I had a very strange discussion on twitter (yes, another one), about regression curves. I think it started with a tweet based on some xkcd picture (just for fun, because it was New Year’s Day)

“don’t trust linear regressions” https://t.co/exUCvyRd1G pic.twitter.com/O6rBJfkULa

— Arthur Charpentier (@freakonometrics) 1 janvier 2017

There were comments on that picture, by econometricians, mainly about ‘significant’ trends when datasets are very noisy. And I mentioned a graph that I saw earlier, a couple of days ago

@AndyHarless @mileskimball actually, all that reminds me of a post by @RogerPielkeJr earlier (not a big fan of the regression line) pic.twitter.com/NQgzgVsBcE

— Arthur Charpentier (@freakonometrics) 1 janvier 2017

Let us reproduce that graph (Roger kindly sent me the dataset)

`db=data.frame(year=1990:2016,`

ratio=c(.23,.27,.32,.37,.22,.26,.29,.15,.40,.28,.14,.09,.24,.18,.29,.51,.13,.17,.25,.13,.21,.29,.25,.2,.15,.12,.12))

library(ggplot2)

The graph is here (with the same aesthetic conventions as Roger’s initial graph, i.e. using some sort of barplot)

`ggplot(db, aes(year, ratio)) +`

geom_bar(stat="identity") +

stat_smooth(method = "lm", se = FALSE)

My point was that we miss the ‘confidence band’ of the regression

@freakonometrics @AndyHarless @mileskimball Because it is not a sample. Since 1990 weather losses/global GDP have gone down.

— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017

In R, at least, it is quite natural to get (and actually, it is the default version of the graph function)

`ggplot(db, aes(year, ratio)) +`

geom_bar(stat="identity") +

stat_smooth(method = "lm", se = TRUE)

It is hard to claim that the ‘regression line’ is significant (in the sense “significantly non horizontal”). To be more specific, if we look at the output of the regression model, we get

`summary(lm(ratio~year,data=db))`

`Coefficients:`

Estimate Std. Error t value Pr(>|t|)

(Intercept) 9.158531 4.549672 2.013 0.055 .

year -0.004457 0.002271 -1.962 0.061 .

---

Signif. codes: 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(which is exactly what Roger used in his graph to plot his red straight line). The *p*-value of the estimator of the slope, in a linear regression model is here 6%. But I found Roger’s point puzzling

@freakonometrics @AndyHarless @mileskimball Disagree. U can create one, of course, but doesnt mean much. These data are not balls from urns.

— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017

See also

@freakonometrics @AndyHarless @mileskimball These data are not random.

— Roger Pielke Jr. (@RogerPielkeJr) 1 janvier 2017

First of all, let us get back to a more standard graph, with a scatterplot, and not bars,

`ggplot(db, aes(year, ratio)) +`

stat_smooth(method = "lm") +

geom_point()

Here, we observe points \{y_{1990},y_{1991},\cdots,y_{2016}\}. In order to draw that blue line, we assume (Econometrics 101, actually) that those observations are realizations of random variables \{Y_{1990},Y_{1991},\cdots,Y_{2016}\}. Randomness here does not come from a survey, or from ‘balls in an urn’. Randomness is because hurricanes and floods are themselves seen are realizations of random events. Yes, there might be measurement errors, but that’s not where randomness comes from (here). When we talk about ‘randomness’, it should be related to ‘model error’ i.e. the error we make if we consider a linear model (here), that is

?Y_t=\beta_0+\beta_1t+\varepsilon_tEven if observations are not obtained from balls in an urn, there is some kind of randomness here. Randomness means that we might have errors (random errors) around the estimated value (that is on the blue curve), y_t=\widehat{y}_t+\widehat{\varepsilon}_t. One might consider a nonlinear model to reduce the error,

`ggplot(db, aes(year, ratio)) +`

geom_point() +

geom_smooth()

but in the case, the danger is to overfit

So yes, when we fit a linear model, there is always some kind of randomness, and it is possible to get a ‘confidence band’, that will be very useful for predictions (e.g. for reinsurance purpose here).

Hi,

very interesting post !

do you think that what you wrote is also valid for inferential tests (such as t-test) ? So, can we say that they make sense on exhaustive data because variables are always random (at least because of measurement error) ?

thx,

Cyril

The data used for weather-related losses represent just an instantiation of an underlying stochastic process.

For example, one of the error components resides in how raw data used in loss calculation were measured and/or collected.

Therefore, confidence bands are legit and should be used.

Linear regression is only telling that a monotonous relationship – not necessarily of a first degree polynomial type – does exist.

Hi Arthur,

I am afraid Roger is right. The standard inference task is designed to solve the sampling problem, thus to answer the question whether some sample data provides enough evidence to make an inference about the population the sample is drawn from. Inference in a regression situation would be to decide if there is sufficient evidence to reject the null hypothesis of no relation in the population based on the sample data.

In this case there is no sampling. There is no population of outcomes, of which the provided data is a sample. Consequently, sampling variability that is reflected in sampling distributions of some parameter (the basis of ususal confidence intervals, p – values of null hypothesis significance tests) can not be a valid model for potential randomness inherent in this data.

If a confidence interval should be constructed, it would be necessary to explicitly model what kind of uncertanty the confidence interval reflects (i.e. measurement error or model based error reflecting the potential outcomes that could have realised, uncertainties based on the definition of the variable, etc.).

indeed, see https://freakonometrics.hypotheses.org/50057

It is also always interesting to see what happens when you delete single data points from regressions with low sample size and p-values near the “significance border” of 0.05. For instance, if your remove 2016, you get a p-value of 0.117, and if you remove the quite outlying point at 2005, you get p = 0.0126! These regressions are highly sensitive to the inclusion of single points…

BTW, I developed a shiny app that checks for this phenomene, which I term “leave-one-out significance reversal”.

https://anspiess.shinyapps.io/influence/

Cheers,

Andrej

The R graph is so powerful.

One question – i was told that the linear regression line is only used when Rsq>=0.3? Does it make sense?

Owe-

2016 is annualized based on half-year results. Very likely that final numbers will be very close to these, as last 6 months of 2016 were very benign.

If you are an investor who gets paid based on the magnitude of this metric, being paid more the lower it is, then you will have been paid more in the 2010s than the 2000s than in the 1990s.

In fact, this is a what has been seen in the actual reinsurance industry. That there is a downward trend in this metric is a simple fact. What it means is a separate question. Could be entirely luck. What you cannot argue, however, is that there is evidence to support claims that disaster costs have increased as a proportion of GDP 1990-2016.

Thanks

Question is, if reinsurance results are driven by this metric in any meaningful way. Premium levels are also cyclically driven (see Ingrey’s Insurance Cycle Clock from 1985 https://cas.confex.com/cas/2012clrs/webprogram/Handout/Session5257/CLRS%202012%20(Blumsohn).pdf p.2) as well as by investments into reinsurance, e. g. in the last couple of years we have seen inflows from hedge funds, which put pressure on premia.

I think your last paragraph shows where we come from entirely different points of view: You state as fact that there is a downward trend, and that the conclusion could be that this trend is a result of luck. This seems to me that you take the mathematical result of the linear regression at face value, and argue based on it.

From my point of view, the analysis always starts with the assumption of a “true” data generating process, and any sample is a random draw. As we usually have no complete sample, and no complete model, any estimation of the parameters of the data generating process will be influenced by the randomness of the process. One of the logical conclusions is that the likelihood of any point estimate to be the “true” parameter tends towards zero. This has been drilled into my head in the statistics education from about the first day, and I guess this makes it difficult for me to understand your point of view, or to even accept it as scientifically valid.

If you throw a coin n times and it comes up heads p times, then it came up heads p times and there is no confidence interval around that.

You can’t question the result of a linear regression unless you’re willing to doubt math itself. You can doubt whether inferences made from that are reasonable, but when none are made then there’s nothing to discuss really.

But Pielke was drawing a conclusion, namely that there is a downward trend in the cat losses as percentage of GDP between 1990 and 2016. He drew this conclusion from the slope of the regression result, and then said that there is no randomness inherent because the data are realisations. But this is bullshit on so many levels, some of these Arthur adressed in the article. The fact alone that for 2016 he included half year results only might invalidate his claim. Then there is reliability of the underlying data, claims data as well as GDP data. And, most importantly, to all purposes natural catastrophes are random events, so if we argue about a trend in Claims/GDP we must consider this randomness before drawing any conclusion.

thanks Owe,

I am currently working on an updated version, based on time series techniques… keep you posted !