On inappropriate use of least squares regression

by Greg Goodman
Inappropriate use of linear regression can produce spurious and significantly low estimations of the true slope of a linear relationship if both variables have significant measurement error or other perturbing factors. This is precisely the case when attempting to regress modelled or observed radiative flux against surface temperatures in order to estimate sensitivity of the climate system.

Figure 1 showing conventional and inverse ‘ordinary least squares’ fits to some real, observed climate variables.
Ordinary least squares regression ( OLS ) is a very useful technique, widely used in almost all branches of science. The principal is to adjust one or more fitting parameters to attain the best fit of a model function, according to the criterion of minimising the sum of the squared deviations of the data from the model.
It is usually one of the first techniques that is taught in schools for analysing experimental data. It is also a technique that is misapplied almost as often as it is used correctly.
It can be shown, that under certain conditions, the least squares fit is the best estimation of the true relationship that can be derived from the available data. In statistics this is often called the ‘best, unbiased linear estimator’ of the slope. (Those, who enjoy contrived acronyms, abbreviate this to “BLUE”.)
It is a fundamental assumption of this technique that the ordinate variable ( x-axis ) has negligible error: it is a “controlled variable”. It is the deviations of the dependant variable ( y-azix ) that are minimised. In the case of fitting a straight line to the data, it has been known since at least 1878 that this technique will under-estimate the slope if there is measurement or other errors in the x-variable. (R. J. Adcock) [link]
There are two main conditions for this result to be an accurate estimation of the slope. One is that the deviations of the data from the true relationship are ‘normally’ or gaussian distributed. That is to say that they are of a random nature. This condition can be violated by significant periodic components in the data or excessive number of out-lying data points. The latter may often occur when only a small number of data points is available and the noise, even if random in nature, is not sufficiently sampled to average out.
The other main condition is that there be negligible error ( or non-linear variability ) in the x variable. If this condition is not met, the OLS result derived from the data will almost always under-estimate the slope of the true relationship. This effect is sometimes referred to as regression dilution. The degree by which the slope is under-estimated is determined by the nature of the x and y errors but most strongly by those in x since they are required to be negligible for OLS to give the best estimation.
In this discussion, “errors” can be understood to be both observational inaccuracies and any variability due to some factor other than the supposed linear relationship that it is sought to determine by regression of the two variables.
In certain circumstances regression dilution can be corrected for, but in order to do so, some knowledge of the nature and size of the both x and y errors has to be known. Typically this is not the case beyond knowing whether the x variable is a ‘controlled variable’ with negligible error, although several techniques have been developed to estimate the error in the estimation of the slope [link].
A controlled variable can usually be attained in a controlled experiment, or when studying a time series, provided that the date and time of observations have been recorded and documented in a precise and consistent manner. It is typically not the case when both sets of data are observations of different variables, as is the case when comparing two quantities in climatology.
One way to demonstrate the problem is to invert the x and y axes and repeat the OLS fit. If the result were valid, irrespective of orientation, the first slope would be the reciprocal of second one. However, this is only the case when there is very small errors in both variables, ie. the data is highly correlated ( grouped closely around a straight line ). In the case of one controlled variable and one error prone variable, the inverted result will be incorrect. In the case of two datasets containing observational error, both results will be wrong and the correct result will generally lie somewhere in between.
Another way to check the result is by examining the cross-correlation between the residual and the independent variable ie. ( model – y ) vs x , then repeat for incrementally larger values of the fitted ratio. Depending on the nature of the data, it will often be obvious that the OLS result does not produce the minimum residual between the ordinate and the regressor, ie. it does not optimally account for co-variability of the two quantities.
In the latter situation, the two regression fits can be taken as bounding the likely true value but some knowledge of the relative errors is needed to decide where in that range the best estimation lies. There are a number of techniques such as bisecting the angle, taking the geometric mean (square root of the product), or some other average, but ultimately, they are no more objective unless driven by some knowledge of the relative errors. Clearly bisection would not be correct if one variable had low error, since the true slope would then be close to the OLS fit done with that quantity on the x-axis.


Figure 2. A typical example of linear regression of two noisy variables produced from synthetic randomised data. The true known slope used in generating the data is seen in between the two regression results. ( Click to enlarge graph and access code to reproduce data and graph. )


Figure 2b. A typical example of correct application of linear regression to data with negligible x-errors. The regressed slope is very close to the true value, so close as to be indistinguishable visually. ( Click to enlarge )
The larger the x-errors, the greater the skew in the distribution and the greater the dilution effect.
An Illustration: the Spencer simple model
The following case is used to illustrate the issue with ‘climate-like’ data. However, it should be emphasised that the problem is an objective mathematical one, the principal of which is independent of any particular test data used. Whether the following model is an accurate representation of climate ( it is not claimed to be ) has no bearing on the regression problem.
In a short article on his site Dr. Roy Spencer provided a simple, single-slab ocean, climate model with a predetermined feedback variable built into it. He observed that attempting to derive the climate sensitivity in the usual way consistently under-estimated the know feedback used to generate the data.

By specifying that sensitivity (with a total feedback parameter) in the model, one can see how an analysis of simulated satellite data will yield observations that routinely suggest a more sensitive climate system (lower feedback parameter) than was actually specified in the model run.

And if our climate system generates the illusion that it is sensitive, climate modelers will develop models that are also sensitive, and the more sensitive the climate model, the more global warming it will predict from adding greenhouse gasses to the atmosphere.

This is a very important observation. Regressing noisy radiative flux change against noisy temperature anomalies does consistently produce incorrectly high estimations of climate sensitivity. However, it is not an illusion created by the climate system, it is an illusion created by the incorrect application of OLS regression. When there are errors on both variables, the OLS slope is no longer an accurate estimation of the underlying linear relationship being sought.
Dr Spencer was kind enough to provide an implementation of the simple model in the form of a spread sheet download so that others may experiment and verify the effect.
To demonstrate this problem, the spreadsheet provided was modified to duplicate the dRad vs dTemp graph but with the axes inverted, ie. using exactly the same data for each run but additionally displaying it the other way around. Thus the ‘trend line’ provided by the spreadsheet is calculated with the variables inverted. No changes were made to the model.
Three values for the predetermined feedback variable were used in turn. Two values: 0.9 and 1.9 that Roy Spencer suggests represent the range of IPCC values and 5.0 which he proposes as a value closer to that which he has derived from satellite observational data.
Here is a snap-shot of the spreadsheet showing a table of results from nine runs for each feedback parameter value. Both the conventional and the inverted regression slopes and their geometric mean have been tabulated.

Figure 3. Snap-shot of spreadsheet, click to enlarge.
Firstly this confirms Roy Spencer’s observation that the regression of dRad against dTemp consistently and significantly under-estimates the feedback parameter used to create the data in the first place (and hence over-estimates climate sensitivity of the model). In this limited test, error is between a third and a half of the correct value. There is only one value of the conventional least squares slope that is greater than the respective feedback parameter value.
Secondly, it is noted that the geometric mean of the two OLS regressions does provide a reasonably close to the true feedback parameter, for the value derived from satellite observations. Variations are fairly evenly spread either side: the mean is only slightly higher than the true value and the standard deviation is about 9% of the mean.
However, for the two lower feedback values, representing the IPCC range of climate sensitivities, while the usual OLS regression is substantially less than the true value, the geometric mean over-estimates and does not provide a reliable correction over the range of feedbacks.
All the feedbacks represent a net negative feedback ( otherwise the climate system would be fundamentally unstable ). However, the IPCC range of values represents less negative feedbacks, thus a less stable climate. This can be seen reflected in the degree of variability in data plotted in the spreadsheet. The standard deviations of the slopes are also somewhat higher. This can be expected with less feedback controlling variations.
It can be concluded that the ratio of the proportional variability in the two quantities changes as a function of the degree of feedback in the system. The geometric mean of the two slopes does not provide a good estimation of the true feedback for the less stable configurations which have greater variability. This is in agreement with Isobe et al 1990 [link] which considers the merits of several regression methods.
The simple model helps to see how this relates to Rad / Temp scatter plots and climate sensitivity. However, the problem of regression dilution is a totally general mathematical result and can be reproduced from two series having a linear relationship with added random changes, as shown above.
What the papers say
A quick review of several recent papers on the problems of estimating climate sensitivity shows a general lack of appreciation of the regression dilution problem.
Dessler 2010 b [link] :

Estimates of Earth’s climate sensitivity are uncertain, largely because of uncertainty in the long-term cloud feedback.

Spencer & Braswell 2011 [link]:

Abstract: The sensitivity of the climate system to an imposed radiative imbalance remains the largest source of uncertainty in projections of future anthropogenic climate change.

There seems to be agreement that this is the key problem in assessing future climate trends. However, many authors seem unaware of the regression problem and much published work on this issue seems to rely heavily on the false assumption that OLS regression of dRad against dTemp can be used to correctly determine this ratio, and hence various sensitivities and feedbacks.
Trenberth 2010 [link]:

To assess climate sensitivity from Earth radiation observations of limited duration and observed sea surface temperatures (SSTs) requires a closed and therefore global domain, equilibrium between the fields, and robust methods of dealing with noise. Noise arises from natural variability in the atmosphere and observational noise in precessing satellite observations.

Whether or not the results provide meaningful insight depends critically on assumptions, methods and the time scales ….

Indeed so, unfortunately he then goes on to contradict earlier work by Lindzen and Choi that did address the OLS problem including a detailed statistical analysis comparing their results, by relying on inappropriate use of regression. Certainly not an example of the “robust methods” he is calling for.

Figure 4. Excerpt from Lindzen & Choio 2011, figure 7, showing consistent under-estimation of the slope by OLS regression ( black line ).
Spencer and Braswell 2011 [link]

As shown by SB10, the presence of any time-varying radiative forcing decorrelates the co-variations between radiative flux and temperature. Low correlations lead to regression-diagnosed feedback parameters biased toward zero, which corresponds to a borderline unstable climate system.

This is an important paper highlighting the need to take account of the lagged response of the climate during regression to avoid the decorrelating effect of delays in the response. However, it does not deal with the further attenuation due to regression dilution. It is ultimately still based on regression of two error laden-variables and thus does not recognise regression dilution that is also present in this situation. Thus it is likely that this paper is still over-estimating sensitivity.
Dessler 2011 [link]:

Using a more realistic value of σ(dF_ocean)/σ(dR_cloud) = 20, regression of TOA flux vs. dTs yields a slope that is within 0.4% of lamba.

Then in the conclusion of the paper, emphasis added:

Rather, the evolution of the surface and atmosphere during ENSO variations are dominated by oceanic heat transport. This means in turn that regressions of TOA fluxes vs. δTs can be used to accurately estimate climate sensitivity or the magnitude of climate feedbacks.

Also from a previous paper:
Dessler 2010 b [link]

The impact of a spurious long-term trend in either dRall-sky or dRclear-sky is estimated by adding in a trend of T0.5 W/m 2/ decade into the CERES data. This changes the calculated feedback by T0.18 W/m2/K. Adding these errors in quadrature yields a total uncertainty of 0.74 and 0.77 W/m2/K in the calculations, using the ECMWF and MERRA reanalyses, respectively. Other sources of uncertainty are negligible.

The author was apparently unaware that the inaccuracy of regressing two uncontrolled variables is a major source of uncertainty and error.
Lindzen & Choi 2011 [link]

[Our] new method does moderately well in distinguishing positive from negative feedbacks and in quantifying negative feedbacks. In contrast, we show that simple regression methods used by several existing papers generally exaggerate positive feedbacks and even show positive feedbacks when actual feedbacks are negative.

… but we see clearly that the simple regression always under-estimates negative feedbacks and exaggerates positive feedbacks.

Here the authors have clearly noted that there is a problem with the regression based techniques and go into quite some detail in quantifying the problem, though they do not explicitly identify it as being due to the presence of uncertainty in the x-variable distorting the regression results.
The L&C papers, to their credit, recognise that regression based methods on poorly correlated data seriously under-estimates the slope and utilise techniques to more correctly determine the ratio. They show probability density graphs from Monte Carlo tests to compare the two methods.
It seems the latter authors are exceptional in looking at the sensitivity question without relying on inappropriate use of linear regression. It is certainly part of the reason that their results are considerably lower than almost all other authors on this subject.
Forster & Gregory 2006 [link]

For less than perfectly correlated data, OLS regression of Q-N against δTs will tend to underestimate Y values and therefore overestimate the equilibrium climate sensitivity (see Isobe et al. 1990).

Another important reason for adopting our regression model was to reinforce the main conclusion of the paper: the suggestion of a relatively small equilibrium climate sensitivity. To show the robustness of this conclusion, we deliberately adopted the regression model that gave the highest climate sensitivity (smallest Y value). It has been suggested that a technique based on total least squares regression or bisector least squares regression gives a better fit, when errors in the data are uncharacterized (Isobe et al. 1990). For example, for 1985–96 both of these methods suggest YNET of around 3.5 +/- 2.0 W m2 K-1 ( a 0.7–2.4-K equilibrium surface temperature increase for 2 ϫ CO2 ), and this should be compared to our 1.0–3.6-K range quoted in the conclusions of the paper.

Here, the authors explicitly state the regression problem and its effect on the results of their study on sensitivity. However, when writing in 2005, they apparently feared that it would impede the acceptance of what was already a low value of climate sensitivity if they presented the mathematically more accurate, but lower figures.
It is interesting to note that Roy Spencer, in a non peer reviewed article, found an very similar figure of 3.66 W/m2/K by comparing ERBE data to MSU derived temperatures following Mt Pinatubo.[link]
So Forster and Gregory felt constrained to bury their best estimation of climate sensitivity, and the discussion of the regression problem in an appendix. In view of the ‘gatekeeper’ activities revealed in the Climategate emails, this may have been a wise judgement in 2005.
Now, ten years after the publication of F&G 2006, proper application of the best mathematical techniques available to correct this systematic over-estimation of climate sensitivity is long overdue.
A more recent study Lewis & Curry 2014 [link] used a different method of identifying changes between selected periods and thus is not affected by regression issues. This method also found lower values of climate sensitivity.
Conclusion
Inappropriate use of linear regression can produce spurious and significantly low estimations of the true slope of a linear relationship if both variables have significant measurement error or other perturbing factors.
This is precisely the case when attempting to regress modelled or observed radiative flux against surface temperatures in order to estimate sensitivity of the climate system.
In the sense that this regression is conventionally done in climatology, it will under-estimate the net feedback factor (often denoted as ‘lambda’). Since climate sensitivity is defined as the reciprocal of this term, this results in an over-estimation of climate sensitivity.
This situation may account for the difference between regression-based estimations of climate sensitivity and those produced by other methods. Many techniques to reduce this effect are available in the broader scientific literature, thought there is no single, generally applicable solution to the problem.
Those using linear regression to assess climate sensitivity need to account for this significant source of error when supplying uncertainly values in published estimations of climate sensitivity or take steps to address this issue.
The decorrelation due to simultaneous presence of both the in-phase and orthogonal climate reactions, as noted by Spencer et al, also needs to be accounted for to get the most accurate information from the available data. One possible approach to this is detailed here: https://judithcurry.com/2015/02/06/on-determination-of-tropical-feedbacks/
A mathematical explanation of the origin of regression dilution is provided here:
On the origins of regression dilution.
JC note:  As with all guest posts, keep your comments civil and relevant.
 Filed under: Sensitivity & feedbacks

Source