A Test of the Tropical 200-300 mb Warming Rate in Climate Models

by Ross McKitrick
I sat down to write a description of my new paper with John Christy, but when I looked up a reference via Google Scholar something odd cropped up that requires a brief digression.

Google Scholar insists on providing a list of “recommended” articles whenever I sign on to it. Most turn out to be unpublished or non-peer reviewed discussion papers. But at least they are typically current, so I was surprised to see the top rank given to “Consistency of Modelled and Observed Temperature Trends in the Tropical Troposphere,” a decade-old paper by Santer et al. Google was, however, referring to its reappearance as a chapter in a 2018 book called Climate Modelling: Philosophical and Conceptual Issuesedited by Elizabeth Lloyd and Eric Winsberg, two US-based philosophers. Lloyd specifically describes herself as “a philosopher of climate science and evolutionary biology, as well as a scientist studying women’s sexuality” so readers should not expect specialized expertise in climate model evaluation, nor does the book’s editors exhibit any. Yet Google’s algorithm flagged it for me as the best thing out there and positioned two of its chapters as top leads in its “recommended” list.
Much of the first part of the book is an extended attack on a 2007 paper by David Douglass, John Christy, Benjamin Pearson and Fred Singer on the model/observational mismatch in the tropical troposphere. The editors add a diatribe against John Christy in particular for supposedly being impervious to empirical evidence, using flawed statistical methods and refusing to accept the validity of climate model representations of the warming of the tropical troposphere.
By way of contrast, and as an exemplar of research probity, they reproduce the decade-old Santer et al. paper and rely entirely on it for their case. If they are aware of any subsequent literature (which I doubt) they don’t mention it. They fail to mention:

  • Santer bitterly foughtreleasing his data
  • Despite having data up to 2007 he truncated his sample at 1999
  • If he had used the same methodology on the full data set he’d have reached the opposite conclusion, supporting Douglass et al. rather than supposedly refuting them
  • Steve McIntyre and I submitteda comment to the journal showing this. It was rejected, in part because the referee considered Santer’s statistical method invalid  and didn’t want it perpetuated through further discussion
  • We re-cast the article as a more detailed discussion of trend comparison methodology and published it in 2010 in Atmospheric Science Letters. We confirmed, among other things, that based on modern econometric testing methods the gap between models and observations in the tropical troposphere is statistically significant.

McKitrick and Vogelsang (2014)provided a longer model-observational comparison using radiosonde records from 1958 to 2012 while generalizing the trend model to include a possible step change, and reaffirmed the significant discrepancy between models and observations. Similar conclusions were also reached by Fu et al. (2011), Bengtsson and Hodges (2009)and Po-Chedley and Fu (2012).
Needless to say you learn none of this in the Lloyd and Winsburg book.
A related issue is the ratio of tropospheric to surface warming. Klotzbach et al. (2009)found that climate models predict an average amplification ratio of about 1.2 between surface and tropospheric trends, but this far exceeded the observed average, which is typically less than 1.0. Critics said they should have used a different ratio between oceans and land, so Klotzbach et al. (2010)used 1.1 over land and 1.6 over oceans, which didn’t change their conclusions.
Vogelsang and Nawaz (2017)is an important new contribution to this literature since they provide the first formal treatment of the trend ratio problem. They note that there are several seemingly identical ways to write out the trend ratio regression but they each imply different estimators, one of which is systematically biased. They identify a preferred method (which corresponds to the form used by Klotzbach et al.) and they provide a practical method for constructing valid confidence intervals robust to general forms of autocorrelation.
They then use the Klotzbach et al. data sets (original and updated) and test whether the typical amplification ratios in climate models are consistent with observations. In almost all global surface/troposphere data pairings, the amplification ratios in models are too large and are rejected against the observations. When the testing is done separately for land and ocean regions the rejections are unanimous.
So: whether we test the tropospheric trend magnitudes, or the ratio of tropospheric to surface trends, across all kinds of data sets, and across all major trend intervals, models have been shown to exaggerate the amplification rate and the warming rate, globally and in the tropics.
Philosophers Elizabeth Lloyd and Eric Winsberg sound very smug and confident as they disparage people like John Christy and his coauthors and colleagues. Yet they clearly don’t know the literature, and they instead reveal that they are the ones who are impervious to empirical evidence, enamoured with flawed statistical methods and uncritical in their acceptance of biased climate model outputs.
Moving on.
John and I have published a new paper in Earth and Space Sciencethat adds to the climate model evaluation literature, using tropical mid-troposphere trend comparisons (models versus observations) as a basis to make a more general point about models. For a model to be scientific it ought to have an underlying testable hypothesis. Large, complex models like GCMs embed countless minor hypotheses that can be tested and rejected without undermining the major structure of the model. For instance, if a GCM does a lousy job of reproducing rainfall patterns over the Amazon, that component could be modified or removed without the model ceasing to be a GCM.  But there must be at least one major component that, in principle, were it to be falsified, would call into question such an essential component of the model structure that you couldn’t simply remove it without changing the overall model type.
The hypothesis we are interested in testing is the representation of moist thermodynamics in the model troposphere that yields amplified warming in response to rising CO2 levels, thereby generating the results of most interest to users of GCMs, namely projections of global warming due to greenhouse gas emissions.  We propose four criteria that a valid test must meet and we argue that the air temperature trend in the tropical 200-300 mb layer satisfies all four, pretty much uniquely as far as we know. That layer is where models exhibit the clearest and strongest response to greenhouse warming, on a rapid timetable, so it makes sense to focus on it as a test target. The four specific criteria are as follows.

  • Measurability: The target must be well-measured over a long interval. Many surface regions like the Arctic and oceans are poorly sampled. Homogenized radiosonde records for the tropical troposphere are now available from more than one independent source over a 60 year span, which is long enough to identify relevant trends without undue influence of short-term events arising from internal variability or volcanic activity.
  • Specificity: The phenomenon must reliably emerge in all models on a known time scale. It should not be possible to shield models endlessly from testing by appealing to pattern ambiguity or fuzzy time scales in model outputs. We looked at 102 CMIP5 runs of the tropical 200-300mb layer temperature series over 1958-2017 and they are very coherent on their prediction. 94 percent of the cross correlations exceed 0.5 and 77 percent exceed 0.6. The first principal component explains 73 percent of the variance and the remaining PCs each explain only minute amounts of variance. Also models project on average that about 2C warming should have happened by now, a magnitude well within observational capability. Hence models all predict one specific thing on a specific timetable.
  • Independence: The target of the prediction must not be an input to the empirical tuning of the model. This rules out using the global average surface temperature record. Satellite-based lower- and mid-troposphere composites are also somewhat contaminated since they include the near-surface layer in their weighting functions. Radiosondes measure each layer of the atmosphere independently, so they are not inputs to tuning against the surface.
  • Uniqueness: The causality behind the observed change should be uniquely tied to the measured phenomenon. If the model predicts that many things could cause the target to warm, an observed warming would be consistent both with a successful prediction and with a failed prediction coupled with the coincidental action of other causes. But the IPCC states that only greenhouse forcing would explain a strong historical warming trend in the target region. The presence of such a trend would thus have only one explanation; likewise, its absence would conflict with only one major hypothesis of the model, namely the set of parameterizations that yield amplified GHG-induced warming.

We took the annual 1958-2017 tropical 200-300mb layer average temperatures from three radiosonde data sets: RATPAC, RICH and RAOBCORE, and from all 102 runs in the CMIP5 archive. The model runs followed the RCP4.5 concentrations pathway, which follows observed GHG levels and other forcings up to the early part of the last decade then projections thereafter. We estimated linear trends using ordinary least squares and computed robust confidence intervals using the Vogelsang-Franses method. We generated results both for a simple trend and for one allowing a possible break in 1979 following the method outlined in McKitrick and Vogelsang (2014).
The trends (circles) and confidence intervals (whiskers) are shown here (models-red, observations-blue):

The mean restricted trend (without a break term) is C/decade in the models and  C/decade in the observations. With a break term included they are, respectively, C/decade (models) and C/decade (observed).
In both cases, all 102 model trends exceed the average observed trend. In the restricted case (no break term), 62 of the discrepancies are significant, while in the general case 87 are. In both cases the model ensemble mean also rejects against observations.
We also constructed divergence terms consisting of each model run minus the average balloon record. The histograms of trends in these measures ought to be centered on zero if the model errors were mere noise. Instead the distributions are entirely positive, indicating a systematic positive bias:

Conclusion
Summarizing, all 102 CMIP5 model runs warm faster than observations, in most individual cases the discrepancy is significant, and on average the discrepancy is significant. The test of trend equivalence rejects whether or not we include a break at 1979, though the rejections are stronger when we control for its influence. Measures of series divergence are centered at a positive mean and the entire distribution is above zero. While the observed analogue exhibits a warming trend over the test interval it is significantly smaller than that shown in models, and the difference is large enough to reject the null hypothesis that models represent it correctly.
To the extent GCMs are getting some features of the surface climate correct as a result of their current tuning, they are doing so with a flawed structure. If tuning to the surface added empirical precision to a valid physical representation, we would expect to see a good fit between models and observations at the point where the models predict the clearest and strongest thermodynamic response to greenhouse gases. Instead we observe a discrepancy across all runs of all models, taking the form of a warming bias at a sufficiently strong rate as to reject the hypothesis that the models are realistic. Our interpretation of the results is that the major hypothesis in contemporary climate models, namely the theoretically-based negative lapse rate feedback response to increasing greenhouse gases in the tropical troposphere, is flawed.
Paper:

Moderation note:  As with all guest posts, please keep your comments civil and relevant.

Source