New Paper by McKitrick and Vogelsang comparing models and observations in the tropical troposphere

This is a guest post by Ross McKitrick. Tim Vogelsang and I have a new paper comparing climate models and observations over a 55-year span (1958-2012) in the tropical troposphere. Among other things we show that climate models are inconsistent with the HadAT, RICH and RAOBCORE weather balloon series. In a nutshell, the models not only predict far too much warming, but they potentially get the nature of the change wrong. The models portray a relatively smooth upward trend over the whole span, while the data exhibit a single jump in the late 1970s, with no statistically significant trend either side.
Our paper is called “HAC-Robust Trend Comparisons Among Climate Series With Possible Level Shifts.” It was published in Environmetrics, and is available with Open Access thanks to financial support from CIGI/INET. Data and code are here and in the paper’s SI.

Tropical Troposphere Revisited
The issue of models-vs-observations in the troposphere over the tropics has been much-discussed, including here at CA. Briefly to recap:

  • All climate models (GCMs) predict that in response to rising CO2 levels, warming will occur rapidly and with amplified strength in the troposphere over the tropics. See AR4 Figure 9.1  and accompanying discussion; also see AR4 text accompanying Figure 10.7.
  • Getting the tropical troposphere right in a model matters because that is where most solar energy enters the climate system, where there is a high concentration of water vapour, and where the strongest feedbacks operate. In simplified models, in response to uniform warming with constant relative humidity, about 55% of the total warming amplification occurs in the tropical troposphere, compared to 10% in the surface layer and 35% in the troposphere outside the tropics. And within the tropics, about two-thirds of the extra warming is in the upper layer and one-third in the lower layer. (Soden & Held  p. 464).
  • Neither weather satellites nor radiosondes (weather balloons) have detected much, if any, warming in the tropical troposphere, especially compared to what GCMs predict. The 2006 US Climate Change Science Program report (Karl et al 2006) noted this as a “potentially serious inconsistency” (p. 11). I suggest is now time to drop the word “potentially.”
  • The missing hotspot has attracted a lot of discussion at blogs (eg http://joannenova.com.au/tag/missing-hot-spot/) and among experts (eg http://www.climatedialogue.org/the-missing-tropical-hot-spot). There are two related “hotspot” issues: amplification and sensitivity. The first refers to whether the ratio of tropospheric to surface warming is greater than 1, and the second refers to whether there is a strong tropospheric warming rate. Our analysis focused in the sensitivity issue, not the amplification one. In order to test amplification there has to have been a lot of warming aloft, which turns out not to have been the case. Sensitivity can be tested directly, which is what we do, and in any case is the more relevant question for measuring the rate of global warming.
  • In 2007 Douglass et al. published a paper in the IJOC showing that models overstated warming trends at every layer of the tropical troposphere. Santer et al. (2008) replied that if you control for autocorrelation in the data the trend differences are not statistically significant. This finding was very influential. It was relied upon by the EPA when replying to critics of their climate damage projections in the Technical Support Document behind the “endangerment finding”, which was the basis for their ongoing promulgation of new GHG regulations. It was also the basis for the Thorne et al. survey’s (2011) conclusion that “there is no reasonable evidence of a fundamental disagreement between models and observations” in the tropical troposphere.
  • But for some reason Santer et al truncated their data at 1999, just at the end of a strong El Nino. Steve and I sent a comment to IJOC pointing out that if they had applied their method on the full length of then-available data they’d get a very different result, namely a significant overprediction by models. The IJOC would not publish our comment.
  • I later redid the analysis using the full length of available data, applying a conventional panel regression method and a newer more robust trend comparison methodology, namely the non-parametric HAC (heteroskedasticity and autocorrelation)-robust estimator developed by econometricians Tim Vogelsang and Philip Hans Franses (VF2005). I showed that over the 1979-2009 interval climate models on average predict 2-4x too much warming in the tropical lower- and mid- troposphere (LT, MT) layers and the discrepancies were statistically significant. This paper was published as MMH2010 in Atmospheric Science Letters
  • In the AR5, the IPCC is reasonably forthright on the topic (pp. 772-73). They acknowledge the findings in MMH2010 (and other papers that have since confirmed the point) and conclude that models overstated tropospheric warming over the satellite interval (post-1979). However they claim that most of the bias is due to model overestimation of sea surface warming in the tropics. It’s not clear from the text where they get this from. Since the bias varies considerably among models, it seems to me likely to be something to do with faulty parameterization of feedbacks. Also the problem persists even in studies that constrain models to observed SST levels.
  • Notwithstanding the failure of models to get the tropical troposphere right, when discussing fidelity to temperature trends the SPM of the AR5 declares Very High Confidence in climate models (p. 15). But they also declare low confidence in their handling of clouds (p. 16), which is very difficult to square with their claim of very high confidence in models overall. They seem to be largely untroubled by trend discrepancies over 10-15 year spans (p. 15). We’ll see what they say about 55-year discrepancies.

 
The Long Balloon Record
After publishing MMH2010 I decided to extend the analysis back to the start of the weather balloon record in 1958. I knew that I’d have to deal with the Pacific Climate Shift in the late 1970s. This is a well-documented phenomenon (see ample references in the paper) in which a major reorganization of ocean currents induced a step-change in a lot of temperature series around the Pacific rim, including the tropospheric weather balloon record. Fitting a linear trend through a series with a positive step-change in the middle will bias the slope coefficient upwards. When I asked Tim if the VF method could be used in an application allowing for a suspected mean shift, he said no, it would require derivation of a new asymptotic distribution and critical values, taking into account the possibility of known or unknown break points. He agreed to take on the theoretical work and we began collaborating on the paper.
Much of the paper is taken up with deriving the methodology and establishing its validity. For readers who skip that part and wonder why it is even necessary, the answer is that in serious empirical disciplines, that’s what you are expected to do to establish the validity of novel statistical tools before applying them and drawing inferences.
Our paper provides a trend estimator and test statistic based on standard errors that are valid in the presence of serial correlation of any form up to but not including unit roots, that do not require the user to choose tuning parameters such as bandwidths and lag lengths, and that are robust to the possible presence of a shift term at a known or unknown break point. In the paper we present various sets of results based on three possible specifications: (i) there is no shift term in the data, (ii) there is a shift at a known date (we picked December 1977) and (iii) there is a possible shift term but we do not know when it occurs.
 
Results

  • All climate models but one characterize the 1958-2012 interval as having a significant upward trend in temperatures. Allowing for a late-1970s step change has basically no effect in model-generated series. Half the climate models yield a small positive step and half a small negative step, but all except two still report a large, positive and significant trend around it. Indeed in half the cases the trend becomes even larger once we allow for the step change. In the GCM ensemble mean there is no step-change in the late 1970s, just a large, uninterrupted and significant upward trend.
  • Over the same interval, when we do not control for a step change in the observations, we find significant upward trends in tropical LT and MT temperatures, though the average observed trend is significantly smaller than the average modeled trend.
  • When we allow for a late-1970s step change in each radiosonde series, all three assign most of the post-1958 increase in both the LT and MT to the step change, and the trend slopes become essentially zero.
  • Climate models project much more warming over the 1958-2012 interval than was observed in either the LT or MT layer, and the inconsistency is statistically significant whether or not we allow for a step-change, but when we allow for a shift term the models are rejected at smaller significance levels.
  • When we treat the break point as unknown and allow a data-mining process to identify it, the shift term is marginally significant in the LT and significant in the MT, with the break point estimated in mid-1979.

When we began working on the paper a few years ago, the then-current data was from the CMIP3 model library, which is what we use in the paper. The AR5 used the CMIP5 library so I’ll generate results for those runs later, but for now I’ll discuss the CMIP3 results.
We used 23 CMIP3 models and 3 observational series. This is Figure 4 from our paper (click for larger version):

Each panel shows the trend terms (oC/decade) and HAC-robust confidence intervals for CMIP3 models 1—23 (red) and the 3 weather balloon series (blue). The left column shows the case where we don’t control for a step-change. The right column shows the case where we do, dating it at December 1977. The top row is MT, the bottom row is LT.
You can see that the model trends remain about the same with or without the level shift term, though the confidence intervals widen when we allow for a level shift. When we don’t allow for a level shift (left column), all 6 balloon series exhibit small but significant trends. When we allow for a level shift (right column), placing it at 1977:12, all observed trends become very small and statistically insignificant. All but two models (GFDL 2.0 (#7) and GFDL2.1 (#8)) yield positive and significant trends either way.
Mainstream versus Reality
Figure 3 from our paper (below) shows the model-generated temperature data, mean GCM trend (red line) and the fitted average balloon trend (blue dashed line) over the sample period. In all series (including all the climate models) we allow a level shift at 1977:12. Top panel: MT; bottom panel: LT.

The dark red line shows the trend in the model ensemble mean. Since this displays the central tendency of climate models we can take it to be the central tendency of mainstream thinking about climate dynamics, and, in particular, how the climate responds to rising GHG forcing. The dashed blue line is the fitted trend through observations; i.e. reality. For my part, given the size and duration of the discrepancy, and the fact that the LT and MT trends are indistinguishable from zero, I do not see how the “mainstream” thinking can be correct regarding the processes governing the overall atmospheric response to rising CO2 levels. As the Thorne et al. review noted, a lack of tropospheric warming “would have fundamental and far-reaching implications for understanding of the climate system.”
Figures don’t really do justice to the clarity of our results: you need to see the numbers. Table 7 summarizes the main test scores on which our conclusions are drawn.

 
The first column indicates the data series being tested. The second column lists the null hypothesis. The third column gives the VF score, but note that this statistic follows a non-standard distribution and critical values must either be simulated or bootstrapped (as discussed in the paper). The last column gives the p-value.
The first block reports results with no level shift term included in the estimated models. The first 6 rows shows the 3 LT trends (with the trend coefficient in C/decade in brackets) followed by the 3 MT trends. The test of a zero trend strongly rejects in each case (in this case the 5% critical value is 41.53 and 1% is 83.96). The next two rows report tests of average model trend = average observed trend. These too reject, even ignoring the shift term.
The second block repeats these results with a level shift at 1977:12. Here you can see the dramatic effect of controlling for the Pacific Climate Shift. The VF scores for the zero-trend test collapse and the p-values soar; in other words the trends disappear and become practically and statistically insignificant. The model/obs trend equivalence tests strongly reject again.
The next two lines show that the shift terms are not significant in this case. This is partly because shift terms are harder to identify than trends in time series data.
The final section of the paper reports the results when we use a data-mining algorithm to identify the shift date, adjusting the critical values to take into account the search process. Again the trend equivalence tests between models and observations reject strongly, and this time the shift terms become significant or weakly significant.
We also report results model-by-model in the paper. Some GCMs do not individually reject, some always do, and for some it depends on the specification. Adding a level shift term increases the VF test scores but also increases the critical values so it doesn’t always lead to smaller p-values.
 
Why test the ensemble average and its distribution?
The IPCC (p. 772) says the observations should be tested against the span of the entire ensemble of model runs rather than the average. In one sense we do this: model-by-model results are listed in the paper. But we also dispute this approach since the ensemble range can be made arbitrarily wide simply by adding more runs with alternative parameterizations. Proposing a test that requires data to fall outside a range that you can make as wide as you like effectively makes your theory unfalsifiable. Also, the IPCC (and everyone else) talks about climate models as a group or as a methodological genre. But it doesn’t provide any support for the genre to observe that a single outlying GCM overlaps with the observations, while all the others tend to be far away. Climate models, like any models (including economic ones) are ultimately large, elaborate numerical hypotheses: if the world works in such a way, and if the input variables change in such-and-such a way, then the following output variables will change thus and so. To defend “models” collectively, i.e. as a related set of physical hypotheses about how the world works, requires testing a measure of their central tendency, which we take to be the ensemble mean.
In the same way James Annan dismissed the MMH2010 results, saying that it was meaningless to compare the model average to the data. His argument was that some models also reject when compared to the average model, and it makes no sense to say that models are inconsistent with models, therefore the whole test is wrong. But this is a non sequitur. Even if one or more individual models are such outliers that they reject against the model average, this does not negate the finding that the average model rejects against the observed data. If the central tendency of models is to be significantly far away from reality, the central tendency of models is wrong, period. That the only model which reliably does not reject against the data (in this case GFDL 2.1) is an outlier among GCMs only adds to the evidence that the models are systematically biased.
There’s a more subtle problem in Annan’s rhetoric, when he says “Is anyone seriously going to argue on the basis of this that the models don’t predict their own behaviour?” In saying this he glosses over the distinction between a single outlier model and “the models” as group, namely as a methodological genre. To refer to “the models” as an entity is to invoke the assumption of a shared set of hypotheses about how the climate works. Modelers often point out that GCMs are based on known physics. Presumably the laws of physics are the same for everybody, including all modelers. Some climatic processes are not resolvable from first principles and have to be represented as empirical approximations and parameterizations, hence there are differences among specific models and specific model runs. The model ensemble average (and its variance) seems to me the best way to characterize the shared, central tendency of models. To the extent a model is an outlier from the average, it is less and less representative of models in general. So the average among the models seems as good a way as any to represent their central tendency, and indeed it would be difficult to conceive of any alternative.
 
Bottom Line
Over the 55-years from 1958 to 2012, climate models not only significantly over-predict observed warming in the tropical troposphere, but they represent it in a fundamentally different way than is observed. Models represent the interval as a smooth upward trend with no step-change. The observations, however, assign all the warming to a single step-change in the late 1970s coinciding with a known event (the Pacific Climate Shift), and identify no significant trend before or after. In my opinion the simplest and most likely interpretation of these results is that climate models, on average, fail to replicate whatever process yielded the step-change in the late 1970s and they significantly overstate the overall atmospheric response to rising CO2 levels.
 
 

Source