Gavin Schmidt and Reference Period “Trickery”

In the past few weeks, I’ve been re-examining the long-standing dispute over the discrepancy between models and observations in the tropical troposphere.  My interest was prompted in part by Gavin Schmidt’s recent attack on a graphic used by John Christy in numerous presentations (see recent discussion here by Judy Curry).   Schmidt made the sort of offensive allegations that he makes far too often:
@curryja use of Christy’s misleading graph instead is the sign of partisan not a scientist. YMMV. tweet;
@curryja Hey, if you think it’s fine to hide uncertainties, error bars & exaggerate differences to make political points, go right ahead.  tweet.
As a result, Curry decided not to use Christy’s graphic in her recent presentation to a congressional committee.  In today’s post, I’ll examine the validity (or lack) of Schmidt’s critique.
Schmidt’s primary dispute, as best as I can understand it, was about Christy’s centering of model and observation data to achieve a common origin in 1979, the start of the satellite period, a technique which (obviously) shows a greater discrepancy at the end of the period than if the data had been centered in the middle of the period.  I’ll show support for Christy’s method from his long-time adversary, Carl Mears, whose own comparison of models and observations used a short early centering period (1979-83) “so the changes over time can be more easily seen”. Whereas both Christy and Mears provided rational arguments for their baseline decision,  Schmidt’s argument was little more than shouting.

Background
The full history of the controversy over the discrepancy between models and observations in the tropical troposphere is voluminous.    While the main protagonists have been Christy, Douglass and Spencer on one side and Santer, Schmidt, Thorne and others on the other side, Ross McKitrick and I have also commented on this topic in the past, and McKitrick et al (2010) was discussed at some length by IPCC AR5, unfortunately, as too often, deceptively on key points.
Starting Points and Reference Periods
Christy and Spencer have produced graphics in a similar style for several years. Roy Spencer (here) in early 2014 showed a similar graphic using 1979-83 centering (shown below). Indeed, it was this earlier version that prompted vicious commentary by Bart Verheggen, commentary that appears to have originated some of the prevalent alarmist memes.

Figure 1.  2014 version of the Christy graphic, from Roy Spencer blog (here). This used 1979-83 centering. This was later criticized by Bart Verheggen here
Christy’s February 2016 presentation explained this common origin as the most appropriate reference period, using the start of a race as a metaphor: 

To this, on the contrary, I say that we have displayed the data in its most meaningful way. The issue here is the rate of warming of the bulk atmosphere, i. e., the trend. This metric tells us how rapidly heat is accumulating in the atmosphere – the fundamental metric of global warming. To depict this visually, I have adjusted all of the datasets so that they have a common origin. Think of this analogy: I have run over 500 races in the past 25 years, and in each one all of the runners start at the same place at the same time for the simple purpose of determining who is fastest and by how much at the finish line. Obviously, the overall relative speed of the runners is most clearly determined by their placement as they cross the finish line – but they must all start together.

The technique used in the 2016 graphic varied somewhat from the earlier style: it took the 1979 value of the 1975-2005 trend as a reference for centering, a value that was very close to the 1979-83 mean.
Carl Mears
Ironically, in RSS’s webpage comparison of models and observations, Christy’s longstanding adversary, Carl Mears, used an almost identical reference period (1979-84) in order that “the changes over time can be more easily seen”.   Mears wrote that “If the models, as a whole, were doing an acceptable job of simulating the past, then the observations would mostly lie within the yellow band”, but that “this was not the case”:

The yellow band shows the 5% to 95% envelope for the results of 33 CMIP-5 model simulations (19 different models, many with multiple realizations) that are intended to simulate Earth’s Climate over the 20th Century.  For the time period before 2005, the models were forced with historical values of greenhouse gases, volcanic aerosols, and solar output. After 2005, estimated projections of these forcings were used. If the models, as a whole, were doing an acceptable job of simulating the past, then the observations would mostly lie within the yellow band.  For the first two plots (Fig. 1 and Fig 2), showing global averages and tropical averages, this is not the case.

Mears illustrated the comparison in the following graphic, the caption to which states the reference period of 1979-84 and the associated explanation.

Figure 2. From RSS here.  Original caption:  Tropical (30S to 30N) Mean TLT Anomaly plotted as a function of time.  The the blue band is the 5% to 95% envelope for the RSS V3.3 MSU/AMSU Temperature uncertainty ensemble.  The yellow band is the 5% to 95% range of output from CMIP-5 climate simulations.  The mean value of each time series average from 1979-1984 is set to zero so the changes over time can be more easily seen. Again, after 1998, the observations are likely to be below the simulated values, indicating that the simulation as a whole are predicting more warming than has been observed by the satellites.  
 
The very slight closing overlap between the envelope of models and envelope of observations is clear evidence – to anyone with a practiced eye – that there is a statistically significant difference between the ensemble mean and observations using the t-statistic as in Santer et al 2008. (More on this in another post).
Nonetheless, Mears did not agree that the fault lay with the models, instead argued, together with Santer, that the fault lay with errors in forcings, errors in observations and internal variability (see here). Despite these differences in diagnosis, Mears agreed with Christy on the appropriateness of using a common origin for this sort of comparison.
IPCC AR5
IPCC, which, to borrow Schmidt’s words, is not shy about “exaggerat[ing or minimizing] differences to make political points”, selected a reference period in the middle of the satellite interval (1986-2005) for their AR5 Chapter 11 Figure 11.25, which compared a global comparison of  CMIP5 models to the average of 4 observational datasets.

Figure 3.  IPCC AR5 WG1 Figure 11.25a.
The effective origin in this graphic was therefore 1995, reducing the divergence between models and observations to approximately half of the full divergence over the satellite period. Roy Spencer recently provided the following diagram, illustrating the effect of centering two series with different trends at the middle of the period (top panel below), versus the start of the period (lower panel). If the two trending series are centered in the middle of the period, then the gap at closing is reduced to half of the gap arising from starting both series at a common origin (as in the Christy diagram.)

Figure 4. Roy Spencer’s diagram showing difference between centering at the beginning and in the middle.
Bart Verheggen
The alarmist meme about supposedly inappropriate baselines in Christy’s figure appears to have originated (or at least appeared in an early version) in a 2014 blogpost by Bart Verheggen, which reviled an earlier version of the graphic from Roy Spencer’s blog (here) shown above, which had used 1979-83 centering, a choice that was almost exactly identical to the 1979-84 centering that later used by RSS/Carl Mears (1979-84).
Verheggen labeled such baselining as “particularly flawed” and accused Christy and Spencer of “shifting” the model runs upwards to “increase the discrepancy”:

They shift the modelled temperature anomaly upwards to increase the discrepancy with observations by around 50%.

Verheggen claimed that the graphic began with an 1986-2005 reference period (the period used by IPCC AR5) and that Christy and Spencer had been “re-baseline[d]” to the shorter period of 1979-83 to “maximize the visual appearance of a discrepancy”:

The next step is re-baselining the figure to maximize the visual appearance of a discrepancy: Let’s baseline everything to the 1979-1983 average (way too short of a period and chosen very tactically it seems)… Which looks surprisingly similar to Spencer’s trickery-graph.

Verheggen did not provide a shred of evidence showing that Christy and Spencer had first done the graphic with IPCC’s middle-interval reference period and then “re-baselin[ed]” the graphic to “trick” people.  Nor, given that the reference period of “1979-83” was clearly labelled on the y-axis, it hardly required reverse engineering to conclude that Christy and Spencer had used a 1979-83 reference period nor should it have been “surprising” that an emulation using a 1979-83 reference period would look similar. Nor has Verheggen made similar condemnations of Mears’ use of a 1979-84 reference period to enable the changes to be “more easily seen”.
Verheggen’s charges continues to resonate in the alarmist blog community. A few days after Gavin Schmidt challenged Judy Curry, Verheggen’s post was cited at Climate Crocks as the “best analysis so far of John Christy’s go-to magical graph that gets so much traction in the deniosphere”.
The trickery is entirely the other way.  Graphical techniques that result in an origin in the middle of the period (~1995) rather than the start (1979) reduce the closing discrepancy by about 50%, thereby, hiding the divergence, so to speak.
Gavin Schmidt 
While Schmidt complained that the Christy diagram did not have a “reasonable baseline”,  Schmidt did not set out criteria for why one baseline was “reasonable” and another wasn’t, or what was wrong with using a common origin (or reference period at the start of the satellite period) “so the changes over time can be more easily seen” as Mears had done.
In March 2016, Schmidt produced his own graphics, using two different baselines to compare models and observations.  Schmidt made other iconographic variations to the graphic (which I intend to analyse separately), but for the analysis today, it is the reference periods that are of interest.
Schmidt’s first graphic (shown in the left panel below – unfortunately truncated on the left and right margins in the Twitter version) was introduced with the following comment:

Hopefully final versions for tropical mid-troposphere model-obs comparison time-series and trends (until 2016!).

This version used 1979-1988 centering, a choice which yields relatively small differences from Christy’s centering.  Victor Venema immediately ragged Schmidt about producing anomalies so similar to Christy and wondered about the reference period:

@ClimateOfGavin Are these Christy-anomalies with base period 1983? Or is it a coincidence that the observations fit so well in beginning?

Schmidt quickly re-did the graphic using 1979-1998 centering, thereby lessening the similarity to “Christy anomalies”, announcing the revision (shown on the right below) as follows:

@VariabilityBlog It’s easy enough to change. Here’s the same thing using 1979-1998. Perhaps that’s better…

After Schmidt’s “re-baselining” of the graphic (to borrow Verheggen’s term), the observations were now shown as within the confidence interval throughout the period. It was this second version that Schmidt later proffered to Curry as the result arising from a “more reasonable” baseline.

Figure 5. Two figures from Gavin Schmidt tweets on March 4, 2016.  Left – from March 4 tweet, using 1979-1988 centering. Note that parts of the graphic on the left and right margins appear to have been cut off, so that the graph does not go to 2015. Right- second version using 1979-1998 centering, thereby lowering model frame relative to observations.
The incident is more than a little ironic in the context of Verheggen’s earlier accusations. Verheggen showed a sequence of graphs going from a 1986-2005 baseline to a 1979-1983 baseline and accused Spencer and Christy of “re-baselining” the graphic “to maximize the visual appearance of a discrepancy” – which Verheggen called “trickery”. Verheggen made these accusations without a shred of evidence that Christy and Spencer had started from a 1986-2005 reference period – a highly questionable interval in the first place, if one is trying to show differences over the 1979-2012 period, as Mears had recognized. On the other hand, prompted by Venema, Schmidt actually did “re-baseline” his graphic, reducing the “visual appearance of a discrepancy”.
The Christy Graphic Again
Judy Curry had reservations about whether Schmidt’s “re-baselining” was sufficient to account for the changes from the Christy figure, observing:

My reaction was that these plots look nothing like Christy’s plot, and its not just a baseline issue.

In addition to changing the reference period, Schmidt’s graphic made several other changes:

  • Schmidt used annual data, rather than a 5-year average.
  • Schmidt showed a grey envelope representing the 5-95% confidence interval, rather than showing the individual spaghetti strands;
  • instead of showing 102 runs individually, Christy showed averages for 32 models. Schmidt seems to have used the 102 runs individually, based on his incorrect reference to 102 models(!) in his caption.

I am in the process of trying to replicate Schmidt’s graphic.  To isolate the effect of Schmidt’s re-baselining on the Christy graphic, I replicated the Christy graphic as closely as I could, with the resulting graphic (second panel) capturing the essentials in my opinion, and then reproduced the graphic using Schmidt centering.
The third panel isolates the effect of Schmidt’s 1979-1998 centering period. This moves downward both models and observations, models slightly more than observations. However, in my opinion, the visual effect is not materially changed from Christy centering. This seems to confirm Judy Curry’s surmise that the changes in Schmidt’s graphic arise from more than the change in baseline. One possibility was that change in visual appearance arose from Christy’s use of ensemble averages for each model, rather than individual runs. To test this, the fourth panel shows the Christy graphic using runs. Once again, it does not appear to me that this iconographic decision is material to the visual impression. While the spaghetti graph on this scale is not particularly clear, the INM-CM4 model run can be distinguished as the singleton “cold” model in all four panels.
Figure 1. Christy graphic (left panel) and variations. See discussion in text.   The blue line shows the average of the UAH 6.0 and RSS 3.3 TLT tropical data. 
Conclusion
 
There is nothing mysterious about using the gap between models and observations at the end of the period as a measure of differing trends.  When Secretariat defeated the field in the 1973 Belmont by 25 lengths, even contemporary climate scientists did not dispute that Secretariat ran faster than the other horses.
Even Ben Santer has not tried to challenge whether there was a “statistically significant difference” between Steph Curry’s epic 3-point shooting in 2015-6 and leaders in other seasons.  Last weekend, NYT Sports illustrated the gap between Steph Curry and previous 3-point leaders using a spaghetti graph (see below) that, like the Christy graph, started the comparisons with a common origin. The visual force comes in large measure from the separation at the end.
If NYT Sports had centered the series in the middle of the season (in Bart Verheggen style), then Curry’s separation at the end of the season would be cut in half.  If NYT Sports had centered the series on the first half (in the style of Gavin Schmidt’s “reasonable baseline”), Curry’s separation at the end of the season would likewise be reduced.  Obviously, such attempts to diminish the separation would be rejected as laughable.
There is a real discrepancy between models and observations in the tropical troposphere.  If the point at issue is the difference in trend during the satellite period (1979 on), then, as Carl Mears observed, it is entirely reasonable to use center the data on an early reference period such as the 1979-84 used by Mears or the 1979-83 period used by Christy and Spencer (or the closely related value of the trend in 1979) so that (in Mears’ words) “the changes over time can be more easily seen”.
Varying Schmidt’s words, doing anything else will result in “hiding” and minimizing “differences to make political points”, which, once again in Schmidt’s words, “is the sign of partisan not a scientist.”
There are other issues pertaining to the comparison of models and observations which I intend to comment on and/or re-visit.

Source