Schmidt’s Histogram Diagram Doesn’t Refute Christy

In my most recent post,  I discussed yet another incident in the long running dispute about the inconsistency between models and observations in the tropical troposphere – Gavin Schmidt’s twitter mugging of John Christy and Judy Curry.   Included in Schmidt’s exchange with Curry was a diagram with a histogram of model runs. In today’s post, I’ll parse the diagram presented to Curry, first discussing the effect of some sleight-of-hand and then showing that Schmidt’s diagram, after removing the sleight-of-hand and when read by someone familiar with statistical distributions, confirms Christy rather than contradicting him.
Background 
The proximate cause of Schmidt’s bilious tweets was Curry’s proposed use of the tropical troposphere spaghetti graph from Christy’s more recent congressional testimony in her planned presentation to NARUC.   In that testimony, Christy had reported that “models over-warm the tropical atmosphere by a factor of approximately 3, (Models +0.265, Satellites +0.095, Balloons +0.073 °C/decade)”
The Christy diagram has long been criticized by warmist blogs for its baseline – an allegation that I examined in my most recent post, in which I showed that baselining as set out by Schmidt and/or Verheggen was guilty of the very offences of Schmidt’s accusation and that, ironically, Christy’s nemesis, Carl Mears, had used a nearly identical baseline, but had not been excoriated by Schmidt or others.
I had focused first on baselining, because that had been the main issue at warmist blogs relating to the Christy diagram. However, in twitter followup to my post, Schmidt pretended not to recognize the baselining issue,  instead saying that the issue was merely “uncertainties”, but did not expand on exactly how “uncertainties” discomfited the Christy graphic.   Even though I had shown that Christy’s baselining was equivalent to Carl Mears’, Schmidt refused to disassociate himself from Verheggen’s offensive accusations.
One clue to Schmidt’s invocation of “uncertainties” comes from the histogram diagram which he proposed to Judy Curry, shown below.  This diagram was accompanied by a diagram, which represented the spaghetti distribution of model runs as a grey envelope – an iconographical technique that I will discuss on another occasion.  The diagram consisted of two main elements: (1) a histogram of the 102 CMIP5 runs (32 models); (2) five line segments, each representing the confidence intervals for five different satellite measurements.

Schmidt did not provide a statistical interpretation or commentary on this graphic, apparently thinking that the diagram somehow refuted Christy on its face.  However, it does nothing of the sort.  CA reader JamesG characterized it as “the daft argument that because the obs uncertainties clip the model uncertainties then the models ain’t so bad.”  In fact, to anyone with a grounded understanding of joint statistical distributions, Schmidt’s diagram actually supports Christy’s claim of inconsistency.
TRP vs GLB Troposphere
Alert readers may have already noticed that whereas the Christy figure in controversy depicted trends in the tropical troposphere – a zone that has long been especially in dispute, Schmidt’s histogram depicts trends in the global troposphere.
In the figure below, I’ve closely emulated Schmidt’s diagram and shown the effect of the difference.  In the left panel, I’ve shown the Schmidt histogram (GLB TMT) with horizontal and vertical axes transposed for graphical convenience. The second panel shows my emulation of the Schmidt diagram using GLB TMT (mid-troposphere) from CMIP5. The third and fourth panels show identically constructed diagrams for tropical TMT and tropical TLT (lower troposphere), each derived from the Christy compilation of 102 CMIP5 runs (also, I believe, used by Schmidt.) Discussion below the figure.

Figure 1. Histograms of 1979-2015 trends versus satellite observations. Left – Gavin Schmidt; second panel – GLB TMT; third panel – TRP TMT; fourth panel: TRP TLT. The black triangle shows average of model runs.  All calculated from annualized data. 
The histograms and observations in panels 2-4 were all calculated from annualizations of monthly data (following indications of Schmidt’s method.)  The resulting panel for Global TMT (second panel) corresponds reasonably to the Schmidt diagram, though there are some puzzling differences of detail.   The lengths of the line segments for each satellite observation series were calculated as the standard error of the trend coefficient using OLS on annualized data, closely replicating the Schmidt segments (and corresponding to information from a Schmidt tweet.)  This yields higher uncertainty than the same calculation on monthly data, but less than assuming AR1 errors with monthly data. The confidence intervals are also somewhat larger than the corresponding confidence intervals in the RSS simulations of structural uncertainty, a detail that I can discuss on another occasion.
In the third panel, I did the same calculation with tropical (TRP) TMT data, thus corresponding to the Christy diagram with which Schmidt had taken offence.  The trends in this panel are noticeably higher than for the GLB panel (this is the well known “hot spot” in models of the tropical troposphere).  In my own previous discussions of this topic, I’ve considered the lower troposphere (TLT) rather than mid-troposphere and, for consistency, I’ve shown this in the right panel.  Tropical TLT in models run slightly warmer than tropical TMT model runs, but only a little.  In each case, I’ve extracted available satellite data.  Tropical TLT data from RSS 4.0 and NOAA is not yet available (and thus not shown in the fourth panel.)
The average tropical TMT model trend was 0.275 deg C/decade, about 30% higher than the corresponding GLB trend (0.211 deg C/decade), shown in the Schmidt diagram.  The difference between the mean of the model runs and observations was about 55% higher in the tropical diagram than in the GLB diagram.
So Schmidt’s use of the global mid-troposphere shown in his initial tweet to Curry had the effect of materially reducing the discrepancy.  Update (May 6): In a later tweet,  Schmidt additionally showed the corresponding graphic for tropical TMT.  I’ll update this post to reflect this.

The Model Mean: Back to Santer et al 2008
In response to my initial post about baselining, Chris Colose purported to defend Schmidt (tweet) stating:

“re-baselining is not the only issues. large obs uncertainty, model mean not appropriate, etc.”

I hadn’t said that “re-baselining” was the “only” issue. I had opened with it as an issue because it had been the most prominent in warmist critiques and had occasioned offensive allegations, originally from Verheggen, but repeated recently by others.  So I thought that it was important to take it off the table. I invited Gavin Schmidt to disassociate himself from Verheggen’s unwarranted accusations about re-baselining, but Schmidt refused.
Colose’s assertion that the “model mean [is] not appropriate” ought to raise questions, since differences in means are assessed all the time in all branches of science.  Ironically, a comparison of observations to the model mean was one of the key comparisons in Santer et al 2008, of which Schmidt was a co-author.  So Santer, Schmidt et al had no issue at the time with the principle of comparing observations to the model mean.  Unfortunately (as Ross and I observed in a contemporary submission), Santer et al used obsolete data (ending in 1999) and their results (purporting to show no statistically significant difference) were invalid using then up-to-date data. (The results are even more offside with the addition of data to the present.)
For their comparison of the difference between means, Santer et al used a t-statistic, in which their formula for the standard error of the model mean was the standard deviation of the model trends by the square root of the number of models (highlighted).  I show this formula since Schmidt and others had argued vehemently against inclusion of the n_m divisor for number of models.

The above formula for the standard error of the model, as Santer himself realized – mentioning the point in several Climategate emails, was identical to that used in Douglass et al 2008. Their formula differed from Douglass et al in the other term of the denominator – the standard error of observations s{b_o}.
In December 2007, Santer et al 2008 coauthor Schmidt had ridiculed this formula for the standard error of models as an “egregious error”, claiming that division of the standard deviation by the (square root of ) the number of models resulted in the “absurd” situation where some runs contributing to the model mean were outside the confidence interval for the model mean.

Schmidt’s December 2007 post relied on rhetoric rather than statistical references and his argument was not adopted in Santer et al 2008, which divided the standard deviation by the square root of the number of models.
Schmidt’s December 2007 argument caused some confusion in October 2008 when Santer et al 2008 was released, on which thus far undiscussed Climategate emails shed interesting light.  Gavin Cawley, commenting at Lucia’s and Climate Audit in October 2008 as “beaker”, was so persuaded by Schmidt’s December 2007 post that he argued that there must have been a misprint in Santer et al 2008. Cawley purported to justify his claimed misprint with a variety of arid arguments that made little sense to either me or Lucia.  We lost interest in Cawley’s arguments once we were able to verify from tables in Santer et al 2008 that there was no misprint and were able to establish that Santer et al 2008 had used the same formula for standard error of models as Douglass et al (differing, as noted above, in the term for standard error of observations.)
Cawley pursued the matter in emails to Santer that later became part of the Climategate record. Cawley pointed to Schmidt’s earlier post at Real Climate and asked Santer whether there was a misprint in Santer et al 2008.   Santer forwarded Cawley’s inquiry to Tom Wigley, who told Santer that Schmidt’s Real Climate article was “simply wrong” and warned Santer that Schmidt was “not a statistician” – points on which a broad consensus could undoubtedly have been achieved.   Unfortunately, Wigley never went public with his rejection of Schmidt’s statistical claims, which remain uncorrected to this day.  Santer reverted back to Cawley that the formula in the article was correct and was conventional statistics, citing von Storch and Zwiers as authority.  Although Cawley had been very vehement in his challenges to Lucia and myself, he did not close the circle when he heard back from Santer, conceding that Lucia and I had been correct in our interpretation.
Bayesian vs Frequentist 
In recent statistical commentary, there has been a very consistent movement to de-emphasize “statistical significance” as a sort of talisman of scientific validity, while increasing emphasis on descriptive statistics and showing distributions – a move that is associated with the increasing prominence of Bayesianism and something that is much easier with modern computers.  As someone who treats data very descriptively, I’m comfortable with the movement.
Rather than worry about whether something is “statistically significant”, the more modern approach is to look at its “posterior distribution”. Andrew Gelman’s text (Applied Bayesian Analysis, p 95) specifically recommended this in connection with difference in means:

In problems involving a continuous parameter θ (say the difference between two means), the hypothesis that θ is exactly zero is rarely reasonable, and it is of more interest to estimate a posterior distribution or a corresponding interval estimate of θ. For a continuous parameter θ, the question ‘Does θ equal 0?’ can generally be rephrased more usefully as ‘What is the posterior distribution for θ? (text, p 95)

In the diagram below, I show how the information in a Schmidt-style histogram can be translated into a posterior distribution, and why such a distribution is helpful and relevant to someone trying to understand the data in a practical way.   The techniques below do not use full Bayesian apparatus of MCMC simulations (which I have not mastered), but I would be astonished if such technique would result in any material difference.  (I’m somewhat reassured that this was my very first instinct when confronted with this issue: see October 2008 CA post here and Postscript below.)
On the left, I’ve shown the Schmidt-style diagram for tropical TMT (third panel above). In the middle, I’ve shown approximate distributions for model runs (pink) and observations (light blue) – explained below, and in the right panel, the distribution of the difference between model mean and observations.  From the diagram in the right panel, one can draw conclusions about the t-statistic for the difference in means, but, for me, the picture is more meaningful than a t-statistic.

Figure 2. Tropical TMT trends. Left – As in third panel of Figure 1. Middle. pink: distribution of model trends corresponding to histogram; lightblue: implied distribution of observed trends. Right: distribution of difference of model and observed trends.   In the data used in panel three above (TRP TMT), I got indistinguishable results (models  +0.272 deg C/decade; satellites +0.095 deg C/decade).  
The left panel histogram of trends for tropical TMT is derived from the Christy collation (also used by Schmidt) of the 102 CMIP5 runs (with taz) at KNMI. The line segments represent 95% confidence intervals for five satellite series based on the method used in Schmidt’s diagram (see Figure 1 for color code).
In the middle panel, I’ve used normal distributions for the approximations, since their properties are tractable, but the results of this post would apply for other distributions as well.  For models, I’ve used the mean  and standard deviation of the 102 CMIP5 runs (0.272 and 0.058 deg C/decade, respectively).  For observations, I presumed that each satellite was associated with a normal distribution with the standard deviation being the standard error of the trend coefficient in the regression calculation; for each of the five series, I simulated 1000 realizations. From the composite of 5000 realizations, I calculated the mean and standard deviation  (0.095 and 0.049 deg C/decade respectively) and used that for the normal distribution for observations shown in light blue.  There are other reasonable ways of doing this, but this seemed to me to be the most consistent with Schmidt’s graphic. Note that this technique  yields a somewhat wider envelope than the envelope of realizations representing structural uncertainty in the RSS ensemble.
In the right panel, I’ve shown the distribution of the difference of means, calculated following Jaynes’ formula (discussed at CA previously here). In an analysis following Jaynes’ technique, the issue is not whether the difference in means was “statistically significant”, but assessing the odds/probability that a draw from models would be higher than a draw from observations, fully accounting for uncertainties of both, calculated according to the following formula from Jaynes:

By specifying the two distributions in the middle panel as normal distributions, the distribution of the difference of means is also normal, with its mean being the difference between the two means and the standard deviation being the square root of the sum of squares of the two standard deviations in the middle panel ( mean 0.177 and sd 0.076 deg C/decade respectively).  For more complicated distributions, the distribution could be calculated using simulations to effect the integration.
Conclusion
In the present case, from the distribution in the right panel:

  • a model run will be warmer than an observed trend more than 99.5% of the time;
  • will be warmer than an observed trend by more than 0.1 deg C/decade approximately 88% of the time;
  • and will be warmer than an observed trend by more than 0.2 deg C/decade more than 41% of the time.

These values demonstrate a very substantial warm bias in models, as reported by Christy, a bias which cannot be dismissed by mere arm-waving about “uncertainties” in Schmidt style.  As an editorial comment about why the “uncertainties” have a relatively negligible impact on “bias”: it is important to recognize that the uncertainties work in both directions, a trivial point seemingly neglected in Schmidt’s “daft argument”.  Schmidt’s “argument” relied almost entirely on the rhetorical impact of the upper tail of the observation distributions nicking the lower tail of the model distributions.  But the wider upper tail is accompanied by a wider lower tail and, for these measurements, the discrepancy is even larger than the mean discrepancy.
Unsurprisingly, using up-to-date data, the t-test used in Santer et al 2008 is even more offside than it was in early 2009. The t-value under Santer’s equation 12 is 3.835, far outside usual confidence limits. Ironically, it fails even using the incorrect formula for standard error of models, which Schmidt had previously advocated.
The bottom line is that Schmidt’s diagram does not contradict Christy after all, and, totally fails to support Schmidt’s charges that Christy’s diagram was “partisan”.
Postscript
As a small postscript, I am somewhat pleased to observe that my very first instinct, when confronted by the data in dispute in Santer et al 2008, was to calculate a sort of posterior distribution, albeit in a somewhat homemade method  – see October 2008 CA post here.
In that post, I calculated a histogram of model trends used in Douglass et al (tropical TLT to end 2004, as I recall – I’ll check what I did).  Note that the model mean (and overall distribution) at that time was considerably less than the model mean (and envelope) to the end of 2015.   When one squints at the models in detail, they tend to accelerate in the 21st century.  I had then calculated the proportion of models with greater trend than observations for values between -0.3 and 0.4 deg C/decade (a different format than the density curve in my diagram above, but one can be calculated from the other).

Figures from CA here. Left – histogram of model runs used in Douglass et al 2008; right 
 
 
 
 

Source