Neukom and Gergis Serve Cold Screened Spaghetti

Neukom, Gergis and Karoly, accompanied by a phalanx of protective specialists, have served up a plate of cold screened spaghetti in today’s Nature (announced by Gergis here).
Gergis et al 2012 (presently in a sort of zombie withdrawal) had foundered on ex post screening. Neukom, Gergis and Karoly + 2014 take ex post screening to a new and shall-we-say unprecedented level. This will be the topic of today’s post.
Data Availability
As a preamble, the spaghetti is cold in the sense that the network of proxies is almost identical to the proxy network of Neukom and Gergis 2012, which was not archived at the time and which Neukom refused to provide (see CA here). I had hoped that Nature would require Neukom to archive the data this time, but disappointingly Neukom once again did not archive the data. I’m reasonably optimistic that Nature will eventually require Neukom to archive the data, but the unavailability of the data when the article is released restricts commentary significantly. I’ve written to Nature asking them to require Neukom and Gergis to archive the data. (April 1 – an archive has been placed online at NOAA).
Wagenmakers’ Anti-Torture Protocol
In the wake of several social psychology scandals, there has been renewed statistical interest in the problem of “data torture”, for example by Wagenmakers (for example, here and here).
Wagenmakers observes that “data torture” can occur in many ways. He is particularly critical of the ad hoc and ex post techniques that authors commonly use to extract “statistically significant” results from unwieldy data. Ex post screening is an example of data torture. Wagenmakers urges that, for “confirmatory analysis”, authors be required to set out a statistical plan in advance and stick to it. He acknowledges that some results may emerge during analysis, but finds that such results can only be described as “exploratory”.
Wagenmakers’ anti-torture protocol not only condemns ex post statistical manipulations (including ex post screening), but excludes data used in the formulation of a hypothesis, from confirmatory testing of the hypothesis. In other words, Wagenmakers anti-torture protocol would exclude proxies used to develop previous Hockey Sticks and restrict confirmation studies to the consideration of new proxies. This would prevent the use of the same data over and over again in supposedly “independent” studies – a paleoclimate practice long criticized at CA.
In my own examination of new multiproxy reconstructions, I tend to be most interested in “new” proxies. It would be a worthwhile exercise in each new reconstruction to clearly show and discuss the “new” proxies – which are the only ones that pass Wagenmakers’ criteria.
Screening
Ex post (after the fact) screening is a form of data torture long criticized at climate blogs (CA, Jeff Id, Lucia, Lubos – not previously using the term “data torture” in this context), but widely accepted by IPCC scientists. It was an issue with Gergis et al 2012 and again with Neukom, Gergis and Karoly 2014.
Gergis et al 2012 had stated that they had mitigated post hoc screening by de-trending the data before correlation. However, as Jean S observed at Climate Audit at the time, they actually calculated correlations on non-detrended data. Jean S observed that almost no proxies passed screening using the protocol reported in the article itself.
Gergis, encouraged by Mann and Schmidt, tried to persuade the journal that they should be allowed to change the description of the methodology to match their actual calculations. However, the journal did not agree. They required Gergis and Neukom to re-do their calculations using the stated methodology and to show that any difference in protocol “didn’t matter.” Unfortunately for Gergis and Neukom, it did matter. They subsequently re-submitted, but two years later, nothing has appeared.
In their new article, Neukom and Gergis are once again back in the post-hoc screening business but have taken post hoc screening to shall-we-say unprecedented levels.
They stated that their network consisted of 325 “records”:

The palaeoclimate data network consists of 48 marine (46 coral and 2 sediment time series) and 277 terrestrial (206 tree-ring sites, 42 ice core, 19 documentary, 8 lake sediment and 2 speleothem) records [totalling 325 sites] (details in Supplementary Section 1)…

Some of the 206 tree ring sites are combined into “composites” of nearby sites: their list of proxies in Supplementary Table 1 contains 204 records and it is these 204 records that are screened.
Once again, they claimed that their screening based on local correlations with detrended data, reducing the network to 111 (54% of the original count). In the Methodology section of the article:

Proxies are screened with local grid-cell temperatures yielding 111 temperature predictors (Fig. 1) for the nested multivariate principal component regression procedure.

and in the SI:

The predictors for the reconstructions are selected based on their local correlations with the target grid…

Later in the SI, they state that detrended data was used for the local correlation:

Both the proxy and instrumental data are linearly detrended over the 1911-1990 overlap period prior to the correlation analyses. Correlations of each proxy record with all grid cells are then calculated for the period 1911-1990.

Jean S determined that only a few proxies in the network of Gergis et al 2012 (which contributes to the present network) passed a screening test using detrended data.
So how did Neukom and Gergis 2014 get a yield of over 54%?
Watch how they calculated “local” correlation. Later in the SI, they say (for all non-Antarctic cells):

We consider the “local” correlation of each record as the highest absolute correlation of a proxy with all grid cells within a radius of 1000 km and for all the three lags (0, 1 or -1 years). A proxy record is included in the predictor set if this local correlation is significant (p<0.05). … Significance levels (5% threshold) are calculated taking AR1 autocorrelation into account (Bretherton et al., 1999).

Mann et al 2008 had improved their screening yield by a “pick two” methodology. Neukom and Gergis go far beyond that by comparing to grid cells within 1000 km and three lags. As I understand this, they picked the “best” correlation from several dozen comparisons. One wonders how they calculated “significance” in such a calculation (not elaborated in the article itself). Unless their benchmarks allowed for the enormous number of comparisons, their “significance” calculations would be incorrect.
The above procedure is used for non-Antarctic proxies. For Antarctic proxies, they say:

Proxies from Antarctica, which are outside the domain used for proxy screening, are included, if they correlate significantly with at least 10% of the grid-area used for screening (latitude weighted).

At present, I am unable to interpret this test in operational terms.
With a radius of 1000 km, 54% of the proxies passed their test (111). With a reduced radius of 500 km, the yield fell to 42% (85 proxies). The acceptance rate for corals was about 80% and for other proxies was about 50% (slightly lower for ice cores).
Among the “long” proxies (ones that start earlier than 1025, thus covering most of the MWP), 9 of 12 ice core proxies were rejected, including isotope records from Siple Dome, Berkner Island, EDML Dronning Maud. The only “new” passing ice core record was a still unpublished Law Dome Na series (while Na series from Siple Dome and EDML did not “pass”).
Of the 5 long tree ring series, only Mt Read (Tasmania) and Oroko Swamp NZ “passed”. These are not new series: Mt Read (Tasmania) has been used since Mann et al 1999 and Jones et al 1998, while Oroko was considered in Mann and Jones 2003. Both were illustrated in AR4. Mann and Jones 2003 had rejected Oroko as not passing local correlation, but it “passes” Neukom and Gergis with flying colors. (The Oroko version needs to be parsed, because at least one version spliced instrumental data due to recent logging disturbance.) Not passing were three South American series, one of which included Rio Alerce, a series used in Mann et al 1998-99.
None of the “documentary” series cover the medieval period, but calibration of these series is idiosyncratic to say the least. Nearly all of these series are direct measures of precipitation. SI Table 4 shows that these series end in the late 20th century, but a footnote to the table says that the 20th century portion of nine (of 19 series) is projected, citing earlier publications of the same authors for the projection method.

The documentary record ends in the 19th or early 20th century and was extended to present using “pseudo documentaries” (see Neukom et al. 2009 and Neukom et al. 2013)

They “explain” their extrapolation as follows:

Some documentary records did not originally cover the 20th century (Supplementary Table 4). In order to be able to calibrate them, we extend them to present using the “pseudo documentary” approach described by Neukom et al. (2009; 2013). In this approach, the representative instrumental data for each record are degraded with white noise and then classified into the index categories of the documentary record in order to realistically mimic its statistical properties and not overweight the record in the multiproxy calibration process. The amount of noise to be added is determined based on the overlap correlations with the instrumental data. In order to avoid potential biases by using only one iteration of noise6 degrading, we create 1,000 “pseudo documentaries” for each record and randomly sample one realization for each ensemble member (see below, Section 2.2). All documentary records are listed in Supplementary Table 4.

For Tucuman precipitation, one of the documentary records so extended, they report a “local” correlation (under the idiosyncratic methods of the article) of 0.43 and a correlation to SH temperature of 0.37. This was a higher correlation than all but one documentary indices with actual 20th century data.
Comparison to PAGES2K
The present dataset is closely related to data used for the South American and Australasian regional PAGES2K reconstruction used in IPCC AR5. I previously discussed the PAGES2K South American reconstruction here. pointing out that it had used the Quelccaya O18 and accumulation data upside-down to the orientation employed by specialists and upside-down to Thompson’s own reports. I also discussed Neukom’s South American network in the context of the AR5 First Draft here.
Conclusion
Neukom et al 2014 is non-compliant with Wagenmakers’ anti-torture protocol on several important counts, including its unprecedented ex post screening and its reliance on the same proxies that have been used in multiple previous studies.
I have some work on SH proxies in inventory, some of which touches on both Neukom and Gergis 2014 and PAGES2K, and will try to write up some posts on the topic from time to time.

Source