Data Coverage in Cowtan and Way

As I was reading section 3 (Global temperature reconstruction) of the Cowtan and Way paper, I came across this text:

The HadCRUT4 map series was therefore renormalised to match the UAH baseline period of 1981-2010. For each map cell and each month of the year, the mean value of that cell during the baseline period was determined. If at least 15 of the 30 possible observations were present, an offset was applied to every value for that cell and month to bring the mean to zero; otherwise the cell was marked as unobserved for the whole period.
Renormalization is not a neutral step – coverage is very slightly reduced, however the impact of changes in coverage over recent periods is also reduced. Coverage of the renormalized HadCRUT4 map series is reduced by about 2%.

This raised the question of the general overall coverage of the hadcrut dataset used in the study as well as the effect of any changes in that coverage due to the methodology of the analysis in that paper. To address these issues, the data available on the paper’s website was downloaded and analysed using the R statistical package. The specific data used was contained in the file HadCRUT.4.1.1.0.median.txt which can be downloaded as part of the full data used in the paper.

The data was read into R and the reduced to cover the time period from January, 1979 to December, 2012. It consisted of monthly temperature grid values on a 5 x 5 degree grid producing 72 x 36 = 2592 possible values each month. The data was re-anomalised using the time interval and methodology described in the quoted text above. A monthly count of non-empty cells before and after anomalisation was calculated and the results plotted below (showing the percentage of coverage).

Several things are immediately evident from the graph. First, there seems to be a substantial drop in coverage during the last two years. This could possibly be due to late reports of temperatures in the data although that explanation might be questionable because the amount of the decrease is fairly large. The second observation is that there seems to be a large component of seasonality in the coverage with coverage lower in the northern hemisphere summer than in the winter.
The cells lost as a percentage of all possible cell values (missing or not) is 2.32%, approximately the value given in the paper. However, the percentage of cells containing temperatures (representing the real reduction of information) lost through anomalisation is 3.34%. The monthly distribution of the losses can be seen in the following graph. The change in the pattern shortly before 2005 could be related to Steve’s observations in his earlier post on CA.

Which grid cells are losing the information?

It is quite obvious that the losses occurred in those areas which were already short of temperature information thereby exacerbating the problems inherent in trying to create a reliable reconstruction of the gridded temperatures in the polar regions. I would suggest that the paper’s observation that “coverage is very slightly reduced” understates the impact of these reductions by ignoring the geography of where they occur.
Could these losses in information be avoided? If the analysis is done using anomalised data then probably not. The alternative of changing the satellite data to match the HadCrut set is not possible because the lack of overlap of the anomaly time periods. However, consideration should be given to the use of absolute temperatures rather than anomalies. This would require that the HadCrut data be in that form as well as the UAH temperature set. The resulting reconstruction would also be in absolute temperatures whose properties could be analyzed without difficulty.
I have some other questions about the paper (e.g. is the covariance structure of the temperature field dependent on season) but that can be looked at some another time.
Please keep comments civil and on topic. Any personal attacks will likely be snipped or deleted.
The R script for the analysis is given below:
### Libraries needed for plotting
library(fields)
library(maps)
### Function to read data
# Writes a temporary file in the working directory
read.tdat = function(tfile,outfile = "readtemp.txt",form = c(1, 72, 36), na.str = "-1.000e+30") {
alldat = readLines(tfile)
tlen = length(alldat)
ntime = tlen/(form[1]+form[3])
lines1 = seq (1,tlen,form[1]+form[3])
if (form[1] > 1) lines1 = c(t(outer(lines1,0:(form[1]-1),"+")))
subdat = alldat[-lines1]
write(subdat,file ="temp.txt",ncolumns=1)
tdat = read.table("temp.txt",header = F, na.strings = na.str)
tdat = as.matrix(tdat)
out.arr = array(NA,dim = c(form[2],form[3],ntime))
# reshape matrix
nr = nrow(tdat)
out.arr = array(NA, dim = c(72,36,ntime))
st.seq = seq(1,nr,form[3])
en.seq = seq(form[3],nr,form[3])
for(ind in 1:ntime) out.arr[,,ind] = t(tdat[st.seq[ind]:en.seq[ind],])
out.arr}
### Function to Re-anomalize
# crit is the minimum number of obsrvations needed
# anomaly period is st.time to en.time
re.anom = function(tempdat,dat.init =c(1979,1),st.time = c(1981,1),en.time = c(2010,12),crit = 15){
ddim = dim(tempdat)
outmat = array(NA,ddim)
for(ind1 in 1:ddim[1]) {
for (ind2 in 1:ddim[2]) {
dats = ts(tempdat[ind1,ind2,],start = dat.init,freq=12)
subdats = window(dats,start = st.time,end = en.time)
xmonths = tapply(subdats, cycle(subdats),function(x) sum(!is.na(x)) < crit )
monmeans = tapply(subdats, cycle(subdats), function(x) mean(x,na.rm = T))
monmeans[xmonths] = NA
outmat[ind1,ind2,] = c(dats - rep(monmeans, length.out = ddim[3])) }
}
outmat}
###Read HadCrut
# Uses file from C-W web site
hadcrut41.all = read.tdat("HadCRUT.4.1.1.0.median.txt",na.str = "-99.99")
# dim(hadcrut41.all) # 72 36 1956
### Select years 1979 to 2012 incl.
hc41 = hadcrut41.all[,,(1956 - 407):1956]
# dim(hc41) # 72 36 408
###Reanomalize
hadcrut = re.anom(hc41)
### Monthly coverage
time = seq(1979,2013-1/12,1/12)
cover = apply(hadcrut,3,function(x) sum(!is.na(x)))
cover2 = apply(hc41,3,function(x) sum(!is.na(x)))
matplot(time, cbind(cover2,cover)/25.92,type="l", main = "Monthly Coverage of HadCrut 4.1.1.0",
xlab = "Year", ylab = "Percent of Grid Cells",lty =1)
legend("bottomleft",legend=c("Original", "Re-anomalised"),lwd = 2, col = 1:2)
### Difference in coverage
100*(sum(!is.na(hc41)) - sum(!is.na(hadcrut)))/sum(!is.na(hc41)) # 3.340224
100*(sum(!is.na(hc41)) - sum(!is.na(hadcrut)))/(408*36*72) # 2.32238
plot(time, cover2-cover, type="l", main = "Monthly Cells Lost Due to Anomalisation",
xlab = "Year", ylab = "Grid Cell Count")
### Grid pattern for losses
covergrx = apply(hc41,c(1,2),function(x) sum(!is.na(x)))
covergr = apply(hadcrut,c(1,2),function(x) sum(!is.na(x)))
plot.list = list(x = seq(-177.5,177.5,5),y = seq(-87.5,87.5,5), z = covergrx[,36:1] - covergr[,36:1])
image.plot(plot.list,col = two.colors(n=64, start="white", middle = "red", end="black",alpha=1.0),
main = "Frequency of Grid Cell Data Loss")
abline(h = 90,v = -180)
map("world",fill=F,add=T,col="darkblue")

Source