Looking Into Climate Science

One way to approach the controversies in climate science is to get the raw data and analyze it in a way that one trusts, in a way that makes sense, if there are suspicions about the methods used. Even better would be to collect the raw data oneself, but that's not possible.

The conclusion I reach with this exercise is that GISS's (Goddard Institute for Space Science at NASA) analytical methods for processing weather station data don't introduce a significant amount of bias at least where the data available is concerned. The other issue is the heat island effect. This analysis shows that it exists and was not accounted for by the GISS analysis; there's no evidence that it was subtracted from the grand mean of global temperatures. This becomes important when attempting to extrapolate available weather station data onto the rest of earth's surface. Not removing the heat island effect from data extrapolated out to the rest of the globe would tend to make the earth look too hot.

As Steve McIntyre has shown, Jim Hansen's methods [at GISS] for processing station temperatures into global temperatures don't really make sense.

The burning question for me therefore has been: If GISSs methods don't make sense and perhaps introduce bias, then what do we get when proper methods are used? What, exactly, is the truth of the matter?

Steve pointed out that the job of processing this data and eliminating the effects of various factors to come up with an estimation of the temperature anomalies over time could be done using well understood mixed effects statistical modeling algorithms. I thought it rather astounding that a statistical method seemingly completely different from Hansen's methods would come up with much the same results. Steve used this in modeling the combination of location data into grid cells, but why not apply it to the whole data set?

And so I tried that.

giss.dset0, you may recall is temperature data taken from different weather stations. In some cases these stations are all together in one location, as in 4 or 5 stations in a given city area. Hansen first combined these stations together for a location, then he did a lot of complicated smoothing and adjustment of the data in a series of steps. dset0 is raw data, dset1 includes some adjustments and combination of data of stations at given locations, dset2 includes more adjustments within a given location, and gridding of the data does more of the same, smoothing data over geometrical areas, between locations. With all of the adjusting and smoothing going on the introduction of some sort of bias seems likely. For example, Hansen always used the longest records to start with and then increased or decreased the shorter records to match the longer. But aren't the longer records usually from urban stations? Would this not cement the heat island effect into the combined location data?

And so, working in the R Project statistical package and following McIntyre's lead, I first reorganized the data in giss.dset0 into a simple long data frame with location information folded in from giss.info.

extr = function(giss.dset0, giss.info) {
year = 0; temp = 0; alt=0; ur=0; loc=0; lxs=0; sta=0
Ni = (1:length(giss.dset0))[!is.na(giss.dset0)]
for(i in Ni) #if(sum(is.na(giss.dset0[])) > 0) next else
for(j in 1:length(giss.dset0[])) if(is.na(giss.dset0[][[j]])) next else
{
N = length(giss.dset0[][[j]][[1]])
sta = c(sta, rep(names(giss.dset0[])[j],N))
year = c(year, giss.dset0[][[j]][[1]])
temp = c(temp, giss.dset0[][[j]][[18]])
ur = c(ur, rep(giss.info$urban,N))
alt = c(alt, rep(250*(giss.info$alt.interp%/%250),N))
tmp=substr(names(giss.dset0[])[j],1,8)
loc = c(loc, rep(tmp, N))
lxs = c(lxs, rep(paste(tmp, "i", names(giss.dset0[])[j],sep=""),N))
}
sta=sta[-1]; year=year[-1]; temp=temp[-1]; alt=alt[-1]; loc=loc[-1]
ur = ur[-1]; lxs=lxs[-1]
return(data.frame(loc=loc,sta=sta,year=year,ur=ur,alt=alt,lxs=lxs,temp=temp))
}

dset0.t = extr(giss.dset0, giss.info)

(No, I could not use monthly data without running out of memory.)

Then calculate random effects:

(load the lme4 package)

dset0.fm = lmer(temp~(1|loc)+(1|sta)+(1|ur)+(1|alt)+(1|year), data=dset0.t)
dset0.fm1 = lmer(temp~(1|sta)+(1|year), data=dset0.t)

(loc = location id, sta=station id with location, ur=urban vs suburban vs rural, alt= location altitude in 250 meter bands)

ranef(dset0.fm)$year and ranef(dset0.fm1)$year contain the random effects year over year. The temperature anomaly for each year, in other words. Most of the variance is dumped into other groups. Using more groupings in the model formula makes no difference in these estimates (dset0.fm vs dset0.fm1 R-squared = 0.999).

Here is your heat island effect:

> ranef(dset0.fm)$ur
(Intercept)
R -0.23716400 (rural)
S -0.04312743 (suburban)
U 0.28028908 (urban)

And here is the effect of altitude:

> ranef(dset0.fm)$alt
(Intercept)
0 10.0821297
250 9.5952519
500 8.2160810
750 7.4049196
1000 6.4774994
1250 5.4415439
1500 4.2931860
1750 3.4352673
2000 1.7974976
2250 -0.5588156
2500 -1.6522236
2750 -5.5811832
3000 -5.2383055
3250 -13.6373919
3500 -8.7024097
3750 -13.8671371
4000 1.2833311
4250 -6.0728011
4500 -3.3266110
4750 0.8495323
5000 -1.0044707
5750 -0.4843320

What that yearly random effect looks like compared to the GISSTEMP book numbers:

View attachment 67134866

And this is the difference between the two:

View attachment 67134867

R-squared = 0.73. Close enough for government work. The slope coefficients are nearly identical (a 0.00035 per year difference). And I'm sure that a statistician would do much better. Much of what Hansen did in step 1 and 2 and in gridding the data was smoothing. Application of a smoothing algorithm would no doubt take out the variation in the random effects estimation I got, but that would not improve the fidelity of the results.

dset0 annual averages are as close to the raw data as I can get without running out of computer memory. I went from that data to a final result with one command in R.

If I can come up with the same results with no spacial factors at all other than altitude and station grouping then does gridding mean anything? Can the analysis represent anything more than the 25% of the earth's surface the stations occupy?

This shows that GISS methods, regardless how byzantine, introduce no significant amount of bias where temperature trends are concerned with regard to the station data available.

The reason that the GISS data set runs so much hotter that others may be in the effort to extrapolate station data to areas that don't have stations, that is to say, the other 75% of the earth's surface. This would be using a data set that is perhaps dominated by weather stations in urban areas with a significant heat island effect.

That lack of coverage of the earth's surface in station data, I'm guessing, is the reason for the difference in satellite, radiosonde, and station estimates of global temperature.

The amount of memory and processing power I have in my clearance priced desktop PC would have been duplicated by a computer the size of a city block, if at all, back in the days Hansen started doing this climate business. His multi step process of analyzing station data was perhaps an effort to perform an analysis that would have been impossible using the statistical methods I used here.

Update

Including another grouping factor, latitude in 5 degree bands, has no effect on the temperature anomalies, but it's interesting to look at the random effect for that group:

dset0.fm2 = lmer(temp~(1|ur)+(1|lat)+(1|alt)+(1|sta)+(1|year),
data=dset0.t)

> ranef(dset0.fm2)$lat
(Intercept)
-90 -48.0765255
-85 -30.6208086
-80 -30.5843329
-75 -26.8087508
-70 -17.5837928
-65 -10.6445832
-55 -0.4677458
-50 1.9302840
-45 4.6332613
-40 7.5261542
-35 10.1896612
-30 7.2221295
-25 16.5283583
-20 17.4532507
-15 18.3289638
-10 18.4701555
-5 18.5784181
0 18.7600821
5 19.5108772
10 19.8686941
15 19.7051168
20 17.5278219
25 15.0410194
30 11.2047241
35 7.3644670
40 3.2930061
45 -0.4509212
50 -0.3098546
55 -5.3064547
60 -9.0101544
65 -13.3917199
70 -16.3916299
75 -20.7265894
80 -22.7631131

And including that group gives more consistent altitude and urban estimates:

> ranef(dset0.fm2)$alt
(Intercept)
-1000 7.6009074
0 8.7884377
250 8.2316191
500 5.1136268
750 9.5794809
1000 5.9054092
1250 5.3426933
1500 4.0768528
1750 3.4410112
2000 2.0043036
2250 1.0247369
2500 -0.5251244
2750 -1.8542720
3000 -0.6135464
3250 -8.0035847
3500 -3.3222669
3750 -5.1468110
4000 -5.1510483
4250 -8.3572983
4500 -8.9318326
4750 -5.5925205
5000 -5.9627467
5750 -7.6483137

> ranef(dset0.fm2)$ur
(Intercept)
R -0.89551186
S -0.03941906
U 0.93493366

Click to expand...

Looking Into Climate Science

LiberalAvenger