for peer review - tulane universityeconomic/seminars/vogelsang_climate.pdf · the presence of...

64
For Peer Review HAC-ROBUST TREND COMPARISONS AMONG CLIMATE SERIES WITH P OSSIBLE L EVEL S HIFTS Ross McKitrick* Department of Economics University of Guelph Guelph ON Canada N1G 2W1 [email protected] 519-824-4120 x52532 Timothy J. Vogelsang Department of Economics Michigan State University [email protected] June 20, 2013 *Corresponding author Abstract: Comparisons of trends across climatic data sets are complicated by the presence of serial correlation and possible step-changes in the mean. We build on heteroskedasticity and autocorrelation (HAC) robust methods, specifically the Vogelsang-Franses (VF) nonparametric variance estimator, to allow for a step-change in the mean at a known or unknown date. The VF method provides a powerful multivariate trend estimator robust to unknown serial correlation up to but not including unit roots. We show that the critical values change when the mean shift occurs at a known or unknown date. We derive an asymptotic approximation that can be used to simulate critical values, and we outline a simple bootstrap procedure that generates valid critical values and p-values. Our application builds on the literature comparing simulated and observed trends in the tropical lower- and mid- troposphere since 1958. The method identifies a shift in observations relative to models in 1977, coinciding with the Pacific Climate Shift. Allowing for a level shift causes apparently significant observed trends to become statistically insignificant. Model over-estimation of warming is significant whether or not we account for a level shift, although null rejections are much stronger when the level shift is included. Keywords: Autocorrelation; trend estimation; level shifts; HAC methods; climate models Page 1 of 64 John Wiley & Sons Environmetrics

Upload: duongquynh

Post on 06-Jul-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

For Peer Review

HAC-ROBUST TREND COMPARISONS AMONG

CLIMATE SERIES WITH POSSIBLE LEVEL SHIFTS

Ross McKitrick* Department of Economics

University of Guelph Guelph ON Canada N1G 2W1

[email protected] 519-824-4120 x52532

Timothy J. Vogelsang

Department of Economics Michigan State University

[email protected]

June 20, 2013 *Corresponding author

Abstract: Comparisons of trends across climatic data sets are complicated by the presence of serial correlation and possible step-changes in the mean. We build on heteroskedasticity and autocorrelation (HAC) robust methods, specifically the Vogelsang-Franses (VF) nonparametric variance estimator, to allow for a step-change in the mean at a known or unknown date. The VF method provides a powerful multivariate trend estimator robust to unknown serial correlation up to but not including unit roots. We show that the critical values change when the mean shift occurs at a known or unknown date. We derive an asymptotic approximation that can be used to simulate critical values, and we outline a simple bootstrap procedure that generates valid critical values and p-values. Our application builds on the literature comparing simulated and observed trends in the tropical lower- and mid-troposphere since 1958. The method identifies a shift in observations relative to models in 1977, coinciding with the Pacific Climate Shift. Allowing for a level shift causes apparently significant observed trends to become statistically insignificant. Model over-estimation of warming is significant whether or not we account for a level shift, although null rejections are much stronger when the level shift is included.

Keywords: Autocorrelation; trend estimation; level shifts; HAC methods; climate

models

Page 1 of 64

John Wiley & Sons

Environmetrics

For Peer Review

2

1 INTRODUCTION Many empirical applications involve comparisons of linear trend magnitudes across

different time series with autocorrelation and/or heteroskedasticity of unknown form.

Vogelsang and Franses (2005, herein VF) derived a class of heteroskedasticity and

autocorrelation robust (HAC) tests for this purpose. The VF statistic is similar in form to the

familiar regression F-type statistics but remains valid under serial dependence up to but

not including unit roots in the time series. For treatments of the theory behind HAC

estimation and inference see Andrews (1991), Kiefer and Vogelsang (2005), Newey and

West (1987), Sun, Phillips and Jin (2008) and White and Domowitz (1984) among others.

Like many HAC approaches, the VF approach is nonparametric with respect to the serial

dependence structure and does not require a specific model of serial correlation or

heteroskedasticity to be implemented. Unlike most nonparametric approaches, the VF

approach avoids sensitivity to bandwidth selection by setting the bandwidth equal to the

entire sample.

Here we extend the VF approach to the case in which one or more of the series has a

possible step change. Our assumption throughout is that a researcher considers a one-time

step-change as a fundamentally different process than a continuous trend. Consequently, if

the null hypothesis is that two series have the same trend, and one series exhibits a trend

while the other exhibits a step and no trend, a rejection of the null would be considered

valid since the two phenomena are distinct and a prediction of one is not confirmed by

observing the other. If the researcher is considering a case in which an underlying

phenomenon might equally manifest itself as a one-time step-change or as a gradual trend,

controlling for a step-change is not necessary and the VF statistic would be appropriate.

Page 2 of 64

John Wiley & Sons

Environmetrics

For Peer Review

3

Accounting for step changes does not necessarily increase the likelihood of rejecting a

null of trend equivalence. In the top panel of Figure 1, a comparison of the linear trend

coefficients would suggest they are similar, but clearly y1 differs from y2 in that the former

is steadily trending while the latter is trendless with a single discrete step at the break

point Tb. By contrast, in the bottom panel a failure to account for the shift would overstate

the difference between the trend slopes. In each case, the influence of the shift term is

highlighted by the fact that if the trend slope comparisons were conducted over the pre-

shift or post-shift intervals, they might indicate opposite results to those based on the

entire sample (with the shift term omitted).

The basic linear trend model is written

, (1)

where i=1,…,n denotes a particular time series and t = 1,…,T denotes the time period.

The random part of yit is given by uit which is assumed to be covariance-stationary (in

which case yit is labeled a trend stationary series, that is, stationary around a linear time

trend, if one is present). For a series of length T, we parameterize the break point by

denoting the fraction of the sample occurring before it as .

The following issues must be addressed in order to derive an HAC-robust trend

comparison in the presence of possible mean shifts. (i) If is known, and specifically is

known to be in the (0,1) interval, the VF test score can be generalized, as we show in

Section 3, but the distribution is shown to depend on and the critical values change. It will

turn out that the form of the test and its critical values are the same whether one is testing

hypotheses involving the trend coefficients or the shift terms. (ii) If is unknown it must be

Page 3 of 64

John Wiley & Sons

Environmetrics

For Peer Review

4

estimated along with the magnitude of the associated shift term. But this gives rise to a

problem of non-identification under the null. The regression model takes the form

( ) (2)

where the dummy variable ( ) if and 1 otherwise (we will typically

suppress the term where it is convenient to do so). Hence, for series i, estimation of (2) by

OLS yields an estimated intercept of up to and thereafter. The null hypothesis

is ( ), i.e. there is no change point within the sample, but this implies cannot be

identified under the null, yet as noted above, the distribution under the alternative depends

on . Davies (1987), Andrews (1993), Andrews and Ploberger (1994), and Hansen (1996)

among others discuss inference approaches when a parameter is only identified under the

alternative. In general the solution involves use of a supremum function, which is akin to a

data-mining approach. In the present case the sup-VF score with allowed to vary across

(0,1) can be shown to correspond to the value to which the VF score converges in

distribution under the assumption that there is no mean shift in the data (see Section 4).

Since sup scores are known to lose considerable power if the sample end points are

included it is customary to restrict to vary only over a central interval excluding end

portions of the data. Herein we will exclude ten percent on either end.

Our analysis thus deals with both the detection of level shifts and construction of test

statistics for comparison of deterministic trend terms. A potential application related to the

first issue would be homogenization of weather data. Many long observational records are

believed to have been affected by possible equipment and/or sampling changes, changes to

the area around monitoring locations and so forth (see Hansen et al. 1999, Brohan et al.

Page 4 of 64

John Wiley & Sons

Environmetrics

For Peer Review

5

2006 for examples in the land record; Folland and Parker 1995, Thompson et al. 2008 for

examples in sea surface data). A typical method for detecting and removing level shifts is to

construct a reference series which is not expected to exhibit the discontinuity, such as the

mean of other weather station records in the vicinity, and then look for one or more jumps

in a record relative to its reference series. The question of whether a jump is real or not can

strongly affect the resulting trend calculations. If a change point is known, the analysis in

Section 3 applies, and if a change point is suspected but the date is unknown, the analysis in

Section 4 applies. However, a limitation of our framework is that we can only consider at

most one change point, whereas in typical applications long weather series may be

suspected of having several such breaks. If multiple breaks may occur at unknown dates

this greatly complicates the analysis from a computational perspective and is beyond the

scope of this paper. In addition, if one thinks level shifts occur frequently and with

randomness, then there is the additional difficulty that the range of possible specifications

could, in principle, include the case in which the level changes by a random amount at each

time period, which is equivalent to having a random walk, or unit root component in uit. If

yit has a unit root component, inference in models (1) and (2) becomes more complicated.

More importantly, it is difficult to give a physical interpretation to a unit root component of

a temperature series. See Mills (2010) for a discussion of temperature trend estimation

when a random walk is a possible element of the specification.

Hence we consider an application in which it is reasonable to assume that the observed

series are well characterized by a trend and at most one level shift (at a known date), and

that the errors are covariance stationary. We focus on the prediction of climate warming in

Page 5 of 64

John Wiley & Sons

Environmetrics

For Peer Review

6

the troposphere over the tropics. As shown in Section 6, climate models predict a steady

warming trend in this region due to rising atmospheric greenhouse gas levels, but none

predict a step-change, so trends and shifts can be regarded as distinct phenomena. A

number of studies (summarized below) have shown that models likely overstate the

warming trend, but there is disagreement as to whether the bias is statistically significant.

McKitrick et al. (2010) used the VF test to examine this issue over the 1979-2009 interval,

coinciding with the record available from weather satellites. We extend their analysis to the

1958-2010 interval using data from weather balloons. This long span encompasses a date

at which a known climatic event caused a level shift in many observed temperature series.

If the shift is statistically significant the comparison would thus be akin to that in Figure 1,

such that failure to take it into account could bias the comparison either towards over-

stating or understating the difference.

The event in question occurred around 1978 and is called the Pacific Climate Shift

(PCS). This manifested itself as an oceanic circulatory system change during which basin-

wide wind stress and sea surface temperature anomaly patterns reversed, causing an

abrupt step-like change in many weather observations, including in the troposphere, as

well as in other indicators such as fisheries catch records (see Seidel and Lanzante 2004,

Tsonis et al. 2007, Powell Jr. and Xu 2011, and extensive references therein). For our

purposes we do not try to present a specific physical explanation of the PCS or even

evidence that its origin was exogenous to the climate system, only that it was a large event

at an approximately known date, the existence of which has been documented and studied

extensively and which resulted in a shift in the mean of the temperature data. We first

Page 6 of 64

John Wiley & Sons

Environmetrics

For Peer Review

7

present results based on assumption that the PCS occurred at a known date (Section 6.2)

and then based on the assumption that the PCS is not known to have occurred or that the

date of occurrence is unknown (Section 6.3).

We find that the question of whether a level shift exists is sensitive to the reference

series against which it is tested. When tested against a zero-trend alternative, the

observations do not yield significant evidence of a shift, or the test lacks sufficient power to

detect such a shift if one is present. But when tested against model-generated predictions,

the shift term is highly significant in all cases. This suggests that change point detection

algorithms in temperature homogenization may be sensitive to the choice of reference

series. This issue actually applies to the choice of observed weather balloon data, as we will

discuss in Section 4.

If the date of the PCS is taken as given and exogenous, we find that the models project

significantly more warming in both the lower- and mid-troposphere than are found in

weather balloon records over the interval. This finding remains robust if we treat the date

of the PCS as unknown and apply the conservative data-mining approach. In fact it is robust

whether or not we include a level shift in the regression model: we reject equivalence of

the trend slopes between the observed and model-generated temperature series either

way. The evidence against equivalence is simply stronger when we control for a level shift

and this is true whether we treat the date of the shift to be known or unknown.

We also find that if the date of the PCS is assumed to be known then: a) the appearance

of positive and significant trend slopes in the individual observed temperature series

vanishes once we control for the effect of the level shift, and b) we find statistical evidence

Page 7 of 64

John Wiley & Sons

Environmetrics

For Peer Review

8

for a level shift in the observed temperature series in the mid-troposhere but not in the

lower-troposphere series. If the date of the PCS is assumed to be unknown, statistical

evidence for a level shift in the observed mid-troposphere series disappears, but this is not

surprising given that we use the data-mining robust critical value which decreases the

power of detecting such a shift.

2 BASIC SET-UP WITH NO BREAK OR KNOWN BREAK DATE

2.1 TREND MODELS The literature on estimation and inference in model (1) is by now well established, and

it may hardly seem possible that there is something new to be said on the subject. In fact

the last decade or so has seen some very useful methodological innovations for the purpose

of computing robust confidence intervals, trend significance and trend comparisons in the

presence of autocorrelation of unknown form. Many of these robust estimators use the

nonparametric HAC approach which is now widely used in econometrics and empirical

finance literatures. In contrast the nonparametric HAC approach is used less in applied

climatic or geophysical papers although nonparametric approaches have been proposed by

Bloomfield and Nychka (1992) and further examined by Woodward and Gray (1993) and

Fomby and Vogelsang (2002) for the univariate case. As far as we know, McKitrick et al.

(2010) is the first empirical climate paper to apply nonparametric HAC methods in

multivariate settings.

It will be convenient to define a general deterministic trend model which contains (1)

and (2) as special cases:

Page 8 of 64

John Wiley & Sons

Environmetrics

For Peer Review

9

(3)

where is a single deterministic regressor (typically the time trend in our applications)

and is a vector of additional deterministic regressors (typically the intercept and

shift terms). Model (1) is thus obtained for , , and , , and model

(2) is obtained for , , and ( ) , ( ) . Notice that we are

assuming that each time series i has the same deterministic regressors. This is needed for

the VF statistic to be robust to unknown conditional heteroskedasticity and serial

correlation. In some applications it might be reasonable to model some of the series as

having different trend functions. For example, we know that the climate model series in the

application do not have level shifts because level shifts are not part of the climate model

structures. When we think series could have different functional forms for the trend, we

can simply include in the union of trend regressors across all the series. While this will

result in a loss of degrees of freedom, in many applications the regressors will be similar

across series, so the loss in degrees of freedom will often be small. We view this loss of

degrees of freedom as a small price to pay for robustness to unknown forms of conditional

heteroskedasticity and autocorrelation.

We estimate model (3) using OLS equation by equation. This approach has some nice

properties in our set up. Because the regressors are the same for each equation, we have

the well-known exact equivalence between OLS and generalized least squares (GLS)

estimators that account for cross series correlation. Because we have covariance stationary

errors, the well-known Grenander and Rosenblatt (1957) result applies in which case OLS

Page 9 of 64

John Wiley & Sons

Environmetrics

For Peer Review

10

is also asymptotically equivalent to GLS estimators that account for serial dependence in

the data.

Defining the matrix ( ), model (3) can be written in vector

notation as

. (4)

The parameters of interest are in the vector so it is convenient to express the OLS

estimator using the “partialling out” approach, aka the Frisch-Waugh result (see Davidson

and MacKinnon, 2004 and Wooldridge, 2005) as follows. Let and denote,

respectively the OLS residuals from the regression of and on . The OLS estimator

of can be written as

(∑

)

∑ (5)

and it directly follows that

(∑

)

∑ . (6)

Note that the form of in equation (5) would be unchanged if we redefined to be

the shift term and to be the intercept and trend terms, however the definition of the ~

variables would be adjusted accordingly. This implies that the test statistic on hypotheses

about the shift term will take the same form as those for the trend terms.

2.2 THE VF TEST We are interested in testing null hypotheses of the form

(7)

Page 10 of 64

John Wiley & Sons

Environmetrics

For Peer Review

11

against alternatives where R and r are known restriction matrices of dimension

q x n and q x 1 respectively where q denotes the number of restrictions being tested. The

matrix R is assumed to have full row rank. Robust tests of need to account for

correlation across time, correlation across series, and conditional heteroskedasticity as

summarized by the long run variance of ut. This is defined as

∑ ( )

where ( ) is the matrix autocovariance function of ut. Those familiar with the

time series literature will notice that is proportional to the spectral density matrix of ut

evaluated at frequency zero.

The VF statistic is constructed using the following estimator of :

∑ (

)(

) , (8)

where ∑

. This is the Bartlett kernel nonparametric estimator of

using a bandwidth (truncation lag) equal to the sample size. The VF statistic for testing

is given by

( ) [(∑

)

]

( ) . (9)

In Appendix I we provide a finite sample motivation for the form of . Also note that (8)

was originally proposed by Kiefer, Vogelsang and Bunzel (2000, 2001) although in the

different but computationally identical form:

, (10)

Page 11 of 64

John Wiley & Sons

Environmetrics

For Peer Review

12

where ∑ . See Kiefer and Vogelsang (2002) for a formal derivation of the exact

equivalence between (8) and (10).

2.3 ASYMPTOTIC LIMIT OF THE VF STATISTIC We now provide sufficient conditions for obtaining an asymptotic approximation to the

sampling distribution of the VF statistic. A formal proof is given in Appendix 2. The fraction

( ] of the sample is cT and we denote the integer portion of this as [cT]. The symbols

and → denote weak convergence and convergence in distribution, is the matrix square

root of (i.e. ) and ( ) and ( ) denote and vectors of independent

standard Wiener processes.

Two assumptions are sufficient for obtaining the limit of VF. Define the partial sums of

∑ . The first assumption is that a functional central limit theorem (FCLT)

holds for . As T→

[ ] ( ). (11)

Assume that

there is a scalar , and a matrix , such that

∑ → ∫ ( )

[ ] and ∑ → ∫ ( )

[ ] . (12)

For example, in model (2), , ( ) , , [

], ( ) and

( ) ( ( )) where, in this case, DU denotes a continuous indicator function

taking the value 1 if and 0 otherwise.

( ) ( ) ( ) ( ( ) ( ) )

Page 12 of 64

John Wiley & Sons

Environmetrics

For Peer Review

13

( ) ∫ ( ) (∫ ( ) ( )

) (∫ ( ) ( )

)

∫ ( )

, (13)

In Appendix II we show that under assumptions (11) and (12), the limit of VF under the

null hypothesis (7) is given by

[ ∫ ( )

( )

]

, (14)

where ( ) and independent of the random matrix ∫ ( )

( )

The

limit of the VF statistic can therefore be seen to be similar to an F random variable,

however it follows a nonstandard distribution that depends on the deterministic

regressors in the model via the stochastic process ( ). The critical values of

thus

depend on the regressors in and by extension depend on the value of . It is important to

note that the critical values do not depend on which regressors are placed in (the

regressor of interest for hypothesis testing). For a given value of , one uses the same

critical values for testing the equality of trend slopes or testing the equality of intercepts or

testing the equality of level shifts in model (2).

Obtaining the critical values of the nonstandard asymptotic random variable defined by

(14) is straightforward using Monte Carlo simulation methods that are widely used in the

econometrics and statistics literatures. In the application, when we take the date of the

Pacific Climate Shift to be exogenously given at 1977:12, this yields a value of .

For model (2) with we simulated the asymptotic critical values of VF for testing

one restriction ( ) which we tabulate in Table 1a. The Wiener process that appears in

the limiting distribution is approximated by the scaled partial sums of 1,000 i.i.d. N(0,1)

random deviates. The vector ( ) is approximated using ( ( ) ) for

Page 13 of 64

John Wiley & Sons

Environmetrics

For Peer Review

14

t=1,2,…,T. The integrals are approximated by simple averages. 50,000 replications were

used. We see from Table 1a that the right tail of the VF statistic has a fatter tail than a

random variable.

2.4 BOOTSTRAP CRITICAL VALUES AND P-VALUES If carrying out simulations of the asymptotic distributions is not easily accomplished

using standard statistical packages, an alternative is to use a simple bootstrap approach as

follows:

1. For each i take the OLS residuals from (6) and sample with replacement from

to generate a bootstrap series

. Let

denote the

bootstrap resampled series to be used as dependent variables.

2. For each i estimate model (3) using in place of . Let

and denote the OLS

estimators and let

denote the OLS residuals. Let denote the

vector (

) and let denote the vector (

) .

3. Compute using (8) with in place of . The equivalent form given by (10) can also

be used and is faster to compute.

4. Compute VF using

( ) [(∑

)

]

( ) .

5. Repeat Steps 1 through 4 times where is a relatively large integer. This

generates random draws from VF*.

Page 14 of 64

John Wiley & Sons

Environmetrics

For Peer Review

15

6. Sort the values of VF* from smallest to largest and let [ ] [ ] [ ]

indicate the sorted values. The right tail critical value for a test with significance level is

given by [( ) ] where the integer part of ( ) is used if ( ) is not an

integer.

7. Bootstrap p values van be computed using the frequency of VF* values that exceed

the value of VF from the actual data.

Note that, by construction, the true value of is zero. Therefore , i.e. r=0 in the

bootstrap samples and VF* should be computed using r=0 to ensure that the null holds for

VF*.

Those familiar with bootstrap methods will notice that the resampling scheme used in

Step 1 does not reflect the serial correlation within a series or the correlation across series

because an i.i.d. resampling method is being used. Because VF* is based on HAC estimators

and its asymptotic null distributions do not depend on unknown correlation parameters, it

falls within the general framework considered by Gonçalves and Vogelsang (2011), where

it was shown that the simple, or naive, i.i.d. bootstrap will generate valid critical values. No

special methods, such as blocking, are required here. The formal results implied by the

theory of Gonçalves and Vogelsang (2011) are that VF* computed using Step 4 converges in

distribution to the same limits as does VF under the null. Therefore, the bootstrap critical

values are equivalent to the approximation given by (14).

3 TREATING THE BREAK DATE AS UNKNOWN

Page 15 of 64

John Wiley & Sons

Environmetrics

For Peer Review

16

Many previous authors (e.g. Seidel and Lanzante 2004) have treated the break date of

the level shift as known because the PCS was an exogenous event observed across many

different climatic data series. As a robustness check we also report results where we treat

the date of the level shift as unknown. We take a “data-mining” approach which has a long

history in the change point literature. For a given hypothesis, we compute the VF statistic

for a grid of possible break dates and determine the break date that gives the largest VF

statistic. In other words, we search for the break date that gives the strongest evidence

against the null hypothesis. But the effect of searching over break dates must also be

accounted for, otherwise this approach would be a “data-mining” exercise that could give

potentially misleading inference. The level of the test will be inflated above the nominal

level compared to the case where the break date is assumed to be known. Fortunately, it is

easy to obtain critical values that take into account the search over break dates.

For a given potential break date , let ( ) denote the VF statistic for testing a given

null hypothesis. The limiting random variable given by (14) depends on through the level

shift regressor and we now label the limit by ( ) to make explicit the dependence on

the break date used to estimate the model. Suppose we “trim” the fraction v from each end

of the sample, leaving a grid of potential break dates given by (in

our application we set v=0.1). Define the “data-mined” VF statistic as

( ) for ( )

Under the null hypothesis (7) and under the assumption there is no level shift in the data,

we have

Page 16 of 64

John Wiley & Sons

Environmetrics

For Peer Review

17

→ ( )

( ) (15)

where the limit follows from (14) and application of the continuous mapping theorem.

Using simulation methods identical to those used for the known break date case, we

computed asymptotic critical values for supVF for v=0.1 and q=1 for testing hypotheses

about the trend slope parameters in model (2). These critical values are given in Table 1b.

Using the supVF statistic along with the critical values given by (15) provides a very

conservative test with regard to the break date.

4 TESTING FOR A LEVEL SHIFT IN A UNIVARIATE TIME SERIES As part of the empirical application we provide visual evidence that the observed

temperature series exhibit level shifts around the time of the PCS. Some formal statistical

evidence regarding these level shifts can be provided by application of the VF statistic to an

individual time series. Consider model (2) for the case of n=1 and place the model in the

general framework (3) with , ( ) , and ( ) . If we take

the break date as known, then the VF statistic for testing for no level shift ( ) can

be computed as before using (9) with R=1 and r=0. The asymptotic null critical values are

still given by Table 1a.

If we treat the break date as unknown, we can apply the supVF statistic although the

asymptotic critical values depend on which regressor is placed in . While it is true that

for a given value of , the distribution of ( ) is the same regardless of the regressor

placed in , the covariance structure of ( ) across depends on which regressor is

placed in in . Therefore, the supVF statistic for testing zero trend slope has different

Page 17 of 64

John Wiley & Sons

Environmetrics

For Peer Review

18

asymptotic critical values than the supVF statistic for testing a zero level shift. We

simulated the asymptotic critical values of supVF for testing for a zero level shift for the

case of v=0.1 and q=1 and provide those critical values in Table 1b.

Other tests for a level shift at an unknown date of a trending time series have been

proposed in the empirical climate literature. Reeves et al (2007) provides a review of

change point detection methods developed in the climate literature but the review focuses

on tests designed for time series variables that do not have serial correlation. In contrast

Lund et al (2007) propose a test for a level shift that allows a specific form of

autocorrelation, first order periodic autoregressive model. We prefer the VF approach for

two reasons. First, the VF approach is robust to more general forms of autocorrelation.

Second, we formally derive and characterize the limiting null distribution of the sup

statistic and this allows us to tabulate null critical values. Lund et al (2007) also use a sup-

type statistic but they do not provide any asymptotic theory that can be used to generate

valid approximating critical values.

5 FINITE SAMPLE PERFORMANCE OF THE VF STATISTIC

In this section we report some results from a small Monte Carlo simulation study that

demonstrates the finite sample performance of the VF statistic. We compare the

performance of the VF statistic with a traditional Wald statistic. The Wald statistic is

configured to be robust to heteroskedasticity and serial correlation over time and to be

Page 18 of 64

John Wiley & Sons

Environmetrics

For Peer Review

19

robust to correlation across series. We use two established methods for constructing the

Wald statistic. The first method is based on the parametric estimator of given by

∑ (

)(

) , (16)

which is the Bartlett kernel estimator. The value of the bandwidth, M, was chosen by the

data dependent method proposed by Andrews (1991) based on the AR(1) plug-in method.

This data dependent method tends to choose values of M that are small relative to T

although M tends to be larger when serial correlation in the data is strong compared to

cases where serial correlation is weak. The second method also uses (16) but with

prewhitening based on autoregression models with lag 1 (AR(1)) fit to the components of

. Prewhitening was explored by Andrews and Monahan (1992). We set M=1 in the

prewhitening case which makes the estimator of an AR(1) parametric estimator. We

compute Wald statistics based on these two estimators of using (9) with replaced

with either or the prewhitened estimator. The resulting Wald statistics are denoted by

and respectively. Under the assumptions used to derive the limiting null

distribution of the VF statistic, it is well known that and have limiting null

distributions given by where

denotes a chi-square random variable with q degrees

of freedom.

We use the following data generating process:

( ) , (17)

, (18)

Page 19 of 64

John Wiley & Sons

Environmetrics

For Peer Review

20

where

,

√ (

) ,

,

( ), ( ) , and

, √

[( ) ]( )

. The errors, ,

are configured to have unit variances with ( )

√ . When , and

are uncorrelated with each other.

We report two sets of results. The first set of results focus on empirical null rejection

probabilities. For we set and so that there is no level shift in . For

we set so that the null hypothesis of equal trend slopes holds. We set so that

the two series are uncorrelated. We report results for T=120, 240, 636 and a selection of

values of and . In all cases 5,000 replications were used and we computed empirical

rejection probabilities for the , and VF statistics for testing using the

appropriate asymptotic critical values. The simulation results for this configuration

highlight the impact of serial correlation structure on null rejection probabilities relative to

sample the sample size.

The results are tabulated in Table 2a. There are two sets of results reported for each of

the three statistics. The first set of results corresponds to the case where no level shift

dummy variable is included in the estimated model. The second set of results corresponds

to the case where the level shift dummy is included in the estimated model. Results are

organized into three blocks corresponding to the three sample sizes. Within a block results

are given for seven configurations of the autoregressive parameters ranging from no serial

correlation to strong serial correlation.

Page 20 of 64

John Wiley & Sons

Environmetrics

For Peer Review

21

If the asymptotic approximations were working perfectly for the statistics, we would

see rejections of 0.05 in all cases. When the serial correlation is absent, all three statistics

have empirical rejection probabilities close to 0.05 regardless of the sample size. Once

there is serial correlation is the model, over-rejections can occur depending on the strength

of the serial correlation relative to the sample size. First focus on the case of AR(1) errors

( ). Rejections tend to be close to 0.05 when is small but as increases in value,

rejections tend to increase. This is especially true for the statistic where rejections

exceed 0.25 when In contrast, and VF suffer from less severe over-rejection

problems although they tend to be over-sized when T=120 and For a given value

of , as T increases, over-rejections tend to fall for all three statistics but slowest for .

Overall for the AR(1) error case, and VF have similar rejections to each other and

outperform .

One of the reasons that performs relatively well with AR(1) errors is because

is explicitly designed for AR(1) error structures. But, when the errors are not AR(1),

can suffer from over-rejection and under-rejection problems. Consider the cases

and . In these cases shows substantial over-

rejections that are larger than and VF. These over-rejections tend to persist as T

increases. In contrast VF is much less distorted and rejections approach 0.05 as T increases.

For the case of , under-rejects and the under-rejection problem

becomes more severe as T increases whereas VF has rejections close to 0.05 for all sample

sizes. The statistic tends to over-reject mildly in this case.

Page 21 of 64

John Wiley & Sons

Environmetrics

For Peer Review

22

In general, Table 2a indicates that the statistic has the least over-rejection problems

and is the better statistic with regard to control of type 1 error.

In the second set of results we focus T=636 to match the empirical application. We now

include a level shift in with and we set and . For we

set , .0105, .011, .012. We report results for . While we ran simulations for

a wide range of values for and , we only report results for and given

that results for other serial correlation configurations have similar patterns to what is

reported in Table 2a.

The results are given in Table 2b. The first block of 16 rows gives results for

whereas the second block of 16 rows gives results for . Within each block, results

are first given for followed by results for . For each value of , results are

given for followed by results for . When , we are observing null

rejection probabilities whereas for other values of we are observing power.

First focus on the results when the null hypothesis is true, i.e. . For we

see that when the level shift dummy is included, we have rejections close to 0.05 for all

three statistics. However, when the level shift dummy is not included and , we

observe severe over-rejections which range from 0.301 to 0.691. The statistic with the least

severe over-rejection problem is . When , we have relatively strong

autocorrelation in the data. When either or the level shift dummy is included in the

model, there are some mild over-rejection problems ranging from 0.066 for to 0.158 for

Page 22 of 64

John Wiley & Sons

Environmetrics

For Peer Review

23

with in between. Over-rejections are slightly worse when compared to

.

Now focus on the cases where . In these cases has a bigger trend slope

than and we should be rejecting the null of equal trend slopes. When , we see that

all statistics have good power when and power is higher for compared to

. Power increases as expected as increases. Across the three statistics, tends to

have lower power than other two statistics. This illustrates the well known tradeoff

between over-rejection problems and power. If we include the level shift dummy in the

model even though it is not needed ( ), all three tests show a reduction in power as

one would expect.

The most interesting power results occur for and 05 when the level

shift dummy is left out of the model. In this case the estimator of is biased up and one

can show that the probability limit of the estimator of exactly equals 0.0105. For the case

of , rejections of all three statistics are close to the nominal level of 0.050. This shows

that an omitted level shift variable can cripple the power of the tests to detect a difference

in trend slopes between two series. For larger values of the tests have power even if the

level shift variable is not included. When power is higher if the level shift

dummy is included whereas for power is higher when the level shift dummy is

left out.

These simulation results show that: i) the statistic have type 1 errors closest to the

nominal level, ii) the has lower power which is the price paid for more accurate type 1

error, iii) including a level shift dummy when there is no level shift in the data lowers

Page 23 of 64

John Wiley & Sons

Environmetrics

For Peer Review

24

power, iv) failure to include a level shift dummy when there is a level shift in the data

causes type 1 errors to be excessively larger than the nominal level and, depending on the

magnitude/direction of the level shift, can make it difficult to detect slopes that are

different, v) positive correlation across series ( ) tends to increase power and vi)

stronger serial correlation tends to inflate over-rejections under the null while reducing

power.

6 APPLICATION: DATA AND METHODS

6.1 OBSERVATIONS AND MODEL DATA Our empirical application uses data from the tropical lower- and mid-troposphere (LT,

MT respectively), where we will compare trends from a large suite of general circulation

models (GCMs) to those observed in three monthly radiosonde records over the 1958-2010

interval. Held and Soden (2000), Karl et al. (2006) and Thorne et al. (2011) provide

discussions of the importance of this region for assessing climate models. In response to

rising greenhouse gas levels, models predict a maximum warming trend will occur in the

lower- to mid-troposphere over the tropics. Karl et al. (2006, p. 11) noted that weather

balloons had not detected such a trend and deemed the mismatch a “potentially serious

inconsistency” with models. Douglass et al. (2007) argued that the inconsistency was real

and statistically significant, while Santer et al. (2008) countered that the difference was not

significant if autocorrelation was taken into account, though they only used data from 1979

to 1999 and a simple AR1 correction. McKitrick et al. (2010) extended the data to 2009 and

employed the VF approach, concluding the trend differences were significant. Fu et al.

(2011) also found climate models significantly exaggerate the gain in the warming trend

Page 24 of 64

John Wiley & Sons

Environmetrics

For Peer Review

25

with altitude throughout the tropics. Additionally, Bengtsson and Hodges (2011) and Po-

Chedley and Fu (2012) found that even models constrained to match observed post-1979

sea-surface temperatures overestimated warming trends in the topical mid-troposphere.

While the trend discrepancy is now well-established, Santer et al. (2011) emphasize the

need for multidecadal comparisons to identify whether it is structural or temporary. By

using the half century-length weather balloon records we meet this concern, but we also

span the 1977-78 Pacific Climate Shift. All climate models predict a steady trend in

response to rising greenhouse gases, and none predict a large one-time jump preceded and

followed by decades of static temperatures. Hence we satisfy the conditions described in

the introduction, namely that the underlying phenomenon implies a trend as opposed to a

jump, and our null hypothesis of trend equivalence requires controlling for a potential step-

change at a potentially unknown date.

Our application mainly focuses on trend slopes and comparisons of trend slopes across

series in which case we set and therefore . For model (2) ( )

with the level shift set at 1977:12, implying . We also provide some results on

the level shift parameters themselves in which case , and ( ) . Let

denote the OLS estimator of for a given parameter of interest for a given time series

using either model (1) or model (2). If only one restriction is being tested (q=1) in the form

then equation (9) reduces to a t-type statistic which can be inverted to yield the

VF standard error:

( ) √(∑

)

, (19)

Page 25 of 64

John Wiley & Sons

Environmetrics

For Peer Review

26

where is computed using (8) or (10) using from the respective models. Let cv0.025

denote the 2.5% right tail critical value of the asymptotic distribution of VFt. For model (1)

cv0.025 = 6.482 (see Table 1 of Vogelsang and Franses, 2005; their statistic) and for model

(2) cv0.025 = 6.965 (see Table 1a). A 95% confidence interval (CI) is computed as

( ) .

The tropics are defined as 20N to 20S. The GCM runs were compiled for McKitrick et al.

(2010). There were 57 runs from 23 models for each of LT and MT layers. Each model uses

prescribed forcing inputs up to the end of the 20th century climate experiment (20C3M, see

Santer et al. 2005), and most models include at least one extra forcing such as volcanoes or

land use. Projections forward after 2000 use the A1B emission scenario. Tables 2 and 3

report, for the LT and MT layers respectively, the climate models, the extra forcings, the

number of runs in each ensemble mean, estimated trend slopes in the cases with and

without level shifts, and VF standard errors. All series had a significant AR1 coefficient, but

as reported in McKitrick et al. (2010), over two-thirds also have significant higher-order AR

terms as well, which motivates the use of a HAC estimator as opposed to a simple AR(1)

treatment as in Santer et al. (2008) and Fu et al. (2011).

We used three observational temperature series. The HadAT radiosonde series is a set

of MSU-equivalent layer averages published on the Hadley Centre web site1 (Thorne et al.

2005). We use the 2LT layer to represent the GCM LT-equivalent and the T2 layer to

represent the GCM MT-equivalent. The other two series are denoted RAdiosonde

1 http://www.metoffice.gov.uk/hadobs/hadat/msu/anomalies/hadat_msu_tropical.txt.

Page 26 of 64

John Wiley & Sons

Environmetrics

For Peer Review

27

Observation Bias Correction using Reanalyses (RAOBCORE, Haimberger 2005) and

Radiosonde Innovation Composite Homogenization (RICH, Haimberger et al. 2008). Both

were obtained from the Institute for Meteorology and Geophysics at the University of

Vienna,2 however this site does not provide the data in LT- and MT-equivalent forms so

MSU-equivalent layer averages of the tropical latitudes were computed for us by John

Christy (pers. comm.) using weighting functions that match those used for the HadAT

series. We did not use the RATPAC series published by the National Oceanic and

Atmospheric Administration (NOAA3) since the zonal averages are only available in

quarterly or annual form. Another series, called IUK-radiosonde from the University of New

South Wales,4 only goes up to 2005.

The HadAT and RICH series deal with the problem of homogenizing short data

segments by comparing series at suspected breakpoints to reference series formed using

nearby observations to detect if shift terms are needed. The RAOBCORE series uses

reference series generated by nearby weather forecasting systems (called “reanalysis

data”). Production of these series is therefore an application of the shift-detection methods

developed in this paper, but for our current purposes we will take the data as given and

apply it to the model-observation comparison.

The last three lines of Tables 2 and 3 report the estimated trend slopes and VF standard

errors for the observed temperature series. Figure 2 displays the observed LT and MT

trends with the least squares trend lines shown. The estimated LT trends for, respectively,

2 http://www.univie.ac.at/theoret-met/research/raobcore/index.html. 3 Available at http://www1.ncdc.noaa.gov/pub/data/ratpac/ 4 Available at http://www.ccrc.unsw.edu.au/staff/profiles/sherwood/radproj/index.html

Page 27 of 64

John Wiley & Sons

Environmetrics

For Peer Review

28

HadAT, RICH and RAOBCORE are 0.135, 0.134 and 0.147 oC/decade. The MT trends are,

respectively, 0.089, 0.096 and 0.132 oC/decade. The effect of allowing for a level shift (step-

change) in 1977:12 is shown in Figure 3. The three radiosonde series are averaged and the

broken trend is plotted along with the trend in the GCM average (with a break located at

the same place). Individual monthly observations from three GCMs are also shown for

illustrative purposes. Using a break date of 1977:12 the observed LT trends fall to 0.064,

0.093 and 0.065 oC/decade and the MT trends fall to -0.001, 0.048 and 0.042 oC/decade.

Thus about half of the positive LT trend in Figure 2 can be attributed to the one-time

change at 1977:12 and essentially all the MT change is accounted for by the step-change.

Figure 4 plots all the estimated trend slopes along with their CIs. The top row leaves out

the level shift and the bottom row includes it. The model-generated trends are grouped on

the left with the 95% CI’s shown as the shaded region. The trends are ranked from smallest

to largest and the numbers beside each marker refer to the GCM number (see Table 2 for

names). The three trends on the right edge are, respectively, the Hadley and RICH series.

With or without the break term the range of model runs and their associated CI’s overlaps

with those of the observations. In that sense we could say there is a visual consistency

between the models and observations. However, that is too weak a test for the present

purpose, since the range of model runs can be made arbitrarily wide through choice of

parameters and internal dynamical schemes, and even if the reasonable range of

parameters or schemes is taken to be constrained on empirical or physical grounds, the

spread of trends in Figure 4 (spanning roughly 0.1 to 0.4 oC/decade in each layer) indicates

that it is still sufficiently wide as to be effectively unfalsifiable. Also, if we base the

Page 28 of 64

John Wiley & Sons

Environmetrics

For Peer Review

29

comparison on the range of model runs rather than some measure of central tendency it is

impossible to draw any conclusions about the models as a group, or as an implied physical

theory. Using a range comparison, the fact that, in Figure 4, models 8, 5 and 16 are

reasonably close to the observational series does not provide any support for models 2, 3

and 4, which are far away. We want to pose the trend comparison in a form that tells us

something about the behaviour of the models as a group, or as a methodological genre, and

this requires using multivariate testing framework.

6.2 MULTIVARIATE TREND COMPARISONS IN THE NO-BREAK AND KNOWN BREAK CASES For each layer we now treat the 23 climate model generated series and the 3

observational series as an n=26 panel of temperature series. We estimate models (1) and

(2) using the methods described in Section 3. The parameters of interest are the trend

slopes. We are interested in testing the null hypothesis that the weighted average of the

trend slopes in the 23 climate model generated series is the same as the average trend

slope of the observed series. The weight coefficient wi equals the number of runs in model

i’s ensemble mean, to adjust for the reduction in variance in multi-run ensemble means.

Placing the observed series in positions i=24,25,26 the restriction matrices for this null

hypothesis are

[

]

where the wi terms sum to 57.

Table 5 presents the VF statistics for the test of trend equivalence between the climate

models and observed data. Also reported are the VF statistics for testing the significance of

the individual trends of the observed temperature series, the magnitudes of which

Page 29 of 64

John Wiley & Sons

Environmetrics

For Peer Review

30

(oC/decade) are indicated in parentheses beside the series name. Asymptotic critical values

are provided in the table captions and significance is indicated as described in the table. We

also compute bootstrap p-values for the tests using the method outlined in Section 2.4

using 10,000 bootstrap replications.

In the trend model without level shifts (top panel of Table 5), the zero trend-hypothesis

is rejected at the 1% significance level for all 6 observed series, apparently indicating

strong evidence of a significant warming trend in the tropical troposphere over the 1958-

2010 interval. A test that the climate models, on average, predict the same trend as the

observational data sets is rejected in the LT layer and in the MT layer at 1% significance.

Table 6 repeats the model-observation trend equivalence test for each of the 23 models

individually. In the LT layer the differences are significant at 5% or lower in 8 cases and 14

cases in the MT layer. (Not reported are the single-model tests of trend significance, which

are significant at <1% in both layers for 22 models, at <10% for one model in the LT layer

and insignificant for one model in the MT layer.) So while, on average, the model trends are

significantly different from observations, in the LT layer it can at least be said that if we

ignore the step-change at 1977:12, almost two-thirds of the models have trends that

individually do not significantly differ from the observations.

When we add the level shift dummy at 1977:12 (middle block of Table 5), the trend

magnitudes and values of the VF statistics for testing the zero trend-hypothesis drop

considerably. The critical values for VF are larger than in the case without the mean-shift

dummy. Now none of the observed series has a significant trend. Hence when the level shift

Page 30 of 64

John Wiley & Sons

Environmetrics

For Peer Review

31

is left out, the increase in the series is spuriously associated with a trend slope. But the

entire trend is explained by a jump in the data around 1977.

The VF test of equivalence of trends between the climate models and observed data is

more strongly rejected when the level shift dummy is included. Notice that bootstrap p-

values drop to essentially zero in this case. This finding is not surprising because, as is clear

in Tables 2 and 3, while the estimated trend slopes decrease for the observed series when

the level shift dummy is included, the estimated trend slopes of the climate model series

are not systematically affected by the level shift dummy.5 Therefore, there is a greater

discrepancy between the climate model trends and the observed trends. Table 6 confirms

this in the model-specific tests (see “Known Break Date” columns). The number of

rejections in the LT layer jumps from 8 to 12 out of 23, and in the MT layer from 14 to 17

out of 23.

The VF scores on the level shift magnitudes for the individual time series where the

break date is 1977:12 and is treated as known are generally small. Not surprisingly, the

break terms are not significant for the model generated series, except in the MT where one

is significant at 10%. The break terms yield larger VF scores in the observed series. In the

LT layer it is significant at 5% for the RAOBCORE series and in the MT layer it is significant

at 5% for Hadley and RAOBCORE. It might seem surprising that the effect of the break

dummy is so dramatic on the estimated trend slope parameters, yet the break coefficients

5 The climate-models do not explicitly model the Pacific Climate Shift and so the level shift coefficient has no special meaning for the climate model data. Not surprisingly, the estimated level shift coefficients were positive in 11 cases and negative in 12 of the climate model series.

Page 31 of 64

John Wiley & Sons

Environmetrics

For Peer Review

32

themselves are not strongly significant. But in general, trend slopes are estimated more

efficiently than level shifts (or intercepts) and therefore it is more difficult to conduct

inference about level shifts than about trend slopes. While there is sufficient noise in the

data to make inference about level shifts difficult, the noise is not so large to mask

information about the trend slopes. In addition, unmodeled level shifts also induce

spurious noise into the model when conducting inference about trend slopes. Therefore,

controlling for a level shift makes inferences about trend slopes more informative.

6.3 MULTIVARIATE TREND COMPARISONS IN THE UNKNOWN-BREAK CASE As a robustness check regarding the exogeneity assumption about the break date of the

PCS, we report results where we treat the break date as unknown and use the supVF

statistic. The supVF statistic is calculated by computing the VF score with the break term Tb

sequentially set across the middle 80% of the data set, then selecting the maximum value.

For the tests of model-observational equivalence allowing for a level shift, Figure 5

shows the sequence of VF scores, with 10%, 5% and 1% critical values shown. The supVF

occurs at 1979:6 in the LT layer and at 1979:5 in the MT layer. These break dates are close

to, but not the same as, 1977:12 and all of the VF scores for dates near and around 1977:12

time interval far exceed the 1% critical values for the SupVF statistic. Since the supVF test is

conservative regarding the choice of break date, this provides strong additional support for

the results in Section 5.2.

The supVF scores of tests of model-observational equivalence are reported in Table 5

for model averages and in Table 6 for individual models. A pattern emerges particularly

Page 32 of 64

John Wiley & Sons

Environmetrics

For Peer Review

33

clearly in Table 5 that when we apply the data mining approach, the test scores get larger,

as expected, but so do the critical values. The net effect is to reduce the significance of the

test scores when the break date is treated as unknown. Had we not used the critical values

as given by (15), we would have spuriously inflated the significance of our findings. This is

a useful lesson in the perils of naïve data-mining, in which a specification is selected that

maximizes the chance of rejecting some null hypothesis, without taking into account the

effect of the data mining process on the null distribution of the test. The price one pays for

being honest and using the conservative critical values implied by (15) is lower power in

detecting a deviation from the null hypothesis. Even with lower power, we see in the third

panel of Table 5 that the average model is still clearly rejected against the data in both the

LT and MT layers. Since this result is not dependent on choosing a particular break date it

provides very strong confirmation of the known break date empirical finding.

Regarding the trend magnitudes, we do not report the supVF score for the test of a zero

trend in the 3rd block of Table 5. Recall that the search process looks for the location that

maximizes the chance of rejecting a null hypothesis. In this case the significance of the

trend would be maximized simply by leaving the break term out altogether, which

corresponds to the test scores in the first block of Table 5.

We do not report the supVF scores for the level shift parameters. While the supVF

scores are greater than the VF scores with break date 1977:12, the critical values, as given

by Table 1b, increase substantially, and we do not reject the null of no level shift for the

observed series. Again the problem is the low power of detecting a level shift in data with

substantial noise. Treating the break date as unknown further reduces this already low

Page 33 of 64

John Wiley & Sons

Environmetrics

For Peer Review

34

power. Two points are noteworthy regarding this finding. First, for the purpose of

detecting whether a break occurs in the data at an unknown point, inferences are highly

sensitive to the chosen reference series. Testing observations against a zero-trend

alternative does not indicate the existence of a break point, whereas comparison against

the model-generated alternative strongly indicates such a break. Second, even if we were to

choose a model with no level shifts in the observed temperature series, we would still

reject model-observational equivalence as shown in the first block of Table 5.

7 CONCLUSIONS Heteroskedasticity and autocorrelation robust (HAC) covariance matrix estimators

have been adapted to the linear trend model, permitting robust inferences about trend

significance and trend comparisons in data sets with complex and unknown

autocorrelation characteristics. Here we extend the multivariate HAC approach of

Vogelsang and Franses (2005) to allow more general deterministic regressors in the model.

We show that the asymptotic (approximating) critical values of the test statistics of

Vogelsang and Franses (2005) are nonstandard and depend on the specific deterministic

regressors included in the model. These critical values can be simulated directly.

Alternatively, we outline a simple bootstrap method for obtaining valid critical values and

p-values.

The empirical focus of the paper is a comparison of trends in climate model-generated

temperature data and corresponding observed temperature data in the tropical

troposphere. Our empirical innovation is to model a level shift in the observed data

corresponding to the Pacific Climate Shift that occurred around 1978. With respect to the

Page 34 of 64

John Wiley & Sons

Environmetrics

For Peer Review

35

Vogelsang and Franses (2005) approach, this amounts to adding a level shift dummy to the

model which requires a new set of critical values which we provide.

As our empirical findings show, the detection of a trend in the tropical lower- and mid-

troposphere data over the 1958-2010 interval is contingent on the decision of whether or

not to include a level shift term at December 1977. If the term is included, a time trend

regression with error terms robust to autocorrelation of unknown form indicates that the

trend observed over the 1958-2010 interval is not statistically different from zero in either

the LT or MT layers. Also most climate models predict a significantly larger trend over this

interval than is observed in either lower- or mid-tropospheres. We find a statistically

significant discrepancy between the average climate model trend and observational trends

whether the mean-shift term is included or not. However, with the shift term included the

null hypothesis of equal trend is rejected much more strongly (at much smaller significance

levels).

The testing method used herein is both powerful and conservative. The power of the

test is indicated by the span of test scores in Table 6 in which relatively small changes in

modeled trends translate into much higher rejection probabilities. Using the data-mining

method provides a check on the extent to which the results depend on the assumption of a

known break date.

As such our empirical approach has many other potential applications on climatic and

other data sets in which level shifts are believed to be have occurred. Examples could

include stratospheric temperature trends which are subject to level shifts coinciding with

major volcanic eruptions, and land surface trends where it is believed that the measuring

Page 35 of 64

John Wiley & Sons

Environmetrics

For Peer Review

36

equipment changed or was moved. Generalizing the approach to allow more than one

unknown break point is left for subsequent work.

8 APPENDIX I:MOTIVATION AND BACKGROUND OF VF APPROACH

Motivation for the form of the VF statistic can be developed by considering the very

simple model

.= itiit uay (A1)

Model (A1) can written in terms of the general model (3) setting 1=0t

d , ii

a= and 0=1t

d .

Because itu is a mean zero time series, it follows that )(= iti yEa . For the purpose of matrix

representations the natural organization is to denote rows by the time index t and columns

by the data source index i. However the matrix representation of the statistical theory

becomes easier if we transpose the data so that the columns represent time. We can then

refer to time series of column vectors: ),...,,(= 21

ntttt yyyy , ),...,,(= 21

naaaa and

),...,,(= 21

ntttt uuuu .

Rewrite the model as

,= tt uay

and suppose we are interested in testing q linear restrictions about the means, a , of the

form:

Page 36 of 64

John Wiley & Sons

Environmetrics

For Peer Review

37

,:,=: 10 rRaHrRaH

where R and r are, respectively, nq and 1q matrices of known constants. We require

that nq and that R have full rank ( qRrank =)( ). The natural estimator of a is the vector

of sample averages, i.e. the OLS estimator given by t

T

tyTya

1=

1==ˆ .

The variance of a , given the covariance stationarity assumption for tu , is given by

.))((1=]))([(=)ˆ(1

=1

0

1

=1=1

2

jj

T

j

t

T

t

t

T

t T

jTuuETaVar

Letting ))((1=1

1=0 jj

T

jTT

j

we have the more compact expression

.=)ˆ( 1

TTaVar

(A2)

An F -statistic constructed using (A2) is infeasible because T is unknown. A natural

estimator of T is given by

,ˆˆ=ˆ),ˆˆ)((1ˆ=ˆ

1=

11

1=

0 jtt

T

jt

jjj

T

j

T uuTT

j

where .ˆ=ˆ ayu tt Using T in place of

T leads to the VF statistic:

.)/ˆ(]ˆ[)ˆ(= 11 qraRRRTraRVF T

Because T is constructed without assuming a specific model of serial correlation, T is in

the class of nonparametric spectral estimators of .

Page 37 of 64

John Wiley & Sons

Environmetrics

For Peer Review

38

Asymptotic theory is used to generate an approximation for T and the null

distribution of VF using the FCLT given by (12). If it were the case that T were a

consistent estimator of , then VF would converge in distribution to a qq /2 random

variable. It turns out that T is not a consistent estimator of , however, it is relatively

easy to show that T does converge in distribution to a random matrix that is

proportional to but otherwise does not depend on unknown quantities. This property of

T means that the VF statistic can be used to test 0H because VF can be approximated

by a random variable that does not depend on unknown parameters.

Recall the partial sum of tu given by j

t

j

t uS 1=

= . Evaluating tS at ][= cTt gives

t

cT

t

cT uS ][

1=

][ = which is the sum of the first thc proportion of the data. For a given value of c,

the quantity ][cT as T . Therefore, if we scale by 1/2T we obtain the result

).(0,=)(0,][

][= 1/2

][

1/2

1/2

][

1/2

cNNcScTT

cTST

d

cTcT

For a given value of c, the scaled partial sums of tu satisfy a Central Limit Theorem (CLT).

These limits hold pointwise in c. The FCLT given by (11) is a stronger statement that says

this collection of CLTs, as indexed by c, hold jointly and uniformly in c and that the family of

limiting normal random variables are a Wiener process (or standard Brownian motion).

Not surprisingly, the FCLT requires slightly stronger assumptions for tu than a CLT. For

example, the condition

<)(

0=

lm

jj, where )(lm

j is the ml, element of the matrix j , is

Page 38 of 64

John Wiley & Sons

Environmetrics

For Peer Review

39

strengthened to

<)(

1=

lm

jjj which requires the autocovariances to shrink faster to

zero as j increases.

Using the FCLT given by (11) it immediately follows that

.)(0,=)(0,(1)===)ˆ( 1/2

1=

1/2 NINWSTuTuTaaT nnTt

T

t

~

Using the FCLT, it is straightforward to determine the asymptotic behavior of T . The first

step is to write T as a function of j

t

jt uS ˆ=ˆ1= using (10):

.ˆˆ2)ˆˆ)((1ˆ=ˆ1

1=

21

1=

0 tt

T

t

jj

T

j

T SSTT

j

Using the FCLT, the limit of ][

1/2 ˆcT

ST is easy to derive:

)ˆ(=)ˆ(=ˆ=ˆ

][

1=

1/2

][

1=

1/2

][

1=

1/2

][

1/2 auaTayTuTSTt

cT

t

t

cT

t

t

cT

t

cT

)ˆ(

][=)ˆ]([=

][

1/21/2

][

1=

1/2 aaTT

cTSTaacTTuT

cTt

cT

t

).((1)))((=(1))( cBcWcWWccW nnnnn

The stochastic process, )(cBn

, is the well known Brownian bridge. Using this result for

][

1/2 ˆcT

ST and the continuous mapping theorem, it follows that

.)()(2)ˆ)(ˆ(2=ˆ

1

0

1/21/21

1=

1

dccBcBSTSTT nntt

T

t

T

Page 39 of 64

John Wiley & Sons

Environmetrics

For Peer Review

40

We see that while T does not converge to = , it does converge to a random matrix

that is proportional to .

Establishing the limit of VF is now simple:

qraRRRTraRVF T )/ˆ(]ˆ[)ˆ(= 11 qraRTRRraRT T )/ˆ(]ˆ[)ˆ(= 1

quTRRRTuTR T /]ˆ[)(= 11 .(1)/])()(2[)(1)( 11

0qWRRdccBcBRWR

nnnn

While not obvious at first glance, the restriction matrix, R , drops from the limit. Because

Wiener processes are Gaussian (normally distributed), linear combinations of Wiener

processes are also Wiener processes. Therefore, )(cWRn

is a 1q vector of Wiener

processes and we can rewrite )(cWRn

as )(cWq

where is the qq matrix square

root of RR , i.e. RR ='* . Similarly, we can rewrite )(cBRn

as )(cBq

where

(1))(=)(qqq

cWcWcB . Because R is assumed to be full rank, it follows that is full rank

and is therefore invertible. We have

qWRRdccBcBRWRVF nnnn (1)/])()(2[)(1)( 1

1

0

qWdccBcBW q

'

qqq (1)/])()([2)(1)(= 11

0

,(1)/])()([2)(1= 1

1

0qWdccBcBW qqqq

and the matrices drop out because is invertible.

The limit of VF does not depend on unknown parameters. The limit is a quadratic form

involving a vector of independent standard normal random variables, (1),qW and the

Page 40 of 64

John Wiley & Sons

Environmetrics

For Peer Review

41

inverse of the random matrix dccBcBqq

)()(21

0

. Because (1)qW is independent of )(cBq

for all

c, (1)qW is independent of dccBcBqq

)()(21

0

and the limit of VF is similar in spirit to an F

random variable but its distribution is nonstandard. The random matrix dccBcBqq

)()(21

0

can be viewed as an approximation to the randomness of RR T whereas (1)qW

approximates the randomness of )ˆ( raRT .

9 APPENDIX II: DERIVATION OF NULL LIMIT OF VF The first step of the derivation is to obtain the asymptotic behavior of rR under 0H .

Under 0H we have rR and it follows that )ˆ(ˆˆ RRRrR . Using (6) we

have

( ) ( ∑

)

Using simple algebra, (11) and (12), we have

→ ∫ ( )

, (A3)

∑ ∫ ( ) ( )

, (A4)

where

( ) ( ) (∫ ( ) ( )

) (∫ ( ) ( )

)

( ).

The limit in (A4) can be more compactly written as follows:

Page 41 of 64

John Wiley & Sons

Environmetrics

For Peer Review

42

∫ ( ) ( )

∫ ( ) ( )

=∫ ( )

( )

∫ ( ) ( )

(A5)

where is the matrix square root of , i.e. . The equivalence in

distribution indicated by (A5) follows because Wiener processes are mean zero and

Gaussian. Using (A3), (A4) and (A5) it follows that

√ ( )

→ (∫ ( )

)

(∫ ( ) ( )

) (A6)

We next derive the limit of T . In deriving this limit it is convenient to stack the

deterministic regressors into a single column vector td where ).,( 10 ttt ddd Define the

combined scaling matrix

.=1

1

10

1)(1)(

Tk

kT

kkT

0

0

It immediately follows from (11) and (12) that

,)(0

][

1=

1 dssfdTc

tT

cT

t ,)()(

1

0=1

1 dssfsfddT TttT

T

t

(A7)

,)()(1

01=

2/1

sfsdWduRT qTtt

T

t

(A8)

The next step is to derive the limit of ][

1/2 ˆcTSRT :

tjj

T

j

jj

T

j

t

cT

t

t

cT

t

cT dddduuRTuRTSRT

1

=1=1

][

=1

1/2][

=1

1/2

][

1/2 ˆˆ

tT

cT

t

TjjT

T

j

Tjj

T

j

cT dTddTduRTSRT

][

=1

1

1

=1

1

=1

2/1

][

1/2

Page 42 of 64

John Wiley & Sons

Environmetrics

For Peer Review

43

.)()()()()()()(0

11

0

1

00cBdssfdssfsfsfsdWsdW f

q

c

qq

c

(A9)

The limit in (A9) follows from (11), (A7) and (A8). Using (A9) it directly follows that

.)()(2ˆˆ2ˆˆ2ˆ1

0

2/12/11

=1

11

=1

2 'f

q

f

q

d

tt

T

t

tt

T

t

T dssBsBRTSSRTTRSSTRRR

(A10)

Combining (A6) and (A10) gives the limit of VF :

qrRTRRdTrRTVF TTt

T

t

TT )/ˆ(ˆ~)ˆ(= 1

0

11

2

0

1=

2

0

11

0

11

0

12

0

1

00

1

0

12

0

1

0)()(2)(

~)()(

~)(

~

'f

q

f

qq

d

dssBsBdssfsdWsfdssf

qsdWsfdssf q /)()(~

)(~

0

1

0

12

0

1

0

./)()(~

)(~

)()(2)()(~

)(~

= 0

1

0

1/22

0

1

0

11

00

1

0

1/22

0

1

0qsdWsfdssfdssBsBsdWsfdssf q

f

q

f

qq

Using well known properties of Wiener processes, it follows that

),,(=)()(~

)(~

0

1

0

2/12

0

1

0qqq INZsdWsfdssf 0~

which allows us to write

qq

f

q

f

qq

d

VFqZdssBsBZVF /)()(21

1

0

Page 43 of 64

John Wiley & Sons

Environmetrics

For Peer Review

44

completing the derivation. Using properties of Wiener processes, it is simple to show that

)(sB f

q is Gaussian and that qZ and )(sB f

q are independent. Therefore, it follows that qZ and

dssBsB f

q

f

q )()(21

0 are independent.

10 REFERENCES Andrews, D.W.K. (1991) “Heteroskedasticity and autocorrelation consistent covariance

matrix estimation,” Econometrica 59, 817–854.

Andrews, D.W.K. (1993) “Tests for Parameter Instability and Structural Change with

Unknown Change Point,” Econometrica, 61, 821-856.

Andrews, D.W.K. and J.C.. Monahan (1992) “An improved heteroskedasticity and autocorrelation

consistent covariance matrix estimator,” Econometrica, 60, 953-966.

Andrews, D.W. K. and W. Ploberger (1994) “Optimal tests when a nuisance parameter is

present only under the alternative,“ Econometrica, 62, 1383-1414.

Bengtsson, L. and K. Hodges, “Evaluating temperature trends in the tropical

troposphere.” Climate Dynamics 2009, doi: 10.1007/s00382-009-0680-y.

Bloomfield, P. and D. Nychka (1992) “Climate spectra and detecting climate change,”

Climate Change, 21, 1-16.

Brohan, P., J.J. Kennedy, I. Harris, S.F.B. Tett and P.D. Jones (2006) “Uncertainty

estimates in regional and global observed temperature changes: a new dataset from 1850.”

Journal of Geophysical Research 111, D12106, doi:10.1029/2005JD006548

Page 44 of 64

John Wiley & Sons

Environmetrics

For Peer Review

45

Davidson, R. and J.G. MacKinnon (2004) Econometric Theory and Methods. Toronto:

Oxford.

Davies, R.B. (1987) “Hypothesis testing when a nuisance parameter is present only

under the alternative,” Biometrica, 74, 33-43.

Douglass, D. H., J. R. Christy, B. D. Pearson, and S. F. Singer. 2007. A comparison of

tropical temperature trends with model predictions. Intl J Climatology Vol 28(13) 1693—

1701 DOI 10.1002/joc.1651

Folland, C.K. and D.E. Parker (1995) “Correction of Instrumental Biases in Historical Sea

Surface Temperature Data.” Quarterly Journal of the Royal Meteorological Society (1995)

121: 319-367.

Fomby, T. and T.J. Vogelsang (2002) “The application of size-robust trend statistics to

global warming temperature series,” Journal of Climate, 15, 117-123.

Fu, Q., S. Manabe and C. M. Johanson (2011) “On the warming in the tropical upper

troposphere: Models versus observations” Geophysical Research Letters Vol. 38, L15704,

doi:10.1029/2011GL048101, 2011.

Gonçalves, S. and T.J. Vogelsang (2011) “Block bootstrap HAC robust tests: The

sophistication of the naïve bootstrap,” Econometric Theory, 27, 745-791.

Grenander, U. and M. Rosenblatt (1957) Statistical Analysis of Stationary Time Series.

New York: Wiley.

Page 45 of 64

John Wiley & Sons

Environmetrics

For Peer Review

46

Haimberger, L. (2005) “Homogenization of radiosonde temperature time series using

ERA-40 analysis feedback information.” ERA-40 Project Report Series No. 23, June 2005.

Available online at

http://www.ecmwf.int/publications/library/ecpublications/_pdf/era40/ERA40_PRS23.pd

f.

Haimberger, L., C. Tavolato, S. Sperka, (2008) “Towards the elimination of the warm

bias in historic radiosonde temperature records - some new results from a comprehensive

intercomparison of upper air data.” J. Climate. 21, 4586—4606 DOI

10.1175/2008JCLI1929.1.

Hansen, J., R. Ruedy, J. Glascoe, and Mki. Sato (1999), “GISS analysis of surface

temperature change,” Journal of Geophysical Research, 104, 30997-31022,

doi:10.1029/1999JD900835.

Hansen, B. (1996) “Inference When a Nuisance Parameter Is Not Identified Under the

Null Hypothesis,” Econometrica, Vol. 64, No. 2 (Mar., 1996), pp. 413-430.

Held, I. and B. J. Soden (2000) “Water Vapor Feedback and Global Warming” Annual

Review of Energy and Environment 25:441—75.

Karl, T. R., S. J. Hassol, C. D. Miller, and W. L. Murray (2006). Temperature Trends in the

Lower Atmosphere: Steps for Understanding and Reconciling Differences. Synthesis and

Assessment Product. Climate Change Science Program and the Subcommittee on Global

Change Research. http://www.climatescience.gov/Library/sap/sap1-1/finalreport/sap1-

1-final-all.pdf. Accessed August 3 2010.

Page 46 of 64

John Wiley & Sons

Environmetrics

For Peer Review

47

Kiefer, N.M and T.J. Vogelsang (2002) “Heteroskedasticity-autocorrelation robust

standard errors using the Bartlett kernel without truncation,” Econometrica, 70, 2093-

2095.

Kiefer, N.M and T.J. Vogelsang (2005) “A new asymptotic theory for heteroskedasticity-

autocorrelation robust tests,” Econometric Theory, 21, 1130-1164.

Kiefer, N.M., T.J. Vogelsang & H. Bunzel (2000) “Simple robust testing of regression

hypotheses,” Econometrica 68, 695–714.

Kiefer, N.M., T.J. Vogelsang & H. Bunzel (2001) “Simple robust testing of hypotheses in

nonlinear models,” Journal of the American Statistical Association, 96, 1088-1096.

Lund, R. X. Wang, Q. Lu, J. Reeves, C. Gallagher and Y. Feng (2007) “Changepoint

Detection in Periodic and Autocorrelated Time Series.” Journal of Climate Vol. 20, DOI:

10.1175/JCLI4291.1.

McKitrick, R. R., S. McIntyre and C. Herman (2010) “Panel andMultivariate Methods for

Tests of Trend Equivalence in Climate Data Sets.” Atmospheric Science Letters, 11(4) pp.

270-277, October/December 2010 DOI: 10.1002/asl.290.

Mills T. C. (2010) Skinning a cat: alternative models of representing temperature trends.

Climatic Change 101: 415-426, DOI 10.1007/s10584-010-9801-1.

Newey, W.K. & K.D. West (1987) “A simple, positive semi-definite, heteroskedasticity

and autocorrelation consistent covariance matrix.” Econometrica 55, 703–708.

Page 47 of 64

John Wiley & Sons

Environmetrics

For Peer Review

48

Po-Chedley, S. and Q. Fu (2012) “Discrepancies in tropical upper tropospheric warming

between atmospheric circulation models and satellites.” Environmental Research Letters 7

(2012) doi:10.1088/1748-9326/7/4/044018.

Powell Jr., A.M. and J. Xu (2011) “Abrupt Climate Regime Shifts, Their Potential Forcing

and Fisheries Impacts” Atmospheric and Climate Sciences 1, 33-47.

doi:10.4236/acs.2011.12004

Reeves, J., Chen, J., Wang, X. L., Lund, R. B. and Lu, Q. (2007). A review and comparison of

changepoint detection techniques for climate data. Journal of Applied Meteorology and

Climatology, 46, 900–915.

Santer, B.D., T. M. L. Wigley, C. Mears, F. J. Wentz, S. A. Klein, D. J. Seidel, K. E. Taylor, P.

W. Thorne, M. F. Wehner, P. J. Gleckler, J. S. Boyle, W. D. Collins, K. W. Dixon, C. Doutriaux, M.

Free, Q. Fu, J. E. Hansen, G. S. Jones, R. Ruedy, T. R. Karl, J. R. Lanzante, G. A. Meehl, V.

Ramaswamy, G. Russell, G. A. Schmidt (2005) Amplification of Surface Temperature Trends

and Variability in the Tropical Atmosphere. Science Vol. 309. no. 5740, pp. 1551 – 1556

DOI: 10.1126/science.1114867

Santer, B. D., P. W. Thorne, L. Haimberger, K. E. Taylor, T. M. L. Wigley, J. R. Lanzante, S.

Solomon, M. Free, P. J. Gleckler, and P. D. Jones. 2008. Consistency of Modelled and

Observed Temperature Trends in the Tropical Troposphere. Int. J. Climatol. Vol 28(13)

1703—1722 DOI: 10.1002/joc.1756

Santer, B.D., C. Mears, C. Doutriaux, P. Caldwell, P.J. Gleckler, T.M.L. Wigley, S. Solomon,

N.P. Gillett, D. Ivanova, T.R. Karl, J.R. Lanzante, G.A. Meehl, P.A. Stott, K.E. Taylor, P.W.

Thorne, M.F. Wehner, and F.J. Wentz (2011) “Separating Signal and Noise in Atmospheric

Page 48 of 64

John Wiley & Sons

Environmetrics

For Peer Review

49

Temperature Changes: The Importance of Timescale.” Journal of Geophysical Research 116,

D22105, doi:10.1029/2011JD016263.

Seidel, D. J. and J. R. Lanzante (2004) “An assessment of three alternatives to linear

trends for characterizing global atmospheric temperature changes.” Journal of Geophysical

Research VOL. 109, D14108, doi:10.1029/2003JD004414, 2004.

Sun, Y., P.C.B. Phillips, and S. Jin (2008) “Optimal bandwidth selection in

heteroskedasticity--autocorrelation robust testing,” Econometrica, 76, 175-194.

Thorne, P. W., D. E. Parker, S. F. B. Tett, P. D. Jones, M. McCarthy, H. Coleman, P. Brohan,

and J. R. Knight, (2005) “Revisiting radiosonde upper-air temperatures from 1958 to 2002.”

J. Geophys. Res., 110, D18105, doi:10.1029/2004JD005753.

Thorne, P. W. , P. Brohan, H.A. Titchner, M.P. McCarthy, S. C. Sherwood, T. C. Peterson, L.

Haimberger, D. E. Parker, S. F. B. Tett, B. D. Santer, D. R. Fereday, and J.J. Kennedy (2011) “A

quantification of uncertainties in historical tropical tropospheric temperature trends from

radiosondes.” Journal of Geophysical Research-Atmospheres Vol. 116, D12116,

doi:10.1029/2010JD015487, 2011.

Thompson, D. W. J., J. J. Kennedy, J. M. Wallace and P. D. Jones (2008). “A large

discontinuity in the mid-twentieth century in observed global-mean surface temperature”

Nature Vol 453, 29 May 2008 doi:10.1038/nature06982

Tsonis, A. K. Swanson and S. Kravtsov (2007). “A new dynamical mechanism for major

climate shifts” Geoohysical Research Letters Vol. 34, L13705, doi:10.1029/2007GL030288,

2007.

Page 49 of 64

John Wiley & Sons

Environmetrics

For Peer Review

50

Vogelsang, T.J. and P. H. Franses (2005) “Testing for Common Deterministic Trend

Slopes,” Journal of Econometrics 126, 1-24.

White, H., and I. Domowitz (1984) “Nonlinear Regression with Dependent

Observations,” Econometrica, 52, 143-161.

Woodward, W. A. and H.L. Gray (1993) “Global warming and the problem of testing for

trend in time series data,” J. Climate, 6, 953-962.

Wooldridge, J. (2005) Introductory Econometrics New York: Cengage Learning.

Page 50 of 64

John Wiley & Sons

Environmetrics

For Peer Review

51

11 TABLES

% VF .700 1.668 11.479 .750 2.166 14.429 .800 2.728 18.312 .850 3.388 23.676 .900 4.279 32.178 .950 5.673 48.514 .975 6.965 67.987 .990 8.639 97.928 .995 9.896 123.92

Table 1a: Asymptotic Critical Values. Model (2), known break date with , . The value in the first column shows the percentage in the upper (right) tail exceeding the indicated values of and . Left tail critical values of follow by symmetry around zero.

%

supVF trend slope

supVF level shift

.700 79.765 95.455

.750 88.184 109.94

.800 98.532 116.20

.850 111.78 130.76

.900 131.92 150.99

.950 166.41 188.68

.975 205.15 225.78

.990 261.39 279.85

.995 301.94 322.48

Table 1b: Asymptotic Critical Values. Model (2), unknown break date, q=1. 10% Trimming ( ). The value in the first column shows the percentage in the upper (right) tail exceeding the indicated values of the supVF statistics for, respectively, the trend slope and level shift coefficients.

Page 51 of 64

John Wiley & Sons

Environmetrics

For Peer Review

52

Without Level Shift With Level Shift

120 0 0 .060 .055 .049 .063 .057 .051

0.3 0 .063 .089 .053 .067 .093 .056

0.6 0 .073 .126 .068 .080 .131 .070

0.9 0 .139 .256 .127 .141 .264 .116

0.3 0.3 .181 .176 .076 .175 .172 .079

0.6 0.3 .255 .293 .148 .247 .286 .129

0.9 -0.3 .021 .102 .053 .022 .117 .060

240 0 0 .054 .053 .049 .058 .056 .054

0.3 0 .059 .080 .053 .059 .081 .057

0.6 0 .066 .103 .060 .064 .106 .064

0.9 0 .103 .192 .092 .101 .200 .092

0.3 0.3 .175 .145 .064 .168 .139 .068

0.6 0.3 .217 .229 .106 .209 .229 .101

0.9 -0.3 .013 .083 .055 .017 .092 .059

636 0 0 .056 .054 .053 .049 .047 .048

0.3 0 .055 .071 .053 .049 .067 .049

0.9 0 .057 .082 .056 .053 .078 .051

0.9 0 .073 .128 .070 .068 .128 .065

0.3 0.3 .158 .105 .059 .151 .103 .052

0.6 0.3 .177 .153 .077 .169 .150 .071

0.9 -0.3 .008 .071 .053 .007 .070 .050

Table 2a: Empirical Null Rejections with AR(2) Errors. , 5% Nominal Level, , , . The data generating process is given by (17) and (18).

Page 52 of 64

John Wiley & Sons

Environmetrics

For Peer Review

53

Without Level Shift With Level Shift

0 0 0 .01 .056 .054 .053 .049 .049 .048

.0105 .380 .378 .310 .144 .142 .127

.011 .912 .914 .816 .442 .439 .373

.012 .999 1.000 1.000 .949 .950 .881

.9 .01 .073 .130 .070 .069 .126 .066

.0105 .092 .153 .091 .073 .133 .073

.011 .148 .229 .142 .088 .158 .090

.012 .366 .484 .325 .160 .258 .143

0 .25 0 .01 .447 .443 .301 .049 .047 .048

.0105 .055 .055 .040 .144 .142 .127

.011 .319 .322 .213 .442 .439 .373

.012 .998 .998 .950 .949 .950 .881

.9 .01 .091 .158 .088 .069 .126 .066

.0105 .073 .128 .069 .073 .133 .073

.011 .089 .147 .084 .088 .158 .090

.012 .234 .324 .206 .160 .258 .143

.5 0 0 .01 .059 .056 .056 .050 .047 .047

.0105 .594 .595 .492 .226 .222 .187

.011 .995 .995 .963 .685 .679 .577

.012 1.000 1.000 1.000 .999 .999 .984

.9 .01 .073 .128 .072 .066 .127 .064

.0105 .106 .173 .101 .074 .142 .075

.011 .211 .304 .189 .108 .185 .101

.012 .572 .686 .495 .242 .350 .218

.5 .25 0 .01 .690 .691 .430 .050 .047 .047

.0105 .057 .056 .032 .226 .222 .187

.011 .508 .509 .303 .685 .679 .577

.012 1.000 1.000 .995 .999 .999 .984

.9 .01 .110 .179 .102 .066 .127 .064

.0105 .071 .126 .070 .074 .142 .075

.011 .098 .163 .094 .108 .185 .101

.012 .351 .467 .305 .242 .350 .218

Table 2b: Empirical Null Rejections and Empirical Power with AR(1) Errors. , T=636, 5% Nominal Level, , . The data generating process is given by (17) and (18).

Page 53 of 64

John Wiley & Sons

Environmetrics

For Peer Review

54

Simple Trend Trend + Level shift Data Series

Model/ Obs Name Extra Forcings; No. runs

Trend (C/decade)

Std Error

Trend (C/decade)

Std Error

Level Shift (C/decade)

Std Error

1 BCCR BCM2.0 O; 2 0.173 0.0071

0.176 0.0128

-0.013 -0.0405

2 CC;CMA3.1-T47 NA; 5 0.347 0.0046

0.345 0.0084

0.008 0.0266

3 CCCMA3.1-T63 NA; 1 0.373 0.0094

0.393 0.0146

-0.078 -0.0462

4 CNRM3.0 O; 1 0.249 0.0061

0.237 0.0106

0.044 0.0333

5 CSIRO3.0 1 0.139 0.0087

0.170 0.0183

-0.114 -0.0577

6 CSIRO3.5 1 0.242 0.0093

0.296 0.0136

-0.203 -0.0429

7 GFDL2.0 O, LU, SO, V; 1 0.186 0.0120

0.141 0.0235

0.170 0.0742

8 GFDL2.1 O, LU, SO, V; 1 0.109 0.0184

0.135 0.0319

-0.097 -0.1007

9 GISS_AOM 2 0.171 0.0085

0.156 0.0142

0.058 0.0449

10 GISS_EH O, LU, SO, V; 6 0.193 0.0117

0.226 0.0174

-0.121 -0.0551

11 GISS_ER O, LU, SO, V; 5 0.178 0.0137

0.207 0.0219

-0.110 -0.0690

12 IAP_FGOALS1.0 3 0.198 0.0132

0.243 0.0187

-0.170 -0.0591

13 ECHAM4 1 0.210 0.0140

0.236 0.0227

-0.096 -0.0717

14 INMCM3.0 SO, V; 1 0.178 0.0094

0.174 0.0169

0.015 0.0533

15 IPSL_CM4 1 0.189 0.0074

0.159 0.0126

0.114 0.0397

16 MIROC3.2_T106 O, LU, SO, V; 1 0.141 0.0104

0.121 0.0169

0.076 0.0534

17 MIROC3.2_T42 0.210 0.0133 0.233 0.0212 -0.086 -0.0669

Page 54 of 64

John Wiley & Sons

Environmetrics

For Peer Review

55

Table 3: Summary of Lower Troposphere data series. Notes: Each row refers to model ensemble mean (rows 1—23) or observational series (rows 24—26). All models forced with 20th century greenhouse gases and direct sulfate effects. Rows 10, 11, 19, 22 and 23 also include indirect sulfate effects. ‘Extra forcing’ indicates which models included other forcings: ozone depletion (O), solar changes (SO), land use (LU), volcanic eruptions (V). NA: information not supplied to PCMDI. No. runs: indicates number of individual realizations in the ensemble mean. Trend slopes estimated using OLS, Std Errors computed using VF method (see Section 4).

O, LU, SO, V; 3 18 MPI2.3.2a

SO, V; 5 0.205 0.0141

0.231 0.0228

-0.096 -0.0721 19 ECHAM5

O; 4 0.204 0.0059

0.198 0.0109

0.023 0.0343 20 CCSM3.0

O, SO, V; 7 0.217 0.0161

0.262 0.0236

-0.167 -0.0744 21 PCM_B06.57

O, SO, V; 4 0.176 0.0060

0.169 0.0114

0.024 0.0361 22 HADCM3

O; 1 0.190 0.0062

0.190 0.0114

0.001 0.0359 23 HADGEM1

O, LU, SO, V; 1 0.226 0.0104

0.205 0.0202

0.078 0.0636 24 HadAT 0.135 0.0093 0.064 0.0209 0.270 0.0660 25 RICH 0.134 0.0092 0.093 0.0206 0.155 0.0651 26 RAOBCORE 0.147 0.0080 0.065 0.0166 0.308 0.0525

Page 55 of 64

John Wiley & Sons

Environmetrics

For Peer Review[56]

Simple Trend Trend + Level shift Data Series

Model/ Obs Name Extra Forcings; No. runs

Trend

(C/decade)

Std Error

Trend

(C/decade)

Std Error

Level Shift (C/decade)

Std Error

1 BCCR BCM2.0 O; 2 0.176 0.0060

0.184 0.0107

-0.030 0.0336

2 CC;CMA3.1-T47 NA; 5 0.372 0.0046

0.365 0.0084

0.026 0.0265

3 CCCMA3.1-T63 NA; 1 0.399 0.0093

0.417 0.0149

-0.068 0.0471

4 CNRM3.0 O; 1 0.311 0.0072

0.296 0.0125

0.057 0.0395

5 CSIRO3.0 1 0.108 0.0086

0.139 0.0182

-0.117 0.0575

6 CSIRO3.5 1 0.229 0.0097

0.286 0.0142

-0.213 0.0448

7 GFDL2.0 O, LU, SO, V; 1 0.174 0.0117

0.133 0.0229

0.155 0.0722

8 GFDL2.1 O, LU, SO, V; 1 0.103 0.0198

0.143 0.0333

-0.152 0.1051

9 GISS_AOM 2 0.163 0.0081

0.154 0.0141

0.031 0.0446

10 GISS_EH O, LU, SO, V; 6 0.180 0.0114

0.210 0.0172

-0.114 0.0542

11 GISS_ER O, LU, SO, V; 5 0.162 0.0127

0.182 0.0211

-0.076 0.0666

12 IAP_FGOALS1.0 3 0.185 0.0125

0.225 0.0182

-0.149 0.0575

13 ECHAM4 1 0.200 0.0131

0.218 0.0220

-0.068 0.0696

Page 56 of 64

John Wiley & Sons

Environmetrics

For Peer Review[57]

14 INMCM3.0 SO, V; 1 0.183 0.0100

0.173 0.0175

0.035 0.0551

15 IPSL_CM4; 1 0.195 0.0081

0.157 0.0130

0.140 0.0411

16 MIROC3.2_T106 O, LU, SO, V; 1

0.147

0.0113

0.119 0.0175

0.105 0.0553

17 MIROC3.2_T42 O, LU, SO, V; 3 0.211 0.0143

0.234 0.0231

-0.086 0.0728

18 MPI2.3.2a SO, V; 5 0.182 0.0124

0.193 0.0214

-0.045 0.0675

19 ECHAM5 O; 4 0.202 0.0059

0.197 0.0109

0.018 0.0345

20 CCSM3.0 O, SO, V; 7 0.201 0.0132

0.232 0.0203

-0.116 0.0640

21 PCM_B06.57 O, SO, V; 4 0.161 0.0048

0.136 0.0084

0.096 0.0266

22 HADCM3 O; 1 0.170 0.0059

0.166 0.0109

0.134 0.0344

23 HADGEM1 O, LU, SO, V; 1 0.221 0.0108

0.210 0.0207

0.041 0.0653

24 HadAT 0.089 0.0088 -0.001 0.0172 0.336 0.0542 25 RICH 0.096 0.0075 0.048 0.0225 0.180 0.0709 26 RAOBCORE 0.132 0.0082 0.042 0.0152 0.339 0.0478

Table 4: Summary of Mid-Troposphere data series. Notes same as for Table 3.

Page 57 of 64

John Wiley & Sons

Environmetrics

For Peer Review

58

Trend Coef Null Hypothesis Test Score Bootstrap p value

No Level shift Hadley LT (0.135) trend = 0 213.3*** < 0.001 RICH LT (0.134) trend = 0 212.7*** < 0.001 RAOBCORE LT (0.147) trend = 0 337.5*** < 0.001 Hadley MT (0.089) trend = 0 102.8*** < 0.001 RICH MT (0.096) trend = 0 94.1*** < 0.001 RAOBCORE MT(0.132) trend = 0 259.8*** < 0.001 LT Models = Observed 96.2*** < 0.001 MT Models = Observed 155.4*** < 0.001

With Level shift at Date (Assumed Known): December 1977 Hadley LT (0.064) trend = 0 9.2 0.349 RICH LT (0.093) trend = 0 20.4 0.180 RAOBCORE LT (0.065) trend = 0 15.4 0.235 Hadley MT (-0.001) trend = 0 0.001 0.991 RICH MT (0.048) trend = 0 4.6 0.498 RAOBCORE MT (0.042) trend = 0 7.7 0.388 LT Models = Observed 261.1*** < 0.0002 MT Models = Observed 423.1*** < 0.0001

With Level Shift at Date Assumed Unknown LT Models = Observed 352.14*** < 0.001 MT Models = Observed 544.08*** < 0.001

Table 5: Results of hypothesis tests using VF statistic with and without level shift term at December, 1977. Notes: sample period (monthly): January 1958 to December 2010. The bootstrap p-value* is computed using the method described in Section 3.3 using 10000 bootstrap replications. VF Critical Values: Without level shift, 20.14 (10%, denoted *) 41.53 (5%, denoted **), 83.96 (1%, denoted ***). With level shift at known date, 32.32 (10%, denoted *), 49.03 (5%, denoted **), 97.23 (1%, denoted ***). With level shift at unknown date, 131.92 (10%, denoted *), 166.41 (5%, denoted **), 261.39 (1%, denoted ***).

Page 58 of 64

John Wiley & Sons

Environmetrics

For Peer Review

59

LT Layer MT Layer

Model

No Break

Known Break Date

Unknown Break Date

No Break

Known Break Date

Unknown Break Date

1 24.00* 53.35** 53.55 80.32** 206.48*** 115.54 2 750.18*** 231.96*** 731.85*** 1497.65*** 724.02*** 1078.92*** 3 653.13*** 913.03*** 924.73*** 636.38*** 1898.60*** 1446.43*** 4 107.71*** 48.37* 136.51* 469.47*** 192.48*** 378.75*** 5 0.00 7.78 14.09 0.43 16.17 11.96 6 79.02** 106.07*** 108.73 93.27*** 222.04*** 159.70* 7 31.53* 17.88 38.68 58.72** 47.08* 67.93 8 4.12 8.16 51.00 0.04 24.93 45.61 9 5.50 7.55 36.98 31.29* 29.52 49.78 10 28.38* 288.26*** 350.18*** 36.83* 495.44*** 476.83*** 11 11.65 150.50*** 170.06** 20.40* 192.87*** 241.43** 12 25.62* 297.75*** 363.07*** 35.91* 466.40*** 494.46*** 13 43.33** 378.37*** 453.11*** 53.93** 339.05*** 606.04*** 14 6.16 8.65 35.57 37.72* 27.26 61.59 15 27.03* 17.11 56.02 116.07*** 57.92** 97.57 16 0.03 2.61 42.62 19.89 15.83 47.48 17 37.32* 324.54*** 335.89*** 46.86** 231.36*** 350.56*** 18 29.94* 194.02*** 248.19*** 36.14* 186.32*** 285.08*** 19 53.03** 38.01* 92.42 159.76*** 156.99*** 127.39 20 29.92* 280.15*** 349.58*** 45.43** 372.73*** 563.97*** 21 31.54* 45.95* 53.44 142.56*** 130.34*** 92.19 22 34.89** 35.84* 52.27 64.75** 80.93** 77.74 23 144.28*** 94.78*** 143.16* 143.26*** 235.49*** 292.16*** <=5% 8/23 12/23 9/23 14/23 17/23 11/23

Table 6. VF tests of equivalent trends between individual model (ensemble mean in cases of multiple runs) and average of balloon series. Column 1: model. Column 2: LT layer, No Break case. Columns 3 and 4 (Break1 and Break2, respectively): with break assumed known at 1977:12 and with break at date assumed unknown. Columns 5-7: same for MT layer. VF Critical Values: Without level shift, 20.14 (10%, denoted *) 41.53 (5%, denoted **), 83.96 (1%, denoted ***). With level shift known to be at 1977:12, 33.32 (10%, denoted *), 51.20 (5%, denoted **), 98.46 (1%, denoted ***). With level shift at unknown date, 131.92 (10%, denoted *), 166.41 (5%, denoted **), 261.39 (1%, denoted ***). Last row: fraction of models exhibiting difference from observations significant at <=5%.

Page 59 of 64

John Wiley & Sons

Environmetrics

For Peer Review

Figure 1. Schematics of two series to be compared. 135x132mm (96 x 96 DPI)

Page 60 of 64

John Wiley & Sons

Environmetrics

For Peer Review

Figure 2. Radiosonde series. Left column: Lower Troposphere. Right column: Mid-Troposphere. Top row: HadAT series. Middle row: RICH series. Bottom row: RAOBCORE series. Least squares trends shown.

101x73mm (300 x 300 DPI)

Page 61 of 64

John Wiley & Sons

Environmetrics

For Peer Review

Figure 3. Model-Observation comparisons allowing for level shift after 1977:12. Left: LT layer, dots indicate monthly data from 3 GCMs, black line, trend through all GCMs, gray dashed line, trend through average of three radiosonde series allowing for step-change at 1977:12. Observed series are shifted down so starting

value coincides with that of model average. Right: same for MT layer. 101x73mm (300 x 300 DPI)

Page 62 of 64

John Wiley & Sons

Environmetrics

For Peer Review

Figure 4. 1958-2010 Trends and 95% CIs for 23 models (shaded region) and three radiosonde series Hadley, RICH and RAOBCORE (respectively, individual markers at right edge). Top row: Trends computed without allowing for level shift. Bottom row: Level shift term included in model. Left column: LT. Right

column: MT. 101x73mm (300 x 300 DPI)

Page 63 of 64

John Wiley & Sons

Environmetrics

For Peer Review

Figure 5. Grid search of VF scores of model-observational equivalence varying the break point across the middle 80% of the sample. Top: LT. Bottom: MT. Horizontal lines show 10%, 5%, and 1% critical values (resp. 131.92, 166.41 and 261.39). Maximum values are at obs 257 352.1 (LT) and obs 250 544.1 (MT).

101x73mm (300 x 300 DPI)

Page 64 of 64

John Wiley & Sons

Environmetrics