statistical inference and prediction in nonlinear models

Statistical Inference and Prediction in Nonlinear

Models using Additive Random Fields∗

Christian M. Dahl†

Department of Economics

Purdue University

[email protected]

Gloria Gonzalez-Rivera

Department of Economics

University of California, Riverside

[email protected]

Yu Qin

Countrywide Financial Corporation

yu [email protected]

April 5, 2005

Abstract

We study nonlinear models within the context of the flexible parametric ran-

dom field regression model proposed by Hamilton (2001). Though the model is

parametric, it enjoys the flexibility of the nonparametric approach since it can ap-

proximate a large collection of nonlinear functions and it has the added advantage

that there is no “curse of dimensionality.”The fundamental premise of this paper

is that the parametric random field model, though a good approximation to the

true data generating process, is still a misspecified model. We study the asymp-

totic properties of the estimators of the parameters in the conditional mean and

variance of a generalized additive random field regression under misspecification.

∗The notation follows Abadir and Magnus (2002). All software developed for this paper can be

obtained from the corresponding author.†Corresponding author. Address: 403 West State Street, Purdue University, West Lafayette, IN

47907-2056, USA. E-mail: [email protected]. Phone: +1 765-494-4503. Fax: +1 765-496-1778.

1

The additive specification approximates the contribution of each regressor to the

model by an individual random field, such that the conditional mean is the sum of

as many independent random fields as the number of regressors. This new speci-

fication can be viewed as a generalization of Hamilton’s and therefore, our results

will provide ”tools” for classical statistical inference that also will apply to his

model. We develop a test for additivity that is based on a measure of goodness

of fit. The test has a limiting Gaussian distribution and is very easy to compute.

Through extensive Monte Carlo simulations, we assess the out-of-sample predictive

accuracy of the additive random field model and the finite sample properties of

the test for additivity comparing its performance to existing tests. The results are

very encouraging and render the parametric additive random field model as a good

alternative specification relative to its nonparametric counterparts, particularly in

small samples.

Keywords: Random Field Regression, Nonlocal Misspecification, Asymptotics, Test

for Additivity.

JEL classification: C12; C15; C22.

1 Introduction

We study nonlinear regression models within the context of the parametric random field

model proposed by Hamilton (2001). Stationary random fields have been long-standing

tools for the analysis of spatial data primarily in the field of geostatistics, environmental

sciences, agriculture, and computer design, see, e.g., Cressie (1993). Several applications

can be found in Loh and Lam (2000), Abt and Welch (1998), Ying (1991, 1993) , and

Mardia and Marshall (1984). The analysis of economic data with random fields is still

in a preliminary state of development. In a regression framework, random fields were

first introduced by Hamilton (2001) and further developed in Dahl (2002), Dahl and

Gonzalez-Rivera (2003), Dahl and Qin (2004), and Dahl and Hylleberg (2004).

Random fields are closely related to universal kriging and thin plate splines. Dahl

(2002) points out that the Hamilton’s estimator of the conditional mean function becomes

2

identical to the cubic spline smoother when the conditional mean function is viewed as

a realization of a Brownian motion process.1 In addition, Dahl (2002) shows that the

random field approach has superior predictive accuracy compared to popular nonpara-

metric estimators, i.e. the spline smoother, when the data is generated from popular

econometric models such as LSTAR/ESTAR and various bilinear specifications. Though

the random field model is parametric, it enjoys the flexibility of the nonparametric ap-

proach since it can approximate a large collection of nonlinear functions but with the

added advantage that there is no “curse of dimensionality” and there is no dependence

on bandwidth selection or smoothing parameter.

In many of the aforementioned applications (those not related to econometrics and/or

economics), the random field model has been treated as the true data generating process.

The view presented in this paper is that the parametric random field model, though

a good approximation to the true data generating mechanism, is still a misspecified

model. However, despite of misspecification, Hamilton (2001) shows that it is possible to

obtain a consistent estimator of the overall conditional mean function under very general

conditions. The theoretical contribution of this paper focuses on the consequences of

misspecification for the asymptotic properties of the estimators of the parameters found

in the conditional mean and in the variance of the error term of the model. This is an

important and additional contribution to the Hamilton’s results, which only focused on

the overall conditional mean function.

An additional innovation of this paper is the specification of additive random fields

to model the conditional mean of the variable of interest. Within the nonparametric

literature, additive models play a very predominant role because they mitigate the “curse

of dimensionality” and their estimators have faster convergence rates than those of the

nonadditive nonparametric models, see i.e. Stone (1985, 1986). Hastie and Tibshirani

(1990) provide an important and thorough analysis of additive models. There is a vast

literature on nonparametric estimation of additive models. Some recent contributions are:

Sperlich, Tjostheim and Yang (2002), Yang (2002), and Carroll, Hardle and Mammen

(2002). In this paper, we investigate how and to what extent imposing an additive

1Using the results of Kimeldorf and Wahba (1971) and Wahba (1978, 1990) we show how this result

generalizes to xt ∈ Rk.

3

structure on the random field regression model improves its statistical properties when

the true data generating process is additive in one or more of its arguments (nonzero

interactions terms are allowed). In particular, we propose (in the simplest case where

there are no interaction terms,) to approximate the individual contribution of each of

the k regressors to the conditional mean by a random field, such that the nonlinear part

is the sum of k individual and independent random fields. Our approach differs from

the Hamilton’s model where one “comprehensive” random field approximates the joint

contribution of the k regressors. This new specification can be viewed as a generalized

random field as it collapses to the Hamilton’s model when no assumptions regarding

additivity are imposed. This implies that all the asymptotic results that we derive in

this paper, they will also apply to the Hamilton’s (2001) model. We will assess the gains

in the predictive accuracy of the model when the additive structure is incorporated.

First, we provide a complete characterization of the asymptotic theory associated

with the estimators of the parameters of the generalized additive random field model.

Our environment is similar to Hamilton’s (2001). Our results facilitate classical statistical

inference in random field regression models. Classical statistical inference is numerically

far less computationally intensive than the Bayesian inference suggested by Hamilton

(2001). Establishing the asymptotic properties of the estimators is non-trivial mainly

because the random field model is viewed as an approximation to the true data generating

process, and as such we need to deal with misspecification concerns.

Secondly, when imposing additivity, we need to evaluate the validity of such a restric-

tion. There is a vast literature on specification tests for additivity in a nonparametric

setting. For example, Barry (1993) developed a test for additivity within a nonpara-

metric model with sampling in a discrete grid. Eubank, Hart, Simpson, and Stefanski

(1995) showed the asymptotic performance of a Tukey–type additivity test based on

Fourier series estimation of nonparametric models and they proposed a new test, which

delivers a consistent estimation of the interaction terms. Chen, Liu, and Tsay (1995)

relaxed the restriction of sampling in a grid and proposed an LM-type test for additivity

in autoregressive processes. Hong and White (1995), and Wooldridge (1992) proposed

several specification tests, which can also be used to test for additivity. Most of these

studies on additivity testing rely on nonparametric models. In addition to being com-

4

putationally demanding, these methods depend very heavily on the bandwidth selection

and on the construction of the weight function, which typically are not easily obtained.

Our contribution is a new test for additivity that does not depend on a nonparametric

estimation procedure. It will be based on a measure of goodness of fit, which will permit

a fast computation. The test has a limiting Gaussian distribution and a Monte Carlo

study illustrates that it has very good size and power for a large class of additive and

nonadditive data generating processes.

A potential drawback of additive models is the large number of parameters to be

estimated. We introduce a more restrictive specification, called the “proportional addi-

tive random field model” where the weight ratio between any two random fields is kept

fixed. This modification substantially improves the computational/numerical aspects of

the estimation and testing procedures. Based on extensive Monte Carlo studies, we find

that these improvements are achieved without sacrificing predictive efficiency in relation

to the generalized additive model.

The organization of the paper is as follows. In Section 2 we present the additive

random field model. In Section 3 we establish the asymptotic properties of the estimators

of the parameters of the model. In Section 4, we develop a new test for additivity and

characterize its asymptotic distribution. In Section 5 we conduct various Monte Carlo

experiments to analyze the small sample properties of the predictive accuracy of the

estimated additive random field model and the size and power properties of the proposed

test for additivity. We conclude in Section 6. All proofs can be found in the mathematical

appendix.

2 Preliminaries

2.1 The additive random field regression model

Let yt ∈ R, xt· ∈ Rk and consider the model

yt = µ(xt·) + ǫt, (1)

5

where ǫt is a sequence of independent and identically N(0, σ2) distributed random vari-

ables and µ(·) : Rk → R is a random function of a k × 1 vector xt·, which is assumed to

be deterministic.2 The vector of explanatory variables is partitioned as (xt1·, . . . ,xtI·)′,

where xti· ∈ Rki and∑Ii=1 ki = k.3 Model (1) is called an I-dimensional additive ran-

dom field model if the conditional mean of yt, i.e. µ(xt·), has a linear component and a

stochastic additive nonlinear component such as

µ(xt·) = xt·β +

I∑

i=1

λimi(gi ⊙ xti·), (2)

for i = 1, 2, ..., I, where λi ∈ R+, gi ∈ Rki

+ , I ≤ k, and for any choice of z ∈ Rki , where

mi(z) is a realization of a random field. Model (2) is said to be fully additive when

I = k; or partially additive when I < k. For example, suppose that the true functional

form of the conditional mean is given by

yt = β1xt1 + β2x2t1 + β3 sin(xt2xt3) + ǫt.

This model is partially additive in xt1. Individual random fields approximate the nonlin-

ear components, i.e. λ1m1(g1 · xt1) ≈ β2x2t1 and λ2m2(g2 · xt2, g3 · xt3) ≈ β3 sin(xt2xt3),

for I = 2, k1 = 1, and k2 = 2. Each of the I random fields is assumed to have the

following distribution

mi(z) ∼ N(0, 1), (3)

E(mi(z)mj(w)) =

Hi(h) if i = j

OT if i 6= j, (4)

where h is the Euclidean distance h ≡ 12 [(z −w)′(z − w)]

12 .4 The realization of mi(·)

for i = 1, 2, ..., I is considered predetermined and independent of {x1·, ...,xT ·, ǫ1, ..., ǫT }.The i’th covariance matrix Hi(h) is defined as

Hi(h) =

Gk−1(h, 1)/Gk−1(0, 1)

0

if h ≤ 1

if h > 1, (5)

2Without loss of generality we assume that all variables are demeaned.3This partition of xt· is made to simplify the exposition. In general, regressors can enter one or more

of the random fields in the additive random field model without altering any of the asymptotic results.4g is a k × 1 vector of parameters and ⊙ denotes element-by-element multiplication i.e. g⊙xt is the

Hadamard product. β is a k × 1 vector of coefficients.

6

where Gk(h, r), 0 < h ≤ r is5

Gk(h, r) =

∫ r

h

(r2 − z2)k2 dz. (6)

When the model is fully additive xti· becomes a scalar for all i and expression (5) reduces

to Hi(h) = (1 − h) 1(h ≤ 1).6,7 Since mi(z) is not observable for any choice of z, the

functional form of µ(xt·) cannot be observed. Hence, inference about the unknown

parameters (β, λ1, . . . , λI , g1, . . . , gI , σ2) must be based on the observed realizations of

yt and xt· only. For this purpose we rewrite model (1) as

y = Xβ + ε, (7)

where y ≡ yT is a T × 1 vector with t-th element equal to yt, X ≡ XT is a T × k

matrix with t-th row equal to xt·, ε is a T × 1 random vector with t-th element equal to∑Ii=1 λimi(gi ⊙ xti·) + ǫt, and ε ∼ N(0T ,H (λ) + σIT ) where H (λ) ≡

∑Ii=1 λiHi and

(λi, σ) ≡ (λ2i , σ

2). To avoid identification issues, the shape parameters gi in the random

field model are assumed to be fixed as in Hamilton (2001). Furthermore, we assume that

the components of the vector (λ1, λ2, ..., λI , σ)′ are strictly positive.

Assumption i. The parameter vector gi ∈ Rki

+ for i = 1, 2, ..., I in the additive random

field model (1) is fixed with typical element equal to gji = 1

2√kis

2ji

, for i = 1, . . . , I, and

ji ∈ [1; k], where s2ji = 1T

∑Tt=1(xjit − xji )

2 and xji is the sample mean of the ji-th

explanatory variable.

5From Hamilton (2001), Gk(h, r) can be computed recursively from

G0(h, r) = r − h

G1(h, r) = (π/4)r2 − 0.5h(r2 − h2)1/2 − (r2/2) sin−1(h/r)

Gk(h, r) = − h

1 + k(r2 − h2)k/2 +

kr2

1 + kGk−2(h, r)

for k = 2, 3,....6The correlation between m(z) and m(w) is given by the volume of the intersection of a k dimensional

unit spheroid centered at z and a k dimensional unit spheroid centered at w relative to the volume of

a k dimensional unit spheroid. Hence, the correlation between m(z) and m(w) is zero if the Euclidean

distance between z and w is h ≥ 2.7The reader interested in a critical review on the choice of an appropriate covariance function is

referred to Dahl and Gonzalez-Rivera (2003).

7

Assumption ii. We define θ ≡ (λ1, . . . , λI , σ)′ ∈ Θ ⊆ RI+1+ , where Θ is a com-

pact parameter space. There exist sufficiently small but positive real numbers θ =

(λ1, . . . , λI , σ)′ and sufficiently large positive real numbers θ = (λ1, . . . , λI , σ)′, such that

λi ∈ [λi, λi] for ∀i = 1, . . . , I and σ ∈ [σ, σ].

We focus on deriving the limiting properties of the maximum likelihood estimators of

θ = (λ1, λ2, ..., λI , σ)′ (the parameters of the nonlinear part of the model). The vector β

will be considered as a nuisance parameter vector, which can be estimated consistently

by ordinary least squares by simply ignoring the possibly nonlinear part of the model

(independently of whether this part is additive or not). Model (7) can be viewed as a

generalized least squares representation where ε has a non-spherical covariance function

C = H (λ) + σIT . We can write the average log-likelihood function of ε as (apart from

a constant term)

QT (θ,β) = − 1

2Tln detC − 1

2T(y −Xβ)

′C−1 (y −Xβ) . (8)

2.2 The data generating process

The random field model is an approximation to the true functional form of the conditional

mean. We assume that there is a data generating mechanism that needs to be discovered.

The important question is that of “representability” of the true functional form through

some function of the covariance function of the random field. We follow similar arguments

as in Hamilton (2001). We assume that yt is generated according to the process

yt = ψ(xt·) + et, (9)

for t = 1, 2, ..., T , where ψ : Rk → R is given by the additive function

ψ(xt·) = xt·α+I∑

i=1

li(xti·). (10)

In addition, the following assumptions are imposed.

Assumption 1: The sequence {xt} is dense. The deterministic sequence {xti·},with xti· ∈ Ai and Ai = A1×A2× ...×Aki

a closed rectangular subset of Rkiand λ ∈ Γ0,

8

is said to be dense for Ai uniformly on the compact space Ai × Γ0 ⊂ Rki × R if there

exists a continuous fi : Ai → R such that fi(xti·) > 0 for all i and xti· and such that for

any ǫ > 0 and any continuous φi : Ai ×Ai × Γ0 → R there exists an N such that for all

T ≥ N,

supAi×Γ0

∣∣∣∣∣1

T

T∑

s=1

φi(xti·,xsi·;λ) −∫

Ai

φi(xti·,x;λ)fi(x)dx

∣∣∣∣∣ < ǫ

for all i = 1, 2, ..., k.

Assumption 2: The function li : Rki → R is representable. Let Ai, and Γ0 be

given as in Assumption 1 and let li : Ai × Γ0 → R be an arbitrary continuous function.

We say that li(·) is representable with respect to φi(·) if there exists a continuous function

fi : Ai → R such that

li(xti·;λ) =

∫

Ai

φi(xti·,x;λ)fi(x)dx.

Assumption 1 is important because it implies that we can write

li(xti·;λi) = limT→∞

1

T

T∑

s=1

φi(xti·,xsi·;λi)

= limT→∞

1

T

T∑

s=1

λiHi(xti·,xsi·)φi(xsi·),

where λi is given as in Assumption ii., Hi(xti·,xsi·) denotes the (t, s) entry in Hi given

by (5), and φi : Rki → R is an arbitrary continuous function. Note, that by varying

φi(·), Assumption 2 describes the general class of nonlinear/linear functions for which

li(xti·;λi) is representable in terms of the spherical covariance function. Importantly,

Hamilton (2001) shows that Taylor and Fourier sine series expansions are representable

under Assumption 2 . We can thus expect the random field model to have good approx-

imation properties over a very broad class of functions. Defining the sample version of

li(·) as

lTi(xti·;λi) =1

T

T∑

s=1

λiHi(xti·,xsi·)φi(xis·), (11)

it follows that limT→∞ lTi(xti·;λi) → li(xti·;λi) uniformly on Ai × Γ0 for ∀t, hereby

providing a necessary link between the approximating random field model and the true

9

data generating process. Hamilton (2001) discusses pointwise convergence of lTi(·) to

li(·) in xti· for ∀t. Following similar arguments as in Dahl and Qin (2004), we generalize

the convergence to uniform convergence.

Assumption 3: Distribution of et. The error term et is assumed to be an i.i.d.

Gaussian distributed random variable with zero mean and variance σ2e .

Assumption 4: Limiting behavior of “second moments” of ψ (X) and X. Let

ψ (X) = (ψ (x1·) , ..., ψ (xT ·))′

where ψ (·) is given by (10). Assume: i . limT→∞

1TX ′X

converges to a finite nonsingular matrix. ii . limT→∞

1Tψ (X)

′ψ (X) converges to a finite

scalar uniformly in α. iii . limT→∞

1TX

′ψ (X) converges to a finite k × 1 vector uniformly

in α.

Assumptions 3 and 4 seem somewhat restrictive but are a consequence of working in

Hamilton’s (2001) environment and they serve primarily to shorten the proofs. We find it

important to establish the asymptotic results under these basic assumptions first before

extending Hamilton’s (2001) fundamental assumptions. We conjecture that replacing

Gaussianity with stationarity, ergodicity, a sufficient number of moment conditions in

Assumption 3, and imposing similar conditions on (y,x′)′ in Assumption 4 would not

alter the main results.

3 Asymptotics

In this section we establish three important asymptotic results; (1) consistency of the

estimator of the conditional mean in model (2); (2) consistency of the maximum likelihood

estimators of the parameters in the nonlinear component of the model; and (3) their

asymptotic normality. The estimation procedure is a two-stage approach. In the first

stage, we estimate the parameters β in the linear component of the model by OLS. In the

second stage, the parameters θ = θ (β) in the nonlinear part of the model are estimated

by maximizing the objective function (8). We establish the asymptotic theory for the

estimators of β and θ under the additivity assumption in (10). To this end, we need to

derive a set of results on uniform convergence of deterministic functions. In the following

10

subsections, we state the main theorems; their proofs can be found in the Mathematical

Appendix.

3.1 Consistency of the conditional mean estimator

Let µT = (µ(x1·), µ(x2·), ..., µ(xT ·))′ and ξT = (ξT (x1·), ξT (x2·), ..., ξT (xT ·))′, where

ξT (xt·) = E (µ(xt·)|yT ,xT ·, yT−1,xT−1·, ...) . Under the assumption of joint normal-

ity of yT and µT , the expectation of the conditional mean function, conditional on

yT ,xT ·, yT−1,xT−1·, ..., is given by

ξT = Xβ +H(λ) (H(λ) + σIT )−1

(y −Xβ) .

For the misspecified additive random field model (2), the following theorem states that

ξT is a consistent estimator of µT .

Theorem 1 Let assumptions 1, 2, and 3 hold and define

LT (λ0) = (lT (x1·;λ), lT (x2·;λ), ..., lT (xT ·;λ))′,

where λ = (λ1, λ2, ..., λI)′, and

LT (λ) =1

T

I∑

i=1

λiHiφi,

for φi =(φi(x1i·), φi(x2i·), ..., φi(xTi·)

)′. Then,

limT→∞

supΘ×A2

1

TE[(ξT −Xβ −LT (λ))

′(ξT −Xβ −LT (λ))

]→ 0. (12)

Theorem 1 generalizes Theorem 4.7 in Hamilton (2001) in two directions: First, it es-

tablishes that the convergence is uniform on Θ×A2, and secondly, it applies to a richer

class of random field regression models. From Theorem 1, we can derive some additional

results regarding uniform convergence of the following deterministic sequences. These re-

sults will play an important role later on, when we establish the asymptotic distribution

of the estimators of the parameters.

11

Corollary 1 Let assumptions i., ii., and 1 to 4 hold. Then,

limT→∞

supΘ×B×A

1

T(ψ(X) −Xβ)′ (H (λ) + σIT )−2 (ψ(X) −Xβ) → 0, (13)

limT→∞

supΘ×B×A

1

Ttr[(H (λ) + σIT )−1

H (λ)H (λ) (H (λ) + σIT )−1]

→ 0, (14)

limT→∞

supΘ×B×A

1

Ttr[(H (λ) + σIT )

−1H (λ) (H (λ) + σIT )

−1H (λ)

]→ 0. (15)

Corollary 2 Let assumptions i., ii., and 1 to 4 hold. Then, for all i, j = 1, 2, ..., I,

limT→∞

supΘ×B×A

1


−1Hi (H (λ) + σIT )

−1Hj

]→ 0.

3.2 Consistency of the parameter estimators

Recall the average likelihood function (8) associated with model (7) is given as,

QT (θ,β) = − 1

2Tln detC − 1

2T(y −Xβ)′C−1 (y −Xβ) .

In this section, we will establish consistency of the maximum likelihood estimator of the

parameter vector θ ≡ (λ1, . . . , λI , σ)′. The consistency of β, i.e., βp−→ β∗, is shown

by Dahl and Qin (2004). We proceed as follows: first, we prove that the expectation

of QT (θ,β) uniformly converges to a limiting function Q∗ (θ,β) . In the second stage,

we prove the main theorem that states θp−→ θ∗, where θ∗ maximizes Q∗(θ,β∗). For

these results to hold we need a set of auxiliary propositions on uniform convergence of

deterministic sequences.

Specifically, let us defineRT ≡ 1T

log detC, and UT ≡ 1T

(ψ (x) −Xβ)′C−1 (ψ (x) −Xβ) ,

with corresponding differentials Dli1,...,il

RT = ∂i1+...+ilRT

(∂λ1)i1 ...(∂λl)il

, and Dli1,...,il

UT = ∂i1+...+ilUT

(∂λ1)i1 ...(∂λl)il,

where i1, . . . , il = 0, 1, . . . , I, l = 1, 2, ..., D0RT = ∂RT

∂σ, and D0UT = ∂UT

∂σ. Furthermore,

define the following limits Dli1,...,il

R = limT→∞

Dli1...,il

RT and Dli1,...,il

U = limT→∞

Di1,...,ilUT ,

and D0R = limT→∞

D0RT , D0U = limT→∞

D0UT . The following propositions establish that,

under the assumptions stated above, the sequences of RT and UT as well as their differ-

entials will converge uniformly.

Proposition 1 Given assumptions i., ii., and 1 to 4, DiR, D2i,jR and D0U are equal to

zero uniformly on Θ ×B ×A for all i, j = 1, ..., I.

12

Proposition 2 Given assumptions i., ii., and 1 to 4, all of the function sequences{Dli1...il

RT}T

for l = 1, 2, ... and i1, . . . , il = 0, 1, . . . , I, are equicontinuous. Further-

more, each function sequence converges uniformly on Θ ×A as T → ∞.


UT}T

for l = 1, 2, .. and i1, . . . , il = 0, 1, . . . , I, are equicontinuous. Furthermore,

each function sequence converges uniformly on Θ ×B ×A as T → ∞.

Next, we define the following second order sample moment matrices

Mx·ix·j(θ) ≡ 1

Tx′·i (H (λ) + σIT )

−1x·j,

Mψx·i(θ) ≡ 1

Tψ(X)′ (H (λ) + σIT )

−1x·i,

Mψψ (θ) ≡ 1

Tψ(X)′ (H (λ) + σIT )

−1ψ(X),

where x·i = (x1i, . . . , xTi)′ for i, j = 1, 2, ..., k. Then, we have the following result.


Mx·ix·j

}T,{Dli1...il

Mψx·i

}T, and

{Dli1...il

Mψψ

}T

for l = 1, 2, .. and i1, . . . , il =

0, 1, . . . , I, are equicontinuous. Furthermore, each function sequence converges uniformly

on Θ ×B ×A as T → ∞, and

limT→∞

supΘ×B×A

D0Mx·ix·j(θ) → 0, (16)

limT→∞

supΘ×B×A

D0Mψx·i(θ) → 0, (17)

limT→∞

supΘ×B×A

D0Mψψ(θ) → 0. (18)

Next, consider the objective function (8). Let Q∗T (θ,β) = E (QT (θ,β)). Substituting

y = ψ (X) + e in (8) and taking expectations, we have

Q∗T (θ;β) = − 1

2Tlog detC + (19)

− 1

2T(ψ (X) −Xβ)

′C−1 (ψ (X) −Xβ) − 1

2Tσ2e tr

(C−1

),

which can also be written according to the notation established previously as

Q∗T (θ,β) = −1

2RT − 1

2UT − σ2

e

2D0RT .

13

Theorem 2 Let assumptions i., ii., and 1 to 4 hold. Then

limT→∞

supΘ×B×A

|QT (θ,β) −Q∗T (θ,β)| p−→ 0, (20)

and

limT→∞

supΘ×B×A

|Q∗T (θ,β) −Q∗(θ,β)| → 0, (21)

where Q∗ (θ,β) = − 12R− 1

2U − σ2e

2 D0R.

By Theorem 2, it can be seen that QT (θ,β)p−→ Q∗(θ,β). The unique existence of

θ∗ ∈ Θ that maximizes Q∗(θ,β) for a given β is guaranteed in the following theorem.

Theorem 3 Let U(θ) be a convex function of θ ∈ Θ and trC−1C−1 < 2σ2e trC−1C−1C−1.

Then Q∗(θ;β) is a concave function of θ ∈ Θ and has a unique maximizer θ∗ in Θ. A

necessary condition for concavity of Q∗(θ;β) in σ is given by the condition σ ≤ 2σ2e .

In the following sections, we will derive a consistent estimator of σ2e that will permit

actual empirical verification of the necessary and sufficient conditions for identification.

Now, we can establish consistency of the estimators of the parameters.

Theorem 4 Let assumptions i., ii., and 1 to 4 hold, and let β be the first-stage OLS

estimator of the linear parameter β such that βp−→ β∗. Then

θp−→ θ∗,

where θ is the two-stage estimator and θ∗ maximizes Q∗(θ,β∗).

3.3 Asymptotic normality

In this section we establish the asymptotic distribution of θ. In the first stage estimation,

we compute the OLS estimator β, which is taken as a nuisance parameter in the second

stage. In the second stage, we estimate θ. The estimation of β will introduce further

variability in the estimation of θ that will be reflected in the moments of its asymptotic

distribution. We begin with the derivation of the asymptotic distribution of the score

14

function. The objective function associated with the first stage OLS estimation is given

as

mT (β) = − 1

T

T∑

t=1

(yt − xt·β)2. (22)

We stack the gradient vector of (22) and the score vector associated with (8) in a (I +

1 + k) × 1 vector as

gT (θ,β) =(DθQT (θ,β)

′,DβmT (β)

′)′, (23)

where

DθQT (θ,β) =

D1QT (θ,β)...

DIQT (θ,β)

D0QT (θ,β)

=

− 12T tr

(C−1H1

)+ 1

2T (y −Xβ)′C−1H1C−1 (y −Xβ)

...

− 12T tr

(C−1HI

)+ 1

2T (y −Xβ)′C−1HIC

−1 (y −Xβ)

− 12T tr

(C−1

)+ 1

2T (y −Xβ)′C−1C−1 (y −Xβ)

,(24)

and

DβmT (β) =2

T

(T∑

t=1

xt1 (yt − xt·β) , . . . ,

T∑

t=1

xtk (yt − xt·β)

)′

.

The following proposition provides us with the exact variance of√TgT .

Proposition 5 Let assumptions i., ii., and 1 to 4 hold. Then

cov(√

TDi1QT (θ,β) ,√TDi2QT (θ,β)

)=

σ4e

2

1

Ttr(C−1Hi1C

−1C−1Hi2C−1)

+

σ2e

1

Tc′C−1Hi1C

−1C−1Hi2C−1c, (25)

cov(√

TDi1QT (θ,β) ,√TDj1mT (β)

)= −2σ2

eDi1

Mxj1ψ −

k∑

j=1

Mxj1xjβj

,(26)

cov(√

TDj1mT (β) ,√TDj2mT (β)

)=

4σ2e

Tx′·j1x·j2 , (27)

15

for all i1, i2 = 0, 1, 2, ..., I and j1, j2 = 1, 2, ..., k. Furthermore, (25) (26) and (27) converge

uniformly on Θ ×B ×A as T → ∞.

The asymptotic distribution of√TgT is a result of the following theorem.

Theorem 5 Let gT (θ,β) be given by (23) and let assumptions i., ii., and 1 to 4 hold.

Then, as T → ∞,√TgT (θ∗,β∗)

d−→ N(0,Σ∗) ,

where Σ∗ = Σ (θ∗,β∗) = [Σ∗11 Σ∗

12 : Σ∗′12 Σ∗

22] and

Σ∗11 =

σ4e

2limT→∞

1

Ttr(C−1Hi1C

−1C−1Hi2C−1),

Σ∗12 = −2σ2

e limT→∞

Di1Mxj1ψ,

Σ∗22 = 4σ2

e limT→∞

1

TX ′X.

Then, the asymptotic normality of θ can be established in the following Theorem 6.

Theorem 6 Let assumptions i., ii., and 1 to 4 hold. Define ζ = (θ′,β′)′, let QT (ζ)

and mT (β) be given by (8) and (22) respectively, and let Σ∗ be defined as in Theorem

5. Then√T(ζT − ζ∗

)d−→ N(0,M∗), (28)

where M∗ = G∗−1Σ∗G∗−1′, for G∗ = limT→∞GT (ζ∗), and

GT (ζ) = DζgT (ζ)

=

D2θθQT (ζ) D2

θβQT (ζ)

0k×I D2ββmT (β)

.

For the parameter of interest θ, notice that

√T (θT − θ∗) d−→ N(0,M∗

11),

where M11 is the upper left corner of the matrix M , which is equal to

M∗11 = (D2

θθQ∗(ζ∗))−1Σ∗

11(D2θθQ

∗(ζ∗))−1 + (29)

σ2e(D

2θθQ

∗(ζ∗))−1D2θβQ

∗(ζ∗)( limT→∞

1

TX ′X)−1(D2

θβQ∗(ζ∗))′(D2

θθQ∗(ζ∗))−1.

16

A consistent estimator of the varianceM∗11 is obtained by substituting σ2

e and ψ(X) with

their respective consistent estimators σ2e and µ. Establishing the asymptotic normality

and consistency of the estimators of the parameters of the additive random field model

in Theorem 6 is useful for a number of reasons. These asymptotic results can be used

to construct confidence bands for the estimated conditional mean function without the

use of the computer intensive Bayesian methods suggested by Hamilton (2001). Such

bands are extremely useful in testing hypothesis about the functional form of the data

generating process. Another important application of the asymptotic distribution is that

it can be used to establish the asymptotic distribution of a very simple parametric test

for additivity, which is discussed in the following section.

4 Testing for additivity

There are several additivity tests in the nonparametric literature, for example Chen,

Liu, and Tsay (1995), and in the parametric literature, for example, Barry (1993), and

Eubank, Hart, Simpson, and Stefanski (1995). The last two tests suffer from a restrictive

sampling scheme because the explanatory variables are sampled on a grid, and the first

two rely on the selection of a data-dependent bandwidth. We propose an additive test

within the framework of a parametric random field model that does not depend on either

a sampling scheme or bandwidth selection. As a by-product, we also propose a “new”

consistent estimator of σ2e .

The null hypothesis is that the data generating process is given by (9) and (10)

described in section 2.2. The test for additivity will assess the goodness of fit of the

misspecified additive random field model µ(xt·) = xt·β+∑I

i=1 λimi(gi ⊙xti·). A direct

measure of the goodness of fit of the additive random field model is the estimation

error. By Theorem 1, the mean squared error in the estimation of the conditional mean

converges to 0 uniformly, if the data generating process is additive in the regressors.

Using this result, the test is based on the estimation error of the observed response yt,

which is the sum of the true conditional mean ψ(xt·) and the error term et.

17

Theorem 7 Let assumptions i., ii., and 1 to 4 hold. Define the estimation error as

ε = y −(Xβ +H (λ)C−1 (y −Xβ)

), where C = (H (λ) + σIT ) . Then

1√T

(ε′ε+ σ2TD0UT + σ2σ2

eTD200RT

) a∼ N

(0,−1

3σ4σ4

eD40000R

), (30)

for any (θ′,β′)′ ∈ Θ ×B.

Note that the quantities D0UT (θ),D200RT (θ), and D4

0000R(θ) are all nonpositive. It

follows from Theorem 7 that

limT→∞

supΘ×B

∣∣∣∣1

Tε′ε+ σ2σ2

eD200RT

∣∣∣∣ = 0 (31)

because the deterministic term σ2D0UT converges to zero by Proposition 1. Furthermore,

it follows, from (31), that a consistent estimator of σ2e can be constructed as described

in the following Corollary.

Corollary 3 Let assumptions i., ii., and 1 to 4 hold. Let ε be defined as in Theorem

7. Then, as T → ∞,1Tε′ε

−σ2D200RT

p−→ σ2e , (32)

where θ = (λ′, σ)′, and β are the consistent two-stage estimators.

Note that a careful examination of the proof of Theorem 7 indicates that the results in

(31) and (32) do not specifically rely on the additive random field framework. They hold

for any general random field model. Therefore, the consistent estimator of σ2e can also

be obtained by fitting a random field model to a nonadditive or additive model. One

only needs to replace the corresponding quantities, for example ε and D200RT , with their

sample counterparts in the random field model under consideration. Based on Theorem

7, we propose the following test statistic for additivity.

Corollary 4 Let θ = (λ′, σ)′, β, and σ2e be consistent estimators. Under the null hy-

pothesis of an additive data generating process, the statistic ResTDGQ is asymptotically

standard normally distributed, i.e.,

ResTDGQ ≡1Tε′ε+ σ2σ2

eD200RT√

− 23T σ

4σ2eD

3000UT − 1

3T σ4σ4eD

40000RT

d−→ N(0, 1). (33)

18

We will refer to ResTDGQ as the residual based test statistic. It should be noticed that

we have added the asymptotically vanishing term D3000UT to the variance and deleted

the asymptotic vanishing term D0UT from the mean to obtain better finite sample prop-

erties. In particular, the additional term in the variance corrects the tendency of the

test to over-reject a true null hypothesis in finite samples. Note that, by Proposition

1, limT→∞ D2ijRT → 0 and limT→∞ D0UT → 0 uniformly. While the first term relies

solely on the covariance matrix of the random field, it is the second term that actually

reflects the approximation of the true data generating process (hereafter, DGP) by the

additive random field model. In other words, when the true DGP is not additive, the ad-

ditive random field model will not be a good approximation, that is, limT→∞ D0UT 9 0.

However, D0UT is bounded as

D0UT ≤ 1

Tσ2(ψ(X) −Xβ)′(ψ(X) −Xβ).

We can assume that when the true DGP is not additive then D0UT = O(1) for every

(θ′,β′)′ ∈ Θ × B, thus under the alternative hypothesis, ResTDGQ = Op(√T ).8 This

property agrees with most of the specification tests in the literature, see, e.g., Wooldridge

(1992), providing an argument for the consistency of the test statistic ResTDGQ. Gener-

ally, under a true alternative, ResTDGQ gives a large value. The test statistic ResTDGQ

is not yet directly computable due to the unknown but asymptotically vanishing term

D3000UT . Since the estimator of the conditional mean ψ(X) is given by µ = Xβ +

H(λ)C−1(θ)(y −Xβ) and

limT→∞

supΘ×B

|D3000UT | = 0, (34)

where UT = 1T

(µ −Xβ)′C−1(θ)(µ − Xβ), we suggest replacing D3000UT by D3

000UT

when computing the test statistic ResTDGQ. This will not result in a loss of asymptotic

power but will improve the small sample properties of the test. Summarizing, the test

statistic ResTDGQ can be computed by the following simple 3-step procedure:

8This result can be seen by multiplying (57) in the Mathematical Appendix by√

T and noticing that

the resulting two last terms on the right hand side will be bounded in probability by a central limit

theorem. After having multiplied by√

T , the first term on the right hand side of (57), however, will be

O(√

T ) when D0UT = O(1).

19

Step 1 Fit a random field model with only one comprehensive random field as in Hamilton

(2001), Dahl and Gonzalez-Rivera (2002), or Dahl and Qin (2004). Compute the

consistent estimator σ2e as in (32) based on this random field model.

Step 2 Fit the additive random field model (2), and then use the σ2e obtained in Step 1 to

calculate the test statistic

ResTDGQ =1Tε′ε+ σ2σ2

eD200RT√

− 23T σ

4σ2eD

3000UT − 1

3T σ4σ4eD

40000RT

,

by plugging in the two-stage estimates β, and θ.

Step 3 Reject the null if∣∣∣ResTDGQ

∣∣∣ > Φ−1(1 − 1

2α), where Φ (·) is the c.d.f. of the

standard normal distribution and α denotes the nominal level of the test.

The auxiliary random field model in Step 1 is only used to estimate the variance σ2e of the

true disturbance term. This objective can also be achieved by other consistent estimates

of σ2e , such as the nonparametric estimator suggested by Hall, Kay, and Titterington

(1990) that enjoys the parametric rate of convergence. We notice that a too high estimate

of σ2e usually results in a less powerful but more conservative test. In the Monte Carlo

experiment section we will discuss the degree of robustness of the residual based test to

“overestimates” of σ2e within a large class of nonadditive data generating processes.

We conclude this section by discussing why it is not appropriate to construct a nested

additivity test based on the likelihood function, such as an LM-type test as proposed by

Hamilton (2001) and Dahl and Gonzalez-Rivera (2003) to detect neglected nonlinearity of

a more general form. To perform a nested additivity test, we need under the alternative

hypothesis a general model that includes a comprehensive random field, i.e., yt = xt·β+∑I

i=1 λimi(xti·)+λhmh(xt·)+ǫt, where mh(xt·) denotes the comprehensive random field

defined on the compact set Ak ∈ Rk. Then, an LM-type test for additivity will have a

null hypothesis defined as

H0 : λh ≡ λ2h = 0. (35)

In this setting, one has to allow the domain of the parameter λh to be a compact set

Ah ∈ R containing the origin. On theoretical grounds, the inclusion of the origin in

20

the parameter space invalidates assumption ii. with critical consequences for the valid-

ity of the asymptotic results presented. On numerical grounds, we find that, under the

null hypothesis (35), the elements of the Hessian matrix D2hhQT (θ;β), i.e., the expres-

sion 1Tc′C−1HhC

−1HhC−1c, do not converge. In most cases, this quantity actually

explodes!

5 Simulation experiments

We perform several Monte Carlo experiments with a wide range of additive and nonad-

ditive data generating processes to evaluate the predictive power of the additive random

field model, and the performance of the residual based additivity test. In Table 1, we

present sixteen data generating processes; eight with an additive structure (A1 to A8)

and eight with a non-additive structure (N1 to N8). The nonlinear specifications are di-

verse: polynomials, logarithmic, exponential, thresholds, and sine functions, which cover

many of the most popular econometric models used in applied work. The explanatory

variables (xt1, xt2)′ are sampled independently from a uniform distribution, i.e., xti ∼

U(−5, 5) for i = 1, 2. Models A5-A8 and N5-N8 are adapted from Chen, Liu, and Tsay

(1995), with certain modifications of the coefficients to accommodate the uniformly de-

signed sampling domain. We conduct 1000 Monte Carlo replications with a sample size

of 100 observations for estimation and 100 out-of-sample points for the one-step ahead

prediction.

In this section, we introduce an estimator/algorithm based on a simplified version of

the additive random field model, which reduces the computational burden significantly

but maintains almost the same predictive accuracy as the additive model described in

the previous sections. We call the alternative specification the “proportional additive

random field model” and it is given as

yt = xt·β + λ

I∑

i=1

cimi(gi ⊙ xti·) + ǫt, (36)

where ci is a predetermined proportional weight of the i-th random field, λ is the total

weight of the random field component, and ǫt ∼ IN(0, σ2). Studies of proportional

additivity in nonparametric models can be found in Yang (2002) and Carroll, Hardle,

21

Table 1: Additive and nonadditive data generating processes. It is assumed that xit ∼i.i.d.U(−5, 5) for i = 1, 2 and et ∼ N(0, σ2

e).

Model True DGP

A1 yt = 1 + 0.1x2t1 + 0.2xt2 + et

A2 yt = 0.1x2t1 + 0.2 ln(xt2 + 6) + et

A3 yt = 0.5xt1−1xt1+6 − 0.5 exp(0.5xt2) + et

A4 yt = 1.5 sin(xt1) + 2 sin(xt2) + et

A5 yt = 0.5xt1 + sinxt2 + et

A6 yt = 0.8xt1 − 0.3xt2 + et

A7 yt = exp(−0.5x2t1)xt1 − exp(−0.5x2

t2)xt2 + et

A8 yt = −2xt11{xt1<=0} + 0.4xt21{xt2>0} + et

N1 yt = 0.3x2t1xt2 + 0.2x2

t2 + et

N2 yt = sin(xt1 + xt2) + et

N3 yt = 1.5 sin(xt1 + 0.2xt2) + 2 sin(0.5xt1 + xt2) + et

N4 yt = 2 × 1{xt1+xt2<−8} + 1{−8≤xt1+xt2<−6} − 1{−6≤xt1+xt2<−4}+

2 × 1{−4≤xt1+xt2<−2} + 1{−2≤xt1+xt2<0} − 1{0≤xt1+xt2<2} + 2 × 1{2≤xt1+xt2<4}+

1{4≤xt1+xt2<6} − 1{6≤xt1+xt2<8} + 2 × 1{xt1+xt2≥8} + et

N5 yt = xt1 sin(xt2) + et

N6 yt = 2 exp(−0.5x2t1)xt1 − exp(−0.5x2

t1)xt2 + et

N7 yt = (0.5xt1 − 0.4xt2)1{xt1<0} + (0.5xt1 + 0.3xt2)1{xt1>=0} + et

N8 yt = xt1(xt2 − 0.5)+ + 0.8xt1(0.5 − xt2)+ + et

22

Table 2: Prediction mean squared errors (MSE) and simulated standard errors (in paren-

thesis) for the additive random field model (MSEARF ), the proportional additive random

field model (MSEPARF ) and the Hamilton’s random field model (MSERF ). The data

generating processes A1-A8 are specified in Table 1. The sample size and Monte Carlo

replications are equal to 100 and 1000 respectively. The MSE’s are based on 100 out-of-

sample one-step ahead predictions.

Model σ2e MSEARF MSEPARF MSERF

0.01 0.0126 (0.0030) 0.0149 (0.0034) 0.0264 (0.0028)

A1 0.04 0.0397 (0.0080) 0.0455 (0.0159) 0.0690 (0.0076)

0.25 0.2517 (0.0270) 0.2605 (0.0283) 0.3154 (0.0296)

0.01 0.0127 (0.0033) 0.0161 (0.0034) 0.0265 (0.0027)

A2 0.04 0.0402 (0.0080) 0.0459 (0.0158) 0.0695 (0.0076)

0.25 0.2541 (0.0271) 0.2616 (0.0284) 0.3176 (0.0298)

0.01 0.0204 (0.0034) 0.0217 (0.0040) 0.0746 (0.0055)

A3 0.04 0.0690 (0.0096) 0.0730 (0.0142) 0.1293 (0.0120)

0.25 0.3185 (0.0369) 0.3099 (0.0382) 0.4524 (0.0414)

0.01 0.0176 (0.0031) 0.0192 (0.0037) 0.0480 (0.0045)

A4 0.04 0.0628 (0.0123) 0.0708 (0.0143) 0.0996 (0.0107)

0.25 0.3189 (0.0455) 0.3128 (0.0677) 0.4274 (0.0458)

0.01 0.0166 (0.0028) 0.0175 (0.0034) 0.0177 (0.0020)

A5 0.04 0.0602 (0.0076) 0.0499 (0.0136) 0.0607 (0.0066)

0.25 0.2746 (0.0228) 0.2797 (0.0296) 0.3148 (0.0304)

0.01 0.0106 (0.0017) 0.0099 (0.0015) 0.0098 (0.0007)

A6 0.04 0.0548 (0.0085) 0.0395 (0.0071) 0.0392 (0.0038)

0.25 0.2473 (0.0156) 0.2446 (0.0149) 0.2443 (0.0150)

0.01 0.0141 (0.0030) 0.0129 (0.0051) 0.0279 (0.0044)

A7 0.04 0.0519 (0.0103) 0.0504 (0.0119) 0.0779 (0.0092)

0.25 0.3188 (0.0203) 0.3192 (0.0203) 0.3192 (0.0196)

0.01 0.0146 (0.0031) 0.0144 (0.0034) 0.0295 (0.0030)

A8 0.04 0.0639 (0.0099) 0.0446 (0.0130) 0.0704 (0.0077)

0.25 0.2523 (0.0270) 0.2601 (0.0269) 0.3106 (0.0283)23

and Mammen (2002). As already indicated, the estimation of (36) is less computationally

demanding than the more flexible additive random field model (2). In Table 2 we compare

the mean squared error of the one-step ahead prediction for the eight additive models A1-

A8 using three different specifications of the random field: the additive, the proportional

additive, and the comprehensive random field. In each table we consider the same models

but with different error variances: σ2e = 0.01, 0.04, and 0.25. Across Table 2, the additive

and the proportional additive random field models exhibit smaller mean squared errors

than the model with one comprehensive random field, with the exception of specification

A6, which has a very simple linear structure. Though the proportional additive random

field has a larger mean squared error relative to the additive model, their differences are

almost insignificant for most of the data generating processes. As expected, the prediction

MSEs increase significantly when σ2e increases. As indicated in Table 2, when σ2

e = 0.25

the differences in the MSEs are less noticeable though the additive and proportional

additive specifications are still superior to that of the comprehensive random field.

In Tables 3 and 4 we report the rejection frequencies of the residual-based test statis-

tic for the sixteen additive and nonadditive data generating processes of Table 1. We

compare the performance of our test with that of the Lagrange Multiplier statistic pro-

posed by Chen, Liu, and Tsay (1995), which has been documented to have extremely

good size and power properties in finite samples. We denote this test statistic LMCLT .

The nominal size is chosen to be α = 0.05. In Table 3 we present the rejection frequen-

cies of the two tests for the eight additive models A1-A8. Both tests have an acceptable

empirical size, though for the simple linear structure A6, the LMCLT test has a perhaps

marginal better size than the ResTDGQ test, which is slightly oversized. Recall that the

residual-based test is based on the estimation of the parameters of a random field de-

signed to detect nonlinearities. Thus, it is not surprising that its performance is inferior

when the process is truly linear. In addition, across all specifications, the estimator of

the error variance proposed in Corollary 3 performs very well by providing an accurate

estimate of the population value.

In Table 4, we report the power of the tests for the eight nonadditive models N1-

N8. The residual-based test has perfect power across all eight specifications. Both tests

have perfect power for functional forms like N1 and N6-N8, which can be approximated

24

Table 3: Rejection frequencies (power) for the additive models in Table 1 with nomi-

nal size α = 0.05. Consistent estimates of σ2e obtained according to (32) are reported

in parenthesis. Sample size and Monte Carlo replications are equal to 100 and 1000

respectively.

Model σ2e(σ

2e) ResTDGQ LMCLT

0.01(0.0125) 0.062 0.058

A1 0.04(0.0389) 0.049 0.048

0.25(0.2699) 0.051 0.058

0.01(0.0130) 0.093 0.054

A2 0.04(0.0419) 0.035 0.046

0.25(0.2692) 0.048 0.058

0.01(0.0198) 0.063 0.084

A3 0.04(0.0443) 0.000 0.064

0.25(0.2738) 0.038 0.050

0.01(0.0179) 0.000 0.104

A4 0.04(0.0446) 0.005 0.030

0.25(0.2512) 0.041 0.036

0.01(0.0103) 0.033 0.034

A5 0.04(0.0391) 0.048 0.042

0.25(0.2697) 0.049 0.044

0.01(0.0097) 0.075 0.058

A6 0.04(0.0387) 0.077 0.058

0.25(0.2419) 0.088 0.058

0.01(0.0106) 0.016 0.042

A7 0.04(0.0487) 0.050 0.040

0.25(0.2635) 0.072 0.046

0.01(0.0126) 0.037 0.044

A8 0.04(0.0420) 0.019 0.040

0.25(0.2628) 0.048 0.040

25

Table 4: Rejection frequencies (power) for the non-additive models in Table 1 with

nominal size α = 0.05. Consistent estimates of σ2e obtained according to (32) are reported

in parenthesis. Sample size and Monte Carlo replications are equal to 100 and 1000

respectively

Model σe(σ2e) ResTDGQ LM test

N1 0.01(0.4310) 1.000 1.000

0.04(0.4558) 1.000 1.000

0.25(0.6342) 1.000 1.000

N2 0.01(0.0238) 1.000 0.000

0.04(0.0499) 1.000 0.000

0.25(0.3102) 1.000 0.002

N3 0.01(0.0300) 1.000 0.000

0.04(0.0576) 1.000 0.000

0.25(0.2591) 1.000 0.000

N4 0.01(0.2108) 1.000 0.972

0.04(0.2444) 1.000 0.766

0.25(0.4749) 0.994 0.386

N5 0.01(0.0389) 1.000 1.000

0.04(0.0613) 1.000 0.986

0.25(0.2324) 1.000 0.736

N6 0.01(0.0311) 1.000 1.000

0.04(0.0540) 1.000 1.000

0.25(0.2816) 1.000 1.000

N7 0.01(0.0453) 1.000 1.000

0.04(0.0716) 1.000 1.000

0.25(0.3613) 1.000 1.000

N8 0.01(0.0963) 1.000 1.000

0.04(0.1234) 1.000 1.000

0.25(0.3096) 1.000 1.000

26

reasonably well by a Taylor’s series expansion9. However, the LMCLT test has no power

for specifications that are Fourier series expansions, like N2 and N3. These models can

not be approximated well by low order polynomials, nevertheless, the residual-based test

ResTDGQ has power equal to one. Note that although the error variance is overestimated

across the eight non-additive specifications, the residual-based test is in this case fairly

robust and there is no loss of power.

6 Conclusion

In this paper we have proposed and analyzed the large sample behavior of the non-locally

misspecified additive random field models, which can be viewed as a generalization of the

parametric random field proposed by Hamilton (2001). We establish the consistency and

asymptotic normality of the estimators of the parameters entering the model under a set

of similar assumptions as those in Hamilton (2001). As a by-product, we have proposed

a test for additivity that is fully parametric, very easy to compute, and - based on simu-

lation studies - appears to have very good small sample size and power properties when

applied to a large class of additive and non-additive specifications. Finally, we provide

simulation evidence showing that the additive random field model substantially improve

the accuracy of out-of-sample predictions relative to Hamilton’s (2001) comprehensive

random field model when the data generating mechanism is additive.

9Hamilton (2001, Lemma 4.9) proved that the random field regression model estimates very well

specifications that can be approximated by a Taylor expansion. The LM test proposed by Chen, Liu,

and Tsay (1995) is basically designed on a Volterra expansion.

27

References

Abadir, K. M. and J. R. Magnus (2002). Notation in econometrics: a proposal for a

standard. The Econometrics Journal 5, 76–90.

Abt, M. and W. J. Welch (1998). Fisher information and maximum likelihood esti-

mation of covariance parameters in gaussian stochastic processes. The Canadian

Journal of Statistics 26(1), 127–137.

Barry, D. (1993). Testing for additivity of a regression function. The Annals of Statis-

tics 21(1), 235–254.

Carroll, R. J., W. Hardle, and E. Mammen (2002). Estimation in an additive model

when the components are linked parametrically. Econometric Theory 18, 886–912.

Chen, R., J. S. Liu, and R. S. Tsay (1995). Additivity tests for nonlinear autoregression.

Biometrika 82(2), 369–383.

Cressie, N. A. (1993). Statistics for Spatial Data. Wiley, NY.

Dahl, C. M. (2002). An investigation of tests for linearity and the accuracy of likelihood

based inference using random fields. Econometrics Journal 5, 263–284.

Dahl, C. M. and G. Gonzalez-Rivera (2003). Testing for neglected nonlinearity in

regression models based on the theory of random fields. Journal of Economet-

rics 114(1), 141–164.

Dahl, C. M. and S. Hylleberg (2004). Flexible regression models and relative forecast

performance. International Journal of Forecasting 20, 201–217.

Dahl, Christian, M. and Y. Qin (2004). The limiting behavior of the estimated pa-

rameters in a misspecified random field regression model. Unpublished Manuscript,

Purdue University.

Davidson, J. (1994). Stochastic Limit Theory. Oxford University Press.

Eubank, R., J. Hart, D. Simpson, and L. Stefanski (1995). Testing for additivity in

nonparametric regression. The Annals of Statistics 23(6), 1896–1920.

Hall, P., J. W. Kay, and D. M. Titterington (1990). Asymptotically optimal difference

based estimation of variance in nonparametric regression. Biometrika 77, 521–528.

28

Hamilton, J. D. (2001). A parametric approach to flexible nonlinear inference. Econo-

metrica 69(3), 537–573.

Hastie, T. and R. Tibshirani (1990). Generalized Additive Models. Chapman and Hall,

London.

Hong, Y. and H. White (1995). Consistent specification testing via nonparametric

series regression. Econometrica 63(5), 1133–1159.

Loh, W.-L. and T.-K. Lam (2000). Estimating structured correlation matrices in

smooth gaussian random field models. The Annals of Statistics 28(3), 880–904.

Lutkepohl, H. (1996). Handbook of Matrices. John Wiley and Sons, New York, USA.

Magnus, J. R. and H. Neudecker (1999). Matrix Differential Calculus with Applications

in Statistics and Econometrics. Wiley.

Mardia, K. and R. Marshall (1984). Maximum likelihood estimation of models for

residual covariance in spatial regression. Biometrika 71(1), 135–146.

Rudin, W. (1976). Principles of Mathematical Analysis. McGraw-Hill.

Sperlich, S., D. Tjostheim, and L. Yang (2002). Nonparametric estimation and testing

of interaction in additive models. Econometric Theory 18, 197–251.

Stone, C. J. (1985). Additive regression and other nonparametric models. The Annals

of Statistics 13(2), 689–705.

Stone, C. J. (1986). The dimensionality reduction principle for generalized additive

models. The Annals of Statistics 14(2), 590–606.

Wooldridge, J. M. (1992). A test of functional form against nonparametric alternatives.

Econometric Theory 8, 452–475.

Wooldridge, J. M. (1994). Estimation and inference for dependent processes. In R. En-

gle and D. McFadden (Eds.), Handbook of Econometrics, Volume IV, Chapter 45.

Amsterdam, North Holland.

Yang, L. (2002). Direct estimation in an additive model when the components are

proportional. Statistica Sinica 12, 801–821.

29

Ying, Z. (1991). Asymptotic properties of a maximum likelihood estimator with data

from a gaussian process. Journal of Multivariate Analysis 36, 280–296.

Ying, Z. (1993). Maximum likelihood estimation of parameters under a spatial sam-

pling scheme. The Annals of Statistics 21(3), 1567–1590.

30

Mathematical Appendix

We need some auxiliary results on uniform convergence of deterministic sequences before

proceeding with the proofs of consistency and asymptotic normality theorems.

Lemma A.1 Let λ = (λ1, . . . , λI) and σ be the nonlinear parameters in the additive

random field model, and C ≡H(λ)+σIT . For any permissible covariance matrixH (λ),

C−1H (λ) = H (λ)C−1 = IT − σC−1.

Proof of Lemma A.1 Consider

CC−1 = (H (λ) + σIT ) (H (λ) + σIT )−1

= I,

then

H (λ)C−1 + σC−1 = I,

hence

H (λ)C−1 = I − σC−1.

Similarly, considering C−1C = I, the first equality of the lemma follows. By Lemma

A.1, the covariance matrix PT ≡ E[(ξT − µT ) (ξT − µT )

′]can be written as,

PT =H(λ) −H(λ)C−1H(λ) = σC−1H(λ).

Lemma A.2 Let Assumptions ii.,1,2, and 3 hold. Then

limT→∞

supΘ×A2

PT → OT .

For each individual random field, let PTi ≡ σ (λiHi + σIT )−1λiHi, then

limT→∞

supΘ×A2

i

PTi → OT .

Proof of Lemma A.2 It follows directly from Theorem 2 in Dahl and Qin (2004).

31

Corollary A.1 Let τi ≡ 1T 3 φ

′iP

′TPT φi where φi : Ai → R is a continuous function for

i = 1, 2, ..., I and φi =(φi(x1i·), φi(x2i·), ..., φi(xTi·)

)′. Then

limT→∞

supΘ×A2

i

τi → 0, (37)

for all i = 1, 2, ..., I. Furthermore, for each individual random field, let τii ≡ 1T 3 φ

′iP

′TiPTiφi.

Then

limT→∞

supΘ×A2

i

τii → 0, (38)

for all i = 1, 2, ..., I.

Proof of Corollary A.1 It follows directly from Theorem 3 in Dahl and Qin (2004)

and from Theorem 4.7 in Hamilton (2001).

Corollary A.2 Let φi be defined as in Corollary A.1, and let P iT ≡ σC−1λiHi. Define

τ ii ≡ 1T 3 φ

′i(P

iT )′P i

T φi. Then

limT→∞

supΘ×A2

τ ii → 0, (39)

for all i = 1, 2, ..., I.

Proof of Corollary A.2 For any two n × n positive definite matrices A,B, we say

that A > B if A − B is also a positive definite matrix. Since P iT ≤ PT for ∀i, T , it

follows that τ ii ≤ τi for all ∀i, T . Since τ ii ≥ 0, Corollary A.2 immediately follows.

Proof of Theorem 1 Let e = (e1, e2, ..., eT )′ be the vector of disturbances in the data

generating process (Assumption 3). We can write

ξT −Xβ −LT (λ) = H (λ)C−1 (y −Xβ) −LT (λ)

= H (λ)C−1 (LT (λ) + e) −LT (λ)

=1

T

(H (λ)C−1

I∑

i=1

λiHiφi −I∑

i=1

λiHiφi

)

+H (λ)C−1e

=1

T

I∑

i=1

P iT φi +H (λ)C−1e,

32

where P iT is defined in Corollary A.2, and the last equality follows from Lemma A.1. In

order to show that

limT→∞

supΘ×A2

1

TE[(ξT −Xβ −LT (λ))′ (ξT −Xβ −LT (λ))

]→ 0, (40)

we need to show that

limT→∞

supΘ×A2

1

T 3

(I∑

i=1

P iT φi

)′(I∑

i=1

P iT φi

)= lim

T→∞sup

Θ×A2

(I∑

i=1

√τ ii

)

I∑

j=1

√τ jj

→ 0,

(41)

where τ ii is given by (39) in Corollary A.2. The convergence of the remaining terms in

(40) is shown in Dahl and Qin (2004). Let aij =√τ ii

√τ jj . By the Cauchy-Schwartz

inequality, we have aij ≤ √aiiajj . Hence,

limT→∞

supΘ×A2

(I∑

i=1

√τ ii

)

I∑

j=1

√τ jj

= lim

T→∞sup

Θ×A2

I∑

i=1

I∑

j=1

aij

≤ limT→∞

supΘ×A2

(I∑

i=1

√aii

)2

≤ limT→∞

supΘ×A2

I

I∑

i=1

aii

= limT→∞

supΘ×A2

I

I∑

i=1

τ ii ,→ 0,

by Corollary A.2.

Proof of Corollary 1 Define

S ≡ ψ (X) −(Xβ +H (λ) (H (λ) + σIT )−1 (y −Xβ)

).

By Theorem 1 we have that limT→∞

1T

E (S′S) → 0 uniformly in θ,β, and X.

1

TE (S′S) =

σ2

T(ψ (X) −Xβ)

′(H (λ) + σIT )

−2(ψ (X) −Xβ)

+σ2e


−1H (λ)H (λ) (H (λ) + σIT )

−1].

Since both terms on the right hand side are positive, results (13) and (14) of Corollary

1 follow. Result (15) is a direct consequence of Lemma A.1 and (14).

33

Lemma A.3 Suppose A,B are two (n, n) positive definite matrices. Then,

tr(AB) > 0. (42)

Proof of Lemma A.3 From Magnus and Neudecker (1999) Theorem 13, p. 16, there

exists an orthogonal matrix P , such that A = P ′V P , where V = diag(γ1, . . . , γn) and

γi for i = 1, . . . , n are eigenvalues of A. Consequently,

tr(AB) = tr(P ′V PB) = tr(V (PBP ′)).

Since B is a positive definite matrix, PBP ′ is a positive definite matrix. Therefore, all

diagonal elements of PBP ′, denoted as α1, . . . , αn must be positive. Hence,

tr(V (PBP ′)) =

n∑

i=1

γiαi > 0,

which completes the proof.

Proof of Corollary 2 We can write

tr(C−1H (λ)C−1H (λ)

)= tr

[C−1

(I∑

i=1

λiHi

)C−1

(I∑

i=1

λiHi

)]

= tr

[(I∑

i=1

λiC−1Hi

)(I∑

i=1

λiC−1Hi

)]

=

I∑

i=1

λ2i tr(C−1HiC

−1Hi

)

+2

I∑

i=1

I∑

j>i

λiλj tr(C−1HiC

−1Hj

).

By combining (15) of Corollary 1 and Lemma A.3 (takingA = C−1HiC−1 andB = Hj)

it is obvious that each term on the right hand side is positive and converges uniformly

to zero on Θ ×B ×A as T → ∞, which concludes the proof.

Proof of Proposition 1 From Magnus and Neudecker (1999, chapter 8)

DiRT =1

TDi log detC =

1

Ttr(C−1 (DiC)

)=

1

Ttr(C−1Hi

),

34

and

D2i,jRT =

1

Ttr(−C−1 (DiC)C−1 (DjC)

)=

1

Ttr(−C−1HiC

−1Hj

).

From Corollary 2, it follows that D2i,jR = lim

T→∞supΘ×A D2

i,jRT = 0 for all i, j = 1, ..., I.

From Corollary 1, it follows that D0U = 0. Finally, let the eigenvalues of C−1Hi be

γ1 ≤ γ2 ≤ . . . ≤ γT . Then

DiRT =1

Ttr(C−1Hi

)=

1

T

T∑

t=1

γt ≤

√∑Tt=1 γ

2t

T=√

D2i,iRT ,

and DiR = 0 immediately follows, for all i = 1, 2, ..., I.

Proof of Proposition 2 The sequences of differentials of RT for l = 3, 4 are given by

D3i1,i2,i3

RT =1

Ttr(C−1Hi3C

−1Hi2C−1Hi1

)(43)

+1

Ttr(C−1Hi2C

−1Hi3C−1Hi1

),

and

D4i1,i2,i3,i4

RT = − 1

Ttr(C−1Hi4C

−1Hi3C−1Hi2C

−1Hi1

)(44)

− 1

Ttr(C−1Hi3C

−1Hi4C−1Hi2C

−1Hi1

)

− 1

Ttr(C−1Hi3C

−1Hi2C−1Hi4C

−1Hi1

)

− 1

Ttr(C−1Hi4C

−1Hi2C−1Hi3C

−1Hi1

)

− 1

Ttr(C−1Hi2C

−1Hi4C−1Hi3C

−1Hi1

)

− 1

Ttr(C−1Hi2C

−1Hi3C−1Hi4C

−1Hi1

).

Let us focus on the first term of (43). Recall the following property of the trace operator,

|tr (A′B)| ≤√

tr (A′A) tr (B′B). (45)

35

Then,∣∣∣∣1

Ttr(C−1Hi3C

−1Hi2C−1Hi1

)∣∣∣∣ (46)

≤ 1

T

√tr (C−1Hi3Hi3C

−1) tr (Hi1C−1Hi2C

−1C−1Hi2C−1Hi1)

≤ 1

T

√T

λ2i3

√tr (Hi1C

−1Hi2C−1C−1Hi2C

−1Hi1)

≤ 1

T

√T

λ2i3

√T

λ2i2λ2i1

=1

λi3λi2λi1.

The second inequality in (46) follows from(C−1

)2 ≤((λi3Hi3)

−1)2

for all i3. Thus,

tr(C−1Hi3Hi3C

−1)

= tr((C−1

)2(Hi3)

2)

=1

λ2i3

tr((C−1

)2(λi3Hi3)

2)≤ T

λ2i3

.

Similarly, the third inequality in (46) follows from

tr(Hi1C


−1Hi1

)= tr

(Hi1C


−1Hi1

)

≤ tr

(Hi1C

−1Hi2

((λi2Hi2)

−1)2

Hi2C−1Hi1

)

=1

λ2i2

tr(Hi1C

−1C−1Hi1

)

≤ 1

λ2i2

tr

(Hi1

((λi1Hi1)

−1)2

Hi1

)

=T

λ2i2λ2i1

.

The same argument can be applied to the second term of (43), which subsequently

can be shown to be bounded by 1λi3

λi2λi1

. It follows that supΘ×A

D3i1,i2,i3

RT is bounded

for all T , and all i1, i2, i3 = 0, 1, 2, ..., I. Using similar arguments, we can show that

supΘ×A

Di1RT , supΘ×A

D2i1,i2

RT , and supΘ×A

D4i1,i2,i3,i4

RT are all bounded for all T and for all

i1, i2, i3, i4 = 0, 1, 2, ..., I. Since{Dli1...il

RT}T

for l = 1, 2, ..., 4, i1, . . . , il = 0, 1, . . . , I, and

all T are bounded, the sequences{Dli1...il

RT}T

for l = 1, 2, 3 and i1, . . . , il = 0, 1, . . . , I

are equicontinuous. Since at least one of these equicontinuous sequences converges − for

example{D2i1,i2

RT}T

in Proposition 1 − Theorem 5 in Dahl and Yu (2004) and Theorem

7.16 in Rudin (1976) imply that all of the function sequences{Dli1...il

RT}T

for l = 1, 2, 3

36

and i1, . . . , il = 0, 1, . . . , I are uniformly convergent on Θ×A as T → ∞. This completes

the proof.

Proof of Proposition 3 Define c ≡ (ψ(X) −Xβ). The sequences of differentials of

UT for l = 1, 2 and i1, . . . , il = 0, 1, . . . , I are given by

Di1UT = − 1

Ttr(c′C−1Hi1C

−1c),

and

D2i1,i2

UT =1

Ttr(c′C−1Hi2C

−1Hi1C−1c)

+1

Ttr(c′C−1Hi1C

−1Hi2C−1c).

Notice that

1

Ttr(c′C−1Hi1C

−1c)

=1

Tλi1tr(c′C−1λi1Hi1C

−1c)

≤ 1

Tλi1tr(c′C−1c

)

≤ 1

σλi1

1

Ttr (c′c) ,

where, by Assumption 4, 1T

tr (c′c) is bounded for all T. Note that the last inequality

follows asC−1 < σ−1I while Assumption 2 guarantees the existence of 1σλi1

. This implies

that supΘ×B×A

Di1UT is bounded for all i1 = 1, 2, ..., I and for all T. Following a similar

argument as in Proposition 2, the first term of D2i1,i2

UT is bounded,

∣∣∣∣1

Ttr(c′C−1Hi2C

−1Hi1C−1c)∣∣∣∣ =

∣∣∣∣1

λi1λi2

1

Ttr(c′C−1λi2Hi2C

−1λi1Hi1C−1c

)∣∣∣∣

≤ 1

σλi1λi2

1

Ttr (c′c) .

The same bound is obtained for the second term of D2i1,i2

UT . Consequently, supΘ×B×A

D2i1,i2

UT

is bounded for all i1, i2 = 0, 1, ..., I and for all T. It is not difficult to show that all terms

of Dli1...il

UT will be bounded by(σli=1λi

)−1 1T

tr (c′c) implying that supΘ×B×A

Dli1...il

UT

for l = 1, 2, .. and i1, . . . , il = 0, 1, . . . , I is bounded for all T. Since{Dli1...il

UT}T

for

l = 1, 2, ..., i1, . . . , il = 0, 1, . . . , I, and all T are bounded, the sequences{Dli1...il

UT}T

for l = 1, 2, .. and i1, . . . , il = 0, 1, . . . , I are equicontinuous. Since at least one of these

equicontinuous function sequences converges − for example {D0UT }T in Proposition 1

37

− Theorem 5 in Dahl and Qin (2004) and Theorem 7.16 in Rudin (1976) imply that

all of the function sequences{Dli1...il

UT}T

for l = 1, 2, .. and i1, . . . , il = 0, 1, . . . , I are

uniformly convergent on Θ ×B ×A as T → ∞. This completes the proof.

Proof of Proposition 4 The proof of equicontinuity proceeds in a similar fashion as

in Propositions 2 and 3. The uniform convergence results of (16),(17), and (18) follow

directly from Corollary 1. Inserting ψ(X) = x·i and β = 0 in (13), we have

limT→∞

supΘ×A

1

Tx′·i

((H(λ) + σ1IT )

−1)2

x·i → 0.

Inserting ψ(X) = x·i + x·i and β = 0 in (13), we can write

limT→∞

supΘ×B×A

1Tc′(C−1

)2c = lim

T→∞sup

Θ×B×A

1

Tx′·i(C−1

)2x·i

+ limT→∞

supΘ×B×A

1

Tx′·j(C−1

)2x·j + lim

T→∞sup

Θ×B×A

2

Tx′·i(C−1

)2x·j .

Since the first two terms converge to zero, the last term must also converge to zero for

(13) to hold. Result (17) follows from (13) when β = 0. Result (18) follows from (13)

when Xβ = x·i.

Proof of Theorem 2 Define fT (e,X,θ,β) ≡ QT (θ,β) −Q∗T (θ,β). We write

fT (e,X,θ,β) = − 1

T(ψ(X) −Xβ)

′C−1e− 1

2Te′C−1e+

σ2e

2Ttr(C−1

).

We wish to show that

limT→∞

supΘ×B×A

|fT (e,X,θ,β)| p−→ 0,

which (according to Theorem 21.9 in Davidson (1994)) will be satisfied if and only if

a) lim fT (e,X,θ,β)p−→ 0 for each (θ,β) ∈ Θ × B and b) fT (e,X,θ,β) is stochasti-

cally equicontinuous. Notice first that E [fT (e,X,θ,β)] = 0. By Chebyshev’s inequality,

condition a) will the be satisfied if

limT→∞

supΘ×B×A

E[fT (e,X,θ,β)2

] p−→ 0.

Now, let 0 < γ1 ≤ . . . ≤ γT be the eigenvalues of H(λ), and νt for t = 1, . . . , T be the

corresponding eigenvectors. Let zt ≡ 1σeν ′te and at ≡ ν ′

t(ψ(X) −Xβ) for t = 1, . . . , T .

38

Then zt ∼ IN(0, 1) and we can write

fT (e,X,θ,β) =1

T

T∑

t=1

[−σeatztγt + σ

− σ2ez

2t

2(γt + σ)+

σ2e

2(γt + σ)]

= − 1

T

T∑

t=1

2σeatzt + σ2ez

2t − σ2

e

2(γt + σ). (47)

Therefore,

E[fT (e,X,θ,β)

2]

=1

T 2

T∑

t=1

4σ2ea

2t E(z2

t ) + σ4e E(z4

t ) + σ4e + 4atσ

3e E(z3

t ) − 4atσ3e E(zt) − 2σ4

e E(z2t )

4(γt + σ)2,

and consequently

limT→∞

supΘ×B×A

E[fT (e,X,θ,β)

2]

= limT→∞

supΘ×B×A

1

T 2

T∑

t=1

4σ2ea

2t + 3σ4

e + σ4e − 2σ4

e

4(γt + σ)2

= limT→∞

supΘ×B×A

1

T 2

T∑

t=1

2σ2ea

2t + σ4

e

2(γt + σ)2

= limT→∞

supΘ×B×A

(− 1

Tσ2eD0UT +

1

T 2

T∑

t=1

σ4e

2(γt + σ)2

)

≤ −σ2eD0U + lim

T→∞sup

Θ×B×A

1

2T

σ4e

σ2

→ 0,

where the last inequality follows from assumption ii. and Proposition 1. This completes

the proof of condition a). To verify condition b) define

fT = fT

(e,X, θ, β

),

and note that

∣∣∣fT − fT

∣∣∣ ≤ 1

T

∣∣∣∣tr((

(ψ(X) −Xβ)′C−1 −(ψ(X) −Xβ

)′C−1

)e

)∣∣∣∣

+1

2T

∣∣∣tr(e′(C−1 − C−1

)e)∣∣∣+

σ2e

2T

∣∣∣tr(C−1 − C−1

)∣∣∣

≤√(

1

T

∑e2t

)∥∥∥∥(ψ(X) −Xβ)′C−1 −

(ψ(X) −Xβ

)′C−1

∥∥∥∥

+

((1

T

∑e2t

)+σ2e

2T

)∥∥∥C−1 − C−1∥∥∥ .

39

It follows immediately (asX is non-stochastic) that

∥∥∥∥(ψ(X) −Xβ)′C−1 −(ψ(X) −Xβ

)′C−1

∥∥∥∥ ↓

0 and∥∥∥C−1 − C−1

∥∥∥ ↓ 0 when (θ′,β′)′ →(θ′, β′

)′. Furthermore, since 1

T

∑e2t = Op(1)

(by LLN) andσ2

e

2T =O(T−1) we can conclude that condition b) holds according to, e.g.,

Theorem 21.10 in Davidson (1994). This completes the proof of (20). Condition (21)

follows directly from Propositions 2 and 3.

Proof of Theorem 3 Calculate the first and second order conditions of the function

Q∗(θ;β). From Propositions 2 and 3, and Theorem 2, we can write the Hessian matrix

as

H(θ;β) =

− 12D2

11U(θ) · · · − 12D2

1IU(θ) 0...

......

...

− 12D2

I1U(θ) · · · − 12D2

IIU(θ) 0

0 · · · 0 − 12 (D2

00R(θ) + σ2eD

3000R(θ))

. (48)

The assumption of convexity of U(θ) guarantees that the upper block of the Hessian

matrix (48) is negative definite and the function Q∗(θ;β) is concave in (λ1,λ2, ...λI).

The right lower element of the Hessian matrix is

−1

2(D2

00R(θ) + σ2eD

3000R(θ)) =

1

2trC−1C−1 − σ2

e trC−1C−1C−1.

For this term to be negative, it is necessary and sufficient that trC−1C−1 < 2σ2e trC−1C−1C−1.

A necessary condition for concavity of Q∗(θ;β) in σ is given by σ ≤ 2σ2e . This condi-

tion comes from considering the eigenvalues of the matrix C−1σ, which are less than

one (Magnus and Neudecker, 1999, p. 25). Then, trC−1C−1C−1σ3 ≤ trC−1C−1σ2 and

σ ≤ trC−1C−1

trC−1C−1C−1 < 2σ2e .

Proof of Theorem 4 We follow the five conditions for consistency in Dahl and Qin

(2004, Theorem 1). Conditions i. requires that Θ and B are compact parameter spaces,

which is also required in our assumption ii. Condition ii. βp−→ β∗ ∈ B can be verified

by a similar argument as in Dahl and Qin (2004). Condition iii. requires that QT (θ,β)

is a continuous measurable function for all T and it is satisfied trivially. Condition iv.

requires that QT (θ,β)p−→ Q∗(θ,β) uniformly in Θ ×B and it is satisfied by Theorem

40

2. Condition v. requires the existence of a unique maximizer θ∗ ∈ Θ of Q∗(θ,β∗) and it

is satisfied by Theorem 3. This completes the proof of consistency of θ.

Proof of Proposition 5 Define v ≡ (y −Xβ) and Bi1 ≡ 1√TC−1Hi1C

−1 and notice

that v ∼ NT (c, σ2eIT ). First, we focus on proving equation (25) and for that purpose we

use the moment generating function M(si1 , si2) defined as

M(si1 , si2) = E (exp(si1v′Bi1v + si2v

′Bi2v)) ,

where si1 , si2 ∈ R. Now, let

κ ≡ − 1

2σ2e

(v − c)′ (v − c) + v′ (si1Bi1 + si2Bi2)v

= − 1

2σ2e

v′v +1

σ2e

c′v

− 1

2σ2e

c′c+ v′ (si1Bi1 + si2Bi2)v

= − 1

2σ2e

(v′(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)v − c′v − v′c+ c′c

)

= − 1

2σ2e

[(v −

(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1c)′(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)

×(v −

(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1c)

− c′((IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1 − IT)c]

= − 1

2σ2e

(v − c)′(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)(v − c)

+c′ (si1Bi1 + si2Bi2)(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1c,

where c =(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1c. By using the above formulation we can

write

M(si1 , si2) =

∫1

(2πσ2e)

T2

exp (κ) dv

=

∫ ∣∣I − 2si1σ2eBi1 − 2si2σ

2eBi2

∣∣− 12

(2πσ2e)

T2 |I − 2si1σ

2eBi1 − 2si2σ

2eBi2 |−

12

exp (κ) dv

=∣∣I − 2si1σ

2eBi1 − 2si2σ

2eBi2

∣∣− 12

× exp(c′ (si1Bi1 + si2Bi2)

(IT − 2si1σ

2eBi1 − 2si2σ

2eBi2

)−1c).

41

Since cov (v′Bi1v,v′Bi2v) = E[(v′Bi1v)(v

′Bi2v)]−E (v′Bi1v) E (v′Bi2v) and E[(v′Bi1v)(v′Bi2v)] =

D2si1

si2M(si1 , si2)|si1

=0,si2=0, we can write

cov (v′Bi1v,v′Bi2v) = 2σ4

e tr (Bi1Bi2) + 4σ2ec

′Bi1Bi2c

= 2σ4e

1

Ttr(C−1Hi1C

−1C−1Hi2C−1)

+4σ2e

1

Tc′C−1Hi1C

−1C−1Hi2C−1c.

Noticing that

cov(√

TDi1QT (θ,β) ,√TDi2QT (θ,β)

)= cov

(1

2v′Bi1v,

1

2v′Bi2v

)

=σ4e

2

1

Ttr(C−1Hi1C

−1C−1Hi1C−1)

+σ2e

1

Tc′C−1Hi1C

−1C−1Hi2C−1c,

completes the proof of equation (25). To show that equation (26) holds we proceed

exactly as above. Let x·j1 = 1√Tx·j1 . We write,

κ = − 1

2σ2e

(v − c)′(v − c) + si1v′Bi1v + sj1 x·j1v

= − 1

2σ2e

[v − (IT − 2σ2esi1Bi1)

−1(c+ σ2esj1 x·j1)]

′(IT − 2σ2esi1Bi1)

[v − (IT − 2σ2esi1Bi1)

−1(c+ σ2esj1 x·j1)] +

1

2σ2e

(c + σ2esj1x·j1)


−1(c+ σ2esj1 x·j1) −

1

2σ2e

c′c.

Then, the corresponding moment generating function can be written as

M(si1 , sj1) = E exp(si1v′Bi1v + sj1 x

′·j1v)

=∣∣I − 2si1σ

2eBi1

∣∣− 12

× exp(1

2σ2e

(c + σ2esj1x·j1)


−1(c+ σ2esj1 x·j1)

− 1

2σ2e

c′c).

Immediately, using the matrix differentials, Ddet (B) = det (B) tr(B−1DB

)and DB−1 =

−B−1DBB−1, we get

E((v′Bi1v)(x

′·j1v)

)= D2

si1sj1M(si1 , sj1)|si1

=0,sj1=0

= σ2e tr (Bi1)

(x′·j1c)

+(x′·j1c)(c′Bi1c) + 2σ2

e x′·j1Bi1c.

42

Therefore, we have cov(v′Bi1v, x

′·j1v

)= E[(v′Bi1v)(x

′·j1v)] − E (v′Bi1v) E

(x′·j1v

)=

2σ2e x

′·j1Bi1c. Equation (27) follows trivially. Finally, we note that, in expression (25),

1T

tr(C−1Hi1C−1C−1Hi1C

−1) is a component in D0i10i2RT and

1Tc′C−1Hi1C

−1C−1Hi1C−1c is a component in Di10i2UT . Therefore, (25) converges

uniformly following Propositions 2 and 3. Uniform convergence of (26) and (27) results

from Proposition 4 and Assumption 4, respectively.

Theorem A.1 (adapted from Davidson (1994)) Let {gt,T }Tt=1 be a triangular

array of p dimensional random vectors and let 1T

var(∑T

t=1 gt,T

)converge to a semi-

positive definite matrix Σ. If for ∀α ∈ Rp satisfying α′α = 1, 1√T

∑T

t=1α′gt,T converges

to a normal random variable in distribution as T → ∞ and E(∑T

t=1 gt,T

)= 0p. Then

1√T

∑T

t=1 gt,Td−→ N(0p,Σ).

Proof of Theorem 5 Building on Theorem A.1, let p = k + I + 1 and notice that

gT

(θ, β

)= 0p. In Theorem 4, we have proven the consistency of

(θ, β

)with re-

spect to (θ∗,β∗). To prove the asymptotic normality of gT (θ∗,β∗), we need to show

that gT

(θ, β

)is normally distributed and, given the equicontinuity of the function

gT (θ,β) together with the consistency property, the desired result will follow. Note that

E (gT (θ∗,β∗)) = 0p and, by Proposition 5, var(gT (θ∗,β∗)) uniformly converges to a

semi-positive definite matrixΣ∗ as T → ∞. The normality of the last k rows of gT (θ,β),

given by DβmT (β), is a direct consequence of the normality of y. We need to show the

asymptotic normality of the first I + 1 rows of gT (θ,β) given by DθQT (θ;β). Using

Theorem A.1, we define 1T

∑T

t=1 gt,T = DθQT (θ,β). Let C0 denote a non-stochastic

term that may depend on θ and β. For any α ∈ RI+1 we have

√T

I∑

i=0

αiDiQT (θ,β) =1√T

(ψ(X) −Xβ)′C−1I∑

i=0

αiHiC−1e

+1

2√Te′C−1

I∑

i=0

αiHiC−1e+ C0

=1√T

T∑

t=1

(γtσeδtzt +1

2γtσ

2ez

2t ) + C0,

43

where γt for t = 1, . . . , T are the eigenvalues of C−1{∑I

i=0 αiHi

}C−1, νt are the

corresponding eigenvectors such that ν ′tνt = 1, zt ≡ ν ′

te/σe with zt ∼ N(0, 1), and

δt ≡ ν ′t (ψ(X) −Xβ) . Notice that by the Cauchy-Schwarz inequality

1√T|δt| ≤

√1

T(ψ(X) −Xβ)

′(ψ(X) −Xβ) =

√√√√ 1

T

T∑

t=1

(ψ(xt·) − xt·β)2. (49)

By Assumption 4 this implies that there exist a constant C1 <∞ such that

1√T|δt| ≤ C1,

for any T and t = 1, . . . , T on B. Next we will show that the condition

|γt| ≤ C2 <∞, (50)

is satisfied for all t. Let κt(A) denote the t ’th eigenvalue of the T × T matrix A where

κ1(A) ≤ κ2(A) ≤ ... ≤ κT (A). From Lutkepohl (1996, page 66, expression 13.a), we can

write

κt(C−1αiHiC

−1)

= αiκt(C−1HiC

−1), (51)

and, from Weyl’s theorem (Lutkepohl, 1996, page 162), we have

γT ≡ κT

(C−1

{I∑

i=0

αiHi

}C−1

)

≤I∑

i=0

κT(C−1αiHiC

−1)

=

I∑

i=0

αiκT(C−1HiC

−1),

where the last equality follows from (51). By the triangular inequality we can write

|γT | ≤I∑

i=0

|αi|∣∣κT

(C−1HiC

−1)∣∣ .

Since κt(C−1HiC

−1)≥ 0 for all i = 0, 1, ..., I, and all t = 1, 2, ..., T , condition (50) will

hold if κT(C−1HiC

−1)

is bounded from above for all values of i. Since C−1HiC−1 and

1λ0λi

I − C−1HiC−1 are both positive definite (real symmetric) matrices for all i (and

t), we have that (Lutkepohl, 1996, page 162)

κT(C−1HiC

−1)≤ 1

λ0λi<∞,

44

for all i = 1, 2, ..., I, and we conclude that (50) holds (the last inequality follows from

Assumption ii. In order to show asymptotic normality of√T∑Ii=0 αiDiQT (θ,β) we

need to check the moment condition in Liapunov’s Theorem (Theorem 23.11 in Davidson

(1994)), i.e. verify that the following condition is satisfied

limT→∞

1

T√T

T∑

t=1

E

∣∣∣∣γtσeδtzt +1

2γtσ

2ez

2t

∣∣∣∣3

= 0. (52)

Define C3 ≡ 1T√T

∑Tt=1 E

∣∣γtσeδtzt + 12γtσ

2ez

2t

∣∣3 . We have

C3 ≤ 1

T√T

T∑

t=1

E

(|γtσeδtzt| +

∣∣∣∣1

2γtσ

2ez

2t

∣∣∣∣)3

=1

T√T

T∑

t=1

(E |γtσeδtzt|3 + E

∣∣∣∣1

2γtσ

2ez

2t

∣∣∣∣3

+3

2E |γtσeδtzt|2

∣∣γtσ2ez

2t

∣∣+ 3

4E |γtσeδtzt|

∣∣γtσ2ez

2t

∣∣2)

=1

T√T

T∑

t=1

(|γt|3 σ3

e |δt|3E |zt|3 +

1

8|γt|3 σ6

e E |zt|6 +3

2|γt|3 |δt|2 σ4

e E |zt|4 +3

4|γt|3 |δt|σ5

e E |zt|5)

≤ C2C1σ3e E |zt|3

1

T

T∑

t=1

γ2t δ

2t +

C32σ

6e E |zt|6

8√T

+3

2C2

2C1σ4e E |zt|4

1

T

T∑

t=1

|γt| |δt|

+34C

22σ

5e E |zt|5√T

1

T

T∑

t=1

|γt| |δt| .

Since zt is a Gaussian random variable, E |zt|j is bounded for all j ∈ N. For (52) to be

satisfied, it is sufficient to show that the following two conditions hold:

limT→∞

1

T

T∑

t=1

γ2t δ

2t = 0 (53)

limT→∞

1

T

T∑

t=1

|γt| |δt| = 0. (54)

45

To verify (53) notice that we can write the condition as

1

T

T∑

t=1

γ2t δ

2t =

1

Ttr

(c′C−1

{I∑

i=0

αiHi

}C−1C−1

{I∑

i=0

αiHi

}C−1c

)

=1

T

I∑

i=0

α2i tr(c′C−1HiC

−1C−1HiC−1c

)+

1

T

I∑

i=0

I∑

j>i

αiαj tr(c′C−1HiC

−1C−1HjC−1c)

1

T

I∑

j=0

I∑

i>j

αiαj tr(c′C−1HiC

−1C−1HjC−1c).

Define

ΥijT ≡ 1

Ttr(c′C−1HiC

−1C−1HjC−1c),

and notice that by Cauchy-Schwartz (Lutkepohl, 1996, page 43)

|ΥijT | ≤√

ΥiiT

√ΥjjT .

From this inequality, it is easy to verify that limT→∞ |ΥijT | = 0, since

limT→∞

ΥiiT ≤ limT→∞

1

T

1

λ2i

tr(c′C−1C−1c

)

= − 1

λ2i

limT→∞

D0UT

= 0,

for all i = 0, 1, 2, ..., I, and from Proposition 1, we have that limT→∞ D0UT = 0, and

λi for i = 0, 1, 2, ..., I is bounded away from zero by Assumption ii. Furthermore, since∑I

i=0 α2i = 1 (see Theorem A.1), it follows that |αi| < 1 and |αiαj | < 1 for all i, j =

0, 1, ..., I. Consequently, we can write

limT→∞

∣∣∣∣∣1

T

T∑

t=1

γ2t δ

2t

∣∣∣∣∣ ≤I∑

i=0

∣∣α2i

∣∣ limT→∞

|ΥiiT | +I∑

i=0

I∑

j>i

|αiαj | limT→∞

|ΥijT | +I∑

i=0

I∑

i>j

|αiαj | limT→∞

|ΥijT |

= 0,

and condition (53) is verified. To verify condition (54), we use Cauchy’s inequality and

obtain

1

T

T∑

t=1

|γt| |δt| ≤√

1

T

∑γ2t δ

2t ,

46

and the desired result follows. Finally, the asymptotic variance Σ∗ is obtained once we

show that

limT→∞

Di1Mx·j1x·j1

(θ) → 0, (55)

uniformly on Θ ×B ×A for all j1 = 1, 2, ...., k. To show condition (55), notice that

DiMx·jx·j(θ) = − 1

Ttr(x′·jC

−1HiC−1x·j

).

Then

∣∣∣ limT→∞

DiMx·jx·j(θ)∣∣∣ ≤

√limT→∞

1

Ttr(x′·jC

−1HiC−1C−1HiC−1x·j)√

limT→∞

1

Ttr(x′·jx·j

)

≤√

1

λ2i

limT→∞

1

Ttr(x′·jC

−1C−1x·j)√

limT→∞

1

Tx′·jx·j

= 0,

for all i = 0, 1, ..., I and j = 1, 2, ...,K, where the last equality follows from Proposition 4

(first term converges to zero) and Assumption 4 (last term converges to a finite constant).

This completes the proof.

Proof of Theorem 6 First, we need to show the convergence of the matrices D2θθQT (θ,β) ,

D2ββmT (β) , and D2

θβQT (θ,β). The convergence of D2θθQT (θ,β) is established in The-

orem 3. The convergence of D2ββmT (β) follows trivially from Assumption 4. We need

to prove the convergence of D2θβQT (θ,β). Consider a typical element of D2

θβQT (θ,β)

given by D2i1j1

QT (θ,β) for i1 = 0, 1, ..., I and j1 = 1, 2, ..., k. We can write

D2i1j1

QT (θ,β) = − 1

Tx′·j1C

−1HiiC−1c− 1

Tx′·j1C

−1HiiC−1e

= −Di1Mψx·j1−

k∑

j=1

βjDi1Mx·j1x·j

− 1

Tx′·j1C

−1HiiC−1e,

where the first term converges as T → ∞ according to Proposition 4 for all i1, j1, and

the last two terms converge to zero by Proposition 5 and Assumption 3, respectively.

Now, define ζ = (θ′,β′)′ and let QT (ζ) and mT (β) be given by (8) and (22), respec-

tively. Under Assumptions 1 - 4, the following conditions are satisfied: i. ζTp−→ ζ∗

(by Theorem 4). ii. QT (ζ) and mT (β) are twice continuously differentiable. iii.√TgT (ζ∗)=(

√TDθQT (ζ∗)′,

√TDβmT (β∗)′)′ converges to a normal random variable

47

N(0,Σ∗) in distribution (by Theorem 5). iv. D2θθQT (ζ), D2

ββmT (β) and D2θβQT (ζ)

converge to nonsingular matrices for any ζ in a neighborhood of ζ∗. Conditions i.-iv. are

sufficient conditions to obtain the desired result (Dahl and Qin, 2004; Theorem 9).

Proof of Theorem 7 Derive the first and second moments of ε′ε√T

:

1

TE(ε′ε) = σ2 1

Tc′C−1C−1c+ σ2 1

TE(e′C−1C−1e

)

= −σ2D0UT − σ2σ2eD

200RT ,

then

E

(ε′ε√T

)= −σ2

√TD0UT − σ2σ2

e

√TD2

00RT .

Notice that

1

Tε′ε = σ2D0UT + 2σ2 1

Tc′C−1C−1e+ σ2 1

Te′C−1C−1e. (56)

Now, let the eigenvalues and eigenvectors of C−1 be γt and νt, for t = 1, . . . , T , respec-

tively. Let zt = ν′eσe

∼ N(0, 1) and at = ν ′c, t = 1, . . . , T . Then (56) can be written

as

1

Tε′ε = σ2D0UT + 2σ2σe

1

T

T∑

t=1

atγ2t zt + σ2σ2

e

1

T

T∑

t=1

γ2t z

2t . (57)

Using that cov(zt, z2t ) = 0 and var(z2

t ) = 2, we write

var

(1√Tε′ε

)=

4σ4σ2e

T

T∑

t=1

a2tγ

4t var(zt) +

+4σ4σ3

e

T

T∑

t=1

atγ4t cov(zt, z

2t ) +

σ4σ4e

T

T∑

t=1

γ4t var(z2

t ) =

= −2

3σ4σ2

eD3000UT − 1

3σ4σ4

eD40000RT . (58)

Asymptotically

limT→∞

var

(1√Tε′ε

)→ −1

3σ4σ4

eD40000R,

by Propositions 1, 2, and 3. The proof of asymptotic normality of ε′ε√T

follows in a similar

fashion to that of Theorem 5. The asymptotic normality of the last term of (56), which

is a multiple of D0QT (θ,β) , follows immediately from Theorem 5. The second term of

(56) is already a normal random variable by Assumption 3.

48

Proof of Corollary 4 We standardize the asymptotic normal random variable on the

left hand side of (30) by the actual standard deviation of 1√Tε′ε in (58) and get

1√T

(ε′ε+ σ2TD0UT + σ2σ2

eTD200RT

)√− 2

3σ4σ2eD

3000UT − 1

3σ4σ4eD

40000RT

a∼ N(0, 1).

We replace unknown parameters in the above expression by their consistent estimates.

Then we multiply the numerator and the denominator of the left hand side of the above

expression by 1√T

. After removing the term σ2D0UT , which converges to 0, we get the

test ResTDGQ.

49

statistical inference and prediction in nonlinear models

Documents