building neural network models for time series ... - fgv epgeepge.fgv.br/files/1108.pdf · building...

Building Neural Network Models for Time Series:

A Statistical Approach �

Marcelo C. MedeirosDepartment of Economics, Catholic University of Rio de Janeiro

Timo Ter�asvirtaDepartment of Economic Statistics, Stockholm School of Economics

Gianluigi RechQuantitative Analysis, Electrabel, Louvain-la-Neuve, Belgium

SSE/EFI Working Paper Series in Economics and Finance No 508

September 3, 2002

Abstract

This paper is concerned with modelling time series by single hidden layer feedforward neural networkmodels. A coherent modelling strategy based on statistical inference is presented. Variable selection iscarried out using existing techniques. The problem of selecting the number of hidden units is solvedby sequentially applying Lagrange multiplier type tests, with the aim of avoiding the estimation ofunidenti�ed models. Misspeci�cation tests are derived for evaluating an estimated neural networkmodel. A small-sample simulation experiment is carried out to show how the proposed modellingstrategy works and how the misspeci�cation tests behave in small samples. Two applications to realtime series, one univariate and the other multivariate, are considered as well. Sets of one-step-aheadforecasts are constructed and forecast accuracy is compared with that of other nonlinear models appliedto the same series.

Keywords: Model misspeci�cation, neural computing, nonlinear forecasting, nonlinear time series,smooth transition autoregression, sunspot series, threshold autoregression, �nancial prediction.

JEL Classi�cation Codes: C22, C51, C52, C61, G12Acknowledgments: This research has been supported by the Tore Browaldh's Foundation. The

research of the �rst author has been partially supported by CNPq. The paper is partly based onChapter 2 of the PhD thesis of the third author. A part of the work was carried out during the visitsof the �rst author to the Department of Economic Statistics, Stockholm School of Economics and thesecond author to the Department of Economics, PUC-Rio. The hospitality of these departments isgratefully acknowledged. Material from this paper has been presented at the Fifth Brazilian Conferenceon Neural Networks, Rio de Janeiro, April 2001, the 20th International Symposium on Forecasting,Dublin, June 2002, and seminars at CORE (Louvain-la-Neuve), Monash University (Clayton, VIC),Swedish School of Economics (Helsinki), University of California, San Diego, Cornell University (Ithaca,NY), Federal Univeristy of Rio de Janeiro, and Catholic University of Rio de Janeiro. We wish to thankthe participants of these occasions, Hal White in particular, for helpful comments. Our thanks also goto Chris Chat�eld and Dick van Dijk for useful remarks, and Allan Timmermann for the data used inthe second empirical example of the paper. The responsibility for any errors or shortcomings in thepaper remains ours.

�Address for correspondence: Timo Ter�asvirta, Department of Economic Statistics, Stockholm School of Eco-nomics, BOX 6501, SE-113 83, Stockholm, Sweden. E-mail: [email protected].

1

1 Introduction

Alternatives to linear models in econometric and time series modelling have increased in popularity

in recent years. Nonparametric models that do not make assumptions about the parametric form of

the functional relationship between the variables to be modelled have become more easily applicable

due to computational advances and increased computational power. Another class of models,

the exible functional forms, o�ers an alternative that in fact also leaves the functional form of

the relationship unspeci�ed. While these models do contain parameters, often a large number of

them, the parameters are not globally identi�ed or, using the statistical terminology, estimable.

Identi�cation or estimability, if achieved, is local at best without additional parameter restrictions.

The parameters are not interpretable either as they often are in parametric models.

The arti�cial neural network (ANN) model is a prominent example of such a exible functional

form. It has found applications in a number of �elds, including economics. Kuan and White (1994)

surveyed the use of ANN models in (macro)economics, and several �nancial applications appeared

in a recent special issue of IEEE Transactions on Neural Networks (Abu-Mostafa, Atiya, Magdon-

Ismail and White 2001). The use of the ANN model in applied work is generally motivated by a

mathematical result stating that under mild regularity conditions, a relatively simple ANN model

is capable of approximating any Borel-measurable function to any given degree of accuracy; see, for

example, Funahashi (1989), Cybenko (1989), Hornik, Stinchombe, and White (1989,1990), White

(1990), or Gallant and White (1992). Such an approximator would still contain a �nite number of

parameters. How to specify such a model, that is, how to �nd the right combination of parameters

and variables, is a central topic in the ANN literature and has been considered in a large number

of books such as Bishop (1995), Ripley (1996), Fine (1999), Haykin (1999), or Reed and Marks II

(1999), and articles. Many popular speci�cation techniques are \general-to-speci�c" or \top-down"

procedures: the investigator begins with a large model and applies appropriate algorithms to reduce

the number of parameters using a predetermined stopping-rule. Such algorithms usually do not

rely on statistical inference.

In this paper, we propose a coherent modelling strategy for simple single hidden-layer feedfor-

ward ANN time series models. These models discussed here are univariate, but adding exogenous

2

regressors to them does not pose problems. The di�erence between our strategy and the general-to-

speci�c approaches is that ours works in the opposite direction, from speci�c to general. We begin

with a small model and expand that according to a set of predetermined rules. The reason for

this is that we view our ANN model as a statistical nonlinear model and apply statistical inference

to the problem of specifying the model or, as the ANN experts express it, �nding the network

architecture. We shall argue in the paper that proper statistical inference is not available if we

choose to proceed from large models to smaller ones, from general to speci�c. Our \bottom-up"

strategy builds partly on early work by Ter�asvirta and Lin (1993). More recently, Anders and Korn

(1999) presented a strategy that shares certain features with our procedure. Swanson and White

(1995,1997a,1997b) also developed and applied a speci�c-to-general strategy that deserves mention

here. Balkin and Ord (2000) proposed an inference-based method for selecting the number of hid-

den units in the ANN model. Zapranis and Refenes (1999) developed a computer intensive strategy

based on statistics to select the variables and the number of hidden units of the ANN model. Our

aim has been to develop a strategy that minimizes the amount of computation required to reach

the �nal speci�cation and, furthermore, contains an in-sample evaluation of the estimated model.

We shall consider the di�erences between our strategy and the other ones mentioned here in later

sections of the paper.

The plan of the paper is as follows. Section 2 describes the model and Section 3 discusses

geometric and statistical interpretations for it. A model speci�cation strategy, consisting of speci-

�cation, estimation, and evaluation of the model is described in Section 4. The results concerning

a Monte-Carlo experiment are reported in Section 5 and two applications with real data sets are

presented in Section 6. Section 7 contains concluding remarks.

2 The Autoregressive Neural Network Model

The AutoRegressive Neural Network (AR-NN) model is de�ned as

yt = G(xt; ) + "t = �0~xt +

hXi=1

�iF (~!0

ixt � �i) + "t (1)

3

where G(xt; ) is a nonlinear function of the variables xt with parameter vector 2 R(q+2)h+q+1

de�ned as = [�0; �1; : : : ; �h; ~!0

1; : : : ; ~!0

h; �1; : : : ; �h]0. The vector ~xt 2 R

q+1 is de�ned as ~xt =

[1;x0t]0, where xt 2 R

q is a vector of lagged values of yt and/or some exogenous variables. The

function F ( ~!0ixt � �i), often called the activation function, is the logistic function

F (~!0ixt � �i) =�1 + e�( ~!

0

ixt��i)

��1

(2)

where ~!i = [~!1i; : : : ; ~!qi]0 2 R

q and �i 2 R, and the linear combination of these functions in (1)

forms the so-called hidden layer. Model (1) with (2) does not contain lags of "t and is therefore

called a feedforward NN model. For other choices of the activation function, see Chen, Racine and

Swanson (2001). Furthermore, f"tg is a sequence of independently normally distributed random

variables with zero mean and variance �2. The nonlinear function F (~!0ixt � �i) is usually called a

hidden neuron or a hidden unit. The normality assumption enables us to de�ne the log-likelihood

function, which is required for the statistical inference we need, but it can be relaxed.

3 Geometric and Statistical Interpretation

The characterization and tasks of a layer of hidden neurons in an AR-NN model are discussed

in several textbooks such as Bishop (1995), Haykin (1999), Fine (1999), and Reed and Marks II

(1999). It is nevertheless important to review some concepts in order to compare the AR-NN model

with other well-known nonlinear time series models.

Consider the output F ( ~!0ixt � �i) of a unit of the hidden layer of a neural network as de�ned

in (1) and (2). When ~!0ixt = �i, the parameters ~!i and �i de�ne a hyperplane in a q-dimensional

Euclidean space

H = fxt 2 Rq j~!0ixt = �ig: (3)

The direction of ~!i determines the orientation of the hyperplane and the scalar term �i=k~!ik the

position of the hyperplane in terms of its distance from the origin. A hyperplane induces a partition

of the space Rq into two regions de�ned by the halfspaces

H+ = fxt 2 R

q j~!0ixt � �ig (4)

4

and

H� = fxt 2 R

q j~!0ixt < �ig; (5)

associated to the states of the neuron. With h hyperplanes, a q-dimensional space will be split into

several polyhedral regions. Each region is de�ned by the nonempty intersection of the halfspaces (4)

and (5) of each hyperplane. Hyperplanes lying parallel to each other constitute a special case. In

this situation, ~!i � ~!, i = 1; : : : ; h, and equation (1) becomes

yt = �0~xt +

hXi=1

�iF (~!0xt � �i) + "t: (6)

The input space is thus split in h+ 1 regions.

Certain special cases of (1) are of interest. When xt = yt�d in F , model (1) becomes a multiple

logistic smooth transition autoregressive (MLSTAR) model with h + 1 regimes in which only the

intercept changes according to the regime. The resulting model is expressed as

yt = �0~xt +

hXi=1

�iF ( i (yt�d � ci)) + "t; (7)

where i = ~!i1 and ci = �i=~!i1. When h = 1, equation (7) de�nes a special case of an ordinary

LSTAR model considered in Ter�asvirta (1994). When i ! 1, i = 1; : : : ; h, model (7) becomes

a Self-Exciting Threshold AutoRegressive (SETAR) model with a switching intercept and h + 1

regimes. In another important special case xt = t in F . In this situation the intercept of a linear

model changes smoothly as a function of time. A linear model with h structural shifts in the

intercept is obtained as i !1, i = 1; : : : ; h.

An AR-NN model can thus be either interpreted as a semi-parametric approximation to any

Borel-measurable function or as an extension of the MLSTAR model where the transition variable

can be a linear combination of stochastic variables. We should however stress the fact that model

(1) is, in principle, neither globally nor locally identi�ed. Three characteristics of the model imply

non-identi�ability. The �rst one is the exchangeability property of the AR-NN model. The value

in the likelihood function of the model remains unchanged if we permute the hidden units. This

results in h! di�erent models that are indistinguishable from each other and in h! equal local maxima

5

of the log-likelihood function. The second characteristic is that in (2), F (x) = 1 � F (�x). This

yields two observationally equivalent parametrizations for each hidden unit. Finally, the presence

of irrelevant hidden units is a problem. If model (1) has hidden units such that �i = 0 for at least

one i, the parameters ~!i and �i remain unidenti�ed. Conversely, if ~!i = 0 then �i and �i can take

any value without the value of the likelihood function being a�ected.

The �rst problem is solved by imposing, say, the restrictions �1 � � � � � �h or �1 � � � � � �h. The

second source of underidenti�cation can be circumvented, for example, by imposing the restrictions

~!1i > 0, i = 1; : : : ; h. To remedy the third problem, it is necessary to ensure that the model

contains no irrelevant hidden units. This diÆculty is dealt with by applying statistical inference in

model speci�cation; see Section 4. For further discussion of the identi�ability of ANN models see,

for example, Sussman (1992), Kurkov�a and Kainen (1994), Hwang and Ding (1997), and Anders

and Korn (1999).

For estimation purposes it is often useful to reparametrize the logistic function (2) as

F� i�!0ixt � ci

��=�1 + e� i(!

0

ixt�ci)

��1

(8)

where i > 0, i = 1; : : : ; h, and k!ik = 1 with

!i1 =

vuut1�

qXj=2

!2ij > 0; i = 1; : : : ; h: (9)

The parameter vector of model (1) becomes

= [�0; �1; : : : ; �h; 1; : : : ; h; !12; : : : ; !1q; : : : ; !h2; : : : ; !hq; c1; : : : ; ch]0:

In this case the �rst two identifying restrictions discussed above can be de�ned as, �rst, c1 � � � � � ch

or �1 � � � � � �h and, second, i > 0; i = 1; : : : ; h.. We shall return to this parameterization in

Section 4.3, when maximum likelihood estimation of the parameters is considered.

6

4 Strategy for Building AR-NN Models

4.1 Three Stages of Model Building

As mentioned in the Introduction, our aim is to construct a coherent strategy for building AR-NN

models using statistical inference. The structure or architecture of an AR-NN model has to be

determined from the data. We call this stage speci�cation of the model, and it involves two sets

of decision problems. First, the lags or variables to be included in the model have to be selected.

Second, the number of hidden units has to be determined. Choosing the correct number of hidden

units is particularly important as selecting too many neurons yields an unidenti�ed model. In this

work, the lag structure or the variables included in the model are determined using well-known

variable selection techniques. The speci�cation stage of NN modelling also requires estimation

because we suggest choosing the hidden units sequentially. After estimating a model with h hidden

units we shall test it against the one with h+1 hidden units and continue until the �rst acceptance

of a null hypothesis. What follows thereafter is evaluation of the �nal estimated model to check if

the �nal model is adequate. NN models are typically only evaluated out-of-sample, but in this paper

we suggest in-sample misspeci�cation tests for the purpose. Similar tests are routinely applied in

evaluating STAR models (see Eitrheim and Ter�asvirta (1996)), and in this work we adapt them to

the AR-NN models. All this requires consistency and asymptotic normality for the estimators of

parameters of the AR-NN model, conditions for which, based on results in Trapletti, Leisch and

Hornik (2000), will be stated below.

We shall begin the discussion of our modelling strategy by considering variable selection. After

dealing with that problem we turn to parameter estimation. Finally, after discussing statistical

inference for selecting the hidden units and after presenting our in-sample model evaluation tools

we outline the modelling strategy as a whole.

4.2 Variable Selection

The �rst step in our model speci�cation is to choose the variables for the model from a set of

potential variables (lags in the pure AR-NN case). Several nonparametric variable selection tech-

niques exist (Tschernig and Yang 2000, Vieu 1995, Tj�stheim and Auestad 1994, Yao and Tong

7

1994, Auestad and Tj�stheim 1990), but they are computationally very demanding, in particular

when the number of observations is not small. In this paper variable selection is carried out by

linearizing the model and applying well-known techniques of linear variable selection to this ap-

proximation. This keeps computational cost to a minimum. For this purpose we adopt the simple

procedure proposed in Rech, Ter�asvirta and Tschernig (2001). Their idea is to approximate the

stationary nonlinear model by a polynomial of suÆciently high order. Adapted to the present situ-

ation, the �rst step is to approximate function G(xt; ) in (1) by a general k-th order polynomial.

By the Stone-Weierstrass theorem, the approximation can be made arbitrarily accurate if some

mild conditions, such as the parameter space being compact, are imposed on function G(xt; ).

Thus the AR-NN model, itself a universal approximator, is approximated by another function. This

yields

G(xt; ) = �0~xt +

qXj1=1

qXj2=j1

�j1j2xj1;txj2;t

+ � � �+

qXj1=1

� � �

qXjk=jk�1

�j1:::jkxj1;t � � � xjk;t +R(xt; );

(10)

where R(xt; ) is the approximation error that can be made negligible by choosing k suÆciently

high. The �0s are parameters, and � 2 Rq+1 is a vector of parameters. The linear form of the

approximation is independent of the number of hidden units in (1).

In equation (10), every product of variables involving at least one redundant variable has the

coeÆcient zero. The idea is to sort out the redundant variables by using this property of (10). In

order to do that, we �rst regress yt on all variables on the right-hand side of equation (10) assuming

R(xt; ) and compute the value of a model selection criterion (MSC), AIC or SBIC for example.

After doing that, we remove one variable from the original model and regress yt on all the remaining

terms in the corresponding polynomial and again compute the value of the MSC. This procedure is

repeated by omitting each variable in turn. We continue by simultaneously omitting two regressors

of the original model and proceed in that way until the polynomial is of a function of a single

regressor and, �nally, just a constant. Having done that, we choose the combination of variables

that yields the lowest value of the MSC. This amounts to estimatingPq

i=1

�qi

�+ 1 linear models

8

by ordinary least squares (OLS). Note that by following this procedure, the variables for the whole

ANN model are selected at the same time. Rech et al. (2001) showed that the procedure works well

already in small samples when compared to well-known nonparametric techniques. Furthermore,

it can be successfully applied even in large samples when nonparametric model selection becomes

computationally infeasible.

4.3 Parameter Estimation

As selecting the number of hidden units requires estimation of neural network models, we now turn

to this problem. A large number of algorithms for estimating the parameters of a NN model are

available in the literature. In this paper we instead estimate the parameters of our AR-NN model by

maximum likelihood making use of the assumptions made of "t in Section 2. The use of maximum

likelihood or quasi maximum likelihood makes it possible to obtain an idea of the uncertainty in

the parameter estimates through (asymptotic) standard deviation estimates. This is not possible

by using the above-mentioned algorithms. It may be argued that maximum likelihood estimation

of neural network models is most likely to lead to convergence problems, and that penalizing the

log-likelihood function one way or the other is a necessary precondition for satisfactory results.

Two things can be said in favour of maximum likelihood here. First, in this paper model building

proceeds from small to large models, so that estimation of unidenti�ed or nearly unidenti�edmodels,

a major reason for the need to penalize the log-likelihood, is avoided. Second, the starting-values

of the parameter estimates are chosen carefully, and we discuss the details of this in Section 4.3.2.

The AR-NN model is similar to many linear or nonlinear time series models in that the in-

formation matrix of the logarithmic likelihood function is block diagonal in such a way that we

can concentrate the likelihood and �rst estimate the parameters of the conditional mean. Thus

conditional maximum likelihood is equivalent to nonlinear least squares. Using model (1) as our

starting-point, we make the following assumptions:

(A.1) The ((r+1)�1) parameter vector � = [ 0; �2]0 is an interior point of the compact parameter

space which is a subspace of Rr � R+ ; the r-dimensional Euclidean space.

(A.2) The roots of the lag polynomial 1�Pp

j=1 �jzj lie outside the unit circle. This is a suÆcient

9

condition for weak stationarity, see Trapletti et al. (2000).

(A.3) The parameters satisfy the conditions c1 � : : : � ch, i > 0, i = 1; :::; h; and !i1 is de�ned as

in (9) for i = 1; :::; h.

Assumption (A.3) guarantees global identi�ability of the model. The maximum likelihood

estimator of the parameters of the conditional mean equals

= argmax

QT ( ) = �1

2argmin

TXt=1

qt( );

where qt( ) = (yt �G(xt; ))2. Conditions for consistency and asymptotic normality of are

stated in the following theorem.

Theorem 1. Under the assumptions (A.1){(A.3) the maximum likelihood estimator is almost

surely consistent for and

T 1=2( � )! N

�0;�plim

T!1A( )�1

�; (11)

where A( ) = 1�2T

@2QT ( )@ @ 0 .

Proof. Assumptions (A.1)-(A.3) together with the normality assumption of the errors satisfy

the assumptions of Theorem 4.1 (almost sure consistency) in Trapletti et al. (2000). As every

hidden unit in model (1) is a logistic function, assumptions 4.5 and 4.6 of Theorem 4.2 (asymptotic

normality) in Trapletti et al. (2000) are satis�ed as well, so that result (11) follows.

In this paper, we apply the heteroskedasticity-robust large sample estimator of the covariance

matrix of (White 1980)

B( ) =

TXt=1

hth0

t

!�1 TXt=1

"2t hth0

t

! TXt=1

hth0

t

!�1(12)

where ht =@qt( )@

�� =

, and "t is the residual. In the estimation, the use of algorithms such as

the Broyden-Fletcher-Goldfarb-Shanno (BFGS) or the Levenberg-Marquardt algorithms is strongly

recommended. See, for example, Bertsekas (1995) for details about optimization algorithms or

10

Fine (1999, Chapter 5) for ones especially applied to the estimation of NN models. Choosing an

appropriate line search procedure to select the step-length is another important question. Cubic or

quadratic interpolation are usually reasonable choices. All the models in this paper are estimated

with the Levenberg-Marquardt algorithm based on a cubic interpolation line search.

4.3.1 Concentrated Maximum Likelihood

In order to reduce the computational burden we can apply concentrated maximum likelihood to

estimate as follows. Consider the ith iteration and rewrite model (1) as

y = Z(�)� + "; (13)

where y0 = [y1; y2; : : : ; yT ], "0 = ["1; "2; : : : ; "T ], �

0 = [�0; �1; : : : ; �h], and

Z(�) =

0BBBB@~x01 F ( 1(!

0

1x1 � c1)) : : : F ( h(!0

hx1 � ch))

......

. . ....

~x0T F ( 1(!0

1xT � c1)) : : : F ( h(!0

hxT � ch))

1CCCCA ;

with � = [ 1; : : : ; h;!0

1; : : : ;!0

h; c1; : : : ; ch]0. Assuming � �xed (the value is obtained from the

previous iteration), the parameter vector � can be estimated analytically by

� =�Z(�)0Z(�)

��1Z(�)0y: (14)

The remaining parameters � are estimated conditionally on � by applying the Levenberg-Marquardt

algorithm, which completes the ith iteration. This form of concentrated maximum likelihood was

proposed by Leybourne, Newbold and Vougas (1998) in the context of STAR models. It substan-

tially reduces the dimensionality of the iterative estimation problem, as instead of an inversion of

a single large Hessian two smaller matrices are inverted, and a line search is only needed to obtain

the ith estimate of �.

11

4.3.2 Starting-values

Many iterative optimization algorithms are sensitive to the choice of starting-values, and this is

certainly so in the estimation of AR-NN models. Besides, an AR-NN model with h hidden units

contains h parameters, i; i = 1; :::; h; that are not scale-free. Our �rst task is thus to rescale the

input variables in such a way that they have the standard deviation equal to unity. In the univariate

AR-NN case, this simply means normalizing yt. If the model contains exogenous variables, they are

normalized separately. This, together with the fact that k!hjj = 1, gives us a basis for discussing

the choice of starting-values of i; i = 1; :::; h. Another advantage of this is that, in the multivariate

case normalization generally makes numerical optimization easier as all variables have the same

standard deviation. Assume now that we have estimated an AR-NN model with h � 1 hidden

units and want to estimate one with h units. Our speci�c-to-general speci�cation strategy has the

consequence that this situation frequently occurs in practice. A natural choice of initial values

for the estimation of parameters in the model with h neurons is to use the �nal estimates for the

parameters in the �rst h�1 hidden units and the linear unit. The starting-values for the parameters

h; �h; !h and ch in the hth hidden unit are obtained in three steps as follows.

1. For k = 1; : : : ;K:

(a) Construct a vector v(k)h = [v

(k)1h ; : : : ; v

(k)qh ]

0 such that v(k)1h 2 (0; 1] and v

(k)jh 2 [�1; 1],

j = 2; : : : ; q. The values for v(k)1h are drawn from a uniform (0; 1] distribution and the

ones for v(k)jh , j = 2; : : : ; q, from a uniform [�1; 1] distribution.

(b) De�ne !(k)h = v

(k)h kv

(k)h k�1, which guarantees k!

(k)h k = 1.

(c) Let c(k)h = med(!

(k)0

h x), where x = [x1; : : : ;xT ].

2. De�ne a grid of N positive values (n)h , n = 1; : : : ; N , for the slope parameter. This need not

be done randomly. As the changes in h have a small e�ect of the slope when h is large,

only a small number of large values are required, and the grid should be �ner at the low end

of the halfspace of h-values.

3. For k = 1; : : : ;K and n = 1; : : : ; N , estimate � using (14) and compute the value of QT ( )

for each combination of starting-values. Choose those values of the parameters that minimize

12

QT ( ) (they also maximize the concentrated log-likelihood function) as starting values.

After selecting the initial values of the hth hidden unit we have to reorder the units if necessary

in order to ensure that the identifying restrictions discussed in Section 3 are satis�ed.

Typically, choosing K = 1000 and N = 20 ensures good initial estimates. We should stress,

however, that K is a nondecreasing function of the number of input variables. If the latter is large

we have to select a large K as well.

4.3.3 Potential Problem in the Estimation of the Slope Parameter

Finding the appropriate starting-values for the parameters is not the only diÆcult phase of the

parameter estimation. It may also be diÆcult to obtain reasonably accurate estimates for those

slope parameters i, i = 1; : : : ; h, that are very large. This is the case unless the sample size is

also large. To obtain an accurate estimate of a large i it is necessary to have a large number

of observations in the neighbourhood of ci. When the sample size is not very large, there are

generally few observations suÆciently close to ci in the sample, which results in imprecise estimates

of the slope parameter. This manifests itself in low absolute t-values for the estimates of i. In

such cases, the model builder cannot take a low absolute value of the t-statistic of the parameters

of the transition function as evidence for omitting the hidden unit in question. Another reason

for not doing so is that the t-value does not have its customary interpretation as a value of an

asymptotic t-distributed statistic. This is due to an identi�cation problem to be discussed in the

next subsection. For more discussion see, for example, Bates and Watts (1988, p. 87) or Ter�asvirta

(1994).

4.4 Determining the Number of Hidden Units

The number of hidden units included in an ANN model is usually determined from the data. A

popular method for doing that is pruning, in which a model with a large number of hidden units is

estimated �rst, and the size of the model is subsequently reduced by applying an appropriate tech-

nique such as cross-validation. Another technique used in this connection is regularization, which

may be characterized as penalized maximum likelihood or least squares applied to the estimation

of neural network models. For discussion see, for example, Fine (1999, pp. 215{221). Bayesian

13

regularization, based on selecting a prior distribution for the parameters, may serve as an example.

The use of regularization techniques precludes the possibility of applying methods of statistical

inference to the estimated model.

As discussed in the Introduction, another possibility is to begin with a small model and sequen-

tially add hidden units to the model, for discussion see, for example, Fine (1999, pp. 232{233),

Anders and Korn (1999), or Swanson and White (1995,1997a,b). The decision of adding another

hidden neuron is often based on the use of model selection criteria (MSC) or cross-validation. This

has the following drawback. Suppose the data have been generated by an AR-NN model with h

hidden units. Applying an MSC to decide whether or not another hidden unit should be added to

the model requires estimation of a model with h+1 hidden neurons. In this situation, however, the

larger model is not identi�ed and its parameters cannot be estimated consistently. This is likely

to cause numerical problems in maximum likelihood estimation. Besides, even when convergence

is achieved, lack of identi�cation causes a severe problem in interpreting the MSC. The NN model

with h hidden units is nested in the model with h+1 units. A typical MSC comparison of the two

models is then equivalent to a likelihood ratio test of h units against h + 1 ones, see, for exam-

ple, Ter�asvirta and Mellin (1986) for discussion. The choice of MSC determines the (asymptotic)

signi�cance level of the test. But then, when the larger model is not identi�ed under the null hy-

pothesis, the likelihood ratio statistic does not have its customary asymptotic �2 distribution when

the null holds. For more discussion of the general situation of a model only being identi�ed under

the alternative hypothesis, see, for example, Davies (1977,1987) or Hansen (1996). In the AR-NN

case, this lack of identi�cation shows as an ambiguity in determining the size of the penalty. An

AR-NN model with h + 1 hidden units can be reduced to one with h units by setting �h+1 = 0,

which suggests that the number of degrees of freedom in the penalty term should equal one. On

the other hand, the (h+1)th hidden unit can also be eliminated by setting !h+1 = 0. This in turn

suggests that the number of degrees of freedom should be dim(xt). In practice, the most common

choice seems to equal dim(xt) + 2.

We shall also select the hidden units sequentially starting from a small model, in fact from

a linear one, but circumvent the identi�cation problem in a way that enables us to control the

signi�cance level of the tests in the sequence and thus also the overall signi�cance level of the

14

procedure. Following Ter�asvirta and Lin (1993) we derive a test that is repeated until the �rst

acceptance of the null hypothesis. Assume now that our AR-NN model (1) contains h + 1 hidden

units and write it as follows

yt = �0~xt +

hXi=1

�iF ( i(!0

ixt � ci)) + �h+1F ( h+1(!0

h+1xt � ch+1)) + "t: (15)

Assume further that we have accepted the hypothesis of model (15) containing h hidden units and

want to test for the (h+ 1)th hidden unit. The appropriate null hypothesis is

H0 : h+1 = 0; (16)

whereas the alternative is H1 : h+1 > 0. Under (16), the (h+ 1)th hidden unit is identically equal

to a constant and merges with the intercept in the linear unit.

We assume that under (16) the assumptions of Theorem 1 hold so that the parameters of

(15) can be estimated consistently. Model (15) is only identi�ed under the alternative so that, as

discussed above, the standard asymptotic inference is not available. This problem is circumvented

as in Luukkonen, Saikkonen and Ter�asvirta (1988) by expanding the (h + 1)th hidden unit into a

Taylor series around the null hypothesis (16). Using a third-order Taylor expansion, rearranging

and merging terms results in the following model

yt = �0~xt +

hXi=1

�iF ( i(!0

ixt � ci)) +

qXi=1

qXj=i

�ijxi;txj;t +

qXi=1

qXj=i

qXk=j

�ijkxi;txj;txk;t + "�t ; (17)

where "�t = "t + �h+1R(xt); R(xt) is the remainder. It can be shown that �ij = 2h+1~�ij, ~�ij 6= 0,

i = 1; : : : ; q; j = i; : : : ; q; and �ijk = 3h+1~�ijk, ~�ijk 6= 0, i = 1; : : : ; q; j = i; : : : ; q, k = j; : : : ; q.

Thus the null hypothesis H00 : �ij = 0, i = 1; : : : ; q; j = i; : : : ; q, �ijk = 0, i = 1; : : : ; q; j = i; : : : ; q;

k = j; : : : ; q. Note that under H0 : "�t = "t; so that the properties of the error process remain

unchanged under the null hypothesis. Finally, it may be pointed out that one may also view (17)

15

as resulting from a local approximation to the log-likelihood which for observation t takes the form

lt =�1

2ln(2�)�

1

2ln�2 �

1

2�2

(yt � �

0~xt �

hXi=1

�iF ( i(!0

ixt � ci))�

qXi=1

qXj=i

�ijxi;txj;t �

qXi=1

qXj=i

qXk=j

�ijkxi;txj;txk;t

)2

:

(18)

We make the following assumption to accompany the previous assumptions (A.1){(A.3):

(A.4) Ejxt;ijÆ <1, i = 1; : : : ; q, for some Æ > 6. This enables us to state the following well-known

result:

Theorem 2. Under H0 : h+1 = 0 and assumptions (A.1){(A.4), the LM type statistic

LM =1

�2

TXt=1

"t�0

t

8<:

TXt=1

�t�0

t �

TXt=1

�th0

t

TXt=1

hth0

t

!�1 TXt=1

ht�0

t

9=;

TXt=1

�t"t (19)

where "t = yt �G(xt; �),

ht =@G(xt; )

@ 0

�� =

=

"~x0t; F ( 1(!

0

1xt � c1)); : : : ; F ( h(!0

hxt � ch));

�1@F ( 1(!

0

1xt � c1))

@ 1; : : : ; �h

@F ( h(!0

hxt � ch))

@ h;

�1@F ( 1(!

0

1xt � c1))

@~!012; : : : ; �1

@F ( 1(!0

1xt � c1))

@~!01q; : : : ;

�h@F ( h(!

0

hxt � ch))

@~!0h2; : : : ; �h

@F ( h(!0

hxt � ch))

@~!0hq;

�1@F ( 1(!

0

1xt � c1))

@c1; : : : ; �h

@F ( h(!0

hxt � ch))

@ch

#0

with

@F ( i(!0

ixt � ci))

@ i= (!0ixt � ci)

�2 cosh

� i(!

0

ixt � ci)��2

; i = 1; : : : ; h

@F ( i(!0

ixt � ci))

@~!0ij= i~xj;t

�2 cosh

� i(!

0

ixt � ci)��2

; i = 1; : : : ; h; j = 2; : : : ; q

@F ( i(!0

ixt � ci))

@ci= � i

�2 cosh

� i(!

0

ixt � ci)��2

; i = 1; : : : ; h

16

and �t =hx21;t; x1;tx2;t; : : : ; xi;txj;t; : : : ; x

31;t; : : : ; xi;txj;txk;t; : : : ; x

3h;t

i, has an asymptotic �2 distri-

bution with m = q(q + 1)=2 + q(q + 1)(q + 2)=6 degrees of freedom.

The test can also be carried out in stages as follows:

1. Estimate model (1) with h hidden units. If the sample size is small and the model thus diÆcult

to estimate, numerical problems in applying the maximum likelihood algorithm may lead to

a solution such that the residual vector is not precisely orthogonal to the gradient matrix

of G(xt; ). This has an adverse e�ect on the empirical size of the test. To circumvent

this problem, we regress the residuals "t on ht and compute the sum of squared residuals

SSR0 =PT

t=1 ~"2t . The new residuals ~"t are orthogonal to ht.

2. Regress ~"t on ht and �t. Compute the sum of squared residuals SSR1 =PT

t=1 v2t .

3. Compute the �2 statistic

LMhn�2 = T

SSR0 � SSR1

SSR0; (20)

or the F version of the test

LMhnF =

(SSR0 � SSR1)=m

SSR1=(T � n�m); (21)

where n = (q+2)h+p+1. Under H0, LMhn�2 has an asymptotic �2 distribution with m degrees

of freedom and LMhnF is approximately F -distributed with m and T � n�m degrees of freedom.

The following cautionary remark is in order. If any i, i = 1; : : : ; h, is very large, the gradient

matrix becomes near-singular and the test statistic numerically unstable, which distorts the size

of the test. The reason is that the vectors corresponding to the partial derivatives with respect to

i, !i, and ci, respectively, tend to be almost perfectly linearly correlated. This is due to the fact

that the time series of those elements of the gradient resemble dummy variables being constant

most of the time and nonconstant simultaneously. The problem may be remedied by omitting

these elements from the regression in step 2. This can be done without signi�cantly a�ecting the

value of the test statistic; see Eitrheim and Ter�asvirta (1996) for discussion. This situation may

also cause problems in obtaining standard deviation estimates for parameter estimates through

the outer product matrix (12). The same remedy can be applied: omit the rows and columns

17

corresponding to the two parameters and invert the reduced matrix. This yields more reliable

standard deviation estimates for the remaining parameter estimates. In fact, omitting the rows

and columns corresponding to the high values i should suÆce.

Testing zero hidden units against at least one is a special case of the above test. This amounts

to testing linearity, and the test statistic is in this case identical to the one derived for testing

linearity against the AR-NN model in Ter�asvirta, Lin and Granger (1993). A natural alternative

to our procedure is the one �rst suggested in White (1989) and investigated later in Lee, White

and Granger (1993). In order to test the null hypothesis of h hidden units, one adds q hidden units

to model (1) by randomly selecting the parameters ~!h+j, �h+j, j = 1; : : : ; q. If q is large (a small

value may not render the test powerful enough), a small number of principal components of the q

extra units may be used instead. This solves the identi�cation problem as the extra neurons or the

transformed neurons are observable, and the null hypothesis �h+1 = � � � = �h+q = 0 or its equivalent

when the principal component approach is taken, can be tested using standard inference. When

h = 0; this technique also collapses into a linearity test: see Lee et al. (1993). Simulation results in

Ter�asvirta et al. (1993) and Anders and Korn (1999) indicate that the polynomial approximation

method presented here compares well with White's approach, and it is applied in the rest of this

work.

As mentioned in the Introduction, the normality assumption can be relaxed while the consis-

tency and asymptotic normality of the (quasi) maximum likelihood estimators are retained. This

is important, for example, in �nancial applications of the AR-NN model. In �nancial applications,

at least in ones to high-frequency data, such as intradaily, daily or even weekly series, the series

typically contain conditional heteroskedasticity. This possibility can be accounted for by robusti-

fying the tests against heteroskedasticity following Wooldridge (1990). A heteroskedasticity-robust

version of the LM type test, based on the notion of robustifying statistic (18), can be carried out

as follows.

1. As before.

2. Regress �t on ht and compute the residuals rt.

3. Regress 1 on ~"trt and compute the sum of squared residuals SSR1.

18

4. Compute the value of the test statistic

LM r�2 = T � SSR1: (22)

The test statistic has the same asymptotic �2 null distribution as before.

It should be noticed that in the case of conditional heteroskedasticity, the maximum likelihood

estimates discussed in Section 4.3 are just quasi maximum likelihood estimates. Under regularity

conditions they are still consistent and asymptotically normal.

4.5 Evaluation of the Estimated Model

After a model has been estimated it has to be evaluated. We propose two in-sample misspeci�cation

tests for this purpose. The �rst one tests for the instability of the parameters. The second one

tests the assumption of no serial correlation in the errors and is an application of the results in

Eitrheim and Ter�asvirta (1996) and Godfrey (1988, pp. 112{121).

4.5.1 Test of Parameter Constancy

Testing parameter constancy is an important way of checking the adequacy of linear or nonlinear

models. Many parameter constancy tests are tests against unspeci�ed alternatives or a single

structural break. In this section we present a parametric alternative to parameter constancy which

allows the parameters to change smoothly as a function of time under the alternative hypothesis.

In the following we assume that the logistic function (8) has constant parameters whereas both �

and �i, i = 1; : : : ; h, may be subject to changes over time. This assumption is made mainly because

changes in the parameters of the logistic function are more diÆcult to detect than changes in the

linear parameters.

In order to derive the test, consider a model with time-varying parameters de�ned as

yt = ~G(xt; ; ~ ) + "t = ~�0(t)~xt +

hXi=1

n~�i(t)F ( i(!

0

ixt � ci))o+ "t; (23)

19

where

~�(t) = �+ ��F (�(t� �)) ; (24)

and

~�i(t) = �i + ��iF (�(t� �)) ; i = 1; : : : ; h: (25)

The function F in (24) and (25) is de�ned as in (2), and � > 0. The parameter vector is de�ned as

before, and ~ = [��; ; ��1; : : : ; ��h; �; �]0. The parameter � controls the smoothness of the monotonic

change in the autoregressive parameters. When � ! 1, equations (23){(25) represent a model

with a single structural break at t = �. Combining (24) and (25) with (23), we have the following

model

yt =��0 + ��0F (�(t� �))

~xt +

hXi=1

n�i + ��iF (�(t� �))

oF ( i(!

0

ixt � ci)) + "t: (26)

The null hypothesis hypothesis of parameter constancy is

H0 : � = 0: (27)

Note that model (26) is only identi�ed under the alternative � > 0. As is obvious from Section

4.4, a consequence of this complication is that the standard asymptotic distribution theory for the

likelihood ratio or other classical test statistics for testing (27) is not available. To remedy this

problem we expand F (�(t� �)) into a �rst-order Taylor expansion around � = 0. This yields

t1 =1

4�(t� �) +R(t; �; �); (28)

where R(t; �; �) is the remainder. Replacing F (�(t� �)) in (26) by (28) and reparameterizing we

obtain

yt =��00 + �

0

0t�~xt +

hXi=1

(�i + �it)F ( i(!0

ixt � ci)) + "�t ; (29)

where �0 = � � ��=4, �0 = ��=4, �i = �i � ��i��=4, �i = ��i�=4, i = 1; : : : ; h, and "�t =

20

"t +R(t; �; �). The null hypothesis becomes

H0 : �0 = 0; �1 = � � � = �h = 0: (30)

Under H0, R(t; �; �) = 0 and "�t = "t, so that standard asymptotic distribution theory is available.

The local approximation to the tth term in the normal log likelihood function in a neighbourhood

of H0 for observation t, ignoring R(t; �; �), is

lt =�1

2ln(2�)�

1

2ln�2

�1

2�2

(yt �

��00 + �

0

0t�~xt �

hXi=1

(�i + �it)F ( i(!0

ixt � ci))

)2

:

(31)

The consistent estimators of the relevant partial derivatives of the log likelihood under the null

hypothesis are

@lt@�00

��H0

=1

�2"t~xt (32)

@lt@�00

��H0

=1

�2"tt~xt (33)

@lt@�i

��H0

=1

�2"tF ( i(!

0

ixt � ci)) (34)

@lt@�0i

��H0

=1

�2"ttF ( i(!

0

ixt � ci)) (35)

@lt@ 0i

��H0

=1

�2"t�i

@F ( i(!0

ixt � ci))

@ 0i(36)

@lt@!0i

��H0

=1

�2"t�i

@F ( i(!0

ixt � ci))

@!0i(37)

@lt@ci

��H0

=1

�2"t�i

@F ( i(!0

ixt � ci))

@ci(38)

where i = 1; : : : ; h, �2 = (1=T )PT

t=1 "2t , and "t = yt�G(xt; ) = yt��

0~xt�hPi=1

�iF ( i(!0

ixt�ci)) are

21

the residuals estimated under the null hypothesis. The LM statistic is (19) with ht =@G(xt; )

@ 0

�� =

and �t =�t~x0t; tF ( 1(!

0

1xt � c1)); : : : ; tF ( h(!0

hxt � ch))�0

. Testing hypothesis of just a subset of

coeÆcients being constant is also possible in this framework. Under H0, statistic (19) has an

asymptotic �2 distribution with q + h+ 1 degrees of freedom. This is the case even if �t contains

elements dominated by a deterministic trend, see Lin and Ter�asvirta (1994) for details.

The test can be carried out in stages as before. The only di�erences are the new de�nition of

�t at stage 2 and the degrees of freedom in the �2 or F test. As before, use of F-version of the test

is recommended.

The alternative to parameter constancy may be made more general simply by de�ning

F (�(t� c)) =

1 + exp

(��

KYk=1

(t� �k)

)!�1; � > 0 (39)

withK � 1. Transition function (39) withK � 2 allows the parameters to change nonmonotonically

over time, in contrast to the caseK = 1. Consequently, in the test statistic (19), �t = [� 01t; : : : ; �0

Kt]0

where

� 0kt =htkx0t; t

kF ( 1(!0

1xt � c1)); : : : ; tkF ( h(!

0

hxt � ch)i; k = 1; : : : ;K

and the number of degrees of freedom in the test statistic is adjusted accordingly. In this paper

we report results for K = 1; 2; 3; and call the corresponding test statistics LMK ;K = 1; 2; 3;

respectively. If the error process is heteroskedastic, a robust version of the test, immediately obvious

from Section 4.4, has to be employed instead of the standard test. When the model is assumed not

to contain any hidden units: �i(t) � 0, i = 1; : : : ; h, the test collapses into the parameter constancy

test in Lin and Ter�asvirta (1994).

4.5.2 Test of Serial Independence

Assume that the errors in equation (1) follow an rth order autoregressive process de�ned as

"t = �0�t + ut; (40)

22

where �0 = [�1; : : : ; �r] is a parameter vector, � 0t = ["t�1; : : : ; "t�r], and ut�NID(0; �2). We assume

that under H0, assumptions (A.1){(A.3) hold. Consider the null hypothesis H0 : � = 0 whereas

H1 : � 6= 0. The conditional normal log-likelihood of (1) with (40) for observation t, given the �xed

starting values has the form

lt =�1

2ln(2�) �

1

2ln�2

�1

2�2

8<:yt �

rXj=1

�jyt�j �G(xt; ) +

rXj=1

�jG(xt�j ; )

9=;

2

:

(41)

The �rst partial derivatives of the normal log-likelihood for observation t with respect to � and

are

@lt@�j

=� ut�2

�fyt�j �G(xt�j ; )g ; j = 1; : : : ; r

@lt@

= �� ut�2

�8<:@G(xt; )

@ �

rXj=1

�j@G(xt�j ; )

@

9=; :

(42)

Under the null hypothesis, the consistent estimators of the score are

TXt=1

@lt@�

��H0

=1

�2

TXt=1

"t�t andTXt=1

@lt@

��H0

= �1

�2

TXt=1

"tht;

where � 0t = ["t�1; : : : ; "t�r ], "t�j = yt�j � G(xt�j ^; ), j = 1; : : : ; r, ht = @G(xt; )@ , and �2 =

(1=T )PT

t=1 "2t . The LM statistic is (19) with ht and �t de�ned as above, and it has an asymptotic

�2 distribution with r degrees of freedom under the null hypothesis. For details, see Godfrey (1988,

pp. 112{121).

The test can be performed in three stages as shown before. It may be pointed out that the

Ljung-Box test or its asymptotically equivalent counterpart, the Box-Pierce test, both recommended

for use in connection with NN models by Zapranis and Refenes (1999), are not available. Their

asymptotic null distribution is unknown when the estimated model is an AR-NN model.

23

4.6 Modelling strategy

At this point we are ready to combine the above statistical ingredients into a coherent modelling

strategy. We �rst de�ne the potential variables (lags) and select a subset of them applying the

variable selection technique considered in Section 4.2. After selecting the variables we select the

number of hidden units sequentially. We begin testing linearity against a single hidden unit as

described in Section 4.4 at signi�cance level �. The model under the null hypothesis is simply a

linear AR(p) model. If the null hypothesis is not rejected, the AR model is accepted. In case of

a rejection, an AR-NN model with a single unit is estimated and tested against a model with two

hidden units at the signi�cance level �%, 0 < % < 1. Another rejection leads to estimating a model

with two hidden units and testing it against a model with three hidden neurons at the signi�cance

level �%2. The sequence is terminated at the �rst acceptance of the null hypothesis. The signi�cance

level is reduced at each step of the sequence and converges to zero. In the applications in Section 6,

we use % = 1=2: This way we avoid excessively large models and control the overall signi�cance level

of the procedure. An upper bound for the overall signi�cance level �� may be obtained using the

Bonferroni bound. For example, if � = 0:1; and % = 1=2 then �� 0:187. Note that if we instead

of our LM type test apply a model selection criterion such as AIC or SBIC to this sequence, we

in fact use the same signi�cance level at each step. Besides, the upper bound that can be worked

out in the linear case, see, for example, Ter�asvirta and Mellin (1986), remains unknown due to the

identi�cation problem mentioned above.

In following the above path we have indeed assumed that all hidden neurons contain the variables

that are originally selected to the AR-NN model. Another variant of the strategy is the one in which

the variables in each hidden unit are chosen individually from the set of originally selected variables.

In the present context this may be done, for example, by considering the estimated parameter vector

!h of the most recently added hidden neuron, removing the variables whose coeÆcients have the

lowest t-values and re-estimating the model. Anders and Korn (1999) recommended this alternative.

It has the drawback that the computational burden may become high as frequent estimation of

neural network models may be involved. Because of this we suggest another technique that combines

sequential testing for hidden units and variable selection. Consider equation (15). Instead of just

testing a single null hypothesis as is done within (15), we can do the following. First test the null

24

hypothesis involving all variables. Then remove one variable from the extra unit under test and

test the model with h hidden units against this reduced alternative. Remove each variable in turn

and carry out the test. Continue by removing two variables at a time. Finally, test the model with

h neurons against the alternatives in which the (h + 1)th unit only contains a single variable and

an intercept. Find the combination of variables for which the p-value of the test is minimized. If

this p-value is lower than a prescribed value, \signi�cance level", add the (h + 1)th unit with the

corresponding variables to the model. Otherwise accept the AR-NN model with h hidden units and

stop. This way of selecting the variables for each hidden unit is analogous to the variable selection

technique discussed in Section 4.2.

Compared to our �rst strategy, this one adds to the exibility and on the average leads to more

parsimonious models than the other one. On the other hand, as every choice of hidden unit involves

a possibly large number of tests, we do not control the signi�cance level of the overall hidden unit

test. We do that, albeit conditionally on the variables selected, if the set of input variables is

determined once and for all before choosing the number of hidden units.

Evaluation following the estimation of the �nal model is carried out by subjecting the model

to misspeci�cation tests discussed in Section 4.5. If the model does not pass the tests, the model

builder has to reconsider the speci�cation. For example, important lags (or, in the more general

case, exogenous variables), may be missing from the model. If parameter constancy is rejected,

model (23) may be estimated and used for forecasting, but reconsidering the whole speci�cation

may often be a more sensible option of the two.

Another way of evaluating the model is out-of-sample forecasting. As AR-NN models are most

often constructed for forecasting purposes, this is important. This part of the model evaluation is

carried out by saving the last observations in the series for forecasting and comparing the forecast

results with those from at least one benchmark model. Note, however, that the results are dependent

on the observations contained in that particular prediction period and may not allow very general

conclusions about the properties of the estimated model. They are also conditional on the structure

of the model remaining unchanged over the forecasting period, which may not necessarily be the

case. For more discussion about this, see Clements and Hendry (1999, Chapter 2).

25

4.7 Discussion and comparisons

It is useful to compare our modelling strategy with other bottom-up approaches available in the

literature. Swanson and White (1995), Swanson and White (1997a), and Swanson and White

(1997b) apply the SBIC model selection criterion (Schwarz 1978) as follows. They start with a

linear model, adding potential variables to it until SBIC indicates that the model cannot be further

improved. Then they estimate models with a single hidden unit and select regressors sequentially

to it one by one unless SBIC shows no further improvement. Next Swanson and White add another

hidden unit and proceed by adding variables to it. The selection process is terminated when

SBIC indicates that no more hidden units or variables should be added or when a predetermined

maximum number of hidden units has been reached. This modelling strategy can be termed fully

sequential.

Anders and Korn (1999) essentially adopt the procedure of Ter�asvirta and Lin (1993) described

in Section 4.4 for selecting the number of hidden units. After estimating the largest model they

suggest proceeding from general-to-speci�c by sequentially removing those variables from hidden

units whose parameter estimates have the lowest (t-test) p-values. Note that this presupposes

parameterizing the hidden units as in (2), not as in (8) and (9).

Balkin and Ord (2000) select the ordered variables (lags) sequentially using a linear model and a

forward stepwise regression procedure. If the F -test statistic of adding another lag obtains a value

exceeding 2, this lag is added to the set of input variables. The number of variables selected also

serves as a maximum number of hidden units. The authors suggest estimating all models from the

one with a single hidden unit up to the one with the maximum number of neurons. The �nal choice

is made using the Generalized Cross-Validation Criterion of Golub, Heath and Wahba (1979). The

model for which the value of this model selection criterion is minimized is selected.

Refenes and Zapranis (1999) (see also Zapranis and Refenes (1999)) propose adding hidden units

into the model sequentially (there is a ow chart in the paper indicating this). The number of units,

however, is selected only after adding all units up to a predetermined maximum number, so that

the procedure is not genuinely sequential. The choice is made by applying the Network Information

Criterion (Murata, Yoshizawa and Amari 1994) that can be traced back to Stone (1977). The model

is then pruned by removing redundant variables from the neurons and re-estimating the model.

26

Unlike the others, Refenes and Zapranis (1999) underline the importance of misspeci�cation testing

which also forms an integral part of our modelling procedure. They suggest, for example, that the

hypothesis of no error autocorrelation should be tested, by the Ljung-Box or the asymptotically

equivalent Box-Pierce test. Unfortunately, these tests do not have their customary asymptotic null

distribution when the estimated model is an AR-NN model instead of a linear autoregressive one.

Of these strategies, the Swanson and White one is computationally the most intensive one, as

the number of steps involving an estimation of a NN model is large. Our procedure is in this respect

the least demanding. The di�erence between our scheme and the Anders and Korn one is that in

our strategy, variable selection does not require estimation of NN models because it is wholly based

on LM type tests (the model is only estimated under the null hypothesis). Furthermore, there is a

possibility of omitting certain potential variables before even estimating neural network models.

Like ours, the Swanson and White strategy is truly sequential: the modeller proceeds by con-

sidering nested models. The di�erence lies in how to compare two nested models in the sequence.

Swanson and White apply SBIC whereas Anders and Korn and we use LM type tests. The problems

with the former technique have been discussed in Section 4.4. The problem of estimating uniden-

ti�ed models is still more acute in the approaches of Balkin and Ord and Refenes and Zapranis.

Because these procedures require the estimation of NN models up to one containing a predeter-

mined maximum number of hidden units, several estimated models may thus be unidenti�ed. The

problem is even more serious if statistical inference is applied in subsequent pruning as the selected

model may also be unidenti�ed. The probability of this happening is smaller in the Anders and

Korn case, in particular when the sequence of hidden unit tests has gradually decreasing signi�cance

levels.

5 Monte-Carlo Study

In this section we report results from two Monte Carlo experiments. The purpose of the �rst one is

to illustrate some features of the NN model selection strategy described in Section 4.6 and compare

it with the alternative in which model selection is carried out using an appropriate model selection

criterion. In the second experiment, the performance of the misspeci�cation tests of Section 4.5 is

27

considered. In both experiments we make use of the following model

yt = 0:10 + [0:75 � 0:2� � F �(0:05(t � 90))] yt�1 � 0:05yt�4

+ [0:80� 0:5� � F �(0:05(t � 90))]F (2:24(0:45yt�1 � 0:89yt�4 + 0:09))

+ [�0:70 + 2:0� � F (0:05(t � 90))]F (1:12(0:44yt�1 + 0:89yt�4 + 0:35)) + "t;

"t = �"t�1 + ut; ut � NID(0; �2)

(43)

where F � is de�ned as in 43) with K = 1 or 2. However, in the �rst experiment, � = � = 0.

The number of observations in the �rst experiment is either 200 or 1000, in the second one we

report results for 100 observations. In every replication, the �rst 500 observations are discarded to

eliminate the initialization e�ects. The number of replications equals 500.

5.1 Architecture Selection

Results from simulating the modelling strategy can be found in Table 1. The table also contains

results on choosing the number of hidden units using SBIC. This model selection criterion was

chosen for the experiment because Swanson and White (1995, 1997a,b) applied it to this problem.

In this case it is assumed that the model contains the correct variables. This is done in order to

obtain an idea of the behaviour of SBIC free from the e�ects of an incorrectly selected model.

Di�erent results can be obtained by varying the error variance, the size of the coeÆcients of

hidden units or \connection strengths" and, in the case of our strategy, the signi�cance levels.

In this experiment, the signi�cance level is halved at every step, but other choices are of course

possible. It seems, at least in the present experiment, that selecting the variables is easier than

choosing the right number of hidden units. In small samples, there is a strong tendency to choose

a linear model but, as can be expected, nonlinearity becomes more apparent with an increasing

sample size. The larger initial signi�cance level (� = 0:10) naturally leads to larger models on

the average than the smaller one (� = 0:05). Over�tting is relatively rare but the results suggest,

again not unexpectedly, that the initial signi�cance level should be lowered when the number of

observations increases. Finally, improving the signal-to-noise ratio improves the performance of our

strategy.

28

The results of the hidden unit selection by SBIC show that the empirical signi�cance level

implied by it is, at least in this experiment, very low for both T = 200 and T = 1000; although it

changes with the sample size. Compared to our approach, the linear model is still chosen relatively

often for T = 1000 whereas the correct model with two hidden units is not selected at all. A

drawback of SBIC is that the signi�cance level is not known to the model builder, whereas in our

strategy it can be controlled at will.

5.2 Parameter Constancy Tests

The parameter constancy test statistic is simulated for K = 1; 2 in (39). The results of size

simulations are presented using size discrepancy plots introduced in Davidson and MacKinnon

(1998). They can be found in Figure 1. The results are based on realizations with 100 observations

and two error standard deviations, � = 0:125 and � = 0:25. For the latter one, the test is seen to

be conservative, whereas this tendency disappears at customary levels of signi�cance (up to 0.10)

when the signal-to-noise ratio is increased (the error standard deviation decreased). Panels (d)-(f)

show that the size is somewhat sensitive to error autocorrelation.

The results of power simulations are not shown here. The reason is that they are what can be

expected: the test has reasonable power when smooth parameter change is present in model (43).

The results are available from the authors upon request.

5.3 Serial Independence Test

The test of no error autocorrelation is simulated using model (43) with � = 0; 1 and � = 0:125; 0:25.

The maximum lags in the alternative equal 1,2 and 4. The size discrepancy plots appear in Figure 2.

Again, the test is somewhat conservative for � = 0:25 and less so for � = 0:125. Not unexpectedly,

panels (c) and (d) show that the test has power against time-varying parameters (� = 1 in (43)).

The results of power simulations do not o�er any surprises: the power increases with parameter �.

They are thus omitted to save space but are available upon request.

29

Table 1: Outcomes of the experiments of selecting the number of hidden units using the testsequence starting at signi�cance levels � = 0:05 and 0:10 and sample sizes 200 and 1000 based on500 replications of model (43) for � = � = 0 and three di�erent values for � and the same usingSBIC.

� = 1

� = 0:05; % = 1=2200 observations 1000 observations

Correct Too many Too few Correct Too many Too fewvariables variables variables SBIC variables variables variables SBIC

h = 0 405 6 10 483 114 0 0 394

h = 1 73 2 1 17 363 0 0 106

h = 2 3 0 0 0 3 0 0 0

h > 2 0 0 0 0 0 0 0 0



h = 0 335 7 0 483 68 0 0 394

h = 1 122 5 0 17 387 0 0 106

h = 2 4 0 0 0 33 0 0 0

h > 2 1 0 0 0 4 0 0 0

� = 0:5� = 0:05; % = 1=2

200 observations 1000 observationsCorrect Too many Too few Correct Too many Too fewvariables variables variables SBIC variables variables variables SBIC

h = 0 365 2 0 475 18 0 0 256

h = 1 127 1 0 25 440 0 0 244

h = 2 5 0 0 0 38 0 0 0

h > 2 0 0 0 0 4 0 0 0



h = 0 282 0 0 475 4 0 0 256

h = 1 205 1 0 25 438 0 0 244

h = 2 11 0 0 0 54 0 0 0

h > 2 1 0 0 0 4 0 0 0

� = 0:125� = 0:05; % = 1=2

200 observations 1000 observationsCorrect Too many Too few Correct Too many Too fewvariables variables variables SBIC variables variables variables SBIC

h = 0 116 0 0 423 0 0 0 4

h = 1 360 0 0 77 304 0 0 495

h = 2 23 0 0 0 177 0 0 1

h > 2 1 0 0 0 19 0 0 0



h = 0 86 0 0 423 0 0 0 4

h = 1 382 0 0 77 262 0 0 495

h = 2 30 0 0 0 205 0 0 1

h > 2 2 0 0 0 33 0 0 0

Notes: (a) The cases where the number of variables is correct but the combination is not the correct one appear under the heading \Too few

variables". (b) The results concerning model selection using SBIC do not depend on the value of �.

30

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(a)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20

0.005

0.01

0.015

0.02

0.025

0.03

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(b)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.035

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(c)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2−0.16

−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(d)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.06

−0.05

−0.04

−0.03

−0.02

−0.01

0

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(e)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.2

−0.18

−0.16

−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

Nominal Size

Nom

inal

Siz

e −

Empi

rical

Siz

e

0o lineK=1K=2

(f)

Figure 1: Size discrepancy curves of the simulated parameter constancy test. Panel (a): � = 0,� = 0, and � = 0:25. Panel (b): � = 0, � = 0, and � = 0:125. Panel (c): � = 0, � = 0:2, and� = 0:25. Panel (d): � = 0, � = 0:4, and � = 0:25. Panel (e): � = 0, � = 0:2, and � = 0:125.Panel (f): � = 0, � = 0:4, and � = 0:125.

31

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Nominal Size

Nom

inal

Siz

e −

Em

piric

al S

ize

0o liner=1r=2r=4

(a)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

0

0.005

0.01

0.015

0.02

0.025

Nominal Size

Nom

inal

Siz

e −

Em

piric

al S

ize

0o liner=1r=2r=4

(b)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.18

−0.16

−0.14

−0.12

−0.1

−0.08

−0.06

−0.04

−0.02

0

Nominal Size

Nom

inal

Siz

e −

Em

piric

al S

ize

0o liner=1r=2r=4

(c)

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

Nominal Size

Nom

inal

Siz

e −

Em

piric

al S

ize

0o liner=1r=2r=4

(d)

Figure 2: Size discrepancy curves of the no error autocorrelation test. Panel (a): � = 0, � = 0, and� = 0:25. Panel (b): � = 0, � = 0, and � = 0:125. Panel (c): � = 1, � = 0, and � = 0:25. Panel(d): � = 2, � = 0, and � = 0:125 in (43)

32

6 Case Studies

6.1 Example 1: Annual Sunspot Numbers, 1700{2000

In this section we illustrate our modelling strategy by two empirical examples. In the �rst ex-

ample we build an AR-ANN model for the annual sunspot numbers over the period 1700{1979

and forecast with the estimated model up until the year 2001. The series, consisting of the years

1700{2001, was obtained from the National Geophysical Data Center web page. 1 The sunspot

numbers are a heavily modelled nonlinear time series: for a neural network example see Weigend,

Huberman and Rumelhart (1992). In this work we adopt the square-root transformation of Ghad-

dar and Tong (1981) and Tong (1990, p. 420). The transformed observations have the form

yt = 2hp

(1 +Nt)� 1i, t = 1; : : : ; T , where Nt is the original number of sunspots in the year t.

The graph of the transformed series appears in Figure 3. Most of the published examples of �tting

NN models to sunspot series deal with the original and not the square-root transformed series.

We use the observations for the period 1700{1979 to estimate the model and the remaining ones

for a forecast evaluation. We begin the AR-NN modelling of the series by selecting the relevant lags

using the variable selection procedure described in Section 4.2. We use a third-order polynomial

approximation to the true model. Applying SBIC, lags 1,2, and 7 are selected whereas AIC yields

the lags 1,2,4,5,6,7,8,9, and 10. We proceed with the lags selected by the SBIC. However, the

residuals of the estimated linear AR model are strongly autocorrelated. The serial correlation is

removed by also including yt�3 in the set of selected variables. When building the AR-NN model

we select the input variables for each hidden unit separately using the speci�cation test described in

Section 4.4. Linearity is rejected at any reasonable signi�cance level and the p-value of the linearity

test minimized with lags 1 2, and 7 as input variables. The sequence of including hidden units is

1http://www.ngdc.noaa.gov/stp/SOLAR/SSN/ssn.html

33

discontinued after adding the second hidden unit, see Table 2, and the �nal estimated model is

yt = �0:17(0:83)

+ 0:85(0:09)

yt�1 + 0:14(0:12)

yt�2 � 0:31(0:06)

yt�3 + 0:08(0:05)

yt�7

+ 12:80(7:18)

� F

�0:46(0:23)

�0:29(�)

yt�1 � 0:87(0:83)

yt�2 + 0:40(0:09)

yt�7 � 6:68(0:05)

��

+ 2:44(0:48)

� F

"1:17 � 103(8:45�103)

�0:83(�)

yt�1 � 0:53(0:12)

yt�2 � 0:18(0:08)

yt�7 + 0:38(7:18)

�#+ "t:

(44)

� = 1:89 �=�L = 0:70 R2 = 0:89 pLJB = 1:8� 10�7

pARCH(1) = 0:94 pARCH(2) = 0:75 pARCH(3) = 0:90 pARCH(4) = 0:44;

where the �gures in parentheses below the estimates are standard deviation estimates, � is the

residual standard deviation, �L is the residual standard deviation of the linear AR model, R2 is

the determination coeÆcient, pLJB is the p-value of the Lomnicki-Jarque-Bera test of normality,

and pARCH(j), j = 1; : : : ; 4, is the p-value of the LM test of no ARCH against ARCH of order j.

The estimated correlation matrix of the linear term and the output of the hidden units is

� =

0BBBB@

1 �0:30 0:74

�0:30 1 �0:19

0:74 �0:19 1

1CCCCA : (45)

It is seen from (45) that there are no redundant hidden units in the model as none of the correlations

is close to unity in absolute value. Figure 3 illuminates the contributions of the two hidden units

to the explanation of yt. The linear unit can only represent a symmetric cycle, so that the hidden

units must handle the nonlinear part of the cyclical variation in the series. It is seen from Figure

3 that the �rst hidden unit is activated at the beginning of every upswing, and its values return to

zero before the peak. The unit thus helps explain the very rapid recovery of the series following

each trough. The second hidden unit is activated roughly when the series is obtaining values higher

than its mean. It contributes to characterizing another asymmetry in the sunspot cycle: the peaks

and the troughs have distinctly di�erent shapes, peaks being rounder than troughs. The switches

in the value of the hidden unit from zero to unity and back again are quite rapid ( 2 large), which

is the cause of the large standard deviation of the estimate of 2, see the discussion in Section 4.3.

34

50 100 150 200 2500

10

20

ty t

50 100 150 200 250

0.2

0.4

0.6

t

F 1

50 100 150 200 2500

0.5

1

t

F 2

Figure 3: Panel (a): Transformed sunspot time series, 1700{1979. Panel (b): Output of the �rsthidden unit in (44). Panel (c): Output of the second hidden unit in (44).

Table 2: Test of no additional hidden units: minimum p-value of the set of tests against each nullmodel.

Number of hidden units under the null hypothesis0 1 2

p-value 3� 10�14 2� 10�9 0:019

The results of the misspeci�cation tests of model (44) in Table 3 indicate no model misspec-

i�cation. In order to assess the out-of-sample performance of the estimated model we compare

our forecasting results with the ones obtained from the two SETAR models, the one reported in

Tong (1990, p. 420) and the other in Chen (1995), an arti�cial neural network (ANN) model with

10 hidden neurons and the �rst 9 lags as input variables, estimated with Bayesian regularization

(MacKay 1992a, MacKay 1992b), and a linear autoregressive model with lags selected using SBIC.

The SETAR model estimated by Chen (1995) is one in which the threshold variable is a nonlinear

function of lagged values of the time series whereas it is a single lag in Tong�s model.

Table 4 shows the results of the one-step-ahead forecasting for the period 1980-2001. The

Table 3: Tests of no error autocorrelation and parameter constancy for model (44).

LM Test for q-th order serial correlation LM type test of parameter constancy

Lag K1 2 3 4 8 12 1 2 3 4

p-value 0.55 0.61 0.34 0.49 0.47 0.22 0.98 0.95 0.93 0.88

35

Table 4: One-step ahead forecasts, their root mean square errors, and mean absolute errors for theannual number of sunspots from a set of time series models, for the period 1980-2001.

SETAR model SETAR modelAR-NN NN model (Tong 1990) (Chen 1995) AR model

Year Observation Forecast Error Forecast Error Forecast Error Forecast Error Forecast Error

1980 154.6 153.4 1.2 136.9 17.7 161.0 -6.4 134.3 20.3 159.8 -5.21981 140.4 128.4 12.0 130.5 9.9 135.7 4.7 125.4 15.0 123.3 17.11982 115.9 95.8 20.1 101.1 14.8 98.2 17.7 99.3 16.6 99.6 16.31983 66.6 76.7 -10.1 88.6 -22.0 76.1 -9.5 85.0 -18.4 78.9 -12.31984 45.9 29.8 16.1 45.8 0.1 35.7 10.2 41.3 4.7 33.9 12.01985 17.9 21.9 -4.0 29.5 -11.6 24.3 -6.4 29.8 -11.9 29.3 -11.41986 13.4 13.5 -0.1 9.5 3.9 10.7 2.7 9.8 3.6 10.7 2.71987 29.4 23.7 5.7 25.2 4.2 20.1 9.3 16.5 12.9 23.0 6.41988 100.2 86.7 13.5 76.8 23.4 54.5 45.7 66.4 33.8 61.2 38.91989 157.6 161.6 -3.9 152.9 4.6 155.8 1.8 121.8 35.8 159.2 -1.61990 142.6 159.7 -17.1 147.3 -4.7 156.4 -13.8 152.5 -9.9 175.5 -32.91991 145.7 118.2 27.5 121.2 24.5 93.3 52.4 123.7 22.0 119.1 26.61992 94.3 98.1 -3.8 114.3 -20.0 110.5 -16.2 115.9 -21.7 118.9 -24.61993 54.6 64.8 -10.2 71.0 -16.4 67.9 -13.3 69.2 -14.6 57.9 -3.31994 29.9 21.0 8.9 32.9 -3.0 27.0 2.9 35.7 -5.8 29.9 -0.11995 17.5 14.9 2.6 19.2 -1.7 18.4 -0.9 18.9 -1.4 17.6 -0.11996 8.6 19.2 -10.6 10.2 -1.6 18.1 -9.5 11.6 -3.0 15.7 -7.11997 21.5 17.6 3.9 21.3 0.2 12.3 9.2 11.8 9.7 16.0 5.51998 64.3 64.6 -0.3 67.6 -3.3 46.7 17.6 58.5 5.8 52.5 11.81999 93.3 113.0 -19.7 105.2 -11.9 105.7 -12.5 122.7 -29.4 109.2 -15.92000 119.6 102.4 17.2 101.8 17.8 99.5 20.1 102.7 16.8 115.1 4.42001 111 102.9 8.1 112.5 -1.5 110.2 0.8 112.5 -1.5 121.0 -10RMSE 12.2 12.8 18.1 17.3 15.9MAE 9.9 9.9 12.9 14.3 12.1

results, summarized by the root mean squared error (RMSE) and mean absolute error (MAE)

measures, are quite favourable for our AR-NN model. Turning away from the neural network

models, the less than impressive performance of the SETAR models may raise questions about

their feasibility. However, as Tong (1990, p. 421) has pointed out, these models are at their best in

forecasting several years ahead because they are able to reproduce the distinct nonlinear structure

of the sunspot series clearly better than the linear autoregressive models.

In order to �nd out whether or not model (44) generates more accurate one-step-ahead forecasts

than the other models we have applied the modi�ed Diebold-Mariano test (Diebold and Mariano

1995) of Harvey, Leybourne and Newbold (1997) to these series of forecasts. Table 5 shows the

values of the statistic and the corresponding p-values. The null hypothesis of no di�erence in the

theoretical MAE or RMSE between the AR-NN model and a competitor can be rejected only when

the competitor is any of the SETAR models. The AR-NN model thus appears somewhat better

than the SETAR alternatives but not better than the linear AR model and the NN one obtained

by Bayesian regularization.

We also compared multi-step forecasts made by our model and the alternative models described

above. The forecasts were made according to the following procedure.

36

Table 5: Modi�ed Diebold-Mariano test of the null of no di�erence between the forecast errors ofthe di�erent models.

Comparison MDM Statistic p-value

Squared ErrorsAR-NN vs. NN 0.41 0.34

AR-NN vs. SETAR (Tong 1990) 1.42 0.08AR-NN vs. SETAR (Chen 1995) 1.89 0.04

AR-NN vs. AR 1.29 0.10

Absolute ErrorsAR-NN vs. NN 0.21 0.42

AR-NN vs. SETAR (Tong 1990) 1.52 0.07AR-NN vs. SETAR (Chen 1995) 2.10 0.02

AR-NN vs. AR 1.10 0.15

1. For t = 1980; : : : ; 1988, compute the out-of-sample forecasts of one to 8-step-ahead of each

model, yt(k), and the associated forecast errors denoted by "t(k) where k is the forecasting

horizon.

2. For each forecasting horizon, compute the RMSE and the MAE statistics.

Table 6 shows the root mean squared error and the mean absolute errors for the annual number

of sunspots from a total of forecasts each made by each model for forecast horizons from 2 to 8

years. Several interesting facts emerge from the results. The forecastability of sunspots using the

linear AR model deteriorates very slowly with the forecast horizon. This is clearly due to the

extraordinarily persistent cycle in the series. As to the AR-NN model that also contains a linear

unit, the advantage in forecast accuracy compared to the AR model is clear at short horizons

but vanishes at the seven-year horizon. The large NN model obtained by Bayesian regularization

does not contain a linear unit and fares less well in this comparison. For the two SETAR models,

forecastability deteriorates quite slowly with the forecast horizon after a quick initial decay. The

accuracy of forecasts, however, measured by the root mean squared error or the mean absolute

error, is somewhat inferior to that of the linear AR model.

37

Table 6: Multi-step ahead forecasts, their root mean square errors, and mean absolute errors forthe annual number of sunspots from a set of time series models, for the period 1981-2000.

HorizonAR-NN NN model SETAR model (Tong 1990) SETAR model (Chen 1995) AR

RMSE MAE RMSE MAE RMSE MAE RMSE MAE RMSE MAE2 18.4 14.6 20.7 16.7 31.6 21.0 27.2 21.6 26.5 18.83 21.6 14.6 24.3 19.3 38.4 25.2 33.6 24.8 28.2 19.94 22.2 15.6 27.3 21.6 42.2 26.4 31.8 23.6 27.8 20.25 22.4 14.0 32.4 23.2 42.2 27.0 30.6 21.6 26.9 19.16 20.6 14.0 36.5 25.3 41.6 26.4 31.9 23.0 26.8 19.77 27.5 18.4 42.2 30.2 43.3 30.3 34.0 25.0 27.5 19.88 25.1 20.0 39.6 30.1 45.2 35.0 33.8 26.0 26.7 19.6

6.2 Example 2: Financial Prediction

Our second example has to do with forecasting stock returns. We have chosen it because our

results can be compared with ones from previous studies and because this is a multivariate example.

Pesaran and Timmermann (1995) provided evidence in favour of monthly US stock returns being

predictable. They constructed a linear model containing nine economic variables and showed that

using the model for managing a portfolio consisting of either the S&P 500 index or bonds gave

results superior to the ones obtained from a simple random walk model. The choice between stocks

and bonds was reconsidered every month, and pro�ts were reinvested. The time period extended

from January 1954 to December 1992. Later, Qi (1999) applied a neural network model based

on Bayesian regularization to the same data set and obtained results vastly superior to the ones

Pesaran and Timmermann (1995) had reported. Recently, however, Maasoumi and Racine (2001)

found that with no model could one come close to the level of accumulated wealth Qi's model

generated even though some models had a similar in-sample performance. When Racine (2001)

reproduced her experiment, he was unable to demonstrate similar results for the neural network

model.

Following the others, we respecify our model for each observation period. Thus, our modelling

strategy is applied as follows:

1. For t = 1; : : : ; T ; with T = 1960:1 to 1992:12

(a) Select the variables with the procedure described in Section 4.2 using a third-order Taylor

expansion.

(b) Test linearity with all the selected variables in the transition function using the heteroskedasticity-

38

robust version of the linearity test.

(c) If linearity is rejected, estimate an AR-NN model. Otherwise, estimate a linear regres-

sion including all the covariates. The number of hidden units is determined using the

heteroskedasticity-robust version of the LM test. The initial signi�cance level of the

tests equals 0.05.

The �rst model is estimated with the data extending to the end of 1959, and the whole modelling

procedure is repeated after adding another month to the sample. It is seen from Figure 4 that the

composition of variables varies quite considerably, although there are periods of stability, such as

the years 1978-1984, for example. (For a detailed description of the variables, see Pesaran and

Timmermann (1995).) Not a single one of the nine variables appears in every model, however.

Perhaps quite predictably when the sample is small, linearity is not rejected at the 5% level, see

Figure 5. There is a single period between 1984 and 1988 when the model selection strategy yields

a neural network (NN) model and another one at the end of the period. This already leads one to

expect only minor di�erences in wealth at the end of the period between the strategies based on

the linear and the NN model. In fact, the linear model containing all variables and the NN strategy

(either a linear model with a subset of variables or an NN model) lead to a di�erent investment

decision in only 10 cases out of 396. Out of these 10, our technique yielded a correct direction

forecast in four cases and the linear model in remaining six.

The accumulated wealth is shown in Table 7. The linear model gives the best results. Our NN

model (Panel E) is slightly better than Bayesian regularization NN model of Racine (2001) (Panel

D) for no or low transaction costs. For high transaction costs, the relationship is the opposite.

Thus, our NN modelling strategy compares well with the Qi-Racine approach but is not any better

than a linear model with a constant composition of variables. The main reason for the linear model

doing well is that there is not much structure to be modelled in the relationship between the returns

and the explanatory variables. A nonlinear model cannot therefore be expected to do better than

a linear one. Furthermore, NN models most often require a large sample to perform well, and in

this example a clear majority of samples must be considered small.

It has been pointed out, see for instance Fama (1998), that the accumulated wealth compar-

39

60 64 68 72 76 80 84 88 92None

DY_t

EP_t

I1_{t−1}

I1_{t−2}

I12_{t−1}

I12_{t−2}

Pi_{t−2}

Delta IP_{t−2}

Delta M_{t−2}

Time

Varia

bles

Figure 4: Variables selected using the AIC.

60 64 68 72 76 80 84 88 9210

−2

10−1

Time

Log(p−

value)

Figure 5: p-value of the linearity test (heteroskedasticity-robust version). The dashed lines are the0.1, 0.05, and 0.01 bounds.

40

Table 7: Risks and pro�ts of market, bond, and switching portfolios based on the out-of-sampleforecasts of alternative models, 1960.1 to 1992.12.

Transaction costs Mean return (%) Std. of return Sharpe ratio Final wealth ($)

Panel A: Market portfolioZero 11.15 14.90 0.35 2,503Low 11.13 14.90 0.43 2,463High 11.11 14.89 0.43 2,424

Panel B: Bond portfolioZero 5.93 2.74 { 700Low 4.72 2.74 { 471High 4.72 2.74 { 471

Panel C: Switching portfolio based on linear forecastsZero 13.66 10.08 0.77 7,458Low 12.21 10.18 0.74 4,631High 11.23 10.34 0.63 3,346

Panel D: Switching portfolio based NN forecasts of Racine (2001)

Zero 13.23 10.89 0.67 6,624Low 11.98 10.88 0.67 4,204High 11.23 10.93 0.60 3,292

Panel E: Switching portfolio based AR-NN forecasts

Zero 13.50 10.12 0.75 7,054Low 12.00 10.20 0.71 4,319High 10.99 10.34 0.61 3,089

41

isons may be misleading in assessing the forecasting performance of di�erent models because the

cumulative e�ect of a single pair of di�erent direction forecasts and thus investment decisions early

on may grow quite large. In our case, the di�erent decisions are few and appear relatively late in

the sample. As a result, repeating the same exercise without reinvesting the pro�ts leads to the

conclusion that there is no di�erence in performance between the linear AR model and the models,

either linear or NN, obtained by our technique.

7 Conclusion

In this paper we have demonstrated how statistical methods can be applied in building neural net-

work models. The idea is to specify parsimonious models and keep the computational cost small.

An advantage of our modelling strategy is that the modelling procedure is not a black box. Every

step in model building is clearly documented and motivated. On the other hand, using this strat-

egy requires active participation of the model builder and willingness to make decisions. Choosing

the model selection criterion for variable selection and determining signi�cance levels for the test

sequence for selecting the number of hidden units are not automated, and di�erent choices may

often produce di�erent models. Combining them in forecasting could be an interesting topic that,

however, lies beyond the scope of this paper. Nevertheless, the method shows promise, and research

is being carried out in order to learn more about its properties in modelling and forecasting station-

ary time series. The Matlab code for carrying out the modelling cycle exists and is downloadable

at www.econ.puc-rio/br/mcm/nonlinear.html or www.hhs.se/stat/research/nonlinear.htm.

References

Abu-Mostafa, Y. S., Atiya, A. F., Magdon-Ismail, M. and White, H.: 2001, Introduction to the special issue

on neural networks in �nancial engineering, IEEE Transactions on Neural Networks 12, 653{655.

Anders, U. and Korn, O.: 1999, Model selection in neural networks, Neural Networks 12, 309{323.

Auestad, B. and Tj�stheim, D.: 1990, Identi�cation of nonlinear time series: First order characterization

and order determination, Biometrika 77, 669{687.

42

Balkin, S. D. and Ord, J. K.: 2000, Automatic neural network modeling for univariate time series, Interna-

tional Journal of Forecasting 16, 509{515.

Bates, D. M. and Watts, D. G.: 1988, Nonlinear Regression Analysis and its Applications, Wiley, New York.

Bertsekas, D. P.: 1995, Nonlinear Programming, Athena Scienti�c, Belmont, MA.

Bishop, C. M.: 1995, Neural Networks for Pattern Recognition, Oxford University Press, Oxford.

Chen, R.: 1995, Threshold variable selection in open-loop threshold autoregressive models, Journal of Time

Series Analysis 16, 461{481.

Chen, X., Racine, J. and Swanson, N. R.: 2001, Semiparametric ARX neural-network models with an

application to forecasting in ation, IEEE Transactions on Neural Networks 12, 674{683.

Clements, M. P. and Hendry, D. F.: 1999, Forecasting Non-stationary Economic Time Series, MIT Press,

Cambridge, MA.

Cybenko, G.: 1989, Approximation by superposition of sigmoidal functions,Mathematics of Control, Signals,

and Systems 2, 303{314.

Davidson, R. and MacKinnon, J. G.: 1998, Graphical methods for investigating the size and power of

hypothesis tests, The Manchester School 66, 1{26.

Diebold, F. X. and Mariano, R. S.: 1995, Comparing predictive accuracy, Journal of Business and Economic

Statistics 13, 253{263.

Eitrheim, . and Ter�asvirta, T.: 1996, Testing the adequacy of smooth transition autoregressive models,

Journal of Econometrics 74, 59{75.

Fama, E. F.: 1998, Market eÆciency, long-term returns, and behavioral �nance, Journal of Financial Eco-

nomics 49, 283{306.

Fine, T. L.: 1999, Feedforward Neural Network Methodology, Springer, New York.

Funahashi, K.: 1989, On the approximate realization of continuous mappings by neural networks, Neural

Networks 2, 183{192.

Gallant, A. R. and White, H.: 1992, On learning the derivatives of an unknown mapping with multilayer

feedforward networks, Neural Networks 5, 129{138.

Ghaddar, D. K. and Tong, H.: 1981, Data transformations and self-exciting threshold autoregression, Journal

of the Royal Statistical Society C30, 238{248.

43

Godfrey, L. G.: 1988, Misspeci�cation Tests in Econometrics, Vol. 16 of Econometric Society Monographs,

second edn, Cambridge University Press, New York, NY.

Golub, G., Heath, M. and Wahba, G.: 1979, Generalized cross-validation as a method for choosing a good

ridge parameter, Technometrics 21, 215{223.

Hansen, B. E.: 1996, Inference when a nuisance parameter is not identi�ed under the null hypothesis,

Econometrica 64, 413{430.

Harvey, D., Leybourne, S. and Newbold, P.: 1997, Testing the equality of prediction mean squared errors,

International Journal of Forecasting 13, 281{291.

Haykin, H.: 1999, Neural Networks: A Comprehensive Foundation, second edn, Prentice-Hall, Oxford.

Hwang, J. T. G. and Ding, A. A.: 1997, Prediction intervals for arti�cial neural networks, Journal of the

American Statistical Association 92, 109{125.

Kuan, C. M. and White, H.: 1994, Arti�cial neural networks: An econometric perspective, Econometric

Reviews 13, 1{91.

Kurkov�a, V. and Kainen, P. C.: 1994, Functionally equivalent feedforward neural networks, Neural Compu-

tation 6, 543{558.

Lee, T.-H., White, H. and Granger, C. W. J.: 1993, Testing for negleted nonlinearity in time series models.

a comparison of neural network methods and alternative tests, Journal of Econometrics 56, 269{290.

Leybourne, S., Newbold, P. and Vougas, D.: 1998, Unit roots and smooth transitions, Journal of Time

Series Analysis 19, 83{97.

Lin, C. F. J. and Ter�asvirta, T.: 1994, Testing the constancy of regression parameters against continuous

structural change, Journal of Econometrics 62, 211{228.

Luukkonen, R., Saikkonen, P. and Ter�asvirta, T.: 1988, Testing linearity in univariate time series models,

Scandinavian Journal of Statistics 15, 161{175.

Maasoumi, E. and Racine, J.: 2001, Entropy and predictability of stock market returns, Journal of Econo-

metrics 107, 291{312.

MacKay, D. J. C.: 1992a, Bayesian interpolation, Neural Computation 4, 415{447.

MacKay, D. J. C.: 1992b, A practical Bayesian framework for backpropagation networks, Neural Computa-

tion 4, 448{472.

44

Murata, N., Yoshizawa, S. and Amari, S.-I.: 1994, Network information criterion - determining the number of

hidden units for an arti�cial neural network model, IEEE Transactions on Neural Networks 5, 865{872.

Pesaran, M. H. and Timmermann, A.: 1995, Predictability of stock returns: robustness and econoic signi�-

cance, Journal of Finance 50, 1201{1228.

Qi, M.: 1999, Nonlinear predictability of stock returns using �nancial and economic variables, Journal of

Business and Economic Statistics 17, 419{429.

Racine, J.: 2001, On the nonlinear predictability of stock returns using �nancial and economic variables,

Journal of Business and Economic Statistics 19, 380{382.

Rech, G., Ter�asvirta, T. and Tschernig, R.: 2001, A simple variable selection technique for nonlinear models,

Communications in Statistics, Theory and Methods 30, 1227{1241.

Reed, R. D. and Marks II, R. J.: 1999, Neural Smithing, MIT Press, Cambridge, MA.

Refenes, A. P. N. and Zapranis, A. D.: 1999, Neural model identi�cation, variable selection, and model

adequacy, Journal of Forecasting 18, 299{332.

Ripley, B. D.: 1996, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.

Schwarz, G.: 1978, Estimating the dimension of a model, Annals of Statistics 6, 461{464.

Stone, M.: 1977, An asymptotic equivalence of choice of model by cross-validatiob and Akaike's criterion,

Journal of the Royal Statistical Society B 39, 44{47.

Sussman, H. J.: 1992, Uniqueness of the weights for minimal feedforward nets with a given input-output

map, Neural Networks 5, 589{593.

Swanson, N. R. and White, H.: 1995, A model selection approach to assesssing the information in the

term structure using linear models and arti�cial neural networks, Journal of Business and Economic

Statistics 13, 265{275.

Swanson, N. R. and White, H.: 1997a, Forecasting economic time series using exible versus �xed speci�ca-

tion and linear versus nonlinear econometric models, International Journal of Forecasting 13, 439{461.

Swanson, N. R. and White, H.: 1997b, A model selection approach to real-time macroeconomic forecasting

using linear models and arti�cial neural networks, Review of Economic and Statistics 79, 540{550.

Ter�asvirta, T.: 1994, Speci�cation, estimation, and evaluation of smooth transition autoregressive models,

Journal of the American Statistical Association 89, 208{218.

45

Ter�asvirta, T. and Lin, C.-F. J.: 1993, Determining the number of hidden units in a single hidden-layer

neural network model, Research Report 1993/7, Bank of Norway.

Ter�asvirta, T. and Mellin, I.: 1986, Model selection criteria and model selection tests in regression models,

Scandinavian Journal of Statistics 13, 159{171.

Ter�asvirta, T., Lin, C. F. and Granger, C. W. J.: 1993, Power of the neural network linearity test, Journal

of Time Series Analysis 14, 309{323.

Tj�stheim, D. and Auestad, B.: 1994, Nonparametric identi�cation of nonlinear time series { selecting

signi�cant lags, Journal of the American Statistical Association 89, 1410{1419.

Tong, H.: 1990, Non-linear Time Series: A Dynamical Systems Approach, Oxford University Press, Oxford.

Trapletti, A., Leisch, F. and Hornik, K.: 2000, Stationary and integrated autoregressive neural network

processes, Neural Computation 12, 2427{2450.

Tschernig, R. and Yang, L.: 2000, Nonparametric lag selection for time series, Journal of Time Series

Analysis 21, 457{487.

Vieu, P.: 1995, Order choice in nonlinear autoregressive models, Statistics 26, 307{328.

Weigend, A., Huberman, B. and Rumelhart, D.: 1992, Predicting sunspots and exchange rates with connec-

tionist networks, in M. Casdagli and S. Eubank (eds), Nonlinear Modeling and Forecasting, Addison-

Wesley.

White, H.: 1980, A heteroskedasticity-consistent covariance matrix estimator and a direct test for het-

eroskedasticity, Econometrica 48, 817{838.

White, H.: 1989, An additional hidden unit test for negleted nonlinearity in multilayer feedforward networks,

Proceedings of the International Joint Conference on Neural Networks, IEEE Press, New York, NY,

pp. 451{455.

White, H.: 1990, Connectionist nonparametric regression: Multilayer feedforward networks can learn arbi-

trary mappings, Neural Networks 3, 535{550.

Wooldridge, J. M.: 1990, A uni�ed approach to robust, regression-based speci�cation tests, Econometric

Theory 6, 17{43.

Yao, Q. and Tong, H.: 1994, On subset selection in non-parametric stochastic regression, Statistica Sinica

4, 51{70.

46

Zapranis, A. and Refenes, A.-P.: 1999, Principles of Neural Model Identi�cation, Selection and Adequacy:

With Applications to Financial Econometrics, Springer-Verlag, Berlin.

47

building neural network models for time series ... - fgv epgeepge.fgv.br/files/1108.pdf · building...

Documents