conditional resampling of hydrologic time series using multiple predictor...

34
1 Conditional resampling of hydrologic time series using multiple predictor variables: A k-nearest neighbour approach R. Mehrotra* and Ashish Sharma School of Civil and Environmental Engineering, The University of New South Wales, Sydney, Australia *Also at National Institute of Hydrology, Roorkee -247 667, India Abstract Unlike parametric alternatives for time series generation, nonparametric approaches generate new values by conditionally resampling past observations using a probability rationale. Observations lying ‘close’ to the conditioning vector are resampled with higher probability, ‘closeness’ is defined using a Euclidean or Mahalanobis distance formulation. A common problem in this is the difficulty in distinguishing the importance of each predictor in the estimation of the distance. As a consequence, the conditional probability and hence the resampled series, offers a biased representation of the true population it aims to simulate. This paper presents a variation of the k-nearest neighbor resampler designed for use with multiple predictor variables. In the modification proposed, an influence weight is assigned to each predictor in the conditioning set with the aim of identifying nearest neighbours that represent the conditional dependence in an improved manner. The workability of the proposed modification is tested using synthetic data from known linear and non-linear models and its applicability is illustrated through an example where daily rainfall is resampled over 30 stations near Sydney,

Upload: others

Post on 22-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

1

Conditional resampling of hydrologic time series using

multiple predictor variables: A k-nearest neighbour

approach

R. Mehrotra* and Ashish Sharma

School of Civil and Environmental Engineering, The University of New South Wales,

Sydney, Australia

*Also at National Institute of Hydrology, Roorkee -247 667, India

Abstract

Unlike parametric alternatives for time series generation, nonparametric approaches

generate new values by conditionally resampling past observations using a probability

rationale. Observations lying ‘close’ to the conditioning vector are resampled with higher

probability, ‘closeness’ is defined using a Euclidean or Mahalanobis distance

formulation. A common problem in this is the difficulty in distinguishing the importance of

each predictor in the estimation of the distance. As a consequence, the conditional

probability and hence the resampled series, offers a biased representation of the true

population it aims to simulate. This paper presents a variation of the k-nearest neighbor

resampler designed for use with multiple predictor variables. In the modification

proposed, an influence weight is assigned to each predictor in the conditioning set with

the aim of identifying nearest neighbours that represent the conditional dependence in

an improved manner. The workability of the proposed modification is tested using

synthetic data from known linear and non-linear models and its applicability is illustrated

through an example where daily rainfall is resampled over 30 stations near Sydney,

Page 2: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

2

Australia using a predictor set consisting of selected large-scale atmospheric circulation

variables.

1.0 Introduction

Hydrologic time series models aim to reproduce the most relevant statistical

characteristics of the observed series to aid in the planning, design, operation and

management of water resources. As the observed time series is just one realization

amongst an infinite number that may have occurred, reliance on only the historical series

under-represents the uncertainty associated with the resulting design or management

plan. As far as possible, this uncertainty must be incorporated in the planning or design

of the system, something that is often accomplished through the use of time series or

stochastic models. These modeling approaches are broadly grouped into parametric and

nonparametric approaches, the former being expressed via selected parameters (often

estimated as functions of sample moments), while the latter, being formulated based on

the entire sample thus using the observed (sampled) variability in characterizing

uncertainty. Nonparametric stochastic approaches offer an efficient alterative when

sufficient data is available. Their intuitive simplicity and transparency have made them

attractive and popular for use in hydrology and other sciences.

Two commonly used nonparametric stochastic alternatives are kernel density and

nearest neighbour based resampling methods. Both have been used extensively for a

range of hydro-climatological applications [Young, 1994; Lall and Sharma, 1996; Lall et

al., 1996; Tarboton et al., 1998; Sharma and Lall, 1999; Rajagopalan and Lall, 1999;

Buishand and Brandsma, 2001; Sharma and O’Neill, 2002; Yates et al., 2003; Beersma

and Buishand, 2003; Harrold et al., 2003; Wójcik and Buishand, 2003; Mehrotra et al.,

Page 3: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

3

2004]. Both are based on a similar rationale, generating new values obtained through a

conditional probability density function (PDF) which is estimated using the observed

data. This conditional PDF is broadly expressed as:

∑= i iitt wf RXR )( (1a)

∑=

j ij

itiw )-(

)-(XX

XXψ

ψ (1b)

where, tR is a vector of predictands, wi is the weight associated with the predictands at

time step i, tX and iX , respectively, are multivariate vectors (also called as feature

vectors) of predictor variables at times t and i, ψ(.) is a measure of proximity of tX to iX ,

and is different depending on the nonparametric method used and )( ttf XR is

conditional PDF based on which tR is simulated. Note that if tR is expressed as 1+tX ,

then the above formulation becomes analogous to a classical multivariate time series

model [Salas, 1993]. In the notation used above, one can think of tR as a vector of daily

rainfall amounts or occurrences at multiple locations in a region that are to be simulated

conditional to predictors tX that represent relevant atmospheric indicators.

This general formulation is specific to a stochastic downscaling model [Mehrotra et al.,

2004] which forms one of the three examples presented later in this study. Note also that

in the notation used above, vectors or matrices have been represented as “bold”, a

notation that will be followed elsewhere.

The nearest neighbour based resampling methods are founded on pattern recognition

theory that dates back to the early fifties, and use the classic bootstrap described by

Yakowitz [1985] and Efron and Tibshirani [1993]. An important issue in nearest-

Page 4: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

4

neighbour resampling is the choice of a function ψ(.) which identifies the proximity of k

nearest neighbours of a particular state. This study uses the k-nearest neighbour

bootstrap formulation proposed by Lall and Sharma [1996] which specifies the proximity

ψ(.) in (1b) as:

kit1)-( =XXψ if k ≤ K (2)

= 0 if k > K

where k denotes the number of observations between tX and iX (inclusive of tX ) in the

historical record and K is a specified maximum value of k. The ‘Proximity’ or ‘nearness’

is defined through a distance formulation, the Euclidean distance it ,ξ between tX and

iX being:

∑=

=m

jj,tj,ijit -XXs

1

2, })({ξ (3)

where the vector iX consists of m predictors variables Xj,i , j = 1…m, and js is the

scaling weight for the jth predictor. It is also possible to use other distance measures (for

example see Yates et al. [2003]; Souza Filho and Lall [2003] and Wójcik and Buishand

[2003]). For other nonparametric methods the exact formulation of (1a, b) varies (see for

example, Sharma, [2000] for a formulation using kernel methods).

Predictor variables used in the calculation of Euclidean distances are made

dimensionless through standardization using scaling weights. Scaling of variables

ensures that each predictor has the same weighting in selection of nearest neighbours.

Scaling is also necessary when the feature vector consists of combination of discrete

Page 5: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

5

and continuous variables [Sharma and Lall, 1999; Buishand and Brandsma, 2001]. The

reciprocal of the sample standard deviation of each predictor variable is a common

choice for the scaling weights [Lall and Sharma, 1996; Rajagopalan and Lall, 1999;

Sharma and Lall, 1999; Harrold et al., 2003; Beersma and Buishand, 2003]. However,

this has been noted to result in an improper representation of the conditional density

because of which application specific choices have been used in Brandsma and

Buishand [1998]; Buishand and Brandsma [2001] and Beersma and Buishand [2003].

For similar reasons, Yates et al. [2003] and Wójcik and Buishand [2003] used modified

representation of the distance measure in (3) by adopting the Mahalanobis distance, as

an alternative to the Euclidean distance based on the inverse of the predictor covariance

matrix. Souza Filho and Lall [2003] used a modified Euclidean distance formulation

based on coefficients from a fitted linear regression model between the predictor and

predictands. These formulations are discussed in details later in the paper.

The Euclidean distance formulation in (3) considers all predictors to be equally important

in the estimation of the conditional probability. In realty, however, each predictor may

have unequal importance and some of the predictors may not be significant at all. These

less important or otherwise redundant predictor variables tend to influence the

conditional probability and hence the resampled series offers a biased representation of

the true population it aims to simulate.

Consider the example in Figure 1 wherein a sinusoidal series of the form ε+= )sin(u z

is plotted. Thus, the z series is a function of only one predictor variable u, a uniformly

distributed random number varying between 0 and 5π, ε being a Gaussian error term.

However, to demonstrate the effect of inclusion of redundant predictor variables, we

consider that the feature vector consists of three additional predictor variables. These

redundant predictors are all uniformly distributed random variables independent of u, z

Page 6: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

6

and each other. Giving all four predictors equal importance or using other distance

measures as specified in Yates et al. [2003]; Souza Filho and Lall [2003] and Wójcik and

Buishand [2003] leads to a biased estimate of the conditional mean from the KNN

conditional probability density. This bias is much stronger at the two boundaries of the

data, and at the points which represent a large change in slope (peaks and troughs). An

alternative formulation denoted KNN-W that considers an optimal scaling weight for each

predictor, does not suffer from the same deficiency, and is the focus of the present

study.

The paper is organized as follows. A brief description of different KNN alternatives, our

proposed model (KNN-W) and the procedure used to estimate the optimal scaling

weights are presented first. In the next section, we demonstrate the importance of

inclusion of optimal scaling weights by applying the different KNN formulations to

selected linear and non-linear synthetic examples, and extend the comparison of KNN-

W and traditional KNN to downscale synoptic atmospheric patterns to rainfall at multiple

point locations in a region near Sydney, Australia. The last section presents the

summary and conclusions from this research.

2.0 k nearest neighbour resampling

2.1 Background

As described in the previous section, the k nearest neighbour approach estimates the

conditional probability on the basis of k nearest neighbours of the conditioning vector tX .

In essence, the k patterns in the historical record that are most similar to the conditioning

vector are identified, and the k sets of corresponding predictands are specified as the

most likely values the system may have at the time step t. The common way to define

Page 7: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

7

similarity uses the Euclidean distance formulation in (3). To ensure that each predictor of

the conditioning vector has the same weighting in selection of nearest neighbours,

standardization of predictor variables is carried out. The reciprocal of the sample

standard deviation of each predictor variable is a common choice for the standardization.

Recognizing the fact that correlations among the predictors may influence the selection

of nearest neighbours, Yates et al. [2003] and Wójcik and Buishand [2003] used a

modified representation of the distance measure in (3) by adopting the Mahalanobis

distance, as an alternative to the Euclidean distance and adopting the standardization

based on the inverse of the predictor covariance matrix. The Mahalanobis distance

measure is defined by:

)X-(XC)X-(X Tit

1itit

−=,ξ (4)

where T represents the transpose operation, 1−C is the inverse of the predictor (X)

covariance matrix and tX and iX have their usual meaning. It may be noted that this

measure of proximity is commonly adopted in multivariate kernel density and regression

function estimation (see for example, Sharma, [2000]). Also note that the selection of

nearest neighbour in Yates et al. [2003] is based on a concept that is different than the

probability weight metric specified in (2). Their procedure assigns a substantially greater

probability of selecting the nearest neighbour for the same value of k as compared to the

logic in equations (1) and (2).

The distances provide the measure of similarity of the current predictor tX to the

observed set iX . To include the relative importance of each predictor in selecting the

nearest neighbours, Souza Filho and Lall [2003] considered regression coefficient of

standardized values of predictor and predictand to develop weights for each predictor

and formed a weighted Euclidean distance measure that is expressed as:

Page 8: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

8

{ }∑=

=m

jijtjjjit -XXsr

1

2,,, )(ξ (5)

where jr is the regression coefficient between the jth predictor and predictand, and tX ,

iX , js and m have their usual meaning. This distance measure takes into account the

linear dependence between predictor and predictand.

2.2 Modified KNN (KNN-W) model

In the modification proposed here, the Euclidean distance in (3) is modified to include an

‘influence’ weight for each predictor as follows:

∑=

=m

jj,tj,ijjit -XXs

1

2, })({βξ (6)

where, js is the scaling weight as defined earlier, and jβ is the influence weight

associated with the jth predictor. Introduction of influence weights, β

( mjj ,...,1],[ == ββ ) aims to define the information content of each predictor in the

estimation of the conditional probability density )( ttf XR . In the simplest form, all

influence weights,β carry equal values and (6) reduces to (3), the traditional KNN

model. If the relationship between predictor and predictand is linear, the influence

weight jβ closely follows the scaled value ∑ =mi ij rr 1

22 of regression coefficient jr as

specified in (5).

Page 9: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

9

2.3 Estimation of the KNN-W influence weights

The scaling weight vector s ( s =[sj], j=1,…., m), number of nearest neighbours k and

influence weight vectorβ are the key to accurate estimation of the conditional probability

density in the KNN-W approach. The scaling weight vector s is estimated as the

reciprocal of the sample standard deviation associated with each variable in the

predictor set. The influence weights β aim to define the relative influence each predictor

variable has on the conditional PDF and is ascertained based on an optimization

procedure using leave-one-out cross-validation, as described next.

Cross validation (CV) is a commonly used basis for measuring the predictive error

associated with a model. Cross validation involves developing the model using one part

of the sample and assessing its accuracy by applying it on the other, the process being

repeated to ensure that all observations are used in estimating the prediction error.

Leave-one-out cross-validation (L1CV) is a special case of CV where the model is

formulated at each observed data point using all observations except the observation

under consideration, and applying the model to predict the observation that has been left

out. This procedure is then repeated for all observations and the difference of observed

and predicted value at each data point provides the measure of predictive error. For the

current application, an appropriate set of values for the influence vector β is ascertained

by minimizing the predictive error associated with the model as assessed using L1CV. In

the formulation described next, this measure of the predictive error is used to specify the

log-likelihood associated with theβ , the optimal β being selected as the vector that

results in the maximum log-likelihood score.

It may also be noted here that as the estimation of the best set of influence weights is

exclusively based on the minimization of the predictive error computed using L1CV, the

resulting set of predictands may not provide the optimal predictions when assessed

Page 10: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

10

based on statistics at a different scale of aggregation (see for example the assessment

of a daily rainfall generation model using the variability in rainfall at an annual time scale

in Harold et al. [2003]). If the aim were to formulate influence weights that were optimal

with respect to statistics estimated at multiple scales, the likelihood function would need

to be suitably modified.

In addition to the influence weight vectorβ , the number of k-nearest neighbours is also

an unknown that is often difficult to specify in practice. The optimal value of k depends

on the number of observations from which nearest neighbours are selected, variability

present in the observed records and number of dependent variables. Also, the optimal k

value differs when used in the context of time series simulation as compared to when

used for prediction [see for example Buishand and Brandsma, 2001]. In general, the

optimal k will be smaller when the purpose is time series simulation than when the

purpose is prediction. However, a small number of nearest neighbours may have the

danger of duplicating the large part of the historic record in the simulated series. For

enough number of sampled observations (≈100) and small number of predictors (<6),

Lall and Sharma [1996] suggested that, as a basic guideline, k can be chosen as the

square root of the length of the observations. For time series simulation, Buishand and

Brandsma [2001] found better results with small k (≤5). The approach we follow for

estimating the influence weightsβ can also be used for estimating the optimal value of k

for a particular application, number of predictor variables and length of observations,

however, with certain modification in the likelihood formulation. Likelihood measures

(such as the ones presented in the next section) tend to optimize the parameter values

on the basis of expected value of predictands at a given time step and may therefore,

settle down for smaller k value. However, as the aim of this study is to illustrate the

importance of considering the influence weights in a multi-predictor formulation of the k-

Page 11: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

11

nearest neighbour resampling, we opt not to include the number of nearest neighbours

in the optimization procedure described in the next section.

2.4 Adaptive Metropolis Algorithm

In the present study we have used the recently proposed adaptive Metropolis (AM)

sampling approach [Haario et al., 2001, Marshall et al., 2004] to estimate the optimal

values of the influence weightsβ . The optimization procedure for synthetic example is

based on the maximization of log likelihood for continuous data:

)( θ|Rl ∑ −−−==

NgRN

ttt

1

2)];([22

1)22ln(2

θXσ

πσ (7)

where )( θ|Rl is the log likelihood, tR is the observed value, );( θXtg is the predicted

values at time step t, tX is the set of predictors at time t, 2σ is the predictive error

variance, N is total number of observations, and θ is the set of model parameters. For

the present application, we include 2σ , the predictive error variance andβ , the influence

weights as the parameters θ to be optimised. Note that );( θtXg is estimated using

leave-one-out cross-validation as described in the previous section, as the expected

value of the conditional PDF specified in (2) and (6).

A different likelihood function is used for the downscaling application presented later in

this section. As here the response is a discrete variable (rainfall occurrence, taking

values of 0 or 1 denoting dry or wet states respectively), the likelihood measure of (7) is

modified as follows:

Page 12: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

12

)( θ|Rfl ∑=

+

−−

= ∑∑∑= =

=

N

tb

a

atgitRfn

i

k

jk

jj

jji

1

);(,1log

1 1

1

, θX (8)

where, )( θ|Rfl is the log likelihood, itRf , is the observed rainfall occurrence at the ith

station at time t, );(, θX tjig is the predicted rainfall for the jth nearest neighbour for the ith

station at time t, and tX is the set of predictor variables at time t. Further, n defines the

number of stations the atmospheric predictors are being downscaled to, N is total

number of observations, θ is the set of model parameters, aj is the decreasing kernel for

the jth nearest neighbour as defined by (2) and k is the number of nearest neighbours.

The additional term, b represents a small value to ensure that on occasions when

observed and predicted rainfall occurrences disagree, log-likelihood remains feasible

and is duly penalized.

The above likelihood formulations were used to ascertain the optimal parameter vector θ

using the adaptive Metropolis (AM) algorithm [Haario et al., 2001; Marshall et al., 2004].

This algorithm is characterised by a proposal distribution based on the estimated

posterior covariance matrix of the parameters. At step t, Haario et al. [2001] consider a

multivariate normal proposal with mean given by the current value, ),( ttN Cθ where tC

is the proposal covariance. The covariance tC has a fixed value (C0) for the first few

iterations and is updated after a t0 iterations as:

+⋅⋅⋅=

− dtt Cov Iθθ

CC

ηεη ),,( 10

0 0

0

tttt

>≤

(9)

Page 13: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

13

where d is total number of parameters included in the optimization algorithm, I is a dxd

identity matrix, ε is a parameter chosen to ensure tC does not become singular and η

is a scaling parameter to ensure reasonable acceptance rates of the proposed states.

The steps involved in our implementation of the AM algorithm are:

a) Initialize t=0 and set 0C as a diagonal matrix with each diagonal term

representing the variance associated with the prior distribution for each unknown.

b) Update tC for the current iteration number t using (9).

c) Generate a proposed value *θ for θ where ) ,( N~*tt Cθθ . If parameter values

are outside the parameter constraints, repeat the step again. The constraints for

the β parameters are defined as:

011 .ßmj j =∑ = ; and 0.10 << jß (10)

These conditions ensure that the relative magnitude of each parameter is

retained and it remains bounded.

d) Compute the log likelihood )*( θ|Rl using (7) or )( *θ|Rfl using (8) following the

L1CV procedure.

e) Calculate the acceptance probability, α, of the proposed parameter values using:

{ })]()|(*)(*)|(exp[,1min tt plpl θθRθθR −−+=α (11)

where )(θp is the log prior distribution of θ, the priors being specified as a

Uniform PDF over [0,1] for each element of the influence weight vector β , and a

Page 14: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

14

scaled inverse-chi-square distribution χ-2(ν,λ) for the predictive error variance σ2

(for the likelihood expressed in (7)), the parameters ν and λ being specified

depending on the nature of the error associated with the true model. In the

synthetic examples presented in the next section, ν and λ were specified as 0.5

and 8 respectively.

f) Generate u ~ U[0,1]

If u < α, accept *1 θθ =+t , otherwise set tt θθ =+1

g) Repeat b-f a sufficient number of times to ensure that the posterior distribution of

the parameter vector θ has been sampled exhaustively.

The scaling weights η and the stopping criterion for our algorithm are based on the

recommendations of Marshall et al., [2004] and Haario et al., [2001]. A reasonable

approach is to start the optimization with initial set of β parameters being assigned

equal values satisfying (10), and consider the predictive error variance as the variance of

the predictive error estimated using these parameter values. In the results reported next,

the optimal θ has been chosen as the one that results in the maximum log-likelihood in

(7 or 8) from the assigned number of iterations. Standard convergence tests [Marshall et

al., 2004] were used to assess the adequacy of the number of iterations (15000)

assigned for each application, details on which are presented next.

3.0 Applications

This section illustrates the impact of using optimal influence weights for k-nearest

neighbour resampling through three carefully designed experiments and four KNN

Page 15: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

15

resampling alternatives. The first two experiments use data from known synthetic

models and are designed to test the following three hypotheses: (a) the influence weight

should assume a value of zero when a predictor variable is redundant; (b) the influence

weight should be dictated by the information content each predictor has about the

predictand(s); and, (c) inclusion of influence weights should result in an improvement

over the case where they are set equal to one (the traditional KNN approach). The last

experiment presents an application to downscale rainfall using a system of atmospheric

circulation predictors, the aim being to illustrate the improvements over the case where

the predictors have equal influence weights.

The resampling alternatives considered in the synthetic examples differ the way the

scaling of variables is accomplished. The different alternatives for scaling of variables

include: i) reciprocal of the standard deviation (KNN), defined using (3); ii) inverse of the

predictor covariance (KNN-M1), defined using (4); iii) regression coefficient vector of

predictor and predictand and reciprocal of the standard deviation (KNN-M2), defined

using (5) and iv) influence weights and reciprocal of the standard deviation (KNN-W),

defined using (6). While the second alternative (KNN-M1) is based on the Mahalanobis

distance measure, other alternatives consider the Euclidean distance measure as the

choice of the function in identifying the proximity of k nearest neighbours. For KNN-M2,

regression coefficients relating the predictors and predictand were estimated by carrying

out a multiple linear regression analysis. For synthetic cases, the number of nearest

neighbours k was specified equal to square root of number of observations.

3.1 Synthetic case study using a linear model

A sample of 200 observations was generated using the following linear model:

iε++= i,xi, x iz 21 (12)

Page 16: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

16

where, i varies from 1 to 200, x1 and x2 are generated from a standard normal PDF

N(0,1) and ε is a noise term also from a normal distribution N(0, 0.5). Thus, predictand z

is a known function of x1 and x2 with both contributing equally. To evaluate the capability

of the method at identifying redundant predictors, two additional variables, x3 and x4,

both independent and N(0,1) were also included as predictors in the KNN resampling

algorithm. Given this configuration, an optimal outcome from the KNN-W should result in

nearly equal influence weights ( 1β and 2β ) to both x1 and x2 and negligible weights

( 3β and 4β ) to x3 and x4 variables. On the other hand, as KNN considers all variables to

be equally important and assigns equal weights to all, one would expect the predictive

error associated with the KNN to be larger than that for the KNN-W. As the relationship

among predictors and predictand is linear, KNN-M2 is expected to perform as well as

KNN-W. Also, as all predictors are independent of each other, performance of KNN-M1

should be similar to KNN.

Figure 2 presents the box-plots representing the posterior distribution of influence

weights obtained from the optimisation runs for the KNN-W. The optimized values of

these parameters are also presented in Table 1. As can be seen from these results, the

KNN-W is able to ascertain the inherent structure of the model by identifying variables x3

and x4 as redundant predictors with negligible influence weights, and variables x1 and x2

as significant predictors with almost equal influence weights, thus establishing the first

two of the three hypotheses we seek to test. This table also includes the influence

weights for KNN and KNN-M2 formulations. For KNN-M2 these are equivalent to

∑ =mi ij rr 1

22 , where jr is regression coefficient for jth predictor as specified in (5) and

these values are similar to the KNN-W influence weights. The specification of influence

weights for KNN-M1 model is not straightforward. Table 1 also presents the sum of

squares of residuals (SSR) for all KNN formulations ascertained based on the ability of

Page 17: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

17

these models to correctly estimate the conditional mean using leave-one-out cross-

validation. The improved performance of the KNN-W as compared to other KNN

formulations establishes the last of the three hypotheses that were tested. As expected,

KNN-M2 provides SSR similar to KNN-W while KNN-M1 results are similar to KNN.

3.2 Synthetic case study using a nonlinear model

The second case study is based on a sample of 300 observations from a nonlinear

model of the form:

iε+= )isin(u iz (13)

where i varies from 1 to 300, u is series of uniformly distributed random numbers over

the interval [0, 5π], ε represents noise from a normal distribution N(0, 0.20), and z

denotes the response from the model. Three redundant predictors, each from a Uniform

distribution U[0,1], were also included to evaluate the impact they have on the resulting

model estimates. As with the previous example, we expect KNN-W to assign a unit

influence weight ( 1β ) to the variable u and negligible influence weights ( 2β , 3β and 4β )

to the three redundant predictors. As now the relationship among predictors and

predictand is non-linear, we expect the performance of KNN, KNN-M1 and KNN-M2 to

be inferior as compared to the KNN-W.

Box-plots of the posterior distribution for each influence weights are illustrated in Figure

3. The optimal values of influence weights and corresponding error statistics are

presented in Table 2. This table also includes the influence weights for KNN and KNN-

M2 formulations. As can be seen from these results, the KNN-W is able to identify the

three additional variables as redundant predictors with negligible influence weights, and

Page 18: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

18

the u variable as the sole significant predictor with a weight nearing unity. A plot of the

true series and the KNN and KNN-W estimates was presented in Figure 1 earlier in the

paper. As results of KNN-M1 and KNN-M2 are almost similar to KNN, these were not

included in the plot. As can be inferred from Table 2 and Figure 1, the KNN-W results

offer a significant improvement over other KNN formulations. The KNN, KNN-M1 and

KNN-M2 are not able to identify the non-linear nature of the relationship among

predictors and predictand and the associated SSR values are much higher than KNN-W.

We have demonstrated that inclusion of influence weights in defining Euclidean

distances and thereby estimating the conditional PDF results in an improvement in the

resampling procedure irrespective of linear or non-linear relationship of predictors and

predictand. Similar improvements can be expected in an application where the true form

of the underlying model is not known. This is demonstrated next by downscaling rainfall

at a network of raingauges near Sydney, Australia.

3.3 Downscaling of rainfall near Sydney, Australia

The downscaling application uses 43 years (1960-2002) of daily rainfall at 30 stations

located around Sydney, New South Wales, Australia for June (winter in the southern

hemisphere) and is approached using the KNN and KNN-W formulations. In an earlier

study Mehrotra at el. [2004] identified the mean sea level pressure (MSLP), geopotential

heights (GPH) at 700 hPa and their gradients as plausible atmospheric circulation

indicators of the rainfall patterns in this region. Further consultations with climatologists

and hydrologists indicated that total precipitable water content may also have a strong

influence in driving rainfall mechanism over the region. Consequently, a simple

correlation analysis was conducted and a total of 6 atmospheric circulation variables

were identified as predictors for the downscaling example. These consisted of the total

precipitable water content over the study region, east-west gradient of GPH at 700 hPa

Page 19: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

19

and north-south gradient of MSLP of current as well as previous day. The predictands in

this application are daily rainfall occurrences at the 30 stations. A day was considered as

a wet day or dry day depending on whether the rainfall amount was greater or less than

0.3 mm respectively [after Harrold et al., 2003; Buishand, 1978]. Keeping in view the fact

that this application aims to predict rainfall values (in contrast to generating a time-series

of likely values), based on the number of predictors, total number of observations, large

number of predictands and suggestions from Buishand and Brandsma [2001], the

number of nearest neighbours k was kept small and specified equal to 5 for this

application.

The influence weights for this example were estimated using the optimization procedure

and the likelihood measure in (8). Once the influence weights were identified, we

evaluated the performances of KNN and KNN-W by predicting daily rainfall amounts in a

leave-one-out cross validation setting. A comparison of the observed and predicted

rainfall amount served as an additional validation measure for the procedure as the

model was developed with reference only to the rainfall occurrence (wet or dry) at each

station. The leave-one-out cross-validation was performed by downscaling for day t

based on the conditional PDF in (2) and (3) with neighbours that did include the data for

day t. Results from both models were ascertained by resampling 100 realisations of

rainfall amounts for each day of the observed record, based on which various

performance measures were computed. The comparison of results was based on the

reproduction of selected spatial and temporal characteristics of daily and aggregated

monthly rainfall totals including those of importance to water resource management.

The graphical comparison was performed on the basis of the following statistics: (a) a

performance measure indicating the success at simulating wet and dry days similar to

what was observed; (b) mean value of monthly rainfall totals and (c) maximum daily

rainfall amount. Numerical comparison of these models was based on the estimation of

Page 20: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

20

the root mean square error (RMSE) for the cross correlation, wetness fraction, wet and

dry spell statistics and rainfall occurrence at a day.

Table 3 presents the optimized values of influence weights for all predictors. The KNN-W

recognizes the three atmospheric circulation variables of the current day as the most

significant predictors.

Figure 4 presents the success rates of model predicted wet and dry days at each station.

The success rate is defined as the ratio of number of days when both observed and

predicted values are wet (dry) to the total number of wet (dry) days in the observed

record. In case of perfect match the success rate approaches unity. The accurate

reproduction of wet and dry days define the day to day persistence of the daily rainfall

and also help in representing the wet and dry spells of longer durations. As such these

are of prime concern in catchment management studies. Both the models under

estimate the success rates of wet and dry days. However, KNN-W is more successful in

reproducing these statistics at all stations.

To evaluate the performance of the model on a statistic that is independent of the

procedure used to specify the model, attention was paid to the representation of the

daily rainfall amount on each day classified as wet. Note that this serves as a means to

validate the performance of the model as the influence weight estimation procedure uses

only the rainfall occurrence to estimate the model likelihood. Figure 5 presents the

scatter plots of observed and model predicted monthly mean and maximum daily rainfall

values at each station. There is a marginal improvement in the statistics estimated

based on KNN-W results as compared to those from the KNN model. Note that the

improvements are more visible for the stations having higher rainfall averages.

Page 21: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

21

Table 4 provides the RMSE scores of various statistics for KNN and KNN-W application

runs. As can be inferred from this table, inclusion of influence weights offers

encouraging improvement in the predicted results in almost all statistics considered.

The overall divergence of predicted results from the observed ones could be owing to

the large number of stations and the choice of predictor variables considered in the

present application. The atmospheric circulation variables selected as potential

predictors are not adequate in defining the temporal correlation of rainfall at each station

successfully. We feel that inclusion of some better mechanism explaining the lagged

spatial distribution of rainfall patterns can offer better results. Defining the relative

contribution of each predictor variable by influence weights, however, to some extent,

helps overcome this deficiency, which is the main focus of the present study.

4.0 Summary and Conclusions

We presented a variation of the traditional KNN time series resampling approach to deal

with the problem of using multiple predictors in the specification of the conditional

probability density. The utility of the proposed approach was assessed by applying it to

linear and non-linear synthetic datasets and comparing the results with the traditional

KNN formulations. The results of the study indicate that the proposed modification is

able to pick up the linear as well as non-linear dependence structure of the predictor and

predictands. The modified formulation was also applied to downscale rainfall using

multiple predictors and improvements in the results as compared to traditional KNN

approach were observed. The modified approach was found to generate more

representative rainfall occurrences and amounts when evaluated at both daily and

monthly time scales.

Page 22: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

22

The modification requires the estimation of optimum value of influence weight for each

predictor. In the study presented here, we used the AM algorithm [Haario et al., 2001;

Marshall et al., 2004] for estimation of these weights. The AM algorithm performs a

search in the parameter space and requires a large number of iterations to converge

when many parameters are being assessed. Also, the KNN method is data intensive and

at each time step considers sorted distances and therefore, requires significant amount

of computer time. Combination of AM algorithm and KNN method may further slow down

the computations considerably, specifically with large number of predictors and

predictands. We feel that this limitation can be addressed better through the use of

special sorting algorithms, like partial sort, adaptive sorting algorithms [Estivill-Castro

and Wood, 1992] and randomized algorithms [Motvani and Raghavan, 1995] or their

variations, which were not attempted in our study.

One of the significant advantages of the KNN framework is the simplicity with which

multiple predictor variables can be accommodated in the feature vector. The proposed

modification can be used to find the significance of the added predictor(s) as compared

to the existing predictors in defining the predictand(s). It is also possible to include the

number of nearest neighbour – k, for the given dataset, in the optimization procedure

using a modified version of the model likelihood. The “curse of dimensionality” [Scott,

1992] is less felt in the KNN-W formulation due to the reduced weight associated with

redundant or irrelevant predictor variables. Consequently, formulation of higher

dimensional nonparametric models, such as those used for disaggregation [Tarboton et

al., 1998], multivariate stochastic simulation [Salas et al., 1980] or downscaling

[Mehrotra et al., 2004] is simplified through the use of the influence weight logic

proposed here.

Acknowledgements

Page 23: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

23

This work was partially funded by the Australian Research Council. We wish to

acknowledge the constructive comments of two anonymous reviewers whose inputs

greatly benefited the quality of our presentation.

References

Beersma, J. J., T. A. Buishand, Multi-site simulation of daily precipitation and

temperature conditional on the atmospheric circulation, Clim. Res., 25, 121-133, 2003

Brandsma, T., and T. A. Buishand, Simulation of extreme precipitation in the Rhine basin

by nearest neighbour resampling: Hydrol. Earth Syst. Sci., 2, 195-209, 1998.

Buishand, T. A., Some remarks on the use of daily rainfall models, J. Hydrol., 36, 295-

308, 1978.

Buishand, T. A., and T. Brandsma, Multisite simulation of daily precipitation and

temperature in the Rhine basin by nearest-neighbor resampling: Water. Resour. Res.,

37, 2761-2776, 2001.

Efron, B., and R. J. Tibshirani, An introduction to the bootstrap, Chapman and Hall, New

York, 1993.

Estivill-Castro, V. and Derick Wood, A survey of adaptive sorting algorithms, ACM

Computing Surveys (CSUR), 24(4), 441-476, 1992.

Haario, H., E. Saksman and J. Tamminem, An adaptive Metropolis algorithm, Bernoulli,

7(2), 223-242, 2001.

Harrold, T. I., A. Sharma, and S. J. Sheather, A nonparametric model for stochastic

generation of daily rainfall occurrence, Water. Resour. Res., 39(10), SWC 10:1-11, 2003

Lall, U., and A. Sharma, A nearest neighbor bootstrap for time series resampling: Water.

Resour. Res., 32, 679-693, 1996.

Page 24: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

24

Lall, U., B. Rajagopalan, and D. G. Tarboton, A nonparametric wet/dry spell model for

resampling daily precipitation, Water Resour. Res., 32(9), 2803-2823, 1996.

Marshall L., D. Nott and A. Sharma, A comparative study of Markov chain Monte Carlo

methods for conceptual rainfall-runoff modelling, Water. Resour. Res., 40, W02501,

doi:10.1029/2003 WR002378, 2004.

Mehrotra, R., A. Sharma and I. Cordery, Comparison of two approaches for downscaling

synoptic atmospheric patterns to multisite precipitation occurrence, accepted for

publication in J. Geophys. Res., 2004

Motvani, R. and P. Raghavan, Randomized algorithms, Cambridge University Press,

476p, 1995.

Rajagopalan, B., and U. Lall, A k-nearest neighbour simulator for daily precipitation and

other weather variables: Water. Resour. Res., 35(10), 3089-3101, 1999.

Salas, J. D., Analysis and modelling of hydrologic time series, in Handbook of

Hydrology, edited by D.R. Maidment, pp19.1-19.72, McGraw-Hill, New York, 1993.

Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, Applied modeling of hydrologic

time series, Water Resource Publications, Colorado, USA, 482p, 1980

Scott, D. W., Multivariate density estimation: Theory, practice and visualization, John

Wiley, New York, 1992.

Sharma, A., and U. Lall, A nonparametric approach for daily rainfall simulation,

Mathematics and Computers in Simulation, 48, 361-371,1999.

Sharma, A., Seasonal to interannual rainfall probabilistic forecasts for improved water

supply management: Part 3 — A nonparametric probabilistic forecast model, J. Hydrol.,

239, 249–258, 2000.

Page 25: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

25

Sharma, A. and R. O’Neill, A nonparametric approach for representing interannual

dependence in monthly streamflow sequences, Water Resour. Res., 38(7), 5.1- 5.10,

2002.

Souza Filho, F. and U. Lall, Seasonal to internannual ensemble streamflow forecasts for

Ceara, Brazil: Applications of a multivariate, semparametric algorithm, Water Resour.

Res., 39(11), 1307, doi:10.1029/2002WR001373, 2003.

Tarboton, D.G., A. Sharma, and U. Lall, Disaggregation procedures for stochastic

hydrology based on nonparametric density estimation, Water Resour. Res., 34(1),.22

107-119, 1998.

Wójcik, R. and Buishand, T.A., Simulation of 6-hourly rainfall and temperature by two

resampling schemes, J. Hydrol., 273, 69–80, 2003.

Yakowitz, S., Nonparametric density estimation, prediction, and regression for markov

sequences, J. Am. Stat. Assoc., 80(389), 215-221, 1985.

Yates, D., S. Gangopadhyay, B. Rajagopalan, and K. Strzepek, A technique for

generating regional climate scenarios using a nearest-neighbor algorithm. Water

Resour. Res., 39(7), SWC 7 1-15, 2003.

Young, K. C., A multivariate chain model for simulating climatic parameters from daily

data: J. Appl. Meteorol., 33, 661-671, 1994.

Page 26: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

26

Figure Captions

Figure 1: True, observed and predicted data points using KNN and KNN-W formulations-

non-linear synthetic dataset. The points on the graph indicate the series

formed, thick continuous line indicates true sine curve, thin dashed line

indicate the series generated by KNN and thin continuous line indicates the

series generated by KNN-W.

Figure 2: Box plot of influence weights of KNN-W with linear synthetic dataset. Filled

circles (●) indicate optimized values of influence weights β and associated

SSR. The whiskers extend to 1.5 times the inter-quartile range of the data.

Observations outside the whiskers are represented as “○”.

Figure 3: Box plot of influence weights ß of KNN-W with non-linear synthetic dataset.

Filled circles (●) indicate optimized values of influence weights β and

associated SSR. The whiskers extend to 1.5 times the inter-quartile range of

the data. Observations outside the whiskers are represented as “○”.

Figure 4: Success rates of number of days being (a) wet and (b) dry in predicted record

at each station for KNN and KNN-W using real dataset.

Figure 5: Scatter plots of (a) means and (b) maximum daily rainfall amount at each

station for KNN and KNN-W using real dataset.

Page 27: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

27

Table 1: Optimized values of influence weights and related sum of squares of residuals

(SSR) – linear synthetic case. The influence weights for KNN-M2 are specified

as ∑ =mi ij rr 1

22 , where jr is the regression coefficient for the jth predictor.

Influence weights Formulation

ß1 ß2 ß3 ß4

SSR

KNN-W 0.485 0.507 0.000 0.008 54.52

KNN 0.250 0.250 0.250 0.250 85.45

KNN-M1 - - - - 80.95

KNN-M2 0.511 0.489 0.000 0.000 55.83

Table 2: Optimized values of influence weights and related sum of squares of residuals

(SSR) – non-linear synthetic case. The influence weights for KNN-M2 are

specified as ∑ =mi ij rr 1

22 , where jr is the regression coefficient for the jth

predictor.

Influence weights Formulation

ß1 ß2 ß3 ß4

SSR

KNN-W 0.998 0.001 0.000 0.001 13.13

KNN 0.250 0.250 0.250 0.250 129.79

KNN-M1 - - - - 130.74

KNN-M2 0.021 0.012 0.748 0.219 168.08

Page 28: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

28

Table 3: Optimized values of influence weights and sum of squares of residuals (SSR) –

real dataset.

Influence weights Formulation

ß1 ß2 ß3 ß4 ß5 ß6

SSR

KNN-W 0.160 0.260 0.480 0.020 0.040 0.050 9541

KNN 0.166 0.166 0.166 0.166 0.166 0.166 9990

Page 29: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

29

Table 4: RMSE in statistical characteristics - real dataset.

Statistic KNN KNN-W

Number of wet spells of two days 0.976 0.994

Maximum wet spell length in days 1.923 2.009

Total number of wet spells 1.441 1.458

Number of dry spells of two days 0.816 0.826

Maximum dry spell length in days 4.603 4.516

Total number of dry spells 1.343 1.337

Number of solitary wet days 1.485 1.512

Number of wet days 3.618 3.851

Daily rainfall occurrence 0.490 0.503

Probability of a wet day following a wet day 0.071 0.077

Probability of a wet day following a dry day 0.015 0.015

Probability of a dry day following a wet day 0.015 0.015

Probability of a dry day following a dry day 0.074 0.08

Lag one correlation of mean wetness fraction 0.367 0.381

Cross correlation of rainfall amount at stations 1.152 1.091

Page 30: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

30

-1.6

0

1.6

0.08 1.85 3.76 5.83 7.96 9.63 11.09 13.10 15.18

u

z=si

n(u)

Observed TRUE KNN-W KNN

Figure 1: True, observed and predicted data points using KNN and KNN-W formulations-

non-linear synthetic dataset. The points on the graph indicate the series

formed, thick continuous line indicates true sine curve, thin dashed line

indicates the series generated by KNN and thin continuous line indicates the

series generated by KNN-W.

Page 31: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

31

Figure 2: Box plot of influence weights of KNN-W with linear synthetic dataset. Filled

circles (●) indicate optimized values of influence weights β and associated

SSR. The whiskers extend to 1.5 times the inter-quartile range of the data.

Observations outside the whiskers are represented as “○”.

Page 32: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

32

Figure 3: Box plot of influence weights ß of KNN-Wl with non-linear synthetic dataset.

Filled circles (●) indicate optimized values of influence weights β and

associated SSR. The whiskers extend to 1.5 times the inter-quartile range of

the data. Observations outside the whiskers are represented as “○”.

Page 33: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

33

0.25

0.45

0.65

1 5 10 15 20 25 30Sation number

Suc

cess

rate

KNN-W KNN

(a) For wet days

0.75

0.8

0.85

0.9

1 5 10 15 20 25 30Sation number

Suc

cess

rate

KNN-W KNN

(b) For dry days

Figure 4: Success rates of number of days being (a) wet and (b) dry in the predicted

record at each station for KNN and KNN-W using real dataset.

Page 34: Conditional resampling of hydrologic time series using multiple predictor variables…civil.colorado.edu/~balajir/partialMI/PartialMI/Partial... · 2004. 10. 27. · 1 Conditional

34

0

50

100

150

0 50 100 150Observed value in mm

Pre

dict

ed v

alue

in m

m

KNNKNN-W

(a) Average of monthly rainfall totals in mm

0

100

200

300

0 100 200 300

Observed value in mm

Pre

dict

ed v

alue

in m

m

KNN

KNN-W

(b) Extreme daily rainfall in mm

Figure 5: Scatter plots of (a) means, and (b) maximum daily rainfall amount at each

station for KNN and KNN-W using real dataset.