conditional resampling of hydrologic time series using multiple predictor...
TRANSCRIPT
1
Conditional resampling of hydrologic time series using
multiple predictor variables: A k-nearest neighbour
approach
R. Mehrotra* and Ashish Sharma
School of Civil and Environmental Engineering, The University of New South Wales,
Sydney, Australia
*Also at National Institute of Hydrology, Roorkee -247 667, India
Abstract
Unlike parametric alternatives for time series generation, nonparametric approaches
generate new values by conditionally resampling past observations using a probability
rationale. Observations lying ‘close’ to the conditioning vector are resampled with higher
probability, ‘closeness’ is defined using a Euclidean or Mahalanobis distance
formulation. A common problem in this is the difficulty in distinguishing the importance of
each predictor in the estimation of the distance. As a consequence, the conditional
probability and hence the resampled series, offers a biased representation of the true
population it aims to simulate. This paper presents a variation of the k-nearest neighbor
resampler designed for use with multiple predictor variables. In the modification
proposed, an influence weight is assigned to each predictor in the conditioning set with
the aim of identifying nearest neighbours that represent the conditional dependence in
an improved manner. The workability of the proposed modification is tested using
synthetic data from known linear and non-linear models and its applicability is illustrated
through an example where daily rainfall is resampled over 30 stations near Sydney,
2
Australia using a predictor set consisting of selected large-scale atmospheric circulation
variables.
1.0 Introduction
Hydrologic time series models aim to reproduce the most relevant statistical
characteristics of the observed series to aid in the planning, design, operation and
management of water resources. As the observed time series is just one realization
amongst an infinite number that may have occurred, reliance on only the historical series
under-represents the uncertainty associated with the resulting design or management
plan. As far as possible, this uncertainty must be incorporated in the planning or design
of the system, something that is often accomplished through the use of time series or
stochastic models. These modeling approaches are broadly grouped into parametric and
nonparametric approaches, the former being expressed via selected parameters (often
estimated as functions of sample moments), while the latter, being formulated based on
the entire sample thus using the observed (sampled) variability in characterizing
uncertainty. Nonparametric stochastic approaches offer an efficient alterative when
sufficient data is available. Their intuitive simplicity and transparency have made them
attractive and popular for use in hydrology and other sciences.
Two commonly used nonparametric stochastic alternatives are kernel density and
nearest neighbour based resampling methods. Both have been used extensively for a
range of hydro-climatological applications [Young, 1994; Lall and Sharma, 1996; Lall et
al., 1996; Tarboton et al., 1998; Sharma and Lall, 1999; Rajagopalan and Lall, 1999;
Buishand and Brandsma, 2001; Sharma and O’Neill, 2002; Yates et al., 2003; Beersma
and Buishand, 2003; Harrold et al., 2003; Wójcik and Buishand, 2003; Mehrotra et al.,
3
2004]. Both are based on a similar rationale, generating new values obtained through a
conditional probability density function (PDF) which is estimated using the observed
data. This conditional PDF is broadly expressed as:
∑= i iitt wf RXR )( (1a)
∑=
j ij
itiw )-(
)-(XX
XXψ
ψ (1b)
where, tR is a vector of predictands, wi is the weight associated with the predictands at
time step i, tX and iX , respectively, are multivariate vectors (also called as feature
vectors) of predictor variables at times t and i, ψ(.) is a measure of proximity of tX to iX ,
and is different depending on the nonparametric method used and )( ttf XR is
conditional PDF based on which tR is simulated. Note that if tR is expressed as 1+tX ,
then the above formulation becomes analogous to a classical multivariate time series
model [Salas, 1993]. In the notation used above, one can think of tR as a vector of daily
rainfall amounts or occurrences at multiple locations in a region that are to be simulated
conditional to predictors tX that represent relevant atmospheric indicators.
This general formulation is specific to a stochastic downscaling model [Mehrotra et al.,
2004] which forms one of the three examples presented later in this study. Note also that
in the notation used above, vectors or matrices have been represented as “bold”, a
notation that will be followed elsewhere.
The nearest neighbour based resampling methods are founded on pattern recognition
theory that dates back to the early fifties, and use the classic bootstrap described by
Yakowitz [1985] and Efron and Tibshirani [1993]. An important issue in nearest-
4
neighbour resampling is the choice of a function ψ(.) which identifies the proximity of k
nearest neighbours of a particular state. This study uses the k-nearest neighbour
bootstrap formulation proposed by Lall and Sharma [1996] which specifies the proximity
ψ(.) in (1b) as:
kit1)-( =XXψ if k ≤ K (2)
= 0 if k > K
where k denotes the number of observations between tX and iX (inclusive of tX ) in the
historical record and K is a specified maximum value of k. The ‘Proximity’ or ‘nearness’
is defined through a distance formulation, the Euclidean distance it ,ξ between tX and
iX being:
∑=
=m
jj,tj,ijit -XXs
1
2, })({ξ (3)
where the vector iX consists of m predictors variables Xj,i , j = 1…m, and js is the
scaling weight for the jth predictor. It is also possible to use other distance measures (for
example see Yates et al. [2003]; Souza Filho and Lall [2003] and Wójcik and Buishand
[2003]). For other nonparametric methods the exact formulation of (1a, b) varies (see for
example, Sharma, [2000] for a formulation using kernel methods).
Predictor variables used in the calculation of Euclidean distances are made
dimensionless through standardization using scaling weights. Scaling of variables
ensures that each predictor has the same weighting in selection of nearest neighbours.
Scaling is also necessary when the feature vector consists of combination of discrete
5
and continuous variables [Sharma and Lall, 1999; Buishand and Brandsma, 2001]. The
reciprocal of the sample standard deviation of each predictor variable is a common
choice for the scaling weights [Lall and Sharma, 1996; Rajagopalan and Lall, 1999;
Sharma and Lall, 1999; Harrold et al., 2003; Beersma and Buishand, 2003]. However,
this has been noted to result in an improper representation of the conditional density
because of which application specific choices have been used in Brandsma and
Buishand [1998]; Buishand and Brandsma [2001] and Beersma and Buishand [2003].
For similar reasons, Yates et al. [2003] and Wójcik and Buishand [2003] used modified
representation of the distance measure in (3) by adopting the Mahalanobis distance, as
an alternative to the Euclidean distance based on the inverse of the predictor covariance
matrix. Souza Filho and Lall [2003] used a modified Euclidean distance formulation
based on coefficients from a fitted linear regression model between the predictor and
predictands. These formulations are discussed in details later in the paper.
The Euclidean distance formulation in (3) considers all predictors to be equally important
in the estimation of the conditional probability. In realty, however, each predictor may
have unequal importance and some of the predictors may not be significant at all. These
less important or otherwise redundant predictor variables tend to influence the
conditional probability and hence the resampled series offers a biased representation of
the true population it aims to simulate.
Consider the example in Figure 1 wherein a sinusoidal series of the form ε+= )sin(u z
is plotted. Thus, the z series is a function of only one predictor variable u, a uniformly
distributed random number varying between 0 and 5π, ε being a Gaussian error term.
However, to demonstrate the effect of inclusion of redundant predictor variables, we
consider that the feature vector consists of three additional predictor variables. These
redundant predictors are all uniformly distributed random variables independent of u, z
6
and each other. Giving all four predictors equal importance or using other distance
measures as specified in Yates et al. [2003]; Souza Filho and Lall [2003] and Wójcik and
Buishand [2003] leads to a biased estimate of the conditional mean from the KNN
conditional probability density. This bias is much stronger at the two boundaries of the
data, and at the points which represent a large change in slope (peaks and troughs). An
alternative formulation denoted KNN-W that considers an optimal scaling weight for each
predictor, does not suffer from the same deficiency, and is the focus of the present
study.
The paper is organized as follows. A brief description of different KNN alternatives, our
proposed model (KNN-W) and the procedure used to estimate the optimal scaling
weights are presented first. In the next section, we demonstrate the importance of
inclusion of optimal scaling weights by applying the different KNN formulations to
selected linear and non-linear synthetic examples, and extend the comparison of KNN-
W and traditional KNN to downscale synoptic atmospheric patterns to rainfall at multiple
point locations in a region near Sydney, Australia. The last section presents the
summary and conclusions from this research.
2.0 k nearest neighbour resampling
2.1 Background
As described in the previous section, the k nearest neighbour approach estimates the
conditional probability on the basis of k nearest neighbours of the conditioning vector tX .
In essence, the k patterns in the historical record that are most similar to the conditioning
vector are identified, and the k sets of corresponding predictands are specified as the
most likely values the system may have at the time step t. The common way to define
7
similarity uses the Euclidean distance formulation in (3). To ensure that each predictor of
the conditioning vector has the same weighting in selection of nearest neighbours,
standardization of predictor variables is carried out. The reciprocal of the sample
standard deviation of each predictor variable is a common choice for the standardization.
Recognizing the fact that correlations among the predictors may influence the selection
of nearest neighbours, Yates et al. [2003] and Wójcik and Buishand [2003] used a
modified representation of the distance measure in (3) by adopting the Mahalanobis
distance, as an alternative to the Euclidean distance and adopting the standardization
based on the inverse of the predictor covariance matrix. The Mahalanobis distance
measure is defined by:
)X-(XC)X-(X Tit
1itit
−=,ξ (4)
where T represents the transpose operation, 1−C is the inverse of the predictor (X)
covariance matrix and tX and iX have their usual meaning. It may be noted that this
measure of proximity is commonly adopted in multivariate kernel density and regression
function estimation (see for example, Sharma, [2000]). Also note that the selection of
nearest neighbour in Yates et al. [2003] is based on a concept that is different than the
probability weight metric specified in (2). Their procedure assigns a substantially greater
probability of selecting the nearest neighbour for the same value of k as compared to the
logic in equations (1) and (2).
The distances provide the measure of similarity of the current predictor tX to the
observed set iX . To include the relative importance of each predictor in selecting the
nearest neighbours, Souza Filho and Lall [2003] considered regression coefficient of
standardized values of predictor and predictand to develop weights for each predictor
and formed a weighted Euclidean distance measure that is expressed as:
8
{ }∑=
=m
jijtjjjit -XXsr
1
2,,, )(ξ (5)
where jr is the regression coefficient between the jth predictor and predictand, and tX ,
iX , js and m have their usual meaning. This distance measure takes into account the
linear dependence between predictor and predictand.
2.2 Modified KNN (KNN-W) model
In the modification proposed here, the Euclidean distance in (3) is modified to include an
‘influence’ weight for each predictor as follows:
∑=
=m
jj,tj,ijjit -XXs
1
2, })({βξ (6)
where, js is the scaling weight as defined earlier, and jβ is the influence weight
associated with the jth predictor. Introduction of influence weights, β
( mjj ,...,1],[ == ββ ) aims to define the information content of each predictor in the
estimation of the conditional probability density )( ttf XR . In the simplest form, all
influence weights,β carry equal values and (6) reduces to (3), the traditional KNN
model. If the relationship between predictor and predictand is linear, the influence
weight jβ closely follows the scaled value ∑ =mi ij rr 1
22 of regression coefficient jr as
specified in (5).
9
2.3 Estimation of the KNN-W influence weights
The scaling weight vector s ( s =[sj], j=1,…., m), number of nearest neighbours k and
influence weight vectorβ are the key to accurate estimation of the conditional probability
density in the KNN-W approach. The scaling weight vector s is estimated as the
reciprocal of the sample standard deviation associated with each variable in the
predictor set. The influence weights β aim to define the relative influence each predictor
variable has on the conditional PDF and is ascertained based on an optimization
procedure using leave-one-out cross-validation, as described next.
Cross validation (CV) is a commonly used basis for measuring the predictive error
associated with a model. Cross validation involves developing the model using one part
of the sample and assessing its accuracy by applying it on the other, the process being
repeated to ensure that all observations are used in estimating the prediction error.
Leave-one-out cross-validation (L1CV) is a special case of CV where the model is
formulated at each observed data point using all observations except the observation
under consideration, and applying the model to predict the observation that has been left
out. This procedure is then repeated for all observations and the difference of observed
and predicted value at each data point provides the measure of predictive error. For the
current application, an appropriate set of values for the influence vector β is ascertained
by minimizing the predictive error associated with the model as assessed using L1CV. In
the formulation described next, this measure of the predictive error is used to specify the
log-likelihood associated with theβ , the optimal β being selected as the vector that
results in the maximum log-likelihood score.
It may also be noted here that as the estimation of the best set of influence weights is
exclusively based on the minimization of the predictive error computed using L1CV, the
resulting set of predictands may not provide the optimal predictions when assessed
10
based on statistics at a different scale of aggregation (see for example the assessment
of a daily rainfall generation model using the variability in rainfall at an annual time scale
in Harold et al. [2003]). If the aim were to formulate influence weights that were optimal
with respect to statistics estimated at multiple scales, the likelihood function would need
to be suitably modified.
In addition to the influence weight vectorβ , the number of k-nearest neighbours is also
an unknown that is often difficult to specify in practice. The optimal value of k depends
on the number of observations from which nearest neighbours are selected, variability
present in the observed records and number of dependent variables. Also, the optimal k
value differs when used in the context of time series simulation as compared to when
used for prediction [see for example Buishand and Brandsma, 2001]. In general, the
optimal k will be smaller when the purpose is time series simulation than when the
purpose is prediction. However, a small number of nearest neighbours may have the
danger of duplicating the large part of the historic record in the simulated series. For
enough number of sampled observations (≈100) and small number of predictors (<6),
Lall and Sharma [1996] suggested that, as a basic guideline, k can be chosen as the
square root of the length of the observations. For time series simulation, Buishand and
Brandsma [2001] found better results with small k (≤5). The approach we follow for
estimating the influence weightsβ can also be used for estimating the optimal value of k
for a particular application, number of predictor variables and length of observations,
however, with certain modification in the likelihood formulation. Likelihood measures
(such as the ones presented in the next section) tend to optimize the parameter values
on the basis of expected value of predictands at a given time step and may therefore,
settle down for smaller k value. However, as the aim of this study is to illustrate the
importance of considering the influence weights in a multi-predictor formulation of the k-
11
nearest neighbour resampling, we opt not to include the number of nearest neighbours
in the optimization procedure described in the next section.
2.4 Adaptive Metropolis Algorithm
In the present study we have used the recently proposed adaptive Metropolis (AM)
sampling approach [Haario et al., 2001, Marshall et al., 2004] to estimate the optimal
values of the influence weightsβ . The optimization procedure for synthetic example is
based on the maximization of log likelihood for continuous data:
)( θ|Rl ∑ −−−==
NgRN
ttt
1
2)];([22
1)22ln(2
θXσ
πσ (7)
where )( θ|Rl is the log likelihood, tR is the observed value, );( θXtg is the predicted
values at time step t, tX is the set of predictors at time t, 2σ is the predictive error
variance, N is total number of observations, and θ is the set of model parameters. For
the present application, we include 2σ , the predictive error variance andβ , the influence
weights as the parameters θ to be optimised. Note that );( θtXg is estimated using
leave-one-out cross-validation as described in the previous section, as the expected
value of the conditional PDF specified in (2) and (6).
A different likelihood function is used for the downscaling application presented later in
this section. As here the response is a discrete variable (rainfall occurrence, taking
values of 0 or 1 denoting dry or wet states respectively), the likelihood measure of (7) is
modified as follows:
12
)( θ|Rfl ∑=
+
−−
= ∑∑∑= =
=
N
tb
a
atgitRfn
i
k
jk
jj
jji
1
);(,1log
1 1
1
, θX (8)
where, )( θ|Rfl is the log likelihood, itRf , is the observed rainfall occurrence at the ith
station at time t, );(, θX tjig is the predicted rainfall for the jth nearest neighbour for the ith
station at time t, and tX is the set of predictor variables at time t. Further, n defines the
number of stations the atmospheric predictors are being downscaled to, N is total
number of observations, θ is the set of model parameters, aj is the decreasing kernel for
the jth nearest neighbour as defined by (2) and k is the number of nearest neighbours.
The additional term, b represents a small value to ensure that on occasions when
observed and predicted rainfall occurrences disagree, log-likelihood remains feasible
and is duly penalized.
The above likelihood formulations were used to ascertain the optimal parameter vector θ
using the adaptive Metropolis (AM) algorithm [Haario et al., 2001; Marshall et al., 2004].
This algorithm is characterised by a proposal distribution based on the estimated
posterior covariance matrix of the parameters. At step t, Haario et al. [2001] consider a
multivariate normal proposal with mean given by the current value, ),( ttN Cθ where tC
is the proposal covariance. The covariance tC has a fixed value (C0) for the first few
iterations and is updated after a t0 iterations as:
+⋅⋅⋅=
− dtt Cov Iθθ
CC
ηεη ),,( 10
0 0
0
tttt
>≤
(9)
13
where d is total number of parameters included in the optimization algorithm, I is a dxd
identity matrix, ε is a parameter chosen to ensure tC does not become singular and η
is a scaling parameter to ensure reasonable acceptance rates of the proposed states.
The steps involved in our implementation of the AM algorithm are:
a) Initialize t=0 and set 0C as a diagonal matrix with each diagonal term
representing the variance associated with the prior distribution for each unknown.
b) Update tC for the current iteration number t using (9).
c) Generate a proposed value *θ for θ where ) ,( N~*tt Cθθ . If parameter values
are outside the parameter constraints, repeat the step again. The constraints for
the β parameters are defined as:
011 .ßmj j =∑ = ; and 0.10 << jß (10)
These conditions ensure that the relative magnitude of each parameter is
retained and it remains bounded.
d) Compute the log likelihood )*( θ|Rl using (7) or )( *θ|Rfl using (8) following the
L1CV procedure.
e) Calculate the acceptance probability, α, of the proposed parameter values using:
{ })]()|(*)(*)|(exp[,1min tt plpl θθRθθR −−+=α (11)
where )(θp is the log prior distribution of θ, the priors being specified as a
Uniform PDF over [0,1] for each element of the influence weight vector β , and a
14
scaled inverse-chi-square distribution χ-2(ν,λ) for the predictive error variance σ2
(for the likelihood expressed in (7)), the parameters ν and λ being specified
depending on the nature of the error associated with the true model. In the
synthetic examples presented in the next section, ν and λ were specified as 0.5
and 8 respectively.
f) Generate u ~ U[0,1]
If u < α, accept *1 θθ =+t , otherwise set tt θθ =+1
g) Repeat b-f a sufficient number of times to ensure that the posterior distribution of
the parameter vector θ has been sampled exhaustively.
The scaling weights η and the stopping criterion for our algorithm are based on the
recommendations of Marshall et al., [2004] and Haario et al., [2001]. A reasonable
approach is to start the optimization with initial set of β parameters being assigned
equal values satisfying (10), and consider the predictive error variance as the variance of
the predictive error estimated using these parameter values. In the results reported next,
the optimal θ has been chosen as the one that results in the maximum log-likelihood in
(7 or 8) from the assigned number of iterations. Standard convergence tests [Marshall et
al., 2004] were used to assess the adequacy of the number of iterations (15000)
assigned for each application, details on which are presented next.
3.0 Applications
This section illustrates the impact of using optimal influence weights for k-nearest
neighbour resampling through three carefully designed experiments and four KNN
15
resampling alternatives. The first two experiments use data from known synthetic
models and are designed to test the following three hypotheses: (a) the influence weight
should assume a value of zero when a predictor variable is redundant; (b) the influence
weight should be dictated by the information content each predictor has about the
predictand(s); and, (c) inclusion of influence weights should result in an improvement
over the case where they are set equal to one (the traditional KNN approach). The last
experiment presents an application to downscale rainfall using a system of atmospheric
circulation predictors, the aim being to illustrate the improvements over the case where
the predictors have equal influence weights.
The resampling alternatives considered in the synthetic examples differ the way the
scaling of variables is accomplished. The different alternatives for scaling of variables
include: i) reciprocal of the standard deviation (KNN), defined using (3); ii) inverse of the
predictor covariance (KNN-M1), defined using (4); iii) regression coefficient vector of
predictor and predictand and reciprocal of the standard deviation (KNN-M2), defined
using (5) and iv) influence weights and reciprocal of the standard deviation (KNN-W),
defined using (6). While the second alternative (KNN-M1) is based on the Mahalanobis
distance measure, other alternatives consider the Euclidean distance measure as the
choice of the function in identifying the proximity of k nearest neighbours. For KNN-M2,
regression coefficients relating the predictors and predictand were estimated by carrying
out a multiple linear regression analysis. For synthetic cases, the number of nearest
neighbours k was specified equal to square root of number of observations.
3.1 Synthetic case study using a linear model
A sample of 200 observations was generated using the following linear model:
iε++= i,xi, x iz 21 (12)
16
where, i varies from 1 to 200, x1 and x2 are generated from a standard normal PDF
N(0,1) and ε is a noise term also from a normal distribution N(0, 0.5). Thus, predictand z
is a known function of x1 and x2 with both contributing equally. To evaluate the capability
of the method at identifying redundant predictors, two additional variables, x3 and x4,
both independent and N(0,1) were also included as predictors in the KNN resampling
algorithm. Given this configuration, an optimal outcome from the KNN-W should result in
nearly equal influence weights ( 1β and 2β ) to both x1 and x2 and negligible weights
( 3β and 4β ) to x3 and x4 variables. On the other hand, as KNN considers all variables to
be equally important and assigns equal weights to all, one would expect the predictive
error associated with the KNN to be larger than that for the KNN-W. As the relationship
among predictors and predictand is linear, KNN-M2 is expected to perform as well as
KNN-W. Also, as all predictors are independent of each other, performance of KNN-M1
should be similar to KNN.
Figure 2 presents the box-plots representing the posterior distribution of influence
weights obtained from the optimisation runs for the KNN-W. The optimized values of
these parameters are also presented in Table 1. As can be seen from these results, the
KNN-W is able to ascertain the inherent structure of the model by identifying variables x3
and x4 as redundant predictors with negligible influence weights, and variables x1 and x2
as significant predictors with almost equal influence weights, thus establishing the first
two of the three hypotheses we seek to test. This table also includes the influence
weights for KNN and KNN-M2 formulations. For KNN-M2 these are equivalent to
∑ =mi ij rr 1
22 , where jr is regression coefficient for jth predictor as specified in (5) and
these values are similar to the KNN-W influence weights. The specification of influence
weights for KNN-M1 model is not straightforward. Table 1 also presents the sum of
squares of residuals (SSR) for all KNN formulations ascertained based on the ability of
17
these models to correctly estimate the conditional mean using leave-one-out cross-
validation. The improved performance of the KNN-W as compared to other KNN
formulations establishes the last of the three hypotheses that were tested. As expected,
KNN-M2 provides SSR similar to KNN-W while KNN-M1 results are similar to KNN.
3.2 Synthetic case study using a nonlinear model
The second case study is based on a sample of 300 observations from a nonlinear
model of the form:
iε+= )isin(u iz (13)
where i varies from 1 to 300, u is series of uniformly distributed random numbers over
the interval [0, 5π], ε represents noise from a normal distribution N(0, 0.20), and z
denotes the response from the model. Three redundant predictors, each from a Uniform
distribution U[0,1], were also included to evaluate the impact they have on the resulting
model estimates. As with the previous example, we expect KNN-W to assign a unit
influence weight ( 1β ) to the variable u and negligible influence weights ( 2β , 3β and 4β )
to the three redundant predictors. As now the relationship among predictors and
predictand is non-linear, we expect the performance of KNN, KNN-M1 and KNN-M2 to
be inferior as compared to the KNN-W.
Box-plots of the posterior distribution for each influence weights are illustrated in Figure
3. The optimal values of influence weights and corresponding error statistics are
presented in Table 2. This table also includes the influence weights for KNN and KNN-
M2 formulations. As can be seen from these results, the KNN-W is able to identify the
three additional variables as redundant predictors with negligible influence weights, and
18
the u variable as the sole significant predictor with a weight nearing unity. A plot of the
true series and the KNN and KNN-W estimates was presented in Figure 1 earlier in the
paper. As results of KNN-M1 and KNN-M2 are almost similar to KNN, these were not
included in the plot. As can be inferred from Table 2 and Figure 1, the KNN-W results
offer a significant improvement over other KNN formulations. The KNN, KNN-M1 and
KNN-M2 are not able to identify the non-linear nature of the relationship among
predictors and predictand and the associated SSR values are much higher than KNN-W.
We have demonstrated that inclusion of influence weights in defining Euclidean
distances and thereby estimating the conditional PDF results in an improvement in the
resampling procedure irrespective of linear or non-linear relationship of predictors and
predictand. Similar improvements can be expected in an application where the true form
of the underlying model is not known. This is demonstrated next by downscaling rainfall
at a network of raingauges near Sydney, Australia.
3.3 Downscaling of rainfall near Sydney, Australia
The downscaling application uses 43 years (1960-2002) of daily rainfall at 30 stations
located around Sydney, New South Wales, Australia for June (winter in the southern
hemisphere) and is approached using the KNN and KNN-W formulations. In an earlier
study Mehrotra at el. [2004] identified the mean sea level pressure (MSLP), geopotential
heights (GPH) at 700 hPa and their gradients as plausible atmospheric circulation
indicators of the rainfall patterns in this region. Further consultations with climatologists
and hydrologists indicated that total precipitable water content may also have a strong
influence in driving rainfall mechanism over the region. Consequently, a simple
correlation analysis was conducted and a total of 6 atmospheric circulation variables
were identified as predictors for the downscaling example. These consisted of the total
precipitable water content over the study region, east-west gradient of GPH at 700 hPa
19
and north-south gradient of MSLP of current as well as previous day. The predictands in
this application are daily rainfall occurrences at the 30 stations. A day was considered as
a wet day or dry day depending on whether the rainfall amount was greater or less than
0.3 mm respectively [after Harrold et al., 2003; Buishand, 1978]. Keeping in view the fact
that this application aims to predict rainfall values (in contrast to generating a time-series
of likely values), based on the number of predictors, total number of observations, large
number of predictands and suggestions from Buishand and Brandsma [2001], the
number of nearest neighbours k was kept small and specified equal to 5 for this
application.
The influence weights for this example were estimated using the optimization procedure
and the likelihood measure in (8). Once the influence weights were identified, we
evaluated the performances of KNN and KNN-W by predicting daily rainfall amounts in a
leave-one-out cross validation setting. A comparison of the observed and predicted
rainfall amount served as an additional validation measure for the procedure as the
model was developed with reference only to the rainfall occurrence (wet or dry) at each
station. The leave-one-out cross-validation was performed by downscaling for day t
based on the conditional PDF in (2) and (3) with neighbours that did include the data for
day t. Results from both models were ascertained by resampling 100 realisations of
rainfall amounts for each day of the observed record, based on which various
performance measures were computed. The comparison of results was based on the
reproduction of selected spatial and temporal characteristics of daily and aggregated
monthly rainfall totals including those of importance to water resource management.
The graphical comparison was performed on the basis of the following statistics: (a) a
performance measure indicating the success at simulating wet and dry days similar to
what was observed; (b) mean value of monthly rainfall totals and (c) maximum daily
rainfall amount. Numerical comparison of these models was based on the estimation of
20
the root mean square error (RMSE) for the cross correlation, wetness fraction, wet and
dry spell statistics and rainfall occurrence at a day.
Table 3 presents the optimized values of influence weights for all predictors. The KNN-W
recognizes the three atmospheric circulation variables of the current day as the most
significant predictors.
Figure 4 presents the success rates of model predicted wet and dry days at each station.
The success rate is defined as the ratio of number of days when both observed and
predicted values are wet (dry) to the total number of wet (dry) days in the observed
record. In case of perfect match the success rate approaches unity. The accurate
reproduction of wet and dry days define the day to day persistence of the daily rainfall
and also help in representing the wet and dry spells of longer durations. As such these
are of prime concern in catchment management studies. Both the models under
estimate the success rates of wet and dry days. However, KNN-W is more successful in
reproducing these statistics at all stations.
To evaluate the performance of the model on a statistic that is independent of the
procedure used to specify the model, attention was paid to the representation of the
daily rainfall amount on each day classified as wet. Note that this serves as a means to
validate the performance of the model as the influence weight estimation procedure uses
only the rainfall occurrence to estimate the model likelihood. Figure 5 presents the
scatter plots of observed and model predicted monthly mean and maximum daily rainfall
values at each station. There is a marginal improvement in the statistics estimated
based on KNN-W results as compared to those from the KNN model. Note that the
improvements are more visible for the stations having higher rainfall averages.
21
Table 4 provides the RMSE scores of various statistics for KNN and KNN-W application
runs. As can be inferred from this table, inclusion of influence weights offers
encouraging improvement in the predicted results in almost all statistics considered.
The overall divergence of predicted results from the observed ones could be owing to
the large number of stations and the choice of predictor variables considered in the
present application. The atmospheric circulation variables selected as potential
predictors are not adequate in defining the temporal correlation of rainfall at each station
successfully. We feel that inclusion of some better mechanism explaining the lagged
spatial distribution of rainfall patterns can offer better results. Defining the relative
contribution of each predictor variable by influence weights, however, to some extent,
helps overcome this deficiency, which is the main focus of the present study.
4.0 Summary and Conclusions
We presented a variation of the traditional KNN time series resampling approach to deal
with the problem of using multiple predictors in the specification of the conditional
probability density. The utility of the proposed approach was assessed by applying it to
linear and non-linear synthetic datasets and comparing the results with the traditional
KNN formulations. The results of the study indicate that the proposed modification is
able to pick up the linear as well as non-linear dependence structure of the predictor and
predictands. The modified formulation was also applied to downscale rainfall using
multiple predictors and improvements in the results as compared to traditional KNN
approach were observed. The modified approach was found to generate more
representative rainfall occurrences and amounts when evaluated at both daily and
monthly time scales.
22
The modification requires the estimation of optimum value of influence weight for each
predictor. In the study presented here, we used the AM algorithm [Haario et al., 2001;
Marshall et al., 2004] for estimation of these weights. The AM algorithm performs a
search in the parameter space and requires a large number of iterations to converge
when many parameters are being assessed. Also, the KNN method is data intensive and
at each time step considers sorted distances and therefore, requires significant amount
of computer time. Combination of AM algorithm and KNN method may further slow down
the computations considerably, specifically with large number of predictors and
predictands. We feel that this limitation can be addressed better through the use of
special sorting algorithms, like partial sort, adaptive sorting algorithms [Estivill-Castro
and Wood, 1992] and randomized algorithms [Motvani and Raghavan, 1995] or their
variations, which were not attempted in our study.
One of the significant advantages of the KNN framework is the simplicity with which
multiple predictor variables can be accommodated in the feature vector. The proposed
modification can be used to find the significance of the added predictor(s) as compared
to the existing predictors in defining the predictand(s). It is also possible to include the
number of nearest neighbour – k, for the given dataset, in the optimization procedure
using a modified version of the model likelihood. The “curse of dimensionality” [Scott,
1992] is less felt in the KNN-W formulation due to the reduced weight associated with
redundant or irrelevant predictor variables. Consequently, formulation of higher
dimensional nonparametric models, such as those used for disaggregation [Tarboton et
al., 1998], multivariate stochastic simulation [Salas et al., 1980] or downscaling
[Mehrotra et al., 2004] is simplified through the use of the influence weight logic
proposed here.
Acknowledgements
23
This work was partially funded by the Australian Research Council. We wish to
acknowledge the constructive comments of two anonymous reviewers whose inputs
greatly benefited the quality of our presentation.
References
Beersma, J. J., T. A. Buishand, Multi-site simulation of daily precipitation and
temperature conditional on the atmospheric circulation, Clim. Res., 25, 121-133, 2003
Brandsma, T., and T. A. Buishand, Simulation of extreme precipitation in the Rhine basin
by nearest neighbour resampling: Hydrol. Earth Syst. Sci., 2, 195-209, 1998.
Buishand, T. A., Some remarks on the use of daily rainfall models, J. Hydrol., 36, 295-
308, 1978.
Buishand, T. A., and T. Brandsma, Multisite simulation of daily precipitation and
temperature in the Rhine basin by nearest-neighbor resampling: Water. Resour. Res.,
37, 2761-2776, 2001.
Efron, B., and R. J. Tibshirani, An introduction to the bootstrap, Chapman and Hall, New
York, 1993.
Estivill-Castro, V. and Derick Wood, A survey of adaptive sorting algorithms, ACM
Computing Surveys (CSUR), 24(4), 441-476, 1992.
Haario, H., E. Saksman and J. Tamminem, An adaptive Metropolis algorithm, Bernoulli,
7(2), 223-242, 2001.
Harrold, T. I., A. Sharma, and S. J. Sheather, A nonparametric model for stochastic
generation of daily rainfall occurrence, Water. Resour. Res., 39(10), SWC 10:1-11, 2003
Lall, U., and A. Sharma, A nearest neighbor bootstrap for time series resampling: Water.
Resour. Res., 32, 679-693, 1996.
24
Lall, U., B. Rajagopalan, and D. G. Tarboton, A nonparametric wet/dry spell model for
resampling daily precipitation, Water Resour. Res., 32(9), 2803-2823, 1996.
Marshall L., D. Nott and A. Sharma, A comparative study of Markov chain Monte Carlo
methods for conceptual rainfall-runoff modelling, Water. Resour. Res., 40, W02501,
doi:10.1029/2003 WR002378, 2004.
Mehrotra, R., A. Sharma and I. Cordery, Comparison of two approaches for downscaling
synoptic atmospheric patterns to multisite precipitation occurrence, accepted for
publication in J. Geophys. Res., 2004
Motvani, R. and P. Raghavan, Randomized algorithms, Cambridge University Press,
476p, 1995.
Rajagopalan, B., and U. Lall, A k-nearest neighbour simulator for daily precipitation and
other weather variables: Water. Resour. Res., 35(10), 3089-3101, 1999.
Salas, J. D., Analysis and modelling of hydrologic time series, in Handbook of
Hydrology, edited by D.R. Maidment, pp19.1-19.72, McGraw-Hill, New York, 1993.
Salas, J. D., J. W. Delleur, V. Yevjevich and W. L. Lane, Applied modeling of hydrologic
time series, Water Resource Publications, Colorado, USA, 482p, 1980
Scott, D. W., Multivariate density estimation: Theory, practice and visualization, John
Wiley, New York, 1992.
Sharma, A., and U. Lall, A nonparametric approach for daily rainfall simulation,
Mathematics and Computers in Simulation, 48, 361-371,1999.
Sharma, A., Seasonal to interannual rainfall probabilistic forecasts for improved water
supply management: Part 3 — A nonparametric probabilistic forecast model, J. Hydrol.,
239, 249–258, 2000.
25
Sharma, A. and R. O’Neill, A nonparametric approach for representing interannual
dependence in monthly streamflow sequences, Water Resour. Res., 38(7), 5.1- 5.10,
2002.
Souza Filho, F. and U. Lall, Seasonal to internannual ensemble streamflow forecasts for
Ceara, Brazil: Applications of a multivariate, semparametric algorithm, Water Resour.
Res., 39(11), 1307, doi:10.1029/2002WR001373, 2003.
Tarboton, D.G., A. Sharma, and U. Lall, Disaggregation procedures for stochastic
hydrology based on nonparametric density estimation, Water Resour. Res., 34(1),.22
107-119, 1998.
Wójcik, R. and Buishand, T.A., Simulation of 6-hourly rainfall and temperature by two
resampling schemes, J. Hydrol., 273, 69–80, 2003.
Yakowitz, S., Nonparametric density estimation, prediction, and regression for markov
sequences, J. Am. Stat. Assoc., 80(389), 215-221, 1985.
Yates, D., S. Gangopadhyay, B. Rajagopalan, and K. Strzepek, A technique for
generating regional climate scenarios using a nearest-neighbor algorithm. Water
Resour. Res., 39(7), SWC 7 1-15, 2003.
Young, K. C., A multivariate chain model for simulating climatic parameters from daily
data: J. Appl. Meteorol., 33, 661-671, 1994.
26
Figure Captions
Figure 1: True, observed and predicted data points using KNN and KNN-W formulations-
non-linear synthetic dataset. The points on the graph indicate the series
formed, thick continuous line indicates true sine curve, thin dashed line
indicate the series generated by KNN and thin continuous line indicates the
series generated by KNN-W.
Figure 2: Box plot of influence weights of KNN-W with linear synthetic dataset. Filled
circles (●) indicate optimized values of influence weights β and associated
SSR. The whiskers extend to 1.5 times the inter-quartile range of the data.
Observations outside the whiskers are represented as “○”.
Figure 3: Box plot of influence weights ß of KNN-W with non-linear synthetic dataset.
Filled circles (●) indicate optimized values of influence weights β and
associated SSR. The whiskers extend to 1.5 times the inter-quartile range of
the data. Observations outside the whiskers are represented as “○”.
Figure 4: Success rates of number of days being (a) wet and (b) dry in predicted record
at each station for KNN and KNN-W using real dataset.
Figure 5: Scatter plots of (a) means and (b) maximum daily rainfall amount at each
station for KNN and KNN-W using real dataset.
27
Table 1: Optimized values of influence weights and related sum of squares of residuals
(SSR) – linear synthetic case. The influence weights for KNN-M2 are specified
as ∑ =mi ij rr 1
22 , where jr is the regression coefficient for the jth predictor.
Influence weights Formulation
ß1 ß2 ß3 ß4
SSR
KNN-W 0.485 0.507 0.000 0.008 54.52
KNN 0.250 0.250 0.250 0.250 85.45
KNN-M1 - - - - 80.95
KNN-M2 0.511 0.489 0.000 0.000 55.83
Table 2: Optimized values of influence weights and related sum of squares of residuals
(SSR) – non-linear synthetic case. The influence weights for KNN-M2 are
specified as ∑ =mi ij rr 1
22 , where jr is the regression coefficient for the jth
predictor.
Influence weights Formulation
ß1 ß2 ß3 ß4
SSR
KNN-W 0.998 0.001 0.000 0.001 13.13
KNN 0.250 0.250 0.250 0.250 129.79
KNN-M1 - - - - 130.74
KNN-M2 0.021 0.012 0.748 0.219 168.08
28
Table 3: Optimized values of influence weights and sum of squares of residuals (SSR) –
real dataset.
Influence weights Formulation
ß1 ß2 ß3 ß4 ß5 ß6
SSR
KNN-W 0.160 0.260 0.480 0.020 0.040 0.050 9541
KNN 0.166 0.166 0.166 0.166 0.166 0.166 9990
29
Table 4: RMSE in statistical characteristics - real dataset.
Statistic KNN KNN-W
Number of wet spells of two days 0.976 0.994
Maximum wet spell length in days 1.923 2.009
Total number of wet spells 1.441 1.458
Number of dry spells of two days 0.816 0.826
Maximum dry spell length in days 4.603 4.516
Total number of dry spells 1.343 1.337
Number of solitary wet days 1.485 1.512
Number of wet days 3.618 3.851
Daily rainfall occurrence 0.490 0.503
Probability of a wet day following a wet day 0.071 0.077
Probability of a wet day following a dry day 0.015 0.015
Probability of a dry day following a wet day 0.015 0.015
Probability of a dry day following a dry day 0.074 0.08
Lag one correlation of mean wetness fraction 0.367 0.381
Cross correlation of rainfall amount at stations 1.152 1.091
30
-1.6
0
1.6
0.08 1.85 3.76 5.83 7.96 9.63 11.09 13.10 15.18
u
z=si
n(u)
+ε
Observed TRUE KNN-W KNN
Figure 1: True, observed and predicted data points using KNN and KNN-W formulations-
non-linear synthetic dataset. The points on the graph indicate the series
formed, thick continuous line indicates true sine curve, thin dashed line
indicates the series generated by KNN and thin continuous line indicates the
series generated by KNN-W.
31
Figure 2: Box plot of influence weights of KNN-W with linear synthetic dataset. Filled
circles (●) indicate optimized values of influence weights β and associated
SSR. The whiskers extend to 1.5 times the inter-quartile range of the data.
Observations outside the whiskers are represented as “○”.
32
Figure 3: Box plot of influence weights ß of KNN-Wl with non-linear synthetic dataset.
Filled circles (●) indicate optimized values of influence weights β and
associated SSR. The whiskers extend to 1.5 times the inter-quartile range of
the data. Observations outside the whiskers are represented as “○”.
33
0.25
0.45
0.65
1 5 10 15 20 25 30Sation number
Suc
cess
rate
KNN-W KNN
(a) For wet days
0.75
0.8
0.85
0.9
1 5 10 15 20 25 30Sation number
Suc
cess
rate
KNN-W KNN
(b) For dry days
Figure 4: Success rates of number of days being (a) wet and (b) dry in the predicted
record at each station for KNN and KNN-W using real dataset.
34
0
50
100
150
0 50 100 150Observed value in mm
Pre
dict
ed v
alue
in m
m
KNNKNN-W
(a) Average of monthly rainfall totals in mm
0
100
200
300
0 100 200 300
Observed value in mm
Pre
dict
ed v
alue
in m
m
KNN
KNN-W
(b) Extreme daily rainfall in mm
Figure 5: Scatter plots of (a) means, and (b) maximum daily rainfall amount at each
station for KNN and KNN-W using real dataset.