K-Nearest Neighbor Resampling Technique
(Weather Generation and Water Quality Applications)
Balaji Rajagopalan
Somkiat Apipattanavis & Erin TowlerDepartment of Civil, Environmental and
Architectural Engineering
University of Colorado
Boulder, CO
Denver Water
February 2007
“Translation” of Climate Info
• Users most interested in sectoral outcomes (streamflows, crop yields, risk of disease X)
ClimateForecast /Projection
Forecast /ProjectionTranslation
ProcessModels
Distributionof Outcomes
Translation
28.5
………
12.4
23.1
………
10.2
29.1
………
11.4
25.8
………9.7
…
HistoricalData
Synthetic series
Process model
Frequency distribution of
outcomes
Why Simulation?• Limited historical data
– cannot capture the full range of variability– electing a (single or a set of ) historical years from the record – with
equal chance.Unconditional bootstrap, Index Sequential Method
• Need – tool to generate ‘scenarios’ that capture the historical statistical properties
• Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.)
– These are cumbersome, restrictive (in their assumptions)
• Re-sampling techniques are simple and robust– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)
bootstrap offer attractive alternatives.
Why Simulation?• Limited historical data
– cannot capture the full range of variability– electing a (single or a set of ) historical years from the record – with
equal chance.Unconditional bootstrap, Index Sequential Method
• Need – tool to generate ‘scenarios’ that capture the historical statistical properties
• Several statistical techniques are available (e.g., time series techniques, Monte-carlo techniques etc.)
– These are cumbersome, restrictive (in their assumptions)
• Re-sampling techniques are simple and robust– Unconditional and Conditional bootstrap, K-nearest neighbor (K-NN)
bootstrap offer attractive alternatives.
Re-sampling Techniques
• Drawing cards from a well shuffled deck– Selecting a (single or a set of ) historical years from the record –
with equal chance.Unconditional bootstrap, Index Sequential Method
• Drawing card from a biased deck– Selecting a (single or a set of) historical years with unequal
chance.E.g., selecting only El Nino years
Conditional bootstrap• K-Nearest Neighbor Bootstrap – “pattern matching”
– Select ‘K’ nearest neighbors (e.g., years) to the current ‘feature’– Select one of the K neighbors at random– Repeat to produce an ensemble–
Examples
• Ensemble Weather Generation– Scenario generation– Forecast
Argentina - Pampas Region
• Water Quality Modeling
(Boulder Water Utility)
Two Step Weather Generator
1 0 0 1 1 0 0 0 1 0 0 - - - - -
Probability of Dry and Wet Days
Dry day Wet day
0.60 (pd) 0.40 (pw)
Transition Prob (pij)
Dry day Wet day
Dry day 0.70 (pdd) 0.30 (pdw)
Wet day 0.80 (pwd) 0.20 (pww)
Generated Precipitation State time series
• Estimate Transition (wet to dry, etc.) Probabilities of the Markov Chain order-1 from historical data – for each month
• Generate Precipitation State time series using Markov Chain
• Suppose we need weather simulation for January 5th - January 4th is a wet day
• Get Neighbors from a 7-day window (7*50) centered on January 4th
• Screen days using the Precipitation state [(1,0), days in blue] – i.e., “Potential Neighbors”
• Calculate the distances between weather variables of current day feature vector and the potential neighbors
• Select the K-nearest neighbors • Assign them weights
Year January February
1234567 - - 11234 - -
1 20030200- - x x x x - -2 03200040- - x x x x - -3 30020300- - x x x x - -4 00600000- - x x x x - ----- - - - - - - - - - - - - - - - ----- - - - - - - - - - - - - - - - ----- - - - - - - - - - - - - - - - -0 02030023- - x x x x - -
• Pick a day from k-NN using the weight function – say, Jan 1st 1953
• The simulated weather for Jan 5th is Jan 2nd 1953.
• Repeat
k
jj
jijK
1
1
1
nk
Single Site Simulation
• Pergamino, Argentina– Daily weather variables 1931-2003
• Precipitation• Max. Temperature• Min. Temperature
• 100 simulations of 73 year length (as length of record)
• Statistics of simulated and historical data are compared
Spell Properties
Pergamino, Argentina
wet and dry spell statistics
Moments (wet month - Jan)
Moments (dry month - July)
Conditional K-NN Re-sampling
• Conditioned on IRI seasonal forecast
• Get the prediction (A:N:B=40:35:25)
• Divide historical (seasonal) total into 3 tercile categories
• Bootstrap 40, 35 and 25 sample of historical years from wet, normal and dry categories
• Apply the two-step weather generator on this sample.
Conditional Weather Generation (results)
Multi-site extension
• Same procedure as single site is used but– Calculate the Average time series – “single site virtual
weather data” – Apply the two-step generator– Select the weather at all the locations on the picked
day – to obtain multi-site simulation
• Stations in Pampus region, Stations in Pampus region, Argentina Argentina
• PergaminoPergamino• JuninJunin• Nueve de JulioNueve de Julio
wet and dry spell Statistics
Pergamino, Argentina
Multisite Case
Basic Distribution Properties
Spatial Correlation
Influent Water Quality
Finished Water Quality
Water Treatment
Plant
Motivation
• TOC
• TSUVA
• Alkalinity
• pH
• Turbidity
• Temperature
Finished water must comply with a given regulation
0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
sw_avg
Pro
babi
lity
dens
ity fu
nctio
n0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
sw_avg
Pro
babi
lity
dens
ity fu
nctio
n
Motivation
Dis
trib
utio
n
Dis
trib
utio
n
Input Output
Comply Non-Compliance
Uncertainty helps us to understand the risk of non-compliance with a given regulation
WTP
• Monitoring effort mandated by USEPA
• Large public water systems
• Water quality and operating data
- Disinfection by-products (DBPs) and microorganisms to support rulemakings
• Most comprehensive view of large drinking water systems to date
Data SetInformation Collection Rule (ICR)
• 18 months (Jul. 1997 – Dec. 1998)
• 458 continental US locations
Data Set
ICR
Data Set
• Water Quality – Influent
– Intermediate
– Finished
– Distribution system
• Chemical Additions
ICR Database
Influent water quality has significant variability due to
- climate
- geology
- water management
practices
Characterize Variability
Source Water
• TOC
• TSUVA
• Alkalinity
• pH
• Turbidity
• Temperature
• Total Hardness
• Examine influent water quality for surface waters (SWs) – Spatial variability– Temporal variability
• Focus on total organic carbon (TOC)– TOC is a precursor in formation of DBPs– Methods extend to other water quality
parameters
Variability
Spatial Variability
Variability
• Local polynomial approach
• Find best K and P combination
• Contour estimates
),(_ LongitudeLatitudefTOC averageannual
Spatial Variability SW Average Annual TOC (mg/L)
Variability
2,30. P
Spatial Variability
Variability
Similar spatial patterns found for• Finished water TOC (lower)
• Distribution system DBPs– TTHM (total trihalomethanes)
– HAA5 (five haloacetic acids)
Spatial Variability
Variability
• Alkalinity
• Bromide
Spatial patterns consistent with previous research for other influent water quality variables
Variability
Temporal Variability
J F M A M J J A S O N D
Influ
ent T
OC
(m
g/L
)
0
1
2
3
4
2 4 6 8 10 12
01
23
4
1:12
TO
C[1
:12
]
J F M A M J J A S O N D
1998
City of Boulder’s Betasso Water Treatment Plant (CO)
Variability
Temporal Variability
• Some locations exhibited seasonal trends, others did not
• Month to month variations should be considered
• Inherent variability in water quality contributes to uncertainty
• How can we quantify uncertainty?
Variability
Simulate “ensembles” of influent water quality (Monte Carlo)
Quantify Uncertainty
121 ... TOCTOC
Observed data
12_1001_100
12_11_1
...
.........
...
SS
SS
TOCTOC
TOCTOC
Ensembles
Normal
Lognormal
• Fit a probability density function (pdf) to the data-Normal, Lognormal, etc.
• Simulate from pdf
Quantify
Traditional Method
Limitations - What if the pdf is not a good fit?
- What if you don’t have enough data to make the pdf?
ex. 18 months/location in ICR database
Histogram of May
May
De
nsi
ty
1000 2000 3000 4000 5000 6000
0
e+
00
1
e-0
42
e
-04
3
e-0
44
e
-04
Quantify
• Skip fitting a pdf to the data
• Simulate by bootstrapping• Randomly sample data with replacement
• Expand bootstrapping pool to include “similar” locations (nearest neighbors)
• What is limited in time is available in space
Space-Time Bootstrapping Method
Quantify
• Find nearest neighbors (locations) in terms of a feature vector that includes variables of interest
• Feature vector includes:- Average Annual Concentration- Latitude
- Longitude
Quantify
),,( LonLatTOCtorFeatureVec average
Average annual concentration helps finds neighbors that are similar but may not be geographically nearby.
Average annual TOC (mg/L) for Ohio surface waters
Geographically close, but not good “neighbors” for bootstrapping
Quantify
Quantify
),,( LonLatTOCtorFeatureVec average
• Sample monthly TOC values based on feature vector• Conditional probability
)|( torFeatureVecTOCf monthly
Simulation Algorithm
user
user
user
user
Lon
Lat
TOC
x
mmm
iiiICR
LonLatTOC
LonLatTOC
LonLatTOC
x
.........
.........111
1) User inputs their location and their average annual TOC concentration
2) The ICR database is queried for all eligible entries
Quantify
"" ICRuser xxd
Algorithm- cont.
3) Calculate distances, d, between the xuser vector and the xICR vector
Quantify
userx ICRx
Algorithm- cont.
3) Calculate distances using weighted Mahalanobis equation
Quantify
))(())(( _1
_ iICRuserTT
iICRuseri xxWSxxWd
))(())(( _1
_ iICRuserTT
iICRuseri xxWSxxWd
Algorithm- cont.Quantify
Remove the weights (W) and the covariance matrix (S) and it’s Euclidean Distance
))(())(( _1
_ iICRuserTT
iICRuseri xxWSxxWd
Algorithm- cont.Quantify
By including S, covariance matrix, components of the feature vector do not have to be scaled
(Davis 1986 )
Algorithm- cont.Quantify
))(())(( 1iuser
TTiuseri xxWSxxWd
Weights are assigned as
LonLatTOC WWWW
100 LonLatTOC WWW 010 LonLatTOC WWW
001 LonLatTOC WWW 111 LonLatTOC WWW
Quantify
Weights offer flexibility in neighbor selection
(a) (b)
(c) (d)
4) Obtain observed monthly data for each nearest neighbor
DeckJank
DeciJani
DecJan
NN
TOCTOC
TOCTOC
TOCTOC
x
__
__
_1_1
...
.........
...
.........
...
Algorithm- cont.Quantify
5) Bootstrap xNN using a weight function
k
ii
jjp
1
1
1
Algorithm- cont.Quantify
Increases likelihood of picking nearer neighbors
Apply algorithm to quantify uncertainty in influent TOC concentrationCity of Boulder’s Betasso Water Treatment Plant (CO)
Boulder
SWs only, N = 334
Quantify
Red dot is the Boulder plant being simulated
Empty black dots are the “neighbors” to be bootstrapped
Identify nearest neighbors
- Include Boulder in pool for bootstrapping
111 LonLatTOC WWW
Quantify
01
23
45
Influ
en
t T
OC
(m
g/L
)
J F M A M J J A S O N D Ann
Quantify
Box plot each monthly bootstrap ensemble (100 values)
Median
5th Percentile
95th Percentile
25th Percentile
75th Percentile
Outliers
Uncertainty quantified for Boulder
01
23
45
Influ
en
t TO
C (
mg
/L)
J F M A M J J A S O N D Ann
1998
Influ
ent T
OC
(m
g/L)
0
1
2
3
4
5
J F M A M J J A S O N D Ann
Quantify
• Simulates seasonal trends
• Provides rich variety of uncertainty
Overlay recent data
• Simulations capture recent data
01
23
45
TO
C (
mg
/L)
J F M A M J J A S O N D Ann
19971998200320042005
Influ
ent T
OC
(m
g/L)
0
1
2
3
4
5
J F M A M J J A S O N D Ann
Quantify
City of Birmingham’s Carson Filter Plant (AL)
J F M A M J J A S O N D Ann
Influ
ent T
OC
(m
g/L)
0
1
2
3
4
2 4 6 8 10 12
01
23
4
Influ
en
t TO
C (
mg
/L)
J F M A M J J A S O N D Ann
1998
QuantifyPortable Across Locations
City of Birmingham’s Carson Filter Plant (AL)
J F M A M J J A S O N D Ann
Influ
ent T
OC
(m
g/L)
0
1
2
3
4
QuantifyPortable Across Locations
2 4 6 8 10 12
01
23
4
Influ
en
t TO
C (
mg
/L)
J F M A M J J A S O N D Ann
01
23
4
City of Birmingham’s Carson Filter Plant (AL)
J F M A M J J A S O N D Ann
Influ
ent T
OC
(m
g/L)
0
1
2
3
4
QuantifyPortable Across Locations
2 4 6 8 10 12
01
23
4
Influ
en
t TO
C (
mg
/L)
J F M A M J J A S O N D Ann
01
23
4 19971998200320042005
19971998200320042005
J F M A M J J A S O N D Ann
Influ
ent A
lkal
inity
(as
mg/
L C
aCO
3)
0
10
2
0
30
4
0
50
60
70
2 4 6 8 10 12
01
02
03
04
05
06
07
0
z1
ob
s_1
99
8
J F M A M J J A S O N D Ann
New Jersey American Water Swimming River Treatment Plant (NJ)
QuantifyApplies to Other Variables
2 4 6 8 10 12
01
23
4
1:12
TO
C[1
:12
]
J F M A M J J A S O N D
1998
J F M A M J J A S O N D Ann
Influ
ent A
lkal
inity
(as
mg/
L C
aCO
3)
0
10
2
0
30
4
0
50
60
70
New Jersey American Water Swimming River Treatment Plant (NJ)
QuantifyApplies to Other Variables
2 4 6 8 10 12
01
02
03
04
05
06
07
0
z1
ob
s_1
99
8
J F M A M J J A S O N D Ann
01
02
03
04
05
06
07
0
J F M A M J J A S O N D Ann
Influ
ent A
lkal
inity
(as
mg/
L C
aCO
3)
0
10
2
0
30
4
0
50
60
70
New Jersey American Water Swimming River Treatment Plant (NJ)
QuantifyApplies to Other Variables
2 4 6 8 10 12
01
02
03
04
05
06
07
0
z1
ob
s_1
99
8
J F M A M J J A S O N D Ann
01
02
03
04
05
06
07
0
++
+ ++ +
++
++ + + +
+
199719982002200320042005
• K-NN resampling technique provides a simple and robust alternative to generating ‘scenarios’.
– Quantify Uncertainty
– Ensemble forecast
• Very general – can be easily applied to a variety of situations.
Weather generation
Water Quality
Streamflow (Colorado River Basin)
Summary & Conclusions
• Can readily be extended to generate ‘scenarios’ under climate change or decadal variability
modify the ‘feature vector’ to include the climate variability information
• Rajagopalan and Lall (1999); Yates et al. (2003), Apipattanavis et al. (2007) - all papers in Water Resources Research
AwwaRF project 3115
“Decision Tool to Help Utilities Develop Simultaneos Compliance Strategies”
Utilities
City of Boulder’s Betasso Water Treatment Plant (CO)
City of Birmingham’s Carson Filter Plant (AL)
New Jersey American Water Swimming River Treatment Plant (NJ)
Greater Cincinnati (OH) Water Works Richard Miller Water Treatment Plant
Acknowledgements
Questions
“It is better to be roughly right than precisely wrong.”
-John Maynard Keynes (1883-1946)