an introduction to statistical modelling of extreme values
TRANSCRIPT
An introduction to statistical modelling of extreme values Application to calculate extreme wind speeds Edward Omey, Fermin Mallor, Eulalia Nualart
HUB RESEARCH PAPER 2009/36 NOVEMBER 2009
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
I
An introduction to statistical modelling
of extreme values.
Application to calculate extreme wind speeds
Pamplona, 2009
Fermín Mallor, Public University of Navarre
Eulalia Nualart. Public University of Navarre
Edward Omey. Hogeschool Universiteit Brussel
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
II
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
III
An introduction to statistical modelling
of extreme values.
Application to calculate extreme wind speeds
Contents
1. INTRODUCTION
2. CLASSICAL EXTREME VALUE THEORY
3. THRESHOLD MODELS
4. EXTREMES OF DEPENDENT SEQUENCES
5. BIBLIOGRAPHY
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
IV
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
1
1. INTRODUCTION
High wind speeds pose a threat to the integrity of structures such as wind turbines. An
accurate estimation of the occurrence of extreme wind speeds is an important factor in
achieving a correct balance between safety and cost of “over-design”. This design
problem also arise in many other engineering areas such as ocean engineering (with the
wave height), hydraulics engineering (floods), structural engineering (earthquakes) and
also in meteorology (temperatures, rainfall, etc), fatigue strength (workloads), etc. All
these applications have in common that the interest is not the knowledge of the average
behaviour of the analysed phenomena but the extreme behaviour of them. Then, the
distinguishing feature of an extreme value statistical analysis is that the objective is not
to describe the usual behaviour of the stochastic phenomena but the unusual and the
rarely observed events.
For example, suppose that a sea-wall is going to be built with the purpose of protecting
the coast against all sea-levels
that it is likely to occur within
its projected life span (for
example, 100 years). Accurate
estimation of the highest sea-
level in 100 years is necessary
in order to balance
economical and safety goals.
The problem that the
statistical methods face is that
the records of sea-levels could
span for shorter periods of time, of say 15 years. The challenge is to estimate what sea-
levels might occur over the next 100 years given the 15 year data.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
2
Motivation of extreme wind speed analysis: the calculus of Vref
The extreme wind speed estimates are used to determine critical design loads which the
turbine must withstand during its
lifetime. According to the
International Standard IEC 61400-1
for Wind Turbine Generator
Systems the extreme wind speed
Vref is a basic parameter for wind
turbine classes and therefore
strongly related to design of wind
turbines. The Vref is defined as the
extreme 10-min average wind speed
with a recurrence period of 50 years. In general Vref has to be determined statistically
on the basis of on-site measurement.
Classification of wind turbine generators according to Vref
Return levels and return periods
A return level with a return period of T = 1/p years is a high threshold x(p) (e.g.,
annual peak for the10-minutes average wind speed) whose probability of exceedance is
p. For example, if p = 0.01, then the return period is T = 100 years.
Then the parameter of interest vref is the return period of 50 years for the 10 minutes
average wind speed.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
3
Two common interpretations of a return level with a return period of T years are:
(i) Waiting time: Average waiting time until next occurrence of event is T years
(ii) Number of events: Average number of events occurring within a T-year time period
is one
The statistical theory developed to deal with these problems and this type of data is
known as Extreme Value Theory. The presentation of its main results as well as its
application to the analysis of extreme wind speeds are the two main purposes of this
monography.
A sample of consequences of extreme winds
http://www.youtube.com/watch?v=oAWMpxX60KM&feature=player_embedded
http://www.youtube.com/watch_popup?v=CqEccgR0q-o
http://www.youtube.com/watch?v=b43lAoovqd8&feature=fvw
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
4
2. CLASSICAL EXTREME VALUE THEORY
2.1. Model formulation
The core of the extreme value theory is the study of the statistical behaviour of
{ }nn XXM ,,max 1…=
where { }nXX ,,1… is a sequence of independent random variables having a
common distribution function F.
In applications, variables iX usually represent values of a process measured on a
regular time-scale, as for example the 10 minutes average (or maximum) wind
speed. Then nM is the maximum of the observed process over n time units.
The distribution function of nM verifies:
( ) ( ) ( ) ( ) ( )nnnn zFzXPzXPzXzXPzMP )(,, 11 =≤××≤=≤≤=≤ ……
Thus, a way to study nM is to estimate F from the available data (for example the
10 minutes speed records measured during certain interval of time) and then to
substitute this estimation in the previous formula to estimate nM .
The problem of this approach is that small deviances in the estimation of F lead to
large discrepancies for Fn(z).
One alternative approach is to estimate Fn(z) directly from the extreme data. This
idea is similar to that used to estimate the distribution of the sample mean average.
Following this way it is necessary to study the behaviour of Fn(z) as n tends to
infinity. However, since F(z) < 1 for supzz < , where supz is the smallest value of z
such that 1)( =zF , we have 0)( → ∞→nn zF .
To overcome this difficulty, reaching a limit different from 0, we use the following
linear normalization of nM :
n
nnn a
bMM
−=* , where {an } and {bn} are sequences of constants with an > 0.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
5
Example 1 The Pareto distribution
The Pareto distribution is azzF −−= 1)( where z > 1 and a > 0.
Looking at )(zF n , we have ( )nan zzF −−= 1)( . In view of the well known formula
for exp(z), we replace z by na z to find that
)exp()1()( 1 annaan zznznF −∞→−− − →−=
Then, YMn na →− , where the distribution function of Y is given by
)exp()( aY zzF −−=
Example 2 The exponential distribution
Now we have 0)exp(1)( >−−= zzzF . In this case nn zzF ))exp(1()( −−= and
again using the definition of the exponential function, we find that
))exp(exp())exp(1())log(( 1 zznnzF nnn −− →−−=+ ∞→−
This shows that YnM n →− )log( , where ))exp(exp()( zzFY −−= .
Example 3 The uniform distribution on [0, 1]
Here we have xxF =)( , for 0 < x < 1. Hence nn xxF =)( (0 < x < 1) and
consequently
)exp()1()1( 11 xxnxnF nnn →+=+ ∞→−− , for x < 0.
This shows that YMn n →− )1( , where )exp()( xzFY = , x < 0.
These examples are an illustration of the following general result.
The following theorem states that if *nM converges in distribution to a
nondegenerate variable Y, then Y automatically has a distribution function within
one of three classes of distributions.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
6
Theorem 1: Extremal types theorem. If there exist sequences of constants }{ na
and }{ nb such that
{ } )(/)( zGzabMP nnnn →≤− ∞→ , where G is a non-degenerate distribution
function, then G belongs to one of the following types:
I. ∞<<∞−
−−−= za
bzzG ,expexp)( (Gumbel)
II.
>
−−
≤=
−bz
a
bz
bz
zGα
exp
0
)( (Fréchet)
III.
≥
<
−−−=
bz
bza
bzzG
1
exp)(
α
(Weibull)
In all cases, a > 0 and b real. In the case II and III we have α > 0.
These three classes of distributions are named the extreme value distributions, with
types I, II and III , respectively, and also known as Gumbel, Fréchet and Weibull
families, respectively.
Observe that these three types
of distributions are the only
possible limits for the
distributions of the normalized
maxima regardless of the
distribution F for the
population.
The three limit types have
different forms of tail
behaviour. The end point supz is finite for the Weibull distribution
( ξσµ −=supz ) while ∞=supz for the Fréchet and Gumbel distributions.
However, the density of Gumbel distribution decays exponentially and the density
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
7
of Fréchet distribution decays polynomially. The Gumbel type is the domain of
attraction for many
common distributions, like
normal, lognormal,
exponential and gamma.
The Fréchet type has a
heavy tail, verifying that
( ) ∞=rXE for ξ1≥r
(which means that it has
infinite variance if
21≥ξ ).
2.2. The generalized extreme value distribution
It was usual in the past to adopt one of the three families and then to estimate the
parameters of the model. But this way has a weakness: it needs to choose one out of
the three models which is assumed to be correct and then the uncertainty implied by
this choice is not considered in the subsequent inferences. A better analysis can be
done combining the three models into one single family of models named the
generalized extreme value distribution (GEV):
−+−=− ξ
σµξ
1
1exp)(z
zG
defined on z such that 0)(1 >−+ σµξ z and with parameters
∞<<∞−>
∞<<∞−
ξσ
µ
shape
scale
location
0
Type II distribution is obtained when 0>ξ
Type III distribution is obtained when 0<ξ
The type I (Gumbel distribution) is obtained by letting 0→ξ
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
8
This unification facilitates the statistical analysis. The uncertainty in the estimation
of the parameterξ measures the lack of certainty in the choice of one of the three
models. Now, the extremal types theorem can be re-stated in the following way
Theorem 2. If there exist sequences of constants }{ na and }{ nb such that
{ } )(/)( zGzabMP nnnn →≤− ∞→ , where G is a non-degenerate distribution
function G, then G is a distribution of the GEV family:
−+−=− ξ
σµξ
1
1exp)(z
zG
defined on z such that 0)(1 >−+ σµξ z and with parameters
∞<<∞−>
∞<<∞−
ξσ
µ
shape
scale
location
0
One of the nice properties of the GEV-family is that it is a family of max – stable
distributions.
Theorem 2B
Suppose that nYYY ,,, 21 … are independent random variables with distribution
function a GEV distribution FY (x) = G(x). Then ),,,max( 21 nn YYYM …= is of the
same type, i.e. there are constants 0>nu and nv real, such that
YuvMd
nnn =− )(
Proof
For simplicity take µ = 0 and σ = 1. For ξ = 0 we have ))exp(exp()( xxG −−=
and then it is clear that )())log(( xGnxGn =+ .
For ξ ≠ 0, we have ))(exp())1(( 1 ξξ −−=− xxG , and it follows that
))(exp())(exp()1( 11 ξξξξ −−→−=− xxnnxnGn •
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
9
Furthermore, in practise, we don’t have problems with the normalizing constants.
For large n, we have ( ) )()( zGzabMP nnn ≈≤− . But then it follows that
( ) ( ) )()( * zGabzGzMP nnn =−≈≤ , where G*(z) again belongs too to the GEV
family.
2.3. Practical implementation
The above results lead to the following approach for modelling extremes of a series
of independent and identically distributed observations …,, 21 XX . The first step
consists in blocking the data into sequences of n observations, n being sufficiently
large. Then the maxima Zi of each block i is calculated and, finally, the GEV
distribution is fitted to this series of block maxima Z1, Z2, ….
In environmental applications the length of the blocks usually is one year, and the
we use as data the annual maximum Zi of year i.
Once the GEV distribution has been fitted, let say for the annual maxima, we can
calculate the quantile function, pz , for the annual maximum distribution as:
( )( )
=−−−≠−−−−=
−
0)1log(log
0))1log(1()(
ξσµξξσµ ξ
forp
forpzp
Observe that pzG p −= 1)( . In our terminology, this is the return level associated
with the return period p1 . That is, pz is the level that is expected to be exceeded,
in average, once every p1 years. Equivalently, pz is the level that is exceeded by
the annual maximum in any particular year with probability p.
By using )1log( pyp −−= , this quantile function can be expressed as
=−≠−−
=−
0log
0)1()(
ξσµξξσµ ξ
fory
foryz
p
pp
If pz is plotted against pylog the plot is linear in the case of 0=ξ ; the plot is
convex in the case of 0<ξ with asymptotic limit as p tends to 0 to ξσµ )( − and
the plot is concave for 0>ξ and has not finite bound.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
10
This graph is named a return level plot and it is useful as validation tool as well as a
way of presenting the fitted model.
2.4. Inference for the GEV distribution
The choice of the length of blocks implies a trade off between bias and variance.
When the length of the blocks is small, then the approximation of the distributions
by the limit is quite poor and this is leading to bias in estimation and extrapolation.
Long blocks on the other hand generate only few data leading to large estimation
variance.
The method most commonly used to estimate the parameters is the likelihood
method. One difficulty of this approach is that the regularity conditions for its
application are not satisfied by the GEV distributions because the end-point of the
distribution depends on the parameter values. This violation means that the standard
asymptotic likelihood results are not automatically applicable. This problem has
been studied in detail (Smith, 1985) with the following results:
• When 5.0−>ξ the maximum likelihood estimators have the usual
asymptotic properties.
• When 5.01 −<<− ξ the maximum likelihood estimators can be obtained in
general but they do not have the standard asymptotic properties.
• When 1−<ξ the maximum likelihood estimators are unlikely to be
obtainable.
Observe that the case 5.0−<ξ corresponds to distributions with a very short
bounded upper tail, which is rarely the case in real applications of extreme value
modelling.
By denoting mZZ ,,1… the block maxima and under the assumption that they are
independent variables having a GEV distribution, the log-likelihood for the GEV
when 0≠ξ is
∑∑=
−
=
−+−
−++−−=m
i
im
i
i zzm
1
1
111log)11(log),,(
ξ
σµξ
σµξξσξσµℓ
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
11
provided that 01 >
−+σ
µξ iz for i=1,…,m. When this condition is not satisfied
then the likelihood is zero and the log-likelihood is minus infinity.
In the Gumbel case ( 0=ξ ), the log-likelihood is:
∑∑==
−−−
−−−=m
i
im
i
i zzm
11explog),(
σµ
σµσσµℓ
By maximizing these log-likelihood functions, we obtain the maximum likelihood
estimates )ˆ,ˆ,ˆ( ξσµ . The optimization is made using numerical optimization
algorithms.
The classical theory of maximum likelihood estimation establishes that the
distribution of )ˆ,ˆ,ˆ( ξσµ is approximately normal with mean ),,( ξσµ and variance-
covariance matrix equal to the inverse of the observed information matrix evaluated
at the maximum likelihood estimate. Confidence intervals are obtained from this
approximate normality of the estimator.
Inference for return levels
The maximum likelihood estimate of the 1/p return level pz for 0<p<1 is
=−≠−−=
−
0ˆlogˆˆ
0ˆ)1()ˆˆ(ˆˆ
ˆ
ξσµξξσµ ξ
fory
foryz
p
pp
where )1log( pyp −−= . Confidence intervals can be set using the normal
approximation of the estimator distribution, but caution is required in the
interpretation, especially for return levels corresponding to long return periods
because the normal approximation may be poor. A better approximation is generally
obtained from the profile likelihood function.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
12
2.5. Graphical model checking
Though it is impossible to check the validity of an extrapolation based on the GEV
model, assessment can be done with reference to the observed data.
Probability Plot.
A probability plot is a comparison of the empirical and fitted distribution functions.
The empirical distribution function evaluated in the i-th ordered block maximum,
)(iZ , is )1()(~
)( += miZG i , and the fitted distribution function in the same point is
−+−=
− ξ
σµ
ξˆ1
)()( ˆ
ˆˆ1exp)(ˆ ii
zZG .
In order to have a good model it is necessary that )(ˆ)(~
)()( ii zGzG = . In practise the
plot of points ( ) mizGzG ii ,,1)(ˆ),(~
)()( …= , should lie close to the first diagonal. But
because both functions are bounded to approach 1 as the values of z increase, the
plot is least informative in this region. The following graph avoids this deficiency.
Quantile plot.
The quantile plot is a representation of the points
( ) mizmiG i ,,1)),1((ˆ)(
1…=+− , where
mim
imiG ,,1
1log1ˆ
ˆˆ))1/((ˆ
ˆ1
…=
+−−−=+
−−
ξ
ξσµ
In the ideal situations the plot should show a linear function. Departures from
linearity in the quantile plot also indicate model failure.
Return level plot.
The return level plot represents the points ( ) 10ˆ,log << pzy pp . Confidence
intervals are usually added to this plot to increase its information. The importance of
return periods in engineering is due to the fact that the return period is used as a
design criterion. Furthermore, to use this plot as a model diagnostic one, the
empirical estimates of the return level function are also added. For suitable models
the model based curve and empirical estimates should be in agreement.
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
13
2.6. Case study. Hourly average wind data from Schiphol in
Netherlands.
We consider the records of
the hourly average wind
speed at the location of
Schiphol, Netherlands (lat.
52.330 north, lon. 4.738
east). Data were recorded by
the ''Royal Netherlands
Meteorological Institute'',
through the KNMI HYDRA
PROJECT from March 1,
1950 to December 31, 2005.
The measuring height was
10 meters.
(http://www.knmi.nl/samen
w/hydra/index.html)
An introduction to statistical modelling of extreme values Mallor, Nualart, Omey
14
The data. Next figures show the original data, the daily, monthly and yearly
maxima, respectively.
An introduction to statistical modelling of extreme values
15
Analysis with the package extRemes of the yearly maxima data. http://www.assessment.ucar.edu/toolkit/ The Extremes Toolkit is an interactive program for analyzing extreme value data using the R statistical programming language. A graphical user interface is provided, so a knowledge of R is not necessarily required.
An introduction to statistical modelling of extreme values
16
Max. Wind Speed 1950-2005. Shirphol
0
50
100
150
200
250
300
1940 1950 1960 1970 1980 1990 2000 2010
Year
Mea
n h
ou
rly
spee
d m
/s
N mean Std.Dev. Min. Q1 median Q3 max 56 208.39 26.06 157 189.5 205.0 226.5 280.0
Step 1. Loading the file with the year maximum wind speed data.
An introduction to statistical modelling of extreme values
17
The following plots of the data have been made using the scatter plot option of the package extRemes:
An introduction to statistical modelling of extreme values
18
Step 2. Fit a GEV distribution to this yearly maxima data. We use the option of
extRemes: Analyze > Generalized Extreme Value (GEV) Distribution
The numerical results provided by the software are: ************ GEV fit ----------------------------------- Response variable: speedmaxyear L-moments (stationary case) estimates (used to initialize MLE optimization routine): Location (mu): 197.6025 Scale (sigma): 23.84311 Shape (xi): -0.1420611 Likelihood ratio test (5% level) for xi=0 does not reject Gumbel hypothesis. likelihood ratio statistic is 2.308828 < 3.841459 1 df chi-square critical value. p-value for likelihood-ratio test is 0.128641 Convergence successfull![1] "Convergence successfull!" [1] "Maximum Likelihood Estimates:" MLE Stand. Err. MU: (identity) 197.88060 3.50346 SIGMA: (identity) 23.55972 2.45932 Xi: (identity) -0.15223 0.08944 [1] "Negative log-likelihood: 260.451307992363"
An introduction to statistical modelling of extreme values
19
Parameter covariance: [,1] [,2] [,3] [1,] 12.2742524 1.68235174 -0.116433152 [2,] 1.6823517 6.04824808 -0.098884367 [3,] -0.1164332 -0.09888437 0.008000298 [1] "Convergence code (see help file for optim): 0" NULL Model name: gev.fit1 The diagnostic plots are:
This diagnostic plot can be recovered using:
Plot > Fit diagnostics
An introduction to statistical modelling of extreme values
20
Estimation of confidence intervals are provided for the parameters using the profile
likelihood method:
Analyze > Parameter Confidence Intervals > GEV fit
The numerical results are: ***** [1] "Estmating CIs for GEV 50-yr. return level and shape parameter (xi)." [1] "Estimated return level = 267.1967" [1] "Estimated (MLE) shape parameter = -0.1522" [1] "50-year return level: 95% confidence interval approximately" [1] "(253.91618, 295.71891)" [1] "shape parameter (xi): 95% confidence interval approximately" [1] "(-0.34134, 0.65109)" ***** The profile plots are
An introduction to statistical modelling of extreme values
21
3. THRESHOLD MODELS
3.1. Model formulation and main result
Modelling only block maxima implies that we waste a lot of data if a detailed recording
of the studied phenomenon is available. Now we propose another alternative analysis
that is more efficient in the use of data. The approach consists in considering for the
analysis those data that are viewed as extreme observations, let say, those data that
surpass a threshold level u. Then the stochastic behaviour of these excesses over u is
studied. More formally, given nXX ,,1… a sequence of independent and identically
distributed random variables, having distribution function F, we are interested in the
conditional probability )()( uXyuXPyFu >+≤= , this is)(1
)()()(
uF
uFyuFyFu −
−+= .
The following result gives an approximation to this probability for high values of the
threshold u.
Theorem 3. Let nXX ,,1… be a sequence of independent and identically distributed
random variables with a common distribution function F, and { }nn XXM ,,max 1…=
satisfying the conditions to be approximated by a GEV, that is, for large n:
{ }
−+−=≈≤− ξ
σµξ
1
1exp)(where),(Prz
zGzGzM n
Then, for large enough u, the distribution function of (X – u), conditioned to X > u, is
approximately given by
ξ
σξ 1
~11)(−
+−= yyH GENERALIZED PARETO DISTRIBUTION
defined on { }0)~y(1 and0/ >+> σξyy , and where )(~ µξσσ −+= u .
An introduction to statistical modelling of extreme values
22
This result relates the two approximations to study the distribution of the maximum. We
see how the parameters of the Generalized Pareto Distribution (GDP) are uniquely
determined by the parameters of the associated GEV distribution of block maxima.
Observe that this imply that if we change the size of blocks in the GEV analysis then the
parameter ξ remains unperturbed while the parameters µ and σ change but
compensating their values to provide a fixed value for σ~ .
As for the GEV distribution the ξ parameter is dominant for determining the qualitative
behaviour of the GPD distribution:
• If ξ < 0, then the distribution of excesses is bounded by ξσ~−u .
• If ξ > 0, then the distribution is unbounded.
• If ξ = 0, then the distribution is also unbounded and is in the exponential family
with parameter σ~1 .
An introduction to statistical modelling of extreme values
23
3.2. Threshold selection. Mean residual life plot
Let },,{ 1 nxx … be the original data and let us consider as extreme events those that
excess a threshold u, say, )()1( ,, kxx … . We denote the excesses over the threshold by
uxy jj −= )( . Because of the previous theorem, when the threshold u is large enough,
the values jy can be viewed as independent realizations of a variable distributed
according to a GPD, whose parameters have to be estimated and then the model
validated.
The issue of how to choose the threshold is similar to that of selecting the size of a
block in the sense that both imply a balance between bias and variance. A low level
leads to failure in the asymptotic approximation of the model and a high level provides
few observations and then high variance.
A method to help in the choice of the threshold is based on the mean of the GPD: if Y is
a random variable following a GPD with parameters σ and ξ , then )1()( ξσ −=YE .
when ξ < 1. In the other case the mean is infinite.
If a model is valid for a threshold 0u then it is also valid for all thresholds u greater than
0u . The means in both cases are:
e(u0) = )1(~)/(000 ξσ −=>− uuXuXE
e(u) = )1())(~()1(~)/( 00ξξσξσ −−+=−=>− uuuXuXE uu
Thus, e(u) = )/( uXuXE >− is a linear function of u. Based on this result, the
procedure to estimate the threshold is as follows:
• Build the mean residual life plot, by representing the points
( )( )u
n
i i nuxu u )(,1 )(∑ =
− maxxu < , where un is the number of observations
exceeding u and maxx is the maximum observation in the data set.
• Choose as threshold the value above which the plot is approximately linear in u.
The representation of confidence intervals can help to the determination of this
point.
An introduction to statistical modelling of extreme values
24
3.3. Parameter estimation
Once the threshold has been estimated, the next step is to estimate the parameters of the
GPD, for example by maximum likelihood. If we denote by kyy ,,1… the k excesses
over the threshold, the log-likelihood function, in the case that ξ is not zero, is:
∑=
++−−=k
iiyk
1
)1log()11(log),( σξξσξσℓ , when 0)1( >+ σξ iy , in other case
−∞=),( ξσℓ .
In the case 0=ξ the log-likelihood is ∑=
−−−=k
iiyk
1
1log)( σσσℓ
Return levels
To calculate the return levels, first we need an expression for the unconditional
distribution of variables X. Denoting by ( )uXu >= Prδ and from the conditional
distribution { }ξ
σξ
1)(
1/Pr−
−+=>> uxuXxX we obtain that
{ }ξ
σξδ
1)(
1Pr−
−+=> uxxX u
Hence, the level mx that is exceeded on average once every m observations is the
solution of
ξ
σξδ
1)(
11
−
−+= ux
mm
u , which is ( )1)( −+= ξδξσ
um mux
This expression is valid for values m leading to uxm > .
In the case 0=ξ the return level is ( )um mux δσ log+= , again for m enough large.
The estimation of these return levels requires the substitution of parameters by their
estimates. In the case of the probability ( )uXu >= Prδ , the maximum estimator is the
sample proportion of observations over the threshold u, that is, nku =δ .
An introduction to statistical modelling of extreme values
25
3.4. Model checking
Another tool to help in the choice of threshold u
As we said before, when the GPD is a valid model for a threshold 0u then it is also a
valid model for any 0uu > . At both levels the parameter ξ is the same and the scale
parameters are related by )( 00uuuu −+= ξσσ . Thus, the new parameter uu ξσσ −=∗
is constant with respect to u. Consequently, estimates of ∗σ and ξ should be constant
above 0u , when it is a valid threshold. This argument leads to plot ∗σ and ξ against u,
together with confident intervals for them and selecting 0u as the lowest value of u for
which the estimates remain near-constant.
Probability plots, quantile plots and return level plots are used for assessing the quality
of a fitted generalized Pareto model. Assuming a threshold u, ordered excesses
)()1( ,, kyy … and an estimated model H for the GPD, then we have:
Probability plot. It represents the points ( ) kiyHki i ,,1)(ˆ,)1( )( …=+ .
Quantile plot. It represents the points ( )( ) kiykiH i ,,1,)1(ˆ)(
1…=+− .
When the model is valid, in both plots the points are almost linearly placed.
When 0ˆ ≠ξ the estimations are:
ξ
σξ
ˆ1
ˆ
ˆ11)(ˆ
−
+−= y
yH and ( )1)1(ˆˆ
)(ˆ ˆ1 −−= −− ξ
ξσ
ppH
When 0ˆ =ξ the expressions are:
−−=σ
exp1)(ˆ yyH , and ( ))1(lnˆ)(ˆ 1 ppH −−=− σ
Return level plot. It represents the points ( )mxm ˆ, , where as we have seen before
for 0ˆ ≠ξ ( )1)ˆ(ˆˆ
ˆˆ −+= ξδ
ξσ
um mux ,
for the case 0ˆ =ξ the return level is ( )um mux δσ ˆlogˆˆ += .
Recall that mx is the estimated value that is exceeded on average once every m
observations.
An introduction to statistical modelling of extreme values
26
3.5. Case study. Hourly average wind data from Schiphol in
Netherlands.
We analyse the schiphol data using the threshold method. We consider the series of
monthly maximum wind.
670603536469402335268201134671
300
250
200
150
100
Index
speedmaxmonth
Monthly maximum speed (Schiphol)
We analyse these data by using the extReme package of R.
Descriptive statistics
Monthly max. speed data
N 670 mean 149.69
Std.Dev 33.35 min 87 Q1 125
median 146 Q3 172 max 280
An introduction to statistical modelling of extreme values
27
An introduction to statistical modelling of extreme values
28
Threshold selection. For selecting the threshold we look at the mean residual life plot. Recall that if we select
a threshold that is too low will
give biased parameter estimates
but if the selected threshold is too
high then only few values will be
used and as result the estimated
parameters will have a large
variance.
We can observe a decreasing
linear tendency between 185 and
255 approximately (this
decreasing tendency indicates a
negative value for parameter ξ).
We choose as threshold u the
value 185.
We fit the Generalized Pareto Distribution and get the following results: Convergence successfull! [1] "Threshold = 185" [1] "Number of exceedances of threshold = 99" [1] "Exceedance rate (per year)= 1.77" [1] "Maximum Likelihood Estimates:" MLE Std. Err. Scale (sigma): 25.6462453 3.7656845 Shape (xi): -0.1298863 0.1078369 [1] "Negative log-likelihood: 407.336" Parameter covariance: [,1] [,2] [1,] 14.1803800 -0.32751390 [2,] -0.3275139 0.01162880
An introduction to statistical modelling of extreme values
29
We also use as threshold selection tool the data fitting to a GPD over a range of
thresholds, which, remember, requires fitting data to the GPD distribution several times,
each time using a different threshold. We have to find the minimum value above which
the estimations of both parameters remain constant.
It seems that a value a bit greater than 185 would be more appropriate. Then we repeat
the analysis with a threshold of 210.
The new results are: Convergence successfull! [1] "Threshold = 210" [1] "Number of exceedances of threshold = 33" [1] "Exceedance rate (per year)= 0.591" [1] "Maximum Likelihood Estimates:" MLE Std. Err. Scale (sigma): 28.7074116 6.8613851 Shape (xi): -0.3147829 0.1727022 [1] "Negative log-likelihood: 133.3984" Parameter covariance: [,1] [,2] [1,] 47.078606 -1.04385528 [2,] -1.043855 0.02982605
An introduction to statistical modelling of extreme values
30
Model checking The estimation and confidence interval for the return level are: Estmating CIs for GPD 100-yr. return
level.
Using 12 days per year.
Estimated 100-yr. return level =
276.2093
100-year return level: 95%
confidence interval approximately
(264.44, 320.87)
An introduction to statistical modelling of extreme values
31
We can also estimate a confidence interval for the shape parameter ξ: Estmating CIs for GPD shape parameter
(xi).
Estimated (MLE) shape parameter 0.2721
shape parameter (xi): 95% confidence
interval approximately
(-0.4667, 0.0589)
The main objective of this type of analysis is the estimation of the return levels.
We have obtained the 50 year return
Estmating CIs for GPD 50-
yr. return level.
Using 12 days per year.
Estimated 50-yr. return
level = 269.611
50-year return level: 95%
confidence interval
approximately
(258.20376, 301.5432)
We can compare these
results with that obtained
with block analysis using
GEV distributions:
"Estimated return level =
267.1967"
[1] "50-year return level: 95% confidence interval approximately" [1] "(253.91618, 295.71891)"
An introduction to statistical modelling of extreme values
32
4. EXTREMES OF DEPENDENT SEQUENCES
4.1 Stationarity and limited long-range dependence
In the models studied so far we supposed that the sequence of observations comes from
a sequence of independent random variables. In real applications this is an unrealistic
assumption because it is observed some dependence over time. For example, in the case
of wind speed records it is natural to find high positive correlation among consecutive
hourly observations. The next figure shows the autocorrelation function for the wind
series of Schiphol (hourly average wind) that we used in the previous sections.
Now we are studying a generalization of a sequence of independent random variables to
a stationary series. Stationarity corresponds to a series with stochastic behaviour that is
homogeneous through time but whose variables may be mutually dependent.
Definition. A random process …,, 21 XX is said to be stationary if, given any set of
integers { }kii ,,1 … and any integer m, the joint distributions of ),,(1 kii XX … and
),,(1 mimi k
XX ++ … are identical.
That is, given a set of variables their joint distribution remains unchanged when they are
view m time units later. Then stationarity allows that the variables can have a structure
of dependence but excludes trends, seasonality and other deterministic cycles.
An introduction to statistical modelling of extreme values
33
To obtain the theoretical results to use in our analysis of extremes of stationary
sequences, it is usual to assume a condition that limits the extent of long-range
dependence at extreme levels. We will assume that the events uXi > and uX j > are
approximately independent, when the threshold level u is high enough, and when the
time points i and j are far away from each other.
Many physical phenomena satisfy this property. In our example of wind speed it means
that a high wind today might influence the probability of an extreme wind tomorrow,
maybe because both are due to the pass of the same storm, but it is unlikely that it might
influence in an extreme wind in one month’s time.
The following condition formalizes the notion of extreme events being near-
independent if they are sufficiently distant in time.
conditionuD n)( . A stationary series …,, 21 XX is said to satisfy the )( nuD condition if
for all qp jjii <<<<< …… 11 with kij p >−1 ,
( )( ) ( ) ),(,Pr,Pr
,,,Pr
11
11
knuXuXuXuX
uXuXuXuX
njnjnini
njnjnini
qp
qp
α≤≤≤≤≤−
−≤≤≤≤
……
……
where 0),( →nknα for some sequence satisfying 0 → ∞→nn nk .
Observe that for independent sequences the difference is always 0. To get the results of
next section the condition needs to be satisfied only for a threshold nu that increases
with n. In this way we assure almost the independence of extreme observations that are
enough far apart.
An introduction to statistical modelling of extreme values
34
4.2 Limit result for the maximum of a stationary process
Following result establishes that the distribution function of the maximum of a
stationary process is in the family of generalized extreme value distributions.
Theorem 4. Let …,, 21 XX be a stationary process and { }nn XXM ,,max 1…= . If there
exist sequences of constants { }0>na and { }nb such that
( ){ } )(Pr zGzabM nnnn →≤− ∞→
where G is a non-degenerate distribution function and the )( nuD condition is satisfied
with nnn bzau += for every real z, then G is a member of the generalized extreme value
family of distributions.
Observe that this result implies that when the stationary series has limited long-range
dependence at extreme levels, the maxima follow the same limit laws as in the case of
independent series. Furthermore, there exists a relationship between both distributions.
Theorem 5. Let …,, 21 XX be a stationary process and …,, *2
*1 XX be a sequence of
independent variables with the same marginal distribution. Let { }nn XXM ,,max 1…=
and { }**1
* ,,max nn XXM …= . Under suitable regularity conditions, there exist sequences
of constants { }0>na and { }nb such that
( ){ } )(Pr 1* zGzabM n
nnn →≤− ∞→ if and only if ( ){ } )(Pr 2 zGzabM nnnn →≤− ∞→
where )()( 12 zGzG θ= for some constant θ with 10 ≤< θ .
From the relationship between both distributions, it is easy to see that both have the
same parameter ξ and
when 0≠ξ : ( )ξθξσµµ −−−= 1* and ξθσσ =∗
when 0=ξ : θσµµ log+=∗ and σσ =∗
An introduction to statistical modelling of extreme values
35
The quantity θ is called the extremal index. This index can be interpreted in terms of
the propensity of the process to cluster at extreme levels. Loosely speaking, we have
( ) 1sizecluster mean limiting −=θ
where limiting is in the sense of cluster of exceedances of increasingly high thresholds.
4.3 Models for block maxima.
The distribution of the block maxima, when the )( nuD condition is satisfied, falls in the
same family of distributions as if the series were independent. It means that dependency
in the data can be ignored and we can model the data as before when we assumed
independence. The only question is that, because nM has similar statistical properties to
∗θnM (corresponding to the maxima of nξ independent observations), the quality of the
GEV family as an approximation to the distribution of block maxima is diminished.
4.4 Threshold models.
The generalized Pareto distribution remains appropriate for threshold excesses but the
methodology needs to be adapted because the extremes have some tendency to cluster,
violating the assumption of independence among the individual excesses.
The most commonly used
method for dealing with the
problem of dependent
exceedances in the threshold
exceedance model is
declustering. This process
filters the dependent
observations to obtain as a
result a set of threshold
excesses that are approximately independent.
Gap
C1
C2
C3
C4C5
C6
GapGap
C1
C2
C3
C4C5
C6
An introduction to statistical modelling of extreme values
36
The steps for the analysis of stationary series by the threshold method are the following:
• Use an empirical rule to define clusters of exceedances. For example, all
observations over a threshold belong to the same cluster if runs of observations
in between below the threshold have a length less than certain value k. In the
figure we have six clusters of excedances when we consider the length of the
gap vector as value for k.
• Identify the maximum excess within each cluster.
• Assuming cluster maxima to be independent and with conditional excess
distribution given by the generalized Pareto distribution.
• Fitting the generalized Pareto distribution to the cluster maxima.
Return level. The return level associated with the probability 1/m is
( )1)( −+= ξθδξσ
um mux , where σ and ξ are the parameters of the threshold excess
generalized Pareto distribution, uδ is the probability of an exceedence of u, and θ is the
extremal index.
Denoting the number of exceedances above the threshold u by un and the number of
clusters obtained above u by cn , the parameters uδ and θ are estimated as
n
nuu =δ and
u
c
n
n=θ
uδ and θ are the maximum likelihood estimators or uδ and θ , respectively. Then, when
the parameters of the generalized Pareto distribution, ξ and σ , are estimated by the
maximum likelihood method,
( )1)ˆˆ(ˆˆ
ˆˆ −+= ξθδ
ξσ
um mux
is the maximum likelihood estimator of the return lever mx .
An introduction to statistical modelling of extreme values
37
4.5 Case study. Hourly average wind data from Schiphol in Netherlands.
We analyse the full Schiphol series using the threshold method approach for stationary
process. The descriptive statistics for the data are
Descriptive statistics for the hourly data
N 489504.0 mean 53.2 Std.Dev. 30.9 min 0.0 Q1 31.0 median 48.0 Q3 71.0 max 280.0 missing values 0.0
DatosSchiphol$UP
frequ
ency
0 50 100 150 200 250
020
000
4000
060
000
0e+00 1e+05 2e+05 3e+05 4e+05 5e+05
05
01
0015
020
025
0
Index
spee
d m
/s
An introduction to statistical modelling of extreme values
38
Previous figures show the histogram, the plot of the complete series and a zoom of a
part of this series.
Declustering the data. First step in the analysis consists in declustering the data to
eliminate the dependence among neighbour extreme observations. The procedure
assumes that the exceedances belong to the same cluster if they are separated by fewer
than 'r' (run length) values below a given threshold. In our case we choose as threshold
170 and as run length 96 (that is, 4 days).
An introduction to statistical modelling of extreme values
39
As result, 222 clusters are obtained above the threshold 170.
[1] "Declustering ..."
[1] "declustering performed for:"
[1] "UP and assigned to UP.u170r96dc"
[1] "222 clusters using threshold of 170 and r = 96"
A new data vector is generated with the same length as the original data vector, but with
maximums from each cluster followed by 'filler' numbers that are below the given
threshold, 'u'.
The following threshold analysis is done over this new vector data.
An introduction to statistical modelling of extreme values
40
Fitting declustered data to a generalized Pareto distribution
We fit the declustered data to a GDP using as threshold level 190. Observe that the
number of observations per year is 8766, which corresponds to 365.25x24.
We obtain the following results
An initial estimation of the parameters by the L-moments method:
L-moments estimates for (stationary) GPD are:
scale: 23.63865 shape: -0.1041125
These L-moments estimators were used as initial parameter estimates.
A test hypothesis for 0=ξ
Likelihood ratio test (5% level) for xi=0 does not reject Exponential hypothesis.
likelihood ratio statistic is 1.435923 < 3.841459 1 df chi-square critical value.
p-value for likelihood-ratio test is 0.2308003
Information about excedences
[1] "Threshold = 190" [1] "Number of exceedances of threshold = 83" [1] "Exceedance rate (per year)= 1.4863"
An introduction to statistical modelling of extreme values
41
The results for the maximum likelihood estimation
[1] "Maximum Likelihood Estimates:"
MLE Std. Err.
Scale (sigma): 24.60 3.766
Shape (xi): -0.1477 0.1079
[1] "Negative log-likelihood: 336.5808"
An the estimation for the variance-covariance matrix
Parameter covariance: [,1] [,2] [1,] 14.1831957 -0.32382041 [2,] -0.3238204 0.01164461
Diagnostic plots show a reasonable good fit.
An introduction to statistical modelling of extreme values
42
Graphical tools for determining the threshold
Mean residual plot
Linearity of parameter estimations
An introduction to statistical modelling of extreme values
43
Estimation of return levels
We use the profile likelihood method to obtain confidence intervals for the parameter ξ
and for the return level of 50 years.
The package extRemes of R provides the following result:
Estmating CIs for GPD 50-yr. return level and shape parameter (xi).
Using 8766 days per year.
Estimated 50-yr. return level = 268.4197
Estimated (MLE) shape parameter = -0.1477
50-year return level: 95% confidence interval approximately
(255.396, 305.700)
shape parameter (xi): 95% confidence interval approximately
(-0.32207, 0.11549)
Changing the confidence level to 90% we obtain these intervals:
50-year return level: 90% confidence interval approximately
(256.975, 296.137)
shape parameter (xi): 90% confidence interval approximately
(-0.29791, 0.06547)
An introduction to statistical modelling of extreme values
44
An introduction to statistical modelling of extreme values
45
5. BIBLIOGRAPHY
1. Castillo, E.; Hadi, A. S.; Balakrishnan, N. and Sarabia, J. M. (2005). Extreme Value
and Related Models with Applications in Engineering and Science. Wiley New
Jersey
2. Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values.
Springer London.
3. Eric Gilleland, Rick Katz and Greg Young (2004). extRemes: Extreme value
toolkit.. R package version 1.59.http://www.assessment.ucar.edu/toolkit/
4. Ferro CAT and Segers J (2003) Inference for clusters of extreme values. Journal of
the Royal Statistical Society B 65, 545-556.
5. Gross, J.; Heckert, A.; Lechner, J.; Simiu, E. (1994). Novel Extreme value
estimation procedures: application to extreme wind data. In J. Galambos et al. (eds.),
Extreme Value Theory and Applications, 139-158. Kluwer Academic Publishers.
Netherlands.
6. Perrin, O.; Rootzén, H.; Taesler, R. (2006) A discussion of statistical methods used
to estimate extreme wind speeds. Theor. Appl. Climatol. 85, 203–215.
7. R Development Core Team (2008). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0, URL http://www.R-project.org.
8. Reiss, R. D.; Thomas, M. (2007). Statistical Analysis of Extreme Values, with
applications to Insurance, Finance, Hydrology and other Fields. Birkhäuser
9. Sacré, C.; Moisselinb, J.M.; Sabrea, M.; Floria, J.P.; Dubuissonb, B. (2007). A new
statistical approach to extreme wind speeds in France. Journal of Wind Engineering
10. Sanabria, L. A., Cechet, R. P. (2007). A Statistical Model of Severe Winds.
Geoscience Australia Record 2007/12, 60p. ISBN: 978 1 921236 43 3
11. Simiu, E. (2002). Meteorological extremes. Encyclopedia of Environmetrics,
Volume 3, Abdel H. El-Shaarawi and Walter W. Piegorsch, eds., 1255-1259. John
Wiley & Sons, Ltd, Chichester, United Kingdom.