exploratory analysis for high dimensional extremes
TRANSCRIPT
Exploratory analysis for high dimensional extremes:Support identification, Anomaly detection andclustering, Principal Component Analysis.
Anne Sabourin1
+ many others: Mael Chiapino (PhD), Stephan Clemencon (Telecom Paris), Holger
Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers
(UC Louvain)
1 Telecom Paris, Institut polytechnique de Paris, France.
AgroParistech seminar, March 30th, 2020
1/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,
2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
1/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | kXk > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0.
2/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | kXk > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0.
2/47
Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .
• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.
Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .
• d � 1 : modeling Law(X | kXk > t) unfeasible.
• Dimension reduction problem(s) :
1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.
2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0. 2/47→ ACP :
Applications to risk managementSensors network (road tra�c, river streamflow, temperature, internettra�c . . . ) or financial asset prices:
! extreme event = tra�c jam, flood, heatwave, network congestion,falling price
! question: which groups of sensors / assets are likely to be jointlyimpacted ?
! how to define alert regions (alert groups of features/components)?
spatial case: one feature = one sensor
4/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Applications to anomaly detection
• Training step:Learn a ‘normal region’ (e.g. approximate support)
• Prediction step: (with new data)Anomalies = points outside the ‘normal region’
If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.
How to distinguish between large anomalies and normal extremes?
5/47
Standardized data• Random vectors X = (X1, . . . ,Xd ,) ; Xj � 0
• Margins: Xj ⇠ Fj , 1 j d (continuous).
• Preliminary step: Standardization (here: to Pareto margins)Vj =
11�Fj (Xj ))
, P(Vj > v) = 1v.
• Goal : P(V 2 A), A ’far from 0’ ?
• Each component j is homogeneous (order �1): for all t > 0
P(Vj 2 tA)/P(Vj > t) = P(Vj 2 A).
6/47
-
,US,I
Iv Uco,D ssi Fr est continue
A CCR
Multivariate extremes: regular variation• Informally: the marginal homogeneity property remains valid in themultivariate sense.
• A r.v. V = (V1, . . . ,Vd) 2 Rd is regularly varying if 9 a limitmeasure µ s.t.
P (V 2 tA)
P (kV k > t)���!t!1
µ(A)
for A ⇢ Rd with 0 /2 closure(A) and µ(@A) 6= 0.
• Necessarily µ is homogeneous: µ(rA) = r�↵µ(A), for some ↵ > 0 (tailindex).
• With Vj =1
1�Fj (Xj )necessarily ↵ = 1.
7/47
Idefit
' alternate;
t PC Hett) →FIA)
Multivariate extremes: regular variation (Cont’d)
• µ rules the (probabilistic) behaviour of extremes: if A is far from theorigin,
P(V 2 A) ⇡ µ(A)
• Examples: Max stable vectors with standardized margins, multivariatestudents, . . .
• Statistical procedures based on Extreme Value theory: 2 steps.
1. Learn useful features of µ using the k observations V(1), . . . ,V(k) withlargest norm, with k ⌧ n number of available data
2. Use the approximation P(V 2 A) ⇡ µ(A) for A far from 0.
8/47
PCVEA)= PCVETB)# PC Hull >gyu CB)=tpllulbt) yuCB)
a.*Peixe- t⇒. •. ¥pa.E1÷9¥!
Angular measure• Homogeneity of µ ! polar coordinates are convenient
r = kxk (any norm) ; ✓ = r�1x
• Angular measure � on the corresponding sphere:�(B) = µ{r > 1, ✓ 2 B}.
• Then µ decomposes as a product, only � needs to be estimated:
µ{r > t, ✓ 2 B} = r�↵�(B)
9/47
x#
•N
Ewi""
↳ ohB)
Angular measure
• � rules the joint distribution of extremes
• Asymptotic dependence: (V1,V2) may be large together.
vs
• Asymptotic independence: only V1 or V2 may be large.
No assumption on �: non-parametric framework.
10/47
(Lifan. .
te
En bruit"short tailed
"
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,
2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
10/47
Towards high dimension• Reasonable hope: only a moderate number of Vj ’s may besimultaneously large ! sparse angular measure
• Our goal from a MEVT point of view:
Estimate the (sparse) support of the angular measure(i.e. the dependence structure).
Which components may be large together, while the other are small?
11/47
Sparse angular support
Full support: Sparse supportanything may happen (V1 not large if V2 or V3 large)
Where is the mass?
Subcones of Rd
+ : C↵ = {x ⌫ 0, xi � 0 (i 2 ↵), xj = 0 (j /2 ↵), kxk � 1}↵ ⇢ {1, . . . , d}.
12/47
¥
Support recovery + representation
• {C↵,↵ ⇢ {1, . . . , d}: partition of {x : kxk � 1}• Goal 1: learn the 2d � 1-dimensional representation (potentiallysparse)
M =⇣µ(C↵)
⌘
↵⇢{1,...,d},↵ 6=;; support S = {↵ : µ(C↵) > 0}.
Main interest:
µ(C↵) > 0()
features j 2 ↵ may be large together while the others are small.
13/47
Identifying non empty edgesIssue: real data = non-asymptotic: Vj > 0.
Cannot just count data on each subcone:Only the largest-dimensional one has empirical mass!
14/47
Identifying non empty edges
Fix " > 0. A↵ect data "-close to an edge, to that edge.
C↵ ! R"↵
! New partition of the input space, compatible with non asymptotic data.
15/47
Raffa : 1121131 , then :3. > E
. No tjeqa; Eegri .- e. .
Empirical estimator of µ(C↵))(Counts the standardized points in C"
↵, far from 0.)
data: Xi , i = 1, . . . , n, Xi = (Xi ,1, . . . ,Xi ,d).
• Standardize: Vi ,j =1
1�Fj (Xi,j ), with Fj(Xi ,j) =
rank(Xi,j )�1n
• Natural estimator
µn(C↵) =n
kPn(V 2 n
kR"↵) ! M = (µn(C↵),↵ ⇢ {1, . . . , d})
• Estimated support S = {↵ : µn(C↵) > µ0}. 16/47
• j fixe : .
I rein :
§"
ttivijzhzf ±. k=opf¥i :
# hirai ;)> I - hay et atilt> ha)--
=k (III pies) C- [ k, dk]rani dkk ⇐ zdpound grand
=Ex Dtvieirif - * II. literacy
µ> 0unsaid detolerance .
Sparsity in real datasetsData=50 wave direction from buoys in North sea.(Shell Research, thanks J. Wadsworth)
17/47
Finite sample error boundVC-bound adapted to low probability regions (see Goix, S., Clemencon, 2015)
TheoremIf the margins Fj are continuous and if the density of the angular measureis bounded by M > 0 on each subface (infinity norm),There is a constant C s.t. for any n, d , k , � � e�k , " 1/4,with probability � 1� �,
max↵
|µn(C↵)� µ(C↵)| Cd
r1
k"log
d
�+Md"
!+ Bias n
k,"(F , µ).
Bias: using non asymptotic data to learn about an asymptotic quantity
Regular variation () Biast," ���!t!1
0
• Existing litterature (d = 2): 1/pk .
• Here: 1/pk"+Md". Price to pay for biasing estimator with ".
OK if " k ! 1, " ! 0.Choice of ": cross-validation or ‘" = 0.1’ 18/47
vapm-k-chenonenkis-aseupeao.IEneat PCA)!m{ cm."/ "
rs asymptotic→ taken. -HC Ray
→ uh.
Tools for the proof
1. VC inequality for small probability classes (Goix et.al., 2015)
! max deviations pp ⇥ (usual bound)
2. Apply it on VC-class of rectangles {k
nR(x , z ,↵), x , z � "}
! p dk
"n) sup
↵|µn � µ|(R✏
↵) Cd
r1
"klog
d
�
3. Approach µ(C↵) with µ(R"↵) ! error Md"
(bounded angular density).
19/47
c tour✓=
(sirna-mais .
: %,a-- I
E EEFEIFEI- c.M¥1or
Algorithm DAMEX (Detecting Anomalies with MultivariateExtremes) (Goix, S., Clemencon, 2016)
Anomaly = new observation ‘violating the sparsity pattern’:observed in empty or light subcone.
Scoring function: for x such that v 2 kR"↵,
sn(x) =1
kvk µn(R✏↵) 'x large P (V 2 C↵, kV k > x)
20/47
on
constantIAD=
un"score"
↳Ca) et onse dare
un Seuil Mi Su knew)G→ahow
sulked) uscore optimal : one trans faucet Nde la densities
. ⇒ normal↳au Seis de Neyman Rearso
Extension: feature clustering (Chiapino, S. 2016; Chiapino, S., Segers,
2019 a.)
• Motivating example: River stream-flow dataset, d = 92 gaugingstations:
• Typical groups jointly impacted by extreme records include noisyadditional features!! Empirical µ-mass scattered over many C↵’s
! No apparent sparsity pattern.• How to gather ‘closeby’ ↵’s into feature clusters? ! ‘robust’ versionof DAMEX: the CLEF algorithm and variants + Asymptotic analysis.
21/47
Conclusion I
• Discovering subgroup of components that are likely to besimultaneously large is doable, with an error scaling as 1p
k, k : number
of extreme observations.
• Two algorithms:• DAMEX: easy to implement, linear complexity O(dn log n), not very
robust to weak signals/ noisy dependence structure
• CLEF: a bit more complex (graph mining, but existing python packagecan help), complexity OK only if the dependence graph is sparse, butmore robust.
• Statistical guarantees: non-asymptotic (DAMEX) and asymptotic(CLEF)
• Open questions: optimal choice of tuning parameters (cross-validationis common practice but no theory in an extreme values context)
22/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,
2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
22/47
Application: clustering extreme data (Chiapino et al. , 2019 b.)
• Context: monitoring a high dimensional system (e.g. air flight datafrom Airbus, 82 parameters, 18 000 obs), where extremes are ofparticular interest (associated with anomalies / risk regions)
• Naive idea: use the list of dependent maximal subgroup {↵k , k � K}issued from DAMEX/CLEF and corresponding rectangles of the kind
tR"↵ = {x 2 Rd : xj > t", j 2 ↵, xj < t", j /2 ↵, kxk > t}
• Issue: In practice many data points fall outside the tR"↵’s. How to
assign them to a cluster?
23/47
Mixture model for extremes
• See dependent subgroups ↵k ⇢ {1, . . . , d}, k K issued fromDAMEX/CLEF as components of a mixture model.
• Zk : hidden indicator variable of component k . Conditionally to{kV k > r0,Zk = 1},
V = Vk + ✏k = RkWk + ✏k , (1)
where• Vk 2 C↵k
,
• ✏k 2 C?↵k
⇠ i.i.d. Exponential,
• Rk = kVkk ⇠ Pareto(1),
• Wk = R�1k
Vk 2 S↵k⇠ �k (Angular measure restricted to k th face:
Dirichlet distribution with the L1 norm).
24/47
Model for the k th mixture component
V = Vk + ✏k = RkWk + ✏k , (2)
• Training the mixture model: EM algorithm.
25/47
Clustering extremes
• After training: each extreme point Vi has probabibility pi ,k to comefrom mixture component k .
• Similarity measure between Vi ,Vj :
si ,j = P (Vi ,Vj 2 same component ) =KX
k=1
pi ,kpj ,k
! Similarity matrix (Si ,j) 2 [0, 1]N⇥N where N is the number ofextreme components.
• Clustering based on the similirarity matrix using o↵-the-shelftechniques (e.g. spectral clustering)
26/47
Some results
Table: Shuttle dataset (9 attributes, 7 classes): Purity score - Comparisons withstandard approaches for di↵erent extreme sample sizes.
n0 = 500 n0 = 400 n0 = 300 n0 = 200 n0 = 100Dirichlet mixture 0.8 0.82 0.82 0.84 0.85Kmeans 0.72 0.73 0.75 0.78 0.8Spectral clustering 0.78 0.77 0.82 0.81 0.8
27/47
Flights anomaly clustering + Visual display
Using standard visualization tools (Python package Networkx)
28/47
Conclusion II
• DAMEX/CLEF output can be used to perform clustering of extremes
• Mixture modelling,{ mixture components } = output of DAMEX/CLEF
• Additional layer: building a similarity matrix to perform clustering.
• Question: What if the angular mesure is concentrated on an’oblique’ subspace of the central subsphere (only particular linearcombinations of all components are likely to be large)? ThenCLEF/DAMEX fail because all the mass is in the central subsphereand no particular structure can be discovered.! Idea: perform some sort of PCA on extreme data.
29/47
Outline
Introduction: dimension reduction for multivariate extremes
Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,
2019 a. )
Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)
Principal Component Analysis for extremes (S. and Drees, 202+)
29/47
PCA for extremes: context and motivation
• (X1, . . . ,Xd) a multivariate r. vector with tail index ↵ > 0 and limitmeasure µ
• Motivating assumption (not necessary)
Hypothesis 1
The vector space S0 = span(suppµ) generated by the support of µ hasdimension p < d .
• Purpose of this work: Recover S0 from the data, with guaranteesconcerning the reconstruction error.
30/47
Motivating assumption: interpretation
dim(S0) = p < d ; S0 = span(supp(µ))
()
Certain linear combinations are much likelier to be large than others.
31/47
Dimension reduction in EVT: quick overview
• Looking for multiple subspaces where µ concentrates:
• Chautru, 2015 (clustering + principal nested spheres)
• Goix et al., 2016,17, Chiapino et al. (space partitioning) Simpson etal., 20++ (relaxing the partition)
• Engelke&Hitz, 20++, Graphical models
• K-means clustering: Janssen &Wan, 20++
• Dimension reduction in regression analysis: Gardes, 2018.
• PCA on a transformed version of the data: Cooley & Thibaud(20++)
32/47
Heavy-tailed scarecrow against using PCA for extremes
• ‘Classical dimension reduction tools such as PCA fail for multivariateextremes because they require the existence of second moments’
• Possible answer: Since µ is homogeneous, what matters is theangular component:
Proposed method for recovering the support of µ
• Perform PCA on angular data (or any rescaled version of the data withenough moments) corresponding to observations with largest norm.
• The first eigen vectors of the rescaled empirical covariance matrix provide anestimate for S0 = span(supp(µ)).
33/47
Toy example
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
34/47
Toy example
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
V_0
34/47
Toy example, proposed method
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
35/47
k donuts
-"
.#r : sail radial etui
Toy example, proposed method
●●
●
●
●
●
●
●
●
●
35/47
a) Hi' a
✓-o
/b) Il Xi -Troll ill
'
en es penance : b) est optimalquestion : I quel point a)est pro
che deb) -
Toy example, proposed method
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
35/47
Empirical Risk Minimization setting• k · k: Euclidean norm.• ⇧S (resp. ⇧?
S): Orthogonal projection operator onto linear space S
(resp. S?)
• Rescaled observations: ⇥ = ✓(X ) = !(X ) · X ,• ! : Rd ! R+: suitable scaling function (think !(x) = 1/kxk, variantsallowed s.t. E
�k⇥k2
�< 1).
• Conditional Risk of a linear subspace S ⇢ Rd :
Rt(S) = E⇣k⇧?
S(⇥)k2 | kXk > t
⌘
• Empirical counterpart
Rn,k(S) =1
k
nX
i=1
k⇧?S(⇥(i))k2
with kXk(1) � . . . kXk(n) the order statistics of the norm and ⇥(i) thecorresponding rescaled data.
36/47
" Its't 1/2--11X -Tsx 112
Minimizing a risk () Diagonalizing a covariance matrix
• Denote Eq = set of all q-dimensional subspaces of Rd , 1 q d .
• ⌃t = E�⇥⇥> | kXk > t
�: conditional second moments matrix.
• standard fact from Principal Component analysis: Assume forsimplicity ⌃t has distinct eigenvalues. Let (u1, . . . ud) denote theeigen vectors associated to eigen values in decreasing order. Then
argminS2Eq
Rt(S) = span(u1, . . . , uq).
• SimilarlyargminS2Eq
Rn,k(S) = span(un1 , . . . , un
q).
(unj): eigen vectors of the empirical 2nd moments matrix of the
⇥(i), i k
37/47
ERM setting: risk at the limit
• Limit risk above extreme levels
R1(S) := S1k⇧?S⇥k2
where E1: expectation w.r.t. the limit conditional distribution
P1( · ) = limt!1
P (X 2 t( · ) | kXk > t) = µ( · )/µ({x : kxk > 1})
• Hypothesis 1 (S0 = span suppµ, dim(S0) = p) )8><
>:
{S0} = argminEp R1, R1(S0) = 0,
For all S 0 of dimension p0 < p, R1(S 0) > 0.
38/47
*I
Questions
• Is the empirical minimizer Sn of Rn,k consistent?
• Uniform, non-asymptotic bounds on |Rn,k(S)� Rtn,k (S)|:? (classicalgoal in statistical learning)
• Relevance for practical applications (improved performance fornon-parametric estimation of the probability of failure regions) ?
39/47
Convergence of minimizers of the true conditional risk• Scaling condition on the weight ! (! second moments of ⇥ exist).
9� 2�1� ↵
2, 1⇤: 8� > 0, x 2 Rd : !(�x) = ���!(x)
and c! := supkxk=1
!(x) < 1,
• Define a metric ⇢ between linear spaces S ,W :
⇢(S ,W ) = |||⇧S �⇧W ||| =���������⇧?
S�⇧?
W
��������� = sup
kxk=1k⇧?
Sx �⇧?
Wxk
Theorem 2(Drees, S., 20++)
Under the scaling condition, if R1 has a unique minimizer S⇤1 in Ep, then
for all minimizers Sn of Rn,k in Ep one has
⇢(Sn, S⇤1) ! 0 in probability.
(equicontinuity in probability of rescaled Rn,k for large n (Karamata)+compactness of Ep)
40/47
Uniform risk bound• Stronger condition on !: !(x) 1/kxk (thus k⇥k 1).• tn,k : quantile of level 1� k/n for kXk• St := E
�k⇥k4 � ⇡t tr(⌃2
t ) | kXk > t�; ⌃t = E
�⇥⇥> | kXk > t
�
Theorem 3 (Drees, S., 20++), simplified version
With probability at least 1� �,
supS2Ep
|Rn,k(S)� Rtn,k (S)| hp ^ (d � p)
kStn,k
i1/2+ . . .
. . .h8k(1 + k/n) log(4/�)
i1/2+ . . .
. . .4 log(4/�)
3k.
(Variant of bounded di↵erence inequality (McDiarmid, 98) + argumentsfrom Blanchard et al. 07))• NB: unknown term St : an alternative statement is proven with onlyempirical quantities in the upper bound. 41/47
Simulations: questions
• Can p = dim(V0) be chosen empirically from the risk plots?
• Does the empirical angular measure after projection on the subspacelearned by PCA provide better estimates than the classical one for therisk-related quantities:
(i) limu!1 P(p�1P
1jpX j/kXk > t(i) | kXk > u) =
H{x | p�1P
p
j=1 xj > t(i)} for some t(i) 2 (0, p�1/2)
(ii) limu!1 P(min1jp X j > u,maxp+1jd X j u | kXk > u) =R �(min1jp x j)↵ � (maxp+1jd x j)↵
�+H(dx)
(iii) limu!1 P(X 1 > u | max1jd X j > u) =R(x1)↵ H(dx)/
R(max1jd x j)↵ H(dx)
(iv) limu!1 P(min1jd X j > u | kXk > u) =R(min1jd x j)↵ H(dx)
42/47
Simulations: models
• d dimensional vectors with limit measure concentrated on a p < ddimensional subspace.
• Structure: p dimensional max-stable model + d-dimensional Gaussiannoise (absolute values), ⇢ = 0.2,�2 2 {105/d , 10/d}.
• Unit Frechet margins with tail index ↵ 2 {1, 2}
• Dependence for the p-dimensional model:• Max-stable vector from the Dirichlet Model (Coles&Tawn91, see
Segers, 2012 for simulation), parameter (3, . . . , 3).
• other settings (not shown here)
• n = 1000, k 2 {5, 10, 15, . . . , 200}, 1000 replications.
43/47
choice of p, Dirichlet model, p = 2, d = 10
0 50 100 150 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 50 100 150 2000
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Mean
empirical risk (left) and empirical risk for one sample (right) versus k forPCA projecting onto a subspace of dimension 1 p 10
! Choice p = 2 obvious for small k , p 2 {2, 3} for k � 50.
44/47
Performance for estimating failure probabilities: RMSE’srelated to the angular measure H
0 50 100 150 2000
0.2
0.4
0.6(i)
0 50 100 150 2000
0.05
0.1
0.15
0.2
(ii)
0 50 100 150 2000
0.05
0.1
0.15(iii)
0 50 100 150 2000
0.05
0.1
0.15
0.2(iv)
RMSE’s based on Hn,k (black, solid), HPCA
n,k (blue, dashed) and HPCA
n,k,10 (red, dash-dotted)
versus k in the Dirichlet model with parameter 3, p = 2 and d = 10.
• PCA step with 10 observations ! estimators relatively insensitive tothe choice of k for Hn,k . 45/47
Conclusion III (PCA)
• Plotting the empirical risk is useful to choose p
• In case of doubt, choose the highest plausible dimension.
• For estimating failure probabilities: estimators including a PCA stepare competitive, for probability (i) [concomitance of extremes] theyare superior.
• choosing kPCA < k o↵ers improved robustness w.r.t choice of k in thesecond step.
46/47
Bibliography I
• Blanchard, G., Bousquet, O., & Zwald, L. (2007). Statistical properties of kernelprincipal component analysis. Machine Learning, 66(2-3), 259-294.
• Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis.Electronic journal of statistics, 9(1), 383-418.
• Chiapino, M., Sabourin, A. (2016). Feature clustering for extreme eventsanalysis, with application to extreme stream-flow data. In InternationalWorkshop on New Frontiers in Mining Complex Patterns (pp. 132-147).Springer, Cham.
• Chiapino, M., Sabourin, A., Segers, J. (2019). Identifying groups of variableswith the potential of being large simultaneously. Extremes, 22(2), 193-222.
• Chiapino, M., Clemencon, S., Feuillard, V., Sabourin, A. (2019). Amultivariate extreme value theory approach to anomaly clustering andvisualization. Computational Statistics, 1-22.
• Cooley, D., & Thibaud, E. Decompositions of dependence for high-dimensionalextremes. arXiv:1612.07190.
• J-J. Cai, J. Einmahl, and L. De Haan. ”Estimation of extreme risk regions undermultivariate regular variation.” AoS,2011
46/47
Bibliography II
• N. Goix, A. S., S. Clemencon. ”Learning the dependence structure of rare events:a non-asymptotic study”,.COLT, 2015
• Drees, H. & Sabourin, A. Principal Component Analysis for multivariateextremes, arXiv:1906.11043
• Engelke, S., & Hitz, A. S. Graphical models for extremes. arXiv:1812.01734.
• Gardes, L. (2018). Tail dimension reduction for extreme quantile estimation.Extremes, 21(1), 57-95.
• Goix, N., Sabourin, A., Clemencon, S. (2016). Sparse Representation ofMultivariate Extremes with Applications to Anomaly Ranking. In AISTATS(pp. 75-83).
• N. Goix, A. S., and S. Clemencon. (2017) Sparse representation ofmultivariate extremes with applications to anomaly detection. JMVA
• Simpson, E. S., Wadsworth, J. L., & Tawn, J. A. Determining the DependenceStructure of Multivariate Extremes. arXiv:1809.01606.
• Janssen, A., & Wan, P. k-means clustering of extremes. arXiv:1904.02970.
47/47