exploratory analysis for high dimensional extremes

61
Exploratory analysis for high dimensional extremes: Support identification, Anomaly detection and clustering, Principal Component Analysis. Anne Sabourin 1 + many others: Ma¨ el Chiapino (PhD), Stephan Cl´ emen¸ con (T´ el´ ecom Paris), Holger Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers (UC Louvain) 1 el´ ecom Paris, Institut polytechnique de Paris, France. AgroParistech seminar, March 30th, 2020 1/47

Upload: others

Post on 07-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Exploratory analysis for high dimensional extremes:Support identification, Anomaly detection andclustering, Principal Component Analysis.

Anne Sabourin1

+ many others: Mael Chiapino (PhD), Stephan Clemencon (Telecom Paris), Holger

Drees (U. Hamburg), Vincent Feuillard (Airbus), Nicolas Goix (PhD), Johan Segers

(UC Louvain)

1 Telecom Paris, Institut polytechnique de Paris, France.

AgroParistech seminar, March 30th, 2020

1/47

Outline

Introduction: dimension reduction for multivariate extremes

Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,

2019 a. )

Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

Principal Component Analysis for extremes (S. and Drees, 202+)

1/47

Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.

Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

• d � 1 : modeling Law(X | kXk > t) unfeasible.

• Dimension reduction problem(s) :

1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0.

2/47

Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.

Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

• d � 1 : modeling Law(X | kXk > t) unfeasible.

• Dimension reduction problem(s) :

1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0.

2/47

Motivation(s)• Multivariate heavy-tailed random vector X = (X1, . . . ,Xd)(e.g. spatial field (temperature, precipitation), asset (negative) prices,. . .

• Focus on distribution of the largest values: Law(X | kXk > t), t � 1with P(kXk > t) small.

Possible goals: simulation (stress test), anomaly detection(preprocessing) among extreme values, . . .

• d � 1 : modeling Law(X | kXk > t) unfeasible.

• Dimension reduction problem(s) :

1. Identify the groups of features ↵ ⇢ {1, . . . d} which may be largetogether (while the others stay small), given that one of them is large.

2. Identify a single low undimensional projection subspace V0 such thatLaw(X | kXk > t) ⇡ concentrated on V0. 2/47→ ACP :

Examples: It cannot rain everywhere at the same time

(daily precipitation)

(air pollutants)

3/47

Applications to risk managementSensors network (road tra�c, river streamflow, temperature, internettra�c . . . ) or financial asset prices:

! extreme event = tra�c jam, flood, heatwave, network congestion,falling price

! question: which groups of sensors / assets are likely to be jointlyimpacted ?

! how to define alert regions (alert groups of features/components)?

spatial case: one feature = one sensor

4/47

Applications to anomaly detection

• Training step:Learn a ‘normal region’ (e.g. approximate support)

• Prediction step: (with new data)Anomalies = points outside the ‘normal region’

If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.

How to distinguish between large anomalies and normal extremes?

5/47

Applications to anomaly detection

• Training step:Learn a ‘normal region’ (e.g. approximate support)

• Prediction step: (with new data)Anomalies = points outside the ‘normal region’

If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.

How to distinguish between large anomalies and normal extremes?

5/47

Applications to anomaly detection

• Training step:Learn a ‘normal region’ (e.g. approximate support)

• Prediction step: (with new data)Anomalies = points outside the ‘normal region’

If ‘normal’ data are heavy tailed, Abnormal 6, Extreme .There may be extreme ‘normal data’.

How to distinguish between large anomalies and normal extremes?

5/47

Standardized data• Random vectors X = (X1, . . . ,Xd ,) ; Xj � 0

• Margins: Xj ⇠ Fj , 1 j d (continuous).

• Preliminary step: Standardization (here: to Pareto margins)Vj =

11�Fj (Xj ))

, P(Vj > v) = 1v.

• Goal : P(V 2 A), A ’far from 0’ ?

• Each component j is homogeneous (order �1): for all t > 0

P(Vj 2 tA)/P(Vj > t) = P(Vj 2 A).

6/47

-

,US,I

Iv Uco,D ssi Fr est continue

A CCR

Multivariate extremes: regular variation• Informally: the marginal homogeneity property remains valid in themultivariate sense.

• A r.v. V = (V1, . . . ,Vd) 2 Rd is regularly varying if 9 a limitmeasure µ s.t.

P (V 2 tA)

P (kV k > t)���!t!1

µ(A)

for A ⇢ Rd with 0 /2 closure(A) and µ(@A) 6= 0.

• Necessarily µ is homogeneous: µ(rA) = r�↵µ(A), for some ↵ > 0 (tailindex).

• With Vj =1

1�Fj (Xj )necessarily ↵ = 1.

7/47

Idefit

' alternate;

t PC Hett) →FIA)

Multivariate extremes: regular variation (Cont’d)

• µ rules the (probabilistic) behaviour of extremes: if A is far from theorigin,

P(V 2 A) ⇡ µ(A)

• Examples: Max stable vectors with standardized margins, multivariatestudents, . . .

• Statistical procedures based on Extreme Value theory: 2 steps.

1. Learn useful features of µ using the k observations V(1), . . . ,V(k) withlargest norm, with k ⌧ n number of available data

2. Use the approximation P(V 2 A) ⇡ µ(A) for A far from 0.

8/47

PCVEA)= PCVETB)# PC Hull >gyu CB)=tpllulbt) yuCB)

a.*Peixe- t⇒. •. ¥pa.E1÷9¥!

Angular measure• Homogeneity of µ ! polar coordinates are convenient

r = kxk (any norm) ; ✓ = r�1x

• Angular measure � on the corresponding sphere:�(B) = µ{r > 1, ✓ 2 B}.

• Then µ decomposes as a product, only � needs to be estimated:

µ{r > t, ✓ 2 B} = r�↵�(B)

9/47

x#

•N

Ewi""

↳ ohB)

Angular measure

• � rules the joint distribution of extremes

• Asymptotic dependence: (V1,V2) may be large together.

vs

• Asymptotic independence: only V1 or V2 may be large.

No assumption on �: non-parametric framework.

10/47

(Lifan. .

te

En bruit"short tailed

"

Outline

Introduction: dimension reduction for multivariate extremes

Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,

2019 a. )

Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

Principal Component Analysis for extremes (S. and Drees, 202+)

10/47

Towards high dimension• Reasonable hope: only a moderate number of Vj ’s may besimultaneously large ! sparse angular measure

• Our goal from a MEVT point of view:

Estimate the (sparse) support of the angular measure(i.e. the dependence structure).

Which components may be large together, while the other are small?

11/47

Sparse angular support

Full support: Sparse supportanything may happen (V1 not large if V2 or V3 large)

Where is the mass?

Subcones of Rd

+ : C↵ = {x ⌫ 0, xi � 0 (i 2 ↵), xj = 0 (j /2 ↵), kxk � 1}↵ ⇢ {1, . . . , d}.

12/47

¥

Support recovery + representation

• {C↵,↵ ⇢ {1, . . . , d}: partition of {x : kxk � 1}• Goal 1: learn the 2d � 1-dimensional representation (potentiallysparse)

M =⇣µ(C↵)

↵⇢{1,...,d},↵ 6=;; support S = {↵ : µ(C↵) > 0}.

Main interest:

µ(C↵) > 0()

features j 2 ↵ may be large together while the others are small.

13/47

Identifying non empty edgesIssue: real data = non-asymptotic: Vj > 0.

Cannot just count data on each subcone:Only the largest-dimensional one has empirical mass!

14/47

Identifying non empty edges

Fix " > 0. A↵ect data "-close to an edge, to that edge.

C↵ ! R"↵

! New partition of the input space, compatible with non asymptotic data.

15/47

Raffa : 1121131 , then :3. > E

. No tjeqa; Eegri .- e. .

Empirical estimator of µ(C↵))(Counts the standardized points in C"

↵, far from 0.)

data: Xi , i = 1, . . . , n, Xi = (Xi ,1, . . . ,Xi ,d).

• Standardize: Vi ,j =1

1�Fj (Xi,j ), with Fj(Xi ,j) =

rank(Xi,j )�1n

• Natural estimator

µn(C↵) =n

kPn(V 2 n

kR"↵) ! M = (µn(C↵),↵ ⇢ {1, . . . , d})

• Estimated support S = {↵ : µn(C↵) > µ0}. 16/47

• j fixe : .

I rein :

§"

ttivijzhzf ±. k=opf¥i :

# hirai ;)> I - hay et atilt> ha)--

=k (III pies) C- [ k, dk]rani dkk ⇐ zdpound grand

=Ex Dtvieirif - * II. literacy

µ> 0unsaid detolerance .

Sparsity in real datasetsData=50 wave direction from buoys in North sea.(Shell Research, thanks J. Wadsworth)

17/47

Finite sample error boundVC-bound adapted to low probability regions (see Goix, S., Clemencon, 2015)

TheoremIf the margins Fj are continuous and if the density of the angular measureis bounded by M > 0 on each subface (infinity norm),There is a constant C s.t. for any n, d , k , � � e�k , " 1/4,with probability � 1� �,

max↵

|µn(C↵)� µ(C↵)| Cd

r1

k"log

d

�+Md"

!+ Bias n

k,"(F , µ).

Bias: using non asymptotic data to learn about an asymptotic quantity

Regular variation () Biast," ���!t!1

0

• Existing litterature (d = 2): 1/pk .

• Here: 1/pk"+Md". Price to pay for biasing estimator with ".

OK if " k ! 1, " ! 0.Choice of ": cross-validation or ‘" = 0.1’ 18/47

vapm-k-chenonenkis-aseupeao.IEneat PCA)!m{ cm."/ "

rs asymptotic→ taken. -HC Ray

→ uh.

Tools for the proof

1. VC inequality for small probability classes (Goix et.al., 2015)

! max deviations pp ⇥ (usual bound)

2. Apply it on VC-class of rectangles {k

nR(x , z ,↵), x , z � "}

! p dk

"n) sup

↵|µn � µ|(R✏

↵) Cd

r1

"klog

d

3. Approach µ(C↵) with µ(R"↵) ! error Md"

(bounded angular density).

19/47

c tour✓=

(sirna-mais .

: %,a-- I

E EEFEIFEI- c.M¥1or

Algorithm DAMEX (Detecting Anomalies with MultivariateExtremes) (Goix, S., Clemencon, 2016)

Anomaly = new observation ‘violating the sparsity pattern’:observed in empty or light subcone.

Scoring function: for x such that v 2 kR"↵,

sn(x) =1

kvk µn(R✏↵) 'x large P (V 2 C↵, kV k > x)

20/47

on

constantIAD=

un"score"

↳Ca) et onse dare

un Seuil Mi Su knew)G→ahow

sulked) uscore optimal : one trans faucet Nde la densities

. ⇒ normal↳au Seis de Neyman Rearso

Extension: feature clustering (Chiapino, S. 2016; Chiapino, S., Segers,

2019 a.)

• Motivating example: River stream-flow dataset, d = 92 gaugingstations:

• Typical groups jointly impacted by extreme records include noisyadditional features!! Empirical µ-mass scattered over many C↵’s

! No apparent sparsity pattern.• How to gather ‘closeby’ ↵’s into feature clusters? ! ‘robust’ versionof DAMEX: the CLEF algorithm and variants + Asymptotic analysis.

21/47

Conclusion I

• Discovering subgroup of components that are likely to besimultaneously large is doable, with an error scaling as 1p

k, k : number

of extreme observations.

• Two algorithms:• DAMEX: easy to implement, linear complexity O(dn log n), not very

robust to weak signals/ noisy dependence structure

• CLEF: a bit more complex (graph mining, but existing python packagecan help), complexity OK only if the dependence graph is sparse, butmore robust.

• Statistical guarantees: non-asymptotic (DAMEX) and asymptotic(CLEF)

• Open questions: optimal choice of tuning parameters (cross-validationis common practice but no theory in an extreme values context)

22/47

Outline

Introduction: dimension reduction for multivariate extremes

Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,

2019 a. )

Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

Principal Component Analysis for extremes (S. and Drees, 202+)

22/47

Application: clustering extreme data (Chiapino et al. , 2019 b.)

• Context: monitoring a high dimensional system (e.g. air flight datafrom Airbus, 82 parameters, 18 000 obs), where extremes are ofparticular interest (associated with anomalies / risk regions)

• Naive idea: use the list of dependent maximal subgroup {↵k , k � K}issued from DAMEX/CLEF and corresponding rectangles of the kind

tR"↵ = {x 2 Rd : xj > t", j 2 ↵, xj < t", j /2 ↵, kxk > t}

• Issue: In practice many data points fall outside the tR"↵’s. How to

assign them to a cluster?

23/47

Mixture model for extremes

• See dependent subgroups ↵k ⇢ {1, . . . , d}, k K issued fromDAMEX/CLEF as components of a mixture model.

• Zk : hidden indicator variable of component k . Conditionally to{kV k > r0,Zk = 1},

V = Vk + ✏k = RkWk + ✏k , (1)

where• Vk 2 C↵k

,

• ✏k 2 C?↵k

⇠ i.i.d. Exponential,

• Rk = kVkk ⇠ Pareto(1),

• Wk = R�1k

Vk 2 S↵k⇠ �k (Angular measure restricted to k th face:

Dirichlet distribution with the L1 norm).

24/47

Model for the k th mixture component

V = Vk + ✏k = RkWk + ✏k , (2)

• Training the mixture model: EM algorithm.

25/47

Clustering extremes

• After training: each extreme point Vi has probabibility pi ,k to comefrom mixture component k .

• Similarity measure between Vi ,Vj :

si ,j = P (Vi ,Vj 2 same component ) =KX

k=1

pi ,kpj ,k

! Similarity matrix (Si ,j) 2 [0, 1]N⇥N where N is the number ofextreme components.

• Clustering based on the similirarity matrix using o↵-the-shelftechniques (e.g. spectral clustering)

26/47

Some results

Table: Shuttle dataset (9 attributes, 7 classes): Purity score - Comparisons withstandard approaches for di↵erent extreme sample sizes.

n0 = 500 n0 = 400 n0 = 300 n0 = 200 n0 = 100Dirichlet mixture 0.8 0.82 0.82 0.84 0.85Kmeans 0.72 0.73 0.75 0.78 0.8Spectral clustering 0.78 0.77 0.82 0.81 0.8

27/47

Flights anomaly clustering + Visual display

Using standard visualization tools (Python package Networkx)

28/47

Conclusion II

• DAMEX/CLEF output can be used to perform clustering of extremes

• Mixture modelling,{ mixture components } = output of DAMEX/CLEF

• Additional layer: building a similarity matrix to perform clustering.

• Question: What if the angular mesure is concentrated on an’oblique’ subspace of the central subsphere (only particular linearcombinations of all components are likely to be large)? ThenCLEF/DAMEX fail because all the mass is in the central subsphereand no particular structure can be discovered.! Idea: perform some sort of PCA on extreme data.

29/47

Outline

Introduction: dimension reduction for multivariate extremes

Sparsity in multivariate extremes (Goix et al., 2016, 17., Chiapino et al. 2016,

2019 a. )

Application: extremes/anomalies clustering (Chiapino et al., 2019 b.)

Principal Component Analysis for extremes (S. and Drees, 202+)

29/47

PCA for extremes: context and motivation

• (X1, . . . ,Xd) a multivariate r. vector with tail index ↵ > 0 and limitmeasure µ

• Motivating assumption (not necessary)

Hypothesis 1

The vector space S0 = span(suppµ) generated by the support of µ hasdimension p < d .

• Purpose of this work: Recover S0 from the data, with guaranteesconcerning the reconstruction error.

30/47

Motivating assumption: interpretation

dim(S0) = p < d ; S0 = span(supp(µ))

()

Certain linear combinations are much likelier to be large than others.

31/47

Dimension reduction in EVT: quick overview

• Looking for multiple subspaces where µ concentrates:

• Chautru, 2015 (clustering + principal nested spheres)

• Goix et al., 2016,17, Chiapino et al. (space partitioning) Simpson etal., 20++ (relaxing the partition)

• Engelke&Hitz, 20++, Graphical models

• K-means clustering: Janssen &Wan, 20++

• Dimension reduction in regression analysis: Gardes, 2018.

• PCA on a transformed version of the data: Cooley & Thibaud(20++)

32/47

Heavy-tailed scarecrow against using PCA for extremes

• ‘Classical dimension reduction tools such as PCA fail for multivariateextremes because they require the existence of second moments’

• Possible answer: Since µ is homogeneous, what matters is theangular component:

Proposed method for recovering the support of µ

• Perform PCA on angular data (or any rescaled version of the data withenough moments) corresponding to observations with largest norm.

• The first eigen vectors of the rescaled empirical covariance matrix provide anestimate for S0 = span(supp(µ)).

33/47

Toy example

●●

●●

●●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

34/47

Toy example

●●

●●

●●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

V_0

34/47

Toy example, proposed method

●●

●●

●●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●●

●●

●●

●●●

● ●

●●

●●

●●

35/47

k donuts

-"

.#r : sail radial etui

Toy example, proposed method

●●

35/47

Toy example, proposed method

●●

35/47

j

L

Toy example, proposed method

●●

35/47

a) Hi' a

✓-o

/b) Il Xi -Troll ill

'

en es penance : b) est optimalquestion : I quel point a)est pro

che deb) -

Toy example, proposed method

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

35/47

Empirical Risk Minimization setting• k · k: Euclidean norm.• ⇧S (resp. ⇧?

S): Orthogonal projection operator onto linear space S

(resp. S?)

• Rescaled observations: ⇥ = ✓(X ) = !(X ) · X ,• ! : Rd ! R+: suitable scaling function (think !(x) = 1/kxk, variantsallowed s.t. E

�k⇥k2

�< 1).

• Conditional Risk of a linear subspace S ⇢ Rd :

Rt(S) = E⇣k⇧?

S(⇥)k2 | kXk > t

• Empirical counterpart

Rn,k(S) =1

k

nX

i=1

k⇧?S(⇥(i))k2

with kXk(1) � . . . kXk(n) the order statistics of the norm and ⇥(i) thecorresponding rescaled data.

36/47

" Its't 1/2--11X -Tsx 112

Minimizing a risk () Diagonalizing a covariance matrix

• Denote Eq = set of all q-dimensional subspaces of Rd , 1 q d .

• ⌃t = E�⇥⇥> | kXk > t

�: conditional second moments matrix.

• standard fact from Principal Component analysis: Assume forsimplicity ⌃t has distinct eigenvalues. Let (u1, . . . ud) denote theeigen vectors associated to eigen values in decreasing order. Then

argminS2Eq

Rt(S) = span(u1, . . . , uq).

• SimilarlyargminS2Eq

Rn,k(S) = span(un1 , . . . , un

q).

(unj): eigen vectors of the empirical 2nd moments matrix of the

⇥(i), i k

37/47

ERM setting: risk at the limit

• Limit risk above extreme levels

R1(S) := S1k⇧?S⇥k2

where E1: expectation w.r.t. the limit conditional distribution

P1( · ) = limt!1

P (X 2 t( · ) | kXk > t) = µ( · )/µ({x : kxk > 1})

• Hypothesis 1 (S0 = span suppµ, dim(S0) = p) )8><

>:

{S0} = argminEp R1, R1(S0) = 0,

For all S 0 of dimension p0 < p, R1(S 0) > 0.

38/47

*I

Questions

• Is the empirical minimizer Sn of Rn,k consistent?

• Uniform, non-asymptotic bounds on |Rn,k(S)� Rtn,k (S)|:? (classicalgoal in statistical learning)

• Relevance for practical applications (improved performance fornon-parametric estimation of the probability of failure regions) ?

39/47

Convergence of minimizers of the true conditional risk• Scaling condition on the weight ! (! second moments of ⇥ exist).

9� 2�1� ↵

2, 1⇤: 8� > 0, x 2 Rd : !(�x) = ���!(x)

and c! := supkxk=1

!(x) < 1,

• Define a metric ⇢ between linear spaces S ,W :

⇢(S ,W ) = |||⇧S �⇧W ||| =���������⇧?

S�⇧?

W

��������� = sup

kxk=1k⇧?

Sx �⇧?

Wxk

Theorem 2(Drees, S., 20++)

Under the scaling condition, if R1 has a unique minimizer S⇤1 in Ep, then

for all minimizers Sn of Rn,k in Ep one has

⇢(Sn, S⇤1) ! 0 in probability.

(equicontinuity in probability of rescaled Rn,k for large n (Karamata)+compactness of Ep)

40/47

Uniform risk bound• Stronger condition on !: !(x) 1/kxk (thus k⇥k 1).• tn,k : quantile of level 1� k/n for kXk• St := E

�k⇥k4 � ⇡t tr(⌃2

t ) | kXk > t�; ⌃t = E

�⇥⇥> | kXk > t

Theorem 3 (Drees, S., 20++), simplified version

With probability at least 1� �,

supS2Ep

|Rn,k(S)� Rtn,k (S)| hp ^ (d � p)

kStn,k

i1/2+ . . .

. . .h8k(1 + k/n) log(4/�)

i1/2+ . . .

. . .4 log(4/�)

3k.

(Variant of bounded di↵erence inequality (McDiarmid, 98) + argumentsfrom Blanchard et al. 07))• NB: unknown term St : an alternative statement is proven with onlyempirical quantities in the upper bound. 41/47

Simulations: questions

• Can p = dim(V0) be chosen empirically from the risk plots?

• Does the empirical angular measure after projection on the subspacelearned by PCA provide better estimates than the classical one for therisk-related quantities:

(i) limu!1 P(p�1P

1jpX j/kXk > t(i) | kXk > u) =

H{x | p�1P

p

j=1 xj > t(i)} for some t(i) 2 (0, p�1/2)

(ii) limu!1 P(min1jp X j > u,maxp+1jd X j u | kXk > u) =R �(min1jp x j)↵ � (maxp+1jd x j)↵

�+H(dx)

(iii) limu!1 P(X 1 > u | max1jd X j > u) =R(x1)↵ H(dx)/

R(max1jd x j)↵ H(dx)

(iv) limu!1 P(min1jd X j > u | kXk > u) =R(min1jd x j)↵ H(dx)

42/47

Simulations: models

• d dimensional vectors with limit measure concentrated on a p < ddimensional subspace.

• Structure: p dimensional max-stable model + d-dimensional Gaussiannoise (absolute values), ⇢ = 0.2,�2 2 {105/d , 10/d}.

• Unit Frechet margins with tail index ↵ 2 {1, 2}

• Dependence for the p-dimensional model:• Max-stable vector from the Dirichlet Model (Coles&Tawn91, see

Segers, 2012 for simulation), parameter (3, . . . , 3).

• other settings (not shown here)

• n = 1000, k 2 {5, 10, 15, . . . , 200}, 1000 replications.

43/47

choice of p, Dirichlet model, p = 2, d = 10

0 50 100 150 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 50 100 150 2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Mean

empirical risk (left) and empirical risk for one sample (right) versus k forPCA projecting onto a subspace of dimension 1 p 10

! Choice p = 2 obvious for small k , p 2 {2, 3} for k � 50.

44/47

Performance for estimating failure probabilities: RMSE’srelated to the angular measure H

0 50 100 150 2000

0.2

0.4

0.6(i)

0 50 100 150 2000

0.05

0.1

0.15

0.2

(ii)

0 50 100 150 2000

0.05

0.1

0.15(iii)

0 50 100 150 2000

0.05

0.1

0.15

0.2(iv)

RMSE’s based on Hn,k (black, solid), HPCA

n,k (blue, dashed) and HPCA

n,k,10 (red, dash-dotted)

versus k in the Dirichlet model with parameter 3, p = 2 and d = 10.

• PCA step with 10 observations ! estimators relatively insensitive tothe choice of k for Hn,k . 45/47

Conclusion III (PCA)

• Plotting the empirical risk is useful to choose p

• In case of doubt, choose the highest plausible dimension.

• For estimating failure probabilities: estimators including a PCA stepare competitive, for probability (i) [concomitance of extremes] theyare superior.

• choosing kPCA < k o↵ers improved robustness w.r.t choice of k in thesecond step.

46/47

Bibliography I

• Blanchard, G., Bousquet, O., & Zwald, L. (2007). Statistical properties of kernelprincipal component analysis. Machine Learning, 66(2-3), 259-294.

• Chautru, E. (2015). Dimension reduction in multivariate extreme value analysis.Electronic journal of statistics, 9(1), 383-418.

• Chiapino, M., Sabourin, A. (2016). Feature clustering for extreme eventsanalysis, with application to extreme stream-flow data. In InternationalWorkshop on New Frontiers in Mining Complex Patterns (pp. 132-147).Springer, Cham.

• Chiapino, M., Sabourin, A., Segers, J. (2019). Identifying groups of variableswith the potential of being large simultaneously. Extremes, 22(2), 193-222.

• Chiapino, M., Clemencon, S., Feuillard, V., Sabourin, A. (2019). Amultivariate extreme value theory approach to anomaly clustering andvisualization. Computational Statistics, 1-22.

• Cooley, D., & Thibaud, E. Decompositions of dependence for high-dimensionalextremes. arXiv:1612.07190.

• J-J. Cai, J. Einmahl, and L. De Haan. ”Estimation of extreme risk regions undermultivariate regular variation.” AoS,2011

46/47

Bibliography II

• N. Goix, A. S., S. Clemencon. ”Learning the dependence structure of rare events:a non-asymptotic study”,.COLT, 2015

• Drees, H. & Sabourin, A. Principal Component Analysis for multivariateextremes, arXiv:1906.11043

• Engelke, S., & Hitz, A. S. Graphical models for extremes. arXiv:1812.01734.

• Gardes, L. (2018). Tail dimension reduction for extreme quantile estimation.Extremes, 21(1), 57-95.

• Goix, N., Sabourin, A., Clemencon, S. (2016). Sparse Representation ofMultivariate Extremes with Applications to Anomaly Ranking. In AISTATS(pp. 75-83).

• N. Goix, A. S., and S. Clemencon. (2017) Sparse representation ofmultivariate extremes with applications to anomaly detection. JMVA

• Simpson, E. S., Wadsworth, J. L., & Tawn, J. A. Determining the DependenceStructure of Multivariate Extremes. arXiv:1809.01606.

• Janssen, A., & Wan, P. k-means clustering of extremes. arXiv:1904.02970.

47/47