clustering cds: algorithms, distances, stability and convergence rates

30
HELLEBORECAPITAL Introduction The standard methodology Exploring dependence between returns Copula-based dependence coefficients (clustering distances) Empirical convergence rates Beyond dependence: a (copula,margins) representation Clustering CDS: algorithms, distances, stability and convergence rates CMStatistics 2016, University of Seville, Spain Gautier Marti, Frank Nielsen, Philippe Donnat HELLEBORECAPITAL December 9, 2016 Gautier Marti Clustering CDS: algorithms, distances, stability and convergence r

Upload: hellebore-capital-limited

Post on 16-Jan-2017

49 views

Category:

Data & Analytics


2 download

TRANSCRIPT

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Clustering CDS: algorithms, distances,stability and convergence rates

CMStatistics 2016, University of Seville, Spain

Gautier Marti, Frank Nielsen, Philippe Donnat

HELLEBORECAPITALDecember 9, 2016

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Introduction

Goal: Finding groups of ’homogeneous’ assets that can help to:

• build alternative measures of risk,

• elaborate trading strategies. . .

But, we need a high confidence in these clusters (networks).

So, we need appropriate AND fast converging methodologies [8]:

. to be consistent yet efficient (bias–variance tradeoff),

. to avoid non-stationarity of the time series (too large sample).

A good model selection criterion:Minimum sample size to reach a given ’accuracy’.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

The standard methodology - description

The methodology widely adopted in empirical studies: [7].

Let N be the number of assets.Let Pi (t) be the price at time t of asset i , 1 ≤ i ≤ N.Let ri (t) be the log-return at time t of asset i :

ri (t) = log Pi (t)− log Pi (t − 1).

For each pair i , j of assets, compute their correlation:

ρij =〈ri rj〉 − 〈ri 〉〈rj〉√

(〈r 2i 〉 − 〈ri 〉2)

(〈r 2

j 〉 − 〈rj〉2) .

Convert the correlation coefficients ρij into distances:

dij =√

2(1− ρij).

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

The standard methodology - description

From all the distances dij , compute a minimum spanning tree:

Figure: A minimum spanning tree of stocks (from [1]); stocks from thesame industry (represented by color) tend to cluster together

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

The standard methodology - limitations

• MST clustering equivalent to Single Linkage clustering:

• chaining phenomenon• not stable to noise / small perturbations [11]

• Use of the Pearson correlation:

• can take value 0 whereas variables are strongly dependent• not invariant to variable monotone transformations• not robust to outliers

Is it still useful for financial time series? stocks? CDS??!

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

The standard methodology - limitations

• MST clustering equivalent to Single Linkage clustering:

• chaining phenomenon• not stable to noise / small perturbations [11]

• Use of the Pearson correlation:

• can take value 0 whereas variables are strongly dependent• not invariant to variables monotone transformations• not robust to outliers

Is it still useful for financial time series? stocks? CDS??!

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Copulas

Sklar’s Theorem [13]

For (Xi ,Xj) having continuous marginal cdfs FXi ,FXj , its joint cumulativedistribution F is uniquely expressed as

F (Xi ,Xj) = C (FXi (Xi ),FXj (Xj)),

where C is known as the copula of (Xi ,Xj).

Copula’s uniform marginals jointly encode all the dependence.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

From ranks to empirical copula

ri , rj are the rank statistics of Xi ,Xj respectively, i.e. r ti is the rank

of X ti in {X 1

i , . . . ,XTi }: r ti =

∑Tk=1 1{X k

i ≤ X ti }.

Deheuvels’ empirical copula [3]

Any copula C defined on the lattice L = {( tiT ,

tjT ) : ti , tj = 0, . . . ,T} by

C ( tiT ,

tjT ) = 1

T

∑Tt=1 1{r ti ≤ ti , r

tj ≤ tj} is an empirical copula.

C is a consistent estimator of C with uniform convergence [4].

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Clustering of bivariate empirical copulas

Generate the(N

2

)bivariate empirical copulas

Find clusters of copulas using optimal transport [10, 9]

Compute and display the clusters’ centroids [2]

Some code available at www.datagrapple.com/Tech.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Copula-centers for stocks (CAC 40)

Figure: Stocks: More mass in the bottom-left corner, i.e. lower taildependence. Stock prices tend to plummet together.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Copula-centers for Credit Default Swaps (XO index)

Figure: Credit default swaps: More mass in the top-right corner, i.e.upper tail dependence. Insurance cost against entities’ default tends tosoar in stressed market.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Dependence as relative distances between copulas

C copula of (Xi ,Xj),|u − v |/

√2 distance between (u, v) to the diagonal

Spearman’s ρS :

ρS(Xi ,Xj) = 12

∫ 1

0

∫ 1

0(C (u, v)− uv)dudv

= 1− 6

∫ 1

0

∫ 1

0(u − v)2dC (u, v)

Many correlation coefficients can be expressed as distances to the

Frechet–Hoeffding bounds or the independence [6]. Some are explicitely

built this way (e.g. [12, 5, 9]).

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

A metric space for copulas: Optimal Transport

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

The Target/Forget Dependence Coefficient (TFDC)

Now, we can define our bespoke dependence coefficient:

Build the forget-dependence copulas {CFl }l

Build the target-dependence copulas {CTk }k

Compute the empirical copula Cij from xi , xj

TFDC(Cij) =minl D(CF

l ,Cij)

minl D(CFl ,Cij) + mink D(Cij ,CT

k )

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Spearman vs. TFDC

0.0 0.2 0.4 0.6 0.8 1.0

discontinuity position a

0.0

0.2

0.4

0.6

0.8

1.0

Est

imate

d p

osi

tive d

ependence

Spearman & TFDC values as a function of a

TFDC

Spearman

Figure: Empirical copulas for (X ,Y ) whereX = Z1{Z < a}+ εX1{Z > a},Y = Z1{Z < a + 0.25}+ εY 1{Z > a + 0.25}, a = 0, 0.05, . . . , 0.95, 1,and where Z is uniform on [0, 1] and εX , εY are independent noises (left).TFDC and Spearman coefficients estimated between X and Y as afunction of a (right).For a = 0.75, Spearman coefficient yields a negative value, yet X = Yover [0, a].

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Process: Recovering a simulated ground-truth [8]

A simulation & benchmark process that needs to be refined:

. Extract (using a large sample) a filtered correlation matrix R

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Process: Recovering a simulated ground-truth [8]

A simulation & benchmark process that needs to be refined:

. Generate samples of size T = 10, . . . , 20, . . . from a relevantdistribution (parameterized by R)

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Process: Recovering a simulated ground-truth [8]

A simulation & benchmark process that needs to be refined:

. Compute the ratio of the number of correct clusteringobtained over the number of trials as a function of T

100 200 300 400 500Sample size

0.0

0.2

0.4

0.6

0.8

1.0

Scor

e

Empirical rates of convergence for Single Linkage

Gaussian - PearsonGaussian - SpearmanStudent - PearsonStudent - Spearman

100 200 300 400 500Sample size

0.0

0.2

0.4

0.6

0.8

1.0

Scor

e

Empirical rates of convergence for Average Linkage

Gaussian - PearsonGaussian - SpearmanStudent - PearsonStudent - Spearman

100 200 300 400 500Sample size

0.0

0.2

0.4

0.6

0.8

1.0

Scor

e

Empirical rates of convergence for Ward

Gaussian - PearsonGaussian - SpearmanStudent - PearsonStudent - Spearman

A full comparative study will be posted online at www.datagrapple.com/Tech.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

1 Introduction

2 The standard methodology

3 Exploring dependence between returns

4 Copula-based dependence coefficients (clustering distances)

5 Empirical convergence rates

6 Beyond dependence: a (copula,margins) representation

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

ON CLUSTERING FINANCIAL TIME SERIESGAUTIER MARTI, PHILIPPE DONNAT AND FRANK NIELSEN

NOISY CORRELATION MATRICESLet X be the matrix storing the standardized re-turns of N = 560 assets (credit default swaps)over a period of T = 2500 trading days.

Then, the empirical correlation matrix of the re-turns is

C =1

TXX>.

We can compute the empirical density of itseigenvalues

ρ(λ) =1

N

dn(λ)

dλ,

where n(λ) counts the number of eigenvalues ofC less than λ.

From random matrix theory, the Marchenko-Pastur distribution gives the limit distribution asN →∞, T →∞ and T/N fixed. It reads:

ρ(λ) =T/N

√(λmax − λ)(λ− λmin)

λ,

where λmaxmin = 1 + N/T ± 2

√N/T , and λ ∈

[λmin, λmax].

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

λ

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

ρ(λ

)

Figure 1: Marchenko-Pastur density vs. empirical den-sity of the correlation matrix eigenvalues

Notice that the Marchenko-Pastur density fitswell the empirical density meaning that most ofthe information contained in the empirical corre-lation matrix amounts to noise: only 26 eigenval-ues are greater than λmax.The highest eigenvalue corresponds to the ‘mar-ket’, the 25 others can be associated to ‘industrialsectors’.

CLUSTERING TIME SERIESGiven a correlation matrix of the returns,

0 100 200 300 400 5000

100

200

300

400

500

Figure 2: An empirical and noisy correlation matrix

one can re-order assets using a hierarchical clus-tering algorithm to make the hierarchical correla-tion pattern blatant,

0 100 200 300 400 5000

100

200

300

400

500

Figure 3: The same noisy correlation matrix re-orderedby a hierarchical clustering algorithm

and finally filter the noise according to the corre-lation pattern:

0 100 200 300 400 5000

100

200

300

400

500

Figure 4: The resulting filtered correlation matrix

BEYOND CORRELATIONSklar’s Theorem. For any random vector X = (X1, . . . , XN ) having continuous marginal cumulativedistribution functions Fi, its joint cumulative distribution F is uniquely expressed as

F (X1, . . . , XN ) = C(F1(X1), . . . , FN (XN )),

where C, the multivariate distribution of uniform marginals, is known as the copula of X .

Figure 5: ArcelorMittal and Société générale prices are projected on dependence ⊕ distribution space; notice theirheavy-tailed exponential distribution.

Let θ ∈ [0, 1]. Let (X,Y ) ∈ V2. Let G = (GX , GY ), where GX and GY are respectively X and Y marginalcdf. We define the following distance

d2θ(X,Y ) = θd21(GX(X), GY (Y )) + (1− θ)d20(GX , GY ),

where d21(GX(X), GY (Y )) = 3E[|GX(X)−GY (Y )|2], and d20(GX , GY ) =12

∫R

(√dGX

dλ −√

dGY

)2

dλ.

CLUSTERING RESULTS & STABILITY

0 5 10 15 20 25 30

Standard Deviation in basis points0

5

10

15

20

25

30

35

Num

ber

of

occ

urr

ence

s

Standard Deviations Histogram

Figure 6: (Top) The returns correlation structure ap-pears more clearly using rank correlation; (Bottom)Clusters of returns distributions can be partly describedby the returns volatility

Figure 7: Stability test on Odd/Even trading days sub-sampling: our approach (GNPR) yields more stableclusters with respect to this perturbation than standardapproaches (using Pearson correlation or L2 distances).

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Ricardo Coelho, Przemyslaw Repetowicz, Stefan Hutzler, andPeter Richmond.Investigation of Cluster Structure in the London StockExchange.

Marco Cuturi and Arnaud Doucet.Fast computation of wasserstein barycenters.In Proceedings of the 31th International Conference onMachine Learning, ICML 2014, Beijing, China, 21-26 June2014, pages 685–693, 2014.

Paul Deheuvels.La fonction de dependance empirique et ses proprietes. un testnon parametrique d’independance.Acad. Roy. Belg. Bull. Cl. Sci.(5), 65(6):274–292, 1979.

Paul Deheuvels.Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

A non-parametric test for independence.Publications de l’Institut de Statistique de l’Universite deParis, 26:29–50, 1981.

Fabrizio Durante and Roberta Pappada.Cluster analysis of time series via kendall distribution.In Strengthening Links Between Data Analysis and SoftComputing, pages 209–216. Springer, 2015.

Eckhard Liebscher et al.Copula-based dependence measures.Dependence Modeling, 2(1):49–64, 2014.

Rosario N Mantegna.Hierarchical structure in financial markets.The European Physical Journal B-Condensed Matter andComplex Systems, 11(1):193–197, 1999.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Gautier Marti, Sebastien Andler, Frank Nielsen, and PhilippeDonnat.Clustering financial time series: How long is enough?Proceedings of the Twenty-Fifth International JointConference on Artificial Intelligence, IJCAI 2016, New York,NY, USA, 9-15 July 2016, pages 2583–2589, 2016.

Gautier Marti, Sebastien Andler, Frank Nielsen, and PhilippeDonnat.Exploring and measuring non-linear correlations: Copulas,lightspeed transportation and clustering.NIPS 2016 Time Series Workshop, 55, 2016.

Gautier Marti, Sebastien Andler, Frank Nielsen, and PhilippeDonnat.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

Optimal transport vs. fisher-rao distance between copulas forclustering multivariate time series.In IEEE Statistical Signal Processing Workshop, SSP 2016,Palma de Mallorca, Spain, June 26-29, 2016, pages 1–5, 2016.

Gautier Marti, Philippe Very, Philippe Donnat, and FrankNielsen.A proposal of a methodological framework with experimentalguidelines to investigate clustering stability on financial timeseries.In 14th IEEE International Conference on Machine Learningand Applications, ICMLA 2015, Miami, FL, USA, December9-11, 2015, pages 32–37, 2015.

Barnabas Poczos, Zoubin Ghahramani, and Jeff G. Schneider.Copula-based kernel dependency measures.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates

HELLEBORECAPITAL

IntroductionThe standard methodology

Exploring dependence between returnsCopula-based dependence coefficients (clustering distances)

Empirical convergence ratesBeyond dependence: a (copula,margins) representation

In Proceedings of the 29th International Conference onMachine Learning, ICML 2012, Edinburgh, Scotland, UK, June26 - July 1, 2012, 2012.

A Sklar.Fonctions de repartition a n dimensions et leurs marges.Universite Paris 8, 1959.

Gautier Marti Clustering CDS: algorithms, distances, stability and convergence rates