development of robust scatter estimators under independent ...andy.leung/files/... · alqallaf, van...
TRANSCRIPT
Development of robust scatter estimatorsunder independent contamination model
C. Agostinelli1, A. Leung2, V.J. Yohai3 and R.H. Zamar2
1 Universita Ca Foscari di Venezia, 2 University of British Columbia, and 3Universidad de Buenos Aires and CONICET
Mar 16, 2013
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Some declarations
I To math geeks: I am sorry but I will keep my talk to haveminimal math equations and theorems today (come on, it is9 am!)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Objective of the day
Objective: robust estimation of (location and) scatter matrix fora data set of size n and p continuous variables.
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
What is contamination?
Perhaps the most classical contamination model isHuber-Tukey contamination model (HTCM) (Tukey in 1960,Huber in 1964), which was originally for 1-D data...
Contamination is row-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.9 -2.8 -2.1 -0.8 -2.4 1.3 2.7 3.4 0.9 -0.1[2,] -2.4 2.3 -1.8 -3.0 1.9 1.0 -0.5 0.4 -2.8 -1.5
[3,] 0.7 -2.3 -0.6 2.9 -1.5 -0.8 2.9 0.0 -2.6 1.8
[4,] 1.0 1.9 1.6 1.1 0.0 -2.2 1.0 -4.1 2.2 -0.9[5,] 0.1 -1.0 1.8 2.2 -0.1 2.1 -1.3 3.1 1.2 1.0
[6,] 1.7 3.0 0.6 0.9 -1.4 1.9 -0.3 -0.4 -0.4 1.7[7,] -0.8 1.0 2.5 3.9 -2.8 2.5 -0.3 -0.9 2.6 2.4
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
What is contamination?
Perhaps the most classical contamination model isHuber-Tukey contamination model (HTCM) (Tukey in 1960,Huber in 1964), which was originally for 1-D data...
Contamination is row-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.9 -2.8 -2.1 -0.8 -2.4 1.3 2.7 3.4 0.9 -0.1[2,] -2.4 2.3 -1.8 -3.0 1.9 1.0 -0.5 0.4 -2.8 -1.5
[3,] 0.7 -2.3 -0.6 2.9 -1.5 -0.8 2.9 0.0 -2.6 1.8
[4,] 1.0 1.9 1.6 1.1 0.0 -2.2 1.0 -4.1 2.2 -0.9[5,] 0.1 -1.0 1.8 2.2 -0.1 2.1 -1.3 3.1 1.2 1.0
[6,] 1.7 3.0 0.6 0.9 -1.4 1.9 -0.3 -0.4 -0.4 1.7[7,] -0.8 1.0 2.5 3.9 -2.8 2.5 -0.3 -0.9 2.6 2.4
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
What is contamination?
HTCM in math notation,
x∗ = (1 − u)x + uc
whereI x = (x1, ..., xp) ∼ N(µ,Σ)
I c ∼“something”I u ∼ Bin(1, ε), 0 ≤ ε < 1/2
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
New contamination model
HTCM may not be realistic...I outliers are more likely to happen in certain variables,
independent of othersI what if p is large but n is of moderate to small size?I what if every single observation has one component
contamination?
Alqallaf, Van Aelst, Yohai and Zamar (2006) proposed a newcontamination model...
Cell-wise contamination model
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
New contamination model
HTCM may not be realistic...I outliers are more likely to happen in certain variables,
independent of othersI what if p is large but n is of moderate to small size?I what if every single observation has one component
contamination?
Alqallaf, Van Aelst, Yohai and Zamar (2006) proposed a newcontamination model...
Cell-wise contamination model
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
New contamination model
Contamination is cell-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2.69 2.10 4.59 2.13 -1.09 2.72 -0.72 0.47 -1.42 -1.90
[2,] 2.92 2.20 -1.70 -1.83 -1.05 4.89 0.32 -1.93 -2.59 -2.48
[3,] -0.75 0.53 -3.22 3.07 4.04 -1.39 -0.26 0.44 0.05 2.14
[4,] -2.35 4.46 -0.99 -0.41 0.68 -2.79 1.37 1.74 1.35 1.78
[5,] -1.09 -2.77 4.59 -2.78 -0.97 1.35 4.10 -0.56 3.79 -0.11
[6,] -1.94 -0.33 -0.40 -3.22 1.32 0.24 -1.89 1.02 2.60 4.54
where in math model is
x∗ = (1 − U)x + Uc
where x = (x1, ..., xp) and c is same as before, except
U = diag(ui), where ui ∼ Bin(1, ε),0 ≤ ε < 1/2
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
New contamination model
Contamination is cell-wise, e.g.[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 2.69 2.10 4.59 2.13 -1.09 2.72 -0.72 0.47 -1.42 -1.90
[2,] 2.92 2.20 -1.70 -1.83 -1.05 4.89 0.32 -1.93 -2.59 -2.48
[3,] -0.75 0.53 -3.22 3.07 4.04 -1.39 -0.26 0.44 0.05 2.14
[4,] -2.35 4.46 -0.99 -0.41 0.68 -2.79 1.37 1.74 1.35 1.78
[5,] -1.09 -2.77 4.59 -2.78 -0.97 1.35 4.10 -0.56 3.79 -0.11
[6,] -1.94 -0.33 -0.40 -3.22 1.32 0.24 -1.89 1.02 2.60 4.54
where in math model is
x∗ = (1 − U)x + Uc
where x = (x1, ..., xp) and c is same as before, except
U = diag(ui), where ui ∼ Bin(1, ε),0 ≤ ε < 1/2
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Existing robust scatter estimators
Under HTCM, we have...I Minimum Volume Ellipsoid (MVE) (Rousseeuw, 1985)I Minimum Covariance Determinant (MCD) (Rousseeuw,
1985)I S-estimator (Davies, 1987)I MM-estimator (Yohai, 1987; Tatsuoka and Tyler, 2000)I modified GK estimator (Maronna and Zamar, 2002)I ...
Let’s look at how these existing robust scatter estimators (e.g.MVE, S-est, MM-est) perform under HTCM and Cell-wisecontam.
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),
S-est. (red)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
HTCMLet’s first illustrate through mini examples and diagrams:I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),
S-est. (red) ,MM-est. (gray)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Davies’ S-estimator
Definition (Davies, 1987): For µ ∈ Rp and positive definite Σ,S-estimator is (
µ, Σ)
= arg min s(µ,Σ)
Σ = s∗ Σ
where s(µ,Σ) is solution s to
1n
n∑i=1
ρ
(xi − µ)TΣ−1(xi − µ)|Σ|1/p
s
=12,
with ρ(·) is some bounded monotone loss function and mustsatifies
EΦ
(ρ
(||X||2
c
))=
12
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
MM-estimator (a two-stage estimator)
Definition: For µ ∈ Rp and positive definite Σ, MM-estimator is
(µ, Σ) = arg min J(µ,Σ)
where
J(µ,Σ) =1n
n∑i=1
ρ2
(xi − µ)TΣ−1(xi − µ)|Σ|1/p
sn
with ρ2(·) being a different loss function, i.e. ρ2(·) ≤ ρ1(·) and snbeing the scale from S-estimate.
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Cell-wise contamination
I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Cell-wise contamination
I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Cell-wise contamination
I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Cell-wise contamination
I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),
S-est. (red)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Cell-wise contamination
I p = 3,n = 30, ε = 0.20, random covariance matrix, origin center, normalI 95% conf. ellipsoids: MLE-clean (blue), MLE (yellow), MVE (green),
S-est. (red) ,MM-est. (gray)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
MVE, S-, and MM estimator performs very badly undercell-wise contam....
Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.
In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!
We need to develop a new estimator...
Composite-S estimator (CSE)
...but this estimator is not affine equivariant, which saves fromfalling under HTCM!
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
MVE, S-, and MM estimator performs very badly undercell-wise contam....
Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.
In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!
We need to develop a new estimator...
Composite-S estimator (CSE)
...but this estimator is not affine equivariant, which saves fromfalling under HTCM!
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
MVE, S-, and MM estimator performs very badly undercell-wise contam....
Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.
In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!
We need to develop a new estimator...
Composite-S estimator (CSE)
...but this estimator is not affine equivariant, which saves fromfalling under HTCM!
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
MVE, S-, and MM estimator performs very badly undercell-wise contam....
Note that in our cell-wise contam. example,P(≥ 1 variable is contam.) = 1 − (1 − ε)p = 0.488.
In fact, all affine equivariant estimators for covariance collapseunder cell-wise contam. (Allqalaf et al., 2009)!
We need to develop a new estimator...
Composite-S estimator (CSE)
...but this estimator is not affine equivariant, which saves fromfalling under HTCM!
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.
It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix
Now let’s have an example, we will get back to its definitionlater...
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.
It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix
Now let’s have an example, we will get back to its definitionlater...
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
In short, CSE attempts to minimize the size of the covariance(e.g. “ellipses”) for each pair of variables simultaneously,instead of all variables.
It tries to downweight bivariate Mahalanobis distances, insteadof full, when constructing the covariance matrix
Now let’s have an example, we will get back to its definitionlater...
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.
95% confidence region based on Davies’ S-estimator vs true covariance:
Scatter Plot Matrix
V1024 0 2 4
−4−2
0
−4 −2 0
V2246
2 4 6
−4−2
0
−4−2 0
V3246
2 4 6
−202
−2 0 2
V40
24 0 2 4
−4−2
0
−4 −2 0
V52468
2 4 6 8
−4−2
02
−4 0 2
true S−est
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.
95% confidence region based on CSE:
Scatter Plot Matrix
V1024 0 2 4
−4−2
0
−4 −2 0
V2246
2 4 6
−4−2
0
−4−2 0
V3246
2 4 6
−202
−2 0 2
V40
24 0 2 4
−4−2
0
−4 −2 0
V52468
2 4 6 8
−4−2
02
−4 0 2
true CSE
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
Example: p = 5,n = 100, ε = 0.10, random covariance matrix, origin center,normal, cell-wise contam.
95% confidence region based on CSE versus S-est. based on each pair:
Scatter Plot Matrix
V1024 0 2 4
−4−2
0
−4 −2 0
V2246
2 4 6
−4−2
0
−4−2 0
V3246
2 4 6
−202
−2 0 2
V40
24 0 2 4
−4−2
0
−4 −2 0
V52468
2 4 6 8
−4−2
02
−4 0 2
true CSE Pairwise−S
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S-estimator
Definition (CSE): For a given robust initial estimator Ω0,
(µ, Σ) = arg min s(µ,Σ, Ω0)
Σ = s∗ Σ
where s(µ,Σ, Ω0) is solution s to
2p(p − 1)n
n∑i=1
p∑j=k
p−1∑k=1
ρ
d jki (µ,Σ)
s c0
|Σjk|1/2
|Ωjk0 |
1/2
=12
d jki (µ,Σ) = (xjk
− µjk )TΣjk−1(xjk− µjk ) is the bivariate
Mahalanobis distance, and c must satisifies the same criteriaas in Davies’ S-estimator but in bivariate.
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite MM-estimator
CSE in general is robust under cell-wise contam. but notefficient.
Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.
We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite MM-estimator
CSE in general is robust under cell-wise contam. but notefficient.
Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.
We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite MM-estimator
CSE in general is robust under cell-wise contam. but notefficient.
Efficiency is a measurement of variability of the estimaterelative to some gold standard, such as MLE, under nocontamination.
We use the corresponding MM-version (Tatsuoka and Tyler,2000) of CSE to achieve efficiency
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Composite S- and MM-estimator
Both have very nice but complex estimation procedure thatclosely link with S-estimator with missing data (Danilov et al,2012), but we will not describe here
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Some results shown in ICORS 2012
We performed a Monte Carlo study to assess the behavior ofthe proposed estimators.
Simulation setting:I x ∼ N(0,Σ0), some n and pI Σ0 is exchangeable correlation, i.e.
Σ0 =
1 r ... rr 1 ... r... ... ... ...r ... 1 rr ... r 1
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Some results shown in ICORS 2012
Here we show some results for
I Correlations: r = 0.5 and r = 0.9I p = 10 and n = 100.I p = 20 and n = 200.
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Some results shown in ICORS 2012
Performance criteria as:1. Likelihood ratio test distance (LRT) for robustness
evaluation
D(Σ,Σ0) =1N
N∑i=1
D(Σi ,Σ0)
where
D(Σ,Σ0) = trace(Σ−10 Σ) − log(det(Σ−1
0 Σ)) − p
2. Relative efficiency based on LRT values for efficiencyevaluation
D(ΣMLE,Σ0)/D(Σ,Σ0)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Monte Carlo results
Gaussian Efficiency Without Outliers
p = 10, n = 100 p = 20,n = 200
ESTIMATES r0.5 0.9
S-est 0.91 0.90Pairwise-S 0.25 0.45CSE 0.70 0.50CMME 0.74 0.78
ESTIMATES r0.5 0.9
S-est 0.96 0.96Pairwise-S 0.36 0.37CSE 0.74 0.44CMME 0.81 0.60
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Monte Carlo results
n = 100,p = 10, ε = 10%
10% Contamination(n=100, p=10)
Outliers size
Aver
age
LRT
dist
ance
0
2
4
6
8
5 10 15 20
Corr.=0.5ICM
Corr.=0.9ICM
Corr.=0.5THCM
5 10 15 20
0
2
4
6
8
Corr.=0.9THCM
Pairwise−SCS (QC)
Classical−SCMM (QC)
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Remarks and conclusion
I In general, CSE (and CMME) are very robust undercell-wise contam.
I We have seen that CSE (and CMME) do not perform verywell under HTCM
I Our goal is to have an estimator highly robust under bothHTCM and cell-wise contam. (we are ambitious!)
I ...while efficiency is our second priority
To be continued....
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model
Acknowledgement
Special thanks to Professor R. Zamar and Professor V. Yohai!
Prof. Zamar Prof. Yohai
...AND THANK YOU FOR LISTENING!
C. Agostinelli1, A. Leung2,, V.J. Yohai3 and R.H. Zamar2 Development of robust scatter estimators under independent contamination model