sub-gaussian estimators under heavy tailsw3.impa.br/~rimfo/ebp_subgaussian.pdf · sub-gaussian...
TRANSCRIPT
Sub-Gaussian estimators under heavy tails
Roberto Imbuzeiro Oliveira
XIX Escola Brasileira de Probabilidade Maresias, August 6th 2015
MatthieuLerasle
(CNRS/Nice)
LucDevroye(McGill)
Joint with
GáborLugosi
(ICREA/UPF)
Our problem (and why it's interesting)
Our problem
We want to estimate the mean of a probability distribution over the real line from an i.i.d. sample.
This is (related to) many fundamental statistical tasks.
Our problem
We assume finite variances, but as little else as possible.
Interesting in theory, important in practice.
Our problem
Want nearly optimal tail bounds, uniformly over large classes of distributions.
High-confidence estimates are sometimes necessary.
Formal statement
Given: P, family of probability distributions over R.For P 2 P, µP and �2
P are the mean and variance of P.
Want: for each large enough n 2 N, an estimator
bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,
if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
! �.
Formal statement
Given: P, family of probability distributions over R.For P 2 P, µP and �2
P are the mean and variance of P.
Want: for each large enough n 2 N, an estimator
bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,
if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
! �.
Should be very large (nonparametric)
Formal statement
Given: P, family of probability distributions over R.For P 2 P, µP and �2
P are the mean and variance of P.
Want: for each large enough n 2 N, an estimator
bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,
if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
! �.
Should be very small (exponentially in n?)
Formal statement
Given: P, family of probability distributions over R.For P 2 P, µP and �2
P are the mean and variance of P.
Want: for each large enough n 2 N, an estimator
bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,
if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
! �.
Constant (may depend on the family)
Why sub-Gaussian?
What we ask for is basically that the estimator has Gaussian-like fluctuations around the mean.
Catoni: Gaussian-like fluctuations are optimal for "reasonable" families of distributions (more on this below).
P✓| bEn(X
n1 )� µP| >
��Ppn
◆ C1 e
� �2
C2
Why is this interesting?
Estimator must turn heavy tails into light tails! (Tail surgery?)
Why is this interesting?
Why is this interesting?
Related (weaker) estimators have been applied to problems in Statistics and Machine Learning. Our notion could improve these results.
Audibert and Catoni + Hsu and Sabato (least squares), Buyback et al. (bandits), Brownlees et al. (empirical risk minimization).
When is this possible?
This is the main subject of our paper.
We present our results before we move on.
Our results
First result
Assumption: variance known up to an interval.
Partially known varianceExample: P [�2
1 ,�22 ]
2 := all distributions with variance �2P 2 [�2
1 ,�22 ].
We let R := �2/�1 (may depend on n).
Theorem:
If R is bounded, then for all large enough nthere exist
bEn : Rn ! R, �min ⇡ e�c nand L constant
such that, when P 2 P [�21 ,�
22 ]
2 and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
If R unbounded , any sequence �min ! 0 fails.
Partially known varianceExample: P [�2
1 ,�22 ]
2 := all distributions with variance �2P 2 [�2
1 ,�22 ].
We let R := �2/�1 (may depend on n).
Theorem:
If R is bounded, then for all large enough nthere exist
bEn : Rn ! R, �min ⇡ e�c nand L constant
such that, when P 2 P [�21 ,�
22 ]
2 and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
If R unbounded , any sequence �min ! 0 fails.
Optimal up to the exact values of c>0 e L>0.
Partially known varianceExample: P [�2
1 ,�22 ]
2 := all distributions with variance �2P 2 [�2
1 ,�22 ].
We let R := �2/�1 (may depend on n).
Theorem:
If R is bounded, then for all large enough nthere exist
bEn : Rn ! R, �min ⇡ e�c nand L constant
such that, when P 2 P [�21 ,�
22 ]
2 and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
If R unbounded , any sequence �min ! 0 fails.
Truly different behavior!
Second result
Assumption: (slightly) higher moments.
Higher momentsExample: P↵,⌘ := all distributions with
EP|X � µP|↵ (⌘ �P)↵,
(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)
Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),
there exist
bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant
such that, when P 2 P↵,⌘ and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
Higher momentsExample: P↵,⌘ := all distributions with
EP|X � µP|↵ (⌘ �P)↵,
(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)
Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),
there exist
bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant
such that, when P 2 P↵,⌘ and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
Optimal up to value of c>0.
An extensionSuffices to assume that the distributions which is k-regular:
For instance, symmetric distributions are 1-regular.
9k 2 N, 8P 2 P, 8j � k : if Xj1 =d P⌦n,
P ±1
j
jX
i=1
(Xi � µP) 0
!� 1
3.
Third result
Assumption: under bounded kurtosis, can get nearly optimal constant
This will be further discussed later.
L =p2 + "
Some background
History
Typical analyses of estimators for means are based on expectations, not deviations.
Exceptions do exist (eg. Kolmogorov’s CLT for medians), but assumptions and goals are different.
History
Catoni’s paper (AIHP Prob. Stat. 2012) seems to be the first to focus on deviations as a fundamental problem.
We’ll mention some more applied results later.
Gaussian lower bound
Recall normal cumulative distribution function.
�(r) :=
Z r
�1
e
� x
2
2dxp
2⇡
�
�1(1� �) ⇠
p2 ln(1/�) for � ⌧ 1.
Gaussian lower bound
Family: P�2
Gauss, all Gaussian distributions over Rwith variance �2 > 0.
Thm (Catoni): for any n,
inf
bEn
sup
P2PXn
1 =dP⌦n
P✓bEn(X
n1 )� µP � �(1� �)�1 �Pp
n
◆= �
Similar result for lower tail.
Gaussian lower bound
Family: P�2
Gauss, all Gaussian distributions over Rwith variance �2 > 0.
Thm (Catoni): for any n,
inf
bEn
sup
P2PXn
1 =dP⌦n
P✓bEn(X
n1 )� µP � �(1� �)�1 �Pp
n
◆= �
Similar result for lower tail.
This is asymptotic to
Lp
ln(1/�) with L =p2
Compare with definition
Given: P, family of probability distributions over R.For P 2 P, µP and �2
P are the mean and variance of P.
Want: for each large enough n 2 N, an estimator
bEn : Rn ! Rand a parameter �min = �min,n 2 [0, 1) such that,
if Xn1 = (X1, . . . , Xn) is i.i.d. from P 2 P, then
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
! �.
Gaussian lower bound
Family: P�2
Gauss, all Gaussian distributions over Rwith variance �2 > 0.
Thm (Catoni): for any n,
inf
bEn
sup
P2PXn
1 =dP⌦n
P✓bEn(X
n1 )� µP � �(1� �)�1 �Pp
n
◆= �
Similar result for lower tail.
The empirical mean
It follows from Catoni’s result the empirical mean has optimal deviations for all Gaussian distributions.
This is an exception, rather than the rule.
bEn(Xn1 ) :=
1
n
nX
i=1
Xi
Empirical mean fails
Example: P�2
2 , all distributions with
variances �2P = �2
.
Thm (Catoni): Chebyshev is basically optimal.
sup
P2P�22
Xn1 =dP⌦n
P �����
1
n
nX
i=1
Xi � µP
����� >c�Pp� n
!� �.
Empirical mean fails
Example: Pkrt, all distributions with
kurtosis P := EP|X � µP|4/�4P .
Thm (Catoni): If n is large and � 1/n.
sup
P2Pkrt
Xn1 =dP⌦n
P �����
1
n
nX
i=1
Xi � µP
����� >c�P
(� n)1/4
!� �.
Positive results
Catoni obtained sharp sub-Gaussian estimators in some settings.
Unfortunately, they depend on the confidence level!
One example
Example: P�
2
2 , all distributions with variance �
2P = �
2.
Thm (Catoni): Set �min := e
�o(n). Then 8� 2 [�min, 1),
there exists a �-dependent bEn,�
with
sup
P2P�22
X
n1 =dP⌦n
P | bE
n,�
(X
n
1 )� µP| > �P
r(2 + o(1)) ln(2/�)
n
! �
Why is this bad?
Suppose you want high confidence. Only guarantee is that the probability of huge error is very low.
Nothing is known about the probability of average-to-large error in more typical events.
Why is this bad?
Statistical and machine learning applications (Bubeck et al., Brownlees et al., Hsu/Sabato) had to cope with this dependence on the confidence level.
In all cases, something was lost.
Our results are “better”
… or rather, genuinely different.
Our results imply that for parameter-dependent estimators are easier to obtain.
We’ll see that right now.
Median of means
Median of means
Simple construction of a sub-Gaussian parameter-dependent estimator that only requires finite second moments.
Known for a long time, in many forms, in different comunities (Nemirovski/Yudin, Alon/Matias/Szégedy, Levin, Jerrum/Sinclair, Hsu…). “Pre-history”.
Median of means
Example: P�2
2 , all distributions with
variances �2P = �2
.
Thm: Set �min := e1�n/2. Then 8� 2 [�min, 1),
there exists a �-dependent bEn,� with
sup
P2P�22
Xn1 =dP⌦n
P | bEn,�(X
n1 )� µP| > L�P
r1 + ln(2/�)
n
! �
Median of meansSample: Xn
1 := (X1, X2, X3, . . . , Xn) from distribution P.
Blocks: split {1, 2, . . . , n} = B1 [B2 [ · · · [Bb,
disjoint blocks of size n/b.Means: for each block B`, define
Y` :=b
n
X
i2B`
Xi
Median of means:
bEn,�(Xn1 ) := median of (Y1, Y2, . . . , Yb)
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
Interval
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
Want: median of Y1, . . . , Yb in interval.
Su�cient: more than half of the Y`’s
are in there.
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
Y` =b
n
X
i2B`
Xi, with the Xi i.i.d. P
E(Y`) = µP, Var(Y`) = b�2P/n
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
By Chebyshev, P (Y` 62 interval) L�2
Disjoint blocks) events are independent.
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
Probability that � b/2 Y`’s not in interval
is bounded by a binomial tail probability.
If L is large, P�Bin(b, L�2
) � b/2� e�b
Analysis
RµPµP � L�P
rb
nµP + L�P
rb
n
Probability that � b/2 Y`’s not in interval
is bounded by a binomial tail probability.
If L is large, P�Bin(b, L�2
) � b/2� e�b
b ⇡ ln(1/�) and we’re done
Our proof ideas
Exponential is optimalFamily: PLa, all Laplace distributions La
�
, with � 2 R and
dLa
�
(x)
dx
=
e
�|x��|
2
Property: e
�|�|n dLa⌦n�
dLa⌦n0
(x) e
|�|n
Consequence: any estimator with constant L
will mistake a La0 sample for a La10L2sample
with prob. ⇡ e
1�5L2n
.
Partially known varianceExample: P [�2
1 ,�22 ]
2 := all distributions with variance �2P 2 [�2
1 ,�22 ].
We let R := �2/�1 (may depend on n).
Theorem:
If R is bounded, then for all large enough nthere exist
bEn : Rn ! R, �min ⇡ e�c nand L constant
such that, when P 2 P [�21 ,�
22 ]
2 and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
If R unbounded , any sequence �min ! 0 fails.
Why unbounded fails
Family: P [c/n,R c/n]Po
, Poisson random variables
with very small means c/n µP
Rc/n.
Recall mean=variance for Poisson!
Xn1
:= sample with mean c/n, SX := X1
+ · · ·+Xn.
Y n1
:= sample with mean Rc/n, SY := Y1
+ · · ·+ Yn.
Why unbounded fails
Xn1 := sample with mean c/n, SX := X1 + · · ·+Xn.
Y n1 := sample with mean Rc/n, SY := Y1 + · · ·+ Yn.
Assume good estimator
bEn with constant L.
P⇣n bE(Y n
1 ) � Rc/2⌘� 1� e1�
Rc4L2
In particular, P⇣n bE(Y n
1 ) � Rc/2 | SY = Rc⌘⇡ 1.
Why unbounded fails
Xn1 := sample with mean c/n, SX := X1 + · · ·+Xn.
Y n1 := sample with mean Rc/n, SY := Y1 + · · ·+ Yn.
Assume good estimator
bEn with constant L.
P⇣n bE(Y n
1 ) � Rc/2⌘� 1� e1�
Rc4L2
In particular, P⇣n bE(Y n
1 ) � Rc/2 | SY = Rc⌘⇡ 1.
Same for X as for Y! (Sample sum is sufficient statistic)
Why unbounded fails
P⇣n bE(Xn
1 ) � Rc/2 | SX = Rc⌘⇡ 1.
So P⇣n bE(Xn
1 ) � Rc/2⌘� P (SX = Rc) ⇡ e�R lnRc
On the other hand, the prob. should be ⇡ e�R2 cL2
by the sub-Gaussian estimation property
)( for R large
The positive resultExample: P [�2
1 ,�22 ]
2 := all distributions with variance �2P 2 [�2
1 ,�22 ].
We let R := �2/�1 (may depend on n).
Theorem:
If R is bounded, then for all large enough nthere exist
bEn : Rn ! R, �min ⇡ e�c nand L constant
such that, when P 2 P [�21 ,�
22 ]
2 and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
If R unbounded , any sequence �min ! 0 fails.
Confidence intervalsUse median of means. Get a confidence interval.
bIn,�(Xn1 ) :=
"bEn,�(X
n1 )± L�2
r1 + ln(1/�)
n
#
P µP 2 bIn,�(Xn
1 ) and |bIn,�(Xn1 )| 2LR �P
r1 + ln(1/�)
n
!� 1� �
Confidence intervals
We'll combine sub-Gaussian confidence intervals to obtain a single sub-Gaussian estimator.
Similar in spirit to Lepskii’s adaptation method from nonparametric statistics.
Confidence intervals
Lemma: I1, I2, . . . , IK random nonempty closed intervals.
Assume µ 2 R, P (µ 62 Ik) 2
�k, 1 k K.
Set
ˆK := min{k K : \Kj=kIj 6= ;}.
Let
bE :=midpoint of \Kj=K̂
Ij .
Then 81 k K : P⇣| bE � µ| > |Ik|
⌘ 2
1�k.
Proof sketchI1, I2, . . . , IK random nonempty closed intervals.Set K̂ := min{k K : \K
j=kIj 6= ;}.Let bE :=midpoint of \K
j=K̂Ij .
Assume 8j � k, µ 2 Ij .
Obtain, \Kj=kIj 6= ;, so K̂ k.
Hence bE, µ 2 Ik under the assumption.
) P⇣| bE � µ| > |Ik|
⌘
Pj�k P (µ 62 Ij).
Other usesExample: P↵,⌘ := all distributions with
EP|X � µP|↵ (⌘ �P)↵,
(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)
Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),
there exist
bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant
such that, when P 2 P↵,⌘ and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
Other usesExample: P↵,⌘ := all distributions with
EP|X � µP|↵ (⌘ �P)↵,
(here ↵ 2 (2, 3) is fixed, ⌘ � ⌘0 may depend on n)
Theorem: for all large enough n, if k↵,⌘ := (C ⌘)2↵/(↵�2),
there exist
bEn : Rn ! R, �min ⇡ e�c n/k↵,⌘and L constant
such that, when P 2 P↵,⌘ and Xn1 =d P
⌦n,
8� 2 [�min, 1) : P | bEn(X
n1 )� µP| > L�P
r1 + ln(1/�)
n
!.
Use quantiles of means (instead of medians of means) to build confidence intervals.
Barry-Esséen-type bounds prove that empirical means are nearly symmetric.
Different ideas - kurtosis
Under bounded kurtosis, can use the empirical mean of truncated random variables.
The truncation is data driven and uses preliminary estimates of mean and variance.
Use empirical processes to show this is similar to truncating at the exact mean and variance. Sharp bounds!
Open problems
Open problemsSharp constants are essential for statisticians.
Are sub-Gaussian confidence intervals somehow equivalent to sub-Gaussian estimators?
Efficient extensions to vector-valued data and to risk minimization problems.
Optimal deviation bounds for Poissons, Bernoullis, etc.
Obrigado! (references in the next slides)
Our preprint
Should be posted to the arXiv in some weeks. Available upon request from
roboliv AT gmail.com
Catoni’s work
Catoni’s estimation paper + companion paper on least squares (with Audibert).
J.-Y. Audibert & O. Catoni. "Robust linear least squares regression.” Ann. Stat. 39 no. 5 (2011)
O. Catoni. "Challenging the empirical mean and empirical variance: A deviation study.” Ann. Inst. H. Poincaré Probab. Statist. 48 no. 4 (2012)
Median of meansD. Hsu http://www.inherentuncertainty.org/2010/12/robust-statistics.html (See also Levin, L. "Notes for Miscellaneous Lectures.” arXiv:cs/0503039)
N. Alon, Y. Matias & M. Szégedy. "The Space Complexity of Approximating the Frequency Moments." J. Comput. Syst. Sci. 58 no. 1 (1999)
A. Nemirovski & D. Yudin. Problem complexity and method efficiency in optimization. Wiley (1983).
Some applicationsC. Brownlees, E. Joly & G. Lugosi. "Empirical risk minimization for heavy-tailed losses.” To appear in Ann. Stat.
S. Bubeck, N. Cesa-Bianchi & G. Lugosi. “Bandits with heavy tail.” IEEE Transactions on Information Theory 59 no. 11 (2013)
D. Hsu & S. Sabato. "Loss minimization and parameter estimation with heavy tails.” arXiv:1307.1827. Abstract in ICML proceedings (2014).