concentration inequalities - inriatrsh2010.gforge.inria.fr/slides/203_audibert.pdf · concentration...
TRANSCRIPT
Concentration inequalities
Jean-Yves Audibert1,2
1. Imagine - ENPC/CSTB - universite Paris Est
2. Willow (INRIA/ENS/CNRS)
ThRaSH’2010
Problem
Tight upper and lower bounds on
f(X1, . . . , Xn)
with
X1, . . . , Xn i.i.d. random variables
taking their values in some (measurable) space X and
f : Xn → R
a function which value depends on all the variables but not too much
on any of them. For example: f(X1, . . . , Xn) = X1+···+Xnn or
f(X1, . . . , Xn) = supg∈G
g(X1) + · · ·+ g(Xn)n
Outline
• Asymptotic viewpoint
• Non asymptotic
– Gaussian approximation
– Gaussian processes
– Sum of i.i.d. r.v.
– Functions with bounded differences
– Self-bounding functions
The asymptotic viewpoint
• What is the limit of f(X1, . . . , Xn)?
• What is the limit of its centered and scaled version:
f(X1, . . . , Xn)− Ef(X1, . . . , Xn)√Var f(X1, . . . , Xn)
?
Convergence of random variables
• Convergence in distribution: Wnd−→
n→+∞W
⇔ ∀t ∈ R s.t. FW cont. at t, FWn(t) −→n→+∞
FW (t)
⇔ ∀f : R→ R cont. and bounded, Ef(Wn) −→n→+∞
Ef(W )
⇔ ∀t ∈ R , EeitWn −→n→+∞
EeitW (with i2 = −1)
• Convergence in probability: WnP−→
n→+∞W
⇔ ∀ε > 0, P(|Wn −W | ≥ ε) −→n→+∞
0
• Almost sure convergence: Wna.s.−→
n→+∞W ⇔ P(Wn −→
n→+∞W ) = 1
Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution
• If ∀ε > 0,∑
n≥1 P(|Wn −W | > ε) < +∞, then Wna.s.−→
n→+∞W
Convergence of the empirical mean
f(X1, . . . , Xn) = X1+···+Xnn
• LLN (1713): If X, X1, X2, . . . are i.i.d. r.v. with E|X| < +∞, then
X =∑n
i=1 Xin
a.s.−→n→+∞
EX
• CLT (1733): If X,X1, X2, . . . are i.i.d. r.v. with EX2 < +∞, then
√n(X − EX
) d−→n→+∞
N (0,Var X),
or equivalently: for any t,
P{√
nVar X
(X − EX
)> t
}−→
n→+∞∫ +∞
te−
u2
2√2π
du.
Slutsky’s lemma (1925)
Let (Vn) and (Wn) be two sequences of random vectors or variables.
If VnP−→
n→+∞v and Wn
d−→n→+∞
W , then
1. Vn + Wnd−→
n→+∞v + W
2. VnWnd−→
n→+∞vW
3. V −1n Wn
d−→n→+∞
v−1W if v invertible
An example of complicated functional: the t-statistics
Let
f(X1, . . . , Xn) =√
n(X − EX)Sn
,
with
S2n =
1n
n∑
i=1
(Xi − X)2
Since S2n = 1
n
∑ni=1(Xi − EX)2 − (EX − X)2, from the LLN, we have
S2n
a.s.−→n→+∞
Var X. From the CLT,√
n(X −EX) d−→n→+∞
N (0,Var X).Thus, from Slutsky’s lemma,
f(X1, . . . , Xn) d−→n→+∞
N (0, 1).
Appropriate decompositions of complicated functionals allow to
compute their asymptotic distribution.
Nonasymptotic bounds
Motivations:
• When the nonasymptotic regime plays a crucial role (for instance,
multi-armed bandit problems, racing algorithms, stopping times
problems)
• When asymptotic analysis is not achievable through standard
arguments
• To derive asymptotic results!
The Berry (1941)-Esseen (1942) theorem
• X,X1, . . . , Xn i.i.d.
• E|X|3 < +∞ and σ2 = Var X
• X = X1+···+Xnn
• Z ∼ N (EX,Var X)
supx∈R
∣∣P(X > x)− P(Z > x)∣∣ ≤ E|X − EX|3
2σ3
1√n
Slud’s theorem (1977)
• X1, . . . , Xn i.i.d. ∼ B(p) with p ≤ 12
• Z ∼ N (EX,Var X)
• for any x ∈ [p, 1− p]
P(X > x) ≥ P(Z > x)
the Paley-Zygmund inequality (1932)
• X1, . . . , Xn i.i.d.
• for any 0 ≤ λ < 1,
P(∣∣∣∣√
n(X − EX)√VarX
> λ
∣∣∣∣)≥ (1− λ2)2 min
(13,
(VarX)2
E(X − EX)4
).
Supremum of Gaussian processes (GP)
• Gaussian process (W (g))g∈G: for any g1, . . . , gd ∈ G(W (g1), . . . , W (gd)
)is a Gaussian random vector
• GP: a powerful flexible probabilistic model parametrized by
µ(g) = EW (g) and K(g, g′) = Cov(W (g),W (g′)
)
• Good intuition on GP ⇒ good intuition on supg∈Gg(X1)+···+g(Xn)
n
supg∈G
g(X1) + · · ·+ g(Xn)n
≈ supg∈G
W (g)
with µ(g) = Eg(X) and K(g, g′) = 1nCov
(g(X), g′(X)
).
The Borell (1975) - Cirel’son et al. (1976) inequality
• Z = supg∈G{W (g)− EW (g)
}
• σ2 = supg∈G Var W (g) = supg∈G K(g, g)
for any λ ∈ R,
logEeλ(Z−EZ) ≤ λ2σ2
2
for any t > 0,
P(Z − EZ ≥ t) ≤ e− t2
2σ2
Dudley’s integral (1967)
• d(g, g′) =√
E[W (g)−W (g′)]2
• N(ε) = ε-packing number of (G, d)
• σ2 = supg∈G Var W (g) = supg∈G K(g, g)
E supg∈G
{W (g)− EW (g)
} ≤ 12∫ σ
0
√log N(ε)dε,
Another Borell (1975) - Cirel’son et al. (1976)inequality
• X1, . . . , Xn i.i.d. ∼ N (0, 1)
• f : Rn → R L-Lipschitz for the Euclidean distance
for any x, x′ in Rn, |f(x)− f(x′)| ≤ L‖x− x′‖
for any t > 0,
P(f(X1, . . . , Xn)− Ef(X1, . . . , Xn) ≥ t
) ≤ e− t2
2L2 .
Some useful probabilistic inequalities
• Markov’s inequality: for any r.v. X and a > 0, since |X| ≥ a1|X|≥a
P(|X| ≥ a) ≤ 1aE|X|.
• Jensen’s ineq.: for any integrable r.v. X and ϕ : Rd → R convex,
ϕ(EX) ≤ Eϕ(X).
• For any r.v. X, EX ≤ ∫ +∞0
P(X ≥ t)dt (with equality if X ≥ 0)
• Markov’s inequality is at the basis of Chernoff’s argument: ∀s > 0
P(X ≥ t) = P(esX ≥ est
) ≤ e−stEesX.
Control of the Laplace transform ⇒ control of the large deviations.
Hoeffding’s inequality (1963)
If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then
1. ∀s ∈ R,
Ees(X−EX) ≤ es2(b−a)2
8
2. For any t ≥ 0,
P(X − EX ≥ t
) ≤ e− 2nt2
(b−a)2 ,
or equivalently, for any ε > 0
P(
X − EX < (b− a)
√log(ε−1)
2n
)≥ 1− ε,
i.e., “w.h.p.” X − EX < (b− a)√
log(ε−1)2n .
Log-Laplace upper bound
1. ∀s ∈ R, Ees(X−EX) ≤ es2(b−a)2
8
ϕ(s) = logEesX Ps(dω) = esX(ω)
EesX · P(dω)
ϕ′(s) = EPsX ϕ′′(s) = VarPs X
VarPs X = infr∈REPs (X − r)2 ≤ EPs
(X − a+b
2
)2 ≤ (b−a)2
4 .
ϕ(s) = ϕ(0) + sϕ′(0) +∫ s
0(s− t)ϕ′′(t)dt
⇒ logEesX ≤ sEX +∫ s
0
(s− t)(b− a)2
4dt
≤ sEX +(b− a)2s2
8
Chernoff’s Argument
2. For any t ≥ 0, P(X − EX > t
) ≤ e− 2nt2
(b−a)2 .
P(X − EX ≥ t) = P(es(X−EX) ≥ est
)
≤ e−stE[es(X−EX)]
= e−stE(e
s∑n
i=1(Xi−EX)n
)
= e−stE(e
s(X−EX)n
)n
≤ e−st+s2
nb−a2
8
= e− 2nt2
(b−a)2 by choosing s =4nt
(b− a)2.
Union bound
• P(A) ≥ 1− ε and P(B) ≥ 1− ε ⇒ P(A ∩B) ≥ 1− 2ε
(since P(Ac ∪ Bc) ≤ P(Ac) + P(Bc))
For instance: Hoeffding to X + Hoeffding to −X + union bound
⇒ with proba ≥ 1− ε, |X − EX| < (b− a)√
log(2ε−1)2n
(leads to pessimistic but correct confidence intervals unlike the CLT)
• If P(A1) ≥ 1− ε,. . . ,P(Am) ≥ 1− ε, then
P(A1 ∩ · · · ∩Am
) ≥ 1−mε
Bernstein’s (1946) inequality
Hoeffding’s inequality vs CLT:
e−2α2 Var X
(b−a)2 ≥ P [√n
Var X(X − EX) > α] −→
n→+∞P(Z > α) ≈ e
−α2
2
α√
2π
⇒ Hoeffding’s inequality is imprecise for r.v. having low variance
Bernstein’s inequality:
If X, X1, X2, . . . are i.i.d. r.v. with X − EX ≤ c, then
• for any ε > 0, with proba at least 1− ε,
X ≤ EX +√
2 log(ε−1)Var Xn + clog(ε−1)
3n
• for any t ≥ 0,
P(X − EX > t
) ≤ e− nt2
2Var X+2ct/3
Empirical Bernstein’s inequality(A., Munos, Szepesvari, 2007; Maurer, Pontil, 2009)
If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then for any ε > 0,
with proba at least 1− ε,
EX ≤ X +
√2 log(ε−1)σ2
n+ 7(b− a)
log(ε−1)3n
with
σ2 =∑n
i=1(Xi − X)2
n− 1.
(to be compared with EX ≤ X +
√2 log(ε−1)Var X
n + (b− a)log(ε−1)3n
)
Hoeffding-Azuma inequalities (McDiarmid’s version, 1989)
If for some c ≥ 0,
supi∈{1,...,n}
(x1,...,xn)∈Xn
x∈X
f(x1, . . . , xn)− f(x1, . . . , xi−1, x, xi+1, . . . , xn) ≤ c,
then, for any λ ∈ R, W = f(X1, . . . , Xn) satisfies
Eeλ(W−EW ) ≤ enλ2c2
8
and for any t ≥ 0,
P(W − EW > t
) ≤ e− 2t2
nc2
First example: Hoeffding’s inequality in Hilbert space
• X1, . . . , Xn i.i.d. r.v. taking values in a separable Hilbert space
• EX = 0 and ‖X‖ ≤ 1
For any t ≥ 4√
n,
P(∥∥X1 + · · ·+ Xn
∥∥ ≥ t) ≤ e−
t2
8n.
Second example: supremum of empirical process
W = f(X1, . . . , Xn) = supg∈G
g(X1)+···+g(Xn)n G finite
• Assumptions: ∀g ∈ G, g takes its values in [−1, 1] and Eg(X1) = 0
• supi∈{1,...,n}
(x1,...,xn)∈Xn
x∈X
f(xi−11 , xi, x
ni+1)− f(xi−1
1 , x, xni+1) ≤ 2
n,
• McDiarmid’s inequality ⇒ P(W − EW > t
) ≤ e−nt2/2
⇒ with proba ≥ 1− ε,
supg∈G
g(X1) + · · ·+ g(Xn)n
≤ E supg∈G
g(X1) + · · ·+ g(Xn)n
+
√2 log(ε−1)
n
Third example: kernel density estimation
• X1, . . . , Xn i.i.d. r.v. from a distribution with density p on R.
• h > 0 and K : R→ R+ with∫RK = 1
• p(x) = 1nh
∑ni=1 K
(x−Xi
h
)
• W = f(X1, . . . , Xn) =∫ ∣∣p(x)− p(x)
∣∣dx
f(xi−11 , xi, x
ni+1)− f(xi−1
1 , x′i, xni+1) ≤
1nh
∫ ∣∣∣∣K(
x− xi
h
)n−K
(x− x′i
h
)∣∣∣∣ ≤2n,
W − EW ≤√
2 log(ε−1)n
Self bounded functions(Boucheron, Lugosi, Massart, 2003, 2009; Maurer, 2005)
• fi(x1, . . . , xn) = infxi∈X f(x1, . . . , xn)
• If for some a, b ≥ 0, for any (x1, . . . , xn) ∈ Xn,
n∑
i=1
[f(x1, . . . , xn)− fi(x1, . . . , xn)
]2 ≤ af(x1, . . . , xn) + b,
then, for any t ≥ 0, W = f(X1, . . . , Xn) satisfies
P(W − EW > t
) ≤ e− t2
2(aEW+b+at/2)
Talagrand’s inequality(Talagrand, 1996; Rio, 2002; Bousquet, 2003)
• W = supg∈Gg(X1)+···+g(Xn)
n
• Eg(X) = 0 and g(X) ≤ c
• v = supg∈G Var g(X) + 2cEW
for any ε > 0, with proba at least 1− ε,
W − EW ≤√
2v log(ε−1)n + clog(ε−1)
3n
for any t ≥ 0,
P(W − EW > t
) ≤ e− nt2
2v+2ct/3
Expected maximal deviations
Let σ > 0, m ≥ 2, W1, . . . , Wm r.v. s.t. for all s > 0 and any
1 ≤ i ≤ m, EesWi ≤ es2σ2
2 . Then
E{
max1≤i≤m
Wi
} ≤ σ√
2 log m.
If for any s > 0, we also have Ee−sWi ≤ es2σ2
2 , then
E{
max1≤i≤m
|Wi|} ≤ σ
√2 log(2m).
Proof:
max1≤i≤m
Wi ≤ 1s
logm∑
i=1
esWi ≤ 1s
log(mes2σ2/2).
Extension to martingale difference sequences
• Let X1, X2, . . . and U1, U2, . . . be r.v. such that
E[Xi|U1, . . . , Ui−1] = 0 for all i ≥ 1
• Assume that for some c > 0, and some r.v. Ai measurable w.r.t.
U1, . . . , Ui−1, Xi takes its values in [Ai, Ai + 1] for
P(X > t) ≤ e−2nt2
• same r.h.s. as if we had i.i.d. r.v. taking values in [0, 1]
Other extensions
• All upper bounds easily extends to independent non identically
distributed r.v.
• Some upper bounds on the empirical mean can be extended to
random vectors
• All upper bounds on the empirical mean are valid if the Xi are
samples without replacement