concentration inequalities - inriatrsh2010.gforge.inria.fr/slides/203_audibert.pdf · concentration...

Concentration inequalities

Jean-Yves Audibert1,2

1. Imagine - ENPC/CSTB - universite Paris Est

2. Willow (INRIA/ENS/CNRS)

ThRaSH’2010

Problem

Tight upper and lower bounds on

f(X1, . . . , Xn)

with

X1, . . . , Xn i.i.d. random variables

taking their values in some (measurable) space X and

f : Xn → R

a function which value depends on all the variables but not too much

on any of them. For example: f(X1, . . . , Xn) = X1+···+Xnn or

f(X1, . . . , Xn) = supg∈G

g(X1) + · · ·+ g(Xn)n

Outline

• Asymptotic viewpoint

• Non asymptotic

– Gaussian approximation

– Gaussian processes

– Sum of i.i.d. r.v.

– Functions with bounded differences

– Self-bounding functions

The asymptotic viewpoint

• What is the limit of f(X1, . . . , Xn)?

• What is the limit of its centered and scaled version:

f(X1, . . . , Xn)− Ef(X1, . . . , Xn)√Var f(X1, . . . , Xn)

?

Convergence of random variables

• Convergence in distribution: Wnd−→

n→+∞W

⇔ ∀t ∈ R s.t. FW cont. at t, FWn(t) −→n→+∞

FW (t)

⇔ ∀f : R→ R cont. and bounded, Ef(Wn) −→n→+∞

Ef(W )

⇔ ∀t ∈ R , EeitWn −→n→+∞

EeitW (with i2 = −1)

• Convergence in probability: WnP−→

n→+∞W

⇔ ∀ε > 0, P(|Wn −W | ≥ ε) −→n→+∞

0

• Almost sure convergence: Wna.s.−→

n→+∞W ⇔ P(Wn −→

n→+∞W ) = 1

Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution

• If ∀ε > 0,∑

n≥1 P(|Wn −W | > ε) < +∞, then Wna.s.−→

n→+∞W

Convergence of the empirical mean

f(X1, . . . , Xn) = X1+···+Xnn

• LLN (1713): If X, X1, X2, . . . are i.i.d. r.v. with E|X| < +∞, then

X =∑n

i=1 Xin

a.s.−→n→+∞

EX

• CLT (1733): If X,X1, X2, . . . are i.i.d. r.v. with EX2 < +∞, then

√n(X − EX

) d−→n→+∞

N (0,Var X),

or equivalently: for any t,

P{√

nVar X

(X − EX

)> t

}−→

n→+∞∫ +∞

te−

u2

2√2π

du.

Slutsky’s lemma (1925)

Let (Vn) and (Wn) be two sequences of random vectors or variables.

If VnP−→

n→+∞v and Wn

d−→n→+∞

W , then

1. Vn + Wnd−→

n→+∞v + W

2. VnWnd−→

n→+∞vW

3. V −1n Wn

d−→n→+∞

v−1W if v invertible

An example of complicated functional: the t-statistics

Let

f(X1, . . . , Xn) =√

n(X − EX)Sn

,

with

S2n =

1n

n∑

i=1

(Xi − X)2

Since S2n = 1

n

∑ni=1(Xi − EX)2 − (EX − X)2, from the LLN, we have

S2n

a.s.−→n→+∞

Var X. From the CLT,√

n(X −EX) d−→n→+∞

N (0,Var X).Thus, from Slutsky’s lemma,

f(X1, . . . , Xn) d−→n→+∞

N (0, 1).

Appropriate decompositions of complicated functionals allow to

compute their asymptotic distribution.

Nonasymptotic bounds

Motivations:

• When the nonasymptotic regime plays a crucial role (for instance,

multi-armed bandit problems, racing algorithms, stopping times

problems)

• When asymptotic analysis is not achievable through standard

arguments

• To derive asymptotic results!

The Berry (1941)-Esseen (1942) theorem

• X,X1, . . . , Xn i.i.d.

• E|X|3 < +∞ and σ2 = Var X

• X = X1+···+Xnn

• Z ∼ N (EX,Var X)

supx∈R

∣∣P(X > x)− P(Z > x)∣∣ ≤ E|X − EX|3

2σ3

1√n

Slud’s theorem (1977)

• X1, . . . , Xn i.i.d. ∼ B(p) with p ≤ 12

• Z ∼ N (EX,Var X)

• for any x ∈ [p, 1− p]

P(X > x) ≥ P(Z > x)

the Paley-Zygmund inequality (1932)

• X1, . . . , Xn i.i.d.

• for any 0 ≤ λ < 1,

P(∣∣∣∣√

n(X − EX)√VarX

> λ

∣∣∣∣)≥ (1− λ2)2 min

(13,

(VarX)2

E(X − EX)4

).

Supremum of Gaussian processes (GP)

• Gaussian process (W (g))g∈G: for any g1, . . . , gd ∈ G(W (g1), . . . , W (gd)

)is a Gaussian random vector

• GP: a powerful flexible probabilistic model parametrized by

µ(g) = EW (g) and K(g, g′) = Cov(W (g),W (g′)

)

• Good intuition on GP ⇒ good intuition on supg∈Gg(X1)+···+g(Xn)

n

supg∈G

g(X1) + · · ·+ g(Xn)n

≈ supg∈G

W (g)

with µ(g) = Eg(X) and K(g, g′) = 1nCov

(g(X), g′(X)

).

The Borell (1975) - Cirel’son et al. (1976) inequality

• Z = supg∈G{W (g)− EW (g)

}

• σ2 = supg∈G Var W (g) = supg∈G K(g, g)

for any λ ∈ R,

logEeλ(Z−EZ) ≤ λ2σ2

2

for any t > 0,

P(Z − EZ ≥ t) ≤ e− t2

2σ2

Dudley’s integral (1967)

• d(g, g′) =√

E[W (g)−W (g′)]2

• N(ε) = ε-packing number of (G, d)

• σ2 = supg∈G Var W (g) = supg∈G K(g, g)

E supg∈G

{W (g)− EW (g)

} ≤ 12∫ σ

0

√log N(ε)dε,

Another Borell (1975) - Cirel’son et al. (1976)inequality

• X1, . . . , Xn i.i.d. ∼ N (0, 1)

• f : Rn → R L-Lipschitz for the Euclidean distance

for any x, x′ in Rn, |f(x)− f(x′)| ≤ L‖x− x′‖

for any t > 0,

P(f(X1, . . . , Xn)− Ef(X1, . . . , Xn) ≥ t

) ≤ e− t2

2L2 .

Some useful probabilistic inequalities

• Markov’s inequality: for any r.v. X and a > 0, since |X| ≥ a1|X|≥a

P(|X| ≥ a) ≤ 1aE|X|.

• Jensen’s ineq.: for any integrable r.v. X and ϕ : Rd → R convex,

ϕ(EX) ≤ Eϕ(X).

• For any r.v. X, EX ≤ ∫ +∞0

P(X ≥ t)dt (with equality if X ≥ 0)

• Markov’s inequality is at the basis of Chernoff’s argument: ∀s > 0

P(X ≥ t) = P(esX ≥ est

) ≤ e−stEesX.

Control of the Laplace transform ⇒ control of the large deviations.

Hoeffding’s inequality (1963)

If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then

1. ∀s ∈ R,

Ees(X−EX) ≤ es2(b−a)2

8

2. For any t ≥ 0,

P(X − EX ≥ t

) ≤ e− 2nt2

(b−a)2 ,

or equivalently, for any ε > 0

P(

X − EX < (b− a)

√log(ε−1)

2n

)≥ 1− ε,

i.e., “w.h.p.” X − EX < (b− a)√

log(ε−1)2n .

Log-Laplace upper bound

1. ∀s ∈ R, Ees(X−EX) ≤ es2(b−a)2

8

ϕ(s) = logEesX Ps(dω) = esX(ω)

EesX · P(dω)

ϕ′(s) = EPsX ϕ′′(s) = VarPs X

VarPs X = infr∈REPs (X − r)2 ≤ EPs

(X − a+b

2

)2 ≤ (b−a)2

4 .

ϕ(s) = ϕ(0) + sϕ′(0) +∫ s

0(s− t)ϕ′′(t)dt

⇒ logEesX ≤ sEX +∫ s

0

(s− t)(b− a)2

4dt

≤ sEX +(b− a)2s2

8

Chernoff’s Argument

2. For any t ≥ 0, P(X − EX > t

) ≤ e− 2nt2

(b−a)2 .

P(X − EX ≥ t) = P(es(X−EX) ≥ est

)

≤ e−stE[es(X−EX)]

= e−stE(e

s∑n

i=1(Xi−EX)n

)

= e−stE(e

s(X−EX)n

)n

≤ e−st+s2

nb−a2

8

= e− 2nt2

(b−a)2 by choosing s =4nt

(b− a)2.

Union bound

• P(A) ≥ 1− ε and P(B) ≥ 1− ε ⇒ P(A ∩B) ≥ 1− 2ε

(since P(Ac ∪ Bc) ≤ P(Ac) + P(Bc))

For instance: Hoeffding to X + Hoeffding to −X + union bound

⇒ with proba ≥ 1− ε, |X − EX| < (b− a)√

log(2ε−1)2n

(leads to pessimistic but correct confidence intervals unlike the CLT)

• If P(A1) ≥ 1− ε,. . . ,P(Am) ≥ 1− ε, then

P(A1 ∩ · · · ∩Am

) ≥ 1−mε

Bernstein’s (1946) inequality

Hoeffding’s inequality vs CLT:

e−2α2 Var X

(b−a)2 ≥ P [√n

Var X(X − EX) > α] −→

n→+∞P(Z > α) ≈ e

−α2

2

α√

2π

⇒ Hoeffding’s inequality is imprecise for r.v. having low variance

Bernstein’s inequality:

If X, X1, X2, . . . are i.i.d. r.v. with X − EX ≤ c, then

• for any ε > 0, with proba at least 1− ε,

X ≤ EX +√

2 log(ε−1)Var Xn + clog(ε−1)

3n

• for any t ≥ 0,

P(X − EX > t

) ≤ e− nt2

2Var X+2ct/3

Empirical Bernstein’s inequality(A., Munos, Szepesvari, 2007; Maurer, Pontil, 2009)

If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then for any ε > 0,

with proba at least 1− ε,

EX ≤ X +

√2 log(ε−1)σ2

n+ 7(b− a)

log(ε−1)3n

with

σ2 =∑n

i=1(Xi − X)2

n− 1.

(to be compared with EX ≤ X +

√2 log(ε−1)Var X

n + (b− a)log(ε−1)3n

)

Hoeffding-Azuma inequalities (McDiarmid’s version, 1989)

If for some c ≥ 0,

supi∈{1,...,n}

(x1,...,xn)∈Xn

x∈X

f(x1, . . . , xn)− f(x1, . . . , xi−1, x, xi+1, . . . , xn) ≤ c,

then, for any λ ∈ R, W = f(X1, . . . , Xn) satisfies

Eeλ(W−EW ) ≤ enλ2c2

8

and for any t ≥ 0,

P(W − EW > t

) ≤ e− 2t2

nc2

First example: Hoeffding’s inequality in Hilbert space

• X1, . . . , Xn i.i.d. r.v. taking values in a separable Hilbert space

• EX = 0 and ‖X‖ ≤ 1

For any t ≥ 4√

n,

P(∥∥X1 + · · ·+ Xn

∥∥ ≥ t) ≤ e−

t2

8n.

Second example: supremum of empirical process

W = f(X1, . . . , Xn) = supg∈G

g(X1)+···+g(Xn)n G finite

• Assumptions: ∀g ∈ G, g takes its values in [−1, 1] and Eg(X1) = 0

• supi∈{1,...,n}

(x1,...,xn)∈Xn

x∈X

f(xi−11 , xi, x

ni+1)− f(xi−1

1 , x, xni+1) ≤ 2

n,

• McDiarmid’s inequality ⇒ P(W − EW > t

) ≤ e−nt2/2

⇒ with proba ≥ 1− ε,

supg∈G

g(X1) + · · ·+ g(Xn)n

≤ E supg∈G

g(X1) + · · ·+ g(Xn)n

+

√2 log(ε−1)

n

Third example: kernel density estimation

• X1, . . . , Xn i.i.d. r.v. from a distribution with density p on R.

• h > 0 and K : R→ R+ with∫RK = 1

• p(x) = 1nh

∑ni=1 K

(x−Xi

h

)

• W = f(X1, . . . , Xn) =∫ ∣∣p(x)− p(x)

∣∣dx

f(xi−11 , xi, x

ni+1)− f(xi−1

1 , x′i, xni+1) ≤

1nh

∫ ∣∣∣∣K(

x− xi

h

)n−K

(x− x′i

h

)∣∣∣∣ ≤2n,

W − EW ≤√

2 log(ε−1)n

Self bounded functions(Boucheron, Lugosi, Massart, 2003, 2009; Maurer, 2005)

• fi(x1, . . . , xn) = infxi∈X f(x1, . . . , xn)

• If for some a, b ≥ 0, for any (x1, . . . , xn) ∈ Xn,

n∑

i=1

[f(x1, . . . , xn)− fi(x1, . . . , xn)

]2 ≤ af(x1, . . . , xn) + b,

then, for any t ≥ 0, W = f(X1, . . . , Xn) satisfies

P(W − EW > t

) ≤ e− t2

2(aEW+b+at/2)

Talagrand’s inequality(Talagrand, 1996; Rio, 2002; Bousquet, 2003)

• W = supg∈Gg(X1)+···+g(Xn)

n

• Eg(X) = 0 and g(X) ≤ c

• v = supg∈G Var g(X) + 2cEW

for any ε > 0, with proba at least 1− ε,

W − EW ≤√

2v log(ε−1)n + clog(ε−1)

3n

for any t ≥ 0,

P(W − EW > t

) ≤ e− nt2

2v+2ct/3

Expected maximal deviations

Let σ > 0, m ≥ 2, W1, . . . , Wm r.v. s.t. for all s > 0 and any

1 ≤ i ≤ m, EesWi ≤ es2σ2

2 . Then

E{

max1≤i≤m

Wi

} ≤ σ√

2 log m.

If for any s > 0, we also have Ee−sWi ≤ es2σ2

2 , then

E{

max1≤i≤m

|Wi|} ≤ σ

√2 log(2m).

Proof:

max1≤i≤m

Wi ≤ 1s

logm∑

i=1

esWi ≤ 1s

log(mes2σ2/2).

Extension to martingale difference sequences

• Let X1, X2, . . . and U1, U2, . . . be r.v. such that

E[Xi|U1, . . . , Ui−1] = 0 for all i ≥ 1

• Assume that for some c > 0, and some r.v. Ai measurable w.r.t.

U1, . . . , Ui−1, Xi takes its values in [Ai, Ai + 1] for

P(X > t) ≤ e−2nt2

• same r.h.s. as if we had i.i.d. r.v. taking values in [0, 1]

Other extensions

• All upper bounds easily extends to independent non identically

distributed r.v.

• Some upper bounds on the empirical mean can be extended to

random vectors

• All upper bounds on the empirical mean are valid if the Xi are

samples without replacement

Some nice references:

• Appendix of G. Lugosi and N. Cesa-Bianchi’s book: ’prediction,

learning and games’

• G. Lugosi’s lecture notes on concentration inequalities.

• Boucheron, Lugosi, Massart (2003,2009)

• P. Massart Saint Flour lecture notes

concentration inequalities - inriatrsh2010.gforge.inria.fr/slides/203_audibert.pdf · concentration...

Documents