fast rates of convergence for learning problems...prof. john duchi bounding the local rademacher...

Fast rates of convergence for learning problems

John Duchi

Prof. John Duchi

Outline

I Mean estimation and uniform laws

II Convexity

III Growth conditions and fast rates

Prof. John Duchi

Best rates from uniform laws

What do our uniform laws give us?

bh = argminh2H

(

bLn

(h) =1

n

n

X

i=1

`(h;Xi

)

)

and

sup

h2H|

bLn

(h)� L(h)| . 1pn

with high probability

Best this can be?

L(bh)� L(h?) = bLn

(

bh)� bLn

(h?) + L(bh)� bLn

(

bh) + bLn

(h?)� bLn

(h?)

Prof. John Duchi

The best rate from this approach

Central limit theorems: Consider the third error term involving h?:

p

n⇣

L(h?)� bLn

(h?)⌘

=

1

p

n

n

X

i=1

[L(h?)� `(h?;Xi

)]

d N (0,Var(`(h?;X)))

Is this right?

Prof. John Duchi

Estimating a mean

Goal: We want to estimate ✓? = E[X], use loss

`(✓;x) =1

2

(✓ � x)2

with risk

L(✓) =1

2

E[(✓ �X)2] = 12

E[(✓ � E[X] + E[X]�X)2]

=

1

2

(✓ � E[X])2 +Var(X).

Gap in risks: Subtracting we have

L(✓)� L(✓?) =1

2

(✓ � ✓?)2 +Var(X)�Var(X) =1

2

(✓ � ✓?)2

Prof. John Duchi

Estimating a sub-Gaussian mean

Let Xi

be independent �2-sub-Gaussian, so that

b✓n

=

1

n

n

X

i=1

Xi

= argmin

✓

bLn

(✓)

and for t � 0 we have

P(|b✓n

� ✓?| � t) 2 exp

✓

�

nt2

2�2

◆

LemmaWith probability at least 1� �, we have

L(b✓n

)� L(✓?) =1

2

(

b✓n

� ✓?)2 C�2 log 1

�

n.

Prof. John Duchi

Uniform law for means?

sup

✓2R

n

bLn

(✓)� L(✓)o

= +1

Prof. John Duchi

Convexity: heuristic and graphical explanation

Prof. John Duchi

Convexity: definitions

DefinitionA function f : Rd ! R is convex if

f(�u+ (1� �)v) �f(u) + (1� �)f(v)

for all u, v 2 Rd and � 2 [0, 1]

Prof. John Duchi

Basic properties

A few properties of convex functions

I If f : R ! R is twice di↵erentiable, then f is convex if andonly if f 00(t) � 0

I If f : Rm ! R is convex and A 2 Rm⇥n, b 2 Rm, theng(x) := f(Ax+ b) is convex

I If f1, f2 are convex, then f1 + f2 is convex

Prof. John Duchi

Examples

ExampleLogistic loss: �(t) = log(1 + e�t) and`(✓;x, y) = log(1 + e�yxT ✓) = �(yxT ✓)

ExampleAny norm k·k.

Example`1-regularized linear regression:

1

2nkX✓ � yk22 + � k✓k1 .

Prof. John Duchi

Convex functions have no local minima

TheoremLet B = {✓ : k✓k 1}, S = {✓ : k✓k = 1}, suppose f is convexand satisfies

f(✓) � f(✓?) for ✓ 2 ✓? + ✏S.

For ✓ 62 ✓? + ✏B, define

✓✏

:=

✏

k✓ � ✓?k✓ +

✓

1�

✏

k✓ � ✓?k

◆

✓?

Then

f(✓)� f(✓?) �k✓ � ✓?k

✏[f(✓

✏

)� f(✓?)]

Prof. John Duchi

Proof of theorem

Note that ✓✏

2 ✓? + ✏S, so for t = ✏k✓�✓?k 1,

✓✏

=

✏

k✓ � ✓?k✓ +

✓

1�

✏

k✓ � ✓?k

◆

✓? = t✓ + (1� t)✓?,

andf(✓

✏

) tf(✓) + (1� t)f(✓?)

Prof. John Duchi

Convex loss functions

Suppose that we use a convex loss, i.e. `(✓;X) is convex in ✓.Then

bLn

(✓) > bLn

(✓?) for all ✓ 2 ✓? + ✏S

implies that

b✓n

= argmin

✓

bLn

(✓) satisfies kb✓n

� ✓?k ✏

Prof. John Duchi

A picture of how we achieve fast rates

Prof. John Duchi

Growth and smoothness

Let us fix some radius r > 0, and assume

[Growth] L(✓) � L(✓?) +�

2

k✓ � ✓?k2 for k✓ � ✓?k r

and that

[Smoothness] `(·;x) is M -Lipschitz on {✓ : k✓ � ✓?k r}

Example (linear regression with bounded x)

Prof. John Duchi

Fast rates under growth conditions

TheoremLet the conditions on growth and smoothness hold, define

⇥

✏

:= {✓ 2 ⇥ | k✓ � ✓?k ✏}

and the localized Rademacher complexity

Rn

(⇥

✏

) := E"

sup

✓2⇥✏

�

�

�

�

�

1

n

n

X

i=1

"i

[`(✓;Xi

)� `(✓?;Xi

)]

�

�

�

�

�

#

Fix t � 0 and choose any ✏ r such that

�✏2

2

� 2Rn

(⇥

✏

) +

p

2

Mp

nt · ✏.

Then

P⇣

k

b✓ � ✓k � ✏⌘

2e�nt2

Prof. John Duchi

Bounding the local Rademacher complexity

Under the conditions of the theorem, for ⇥ ⇢ Rd we have

k

b✓ � ✓k CM

�p

n

⇣

p

d+ t⌘

w.p. � 1� 2e�nt2.

Prof. John Duchi

Bounding the local Rademacher complexity

Under the conditions of the theorem, for ⇥ ⇢ Rd we have

L(b✓)� L(✓?) O(stu↵) log 1

�

nw.p. � 1� �.

Prof. John Duchi

Multiclass classification

Suppose we have multiclass logistic loss for ✓ = [✓1 · · · ✓k

],✓l

2 Rd, y 2 {1, . . . , k}, kxk2 M

`(✓;x, y) = log

k

X

l=1

exp

�

xT (✓l

� ✓y

)

�

!

.

Then

Rn

(⇥

✏

) . Mp

dkp

n✏

Prof. John Duchi

Proof of theorem

Part 1: Consider the event bLn

(✓) bLn

(✓?) for some ✓ 2 ⇥✏

,which implies

⇣

bLn

(✓)� L(✓)⌘

�

⇣

bLn

(✓?)� L(✓?)⌘

�

�

2

✏2

Prof. John Duchi

Proof of theoremPart 2: Consider localized excess risk for ✓ 2 ⇥

✏

n

X

i=1

[(`(✓;Xi

)� `(✓?;Xi

))� (L(✓)� L(✓?))]

and get (for all t � 0)

P✓

sup

✓2⇥✏

�

�

�

bLn

(✓)� L(✓)� (bLn

(✓?)� L(✓?))�

�

�

� 2Rn

(⇥

✏

) + t

◆

2 exp

✓

�

nt2

2M2✏2

◆

Prof. John Duchi

Proof of theorem

Part 3: Implications: kb✓ � ✓?k � ✏

)

bLn

(✓)� bLn

(✓?) 0 for some ✓ 2 ⇥✏

) sup

✓2⇥✏

�

�

�

bLn

(✓)� L(✓)� (bLn

(✓?)� L(✓?))�

�

�

�

�✏2

2

) sup

✓2⇥✏

�

�

�

bLn

(✓)� L(✓)� (bLn

(✓?)� L(✓?))�

�

�

� 2Rn

(⇥

✏

) +

p

2

Mp

nt · ✏

Prof. John Duchi

Reading and bibliography

1. P. Bartlett, O. Bousquet, and S. Mendelson. Local rademachercomplexities.Annals of Statistics, 33(4):1497–1537, 2005

2. A. W. van der Vaart and J. A. Wellner. Weak Convergence andEmpirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 3.2–3.4)

3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005 (§5.3)

4. A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures onStochastic Programming: Modeling and Theory.SIAM and Mathematical Programming Society, 2009 (Ch. 5.3)

Prof. John Duchi

fast rates of convergence for learning problems...prof. john duchi bounding the local rademacher...

Documents