fast rates of convergence for learning problems...prof. john duchi bounding the local rademacher...
TRANSCRIPT
-
Fast rates of convergence for learning problems
John Duchi
Prof. John Duchi
-
Outline
I Mean estimation and uniform laws
II Convexity
III Growth conditions and fast rates
Prof. John Duchi
-
Best rates from uniform laws
What do our uniform laws give us?
bh = argminh2H
(
bLn
(h) =1
n
n
X
i=1
`(h;Xi
)
)
and
sup
h2H|
bLn
(h)� L(h)| . 1pn
with high probability
Best this can be?
L(bh)� L(h?) = bLn
(
bh)� bLn
(h?) + L(bh)� bLn
(
bh) + bLn
(h?)� bLn
(h?)
Prof. John Duchi
-
The best rate from this approach
Central limit theorems: Consider the third error term involving h?:
p
n⇣
L(h?)� bLn
(h?)⌘
=
1
p
n
n
X
i=1
[L(h?)� `(h?;Xi
)]
d N (0,Var(`(h?;X)))
Is this right?
Prof. John Duchi
-
Estimating a mean
Goal: We want to estimate ✓? = E[X], use loss
`(✓;x) =1
2
(✓ � x)2
with risk
L(✓) =1
2
E[(✓ �X)2] = 12
E[(✓ � E[X] + E[X]�X)2]
=
1
2
(✓ � E[X])2 +Var(X).
Gap in risks: Subtracting we have
L(✓)� L(✓?) =1
2
(✓ � ✓?)2 +Var(X)�Var(X) =1
2
(✓ � ✓?)2
Prof. John Duchi
-
Estimating a sub-Gaussian mean
Let Xi
be independent �2-sub-Gaussian, so that
b✓n
=
1
n
n
X
i=1
Xi
= argmin
✓
bLn
(✓)
and for t � 0 we have
P(|b✓n
� ✓?| � t) 2 exp
✓
�
nt2
2�2
◆
LemmaWith probability at least 1� �, we have
L(b✓n
)� L(✓?) =1
2
(
b✓n
� ✓?)2 C�2 log 1
�
n.
Prof. John Duchi
-
Uniform law for means?
sup
✓2R
n
bLn
(✓)� L(✓)o
= +1
Prof. John Duchi
-
Convexity: heuristic and graphical explanation
Prof. John Duchi
-
Convexity: definitions
DefinitionA function f : Rd ! R is convex if
f(�u+ (1� �)v) �f(u) + (1� �)f(v)
for all u, v 2 Rd and � 2 [0, 1]
Prof. John Duchi
-
Basic properties
A few properties of convex functions
I If f : R ! R is twice di↵erentiable, then f is convex if andonly if f 00(t) � 0
I If f : Rm ! R is convex and A 2 Rm⇥n, b 2 Rm, theng(x) := f(Ax+ b) is convex
I If f1, f2 are convex, then f1 + f2 is convex
Prof. John Duchi
-
Examples
ExampleLogistic loss: �(t) = log(1 + e�t) and`(✓;x, y) = log(1 + e�yxT ✓) = �(yxT ✓)
ExampleAny norm k·k.
Example`1-regularized linear regression:
1
2nkX✓ � yk22 + � k✓k1 .
Prof. John Duchi
-
Convex functions have no local minima
TheoremLet B = {✓ : k✓k 1}, S = {✓ : k✓k = 1}, suppose f is convexand satisfies
f(✓) � f(✓?) for ✓ 2 ✓? + ✏S.
For ✓ 62 ✓? + ✏B, define
✓✏
:=
✏
k✓ � ✓?k✓ +
✓
1�
✏
k✓ � ✓?k
◆
✓?
Then
f(✓)� f(✓?) �k✓ � ✓?k
✏[f(✓
✏
)� f(✓?)]
Prof. John Duchi
-
Proof of theorem
Note that ✓✏
2 ✓? + ✏S, so for t = ✏k✓�✓?k 1,
✓✏
=
✏
k✓ � ✓?k✓ +
✓
1�
✏
k✓ � ✓?k
◆
✓? = t✓ + (1� t)✓?,
andf(✓
✏
) tf(✓) + (1� t)f(✓?)
Prof. John Duchi
-
Convex loss functions
Suppose that we use a convex loss, i.e. `(✓;X) is convex in ✓.Then
bLn
(✓) > bLn
(✓?) for all ✓ 2 ✓? + ✏S
implies that
b✓n
= argmin
✓
bLn
(✓) satisfies kb✓n
� ✓?k ✏
Prof. John Duchi
-
A picture of how we achieve fast rates
Prof. John Duchi
-
Growth and smoothness
Let us fix some radius r > 0, and assume
[Growth] L(✓) � L(✓?) +�
2
k✓ � ✓?k2 for k✓ � ✓?k r
and that
[Smoothness] `(·;x) is M -Lipschitz on {✓ : k✓ � ✓?k r}
Example (linear regression with bounded x)
Prof. John Duchi
-
Fast rates under growth conditions
TheoremLet the conditions on growth and smoothness hold, define
⇥
✏
:= {✓ 2 ⇥ | k✓ � ✓?k ✏}
and the localized Rademacher complexity
Rn
(⇥
✏
) := E"
sup
✓2⇥✏
�
�
�
�
�
1
n
n
X
i=1
"i
[`(✓;Xi
)� `(✓?;Xi
)]
�
�
�
�
�
#
Fix t � 0 and choose any ✏ r such that
�✏2
2
� 2Rn
(⇥
✏
) +
p
2
Mp
nt · ✏.
Then
P⇣
k
b✓ � ✓k � ✏⌘
2e�nt2
Prof. John Duchi
-
Bounding the local Rademacher complexity
Under the conditions of the theorem, for ⇥ ⇢ Rd we have
k
b✓ � ✓k CM
�p
n
⇣
p
d+ t⌘
w.p. � 1� 2e�nt2.
Prof. John Duchi
-
Bounding the local Rademacher complexity
Under the conditions of the theorem, for ⇥ ⇢ Rd we have
L(b✓)� L(✓?) O(stu↵) log 1
�
nw.p. � 1� �.
Prof. John Duchi
-
Multiclass classification
Suppose we have multiclass logistic loss for ✓ = [✓1 · · · ✓k
],✓l
2 Rd, y 2 {1, . . . , k}, kxk2 M
`(✓;x, y) = log
k
X
l=1
exp
�
xT (✓l
� ✓y
)
�
!
.
Then
Rn
(⇥
✏
) . Mp
dkp
n✏
Prof. John Duchi
-
Proof of theorem
Part 1: Consider the event bLn
(✓) bLn
(✓?) for some ✓ 2 ⇥✏
,which implies
⇣
bLn
(✓)� L(✓)⌘
�
⇣
bLn
(✓?)� L(✓?)⌘
�
�
2
✏2
Prof. John Duchi
-
Proof of theoremPart 2: Consider localized excess risk for ✓ 2 ⇥
✏
n
X
i=1
[(`(✓;Xi
)� `(✓?;Xi
))� (L(✓)� L(✓?))]
and get (for all t � 0)
P✓
sup
✓2⇥✏
�
�
�
bLn
(✓)� L(✓)� (bLn
(✓?)� L(✓?))�
�
�
� 2Rn
(⇥
✏
) + t
◆
2 exp
✓
�
nt2
2M2✏2
◆
Prof. John Duchi
-
Proof of theorem
Part 3: Implications: kb✓ � ✓?k � ✏
)
bLn
(✓)� bLn
(✓?) 0 for some ✓ 2 ⇥✏
) sup
✓2⇥✏
�
�
�
bLn
(✓)� L(✓)� (bLn
(✓?)� L(✓?))�
�
�
�
�✏2
2
) sup
✓2⇥✏
�
�
�
bLn
(✓)� L(✓)� (bLn
(✓?)� L(✓?))�
�
�
� 2Rn
(⇥
✏
) +
p
2
Mp
nt · ✏
Prof. John Duchi
-
Reading and bibliography
1. P. Bartlett, O. Bousquet, and S. Mendelson. Local rademachercomplexities.Annals of Statistics, 33(4):1497–1537, 2005
2. A. W. van der Vaart and J. A. Wellner. Weak Convergence andEmpirical Processes: With Applications to Statistics.Springer, New York, 1996 (Ch. 3.2–3.4)
3. S. Boucheron, O. Bousquet, and G. Lugosi. Theory ofclassification: a survey of some recent advances.ESAIM: Probability and Statistics, 9:323–375, 2005 (§5.3)
4. A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures onStochastic Programming: Modeling and Theory.SIAM and Mathematical Programming Society, 2009 (Ch. 5.3)
Prof. John Duchi