early stopping for computational learning - upmcplc/sestri/rosasco2014.pdf · early stopping for...

29
Early Stopping for Computational Learning Lorenzo Rosasco Universita’ di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia CBMM Sestri Levante, September, 2014 joint work with A. Tacchetti, S.Villa (IIT) also, B.C.Vu+S.Villa (IIT) and J. Lin, D.X. Zhou (CityU)

Upload: hoangthuan

Post on 16-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Early Stopping for Computational Learning

Lorenzo Rosasco Universita’ di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia

CBMM

Sestri Levante, September, 2014

joint work with A. Tacchetti, S.Villa (IIT) also, B.C. Vu+S. Villa (IIT) and J. Lin, D.X. Zhou (CityU)

Image classification (not processing)

Xn =

0

B@x

11 . . . . . . . . . x

p1

......

......

...x

1n . . . . . . . . . x

pn

1

CA Yn =

0

[email protected]

1

CA

+1

�1

......

......

n=millions, p=hundreds thousands

Learning Theory and Optimization: One Remark

Learning is stochastic optimization…!

…yet, there is a divide!

between statistical and optimization analysis

This is not !

the learning problem

Given (x1, y1), . . . , (xn, yn), consider

minw2Rp

1

n

nX

i=1

V (yi, wTxi) + �R(w)

This is not !

the learning problem

Given P (x, y), consider

minw2Rp

EV (y, wTx) + �R(w)

Plan

Bias + Variance + Computations

Name Game: Problems , Estimators and Algorithms!

Iterative Algorithms and Early Stopping!

Learning with Stochastic/Incremental Gradient!

Beyond Least Squares

The Problem

• S = (xi, yi)ni=1 ⇠ P (x, y),

• xi 2 X = Rp, p 1, kxik 1, |yi| 1 almost surely.

w† = argminw2X0

kwk2, X0 = argminX

E(w)

E(w) = E|y � w

Tx|2

Assumption 0

Error Measures

Risk:

Parameters:

P(E(w)� E(w†) � ✏)

P(��w � w†��2 � ✏)

• Source Condition

• Effective Dimension ↵j ⇠ j�b, b 2 [1,1]

X

j

|vTj w†|2

↵2sj

< 1, s 2 (0,1)

Let C = E[xTx] and Cvj = ↵jvj , j = 1, . . . .

Assumption 1

An Estimator

Bias + Variance

vv vv vvvv vv vv

Ill-Posedeness

?

All we have is S = (xi, yi)ni=1 ⇠ P (x, y).

w� = argminw2X

1

n

nX

i=1

(yi � w

Txi)

2 + �w

Tw

Convergence

P( limn!1

E(w) = E(w†)) = 1

Set w = w�n . Then under Assumption 0, it holds

Theorem […]

Analogous results for parameter estimation.

Not a practical result.

Choose �n such that �n ! 0 and 1/�nn ! 0 for n ! 1.

Validation

Validation

Validation

Validation

Validation

Cross Validation

A Posteriori Parameter Choice

What about ?�

S S0

� = argmin�2⇤n

E 0(�), E 0(�) =1

m

mX

k=1

(y0k � w

T� x

0k)

2

Adaptive rates

Theorem

r = s+ 1/2

E(w)� E(w†) c⌧n� 2rb2rb+1

with probability at least 1� e�⌧, ⌧ 2 [0,1)

[Caponnetto, De Vito, R.+Smale,Zhou—- 2005-10]

infw

supw†2⇥b,s

E[E(w)� E(w†)]

The above result is optimal in a minmax sense

The above result describes what is done in practice.

Set w = w�. Then under Assumption 0 and 1, it holds,

Proof Approach

E(y � w

Tx)2

1

nkXnw � Ynk2

Cw = g; C = E[ 1n

X

Tn Xn]; g = E[xy]

Cw = g; C = [1

n

X

Tn Xn]; g =

1

n

nX

i=1

xiyi

kC � Ck Concentration Inequalities

Separate analysis (spectral calculus) and probability

[Caponnetto, De Vito, R.+Smale,Zhou—- 2005-10]

Algorithms?

What’s missing?

How do approximate computations affect our

error estimates?

What’s the cost of computing an estimator?!

Computations!!

Learning Theory~Stochastic Optimization!

so far IBC, Nemirovski Yudin Oracle Model

Computations

w� = argminw2X

1

n

nX

i=1

(yi � w

Txi)

2 + �w

Tw

w� = (XTn Xn+�nI)�1XT

n Yn

Complexity O(np2]�) + O(mp]�) Parametric

x

Tw = x

TX

Tn c c� = (XnX

Tn + �nI)�1YnNonparametric

O(n3]�) + O(mn]�) Nonparametric

Can we do better than this?

gradient descent…w0 = 0,

for k = 1 : t� 1

wk = wk�1 ��

nXT

n (Xnwk�1 � Yn)wt =

n

t�1X

j=0

(1� �

nXT

n Xn)jXT

n Yn

[Landweber ’50]

(XTn Xn)

�1 ⇠ (XTn Xn + �nI)�1

c�1 =1X

j=0

(1� c)j

Interlude: Neumann Series

c�1 ⇠t�1X

j=0

(1� c)j

(XTn Xn)

�1 =1X

j=0

(1�XTn Xn)

j (XTn Xn)

�1 ⇠t�1X

j=0

(1�XTn Xn)

j

…also gradient descent

Early Stopping

P( limn!1

E(w) = E(w†)) = 1

Theorem [Caponnetto Yao R. ‘07]Choose tn such that tn ! 1 and tn/n ! 0 for n ! 1.

Set w = wtn . Then under Assumption 0, it holds

Theorem

E(w)� E(w†) c⌧n� 2rb2rb+1

with probability at least 1� e�⌧, ⌧ 2 [0,1)

[Bauer, Pereverzev, R. ’07, Caponnetto, Yao ‘10]

Set w = wt. Then under Assumption 0 and 1, it holds,

Related results: GD- aka L2Boosting [Buhmann Yu ‘02, Yao, R. Caponnetto 05,

Bauer, Pereverzev R. ’07, Caponnetto, Yao, 10, Raskutti et al. ’13]

One Observation

101 102

Emp Err

101 102

Val Err

t

t

emp. error

test error

101 102

Emp Err

101 102

Val Err

t

test error

Bias + Variance + Computations

Computations

Bias Variance

Better Complexity

O(np2]�) O(mp]�)+

Parametric !Complexity

O(np]�)

Nonparametric!Complexity

O(n3]�) + O(mn]�)

O(n2]�)

Few Openish Questions

!

✴What’s the best we can do?!

✴What about other (iterative) schemes?!

• Accelerated GD - aka nu-Method [Bauer, Perverzev R. ’07, Caponnetto, Yao, 10]!

• Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer ’09]!

• Nesterov method? !

• others?!

✴Stochastic vs Incremental?!

✴Other Loss functions?!

✴Other regularization?!

!

Yet Another Iteration: Stochastic Gradient

GD

lower iteration cost O(p)

SGD aka Robbins-Monro

w0 = 0,

for k = 1 : t� 1

wk = wk�1 ��

nXT

n (Xnwk�1 � Yn)

w0 = 0,

for k = 1 : n

wk = wk�1 � �xk(xTk wk�1 � yk)

SGD Flavors

Varying Step-Size Penalized SGD

wAve =1

n

n�1X

k=0

wk

w0 = 0,

for k = 1 : n

wk = wk�1 � �xk(xkwk�1 � yk) + �wk

•Varying Step Size Penalized SGD: only results in expectation. !

• 2 parameters to crossvalidate [Smale, Yao ’05 and Tarres, Yao ’07], !

• step-size cross validation [Ying, Pontil ’05, Zhang ’04], constant step-

size for finite dimensions Bach Moulines ’13].

Does Anybody Use These Methods?

• � small,

– generalization error keep decreasing after the first pass on the data

– overfitting eventually occurs

• � big, more iterations are needed.

SGD Flavors (cont.)

Multiple Epochs SGD - aka IGDw0 = 0,

for j = 1 : t� 1

v0 = wj�1,

wj = vn,

for k = 1 : n

vk = vk�1 ��

n

x

Tk (xkvk�1 � yk)

end

I have all the n examples, I can process them more than once…!

…should I?

Early Stopping

P( limn!1

E(w) = E(w†)) = 1

Theorem [R. Tacchetti Villa ’14]

Set w = wtn . Then under Assumption 0, it holds

Choose tn such that tn ! 1 and tn/pn ! 0 for n ! 1.

Theorem

with probability at least 1� e�⌧, ⌧ 2 [0,1)

[R. Tacchetti Villa ’14]

Set w = wt. Then under Assumption 0 and 1, it holds,

kw � w⇤k2 c⌧n� ss+1

✴Still missing results with the whole Asssumption 1

Few more questions

•Results suggest same statistical/numerical complexity as GD

O(mp]�)+Parametric !Complexity

O(np]�)

Nonparametric!Complexity O(n2]�) O(mn]�)+

I have all the n examples, I can process them more than once…!

…should I? Yes! (or cross validate step size [Dieuleveut Bach ‘14])

Few Openish Questions!

✴Optimal numerical complexity in a statistical minmax class?!

✴Linear Nonparametrics?!

✴What about other (iterative) schemes?!

• Accelerated GD - aka nu-Method [Bauer, Perverzev R. ’07, Caponnetto, Yao, 10]!

• Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer ’09]!

• Nesterov method? !

• others?!

✴Stochastic vs Incremental?!

✴Other Loss functions?!

✴Other regularization?!

!

What about other loss functions?

Nemitski Loss: V : R⇥ R ! [0,1) such that

• V (y, ·) convex for all y 2 R.

• for p 2 [1,1),V (y,↵) a(y) + b|↵|p, 8↵, y 2 R

where b 22 [0,1) and a : R ! [0,1) with Ea(y) < 1.

minw2X0

kwk2, X0 = argminX

E(w), E(w) = EV (y, wTx)

Early Stopping with Subgradient [Lin, R. Zhou ’14]

What about Other Regularizers?

minw2X0

R(w), X0 = argminX

E(w), E(w) = E|y � w

Tx|2

Convex Regularization: R : X ! R, proper, l.s.c. functional.

[R., Villa, Vu’14]

A new Peradigm For Learning Algorithm Design?

wt

Bias/Optimization

wt

Variance/Stability

minw2X0

R(w), X0 = argminX

E(w), E(w) = EV (y, wTx)

minw2Rp

EV (y, wTx) + �R(w) min

w2Rp

1

n

nX

i=1

V (yi, wTxi) + �R(w)

Lots of questions

!

✴Avoid Data Cross Validation\Splitting?!

✴Early Stopping for Convex Loss, beyond subgradient?!

✴Early Stopping for Convex Regularization…!

✴Problems: From learning, to more general stochastic optimization or

inverse problems (stochastic or not), robust optimization…!

✴Approaches: Coordinate descent, distributed approaches…!

!