early stopping for computational learning - upmcplc/sestri/rosasco2014.pdf · early stopping for...
TRANSCRIPT
Early Stopping for Computational Learning
Lorenzo Rosasco Universita’ di Genova, Massachusetts Institute of Technology Istituto Italiano di Tecnologia
CBMM
Sestri Levante, September, 2014
joint work with A. Tacchetti, S.Villa (IIT) also, B.C. Vu+S. Villa (IIT) and J. Lin, D.X. Zhou (CityU)
Image classification (not processing)
Xn =
0
B@x
11 . . . . . . . . . x
p1
......
......
...x
1n . . . . . . . . . x
pn
1
CA Yn =
0
1
CA
+1
�1
......
......
n=millions, p=hundreds thousands
Learning Theory and Optimization: One Remark
Learning is stochastic optimization…!
…yet, there is a divide!
between statistical and optimization analysis
This is not !
the learning problem
Given (x1, y1), . . . , (xn, yn), consider
minw2Rp
1
n
nX
i=1
V (yi, wTxi) + �R(w)
This is not !
the learning problem
Given P (x, y), consider
minw2Rp
EV (y, wTx) + �R(w)
Plan
Bias + Variance + Computations
Name Game: Problems , Estimators and Algorithms!
Iterative Algorithms and Early Stopping!
Learning with Stochastic/Incremental Gradient!
Beyond Least Squares
The Problem
• S = (xi, yi)ni=1 ⇠ P (x, y),
• xi 2 X = Rp, p 1, kxik 1, |yi| 1 almost surely.
w† = argminw2X0
kwk2, X0 = argminX
E(w)
E(w) = E|y � w
Tx|2
Assumption 0
Error Measures
Risk:
Parameters:
P(E(w)� E(w†) � ✏)
P(��w � w†��2 � ✏)
• Source Condition
• Effective Dimension ↵j ⇠ j�b, b 2 [1,1]
X
j
|vTj w†|2
↵2sj
< 1, s 2 (0,1)
Let C = E[xTx] and Cvj = ↵jvj , j = 1, . . . .
Assumption 1
An Estimator
Bias + Variance
vv vv vvvv vv vv
Ill-Posedeness
?
All we have is S = (xi, yi)ni=1 ⇠ P (x, y).
w� = argminw2X
1
n
nX
i=1
(yi � w
Txi)
2 + �w
Tw
Convergence
P( limn!1
E(w) = E(w†)) = 1
Set w = w�n . Then under Assumption 0, it holds
Theorem […]
Analogous results for parameter estimation.
Not a practical result.
Choose �n such that �n ! 0 and 1/�nn ! 0 for n ! 1.
Validation
Validation
Validation
Validation
Validation
Cross Validation
A Posteriori Parameter Choice
What about ?�
S S0
� = argmin�2⇤n
E 0(�), E 0(�) =1
m
mX
k=1
(y0k � w
T� x
0k)
2
Adaptive rates
Theorem
r = s+ 1/2
E(w)� E(w†) c⌧n� 2rb2rb+1
with probability at least 1� e�⌧, ⌧ 2 [0,1)
[Caponnetto, De Vito, R.+Smale,Zhou—- 2005-10]
infw
supw†2⇥b,s
E[E(w)� E(w†)]
The above result is optimal in a minmax sense
The above result describes what is done in practice.
Set w = w�. Then under Assumption 0 and 1, it holds,
Proof Approach
E(y � w
Tx)2
1
nkXnw � Ynk2
Cw = g; C = E[ 1n
X
Tn Xn]; g = E[xy]
Cw = g; C = [1
n
X
Tn Xn]; g =
1
n
nX
i=1
xiyi
kC � Ck Concentration Inequalities
Separate analysis (spectral calculus) and probability
[Caponnetto, De Vito, R.+Smale,Zhou—- 2005-10]
Algorithms?
What’s missing?
How do approximate computations affect our
error estimates?
What’s the cost of computing an estimator?!
Computations!!
Learning Theory~Stochastic Optimization!
so far IBC, Nemirovski Yudin Oracle Model
Computations
w� = argminw2X
1
n
nX
i=1
(yi � w
Txi)
2 + �w
Tw
w� = (XTn Xn+�nI)�1XT
n Yn
Complexity O(np2]�) + O(mp]�) Parametric
x
Tw = x
TX
Tn c c� = (XnX
Tn + �nI)�1YnNonparametric
O(n3]�) + O(mn]�) Nonparametric
Can we do better than this?
gradient descent…w0 = 0,
for k = 1 : t� 1
wk = wk�1 ��
nXT
n (Xnwk�1 � Yn)wt =
�
n
t�1X
j=0
(1� �
nXT
n Xn)jXT
n Yn
[Landweber ’50]
(XTn Xn)
�1 ⇠ (XTn Xn + �nI)�1
c�1 =1X
j=0
(1� c)j
Interlude: Neumann Series
c�1 ⇠t�1X
j=0
(1� c)j
(XTn Xn)
�1 =1X
j=0
(1�XTn Xn)
j (XTn Xn)
�1 ⇠t�1X
j=0
(1�XTn Xn)
j
…also gradient descent
Early Stopping
P( limn!1
E(w) = E(w†)) = 1
Theorem [Caponnetto Yao R. ‘07]Choose tn such that tn ! 1 and tn/n ! 0 for n ! 1.
Set w = wtn . Then under Assumption 0, it holds
Theorem
E(w)� E(w†) c⌧n� 2rb2rb+1
with probability at least 1� e�⌧, ⌧ 2 [0,1)
[Bauer, Pereverzev, R. ’07, Caponnetto, Yao ‘10]
Set w = wt. Then under Assumption 0 and 1, it holds,
Related results: GD- aka L2Boosting [Buhmann Yu ‘02, Yao, R. Caponnetto 05,
Bauer, Pereverzev R. ’07, Caponnetto, Yao, 10, Raskutti et al. ’13]
One Observation
101 102
Emp Err
101 102
Val Err
t
t
emp. error
test error
101 102
Emp Err
101 102
Val Err
t
test error
Bias + Variance + Computations
Computations
Bias Variance
Better Complexity
O(np2]�) O(mp]�)+
Parametric !Complexity
O(np]�)
Nonparametric!Complexity
O(n3]�) + O(mn]�)
O(n2]�)
Few Openish Questions
!
✴What’s the best we can do?!
✴What about other (iterative) schemes?!
• Accelerated GD - aka nu-Method [Bauer, Perverzev R. ’07, Caponnetto, Yao, 10]!
• Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer ’09]!
• Nesterov method? !
• others?!
✴Stochastic vs Incremental?!
✴Other Loss functions?!
✴Other regularization?!
!
Yet Another Iteration: Stochastic Gradient
GD
lower iteration cost O(p)
SGD aka Robbins-Monro
w0 = 0,
for k = 1 : t� 1
wk = wk�1 ��
nXT
n (Xnwk�1 � Yn)
w0 = 0,
for k = 1 : n
wk = wk�1 � �xk(xTk wk�1 � yk)
SGD Flavors
Varying Step-Size Penalized SGD
wAve =1
n
n�1X
k=0
wk
w0 = 0,
for k = 1 : n
wk = wk�1 � �xk(xkwk�1 � yk) + �wk
•Varying Step Size Penalized SGD: only results in expectation. !
• 2 parameters to crossvalidate [Smale, Yao ’05 and Tarres, Yao ’07], !
• step-size cross validation [Ying, Pontil ’05, Zhang ’04], constant step-
size for finite dimensions Bach Moulines ’13].
Does Anybody Use These Methods?
• � small,
– generalization error keep decreasing after the first pass on the data
– overfitting eventually occurs
• � big, more iterations are needed.
SGD Flavors (cont.)
Multiple Epochs SGD - aka IGDw0 = 0,
for j = 1 : t� 1
v0 = wj�1,
wj = vn,
for k = 1 : n
vk = vk�1 ��
n
x
Tk (xkvk�1 � yk)
end
I have all the n examples, I can process them more than once…!
…should I?
Early Stopping
P( limn!1
E(w) = E(w†)) = 1
Theorem [R. Tacchetti Villa ’14]
Set w = wtn . Then under Assumption 0, it holds
Choose tn such that tn ! 1 and tn/pn ! 0 for n ! 1.
Theorem
with probability at least 1� e�⌧, ⌧ 2 [0,1)
[R. Tacchetti Villa ’14]
Set w = wt. Then under Assumption 0 and 1, it holds,
kw � w⇤k2 c⌧n� ss+1
✴Still missing results with the whole Asssumption 1
Few more questions
•Results suggest same statistical/numerical complexity as GD
O(mp]�)+Parametric !Complexity
O(np]�)
Nonparametric!Complexity O(n2]�) O(mn]�)+
I have all the n examples, I can process them more than once…!
…should I? Yes! (or cross validate step size [Dieuleveut Bach ‘14])
Few Openish Questions!
✴Optimal numerical complexity in a statistical minmax class?!
✴Linear Nonparametrics?!
✴What about other (iterative) schemes?!
• Accelerated GD - aka nu-Method [Bauer, Perverzev R. ’07, Caponnetto, Yao, 10]!
• Conjugate Gradient (CG) - aka Partial Least Squares [Blanchard, Kramer ’09]!
• Nesterov method? !
• others?!
✴Stochastic vs Incremental?!
✴Other Loss functions?!
✴Other regularization?!
!
What about other loss functions?
Nemitski Loss: V : R⇥ R ! [0,1) such that
• V (y, ·) convex for all y 2 R.
• for p 2 [1,1),V (y,↵) a(y) + b|↵|p, 8↵, y 2 R
where b 22 [0,1) and a : R ! [0,1) with Ea(y) < 1.
minw2X0
kwk2, X0 = argminX
E(w), E(w) = EV (y, wTx)
Early Stopping with Subgradient [Lin, R. Zhou ’14]
What about Other Regularizers?
minw2X0
R(w), X0 = argminX
E(w), E(w) = E|y � w
Tx|2
Convex Regularization: R : X ! R, proper, l.s.c. functional.
[R., Villa, Vu’14]
A new Peradigm For Learning Algorithm Design?
wt
Bias/Optimization
wt
Variance/Stability
minw2X0
R(w), X0 = argminX
E(w), E(w) = EV (y, wTx)
minw2Rp
EV (y, wTx) + �R(w) min
w2Rp
1
n
nX
i=1
V (yi, wTxi) + �R(w)
Lots of questions
!
✴Avoid Data Cross Validation\Splitting?!
✴Early Stopping for Convex Loss, beyond subgradient?!
✴Early Stopping for Convex Regularization…!
✴Problems: From learning, to more general stochastic optimization or
inverse problems (stochastic or not), robust optimization…!
✴Approaches: Coordinate descent, distributed approaches…!
!