machine learning (cse 446): pratical issues: optimization...
TRANSCRIPT
Machine Learning (CSE 446):Pratical issues: optimization and learning
Sham M Kakadec© 2018
University of [email protected]
1 / 11
Announcements
I Midterm summary:I stats: 71.5 std: 18I Office hours today: 1:15-2:30 (No office hours on Monday)
I Monday: John Thickstun guest lecture
I Grading:I HW3 posted
I will be periodically updated for typos/clarifictionsI extra credit posted soon
I Today:I Midterm reviewI GD/SGD: practical issues
1 / 11
Midterm
1 / 11
What is a good model of this distribution?
“A mixture of Gaussians”
2 / 11
What is a good model of this distribution?“A mixture of Gaussians” 2 / 11
Midterm Q4: scratch space
3 / 11
Midterm: scratch space
3 / 11
Midterm Q5: scratch space
4 / 11
Midterm: scratch space
4 / 11
Today
4 / 11
The “general” Loss Minimization Problem
w∗ = argminw
1
N
N∑n=1
`(xn, yn,w)︸ ︷︷ ︸`n(w)
+R(w)
How do we run GD? SGD? Which one to use?
How do run them?
5 / 11
Our running example
argminw
1
N
N∑n=1
1
2(yn −w · xn)
2 +1
2λ‖w‖2
I GD? SGD?
I Note we are computing an average. What is a crude way to estimate an average?
Will it converge?
6 / 11
How does GD behave? A 1-dim example
6 / 11
GD: How do we set the step sizes?
I Theory:
I square loss:I more generally:
I Practice:
I square loss:I more generally:
I Do we decay the stepsize?
7 / 11
SGD for the square loss
Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do
n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)
(yn −w(k−1) · xn
)xn;
end
return w(K);Algorithm 1: SGD
I where did the N go?
I regularization?
I minibatching?
8 / 11
SGD for the square loss
Data: step sizes 〈η(1), . . . , η(K)〉Result: parameter winitialize: w(0) = 0;for k ∈ {1, . . . ,K} do
n ∼ Uniform({1, . . . , N});w(k) = w(k−1) + η(k)
(yn −w(k−1) · xn
)xn;
end
return w(K);Algorithm 2: SGD
I where did the N go?
I regularization?
I minibatching?
8 / 11
SGD: How do we set the step sizes?
I Theory:
I Practice:I How do start it?I When do we decay it?
9 / 11
Stochastic Gradient Descent: Convergence
w∗ = argminw
1
N
N∑n=1
`n(w)
I w(k): our parameter after k updates.
I Thm: Suppose `(·) is convex (and satisfies mild regularity conditions). There isdecreasing sequence of step sizes η(k) so that our function value, F (w(k)),converges to the minimal function value, F (w∗).
I GD vs SGD: we need to turn down our step sizes over time!
10 / 11
Making features: scratch space
11 / 11