early stopping is nonparametric variational inferenceduvenaud/talks/early-stopping-bnp.pdf · early...
TRANSCRIPT
Early Stopping is NonparametricVariational Inference
Dougal Maclaurin, David Duvenaud, Ryan Adams
Good ideas always have Bayesianinterpretations
Regularization = MAP inferenceLimiting model capacity = Bayesian Occam’s razor
Cross-validation = Estimating marginal likelihoodDropout = Integrating out spike-and-slab
Ensembling = Bayes model averaging?Early stopping = ??
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Gradient descent as a sampler
• Optimization paths startfrom random init, and movetowards modes...
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Implicit Distributions
• What about the implicitdistribution of parametersafter optimizing for t steps?
• Starts as a badapproximation (prior dist)
• Ends as a badapproximation (point mass)
• Ensembling = takingmultiple samples from dist
• Early stopping = choosingbest intermediate dist
Cross validation vs. marginal likelihood
• What if we could evaluatemarginal likelihood ofimplicit distribution?
• Could choose all hypers tomaximize marginallikelihood
• No need forcross-validation?
0 100 200 300 400 50056789
1011
RM
SE
Train error
Test error
0 100 200 300 400 500Training iteration
28.6
28.4
28.2
28.0
27.8
27.6
Marg
inal
likeli
hood
Marginal likelihood
Variational Lower Bound
log p(x) ≥ −Eq(θ) [− log p(θ, x)]︸ ︷︷ ︸Energy E [q]
−Eq(θ) [log q(θ)]︸ ︷︷ ︸Entropy S[q]
Energy estimated from optimized objective function(training loss is NLL):
Eq(θ) [− log p(θ, x)] ≈ − log p(θ̂T ,x)
Entropy estimated by tracking change at each iteration:
−Eq(θ) [log q(θ)] ≈ S[q0] +T−1∑t=0
log∣∣∣J(θ̂t)
∣∣∣Using a single sample!
Estimating change in entropy
• Inuitively: High curvaturemakes entropy decreasequickly
• Can measure localcurvature with Hessian
• Approximation good forsmall step-sizes
Estimating change in entropy
• Inuitively: High curvaturemakes entropy decreasequickly
• Can measure localcurvature with Hessian
• Approximation good forsmall step-sizes
Estimating change in entropy
• Inuitively: High curvaturemakes entropy decreasequickly
• Can measure localcurvature with Hessian
• Approximation good forsmall step-sizes
Estimating change in entropy
• Inuitively: High curvaturemakes entropy decreasequickly
• Can measure localcurvature with Hessian
• Approximation good forsmall step-sizes
Estimating change in entropy
• Inuitively: High curvaturemakes entropy decreasequickly
• Can measure localcurvature with Hessian
• Approximation good forsmall step-sizes
Estimating change in entropyVolume change given by Jacobian of optimizer’s operator:
S[qt+1]− S[qt ] = Eqt (θt )
[log∣∣∣J(θt)
∣∣∣]Gradient descent update rule:
θt+1 = θt − α∇L(θ),
Has Jacobian:
J(θt) = I − α∇∇L(θt)
Entropy change estimated at a single sample:
S[qt+1]− S[qt ] ≈ log |I − α∇∇L(θt)|
Final algorithm
Stochastic gradient descent1: input: Weight init scale σ0, step size α,
negative log-likelihood L(θ, t)2: initialize θ0 ∼ N (0, σ0ID)3:4: for t = 1 to T do5:6: θt = θt−1 − α∇L(θt , t)7: output sample θT ,
SGD with entropy estimate
1: input: Weight init scale σ0, step size α,negative log-likelihood L(θ, t)
2: initialize θ0 ∼ N (0, σ0ID)3: initialize S0 = D
2 (1 + log 2π) + D logσ0
4: for t = 1 to T do5: St = St−1 + log |I− α∇∇L(θt , t)|6: θt = θt−1 − α∇L(θt , t)7: output sample θT , entropy estimate ST
• Approximate bound: log p(x) & −L(θT ) + ST
• Determinant is O(D3)
• O(D) Taylor approximation using Hessian-vectorproducts
• Scales linearly in parameters and dataset size
Choosing when to stop
• Neural network on theBoston housing dataset.
• SGD marginal likelihoodestimate gives stoppingcriterion without avalidation set
0 100 200 300 400 50056789
1011
RM
SE
Train error
Test error
0 100 200 300 400 500Training iteration
28.6
28.4
28.2
28.0
27.8
27.6
Marg
inal
likeli
hood
Marginal likelihood
Choosing number of hidden units
• Neural net on 50000MNIST examples
• Largest model has 2 millionparameters
• Gives reasonableestimates, butcross-validation still better
10 30 100 300 1000 30000.25
0.20
0.15
0.10
0.05
Pre
dic
tive
lik
eli
hood
Training likelihood
Test likelihood
10 30 100 300 1000 3000Number of hidden units
5
4
3
2
1
Marg
inal
likeli
hood
Marginal likelihood
Limitations
• SGD not even trying tomaximize lower bound –good approximation is byaccident!
• Entropy term gets arbitrarilybad due to concentration,but true performance onlygets as bad as maximumlikelihood estimate
Entropy-friendly optimization
• Modified SGD to moveslower near convergence,optimized newhyperparameter
• Hurts performance, butgives tighter bound
• ideally would match testlikelihood
True posterior
0 iterations
150 iterations
300 iterations
0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2
Test
Lik
eli
hood
Training likelihood
Test likelihood
0 1 3 10 30 100 300 1000Gradient threshold
0.85
0.80
0.75
0.70
0.65
0.60
Marg
inal
likeli
hood
Marginal likelihood
Entropy-friendly optimization
• Modified SGD to moveslower near convergence,optimized newhyperparameter
• Hurts performance, butgives tighter bound
• ideally would match testlikelihood
True posterior
0 iterations
150 iterations
300 iterations
0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2
Test
Lik
eli
hood
Training likelihood
Test likelihood
0 1 3 10 30 100 300 1000Gradient threshold
0.85
0.80
0.75
0.70
0.65
0.60
Marg
inal
likeli
hood
Marginal likelihood
Limitations
• Irrelevant parameters cancause low entropy estimate
• No momentum - wouldneed to estimatedistribution (see Kingma &Welling, 2015)
Main Takeaways
• Optimization with random restarts impliesnonparametric intermediate distributions
• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on
model evidence during optimization• Another connection between practice and theory
Thanks!
Main Takeaways
• Optimization with random restarts impliesnonparametric intermediate distributions
• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on
model evidence during optimization• Another connection between practice and theory
Thanks!