early stopping is nonparametric variational inferenceduvenaud/talks/early-stopping-bnp.pdf · early...

Early Stopping is NonparametricVariational Inference

Dougal Maclaurin, David Duvenaud, Ryan Adams

Good ideas always have Bayesianinterpretations

Regularization = MAP inferenceLimiting model capacity = Bayesian Occam’s razor

Cross-validation = Estimating marginal likelihoodDropout = Integrating out spike-and-slab

Ensembling = Bayes model averaging?Early stopping = ??

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Cross validation vs. marginal likelihood

• What if we could evaluatemarginal likelihood ofimplicit distribution?

• Could choose all hypers tomaximize marginallikelihood

• No need forcross-validation?

0 100 200 300 400 50056789

Train error

Test error

0 100 200 300 400 500Training iteration

likeli

Marginal likelihood

Variational Lower Bound

log p(x) ≥ −Eq(θ) [− log p(θ, x)]︸︷︷︸Energy E [q]

−Eq(θ) [log q(θ)]︸︷︷︸Entropy S[q]

Energy estimated from optimized objective function(training loss is NLL):

Eq(θ) [− log p(θ, x)] ≈ − log p(θ̂T ,x)

Entropy estimated by tracking change at each iteration:

−Eq(θ) [log q(θ)] ≈ S[q0] +T−1∑t=0

log∣∣∣J(θ̂t)

∣∣∣Using a single sample!

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Estimating change in entropyVolume change given by Jacobian of optimizer’s operator:

S[qt+1]− S[qt ] = Eqt (θt )

[log∣∣∣J(θt)

∣∣∣]Gradient descent update rule:

θt+1 = θt − α∇L(θ),

Has Jacobian:

J(θt) = I − α∇∇L(θt)

Entropy change estimated at a single sample:

S[qt+1]− S[qt ] ≈ log |I − α∇∇L(θt)|

Final algorithm

Stochastic gradient descent1: input: Weight init scale σ0, step size α,

negative log-likelihood L(θ, t)2: initialize θ0 ∼ N (0, σ0ID)3:4: for t = 1 to T do5:6: θt = θt−1 − α∇L(θt , t)7: output sample θT ,

SGD with entropy estimate

1: input: Weight init scale σ0, step size α,negative log-likelihood L(θ, t)

2: initialize θ0 ∼ N (0, σ0ID)3: initialize S0 = D

2 (1 + log 2π) + D logσ0

4: for t = 1 to T do5: St = St−1 + log |I− α∇∇L(θt , t)|6: θt = θt−1 − α∇L(θt , t)7: output sample θT , entropy estimate ST

• Approximate bound: log p(x) & −L(θT ) + ST

• Determinant is O(D3)

• O(D) Taylor approximation using Hessian-vectorproducts

• Scales linearly in parameters and dataset size

Choosing when to stop

• Neural network on theBoston housing dataset.

• SGD marginal likelihoodestimate gives stoppingcriterion without avalidation set

0 100 200 300 400 50056789

Train error

Test error

0 100 200 300 400 500Training iteration

likeli

Marginal likelihood

Choosing number of hidden units

• Neural net on 50000MNIST examples

• Largest model has 2 millionparameters

• Gives reasonableestimates, butcross-validation still better

10 30 100 300 1000 30000.25

Training likelihood

Test likelihood

10 30 100 300 1000 3000Number of hidden units

likeli

Marginal likelihood

Limitations

• SGD not even trying tomaximize lower bound –good approximation is byaccident!

• Entropy term gets arbitrarilybad due to concentration,but true performance onlygets as bad as maximumlikelihood estimate

Entropy-friendly optimization

• Modified SGD to moveslower near convergence,optimized newhyperparameter

• Hurts performance, butgives tighter bound

• ideally would match testlikelihood

True posterior

0 iterations

150 iterations

300 iterations

0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2

Training likelihood

Test likelihood

0 1 3 10 30 100 300 1000Gradient threshold

likeli

Marginal likelihood

Entropy-friendly optimization

• Modified SGD to moveslower near convergence,optimized newhyperparameter

• Hurts performance, butgives tighter bound

• ideally would match testlikelihood

True posterior

0 iterations

150 iterations

300 iterations

0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2

Training likelihood

Test likelihood

0 1 3 10 30 100 300 1000Gradient threshold

likeli

Marginal likelihood

Limitations

• Irrelevant parameters cancause low entropy estimate

• No momentum - wouldneed to estimatedistribution (see Kingma &Welling, 2015)

Main Takeaways

• Optimization with random restarts impliesnonparametric intermediate distributions

• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on

model evidence during optimization• Another connection between practice and theory

Thanks!

Main Takeaways

• Optimization with random restarts impliesnonparametric intermediate distributions

• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on

model evidence during optimization• Another connection between practice and theory

Thanks!

early stopping is nonparametric variational inferenceduvenaud/talks/early-stopping-bnp.pdf · early...

Documents