early stopping is nonparametric variational inferenceduvenaud/talks/early-stopping-bnp.pdf · early...

63
Early Stopping is Nonparametric Variational Inference Dougal Maclaurin, David Duvenaud, Ryan Adams

Upload: others

Post on 25-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Early Stopping is NonparametricVariational Inference

Dougal Maclaurin, David Duvenaud, Ryan Adams

Page 2: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Good ideas always have Bayesianinterpretations

Regularization = MAP inferenceLimiting model capacity = Bayesian Occam’s razor

Cross-validation = Estimating marginal likelihoodDropout = Integrating out spike-and-slab

Ensembling = Bayes model averaging?Early stopping = ??

Page 3: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 4: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 5: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 6: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 7: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 8: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 9: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 10: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 11: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 12: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 13: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 14: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 15: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 16: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 17: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 18: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 19: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 20: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 21: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 22: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 23: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 24: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 25: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 26: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 27: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 28: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 29: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 30: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 31: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 32: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 33: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 34: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 35: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Gradient descent as a sampler

• Optimization paths startfrom random init, and movetowards modes...

Page 36: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 37: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 38: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 39: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 40: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 41: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 42: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 43: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 44: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 45: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 46: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Implicit Distributions

• What about the implicitdistribution of parametersafter optimizing for t steps?

• Starts as a badapproximation (prior dist)

• Ends as a badapproximation (point mass)

• Ensembling = takingmultiple samples from dist

• Early stopping = choosingbest intermediate dist

Page 47: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Cross validation vs. marginal likelihood

• What if we could evaluatemarginal likelihood ofimplicit distribution?

• Could choose all hypers tomaximize marginallikelihood

• No need forcross-validation?

0 100 200 300 400 50056789

1011

RM

SE

Train error

Test error

0 100 200 300 400 500Training iteration

28.6

28.4

28.2

28.0

27.8

27.6

Marg

inal

likeli

hood

Marginal likelihood

Page 48: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Variational Lower Bound

log p(x) ≥ −Eq(θ) [− log p(θ, x)]︸ ︷︷ ︸Energy E [q]

−Eq(θ) [log q(θ)]︸ ︷︷ ︸Entropy S[q]

Energy estimated from optimized objective function(training loss is NLL):

Eq(θ) [− log p(θ, x)] ≈ − log p(θ̂T ,x)

Entropy estimated by tracking change at each iteration:

−Eq(θ) [log q(θ)] ≈ S[q0] +T−1∑t=0

log∣∣∣J(θ̂t)

∣∣∣Using a single sample!

Page 49: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Page 50: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Page 51: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Page 52: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Page 53: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropy

• Inuitively: High curvaturemakes entropy decreasequickly

• Can measure localcurvature with Hessian

• Approximation good forsmall step-sizes

Page 54: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Estimating change in entropyVolume change given by Jacobian of optimizer’s operator:

S[qt+1]− S[qt ] = Eqt (θt )

[log∣∣∣J(θt)

∣∣∣]Gradient descent update rule:

θt+1 = θt − α∇L(θ),

Has Jacobian:

J(θt) = I − α∇∇L(θt)

Entropy change estimated at a single sample:

S[qt+1]− S[qt ] ≈ log |I − α∇∇L(θt)|

Page 55: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Final algorithm

Stochastic gradient descent1: input: Weight init scale σ0, step size α,

negative log-likelihood L(θ, t)2: initialize θ0 ∼ N (0, σ0ID)3:4: for t = 1 to T do5:6: θt = θt−1 − α∇L(θt , t)7: output sample θT ,

SGD with entropy estimate

1: input: Weight init scale σ0, step size α,negative log-likelihood L(θ, t)

2: initialize θ0 ∼ N (0, σ0ID)3: initialize S0 = D

2 (1 + log 2π) + D logσ0

4: for t = 1 to T do5: St = St−1 + log |I− α∇∇L(θt , t)|6: θt = θt−1 − α∇L(θt , t)7: output sample θT , entropy estimate ST

• Approximate bound: log p(x) & −L(θT ) + ST

• Determinant is O(D3)

• O(D) Taylor approximation using Hessian-vectorproducts

• Scales linearly in parameters and dataset size

Page 56: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Choosing when to stop

• Neural network on theBoston housing dataset.

• SGD marginal likelihoodestimate gives stoppingcriterion without avalidation set

0 100 200 300 400 50056789

1011

RM

SE

Train error

Test error

0 100 200 300 400 500Training iteration

28.6

28.4

28.2

28.0

27.8

27.6

Marg

inal

likeli

hood

Marginal likelihood

Page 57: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Choosing number of hidden units

• Neural net on 50000MNIST examples

• Largest model has 2 millionparameters

• Gives reasonableestimates, butcross-validation still better

10 30 100 300 1000 30000.25

0.20

0.15

0.10

0.05

Pre

dic

tive

lik

eli

hood

Training likelihood

Test likelihood

10 30 100 300 1000 3000Number of hidden units

5

4

3

2

1

Marg

inal

likeli

hood

Marginal likelihood

Page 58: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Limitations

• SGD not even trying tomaximize lower bound –good approximation is byaccident!

• Entropy term gets arbitrarilybad due to concentration,but true performance onlygets as bad as maximumlikelihood estimate

Page 59: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Entropy-friendly optimization

• Modified SGD to moveslower near convergence,optimized newhyperparameter

• Hurts performance, butgives tighter bound

• ideally would match testlikelihood

True posterior

0 iterations

150 iterations

300 iterations

0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2

Test

Lik

eli

hood

Training likelihood

Test likelihood

0 1 3 10 30 100 300 1000Gradient threshold

0.85

0.80

0.75

0.70

0.65

0.60

Marg

inal

likeli

hood

Marginal likelihood

Page 60: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Entropy-friendly optimization

• Modified SGD to moveslower near convergence,optimized newhyperparameter

• Hurts performance, butgives tighter bound

• ideally would match testlikelihood

True posterior

0 iterations

150 iterations

300 iterations

0 1 3 10 30 100 300 10000.80.70.60.50.40.30.2

Test

Lik

eli

hood

Training likelihood

Test likelihood

0 1 3 10 30 100 300 1000Gradient threshold

0.85

0.80

0.75

0.70

0.65

0.60

Marg

inal

likeli

hood

Marginal likelihood

Page 61: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Limitations

• Irrelevant parameters cancause low entropy estimate

• No momentum - wouldneed to estimatedistribution (see Kingma &Welling, 2015)

Page 62: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Main Takeaways

• Optimization with random restarts impliesnonparametric intermediate distributions

• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on

model evidence during optimization• Another connection between practice and theory

Thanks!

Page 63: Early Stopping is Nonparametric Variational Inferenceduvenaud/talks/early-stopping-bnp.pdf · Early stopping chooses among these distributions Ensembling samples from them Can scalably

Main Takeaways

• Optimization with random restarts impliesnonparametric intermediate distributions

• Early stopping chooses among these distributions• Ensembling samples from them• Can scalably estimate variational lower bound on

model evidence during optimization• Another connection between practice and theory

Thanks!