stochastic approximations, di usion limit and...

Stochastic Approximations, Diffusion Limit and Small Random Perturbations of Dynamical Systems – a probabilistic approach to machine learning. Wenqing Hu. 1 1. Department of Mathematics and Statistics, Missouri S&T.

Upload: others

Post on 12-Jul-2020




0 download


Page 1: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Stochastic Approximations, Diffusion Limitand Small Random Perturbations of

Dynamical Systems

– a probabilistic approach to machine learning.

Wenqing Hu. 1

1. Department of Mathematics and Statistics, Missouri S&T.

Page 2: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).

I Deep Neural Network (DNN). Goal is to solve the followingstochastic optimization problem


F (x) ≡ 1



Fi (x)

where each component Fi corresponds to the loss function fordata point i ∈ {1, ...,M}, and x is the vector of weights beingoptimized.

Page 3: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).

I Naive Thinking : Gradient Descent (GD) updates as

x (t) = x (t−1) − η




∇Fi (x (t−1))


I Learning rate η > 0 is usually a small stepsize.

I Access of all gradients ∇Fi (x (t−1)) for i = 1, ...,M is ingeneral very expensive in large–scale machine learningproblems.

Page 4: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).

I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function




Fi (x) = EB



Fi (x)

)≡ EBF (x ;B) .

I SGD updates as

x (t) = x (t−1) − η


|Bt |∑i∈Bt

∇Fi (x (t−1))


which is the classical mini–batch version of the SGD.

Page 5: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Stochastic Optimization.

I Stochastic Optimization Problem


F (x) .


F (x) ≡ E[F (x ; ζ)] .

I Index random variable ζ follows some prescribed distributionD.

Page 6: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Stochastic Optimization.

I We target at finding a local minimum point x∗ of theexpectation function F (x) ≡ E[F (x ; ζ)] :

x∗ = arg minx∈U⊂Rd

E[F (x ; ζ)] .

Page 7: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Stochastic Approximation : Stochastic Gradient Descent(SGD) Algorithm.

I Stochastic Gradient Descent (SGD) has iteration :

x (t) = x (t−1) − η∇F (x (t−1); ζt) ,

where {ζt} are i.i.d. random variables that have the samedistribution as ζ ∼ D.

Page 8: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Closer look at SGD.

I SGD :x (t) = x (t−1) − η∇F (x (t−1); ζt) .

I F (x) ≡ EF (x ; ζ) in which ζ ∼ D.

I Letet = ∇F (x (t−1); ζt)−∇F (x (t−1))

and we can rewrite the SGD as

x (t) = x (t−1) − η(∇F (x (t−1)) + et) .

Page 9: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Diffusion Limit of SGD.

I Stochastic difference equation :

x (t) − x (t−1) = −η∇F (x (t−1))− ηet .

I When η is small, we can approximate x (t) by a diffusionprocess Xt driven by the stochastic differential equation

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) ,

where Wt is a standard Brownian motion in Rd and

σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Page 10: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Diffusion Limit of SGD : Justification.

I Slogan : Continuous Markov processes are characterized by itslocal statistical characteristics only in the first and secondmoments (conditional mean and (co)variance).

I Such an approximation has been justified in the weak sense inmany classical literature. It can also be thought of as a normaldeviation result (Hu–Li–Li–Liu, 2018).

I Conversely, the discrete iteration can be viewed as a numericalscheme for the diffusion limit Xt .

I In many CS literature people simply refer to the diffusion limitas the SGD algorithm.

Page 11: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Small random perturbations of dynamical systems :gradient flow case.

I Diffusion Limit of SGD :

dXt = −η∇F (Xt)dt + ησ(Xt)dWt , X0 = x (0) .

I Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .

(Random perturbations of the gradient flow !)

I In many CS literature people simply refer to the randomlyperturbed process Yt as the SGD algorithm.

Page 12: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Summary : Probabilistic Approach.

stochastictraining algorithm→ statistical properties

approximation of learning model(convergence, generalization, ...)

l small learning rate ldiffusion limit

time change← small random perturbationsof dynamical systems

Page 13: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

SGD as a randomly perturbed gradient flow.

I SGD as a randomly perturbed gradient flow :

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .


σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Page 14: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Target Problem 1 : Convergence Time.

I Assume isotropic noise D(x) = Id .

I Diffusion Limit :

dXt = −η∇F (Xt)dt + ηdWt , X0 = x (0) .

I Convergence Time = Hitting time

τη = inf{t ≥ 0 : F (Xt) ≤ F (x∗) + e}

for some small e > 0.

I Asymptotic of Eτη as η → 0 ?

Page 15: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Target Problem 1 : Convergence Time.

I Random perturbations : Let Yt = Xt/η, then

dYt = −∇F (Yt)dt +√ηdWt , Y0 = x (0) .

I Hitting time

T η = inf{t ≥ 0 : F (Yt) ≤ F (x∗) + e}

for some small e > 0.


τη = η−1T η .

I Asymptotic of ET η as η → 0 ?

Page 16: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Where is the difficulty ?

Figure 1 – Various critical points and the landscape of F .

Page 17: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Where is the difficulty ?

I Algorithm spends most of its time near critical points of lossfunction F .

Page 18: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Escape from a saddlepoint.

Figure 2 – Escape from a saddle point.

Page 19: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Chain of saddle points.

Figure 3 – Chain of saddle points.

Page 20: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Sequence of stoppingtimes.

I Standard Markov cycle type argument.

I But the geometry is a little different from the classicalarguments for elliptic equilibriums found in Freidlin–Wentzellbook.

I ET η .k

2γ1ln(η−1) conditioned upon convergence.

Page 21: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 1 : Convergence Time.

I Theorem(Hu-Li, 2017) (i) For any small ρ > 0, with probability at least1− ρ, the diffusion limit Xt of SGD converges to the minimizer x∗

for sufficiently small η after passing through all k saddle points O1,..., Ok ;(ii) Consider the stopping time τη. Then as η ↓ 0, conditioned onthe above convergence of the diffusion limit Xt of SGD, we have



η−1 ln η−1≤ k


Page 22: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Target Problem 2 : Effect of batchsize on generalization.

I It is believed that escape from “bad minima” to “goodminima” is respobsible for good empirical properties underSGD training, e.g. generalization.

I DNN training : How batchsize affect the escape propertiesfrom local minimum points ?

I Large batch training v.s. small batch training.

I Large–batch methods tend to converge to sharp minimizers ofthe training and testing functions.

I Small–batch methods consistently converge to flat minimizers.


Sharp minima → Poorer generalization ;Flat minima → Better generalization .

Page 23: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Machine Learning Background : Training DNN viaminibatch Stochastic Gradient Descent (SGD).

I Let B be the minibatch of prescribed size uniformly sampledfrom {1, ...,M}, then the objective function can be furtherwritten as the expectation of a stochastic function




Fi (x) = EB



Fi (x)

)≡ EBF (x ;B) .

I SGD updates as

x (t) = x (t−1) − η


|Bt |∑i∈Bt

∇Fi (x (t−1))


which is the classical mini–batch version of the SGD.

Page 24: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 2 : Relationship betweenbatchsize and the diffusion matrix D(x).

I Relationship between batchsize and the diffusion matrix D(x)(Hu-Li-Li-Liu, 2018).


D(x) =


|B|− 1


)D0(x) ,


D0(x) =1

M − 1


(∇Fi (x)−∇F (x))(∇Fi (x)−∇F (x))T .

I Naively, the larger D(x) is, the easier to escape due toinjection of the noise.

Page 25: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

SGD as a randomly perturbed gradient flow.

I SGD as a randomly perturbed gradient flow :

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .


σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Page 26: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.

I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic

P(x0, ∂U) � exp



φQPloc (x ;O)



Eτ(x0, ∂U) � 1




φQPloc (x ;O)


Page 27: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 2 : The quasi–potential.

I The local quasi–potential

φQPloc (x ; x0) = inf


ψ(0)=x0,ψ(T )=xS0T (ψ) .

I Action functional (Large Deviations Rate Function) :

S0T (ψ) =



∫ T

0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,

if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,

otherwise .

Page 28: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.

I Implicit Regularization : It is widely believed that SGD is animplicit regularizer which helps itself to search for a localminimum that is easy to generalize.

I How to understand this phenomenon ?

Page 29: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Structure of diffusionmatrix D(x).

I Covariance structure (diffusion matrix) D(x) dependsimplicitly on the model architecture :

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

I Anisotropic Noise v.s. Isotropic Noise.

I D(x) may have very large condition number !

Page 30: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

SGD as a randomly perturbed gradient flow.

I SGD as a randomly perturbed gradient flow :

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .


σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Page 31: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 2 : Escape from basin ofattractors via Large Deviations Theory.

I Theorem(Hu-Li-Li-Liu, 2018) Let U be a basin of attractor such thatO ⊂ U is the only minimum of F (x) and assume that theboundary ∂U of the domain U is smooth so that ∇F (x), x ∈ ∂Upoints to the interior of the boundary of U, then for x0 ∈ U wehave the following two asymptotic

P(x0, ∂U) � exp



φQPloc (x ;O)



Eτ(x0, ∂U) � 1




φQPloc (x ;O)


Page 32: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 2 : The quasi–potential.

I The local quasi–potential

φQPloc (x ; x0) = inf


ψ(0)=x0,ψ(T )=xS0T (ψ) .

I Action functional (Large Deviations Rate Function) :

S0T (ψ) =



∫ T

0(ψ̇t +∇F (ψt))TD−1(ψt)(ψ̇t +∇F (ψt))dt ,

if ψt is abs. cont. t ∈ [0,T ] ;+∞ ,

otherwise .

Page 33: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Relation between thequasi–potential and the diffusion matrix D(x).

I Theorem(Hu-Zhu-Xiong-Huan, 2019) The local quasi–potential φQP

loc (x ; x0)is a solution to the Hamilton–Jacobi equation




loc (x ; x0))T

D(x)∇φQPloc (x ; x0)−∇F (x) · ∇φQP

loc (x ; x0) = 0 ,

with boundary condition

φQPloc (O; x0) = 0 .

Page 34: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Escape from localminimum point : Isotropic Noise v.s. Anisotropic Noise.

Figure 4 – Escape from local minimum point : Isotropic Noise v.s.Anisotropic Noise.

Page 35: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Target Problem 3 : Algorithm’s implicit selection ofspecific local minima.

I Suppose that F (x) is non–convex and admits several localminimum points.

I Does SGD always select the global minimum of F (x) ?

Page 36: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

SGD as a randomly perturbed gradient flow.

I SGD as a randomly perturbed gradient flow :

dYt = −∇F (Yt)dt +√ησ(Yt)dWt , Y0 = x (0) .


σ(x)σT (x) = D(x) .

I The diffusion matrix (noise covariance matrix)

D(x) = E(∇F (x ; ζ)−∇F (x))(∇F (x ; ζ)−∇F (x))T .

Page 37: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Variational Inference.

I Variational Inference : Invariant density

ρSS(x) =1

Z (η)exp




I SGD selects the global minimum of Φ(x) : ρ(x , t)→ ρSS(x)as t →∞.

I If diffusion matrix D(x) = Id is isotropic, then Φ(x) ∝ F (x).

I Anisotropic noise is common !

I So in general Φ(x) 6= F (x).

Page 38: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.

I Large Deviation Theory provides another way to express theinvariant density

ρSS(x) � exp




I (global) quasi–potential : φQP(x).

Page 39: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Variational Inference viaLarge Deviation Theory.

I Theorem(Hu-Zhu-Xiong-Huan, 2019) SGD selects the global minimum x∗

of φQP(x) such that φQP(x∗) = 0.

Page 40: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Global quasi-potential.

I The (global) quasi-potential φQP(x) can be calculated fromthe local quasi–potential φQP

loc (x ; x0).

I We have seen that the local quasi–potential φQPloc (x ; x0)

depends on the diffusion matrix D(x) via the Hamilton-Jacobiequation.

I Markov chain on local minimum points (Freidlin–Wentzelltheory).

Page 41: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Approach to Target Problem 3 : Implicit regularization viacovariance structure.

Figure 5 – Implicit regularization via covariance structure. The lossfunction is symmetric with respect to two local minima (−2, 0) and(2, 0). Left : Process starts from (−2, 0), anisotropic noise in thepotential well ; Right : Process starts from (2, 0), isotropic noise in thepotential well. SGD tends to select (2, 0).

Page 42: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Summary : Probabilistic Approach.

stochastictraining algorithm→ statistical properties

approximation of learning model(convergence, generalization, ...)

l small learning rate ldiffusion limit

time change← small random perturbationsof dynamical systems

Page 43: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Stochastic Approximation : Stochastic heavy–ball method.



f (x) .

I “heavy ball method” (B.Polyak 1964)xt = xt−1 +

√svt ,

vt = (1− α√s)vt−1 −

√s∇X f (xt−1) ,

x0, v0 ∈ Rd .

I Stochastic heavy ball method :xt = xt−1 +

√svt ,

vt = (1− α√s)vt−1 −

√s∇X f (xt−1; ζt) ,

x0, v0 ∈ Rd .

Page 44: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Diffusion Limit of Stochastic heavy–ball method.

I Diffusion limit of the stochastic heavy ball method :dx(t) =

√sv(t)dt ,

dv(t) = −√sαv(t)dt −

√s∇X f (x(t)) +

√sσ(x(t))dWt ,

x(0), v(0) ∈ Rd .

Page 45: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Small random perturbations of dynamical systems :dissipative Hamiltonian system.

I Small random perturbations of dissipated Hamiltoniansystem :

dX εt = V ε

t dt;dV ε

t = (−αV εt −∇X f (X ε

t ))dt + εσ(X εt )dWt ,

X ε0 ,V

ε0 ∈ Rd .

I Hu-Li-Su, 2017 and work in progress.

Page 46: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Many further interesting directions at the interplay ofprobability and data science ...

I Escape from saddle points for anisotropic noise.

I Can the program on Hamilton–Jacobi equation be carried tospecific deep neural network structures ?

I Empirical Risk v.s. Population Risk : How does this affectSGD’s variational inference ? Relation with Generalization.

I Many more applied problems in collaborations withCS/ECE/ORIE people...

Page 47: Stochastic Approximations, Di usion Limit and … Approximations, Di usion Limit and Small Random Perturbations of Dynamical

Thank you for your attention !