[dl輪読会]bayesian dark knowledge
TRANSCRIPT
Bayesian Dark Knowledge
August 18, 2016
Bayesian Dark Knowledge
Contents
1 Introduction
2 Background Knowledge
3 Bayesian Dark Knowledge
4 How to improve the original Bayeisan Dark Knowledge
Bayesian Dark Knowledge
Introduction
Introduction
”Bayesian Dark Knowledge” is a method unifying SGLD withdistillation.
SGLD is a method for learning large-scale Bayesian modelslike Bayeisn Networks. SGLD makes it possible to avoidoverfitting.
Distillatoin is a method for training student networks usingsoft labels created by teacher networks.
Bayesian Dark Knowledge
Background Knowledge
SGLD
MLA(Metropolis Adjusted Langevin Dynamics)
The objetive: sample from p(θ), which is often p(θ|Data).The method is based on a Langevin diffusion, with stationarydistribution p(θ) defined by
dθ(t) =1
2∇θL(θ(t))dt + N(0, Idt).
But, isotropic diffusion is inefficient.
So, pre-conditioning matrix M is introduced. M is usually setto the inverse of the Fisher information matrix.
dθ(t) =1
2M∇θL(θ(t))dt + N(0,Mdt).
Bayesian Dark Knowledge
Background Knowledge
SGLD
SGLD
SGLD is a method combining with SGD and MLA.
The formula is as follows:
∆θt =ϵt2(∇ log p(θt)+
N
n
∑∇ log p(xti |θt))+ηt , ηt ∼ N(0, ϵt).
Note that in the SGD, the noise term is removed.
Rejection rates go to zero asympotically.
In the initial phase, SGD like update.
In the latter phase, MLA like update.
Bayesian Dark Knowledge
Background Knowledge
Distillation
By learning the ensembles networks or the large networks, wecan get the good accuracy.
The above networks are called teacher networks. However, themodel size is large.
After learning the teacher networks, we want to transfter theknowledge in a function into a single smaller model.
When trasnfering the knowledge, it is better to use softtargets, which are created by teacher networks, instead of theoriginal labels, i.e., hard targets.
Bayesian Dark Knowledge
Bayesian Dark Knowledge
Overview
Overview
Bayesian Dark knowledge is a method of combining SGLDwith the concept of distillation.
SGLD is a useful method for learning Bayeisan Deep Networks.
The problem is that SGLD needs to archive many copies ofparameters.
The motivation is replacing a set of neural networks with asingle deep network.
Bayesian Dark Knowledge
Bayesian Dark Knowledge
Methods
Bayesian Dark Knowledge
Bayesian Dark Knowledge
Algorithm
Algorithm
Bayesian Dark Knowledge
Bayesian Dark Knowledge
Points
The method does not require to archive the weights. In thedistillation phase, θ is updated online.
The variance of the prior of teacher networks is smaller thanthe variance of the prior of student networks.
Bayesian Dark Knowledge
Bayesian Dark Knowledge
Results
Bayesian Dark Knowledge
How to improve the original Bayeisan Dark Knowledge
How to improve?
SGLD phase
Slow mixing rate.
The above method does not consider the local geometricstructure.
Distillation phase
We do not have the knoweledge about p(x).
We sample from the actual data only.
Bayesian Dark Knowledge
How to improve the original Bayeisan Dark Knowledge
Preconditioned SGLD
p-SGLD
That combines RMSprop with Riemannian SGLD.
RMSprop is an method of adaptive learning rate consideringthe curvature.
Bayesian Dark Knowledge
How to improve the original Bayeisan Dark Knowledge
References
A. Korattikara et.al ”Baysian Dark Knowledge”
G. Hinton ”Distilling the Knowledge in a Neural Network”
C. Li et.al ”Preconditioned Stochastic Gradient LangevinDynamics for Deep Neural Networks”