[dl輪読会]bayesian dark knowledge

Bayesian Dark Knowledge

August 18, 2016

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌


Contents

1 Introduction

2 Background Knowledge

3 Bayesian Dark Knowledge

4 How to improve the original Bayeisan Dark Knowledge

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌


Introduction

Introduction

”Bayesian Dark Knowledge” is a method unifying SGLD withdistillation.

SGLD is a method for learning large-scale Bayesian modelslike Bayeisn Networks. SGLD makes it possible to avoidoverfitting.

Distillatoin is a method for training student networks usingsoft labels created by teacher networks.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌


Background Knowledge

SGLD

MLA(Metropolis Adjusted Langevin Dynamics)

The objetive: sample from p(θ), which is often p(θ|Data).The method is based on a Langevin diffusion, with stationarydistribution p(θ) defined by

dθ(t) =1

2∇θL(θ(t))dt + N(0, Idt).

But, isotropic diffusion is inefficient.

So, pre-conditioning matrix M is introduced. M is usually setto the inverse of the Fisher information matrix.

dθ(t) =1

2M∇θL(θ(t))dt + N(0,Mdt).

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



SGLD

SGLD

SGLD is a method combining with SGD and MLA.

The formula is as follows:

∆θt =ϵt2(∇ log p(θt)+

N

n

∑∇ log p(xti |θt))+ηt , ηt ∼ N(0, ϵt).

Note that in the SGD, the noise term is removed.

Rejection rates go to zero asympotically.

In the initial phase, SGD like update.

In the latter phase, MLA like update.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Distillation

By learning the ensembles networks or the large networks, wecan get the good accuracy.

The above networks are called teacher networks. However, themodel size is large.

After learning the teacher networks, we want to transfter theknowledge in a function into a single smaller model.

When trasnfering the knowledge, it is better to use softtargets, which are created by teacher networks, instead of theoriginal labels, i.e., hard targets.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Overview

Overview

Bayesian Dark knowledge is a method of combining SGLDwith the concept of distillation.

SGLD is a useful method for learning Bayeisan Deep Networks.

The problem is that SGLD needs to archive many copies ofparameters.

The motivation is replacing a set of neural networks with asingle deep network.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Methods

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Algorithm

Algorithm

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Points

The method does not require to archive the weights. In thedistillation phase, θ is updated online.

The variance of the prior of teacher networks is smaller thanthe variance of the prior of student networks.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Results

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌


How to improve the original Bayeisan Dark Knowledge

How to improve?

SGLD phase

Slow mixing rate.

The above method does not consider the local geometricstructure.

Distillation phase

We do not have the knoweledge about p(x).

We sample from the actual data only.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



Preconditioned SGLD

p-SGLD

That combines RMSprop with Riemannian SGLD.

RMSprop is an method of adaptive learning rate consideringthe curvature.

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌

‌



References

A. Korattikara et.al ”Baysian Dark Knowledge”

G. Hinton ”Distilling the Knowledge in a Neural Network”

C. Li et.al ”Preconditioned Stochastic Gradient LangevinDynamics for Deep Neural Networks”

[dl輪読会]bayesian dark knowledge

Technology