[dl輪読会]bayesian dark knowledge

14
Bayesian Dark Knowledge August 18, 2016

Upload: deeplearningjp2016

Post on 08-Jan-2017

149 views

Category:

Technology


1 download

TRANSCRIPT

Bayesian Dark Knowledge

August 18, 2016

Bayesian Dark Knowledge

Contents

1 Introduction

2 Background Knowledge

3 Bayesian Dark Knowledge

4 How to improve the original Bayeisan Dark Knowledge

Bayesian Dark Knowledge

Introduction

Introduction

”Bayesian Dark Knowledge” is a method unifying SGLD withdistillation.

SGLD is a method for learning large-scale Bayesian modelslike Bayeisn Networks. SGLD makes it possible to avoidoverfitting.

Distillatoin is a method for training student networks usingsoft labels created by teacher networks.

Bayesian Dark Knowledge

Background Knowledge

SGLD

MLA(Metropolis Adjusted Langevin Dynamics)

The objetive: sample from p(θ), which is often p(θ|Data).The method is based on a Langevin diffusion, with stationarydistribution p(θ) defined by

dθ(t) =1

2∇θL(θ(t))dt + N(0, Idt).

But, isotropic diffusion is inefficient.

So, pre-conditioning matrix M is introduced. M is usually setto the inverse of the Fisher information matrix.

dθ(t) =1

2M∇θL(θ(t))dt + N(0,Mdt).

Bayesian Dark Knowledge

Background Knowledge

SGLD

SGLD

SGLD is a method combining with SGD and MLA.

The formula is as follows:

∆θt =ϵt2(∇ log p(θt)+

N

n

∑∇ log p(xti |θt))+ηt , ηt ∼ N(0, ϵt).

Note that in the SGD, the noise term is removed.

Rejection rates go to zero asympotically.

In the initial phase, SGD like update.

In the latter phase, MLA like update.

Bayesian Dark Knowledge

Background Knowledge

Distillation

By learning the ensembles networks or the large networks, wecan get the good accuracy.

The above networks are called teacher networks. However, themodel size is large.

After learning the teacher networks, we want to transfter theknowledge in a function into a single smaller model.

When trasnfering the knowledge, it is better to use softtargets, which are created by teacher networks, instead of theoriginal labels, i.e., hard targets.

Bayesian Dark Knowledge

Bayesian Dark Knowledge

Overview

Overview

Bayesian Dark knowledge is a method of combining SGLDwith the concept of distillation.

SGLD is a useful method for learning Bayeisan Deep Networks.

The problem is that SGLD needs to archive many copies ofparameters.

The motivation is replacing a set of neural networks with asingle deep network.

Bayesian Dark Knowledge

Bayesian Dark Knowledge

Methods

Bayesian Dark Knowledge

Bayesian Dark Knowledge

Algorithm

Algorithm

Bayesian Dark Knowledge

Bayesian Dark Knowledge

Points

The method does not require to archive the weights. In thedistillation phase, θ is updated online.

The variance of the prior of teacher networks is smaller thanthe variance of the prior of student networks.

Bayesian Dark Knowledge

Bayesian Dark Knowledge

Results

Bayesian Dark Knowledge

How to improve the original Bayeisan Dark Knowledge

How to improve?

SGLD phase

Slow mixing rate.

The above method does not consider the local geometricstructure.

Distillation phase

We do not have the knoweledge about p(x).

We sample from the actual data only.

Bayesian Dark Knowledge

How to improve the original Bayeisan Dark Knowledge

Preconditioned SGLD

p-SGLD

That combines RMSprop with Riemannian SGLD.

RMSprop is an method of adaptive learning rate consideringthe curvature.

Bayesian Dark Knowledge

How to improve the original Bayeisan Dark Knowledge

References

A. Korattikara et.al ”Baysian Dark Knowledge”

G. Hinton ”Distilling the Knowledge in a Neural Network”

C. Li et.al ”Preconditioned Stochastic Gradient LangevinDynamics for Deep Neural Networks”