fast dropout training - stanford nlp...

Fast dropout training

ICML 2013 Sida Wang, Chris Manning

What is dropout training

•  Introduced by Hinton et al. in “Improving neural networks by preventing co-adaptation of feature detectors”

•  Randomly select some inputs for each unit •  zero them •  compute the gradient, make an update

•  repeat

Dropout training is promising

•  Won the ImageNet challenge by a margin

•  George Dahl et al. won the Merck challenge

•  Dropout seems to be an important ingredient

In this work…

•  Sampling is inefficient – we integrate •  50% random dropout, after 5 passes of the data,

1/32 ≈ 3% of the data is still unseen

•  Dropout as Bayesian model selection, equivalence to making the weights random

CLT and the Gaussian approximation

•  Sample from the matching Gaussian

and empirical evidence. We focus on logistic regression for simplicity, but the idea is also applicableto other probabilistic discriminative models, and to neural networks.

2 The implied objective function, and its faster approximations

To avoid unnecessarily cumbersome notations, we illustrate the main idea with binary logistic re-gression (LR) with a single training vector x, and label y 2 {0, 1}.

2.1 The implied objective function for dropout training

To train LR with dropout on data with dimension m, first sample zi

⇠ Bernoulli(pi

) for i = 1...m.Here p

i

is the probability of not dropping out input xi

. For neural networks, x is activation of theprevious layer. For LR, x is the data. Typically, pbias = 1 and p = 0.1 ⇠ 0.9 for everything else.After sampling z = {z

i

}i=1...m

we can compute the stochastic gradient descent (sgd) update asfollows:

�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x

) is the logistic function.

This update rule, applied over the training data over multiple passes, can be seen as a Monte Carloapproximation to the following gradient:

�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)

The objective function with the above gradient is the expected conditional log-likelihood of the labelgiven the data with dropped out dimensions indicated by z, for y ⇠ Bernoulli(�(wtD

z

x))). This isthe implied objective function for dropout training:

L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)

Since we are just taking an expectation, the objective is still concave provided that the originallog-likelihood is concave, which it is for logistic regression.

Evaluating the expectation in (2) naively by summing over all possible z has complexity O(2

mm).Rather than directly computing the expectation with respect to z, we propose a variable transfor-mation that allows us to compute the expectation with respect to a simple random variable Y 2 R,instead of z 2 Rm.

2.2 Faster approximations to the dropout objective

We make the observation that evaluating the objective function L(w) involves taking the expectationwith respect to the variable Y (z) = wtD

z

x =

Pm

i

wi

xi

zi

, a weighted sum of Bernoulli randomvariables. For most machine learning problems, {w

i

} typically form a unimodal distribution cen-tered at 0, {x

i

} is either unimodal or in a fixed interval. In this case, Y can be well approximated bya normal distribution even for relatively low dimensional data with m = 10. More technically, theLyapunov condition is generally satisfied for a weighted sum of Bernoulli random variables of theform Y that are weighted by real data [5]. Then, Lyapunov’s central limit theorem states that Y (z)tends to a normal distribution as m ! 1. We empirically verify that the approximation is good fortypical datasets of moderate dimensions, except when a couple of dimensions dominate all others.Finally, let

S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S

✏ (4)be the approximating Gaussian, where ✏ ⇠ N (0, 1), E

z

[Y (z)] =P

m

i

pi

wi

xi

, and Var [Y (z)] =Pm

i

pi

(1� pi

)(wi

xi

)

2.

2.2.1 Gaussian approximation

Given good convergence, we note that drawing samples of the approximating Gaussian S of Y (z),a constant time operation, is much cheaper than drawing samples directly of Y (z), which takesO(m). This effect is very significant for high dimensional datasets. So without doing much, wecan already approximate the objective function (3) m times faster by sampling from S instead of

2






⇠ Bernoulli(pi


i



i

}i=1...m


�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x



�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)


z


L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)






z

x =

Pm

i

wi

xi

zi


i


i


S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S


z

[Y (z)] =P

m

i

pi

wi

xi


i

pi

(1� pi

)(wi

xi

)

2.



2






⇠ Bernoulli(pi


i



i

}i=1...m


�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x



�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)


z


L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)






z

x =

Pm

i

wi

xi

zi


i


i


S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S


z

[Y (z)] =P

m

i

pi

wi

xi


i

pi

(1� pi

)(wi

xi

)

2.



2






⇠ Bernoulli(pi


i



i

}i=1...m


�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x



�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)


z


L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)






z

x =

Pm

i

wi

xi

zi


i


i


S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S


z

[Y (z)] =P

m

i

pi

wi

xi


i

pi

(1� pi

)(wi

xi

)

2.



2






⇠ Bernoulli(pi


i



i

}i=1...m


�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x



�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)


z


L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)






z

x =

Pm

i

wi

xi

zi


i


i


S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S


z

[Y (z)] =P

m

i

pi

wi

xi


i

pi

(1� pi

)(wi

xi

)

2.



2






⇠ Bernoulli(pi


i



i

}i=1...m


�w = (y � �(wtDz

x))Dz

x (1)where D

z

= diag(z) 2 Rm⇥m, and �(x) = 1/(1 + e�x



�w̄ = Ez;z

i

⇠Bernoulli(p

i

)

[(y � �(wtDz

x))Dz

x] (2)


z


L(w) = Ez

[log(p(y|Dz

x;w)] = Ez

[y log(�(wtDz

x)) + (1� y) log(1� �(wtDz

x))] (3)






z

x =

Pm

i

wi

xi

zi


i


i


S = Ez

[Y (z)] +p

Var[Y (z)]✏ = µS

+ �S


z

[Y (z)] =P

m

i

pi

wi

xi


i

pi

(1� pi

)(wi

xi

)

2.



2

0 1 0 0 1

−20 −15 −10 −5

0 0.5 10

1

2

3

4x 104

This method works well

Substituting in �(x) ⇡ �(

p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(8)

This is an approximation that is used for Bayesian prediction when the posterior is approximatedby a Gaussian [mackay]. As we now have a closed-form approximation of ↵, one can also obtainexpressions for � and � by differentiating.

Furthermore, we can even approximate the objective function (3) in a closed-form that is easilydifferentiable. The key expression is

EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (9)

⇡ 1p1 + ⇡s2/8

log �⇣ µp

1 + ⇡s2/8

⌘(10)

The actual objective as defined in (3) can be obtained from the above by observing that 1� �(x) =sigma(�x). The gradient, or Hessian with respect to w can be found by analytically differentiating.

Finally, we can approximate the output varianceVar

Y⇠N (µ,s

2)

[�(Y )] = E[�(Y )

2

]� E[�(y)]2 (11)

⇡ E[�(a(Y � b))]� E[�(y)]2 (12)

⇡ �

a(µ� b)p1 + ⇡/8a2s2

!� �

µp

1 + ⇡/8s2

!2

(13)

by matching the values and derivatives at where �(x)2 = 1/2, reasonable values are a = 4 � 2

p2

and b = � log(

p2� 1).

3 Experiments

100

101

102

103

104

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

seconds spent in training

err

or

rate

in t

he

va

lida

tion s

et

100

101

102

103

104

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

training iterations

err

or

rate

in t

he

va

lida

tion s

et

Plain LRGaussian approx.MC dropout

Figure 1: Validation errors vs. time spent in training (left), and number of iterations (right). trainedusing batch gradient descent with Wolfe line search on the 20-newsgroup subtask alt.atheism vs.religion.misc. 100 samples are used. For MC dropout, z

i

is sampled only for non-zero xi

, with adropout rate of 0.5.

The accuracy and time taken are listed in table 1 for the datasets described in section A.1. The Gaus-sian approximation is generally around 10 times faster than MC dropout and performs comparablyto NBSVM in [3]. Further speedup is possible by using one of the deterministic approximationsinstead of sampling. While each iteration of the Gaussian approximation is still slower than LR, itsometimes reaches a better validation performance in less time.

4

•  Reduces complexity from O(Md) to O(M+d) for M samples and data dimension d.

Integrating everything

•  We need to compute several expectations Substituting in �(x) ⇡ �(

p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(8)



EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (9)

⇡ 1p1 + ⇡s2/8

log �⇣ µp

1 + ⇡s2/8

⌘(10)



Y⇠N (µ,s

2)

[�(Y )] = E[�(Y )

2

]� E[�(y)]2 (11)

⇡ E[�(a(Y � b))]� E[�(y)]2 (12)

⇡ �

a(µ� b)p1 + ⇡/8a2s2

!� �

µp

1 + ⇡/8s2

!2

(13)


p2

and b = � log(

p2� 1).

3 Experiments

100

101

102

103

104

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

seconds spent in training

err

or

rate

in the v

alid

atio

n s

et

100

101

102

103

104

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

training iterations

err

or

rate

in the v

alid

atio

n s

et

Plain LRGaussian approx.MC dropout

Figure 1: Validation errors vs. time spent in training (left), and number of iterations (right). trainedusing batch gradient descent with Wolfe line search on the 20-newsgroup subtask alt.atheism vs.religion.misc. 100 samples are used. For MC dropout, z

i

is sampled only for non-zero xi

, with adropout rate of 0.5.


4

!5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

mu

Exp

ect

atio

n

s=11 approxs=33 approxs=55 approx

This integral can be evaluated exactly for the rectified linear unit f(x) = max(0, x). Let r = µ/s,then

⌫ =

Z 1

�1f(x)N (x|µ, s2)dx = �(r)µ+ sN (r|0, 1) (13)

With dropout training, each hidden unit also has an output variance, which can be approximatedfairly well (see A.4).

20 40

10

20

30

40

50!5

0

5

10

20 40

10

20

30

40

50

0

1

2

3

x 10!3

!20 !15 !10 !50

1000

2000

3000

!0.1 0 0.1 0.20

1000

2000

3000

Figure 1: Top: MC dropout covariance matrix of the inputs of 10 random hidden units; Top-left:trained to convergence; Top-right: at random initialization. The covariance is not completely diago-nal once trained to convergence. Bottom: empirical input distribution of the input of a hidden unit.Bottom-left: trained to convergence; Bottom-right: at initialization. We lose almost nothing here.

3.2 Training with backpropagation

The resulting neural network can be trained by backpropagation with an additional set of derivatives.In normal backpropagation, one only need to keep @L

@µ

i

for each hidden unit i with input µi

. Forapproximate dropout training, we need @L

@s

2i

as well for input variance s2i

. Where µi

= pP

j

wij

⌫0j

and si

= p(1� p)P

j

⌫02j

w2

ij

+ p⌧ 02j

w2

ij

and ⌫0i

, ⌧ 0i

are the output mean and variance of the previouslayer.

3.3 The output layer

We still need to define what the cost function L is, which is task dependent. We outline how to doapproximate dropout for the final layer for one-vs-rest logistic units, linear units under squared loss,and softmax units under their representative cost functions.

Logistic units with the cross-entropy loss function that can be well-approximated using the follow-ing:

EX⇠N (µ,s

2)

[log(�(X))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (14)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(15)

Linear units with squared error loss can be computed exactly

EX⇠N (µ,s

2)

[(X � t)2] =

Z 1

�1(x� t)2N (x|µ, s2)dx (16)

= s2 + (µ� t)2 (17)

Since s2 =

Pj

↵w2

j

x2

j

, this is L2 regularization.

5

The objective function

•  For LR, the objective is:

•  Other losses can be integrated in a similar way

the MC dropout method (it would be silly to permanently corrupt predetermined data dimensions).In addition to taking the expectation as mentioned in 2.2.1, ↵,�, � are functions of µ

s

,�2

s

2 R, andcan be computed by 1-d numerical integration as defined in (18), (19), and (20). Simpson’s methodon [µ�4�, µ+4�] works quite well. One can also integrate the logit-normal distribution parameter-ized by µ and � on [0, 1], but that is less stable numerically due to potentially huge masses at eitherend [6]. Doing numerical integration online takes as long as using 200 samples. Because we donot need that much accuracy in practice, numerical integration is not recommended. For even fasterspeed, one can also tabulate ↵,�, � instead of computing them as needed. The function is smootherif parameterized by µ

�

,� instead. However, numerical integration and a lookup table fail to scale tomultinomial LR. In that case, we can use the Unscented Transform (UST) instead of sampling [7].The rest of the procedure remains unchanged from 2.2.1.

2.2.3 A closed-form approximation

Interestingly, a fairly accurate closed-from approximation is also possible by using the Gaussiancummulative distribution function �(x) = 1p

2⇡

Rx

�1 e�t

2/2dt to approximate the logistic function.

It can be shown by parameter differentiation with respect to s and then integrating with respect to sthat Z 1

�1�(�x)N (x|µ, s)dx = �

✓µp

��2

+ s2

◆(8)


p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(9)

This is an approximation that is used for Bayesian prediction when the posterior is approximated bya Gaussian [8]. As we now have a closed-form approximation of ↵, one can also obtain expressionsfor � and � by differentiating.

Furthermore, we can even approximate the objective function (7) in a closed-form that is easilydifferentiable:

EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (10)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(11)

The actual objective as defined in (3) can be obtained from the above by observing that 1� �(x) =�(�x). The gradient, or Hessian with respect to w can be found by analytically differentiating.

3 Fast dropout for neural networks

Dropout training, as originally proposed, was intended for neural networks where hidden units aredropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hiddenlayer of neural networks. In this section, we approximately extend our technique to deep neuralnetworks and show how to do it for several popular types of hidden units and output units.

3.1 The hidden layers

Under dropout training, each hidden unit takes a random variable as input, and produces a randomvariable as output. Because of the CLT, we may approximate the inputs as Gaussians and character-ize the outputs by their means and variances. An additional complication is that the inputs to hiddenunits have a covariance, which is close to diagonal in practice as shown in figure 1.

Consider any hidden unit in dropout training, we may approximate its input as a Gaussian variableX ⇠ N (x|µ, s2), which leads to its output mean and variance ⌫, ⌧2. For the commonly usedsigmoid unit

⌫ =

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(12)

4

Document classification results

•  Improvements over plain LR and most previous methods. Sometimes state-of-the-art.

•  Average accuracy and time on 7 datasets:

Methods Plain LR Dropout Gaussian Determ.

Accuracy 84.97 86.88 86.99 86.87 Time 92 2661 325 120

Relation to Bayesian model selection – random weights

•  We are in effect maximizing a lower bound on the Bayesian evidence:

•  Under the model


p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(8)



EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (9)

⇡ 1p1 + ⇡s2/8

log �⇣ µp

1 + ⇡s2/8

⌘(10)


Finally, we can approximate the output variance

Var

Y⇠N (µ,s

2)

[�(Y )] = E[�(Y )

2

]� E[�(y)]2 (11)

⇡ E[�(a(Y � b))]� E[�(y)]2 (12)

⇡ �

a(µ� b)p1 + ⇡/8a2s2

!� �

µp

1 + ⇡/8s2

!2

(13)


p2

and b = � log(

p2� 1).

3 Relation to Bayesian model selection

L(w) = Ez;z

i

⇠Bernoulli(p

i

)

[log p(y|wTDz

x)] (14)⇡ E

Y⇠N (E[w

T

D

z

x],Var[w

T

D

z

x])

[log(y|Y )] (15)

= Ew:w

i

⇠N (µ

i

,↵µ

2i

)

[log p(y|wTx)] (16)

logEw:w

i

⇠N (µ

i

,↵µ

2i

)

[p(y|wTx)] (17)

= log(Mµ

) (18)

Mµ

=

Rp(D|w)p(w|µ)dw is the Bayesian evidence.

4 Experiments


5 Conclusions

Dropout training, as originally proposed, was intended for neural networks where hidden units aredropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hiddenlayer of neural networks: the gradients for backpropagation to the previous layer, dL

dx

, can be com-puted by the procedure described in 2.2.1 replacing x with w. This is a topic of ongoing research.

4


p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(8)



EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (9)

⇡ 1p1 + ⇡s2/8

log �⇣ µp

1 + ⇡s2/8

⌘(10)


Finally, we can approximate the output variance

Var

Y⇠N (µ,s

2)

[�(Y )] = E[�(Y )

2

]� E[�(y)]2 (11)

⇡ E[�(a(Y � b))]� E[�(y)]2 (12)

⇡ �

a(µ� b)p1 + ⇡/8a2s2

!� �

µp

1 + ⇡/8s2

!2

(13)


p2

and b = � log(

p2� 1).


L(µ) = Ez;z

i

⇠Bernoulli(p

i

)

[log p(y|µTDz

x)] (14)⇡ E

Y⇠N (E[µ

T

D

z

x],Var[µ

T

D

z

x])

[log(y|Y )] (15)

= Ew:w

i

⇠N (µ

i

,↵µ

2i

)

[log p(y|wTx)] (16)

logEw:w

i

⇠N (µ

i

,↵µ

2i

)

[p(y|wTx)] (17)

= log(Mµ

) (18)

Mµ

=

Rp(D|w)p(w|µ)dw is the Bayesian evidence. p(w

i

|µi

) = N (wi

|µi

,↵µ2

i

)

4 Experiments


5 Conclusions

Dropout training, as originally proposed, was intended for neural networks where hidden units aredropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hiddenlayer of neural networks: the gradients for backpropagation to the previous layer, dL

dx

, can be com-puted by the procedure described in 2.2.1 replacing x with w. This is a topic of ongoing research.

4


p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(8)



EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (9)

⇡ 1p1 + ⇡s2/8

log �⇣ µp

1 + ⇡s2/8

⌘(10)



Y⇠N (µ,s

2)

[�(Y )] = E[�(Y )

2

]� E[�(y)]2 (11)

⇡ E[�(a(Y � b))]� E[�(y)]2 (12)

⇡ �

a(µ� b)p1 + ⇡/8a2s2

!� �

µp

1 + ⇡/8s2

!2

(13)


p2

and b = � log(

p2� 1).


L(µ) = Ez;z

i

⇠Bernoulli(p

i

)

[log p(y|µTDz

x)] (14)⇡ E

Y⇠N (E[µ

T

D

z

x],Var[µ

T

D

z

x])

[log(y|Y )] (15)

= Ew:w

i

⇠N (µ

i

,↵µ

2i

)

[log p(y|wTx)] (16)

logEw:w

i

⇠N (µ

i

,↵µ

2i

)

[p(y|wTx)] (17)

= log(Mµ

) (18)

Mµ

=

Rp(D|w)p(w|µ)dw is the Bayesian evidence. p(w

i

|µi

) = N (wi

|µi

,↵µ2

i

)

and

p(y|x,w) = �(ywTx)

4 Experiments


5 Conclusions

Dropout training, as originally proposed, was intended for neural networks where hidden units aredropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hidden

4

Unfortunately, the best way to compute the cross-entropy loss for softmax seems to be samplingfrom the input Gaussian directly (See A.3). Sampling can be applied to any output units with anyloss function. Unscented transform can be used in place of random sampling, trading accuracy forfaster speed.

Unlike the best practice in real dropout, the test time procedure for approximate dropout is exactlythe same as the training time procedure, with no weight halving needed. One shortcoming is thetraining implementation does become more complicated, mainly in the backpropagation stage, whilethe network function is straightforward. In the case when we are not concerned about training time,we can still apply deterministic dropout at test time, instead of using the weight halving heuristicsthat does not exactly match the training objective that is optimized. This alone can give a noticeableimprovement in test performance without much extra work.


Once we make the Gaussian approximation, there is an alternative interpretation of where the vari-ance comes from. In the dropout framework, the variance comes from the dropout variable z. Underthe alternative interpretation where w is a random variable, we can view dropout training as max-imizing a lower bound on the Bayesian marginal likelihood in a class of models M

µ

indexed byµ 2 Rm. Concretely, consider random variables w

i

⇠ N (µi

,↵µ2

i

), then the dropout objective

L(w) = Ez;z

i

⇠Bernoulli(p

i

)

[log p(y|wTDz

x)]

⇡ EY⇠N (E[w

T

D

z

x],Var[w

T

D

z

x])

[log p(y|Y )]

= Ev:v

i

⇠N (µ

i

,↵µ

2i

)

[log p(y|vTx)]

logEv:v

i

⇠N (µ

i

,↵µ

2i

)

[p(y|vTx)]= log(M

µ

)

where Mµ

=

Rp(D|v)p(v|µ)dv is the Bayesian evidence. p(v

i

|µi

) = N (vi

|µi

,↵µ2

i

) andp(y|vTx) = �(vTx)y(1 � �(vTx))1�y is the logistic model. For dropout training, µ = w/pand ↵ = (1� p)/p.

Here the variance of v is tied to its magnitude, so a larger weight is only beneficial when it is robustto noise. While ↵ can be determined by the dropout process, we are also free to choose ↵ and wefind empirically that using a slightly larger ↵ than that determined by dropout often perform slightlybetter.

5 Experiments

Figure 2 shows that the quality of the gradient approximation using Gaussian samples are compa-rable to the difference between different MC dropout runs with 200 samples. Figures 3 and 4 showthat, under identical settings, the Gaussian approximation is much faster than MC dropout, and havevery similar training profile. Both the Gaussian approximation and the MC dropout training re-duce validation error rate by about 30% over plain LR when trained to convergence, without everoverfitting.

Method Plain Aprx. Drp. A.D.+Var Prev. SotA#Errors 182 124 110 150 30

Table 1: Neural network on MNIST: Results on MNIST with 2 hidden layers of 1200 units each.A.D.+Var is approximate dropout plus more artificial variance by increasing ↵. Prev. is the bestprevious result without using domain knowledge [1]. SotA is the state of the art result on the test set.


6

Fast dropout in neural networks

•  The fast dropout idea can be applied to neural networks

•  Each hidden unit has an input mean

and an input variance

•  Outputs a mean and variance


⌫ =

Z 1

�1f(x)N (x|µ, s2)dx = �(r)µ+ sN (r|0, 1) (13)


20 40

10

20

30

40

50!5

0

5

10

20 40

10

20

30

40

50

0

1

2

3

x 10!3

!20 !15 !10 !50

1000

2000

3000

!0.1 0 0.1 0.20

1000

2000

3000




@µ

i



@s

2i


. Where µi

= pP

j

wij

⌫0j

and si

= p(1� p)P

j

⌫02j

w2

ij

+ p⌧ 02j

w2

ij

and ⌫0i

, ⌧ 0i





EX⇠N (µ,s

2)

[log(�(X))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (14)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(15)


EX⇠N (µ,s

2)

[(X � t)2] =

Z 1

�1(x� t)2N (x|µ, s2)dx (16)

= s2 + (µ� t)2 (17)

Since s2 =

Pj

↵w2

j

x2

j


5

Applies to different types of units

•  Sigmoid unit:

•  Rectified linear unit:

•  And other objective functions: regression, MaxOut


⌫ =

Z 1

�1f(x)N (x|µ, s2)dx = �(r)µ+ sN (r|0, 1) (13)


20 40

10

20

30

40

50!5

0

5

10

20 40

10

20

30

40

50

0

1

2

3

x 10!3

!20 !15 !10 !50

1000

2000

3000

!0.1 0 0.1 0.20

1000

2000

3000




@µ

i



@s

2i


. Where µi

= pP

j

wij

⌫0j

and si

= p(1� p)P

j

⌫02j

w2

ij

+ p⌧ 02j

w2

ij

and ⌫0i

, ⌧ 0i





EX⇠N (µ,s

2)

[log(�(X))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (14)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(15)


EX⇠N (µ,s

2)

[(X � t)2] =

Z 1

�1(x� t)2N (x|µ, s2)dx (16)

= s2 + (µ� t)2 (17)

Since s2 =

Pj

↵w2

j

x2

j


5


⌫ =

Z 1

�1f(x)N (x|µ, s2)dx = �(r)µ+ sN (r|0, 1) (13)


20 40

10

20

30

40

50!5

0

5

10

20 40

10

20

30

40

50

0

1

2

3

x 10!3

!20 !15 !10 !50

1000

2000

3000

!0.1 0 0.1 0.20

1000

2000

3000




@µ

i



@s

2i


. Where µi

= pP

j

wij

⌫0j

and si

= p(1� p)P

j

⌫02j

w2

ij

+ p⌧ 02j

w2

ij

and ⌫0i

, ⌧ 0i





EX⇠N (µ,s

2)

[log(�(X))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (14)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(15)


EX⇠N (µ,s

2)

[(X � t)2] =

Z 1

�1(x� t)2N (x|µ, s2)dx (16)

= s2 + (µ� t)2 (17)

Since s2 =

Pj

↵w2

j

x2

j


5

the MC dropout method (it would be silly to permanently corrupt predetermined data dimensions).In addition to taking the expectation as mentioned in 2.2.1, ↵,�, � are functions of µ

s

,�2

s

2 R, andcan be computed by 1-d numerical integration as defined in (18), (19), and (20). Simpson’s methodon [µ�4�, µ+4�] works quite well. One can also integrate the logit-normal distribution parameter-ized by µ and � on [0, 1], but that is less stable numerically due to potentially huge masses at eitherend [6]. Doing numerical integration online takes as long as using 200 samples. Because we donot need that much accuracy in practice, numerical integration is not recommended. For even fasterspeed, one can also tabulate ↵,�, � instead of computing them as needed. The function is smootherif parameterized by µ

�

,� instead. However, numerical integration and a lookup table fail to scale tomultinomial LR. In that case, we can use the Unscented Transform (UST) instead of sampling [7].The rest of the procedure remains unchanged from 2.2.1.

2.2.3 A closed-form approximation

Interestingly, a fairly accurate closed-from approximation is also possible by using the Gaussiancummulative distribution function �(x) = 1p

2⇡

Rx

�1 e�t

2/2dt to approximate the logistic function.

It can be shown by parameter differentiation with respect to s and then integrating with respect to sthat Z 1

�1�(�x)N (x|µ, s)dx = �

✓µp

��2

+ s2

◆(8)


p⇡/8x), we get

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(9)

This is an approximation that is used for Bayesian prediction when the posterior is approximated bya Gaussian [8]. As we now have a closed-form approximation of ↵, one can also obtain expressionsfor � and � by differentiating.

Furthermore, we can even approximate the objective function (7) in a closed-form that is easilydifferentiable:

EY⇠N (µ,s

2)

[log(�(Y ))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (10)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(11)

The actual objective as defined in (3) can be obtained from the above by observing that 1� �(x) =�(�x). The gradient, or Hessian with respect to w can be found by analytically differentiating.

3 Fast dropout for neural networks

Dropout training, as originally proposed, was intended for neural networks where hidden units aredropped out, instead of the data. Fast dropout is directly applicable to dropping out the final hiddenlayer of neural networks. In this section, we approximately extend our technique to deep neuralnetworks and show how to do it for several popular types of hidden units and output units.

3.1 The hidden layers

Under dropout training, each hidden unit takes a random variable as input, and produces a randomvariable as output. Because of the CLT, we may approximate the inputs as Gaussians and character-ize the outputs by their means and variances. An additional complication is that the inputs to hiddenunits have a covariance, which is close to diagonal in practice as shown in figure 1.

Consider any hidden unit in dropout training, we may approximate its input as a Gaussian variableX ⇠ N (x|µ, s2), which leads to its output mean and variance ⌫, ⌧2. For the commonly usedsigmoid unit

⌫ =

Z 1

�1�(x)N (x|µ, s2)dx ⇡ �

µp

1 + ⇡s2/8

!(12)

4This integral can be evaluated exactly for the rectified linear unit f(x) = max(0, x). Let r = µ/s,then

⌫ =

Z 1

�1f(x)N (x|µ, s2)dx = �(r)µ+ sN (r|0, 1) (13)


20 40

10

20

30

40

50!5

0

5

10

20 40

10

20

30

40

50

0

1

2

3

x 10!3

!20 !15 !10 !50

1000

2000

3000

!0.1 0 0.1 0.20

1000

2000

3000




@µ

i



@s

2i


. Where µi

= pP

j

wij

⌫0j

and si

= p(1� p)P

j

⌫02j

w2

ij

+ p⌧ 02j

w2

ij

and ⌫0i

, ⌧ 0i





EX⇠N (µ,s

2)

[log(�(X))] =

Z 1

�1log(�(x))N (x|µ, s2)dx (14)

⇡p

1 + ⇡s2/8 log �⇣ µp

1 + ⇡s2/8

⌘(15)


EX⇠N (µ,s

2)

[(X � t)2] =

Z 1

�1(x� t)2N (x|µ, s2)dx (16)

= s2 + (µ� t)2 (17)

Since s2 =

Pj

↵w2

j

x2

j


5

Neural networks training curve

•  Validation errors vs. training iterations Fast dropout training

200 400 600 800

120

140

160

180

200

epochs

# v

alid

atio

n e

rrors

MC dropout sgdfast dropout sgdfast dropout batchplain sgd

Figure 6. Validation errors vs. epochs: we used the exactSGD training schedule described in (Hinton et al., 2012)and a 784-800-800-10 2-hiden-layer neural network. Thistraining schedule is presumably tuned for real dropout.fast dropout performs similarly to real dropout but withless variance. fast dropout batch: use batch L-BFGSand fast dropout, with validation error evaluated every 10epochs.

simplicity. Table 3 compares several test time meth-ods on neural networks trained for MNIST and CIFARusing real dropout. Multiple real dropout samples andfast dropout provide a small but noticeable improve-ment over weight scaling.

6.5. Other experiments

The normal equations (10) show the contrast be-tween additive and multiplicative L

2

regularization.For linear regression, L

2

regularization outperformeddropout on 10 datasets from UCI that we tried.1 Re-sults on 5 of them are shown in table 4.

Classification results using neural networks on smallUCI datasets are shown in table 5 where fast dropoutdoes better than plain neural networks in most cases.

7. Conclusions

We presented a way of getting the benefits of dropouttraining without actually sampling, thereby speedingup the process by an order of magnitude. For highdimensional datasets (over a few hundred), each it-eration of fast dropout is less than 2 times slowerthan normal training. We provided a deterministicand easy-to-compute objective function approximatelyequivalent to that of real dropout training. One canoptimize this objective using standard optimization

1 http://archive.ics.uci.edu/ml/

Full Scale D1 D10 D100 FDMNIST number of errors

Test 129 108 199 118 105 103Train 4 1 36 1 1 1Time(s) 10 10 13 110 1.1K 16

CIFAR-10 percent errorTest 53 47 51 45 43 44Train 42 35 42 33 32 33Time(s) 17 23 25 230 2.2K 29

Table 3. Di↵erent test time methods on networks trained

with real dropout: A 784-800-800-10 neural network istrained with real dropout on MNIST (3072-1000-1000-10for CIFAR-10) and tested using: Full: use all weights with-out scaling; Scale: w pw; D(n): take n real dropoutsamples; FD: fast dropout.

Dataset L2 train

L2 test

FDtrain

FDtest

Autos 0.25 0.51 0.41 0.57Cardio 109.24 117.87 140.93 188.91House 23.57 21.02 65.44 56.26Liver 9.01 9.94 9.69 9.88Webgrph 0.17 0.20 0.19 0.21

Table 4. Linear regression using eq. (10). Dropout per-forms worse than L2 regularization (recall that fast dropoutis exact). While a digit is still easily recognizable when halfof its dimensions are dropped out, dropout noise is exces-sive for the low dimensional regressors.

Classification accuracyDataset L

2 train

L2 test

FDtrain

FDtest

SmallM 100 87 99 90USPS 100 95 98 96Isolet 100 91 95 93Hepatitis 100 94 99 95Soybean 100 91 94 89

Table 5. Classification results on various datasets: we useda M-200-100-K neural network, and cross validated the pa-rameters.

methods, whereas standard methods are of limited usein real dropout because we only have a noisy measure-ment of the gradient. Furthermore, since fast dropoutis not losing any information in individual trainingcases from sampling, it is capable of doing more workin each iteration, often reaching the same validation setperformance in a shorter time and in less iterations.

Acknowledgments

We thank Andrej Karpathy, Charlie Tang and PercyLiang for helpful discussions. Sida Wang was partiallyfunded by a NSERC PGS-M scholarship and a Stan-ford School of Engineering fellowship.

Performance on MNIST

•  2-layer neural networks on MNIST

Method name Number of errors NN-1200-1200 plain 182 NN-1200-1200 det. Dropout 134 NN-1200-1200 det. Dropout + Var 110 NN-300 MSE [LeCun 1998] 360 NN-800 [Simard 2003] 160 Real dropout [Hinton 2012] 105-120

79 with pretraining

Conclusions

•  Dropout training seems promising

•  But doing real dropout is slow, sampling is expensive

•  Apply the Gaussian approximation

•  Cheaper samples or completely deterministic approximations!

Related Recent Works

•  Part of this work appeared in NIPS 2012 workshops

•  Learning with Marginalized Corrupted Features [van der Maaten et al., 2013]

•  DropConnect [Wan et al., 2013]

fast dropout training - stanford nlp...

Documents