neural networks and deep learningpages.cs.wisc.edu/~dpage/cs760/anns.pdf · neural networks and...

Neural Networks and Deep Learning

www.cs.wisc.edu/~dpage/cs760/

1

Goals for the lecture you should understand the following concepts

•  perceptrons •  the perceptron training rule •  linear separability •  hidden units •  multilayer neural networks •  gradient descent •  stochastic (online) gradient descent •  sigmoid function •  gradient descent with a linear output unit •  gradient descent with a sigmoid output unit •  backpropagation

2

Goals for the lecture you should understand the following concepts

•  weight initialization •  early stopping •  the role of hidden units •  input encodings for neural networks •  output encodings •  recurrent neural networks •  autoencoders •  stacked autoencoders

3

Neural networks •  a.k.a. artificial neural networks, connectionist models •  inspired by interconnected neurons in biological systems

•  simple processing units •  each unit receives a number of real-valued inputs •  each unit produces a single real-valued output

4

Perceptrons [McCulloch & Pitts, 1943; Rosenblatt, 1959; Widrow & Hoff, 1960]

o = 1 if w0 + wii=1

n

∑ xi > 0

0 otherwise

"

#$

%$

input units: represent given x

output unit: represents binary classification

x1

x2

xn

w1

w2

wn

w01

5

Learning a perceptron: the perceptron training rule

Δwi =η y − o( )xi

1.  randomly initialize weights

2.  iterate through training instances until convergence

o = 1 if w0 + wii=1

n

∑ xi > 0

0 otherwise

"

#$

%$

wi ←wi + Δwi

2a. calculate the output for the given instance

2b. update each weight

η is learning rate; set to value << 1 6

Representational power of perceptrons

o = 1 if w0 + wii=1

n

∑ xi > 0

0 otherwise

"

#$

%$

perceptrons can represent only linearly separable concepts

x1

+ + + + + + - + - - + + + + - + + - - - + + - - + - - - - -

x2

w1x1 +w2x2 = −w0

x2 = −w1w2

x1 −w0w2

1 if w0 +w1x1 +w2x2 > 0

decision boundary given by:

7

also write as: wx > 0

w

Representational power of perceptrons

•  in previous example, feature space was 2D so decision boundary was a line

•  in higher dimensions, decision boundary is a hyperplane

8

Some linearly separable functions

x1 x2

0 0 0 1 1 0 1 1

y

0 0 0 1

a b c d

AND

a c 0 1

1 x1

x2

d b

x1 x2

0 0 0 1 1 0 1 1

y

0 1 1 1

a b c d

OR

b

a c 0 1

1 x1

x2

d

9

XOR is not linearly separable

x1 x2

0 0 0 1 1 0 1 1

y

0 1 1 0

a b c d 0 1

1 x1

x2

b

a c

d

x1

x21

-1

1

1

1 -1

a multilayer perceptron can represent XOR

assume w0 = 0 for all nodes 10

Example multilayer neural network

input: two features from spectral analysis of a spoken sound output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

input units

hidden units

output units

11

Decision regions of a multilayer neural network

input: two features from spectral analysis of a spoken sound output: vowel sound occurring in the context “h__d”

figure from Huang & Lippmann, NIPS 1988

12

Learning in multilayer networks •  work on neural nets fizzled in the 1960’s

•  single layer networks had representational limitations (linear separability)

•  no effective methods for training multilayer networks

•  revived again with the invention of backpropagation method [Rumelhart & McClelland, 1986; also Werbos, 1975] •  key insight: require neural network to be differentiable;

use gradient descent

x1

x2

how to determine error signal for hidden units?

13

Gradient descent in weight space

figure from Cho & Chow, Neurocomputing 1999

E(w) = 12

y(d ) − o(d )( )2d∈D∑

Given a training set we can specify an error measure that is a function of our weight vector w

This error measure defines a surface over the hypothesis (i.e. weight) space

w1 w2

D = (x(1), y(1) )…(x(m ), y(m ) ){ }

14


w1

w2

Error

on each iteration •  current weights define a

point in this space •  find direction in which

error surface descends most steeply

•  take a step (i.e. update weights) in that direction

gradient descent is an iterative process aimed at finding a minimum in the error surface

15


w1

w2

Error

−∂E∂w1

−∂E∂w2

∇E(w) = ∂E

∂w0

, ∂E∂w1

, , ∂E∂wn

#

$%

&

'(

Δw = −η ∇E w( )

Δwi = −η ∂E∂wi

calculate the gradient of E:

take a step in the opposite direction

16

The sigmoid function •  to be able to differentiate E with respect to wi , our network

must represent a continuous function •  to do this, we use sigmoid functions instead of threshold

functions in our hidden and output units

f (x) = 11+ e− x

x 17

The sigmoid function for the case of a single-layer network

f (x) = 1

1+ e− w0+ wixi

i=1

n

∑#

$%

&

'(

w0 + wixii=1

n

∑18

Batch neural network training

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

initialize the error

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

increment the error

calculate the gradient

update the weights

D = (x(1), y(1) )…(x(m ), y(m ) ){ }

E(w) = E(w)+ 12y(d ) − o(d )( )2

∇E(w) = ∂E

∂w0

, ∂E∂w1

, , ∂E∂wn

#

$%

&

'(

Δw = −η ∇E w( )

E(w) = 0

19

Online vs. batch training

•  Standard gradient descent (batch training): calculates error gradient for the entire training set, before taking a step in weight space

•  Stochastic gradient descent (online training): calculates error gradient for a single instance, then takes a step in weight space –  much faster convergence –  less susceptible to local minima

20

Online neural network training (stochastic gradient descent)

given: network structure and a training set

initialize all weights in w to small random numbers

until stopping criteria met do

for each (x(d), y(d)) in the training set

input x(d) to the network and compute output o(d)

calculate the error

calculate the gradient

update the weights

D = (x(1), y(1) )…(x(m ), y(m ) ){ }

E(w) = 12y(d ) − o(d )( )2

∇E(w) = ∂E

∂w0

, ∂E∂w1

, , ∂E∂wn

#

$%

&

'(

Δw = −η ∇E w( ) 21

Convergence of gradient descent •  gradient descent will converge to a minimum in the error function

•  for a multi-layer network, this may be a local minimum (i.e. there may be a “better” solution elsewhere in weight space)

•  for a single-layer network, this will be a global minimum (i.e. gradient descent will find the “best” solution)

22

Taking derivatives in neural nets

y = f (u)u = g(x)

∂y∂x

=∂y∂u

∂u∂x

recall the chain rule from calculus

∂E∂wi

= ∂E∂o

∂o∂net

∂net∂wi

we’ll make use of this as follows

net = w0 + wii=1

n

∑ xi

23

Gradient descent: simple case Consider a simple case of a network with one linear output unit and no hidden units:

o(d ) = w0 + wii=1

n

∑ x(d )i

E(w) = 12

y(d ) − o(d )( )2d∈D∑

let’s learn wi’s that minimize squared error

∂E∂wi

=∂∂wi

12

y(d ) − o(d )( )2d∈D∑

batch case

∂E (d )

∂wi

=∂∂wi

12y(d ) − o(d )( )2

online case

x1

x2

xn

w1

w2

wn

w01

24

Stochastic gradient descent: simple case

∂E (d )

∂wi

= ∂∂wi

12y(d ) − o(d )( )2

= y(d ) − o(d )( ) ∂∂wi

y(d ) − o(d )( )

= − y(d ) −o(d )( ) xi

(d )( )

let’s focus on the online case (stochastic gradient descent):

= y(d ) − o(d )( ) −∂o(d )

∂wi

#

$%&

'(

= − y(d ) − o(d )( ) ∂o(d )

∂net (d )∂net (d )

∂wi

= − y(d ) − o(d )( ) ∂net(d )

∂wi

25

Gradient descent with a sigmoid Now let’s consider the case in which we have a sigmoid output unit and no hidden units:

net (d ) = w0 + wii=1

n

∑ x(d )ix1

x2

xn

w1

w2

wn

w01

o(d ) = 11+ e−net

(d )

∂o(d )

∂net (d )= o(d )(1− o(d ) )

useful property:

26

Stochastic GD with sigmoid output unit ∂E (d )

∂wi

= ∂∂wi

12y(d ) − o(d )( )2

= y(d ) − o(d )( ) ∂∂wi

y(d ) − o(d )( )

= y(d ) − o(d )( ) −∂o(d )

∂wi

#

$%&

'(

= − y(d ) − o(d )( ) ∂o(d )

∂net (d )∂net (d )

∂wi

= − y(d ) − o(d )( )o(d )(1− o(d ) ) ∂net(d )

∂wi

= − y(d ) − o(d )( )o(d )(1− o(d ) )xi(d ) 27

Backpropagation

∂E∂wi

•  now we’ve covered how to do gradient descent for single-layer networks with •  linear output units •  sigmoid output units

•  how can we calculate for every weight in a multilayer network?

è backpropagate errors from the output units to the hidden units

28

Backpropagation notation let’s consider the online case, but drop the (d) superscripts for simplicity we’ll use

•  subscripts on y, o, net to indicate which unit they refer to •  subscripts to indicate the unit a weight emanates from and goes to

wjii

j oj

29

Backpropagation

=η δ j oi

each weight is changed by

Δwji = −η ∂E∂wji

= −η ∂E∂net j

∂net j∂wji

δ j = −∂E∂net j

where xi if i is an input unit

30

Backpropagation

Δwji =η δ j oi

δ j = −∂E∂net j

δ j = oj (1− oj )(yj − oj )

δ j = oj (1− oj ) δ kwkjk∑

each weight is changed by

where if j is an output unit

if j is a hidden unit

same as single-layer net with sigmoid output

31

sum of backpropagated contributions to error

Backpropagation illustrated

j

1.  calculate error of output units δ j = oj (1− oj )(yj − oj )

2.  determine updates for weights going to output units Δwji =η δ j oi

32

Backpropagation illustrated

j

4.  determine updates for weights to hidden units using hidden-unit errors Δwji =η δ j oi

j

3.  calculate error for hidden units

δ j = oj (1− oj ) δ kwkjk∑

33

Neural network jargon

•  activation: the output value of a hidden or output unit

•  epoch: one pass through the training instances during gradient descent

•  transfer function: the function used to compute the output of a hidden/output unit from the net input

•  Minibatch: in practice, randomly partition data into many parts (e.g., 10 examples each), and compute SGD gradient for one step with one of these parts (minibatches) instead of just one example

34

Initializing weights

•  Weights should be initialized to •  small values so that the sigmoid activations are in the range

where the derivative is large (learning will be quicker)

•  random values to ensure symmetry breaking (i.e. if all weights are the same, the hidden units will all represent the same thing)

•  typical initial weight range [-0.01, 0.01] 35

Setting the learning rate

η too large (error goes up)

η too small (error goes down a little)

−∂E∂wij

wij

Err

or

convergence depends on having an appropriate learning rate

36

Stopping criteria •  conventional gradient descent: train until local minimum reached

•  empirically better approach: early stopping •  use a validation set to monitor accuracy during training iterations •  return the weights that result in minimum validation-set error

error

training iterations

stop training here

37

Input (feature) encoding for neural networks nominal features are usually represented using a 1-of-k encoding

A =

1000

!

"

####

$

%

&&&&

C =

0100

!

"

####

$

%

&&&&

G =

0010

!

"

####

$

%

&&&&

T =

0001

!

"

####

$

%

&&&&

ordinal features can be represented using a thermometer encoding

small =100

!

"

###

$

%

&&&

medium =110

!

"

###

$

%

&&&

large =111

!

"

###

$

%

&&&

precipitation = 0.68[ ]

real-valued features can be represented using individual input units (we may want to scale/normalize them first though)

38

Output encoding for neural networks regression tasks usually use output units with linear transfer functions

binary classification tasks usually use one sigmoid output unit

k-ary classification tasks usually use k sigmoid or softmax output units

oi =eneti

enet jj∈outputs∑

39

Recurrent neural networks

recurrent networks are sometimes used for tasks that involve making sequences of predictions •  Elman networks: recurrent connections go from hidden units to inputs •  Jordan networks: recurrent connections go from output units to inputs

40

Alternative approach to training deep networks

•  use unsupervised learning to to find useful hidden unit representations

41

Learning representations

•  the feature representation provided is often the most significant factor in how well a learning system works

•  an appealing aspect of multilayer neural networks is that they are able to change the feature representation

•  can think of the nodes in the hidden layer as new features constructed from the original features in the input layer

•  consider having more levels of constructed features, e.g., pixels -> edges -> shapes -> faces or other objects

42

Competing intuitions •  Only need a 2-layer network (input, hidden layer, output)

–  Representation Theorem (1989): Using sigmoid activation functions (more recently generalized to others as well), can represent any continuous function with a single hidden layer

–  Empirically, adding more hidden layers does not improve accuracy, and it often degrades accuracy, when training by standard backpropagation

•  Deeper networks are better

–  More efficient representationally, e.g., can represent n-variable parity function with polynomially many (in n) nodes using multiple hidden layers, but need exponentially many (in n) nodes when limited to a single hidden layer

–  More structure, should be able to construct more interesting derived features

43

The role of hidden units •  Hidden units transform the input space into a new space where

perceptrons suffice •  They numerically represent “constructed” features •  Consider learning the target function using the network structure below:

44

The role of hidden units

•  In this task, hidden units learn a compressed numerical coding of the inputs/outputs

45

How many hidden units should be used? •  conventional wisdom in the early days of neural nets: prefer small

networks because fewer parameters (i.e. weights & biases) will be less likely to overfit

•  somewhat more recent wisdom: if early stopping is used, larger networks often behave as if they have fewer “effective” hidden units, and find better solutions

test set error

training epochs

4 HUs

15 HUs

Figure from Weigend, Proc. of the CMSS 1993 46

Another way to avoid overfitting •  Allow many hidden units but force each hidden unit to

output mostly zeroes: tend to meaningful concepts •  Gradient descent solves an optimization problem—

add a “regularizing” term to the objective function •  Let X be vector of random variables, one for each

hidden unit, giving average output of unit over data set. Let target distribution s have variables independent with low probability of outputting one (say 0.1), and let ŝ be empirical distribution in the data set. Add to the backpropagation target function (that minimizes δ’s) a penalty of KL(s(X)||ŝ(X))

47

Backpropagation with multiple hidden layers

•  in principle, backpropagation can be used to train arbitrarily deep networks (i.e. with multiple hidden layers)

•  in practice, this doesn’t usually work well

•  there are likely to be lots of local minima

•  diffusion of gradients leads to slow training in lower layers

•  gradients are smaller, less pronounced at deeper levels

•  errors in credit assignment propagate as you go back

48

Autoencoders •  one approach: use autoencoders to learn hidden-unit representations •  in an autoencoder, the network is trained to reconstruct the inputs

49

Autoencoder variants

•  how to encourage the autoencoder to generalize

•  bottleneck: use fewer hidden units than inputs

•  sparsity: use a penalty function that encourages most hidden unit activations to be near 0 [Goodfellow et al. 2009]

•  denoising: train to predict true input from corrupted input [Vincent et al. 2008]

•  contractive: force encoder to have small derivatives (of hidden unit output as input varies) [Rifai et al. 2011]

50

Stacking Autoencoders •  can be stacked to form highly nonlinear representations

[Bengio et al. NIPS 2006]

train autoencoder to represent x

Discard output layer; train autoencoder to represent h1 Repeat for k layers

discard output layer; train weights on last layer for supervised task

each Wi here represents the matrix of weights between layers 51

Fine-Tuning

•  After completion, run backpropagation on the entire network to fine-tune weights for the supervised task

•  Because this backpropagation starts with good structure and weights, its credit assignment is better and so its final results are better than if we just ran backpropagation initially

52

Why does the unsupervised training step work well?

•  regularization hypothesis: representations that are good for P(x) are good for P(y | x)

•  optimization hypothesis: unsupervised initializations start near better local minima of supervised training error

53

Deep learning not limited to neural networks

•  First developed by Geoff Hinton and colleagues for belief networks, a kind of hybrid between neural nets and Bayes nets

•  Hinton motivates the unsupervised deep learning training process by the credit assignment problem, which appears in belief nets, Bayes nets, neural nets, restricted Boltzmann machines, etc. •  d-separation: the problem of evidence at a converging

connection creating competing explanations •  backpropagation: can’t choose which neighbors get the

blame for an error at this node

54

Room for Debate

•  many now arguing that unsupervised pre-training phase not really needed…

•  backprop is sufficient if done better –  wider diversity in initial weights, try with many initial settings

until you get learning –  don’t worry much about exact learning rate, but add

momentum: if moving fast in a given direction, keep it up for awhile

–  Need a lot of data for deep net backprop

55

Problems with Backprop for Deep Neural Networks

•  Overfits both training data and the particular starting point

•  Converges too quickly to a suboptimal solution, even with SGD (gradient from one example or “minibatch” of examples at one time)

•  Need more training data and/or fewer weights to estimate, or other regularizer

56

Trick 1: Data Augmentation

•  Deep learning depends critically on “Big Data” – need many more training examples than features

•  Turn one positive (negative) example into many positive (negative) examples

•  Image data: rotate, re-scale, or shift image, or flip image about axis; image still contains the same objects, exhibits the same event or action

57

Trick 2: Parameter (Weight) Tying

•  Normally all neurons at one layer are connected to next layer

•  Instead, have only n features feed to one specific neuron at next level (e.g., 4 or 9 pixels of image go to one hidden unit summarizing this “super-pixel”)

•  Tie the 4 (or 9) input weights across all super-pixels… more data per weight

58

Weight Tying Example: Convolution

•  Have a sliding window (e.g., square of 4 pixels, set of 5 consecutive items in a sequence, etc), and only the neurons for these inputs feed into one neuron, N1, at the next layer.

•  Slide this window over by some amount and repeat, feeding into another neuron, N2, etc.

•  Tie the input weights for N1, N2, etc., so they will all learn the same concept (e.g., diagonal edge).

•  Repeat into new neurons N1’, N2’, etc., to learn other concepts.

59

Alternate Convolutional Layer with Pooling Layer

•  Mean pooling: k nodes (e.g., corresponding to 4 pixels constituting a square in an image) are averaged to create one node (e.g., corresponding to one pixel) at the next layer.

•  Max pooling: replace average with maximum

60

Used in Convolutional Neural Networks for Vision Applications

61

image3_en.png (PNG Image, 416 × 228 pixels) http://masters.donntu.org/2012/fknt/umiarov/diss/images/image3_en.png

1 of 1 2/26/17 4:02 PM

Trick 3: Alternative Activation

•  Tanh: (e2x-1)/(e2x+1) hyperbolic tangent

•  ReLU: max(0,x) or 1/(1+e-x) rectified linear unit or softplus

62

Algebra

Applied Mathematics

Calculus and Analysis

Discrete Mathematics

Foundations of Mathematics

Geometry

History and Terminology

Number Theory

Probability and Statistics

Recreational Mathematics

Topology

Alphabetical Index

Interactive Entries

Random Entry

New in MathWorld

MathWorld Classroom

About MathWorld

Contribute to MathWorld

Send a Message to the Team

MathWorld Book

Wolfram Web Resources »

13,612 entriesLast updated: Fri Feb 24 2017

Created, developed, andnurtured by Eric Weissteinat Wolfram Research

Calculus and Analysis > Special Functions > Hyperbolic Functions >Interactive Entries > webMathematica Examples >Interactive Entries > Interactive Demonstrations >

Hyperbolic Tangent

Min -5 Max 5 Replot

Min Max

Re -5 5

Im -5 5 Replot

By way of analogy with the usual tangent

(1)

the hyperbolic tangent is defined as

(2)

(3)

(4)

where is the hyperbolic sine and is the hyperbolic cosine. The notation is sometimes also used(Gradshteyn and Ryzhik 2000, p. xxix).

is implemented in the Wolfram Language as Tanh[z].

Special values include

(5)(6)

where is the golden ratio.

The derivative of is

(7)

and higher-order derivatives are given by

(8)

Derivatives of Trig andHyperbolic FunctionsItsaso Aranzabal

Search MathWorld

inverse hyperbolic tangent of x

THINGS TO TRY:

inverse hyperbolic tangent of x

inverse hyperbolic tangent of .99

d/dx hyperbolic tangent(x)

Hyperbolic Tangent -- from Wolfram MathWorld http://mathworld.wolfram.com/HyperbolicTangent.html

1 of 2 2/26/17 3:47 PM

e

Plot of the rectifier (blue) and softplus(green) functions near x = 0

Rectifier (neural networks)From Wikipedia, the free encyclopedia

In the context of artificial neural networks, the rectifier is anactivation function defined as

where x is the input to a neuron. This is also known as a rampfunction and is analogous to half-wave rectification in electricalengineering. This activation function was first introduced to adynamical network by Hahnloser et al. in a 2000 paper in Nature[1]

with strong biological motivations and mathematical justifications.[2]

It has been used in convolutional networks[3] more effectively thanthe widely used logistic sigmoid (which is inspired by probabilitytheory; see logistic regression) and its more practical[4] counterpart,the hyperbolic tangent. The rectifier is, as of 2015, the most popular activation function for deep neuralnetworks.[5]

A unit employing the rectifier is also called a rectified linear unit (ReLU).[6]

A smooth approximation to the rectifier is the analytic function

which is called the softplus function.[7] The derivative of softplus is , i.e. the logistic function.

Rectified linear units find applications in computer vision[3] and speech recognition[8][9] using deep neuralnets.

Contents1 Variants

1.1 Noisy ReLUs1.2 Leaky ReLUs1.3 ELUs

2 Advantages3 Potential problems4 See also5 References

Variants

Rectifier (neural networks) - Wikipedia https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

1 of 4 2/26/17 3:57 PM

Trick 4: Alternative Error Function

•  Example: Cross-entropy

63

We can see from this graph that when the neuron's output is closeto , the curve gets very flat, and so gets very small. Equations(55) and (56) then tell us that and get very small. Thisis the origin of the learning slowdown. What's more, as we shall seea little later, the learning slowdown occurs for essentially the samereason in more general neural networks, not just the toy examplewe've been playing with.

Introducing the cross-entropy cost function

How can we address the learning slowdown? It turns out that wecan solve the problem by replacing the quadratic cost with adifferent cost function, known as the cross-entropy. To understandthe cross-entropy, let's move a little away from our super-simple toymodel. We'll suppose instead that we're trying to train a neuronwith several input variables, , corresponding weights

, and a bias, :

The output from the neuron is, of course, , where is the weighted sum of the inputs. We define the

cross-entropy cost function for this neuron by

1 (z)σ ′

∂C/∂w ∂C/∂b

, , …x1 x2

, , …w1 w2 b

a = σ(z)z = + b∑j wjxj

C = − [y ln a + (1 − y) ln(1 − a)] ,1n ∑

x(57)

-4 -3 -2 -1 0 1 2 3 40.0

0.2

0.4

0.6

0.8

1.0

z

sigmoid function

Neural networks and deep learning http://neuralnetworksanddeeplearning.com/chap3.html#the_cross-en...

5 of 99 2/26/17 4:06 PM

o o

Trick 5: Momentum

64

to as momentum (its typical value is about 0.9), but its physical meaning is more consistentwith the coefUcient of friction. Effectively, this variable damps the velocity and reduces thekinetic energy of the system, or otherwise the particle would never come to a stop at thebottom of a hill. When cross-validated, this parameter is usually set to values such as [0.5,0.9, 0.95, 0.99]. Similar to annealing schedules for learning rates (discussed later, below),optimization can sometimes beneUt a little from momentum schedules, where themomentum is increased in later stages of learning. A typical setting is to start withmomentum of about 0.5 and anneal it to 0.99 or so over multiple epochs.

Nesterov Momentum is a slightly different version of the momentum update has recentlybeen gaining popularity. It enjoys stronger theoretical converge guarantees for convexfunctions and in practice it also consistenly works slightly better than standard momentum.

The core idea behind Nesterov momentum is that when the current parameter vector is atsome position x , then looking at the momentum update above, we know that themomentum term alone (i.e. ignoring the second term with the gradient) is about to nudgethe parameter vector by mu * v . Therefore, if we are about to compute the gradient, wecan treat the future approximate position x + mu * v as a “lookahead” - this is a point inthe vicinity of where we are soon going to end up. Hence, it makes sense to compute thegradient at x + mu * v instead of at the “old/stale” position x .

Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know thatour momentum is about to carry us to the tip of the green arrow. With Nesterov momentum wetherefore instead evaluate the gradient at this "looked-ahead" position.

CS231n Convolutional Neural Networks for Visual Recognition http://cs231n.github.io/neural-networks-3/

11 of 21 2/28/17 7:06 AM

Trick 6: Dropout Training

•  Build some redundancy into the hidden units

•  Essentially create an “ensemble” of neural networks, but without high cost of training many deep networks

•  Dropout training…

65

Dropout training •  On each training iteration, drop out (ignore) 50% of the

units (or other 90%, or other) by forcing output to 0 during forward pass

•  Ignore for forward & backprop (all training)

66

Dropout

On each training iteration –  randomly “drop out” a subset of the units and their weights –  do forward and backprop on remaining network

Figures from Srivastava et al., Journal of Machine Learning Research 2014

Dropout

At test time –  use all units and weights in the network –  adjust weights according to the probability that the source unit

was dropped out


At Test Time •  Final model uses all nodes •  Multiply each weight from a node by fraction of times node

was used during training

67

Dropout

On each training iteration –  randomly “drop out” a subset of the units and their weights –  do forward and backprop on remaining network


Dropout

At test time –  use all units and weights in the network –  adjust weights according to the probability that the source unit

was dropped out


Trick 7: Batch Normalization

•  If outputs of earlier layers are uniform or change greatly on one round for one mini-batch, then neurons at next levels can’t keep up: they output all high (or all low) values

•  Next layer doesn’t have ability to change its outputs with learning-rate-sized changes to its input weights

•  We say the layer has “saturated”

68

Another View of Problem

•  In ML, we assume future data will be drawn from same probability distribution as training data

•  For a hidden unit, after training, the earlier layers have new weights and hence generate input data for this hidden unit from a new distribution

•  Want to reduce this internal covariate shift for the benefit of later layers

69

70

Batch Normalization

Input: Values of x over a mini-batch: B = {x1...m

};Parameters to be learned: �, �

Output: {yi

= BN�,�

(x

i

)}

µB 1

m

mX

i=1

x

i

// mini-batch mean

�

2

B 1

m

mX

i=1

(x

i

� µB)2 // mini-batch variance

bxi

x

i

� µBp�

2

B + ✏

// normalize

y

i

�bxi

+ � ⌘ BN�,�

(x

i

) // scale and shift

Algorithm 1: Batch Normalizing Transform, applied toactivation x over a mini-batch.

shifted values y are passed to other network layers. Thenormalized activations bx are internal to our transformation,but their presence is crucial. The distributions of valuesof any bx has the expected value of 0 and the variance of1, as long as the elements of each mini-batch are sampledfrom the same distribution, and if we neglect ✏. This can beseen by observing that

Pm

i=1

bxi

= 0 and 1

m

Pm

i=1

bx2

i

= 1,and taking expectations. Each normalized activation bx(k)

can be viewed as an input to a sub-network composed ofthe linear transform y

(k)

= �

(k)bx(k)

+ �

(k), followed bythe other processing done by the original network. Thesesub-network inputs all have fixed means and variances, andalthough the joint distribution of these normalized bx(k) canchange over the course of training, we expect that the intro-duction of normalized inputs accelerates the training of thesub-network and, consequently, the network as a whole.

During training we need to backpropagate the gradient ofloss ` through this transformation, as well as compute thegradients with respect to the parameters of the BN trans-form. We use chain rule, as follows:

@`

@bxi=

@`

@yi· �

@`

@�

2B=

Pm

i=1

@`

@bxi· (x

i

� µB) · �1

2

(�

2

B + ✏)

�3/2

@`

@µB=

Pm

i=1

@`

@bxi· �1p

�

2B+✏

@`

@xi=

@`

@bxi· 1p

�

2B+✏

+

@`

@�

2B· 2(xi�µB)

m

+

@`

@µB· 1

m

@`

@�

=

Pm

i=1

@`

@yi· bx

i

@`

@�

=

Pm

i=1

@`

@yi

Thus, BN transform is a differentiable transformation thatintroduces normalized activations into the network. Thisensures that as the model is training, layers can continuelearning on input distributions that exhibit less internal co-variate shift, thus accelerating the training. Furthermore,

the learned affine transform applied to these normalized ac-tivations allows the BN transform to represent the identitytransformation and preserves the network capacity.

3.1. Training and Inference with Batch-NormalizedNetworks

To Batch-Normalize a network, we specify a subset of ac-tivations and insert the BN transform for each of them, ac-cording to Alg. 1. Any layer that previously received x

as the input, now receives BN(x). A model employingBatch Normalization can be trained using batch gradientdescent, or Stochastic Gradient Descent with a mini-batchsize m > 1, or with any of its variants such as Adagrad(Duchi et al., 2011). The normalization of activations thatdepends on the mini-batch allows efficient training, but isneither necessary nor desirable during inference; we wantthe output to depend only on the input, deterministically.For this, once the network has been trained, we use thenormalization

bx =

x� E[x]pVar[x] + ✏

using the population, rather than mini-batch, statistics. Ne-glecting ✏, these normalized activations have the samemean 0 and variance 1 as during training. We use the unbi-ased variance estimate Var[x] = m

m�1

· EB[�2

B], where theexpectation is over training mini-batches of size m and �

2

Bare their sample variances. Using moving averages instead,we can track the accuracy of a model as it trains. Since themeans and variances are fixed during inference, the nor-malization is simply a linear transform applied to each ac-tivation. It may further be composed with the scaling by� and shift by �, to yield a single linear transform that re-places BN(x). Algorithm 2 summarizes the procedure fortraining batch-normalized networks.

3.2. Batch-Normalized Convolutional Networks

Batch Normalization can be applied to any set of activa-tions in the network. Here, we focus on transforms thatconsist of an affine transformation followed by an element-wise nonlinearity:

z = g(Wu + b)

where W and b are learned parameters of the model, andg(·) is the nonlinearity such as sigmoid or ReLU. Thisformulation covers both fully-connected and convolutionallayers. We add the BN transform immediately before thenonlinearity, by normalizing x = Wu + b. We could havealso normalized the layer inputs u, but since u is likelythe output of another nonlinearity, the shape of its distri-bution is likely to change during training, and constrainingits first and second moments would not eliminate the co-variate shift. In contrast, Wu + b is more likely to havea symmetric, non-sparse distribution, that is “more Gaus-

Comments on Batch Normalization

•  First three steps are just like standardization of input data, but with respect to only the data in mini-batch. Can take derivative and incorporate the learning of last step parameters into backpropagation.

•  Note last step can completely un-do previous 3 steps

•  But if so this un-doing is driven by the later layers, not the earlier layers; later layers get to “choose” whether they want standard normal inputs or not

71

Some Deep Learning Resources

•  Nature, Jan 8, 2014: http://www.nature.com/news/computer-science-the-learning-machines-1.14481

•  Ng Tutorial:

http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial

•  Hinton Tutorial:

http://videolectures.net/jul09_hinton_deeplearn/ •  LeCun & Ranzato Tutorial: http://www.cs.nyu.edu/

~yann/talks/lecun-ranzato-icml2013.pdf 72

Comments on neural networks •  stochastic gradient descent often works well for very large data sets

•  backpropagation generalizes to •  arbitrary numbers of output and hidden units •  arbitrary layers of hidden units (in theory) •  arbitrary connection patterns •  other transfer (i.e. output) functions •  other error measures

•  backprop doesn’t usually work well for networks with multiple layers of hidden units; recent work in deep networks addresses this limitation

73

neural networks and deep learningpages.cs.wisc.edu/~dpage/cs760/anns.pdf · neural networks and...

Documents