inf 5860 machine learningfor image classification lecture

INF 5860 Machine learning for image classification

Lecture : Backpropagation – learning in neural netsAnne Solberg March 3, 2017

Reading material

– Reading material: • http://cs231n.github.io/optimization-2/

• Additional optional material:• Lecture on backpropagation in Coursera Course on

Machine Learning (Andrew Ng)CS 231n on youtube: lecture 4

• http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf• http://colah.github.io/posts/2015-08-Backprop/• http://neuralnetworksanddeeplearning.com/chap2.html

03.03.2017 INF 5860 3

Notation- forward propagation

03.03.2017 INF 5860 4

aaaaa)(

xxxxa

xxxxa

xxxxa

)1(s s size has :jlayer in nodes s 1,-jlayer in nodes s1) 1)-(jlayer in (nodes(j))layer in (nodes dimension has

j to1-jlayer from mapping function gcontrollin weightsofmatrix -

layer and unit of activation - 0layer isinput that theAssume

(1)3

)2(13

)1(2

)2(12

)1(1

)2(11

)1(0

)2(10

(2)1

3)1(

332)1(

321)1(

310)1(

30(1)3

3)1(

232)1(

221)1(

210)1(

20(1)2

3)1(

132)1(

121)1(

110)1(

10(1)1

1-jj)(

j1-j

)(

)(

)(

gxh

g

g

g

jia

j

j

j

ji

Example feed-forward computation• Input x: 3x1 vector

03.03.2017 INF 5860 5

+1 Bias node+1

)()()(1

)2()2()2(1)1()1()1(1

X

: timeoneat 1....Nn all predictcan wesamples trainingN have weIf

1)1layer in nodeshidden nof. classes (nof. 52:1)inputs nof. 1layer in nodeshidden (nof. 44:

321

321

321

)2(

)1(

NxNxNnx

xxnxxxnx

pixelpixelpixel

pixelpixelpixel

pixelpixelpixel

z1 = Theta1.dot(X)a1 = sigmoid(z1)#Append 1 to a1 before computing z2Continue with layer 2……….

Cost function for one-vs-all neural networksFor a neural nets with one-vs-all :

03.03.2017 INF 5860 6

tionTermRegulariza* LossTerm)(llayer in bias)(without units ofNumber :s

layers ofnumber :L

)(2

):)),((1log())(1(:)),((log)(1)(

R(x)h a :Output

l

1

1

2)(

1

1

11 1

KL

.

J

miXhiyiXhiy

mJ

jl s

j

lji

s

i

L

l

m

ikkkk

K

k

Remark: two variations are common:- Regularize all weights including the bias terms (sum from 0)- Avoid regularizing the bias terms (sum from 1)

In practise, this choice do not matter.

Cost function for softmax neural networksFor a neural net with softmax loss function :

03.03.2017 INF 5860 7

tionTermRegulariza* LossTerm)(

llayer in bias)(without units ofNumber :slayers ofnumber :L

)(2

log11)(

1

)x,|KP(y

)x,|2P(y)x,|1P(y

(x)ha :Output

l

1

1

2)(

1

1

11

1

1

1

L

.

2

1

J

me

eky

mJ

e

e

e

e

jl

iTk

iTk

TK

T

T

Tk

s

j

lji

s

i

L

l

m

iK

k

x

xK

ki

x

x

x

K

k

x

Remark: two variations are common, See previous slide:

Introduction to backpropagation and computational graphs• We now have a network architecture

and a cost function. • A learning algorithm for the net should

give us a way to change the weights in such a manner that the output is closerto the correct class labels.

• The activation function should assurethat a small change in weights results in a small change in ouputs.

• Backpropagation use partial derivatives to compute the derivative of the costfunction J with respect to all theweights.

03.03.2017 INF 5860 8

3.3.2017 INF 5860 9

f

x

z

y

Activations

zL

xz

yz

xz

zL

xL

yz

zL

yL

During backpropagation, thenode will learn

The gate uses chain rule to redistribute thisgradient to its inputs

zL

Green numbers: forward propagationRed numbers: backwards propagation

A more complicated graph example

3.3.2017 INF 5860 10

3.3.2017 INF 5860 11

1.0

3.3.2017 INF 5860 12

1.0

-1/(1.37)2*1.0

-0.53

3.3.2017 INF 5860 13

1.0

1*-0.53=-0.53

-0.53-0.53

3.3.2017 INF 5860 14

1.0

e-1*-0.53=-0.20

-0.53-0.53-0.20

3.3.2017 INF 5860 15

1.0

(-1)*-0.20=0.20

-0.53-0.53-0.200.20

3.3.2017 INF 5860 16

1.0-0.53-0.53-0.200.20

(1)*0.20=0.20Distribute to both inputs

0.20

0.20

3.3.2017 INF 5860 17

1.0-0.53-0.53-0.200.20

(1)*0.20=0.20Distribute to both inputs

0.20

0.20

0.20

0.20

3.3.2017 INF 5860 18

1.0-0.53-0.53-0.200.20

0.20

0.20

0.20

0.20

-1*0.20=0.2

2*0.20=0.4

3.3.2017 INF 5860 19

1.0-0.53-0.53-0.200.20

0.20

0.20

0.20

0.20

-0.2

0.4

-2*0.20=-0.4

-3*0.20=-0.6

25.2.2017 INF 5860 20

1.0-0.53-0.53-0.200.20

0.20

0.20

0.20

0.20

-0.2

0.4

-2*0.20=-0.4

-3*0.20=-0.6

The sigmoid gate

The sigmoid gate

25.2.2017 INF 5860 21

Output: 0.73Derivative of the sigmoid gate: (1-0.73)0.73=0.20This is the same result as above.

Forward and backward for a single neuron

3.3.2017 INF 5860 22

Remark: an efficient implemetation will store inputs and intermediatesduring forward, so that they are available for backprop.

A more tricky example

• Stage the forward pass into simple operations that we now thederivative of:

3.3.2017 INF 5860 23

A more tricky example

• In the backwards pass: compute the derivative of all these terms:

25.2.2017 INF 5860 24

Patterns in backward flow

3.3.2017 INF 5860 25

add gate: gradient distributormax gate: gradient routermul gate: be careful

f=x*y means thatdf/dx=y and df/dy=x

Remark on multiplier gate:If a gate get one large and one small input, backprop will use thebig input to cause a large changeon the small input, and vice versa.This is partly why feature scaling is important

The optimization problem

3.3.2017 INF 5860 26

(l),

(l)

every respect to J with of sderivative theNeed

descent gradient usingJ minimize want toWe weights with

layersLnet with forward-feeda andJ function lossa Given

nm

Backpropagation: recursive application of the chain rule on a computational graph to compute the gradients of all input/parameters/intermediates

Implementation: • Forward: compute the result of the node operation and save

the intermediates needed for gradient computation• Backwards: apply the chain tule to compute the gradients of

the loss function with respect to the input of each node.

A very simple net with one input

03.3.2017 INF 5860 27

Sigmoid* * Sigmoid

x

(1) (2)

z(1) z(2)a(1) a(2)

Assume that we want to minimize the square error between theoutput a(2) and the true class yE=1/2(y-a(2))2 (Mean square error in this example)Compute the partial derivatives with respect to (1) and (2),

and use use gradient descent to update (1)

and (2)

)2(1

)1( and

EE

Measure of error on training dataE=1/2(y-a(2))2

03.3.2017 INF 5860 28

)(1)()(' so g(z) function sigmoid theapplies )2()2()2()2(

)2()2(

)1()2(

)2(2)2(

)2(

)2(2

)2(

)2()2(

)2(

)2(2

)2()2(

zgzgzgza

a

azaya

zza

yaaaEE

03.3.2017 INF 5860 29

)(1)()(' so g(z) function sigmoid theapplies

)(1)()(1)(

)(1)()(1)(

)(1)(

)(1)(

)1()1()1()1(

)1()1(

)1()1()2()2()2()2(

)1(

)1()1(

1)1()2()2()2()2(

)1(1

)1(

)1(

)1()2()2()2()2(

)1(

)1(

)1(

)2()2()2()2(

)1(

)2(

)2(

)2()2(

)1(

)2(

)2()1(

zgzgzgza

a

xzgzgzgzgya

zzgzgzgzgya

zza

zgzgya

aaz

zgzgya

zza

yaa

aEE

From scalars to vectors• In the example x was a scalar, and was a vector with one

element pr. weight. • When working with vector input, for each layer will be a

matrix. • Deriving the vector/matrix version of backpropagation is more

tedious, but follows the same principle. • A good source is http://neuralnetworksanddeeplearning.com/chap2.html• We now present the vector algorithm

03.3.2017 INF 5860 30

E

lE

Backpropagation algorithm for a single training sample (xi, yi)

03.3.2017 INF 5860 31

1)()(

)1()2()2()1(

)2()3()3()2(

)3(j

)3()3(

)3()3(j

J notation, thisWith

)('

)(' Compute

jlayer in nodes ofnumber theis where,,..1 of vector thebe Let

Let

:netlayer -3a For 0)(set tionregulariza theignore now,For

li

ljl

ij

T

T

jj

jj

a

zg

zg

ssjya

ya

Note that this is the elementwise product, or Hadamard-product of two vectors

Derivative of loss function• In backpropagation, we need the derivative of the loss functions with

respect to the activation of the output layer aiL.

• If we ignore the regularization term, the derivative of the logistic loss function for sample i can be shown to be (ai

L-yi)– See http://stats.stackexchange.com/questions/219241/gradient-for-

logistic-loss-function• For softmax, ignoring the regularization term, the derivative of the

softmax loss is also (aiL-yi)

– See http://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function

NOTE: aiL is computed differently

03.3.2017 INF 5860 32

03.3.2017 INF 5860 33

)((2)(2) kyindakk )(' )1((2))2((1) zgT

Notice that the bias nodes do not receive input from previous layer.Thus, they should NOT be used in backpropagation

Including the regularization term

03.3.2017 INF 5860 34

layer) previous thein bias thefrominput gets(it 0 from j and 1, from indexed is i that Note

1jfor m

1J

termsbias theregularizenot do that weis convention thehere , 0jfor 1J:tionregulariza theincluding update ationBackpropag

tionTermRegulariza* LossTerm)(

)(2

):)),((1log())(1(:)),((log)(1)(

(l)ij

)()((l)ij

)()((l)ij

1

1

2)(

1

1

11 1

.

lij

lij

lij

lij

s

j

lji

s

i

L

l

m

ikkkk

K

k

mD

mD

J

miXhiyiXhiy

mJ

jl

Remark: softmax will have the same regularization term

Backpropagation with a loop over training data

03.3.2017 INF 5860 35

(l)ij)(

(l)ij

(l)ij

)((l)ij

(l)ij

1)(l(l)j

(l)ij

(l)ij

)(1)(l)((l)(1)2)-(L

i1)-(L1)-(L

(0)i

(0)

(l)ij

mm11

DJ Here,

0j if ,1D

0j if ,1D

aSet

)(' ,.... Compute

otherwise 0 andk yif 1 function,indicator an is yind,)( Compute

1,...1,a compute tonpropagatio forward Do

xaSet m:1ifor

lj,i, allfor 0Set

)y,),.....(xy,(xset Training

lij

lij

i

lTl

ikk

m

m

zgas

kyinda

Ll

Checking dimensions

• Note that in backpropagation, we use T

• When implementing this shape() is your best friend

• Think of a net with one hidden layer (layer 1) with 25 nodes + bias, and output layer with 10 nodes (10 classes)

• (2) has dimension 10x26 including bias, and ((2))T is 26x10• (2) has dimension 10x1

• REMARK: we can either ignore the bias terms in backpropagation, or compute0

(1) also (resulting in a 26x1 vector), but later ignore the 0(1) values

– When doing backpropagation from layer 2 to layer 1, ignore the bias in (index 0 oflayer 2) and backpropagate ((2))T(1:25,0:9)

• (1) then has dimension [(25x10)x(10x1)](25x1) = 25x1

03.3.2017 INF 5860 36

)(' )2((3))2((2) zgT

Assumptions behindbackpropragation1. The loss function should be expressed as a sum or

average over all training samles.– This is true for all the functions we have studied so far– We will be able to compute for a single training example,

and then average over all samples.

03.3.2017 INF 5860 37

llayer in bias)(without units ofNumber :s

layers ofnumber :L

)(2

):)),((1log())(1(:)),((log)(1)(

R(x)h :Output

l

1

1

2)(

1

1

11 1

K

.

jl s

j

lji

s

i

L

l

m

ikkkk

K

k miXhiyiXhiy

mJ

lij

L

Assumptions behindbackpropragation2. The loss function must be expressed as a function

of the outputs of the net.– This allows us to change the weights and measure how

similar yi and the output h(x) is.

03.3.2017 INF 5860 38

llayer in bias)(without units ofNumber :s

layers ofnumber :L

)(2

):)),((1log())(1(:)),((log)(1)(

R(x)h :Output

l

1

1

2)(

1

1

11 1

K

.

jl s

j

lji

s

i

L

l

m

ikkkk

K

k miXhiyiXhiy

mL

Gradient checking

• When implementing backpropagation, weuse gradient checking to verify theimplementation.

• When the code works, we turn off gradient checking.

• But what is it?

03.3.2017 INF 5860 39

Gradient checking: numericalestimation of the gradient• The gradient of a function is defined as:

• When we have the cost function implemented, wecan easily approximate the gradient as

03.3.2017 INF 5860 40

2)()(lim)(

0

JJJdd

2)()( JJ

Procedure for gradient checking• ‘Unroll’ 1, 2,…into a 1-d vector = [1,…. n]• Approximate

• Check that the difference between this partial derivative and the one from backpropagation is smaller than a threshold.

03.3.2017 INF 5860 41

2),....,(),....,(

2),....,(),....,(

2),....,(),....,(

2121

2121

2

2121

1

nn

n

nn

nn

JJJ

JJJ

JJJ

Regarding gradient checking:

• Computing the approximated gradient is computationally much slower than backpropagation:– Use gradient checking for a small example when debugging

the backpropagation code.– Once it works, turn off gradient checking and proceed with

training the entire data set.

03.3.2017 INF 5860 42

Random initialization of weights

• All weights must be initialized to small, butdifferent random numbers. – More on why next week.

03.3.2017 INF 5860 43

Training a neural network

• Choose an architecture: – Number of inputs: dimension of feature vector or

image– Number of outputs: number of classes– 1-2 hidden layers.

• For simplicity: use the same number of nodes in eachhidden layer

– More on practial details in the next two lectures.

03.3.2017 INF 5860 44

Training a network1. Randomly initialize each weight to small numbers2. Implement forward propagation to get the output 3. Implement code to compute the cost function J()4. Implement backprop to compute the partial derivatives

for i=1:mPerform forward propagation and backpropagation for

sample xi,yi

5. Use gradient checking to compate numerical estimates and backpropagation gradients. Afterward, disable gradient checking.

6. Use gradient descent (or optimization methods) withbackpropagation to minimize J.

03.3.2017 INF 5860 45

Weekly exercise:

• A detailed programming exercise, withdescriptions on the operations, will be available.

• Implementing backpropagation is central to Mandatory exercise 1– No solution in python will be given, but test data

with known results.

03.3.2017 INF 5860 46

Next weeks:

• Training in practice, useful tricks. • Babysitting the training process• Parameter updates• Activation functions• Weight initialization• Preprocessing• Evaluation Main reading material: http://cs231n.github.io/neural-networks-3/

3.3..2017 INF 5860 47

inf 5860 machine learningfor image classification lecture

Documents