back-propagation - cs.cmu.edu16385/s17/slides/9.5_backpropagation.pdf · back-propagation 16-385...

46
Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Upload: phungbao

Post on 27-Aug-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Back-Propagation16-385 Computer Vision (Kris Kitani)

Carnegie Mellon University

Page 2: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

World’s Smallest Perceptron!

y = wx

back to the…

function of ONE parameter!

(a.k.a. line equation, linear regression)

yx

wf

Page 3: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Training the world’s smallest perceptron

this should be the gradient of the loss

function

Now where does this come from?

This is just gradient descent, that means…

Page 4: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

dLdw

L =1

2(y � y)2

y = wx

…is the rate at which this will change…

… per unit change of this

the loss function

the weight parameter

Let’s compute the derivative…

Page 5: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Compute the derivative

That means the weight update for gradient descent is:

dLdw

=d

dw

⇢1

2(y � y)2

= �(y � y)dwx

dw

= �(y � y)x = rw

just shorthand

w = w �rw

= w + (y � y)x

move in direction of negative gradient

Page 6: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Gradient Descent (world’s smallest perceptron)

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = wxi

dLi

dw

= �(yi � y)xi = rw

Li =1

2(yi � y)2

w = w �rw

Page 7: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Training the world’s smallest perceptron

Page 8: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

y

world’s (second) smallest perceptron!

function of two parameters!

w1

w2

x1

x2

Page 9: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

we just need to compute partial derivatives for this network

Page 10: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Back-Propagation

@L@w1

=@

@w1

⇢1

2(y � y)2

= �(y � y)@y

@w1

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w1x1

@w1

= �(y � y)x1 = rw1

@L@w2

=@

@w2

⇢1

2(y � y)2

= �(y � y)@y

@w2

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w2x2

@w2

= �(y � y)x2 = rw2

Why do we have partial derivatives now?

Page 11: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Back-Propagation

@L@w1

=@

@w1

⇢1

2(y � y)2

= �(y � y)@y

@w1

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w1x1

@w1

= �(y � y)x1 = rw1

@L@w2

=@

@w2

⇢1

2(y � y)2

= �(y � y)@y

@w2

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w2x2

@w2

= �(y � y)x2 = rw2

w1 = w1 � ⌘rw1

= w1 + ⌘(y � y)x1

w2 = w2 � ⌘rw2

= w2 + ⌘(y � y)x2

Gradient Update

Page 12: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

Li =1

2(yi � y)

y = fMLP(xi; ✓)

w1i = w1i + ⌘(y � y)x1i

w2i = w2i + ⌘(y � y)x2i

rw1i = �(yi � y)x1i

rw2i = �(yi � y)x2i

(since gradients approximated from stochastic sample)

(adjustable step size)

two BP lines now

Page 13: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…

Page 14: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

multi-layer perceptron

x

function of FOUR parameters and FOUR layers!

w1 w2

b1

h1 h2

w3

yo

Page 15: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

Page 16: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

Page 17: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 18: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 19: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 20: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 21: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 22: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Page 23: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

b1

Page 24: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as one long equation

What is known? What is unknown?We need to train the network:

Page 25: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as a long equation

What is known? What is unknown?

known

We need to train the network:

Page 26: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as a long equation

What is known? What is unknown?

unknown

We need to train the network:

activation function sometimes has

parameters

Page 27: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Given a set of samples and a MLP

Estimate the parameters of the MLP

Learning an MLP

{xi, yi}

✓ = {f, w, b}

y = fMLP(x; ✓)

Page 28: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Stochastic Gradient Descent

For each random sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

@L@✓

vector of parameter update equations

vector of parameter partial derivatives

✓ ✓ � ⌘r✓

Page 29: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

So we need to compute the partial derivatives

@L@✓

=

@L@w3

@L@w2

@L@w1

@L@b

Page 30: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

how

…thisthis

doesaffect…

@L

@w1Partial derivative describes…

So, how do you compute it?

(loss layer)

Remember,

Page 31: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

The Chain Rule

Page 32: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

f2 a3w3 f3· · ·

@L

@w3

@L

@f3

@f3@a3

@a3@w3

rest of the network L(y, y)y

Intuitively, the effect of weight on loss function :

depends on

depends ondepends on

According to the chain rule…

Page 33: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Chain Rule!

Page 34: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Just the partial derivative of L2 loss

Page 35: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Let’s use a Sigmoid functionds(x)

dx

= s(x)(1� s(x))

Page 36: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Let’s use a Sigmoid functionds(x)

dx

= s(x)(1� s(x))

Page 37: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Page 38: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2a2 a3w3 f3

b1

@L

@w2=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

Page 39: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 yf2a2 a3w3 f3

b1

already computed. re-use (propagate)!

@L

@w2=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

Page 40: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

The Chain rule says…

depends ondepends ondepends on

depends ondepends on

depends on

depends on

@L

@w1=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

Page 41: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

The Chain rule says…

depends ondepends ondepends on

depends ondepends on

depends on

depends on

already computed. re-use (propagate)!

@L

@w1=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

Page 42: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

depends ondepends ondepends on

depends ondepends on

depends on

depends on

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b

Page 43: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

depends ondepends ondepends on

depends ondepends on

depends on

depends on

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b

Page 44: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

depends ondepends ondepends on

depends ondepends on

depends on

depends on

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b

Page 45: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

Stochastic Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

w3 = w3 � ⌘rw3

w2 = w2 � ⌘rw2

w1 = w1 � ⌘rw1

b = b� ⌘rb

Li

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b

Page 46: Back-Propagation - cs.cmu.edu16385/s17/Slides/9.5_Backpropagation.pdf · Back-Propagation 16-385 Computer Vision (Kris Kitani) Carnegie Mellon University

@L@✓

✓ ✓ + ⌘@L@✓

vector of parameter update equations

vector of parameter partial derivatives

Stochastic Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

Li