back-propagation - cs.cmu.edu16385/s17/slides/9.5_backpropagation.pdf · back-propagation 16-385...

Back-Propagation16-385 Computer Vision (Kris Kitani)

Carnegie Mellon University

World’s Smallest Perceptron!

y = wx

back to the…

function of ONE parameter!

(a.k.a. line equation, linear regression)

yx

wf

Training the world’s smallest perceptron

this should be the gradient of the loss

function

Now where does this come from?

This is just gradient descent, that means…

dLdw

L =1

2(y � y)2

y = wx

…is the rate at which this will change…

… per unit change of this

the loss function

the weight parameter

Let’s compute the derivative…

Compute the derivative

That means the weight update for gradient descent is:

dLdw

=d

dw

⇢1

2(y � y)2

�

= �(y � y)dwx

dw

= �(y � y)x = rw

just shorthand

w = w �rw

= w + (y � y)x

move in direction of negative gradient

Gradient Descent (world’s smallest perceptron)

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = wxi

dLi

dw

= �(yi � y)xi = rw

Li =1

2(yi � y)2

w = w �rw

Training the world’s smallest perceptron

y

world’s (second) smallest perceptron!

function of two parameters!

w1

w2

x1

x2

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

we just need to compute partial derivatives for this network

Back-Propagation

@L@w1

=@

@w1

⇢1

2(y � y)2

�

= �(y � y)@y

@w1

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w1x1

@w1

= �(y � y)x1 = rw1

@L@w2

=@

@w2

⇢1

2(y � y)2

�

= �(y � y)@y

@w2

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w2x2

@w2

= �(y � y)x2 = rw2

Why do we have partial derivatives now?

Back-Propagation

@L@w1

=@

@w1

⇢1

2(y � y)2

�

= �(y � y)@y

@w1

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w1x1

@w1

= �(y � y)x1 = rw1

@L@w2

=@

@w2

⇢1

2(y � y)2

�

= �(y � y)@y

@w2

= �(y � y)@

Pi wixi

@w1

= �(y � y)@w2x2

@w2

= �(y � y)x2 = rw2

w1 = w1 � ⌘rw1

= w1 + ⌘(y � y)x1

w2 = w2 � ⌘rw2

= w2 + ⌘(y � y)x2

Gradient Update

Gradient Descent

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

Li =1

2(yi � y)

y = fMLP(xi; ✓)

w1i = w1i + ⌘(y � y)x1i

w2i = w2i + ⌘(y � y)x2i

rw1i = �(yi � y)x1i

rw2i = �(yi � y)x2i

(since gradients approximated from stochastic sample)

(adjustable step size)

two BP lines now

We haven’t seen a lot of ‘propagation’ yet because our perceptrons only had one layer…

multi-layer perceptron

x

function of FOUR parameters and FOUR layers!

w1 w2

b1

h1 h2

w3

yo

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4

weightinputactivationsum

weight weight

a2 a3w3 f3

activation activation

inputlayer 1 b1

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4


weight weight

a2 a3w3 f3


inputlayer 1 b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

x

w1 a1 f1 w2 yf2

hiddenlayer 2

hiddenlayer 3

outputlayer 4


weight weight

a2 a3w3 f3


inputlayer 1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

b1

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as one long equation

What is known? What is unknown?We need to train the network:

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as a long equation

What is known? What is unknown?

known

We need to train the network:

a1 = w1 · x+ b1

a2 = w2 · f1(w1 · x+ b1)

a3 = w3 · f2(w2 · f1(w1 · x+ b1))

y = f3(w3 · f2(w2 · f1(w1 · x+ b1)))

Entire network can be written out as a long equation

What is known? What is unknown?

unknown

We need to train the network:

activation function sometimes has

parameters

Given a set of samples and a MLP

Estimate the parameters of the MLP

Learning an MLP

{xi, yi}

✓ = {f, w, b}

y = fMLP(x; ✓)

Stochastic Gradient Descent

For each random sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

@L@✓

vector of parameter update equations

vector of parameter partial derivatives

✓ ✓ � ⌘r✓

So we need to compute the partial derivatives

@L@✓

=

@L@w3

@L@w2

@L@w1

@L@b

�

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

how

…thisthis

doesaffect…

@L

@w1Partial derivative describes…

So, how do you compute it?

(loss layer)

Remember,

The Chain Rule

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

f2 a3w3 f3· · ·

@L

@w3

@L

@f3

@f3@a3

@a3@w3

rest of the network L(y, y)y

Intuitively, the effect of weight on loss function :

depends on

depends ondepends on

According to the chain rule…

f2 a3w3 f3rest of the network L(y, y)y

@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Chain Rule!


@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Just the partial derivative of L2 loss


@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

Let’s use a Sigmoid functionds(x)

dx

= s(x)(1� s(x))


@L

@w3=

@L

@f3

@f3@a3

@a3@w3

= �⌘(y � y)@f3@a3

@a3@w3

= �⌘(y � y)f3(1� f3)@a3@w3

= �⌘(y � y)f3(1� f3)f2

x

w1 a1 f1 w2 yf2a2 a3w3 f3

b1

@L

@w2=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

x

w1 a1 f1 w2 yf2a2 a3w3 f3

b1

already computed. re-use (propagate)!

@L

@w2=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

The Chain rule says…

depends ondepends ondepends on


depends on

depends on

@L

@w1=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y

The Chain rule says…



depends on

depends on

already computed. re-use (propagate)!

@L

@w1=

@L

@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

x

w1 a1 f1 w2 f2a2 a3w3 f3

b1

y



depends on

depends on

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b


For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

w3 = w3 � ⌘rw3

w2 = w2 � ⌘rw2

w1 = w1 � ⌘rw1

b = b� ⌘rb

Li

@L@w3

=@L@f3

@f3@a3

@a3@w3

@L@w2

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@w2

@L@w1

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@w1

@L@b

=@L@f3

@f3@a3

@a3@f2

@f2@a2

@a2@f1

@f1@a1

@a1@b

@L@✓

✓ ✓ + ⌘@L@✓

vector of parameter update equations

vector of parameter partial derivatives


For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update

{xi, yi}

y = fMLP(xi; ✓)

Li

back-propagation - cs.cmu.edu16385/s17/slides/9.5_backpropagation.pdf · back-propagation 16-385...

Documents