what is the role of independence for visual · pdf filegradient descent is simplest possible...

Neural networks

Nuno Vasconcelos ECE Department, UCSD

2

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the world• Y - state (class) of the world

e.g. • x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

)(xfy =x(.)f

3

Perceptronclassifier implements the linear decision rule

with

learning is formulated as an optimization problem• define set of errors

• define the cost

• and minimize• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• at zero we know we have the best possible solution (E empty)

bxwxg T +=)([ ]g(x)sgn )( =xh

w

wb

x

wxg )(

{ }0)(| <+= bxwyxE iT

ii

( ) ( )∑∈

+−=Exi

iT

ipi

bxwybwJ|

,

4

Gradient descentis simplest possible minimization technique• pick initial estimate x(0)

• follow the negative gradient

usually the gradient is a function of entire training set Dmore efficient alternative is “stochastic gradient descent”• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow

the same direction after processing entire D• very popular in learning, where D is usually large

( ))()()1( nnn xfxx ∇−=+ η

f(x)

( ))(nxf∇−η)(nx

5

Perceptron learningfor the Perceptron this leads to:set k = 0, wk = 0, bk = 0set R = maxi ||xi||

do {

for i = 1:n {if yi(wTxi + bk) < 0 then {

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

– k = k+1}

}

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)

6

Perceptron learningthe interesting part is that this is guaranteed to converge in finite timeTheorem: Let D = {(x1,y1), ..., (xn,yn)} and

If there is (w*,b*) such that ||w*|| = 1 and

then the Perceptron will find an error free hyper-plane in at most

the margin γ appears as a measure of the difficulty of the learning problem

||||max iixR =

( ) ibxwy iT

i ∀>+ ,** γ

iterations 2

2⎟⎟⎠

⎞⎜⎜⎝

⎛γR

7

Some historyMinsky and Papert identified serious problems • there are very simply logic problems that the Perceptron cannot

solve

later realized that these can be eliminated by relying on a multi-layered Perceptron (MLP) or “neural network”this is a cascade of Perceptrons where• xi are the input units• layer 1: hj(x) = sgn[wj

Tx]• layer 2: u(x) = sgn[wTh(x)]

8

Graphical representationthe Perceptron is usually represented as

input units: coordinates of xweights: coordinates of whomogeneous coordinates: x = (x,1)T

( )xwwxwxh T

iii sgnsgn)( 0 =⎟

⎠

⎞⎜⎝

⎛+= ∑

bias term

9

Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0• non-smooth approximations

it can be approximated in various waysfor example by the hyperbolic tangent

σ controls the approximation error, but• has derivative everywhere• smooth

neural networks are implemented with these functions

s(x)

s’(x)

s’’(x)

xx

xx

eeeexxs σσ

σσ

σ −

−

+−

== )tanh()(

10

Neural networkthe MLP as function approximation

11

Two modes of operationnormal mode, after training: feedforward

enter x here

feedforward

collect z here

12

Two modes of operationtraining mode: backpropagation

enter x here

feedforward

compare z to the target t, e = ||z-t||2

backpropagation:update weigthsto minimize error

13

Backpropagationis just gradient descentat the end of the day, the output z is just a big function of• input vector x • weights, which we can represent by a big vector w• e.g.

objective:

( ) ),( with ;1 1

wvWWxzxwsvszJ

j

J

jijij ==

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑ ∑

= =

( )[ ] ;)( with )(minarg*1

2∑=

−==n

iii

WWxztWJWJW

14

Backpropagationthis is conceptually trivial, but computing the gradient of Jlooks quite messy • e.g. for

• what is dz/dw13 ?

it turns out that it is possible to do this easily by doing a certain amount of book-keepingthe solution is the backpropagation algorithm, which is based on local updatesthe key to understanding it is to make the right definitions

⎥⎥⎦

⎤

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑ ∑

= =

J

j

J

jijij xwsvsz

1 1

15

In detailnotation:• input i: xi

• weight to hidden unit j: wji

• hidden unit j:

• weight to output unit k: wkj

• output unit k:

∑=i

ijij xwg

][ jj gsy =

∑=j

jkjk ywu

][ kk usz =

xi

gj

yj

uk

zk

wji

wkj

before sigmoid

after sigmoid

16

Computing the gradient of Jthe key is the chain rulethe output layer is easy

where

is the sensitivity of unit k

∑=j

jkjk ywu ][ kk usz =

xi

gj

yj

uk

zk

wji

wkjjk

kj

k

kkj

ywu

uJ

wJ δ−=

∂∂

∂∂

=∂∂

( ) ( )kkk

k

k

kkk

usztuz

zJ

uJ

' −−=∂∂

∂∂

−=∂∂

−=δ

[ ] 21

1

2∑=

−=n

iii ztJ

(**)

(*)

17

Computing the gradient of Jfor the hidden layer

this is a little more subtle since Jdepends on yjthrough all the zk xi

gj

yj

uk

zk

wji

wkj( ) ijj

ji

j

j

j

jji

xgsyJ

wg

gy

yJ

wJ

' ∂∂

=

∂

∂

∂

∂

∂∂

=∂∂

[ ] 21

1

2∑=

−=n

iii ztJ∑=

iijij xwg ][ jj gsy =

18

Computing the gradient of J

and, from (**),

overallxi

gj

yj

uk

zk

wji

wkj

( )

( )

( )

( ) ( ) kjkk

kk

j

k

k

k

kkk

j

k

kkk

kkk

jj

wuszt

yu

uzzt

yzzt

ztyy

J

'

21 2

∑

∑

∑

∑

−−=

∂∂

∂∂

−−=

∂∂

−−=

⎥⎦

⎤⎢⎣

⎡−

∂∂

=∂∂

[ ] 21

1

2∑=

−=n

iii ztJ∑=

jjkjk ywu ][ kk usz =

∑−=∂∂

kkjk

j

wyJ δ

( ) ijk

kjkij

xgswwJ '⎥

⎦

⎤⎢⎣

⎡−=

∂∂ ∑δ

19

Computing the gradient of J

by analogy with (*)

with

xi

gj

yj

uk

zk

wji

wkj

[ ] 21

1

2∑=

−=n

iii ztJ∑=

jjkjk ywu ][ kk usz =

( ) ijk

kjkji

xgswwJ '⎥

⎦

⎤⎢⎣

⎡−=

∂∂ ∑δ

ijji

xwJ δ−=

∂∂

∑=k

kjkj wδδ

20

In summary

j

i

wji

for any pair (i,j)

with

if j is hidden and

if j is output. The weight updates are

the error is “backpropagated” by local message passing!

ijji

xwJ δ−=

∂∂

∑=k

kjkj wδδ

ji

nji

nji w

Jww ∂−= ++ η)1()1(

( ) ( )jjjj uszt '−−=δ

21

Feature transformationMLP can be seen as: non-linear feature transformation + linear discriminant

Perceptron

feature transformation

( )xy Φ=

22

Feature transformationthe feature transformation• searches for the space where

the patterns becomeseparable

example:• two class problem• 2-1 network• non linearly separable on

the space of Xs• made linearly separable

on the space of Ys• the figure shows evolution of

Ys and the training error

23

Feature transformationQ: is separability always possible?A: not really, depends on the number of units example• two class problem• 2-1 network is not

enough• but 3-1 network is

in practice• art-form• trial and error

24

Other problemsthe optimization surface can be quite nastyexample:• scalar problem• 1-1 network

cost has many“plateaus”• global optimal solution

has no error• but gradient frequently

close to zero • slow progress

in general: one plateau per training point• improves with more points, degrades with more weights (dims)

25

Other problemshow do we set the learning rate η?• if too small or too big, we will need various iterations• could even diverge

line search:• pick η(0)

• compute and then f(x’)• if not good, make η(k+1) = α η(k) (with α < 1) and repeat• until you get a minimum of f(x) ( ))()()1( nnn xfxx ∇−=+ η

( ))()0()(' nn xfxx ∇−= η

26

Structural risk minimizationwhat about complexity penalties, overfitting and all that?SRM, in general: 1. start from nested collection of families of functions2. for each Si, find the set of parameters that minimizes the

empirical risk3. select the function class such that

where Φ(h) is a function of the VC dimension (complexity) of the family Si

can be done by1. Si = {MLPs such that ||W||2 < λi}2. backpropagation in this family

kSS ⊂⊂L1

( ){ }iiempi

hRR Φ+= min*

27

Structural risk minimizationinstead of

solve

we will se that this is equivalent to

re-working out backpropagation this can be done by “shrinking”• after each weight update do• this is known as “weight decay” and penalizes complex models

( )[ ] with ∑=

−==n

iii

WWxztWJWJW

1

2;)()(minarg*

tosubject λ<= WWWJW T

W)(minarg*

⎭⎬⎫

⎩⎨⎧

+= WWWJW T

W ηε2)(minarg*

( )-εww oldnew 1=

28

In summarythis works, but requires tunning εthe cost surface is nastyone needs to try different architectureshence, training can be painfully slow• “weeks” is quite common• a good neural network may take years to train

however, when you are finished it tends to work wellexamples:• the Rowley and Kanade face detector• the LeCunn digit recognizer (see

http://yann.lecun.com/exdb/lenet/index.html)

http://yann.lecun.com/exdb/lenet/index.html

29

Rowley and KanadeNeural network-based face detectionRowley, H.A.; Baluja, S.; Kanade, T.;

IEEE Transactions on PAMI ,Volume: 20 , Issue: 1 , Jan. 1998

the face detector:

30

Results

31

Tricks (good for any learning algorithm)expand the training set to cover for most variation• a more exhaustive training set always produces better results

than a less exhaustive one• if you can create interesting examples artificially, then by all

means...• e.g. in vision rotate, scale, translate

• independently of what algorithm you are using

32

Tricks (good for any learning algorithm)where do I get negative examples?• finding a good negative example is difficult• use the classifier itself to do it:

1. put together training set D1

2. train classifier Ck with training set Dk

3. run on a dataset that has no positive examples4. make Dk+1 = {examples classified as positive} U Dk

5. goto 2.

• e.g. “close” non-face examples

what is the role of independence for visual · pdf filegradient descent is simplest possible...

Documents