what is the role of independence for visual · pdf filegradient descent is simplest possible...

33
Neural networks Nuno Vasconcelos ECE Department, UCSD

Upload: hoangque

Post on 06-Mar-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

Neural networks

Nuno Vasconcelos ECE Department, UCSD

Page 2: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

2

Classificationa classification problem has two types of variables

• X - vector of observations (features) in the world• Y - state (class) of the world

e.g. • x ∈ X ⊂ R2 = (fever, blood pressure)

• y ∈ Y = {disease, no disease}

X, Y related by a (unknown) function

goal: design a classifier h: X → Y such that h(x) = f(x) ∀x

)(xfy =x(.)f

Page 3: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

3

Perceptronclassifier implements the linear decision rule

with

learning is formulated as an optimization problem• define set of errors

• define the cost

• and minimize• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• at zero we know we have the best possible solution (E empty)

bxwxg T +=)([ ]g(x)sgn )( =xh

w

wb

x

wxg )(

{ }0)(| <+= bxwyxE iT

ii

( ) ( )∑∈

+−=Exi

iT

ipi

bxwybwJ|

,

Page 4: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

4

Gradient descentis simplest possible minimization technique• pick initial estimate x(0)

• follow the negative gradient

usually the gradient is a function of entire training set Dmore efficient alternative is “stochastic gradient descent”• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow

the same direction after processing entire D• very popular in learning, where D is usually large

( ))()()1( nnn xfxx ∇−=+ η

f(x)

( ))(nxf∇−η)(nx

Page 5: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

5

Perceptron learningfor the Perceptron this leads to:set k = 0, wk = 0, bk = 0set R = maxi ||xi||

do {

for i = 1:n {if yi(wTxi + bk) < 0 then {

– wk+1 = wk + η yi xi

– bk+1 = bk + η yi R2

– k = k+1}

}

} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)

Page 6: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

6

Perceptron learningthe interesting part is that this is guaranteed to converge in finite timeTheorem: Let D = {(x1,y1), ..., (xn,yn)} and

If there is (w*,b*) such that ||w*|| = 1 and

then the Perceptron will find an error free hyper-plane in at most

the margin γ appears as a measure of the difficulty of the learning problem

||||max iixR =

( ) ibxwy iT

i ∀>+ ,** γ

iterations 2

2⎟⎟⎠

⎞⎜⎜⎝

⎛γR

Page 7: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

7

Some historyMinsky and Papert identified serious problems • there are very simply logic problems that the Perceptron cannot

solve

later realized that these can be eliminated by relying on a multi-layered Perceptron (MLP) or “neural network”this is a cascade of Perceptrons where• xi are the input units• layer 1: hj(x) = sgn[wj

Tx]• layer 2: u(x) = sgn[wTh(x)]

Page 8: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

8

Graphical representationthe Perceptron is usually represented as

input units: coordinates of xweights: coordinates of whomogeneous coordinates: x = (x,1)T

( )xwwxwxh T

iii sgnsgn)( 0 =⎟

⎞⎜⎝

⎛+= ∑

bias term

Page 9: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

9

Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0• non-smooth approximations

it can be approximated in various waysfor example by the hyperbolic tangent

σ controls the approximation error, but• has derivative everywhere• smooth

neural networks are implemented with these functions

s(x)

s’(x)

s’’(x)

xx

xx

eeeexxs σσ

σσ

σ −

+−

== )tanh()(

Page 10: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

10

Neural networkthe MLP as function approximation

Page 11: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

11

Two modes of operationnormal mode, after training: feedforward

enter x here

feedforward

collect z here

Page 12: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

12

Two modes of operationtraining mode: backpropagation

enter x here

feedforward

compare z to the target t, e = ||z-t||2

backpropagation:update weigthsto minimize error

Page 13: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

13

Backpropagationis just gradient descentat the end of the day, the output z is just a big function of• input vector x • weights, which we can represent by a big vector w• e.g.

objective:

( ) ),( with ;1 1

wvWWxzxwsvszJ

j

J

jijij ==

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑ ∑

= =

( )[ ] ;)( with )(minarg*1

2∑=

−==n

iii

WWxztWJWJW

Page 14: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

14

Backpropagationthis is conceptually trivial, but computing the gradient of Jlooks quite messy • e.g. for

• what is dz/dw13 ?

it turns out that it is possible to do this easily by doing a certain amount of book-keepingthe solution is the backpropagation algorithm, which is based on local updatesthe key to understanding it is to make the right definitions

⎥⎥⎦

⎢⎢⎣

⎡⎟⎟⎠

⎞⎜⎜⎝

⎛= ∑ ∑

= =

J

j

J

jijij xwsvsz

1 1

Page 15: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

15

In detailnotation:• input i: xi

• weight to hidden unit j: wji

• hidden unit j:

• weight to output unit k: wkj

• output unit k:

∑=i

ijij xwg

][ jj gsy =

∑=j

jkjk ywu

][ kk usz =

xi

gj

yj

uk

zk

wji

wkj

before sigmoid

after sigmoid

Page 16: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

16

Computing the gradient of Jthe key is the chain rulethe output layer is easy

where

is the sensitivity of unit k

∑=j

jkjk ywu ][ kk usz =

xi

gj

yj

uk

zk

wji

wkjjk

kj

k

kkj

ywu

uJ

wJ δ−=

∂∂

∂∂

=∂∂

( ) ( )kkk

k

k

kkk

usztuz

zJ

uJ

' −−=∂∂

∂∂

−=∂∂

−=δ

[ ] 21

1

2∑=

−=n

iii ztJ

(**)

(*)

Page 17: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

17

Computing the gradient of Jfor the hidden layer

this is a little more subtle since Jdepends on yjthrough all the zk xi

gj

yj

uk

zk

wji

wkj( ) ijj

ji

j

j

j

jji

xgsyJ

wg

gy

yJ

wJ

' ∂∂

=

∂∂

=∂∂

[ ] 21

1

2∑=

−=n

iii ztJ∑=

iijij xwg ][ jj gsy =

Page 18: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

18

Computing the gradient of J

and, from (**),

overallxi

gj

yj

uk

zk

wji

wkj

( )

( )

( )

( ) ( ) kjkk

kk

j

k

k

k

kkk

j

k

kkk

kkk

jj

wuszt

yu

uzzt

yzzt

ztyy

J

'

21 2

−−=

∂∂

∂∂

−−=

∂∂

−−=

⎥⎦

⎤⎢⎣

⎡−

∂∂

=∂∂

[ ] 21

1

2∑=

−=n

iii ztJ∑=

jjkjk ywu ][ kk usz =

∑−=∂∂

kkjk

j

wyJ δ

( ) ijk

kjkij

xgswwJ '⎥

⎤⎢⎣

⎡−=

∂∂ ∑δ

Page 19: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

19

Computing the gradient of J

by analogy with (*)

with

xi

gj

yj

uk

zk

wji

wkj

[ ] 21

1

2∑=

−=n

iii ztJ∑=

jjkjk ywu ][ kk usz =

( ) ijk

kjkji

xgswwJ '⎥

⎤⎢⎣

⎡−=

∂∂ ∑δ

ijji

xwJ δ−=

∂∂

∑=k

kjkj wδδ

Page 20: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

20

In summary

j

i

wji

for any pair (i,j)

with

if j is hidden and

if j is output. The weight updates are

the error is “backpropagated” by local message passing!

ijji

xwJ δ−=

∂∂

∑=k

kjkj wδδ

ji

nji

nji w

Jww ∂−= ++ η)1()1(

( ) ( )jjjj uszt '−−=δ

Page 21: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

21

Feature transformationMLP can be seen as: non-linear feature transformation + linear discriminant

Perceptron

feature transformation

( )xy Φ=

Page 22: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

22

Feature transformationthe feature transformation• searches for the space where

the patterns becomeseparable

example:• two class problem• 2-1 network• non linearly separable on

the space of Xs• made linearly separable

on the space of Ys• the figure shows evolution of

Ys and the training error

Page 23: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

23

Feature transformationQ: is separability always possible?A: not really, depends on the number of units example• two class problem• 2-1 network is not

enough• but 3-1 network is

in practice• art-form• trial and error

Page 24: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

24

Other problemsthe optimization surface can be quite nastyexample:• scalar problem• 1-1 network

cost has many“plateaus”• global optimal solution

has no error• but gradient frequently

close to zero • slow progress

in general: one plateau per training point• improves with more points, degrades with more weights (dims)

Page 25: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

25

Other problemshow do we set the learning rate η?• if too small or too big, we will need various iterations• could even diverge

line search:• pick η(0)

• compute and then f(x’)• if not good, make η(k+1) = α η(k) (with α < 1) and repeat• until you get a minimum of f(x) ( ))()()1( nnn xfxx ∇−=+ η

( ))()0()(' nn xfxx ∇−= η

Page 26: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

26

Structural risk minimizationwhat about complexity penalties, overfitting and all that?SRM, in general: 1. start from nested collection of families of functions2. for each Si, find the set of parameters that minimizes the

empirical risk3. select the function class such that

where Φ(h) is a function of the VC dimension (complexity) of the family Si

can be done by1. Si = {MLPs such that ||W||2 < λi}2. backpropagation in this family

kSS ⊂⊂L1

( ){ }iiempi

hRR Φ+= min*

Page 27: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

27

Structural risk minimizationinstead of

solve

we will se that this is equivalent to

re-working out backpropagation this can be done by “shrinking”• after each weight update do• this is known as “weight decay” and penalizes complex models

( )[ ] with ∑=

−==n

iii

WWxztWJWJW

1

2;)()(minarg*

tosubject λ<= WWWJW T

W)(minarg*

⎭⎬⎫

⎩⎨⎧

+= WWWJW T

W ηε2)(minarg*

( )-εww oldnew 1=

Page 28: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

28

In summarythis works, but requires tunning εthe cost surface is nastyone needs to try different architectureshence, training can be painfully slow• “weeks” is quite common• a good neural network may take years to train

however, when you are finished it tends to work wellexamples:• the Rowley and Kanade face detector• the LeCunn digit recognizer (see

http://yann.lecun.com/exdb/lenet/index.html)

Page 29: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

29

Rowley and KanadeNeural network-based face detectionRowley, H.A.; Baluja, S.; Kanade, T.;

IEEE Transactions on PAMI ,Volume: 20 , Issue: 1 , Jan. 1998

the face detector:

Page 30: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

30

Results

Page 31: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

31

Tricks (good for any learning algorithm)expand the training set to cover for most variation• a more exhaustive training set always produces better results

than a less exhaustive one• if you can create interesting examples artificially, then by all

means...• e.g. in vision rotate, scale, translate

• independently of what algorithm you are using

Page 32: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

32

Tricks (good for any learning algorithm)where do I get negative examples?• finding a good negative example is difficult• use the classifier itself to do it:

1. put together training set D1

2. train classifier Ck with training set Dk

3. run on a dataset that has no positive examples4. make Dk+1 = {examples classified as positive} U Dk

5. goto 2.

• e.g. “close” non-face examples

Page 33: What is the Role of Independence for Visual · PDF fileGradient descent is simplest possible ... • no guarantee this is a descent step but, on average, you follow ... 3. run on a

33