what is the role of independence for visual · pdf filegradient descent is simplest possible...
TRANSCRIPT
Neural networks
Nuno Vasconcelos ECE Department, UCSD
2
Classificationa classification problem has two types of variables
• X - vector of observations (features) in the world• Y - state (class) of the world
e.g. • x ∈ X ⊂ R2 = (fever, blood pressure)
• y ∈ Y = {disease, no disease}
X, Y related by a (unknown) function
goal: design a classifier h: X → Y such that h(x) = f(x) ∀x
)(xfy =x(.)f
3
Perceptronclassifier implements the linear decision rule
with
learning is formulated as an optimization problem• define set of errors
• define the cost
• and minimize• Jp cannot be negative since, in E, all yi(wTxi+b) are negative• at zero we know we have the best possible solution (E empty)
bxwxg T +=)([ ]g(x)sgn )( =xh
w
wb
x
wxg )(
{ }0)(| <+= bxwyxE iT
ii
( ) ( )∑∈
+−=Exi
iT
ipi
bxwybwJ|
,
4
Gradient descentis simplest possible minimization technique• pick initial estimate x(0)
• follow the negative gradient
usually the gradient is a function of entire training set Dmore efficient alternative is “stochastic gradient descent”• take the step immediately after each point• no guarantee this is a descent step but, on average, you follow
the same direction after processing entire D• very popular in learning, where D is usually large
( ))()()1( nnn xfxx ∇−=+ η
f(x)
( ))(nxf∇−η)(nx
5
Perceptron learningfor the Perceptron this leads to:set k = 0, wk = 0, bk = 0set R = maxi ||xi||
do {
for i = 1:n {if yi(wTxi + bk) < 0 then {
– wk+1 = wk + η yi xi
– bk+1 = bk + η yi R2
– k = k+1}
}
} until yi(wTxi + bk) ≥ 0, ∀ i (no errors)
6
Perceptron learningthe interesting part is that this is guaranteed to converge in finite timeTheorem: Let D = {(x1,y1), ..., (xn,yn)} and
If there is (w*,b*) such that ||w*|| = 1 and
then the Perceptron will find an error free hyper-plane in at most
the margin γ appears as a measure of the difficulty of the learning problem
||||max iixR =
( ) ibxwy iT
i ∀>+ ,** γ
iterations 2
2⎟⎟⎠
⎞⎜⎜⎝
⎛γR
7
Some historyMinsky and Papert identified serious problems • there are very simply logic problems that the Perceptron cannot
solve
later realized that these can be eliminated by relying on a multi-layered Perceptron (MLP) or “neural network”this is a cascade of Perceptrons where• xi are the input units• layer 1: hj(x) = sgn[wj
Tx]• layer 2: u(x) = sgn[wTh(x)]
8
Graphical representationthe Perceptron is usually represented as
input units: coordinates of xweights: coordinates of whomogeneous coordinates: x = (x,1)T
( )xwwxwxh T
iii sgnsgn)( 0 =⎟
⎠
⎞⎜⎝
⎛+= ∑
bias term
9
Sigmoidsthe sgn[.] function is problematic in two ways:• no derivative at 0• non-smooth approximations
it can be approximated in various waysfor example by the hyperbolic tangent
σ controls the approximation error, but• has derivative everywhere• smooth
neural networks are implemented with these functions
s(x)
s’(x)
s’’(x)
xx
xx
eeeexxs σσ
σσ
σ −
−
+−
== )tanh()(
10
Neural networkthe MLP as function approximation
11
Two modes of operationnormal mode, after training: feedforward
enter x here
feedforward
collect z here
12
Two modes of operationtraining mode: backpropagation
enter x here
feedforward
compare z to the target t, e = ||z-t||2
backpropagation:update weigthsto minimize error
13
Backpropagationis just gradient descentat the end of the day, the output z is just a big function of• input vector x • weights, which we can represent by a big vector w• e.g.
objective:
( ) ),( with ;1 1
wvWWxzxwsvszJ
j
J
jijij ==
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑ ∑
= =
( )[ ] ;)( with )(minarg*1
2∑=
−==n
iii
WWxztWJWJW
14
Backpropagationthis is conceptually trivial, but computing the gradient of Jlooks quite messy • e.g. for
• what is dz/dw13 ?
it turns out that it is possible to do this easily by doing a certain amount of book-keepingthe solution is the backpropagation algorithm, which is based on local updatesthe key to understanding it is to make the right definitions
⎥⎥⎦
⎤
⎢⎢⎣
⎡⎟⎟⎠
⎞⎜⎜⎝
⎛= ∑ ∑
= =
J
j
J
jijij xwsvsz
1 1
15
In detailnotation:• input i: xi
• weight to hidden unit j: wji
• hidden unit j:
• weight to output unit k: wkj
• output unit k:
∑=i
ijij xwg
][ jj gsy =
∑=j
jkjk ywu
][ kk usz =
xi
gj
yj
uk
zk
wji
wkj
before sigmoid
after sigmoid
16
Computing the gradient of Jthe key is the chain rulethe output layer is easy
where
is the sensitivity of unit k
∑=j
jkjk ywu ][ kk usz =
xi
gj
yj
uk
zk
wji
wkjjk
kj
k
kkj
ywu
uJ
wJ δ−=
∂∂
∂∂
=∂∂
( ) ( )kkk
k
k
kkk
usztuz
zJ
uJ
' −−=∂∂
∂∂
−=∂∂
−=δ
[ ] 21
1
2∑=
−=n
iii ztJ
(**)
(*)
17
Computing the gradient of Jfor the hidden layer
this is a little more subtle since Jdepends on yjthrough all the zk xi
gj
yj
uk
zk
wji
wkj( ) ijj
ji
j
j
j
jji
xgsyJ
wg
gy
yJ
wJ
' ∂∂
=
∂
∂
∂
∂
∂∂
=∂∂
[ ] 21
1
2∑=
−=n
iii ztJ∑=
iijij xwg ][ jj gsy =
18
Computing the gradient of J
and, from (**),
overallxi
gj
yj
uk
zk
wji
wkj
( )
( )
( )
( ) ( ) kjkk
kk
j
k
k
k
kkk
j
k
kkk
kkk
jj
wuszt
yu
uzzt
yzzt
ztyy
J
'
21 2
∑
∑
∑
∑
−−=
∂∂
∂∂
−−=
∂∂
−−=
⎥⎦
⎤⎢⎣
⎡−
∂∂
=∂∂
[ ] 21
1
2∑=
−=n
iii ztJ∑=
jjkjk ywu ][ kk usz =
∑−=∂∂
kkjk
j
wyJ δ
( ) ijk
kjkij
xgswwJ '⎥
⎦
⎤⎢⎣
⎡−=
∂∂ ∑δ
19
Computing the gradient of J
by analogy with (*)
with
xi
gj
yj
uk
zk
wji
wkj
[ ] 21
1
2∑=
−=n
iii ztJ∑=
jjkjk ywu ][ kk usz =
( ) ijk
kjkji
xgswwJ '⎥
⎦
⎤⎢⎣
⎡−=
∂∂ ∑δ
ijji
xwJ δ−=
∂∂
∑=k
kjkj wδδ
20
In summary
j
i
wji
for any pair (i,j)
with
if j is hidden and
if j is output. The weight updates are
the error is “backpropagated” by local message passing!
ijji
xwJ δ−=
∂∂
∑=k
kjkj wδδ
ji
nji
nji w
Jww ∂−= ++ η)1()1(
( ) ( )jjjj uszt '−−=δ
21
Feature transformationMLP can be seen as: non-linear feature transformation + linear discriminant
Perceptron
feature transformation
( )xy Φ=
22
Feature transformationthe feature transformation• searches for the space where
the patterns becomeseparable
example:• two class problem• 2-1 network• non linearly separable on
the space of Xs• made linearly separable
on the space of Ys• the figure shows evolution of
Ys and the training error
23
Feature transformationQ: is separability always possible?A: not really, depends on the number of units example• two class problem• 2-1 network is not
enough• but 3-1 network is
in practice• art-form• trial and error
24
Other problemsthe optimization surface can be quite nastyexample:• scalar problem• 1-1 network
cost has many“plateaus”• global optimal solution
has no error• but gradient frequently
close to zero • slow progress
in general: one plateau per training point• improves with more points, degrades with more weights (dims)
25
Other problemshow do we set the learning rate η?• if too small or too big, we will need various iterations• could even diverge
line search:• pick η(0)
• compute and then f(x’)• if not good, make η(k+1) = α η(k) (with α < 1) and repeat• until you get a minimum of f(x) ( ))()()1( nnn xfxx ∇−=+ η
( ))()0()(' nn xfxx ∇−= η
26
Structural risk minimizationwhat about complexity penalties, overfitting and all that?SRM, in general: 1. start from nested collection of families of functions2. for each Si, find the set of parameters that minimizes the
empirical risk3. select the function class such that
where Φ(h) is a function of the VC dimension (complexity) of the family Si
can be done by1. Si = {MLPs such that ||W||2 < λi}2. backpropagation in this family
kSS ⊂⊂L1
( ){ }iiempi
hRR Φ+= min*
27
Structural risk minimizationinstead of
solve
we will se that this is equivalent to
re-working out backpropagation this can be done by “shrinking”• after each weight update do• this is known as “weight decay” and penalizes complex models
( )[ ] with ∑=
−==n
iii
WWxztWJWJW
1
2;)()(minarg*
tosubject λ<= WWWJW T
W)(minarg*
⎭⎬⎫
⎩⎨⎧
+= WWWJW T
W ηε2)(minarg*
( )-εww oldnew 1=
28
In summarythis works, but requires tunning εthe cost surface is nastyone needs to try different architectureshence, training can be painfully slow• “weeks” is quite common• a good neural network may take years to train
however, when you are finished it tends to work wellexamples:• the Rowley and Kanade face detector• the LeCunn digit recognizer (see
http://yann.lecun.com/exdb/lenet/index.html)
29
Rowley and KanadeNeural network-based face detectionRowley, H.A.; Baluja, S.; Kanade, T.;
IEEE Transactions on PAMI ,Volume: 20 , Issue: 1 , Jan. 1998
the face detector:
30
Results
31
Tricks (good for any learning algorithm)expand the training set to cover for most variation• a more exhaustive training set always produces better results
than a less exhaustive one• if you can create interesting examples artificially, then by all
means...• e.g. in vision rotate, scale, translate
• independently of what algorithm you are using
32
Tricks (good for any learning algorithm)where do I get negative examples?• finding a good negative example is difficult• use the classifier itself to do it:
1. put together training set D1
2. train classifier Ck with training set Dk
3. run on a dataset that has no positive examples4. make Dk+1 = {examples classified as positive} U Dk
5. goto 2.
• e.g. “close” non-face examples
33