machine learning techniques hxÒhtlin/course/ml13fall/... · 10 14/ 21 psfrag replacements r linea...

Machine Learning Techniques(機器學習技巧)

Lecture 11: Neural NetworkHsuan-Tien Lin (林軒田)

[email protected]

Department of Computer Science& Information Engineering

National Taiwan University(國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

[email protected]

Neural Network

Agenda

Lecture 11: Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks


Neural Network Random Forest: Theory and Practice

Theory: Does Diversity Help?strength-correlation decomposition (classification):

limT→∞

Eout(G) ≤ ρ ·(

1− s2s2

)

• strength: average voting margin within G• correlation: similarity between gt• similar for regression (bias-variance decomposition)

RF good if diverse and strong



Practice: How Many Trees Needed?theory: the more, the ‘better’

• NTU KDDCup 2013 Track 1: predicting author-paper relation• 1− Eval of thousands of trees: [0.981,0.985] depending on seed;

1− Eout of top 20 teams: [0.98130,0.98554]• decision: take 12000 trees with seed 1

cons of RF: may need lots of trees if randomprocess too unstable



Fun Time


Neural Network Neural Network Motivation

DisclaimerMany parts of this lecture borrows

Prof. Yaser S. Abu-Mostafa’s slides with permission.

Learning From DataYaser S. Abu-MostafaCalifornia Institute of TehnologyLeture 10: Neural Networks

Sponsored by Calteh's Provost O�e, E&AS Division, and IST • Thursday, May 3, 2012



Biologial inspirationbiologial funtion −→ biologial struture

1

2

1

2

Learning From Data - Leture 10 8/21


1

2

1

2



1

2

1

2



1

2

1

2



1

2

1

2



1

2

1

2



1

2

1

2



1

2

1

2



1

2

1

2




Combining pereptrons


Combining pereptrons−−

+

x1

x2+

−

h1

x1

x2+

+−

h2x1

x2


Combining pereptrons−−

+

x1

x2+

−

h1

x1

x2+

+−

h2x1

x2

OR(x1, x2)

1

x1

x2

1

1

1.5 −1.5

1

x1

x2

AND(x1, x2)1

1




Creating layers


Creating layers

1

1

h1h̄2

h̄1h2

f

1.5

1


Creating layers

1

1

h1h̄2

h̄1h2

f

1.5

1

1f

1

1.5

1

1

1

h1

h2

−1−1

−1.5−1.5

1




The multilayer pereptron



w1 • x

f

1

1.5

1

1

1 1

−1−1

−1.5−1.5

1

1x1

x2w2 • x

3 layers �feedforward�Learning From Data - Leture 10 11/21


w1 • x

f

1

1.5

1

1

1 1

−1−1

−1.5−1.5

1

1x1

x2w2 • x



w1 • x

f

1

1.5

1

1

1 1

−1−1

−1.5−1.5

1

1x1

x2w2 • x




A powerful model


A powerful model

−

−

−

+

+

+

−

+

− −

−−

+

+

+

+

− −

−

+

+

+

−

+

Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization


A powerful model

−

−

−

+

+

+

−

+

− −

−−

+

+

+

+

− −

−

+

+

+

−

+



A powerful model

−

−

−

+

+

+

−

+

− −

−−

+

+

+

+

− −

−

+

+

+

−

+



A powerful model

−

−

−

+

+

+

−

+

− −

−−

+

+

+

+

− −

−

+

+

+

−

+





Fun Time


Neural Network Neural Network Hypothesis

The neural network


The neural network

θ

x2

xd

s

θ(s)

h(x)

11 1

x1

θ

θ

θ θ

θ


The neural network

θ

x2

xd

s

θ(s)

h(x)

11 1

x1

input xθ

θ

θ θ

θ


The neural network

θ

x2

xd

s

θ(s)

h(x)

11 1

x1

input x hidden layers 1 ≤ l < Lθ

θ

θ θ

θ


The neural network

θ

x2

xd

s

θ(s)

h(x)

11 1

x1

input x hidden layers 1 ≤ l < L output layer l = Lθ

θ

θ θ

θ




The neural network

θ

x2

xd

s

θ(s)

h(x)

11 1

x1

input x hidden layers 1 ≤ l < L output layer l = Lθ

θ

θ θ

θ

Learning From Data - Leture 10 13/21How the network operates


How the network operates

PSfrag replaements

lineartanhhard threshold

+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =

es − e−ses + e−s

w(l)ij

1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i

Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21


PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i



PSfrag replaements


+1

−1

-4-2024-1-0.500.51θ(s) = tanh(s) =


w(l)ij


x(l)j = θ(s

(l)j ) = θ

d(l−1)∑

i=0

w(l)ij x

(l−1)i

Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25


Fun Time


Neural Network Neural Network Training

Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

e (h(xn), yn) = e(w)To implement SGD, we need the gradient

∇e(w): ∂ e(w)∂ w

(l)ij

for all i, j, lLearning From Data - Leture 10 16/21



∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




∇e(w): ∂ e(w)∂ w

(l)ij




Computing ∂ e(w)∂ w

(l)ij



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom

θWe an evaluate ∂ e(w)∂ w

(l)ij

one by one: analytially or numeriallyA trik for e�ient omputation:

∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij

We have ∂ s(l)j∂ w

(l)ij

= x(l−1)i We only need: ∂ e(w)

∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j



(l)ij

x

w

s

x

i

ij

j

j(l)

(l)

(l)

(l−1)

top

bottom


(l)ij


∂ e(w)∂ w

(l)ij

=∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ w(l)ij


(l)ij


∂ s(l)j

= δ(l)j




δ for the �nal layerδ(l)j =

∂ e(w)∂ s

(l)j

For the �nal layer l = L and j = 1:δ(L)1 =

∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )

θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21


∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e(h(xn), yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = e( x(L)1 , yn)

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = ( x(L)1 − yn)2

x(L)1 = θ(s

(L)1 )



∂ e(w)∂ s

(l)j


∂ e(w)∂ s

(L)1e(w) = ( x(L)1 − yn)2

x(L)1 = θ(s

(L)1 )




Bak propagation of δ



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



i(l−1)

s

x i

w ij

s j

x j(l)

(l)

(l)

(l−1)

top

bottom

θ

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



x i(l−1)

w ij

j

x j(l)

(l)

(l)

δ

δ i(l−1)

top

bottom

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j



x i(l−1)

x i(l−1) 2

w ij

j

x j(l)

(l)

(l)

δ

δ i(l−1)

1−( )

top

bottom

δ(l−1)i =

∂ e(w)∂ s

(l−1)i

=

d(l)∑

j=1

∂ e(w)∂ s

(l)j

×∂ s

(l)j

∂ x(l−1)i

×∂ x(l−1)i

∂ s(l−1)i

=

d(l)∑

j=1

δ(l)j × w

(l)ij × θ′(s

(l−1)i )

δ(l−1)i = (1− (x

(l−1)i )

2)d(l)∑

j=1

w(l)ij δ

(l)j




Bakpropagation algorithm



w ij(l)

δ j(l)

x(l−1)i

top

bottom



w ij(l)

δ j(l)

x(l−1)i

top

bottom

1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21


w ij(l)

δ j(l)

x(l−1)i

top

bottom

1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21



Fun Time


Neural Network Deep Neural Networks

Final remark: hidden layers





θ

x2

xd

s

θ(s)

h(x)

11 1

x1

θ

θ

θ θ

θ

learned nonlinear transforminterpretation?



θ

x2

xd

s

θ(s)

h(x)

11 1

x1

θ

θ

θ θ

θ




θ

x2

xd

s

θ(s)

h(x)

11 1

x1

θ

θ

θ θ

θ




θ

x2

xd

s

θ(s)

h(x)

11 1

x1

θ

θ

θ θ

θ





Shallow versus Deep Structuresshallow: few hidden layers; deep: many hidden layers

Shallow• efficient• powerful if enough neurons

Deep• challenging to train• needing more structural

(model) decisions• ‘meaningful’?

deep structure (deep learning) re-gainattention recently



Key Techniques behind Deep Learning

• (usually) unsupervised pre-training between hidden layers,—viewing hidden layers as ‘condensing’ low-level representationto high-level one

• fine-tune with backprop after initializing with those ‘good’weights—because direct backprop may get stuck more easily

• speed-up: better optimization algorithms, and faster GPU

currently very useful for vision and speechrecognition



Fun Time



Summary

Lecture 11: Neural NetworkRandom Forest: Theory and Practice

Neural Network Motivation

Neural Network Hypothesis

Neural Network Training

Deep Neural Networks


Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks

machine learning techniques hxÒhtlin/course/ml13fall/... · 10 14/ 21 psfrag replacements r linea...

Documents