machine learning techniques hxÒhtlin/course/ml13fall/... · 10 14/ 21 psfrag replacements r linea...
Post on 05-Feb-2021
1 Views
Preview:
TRANSCRIPT
-
Machine Learning Techniques(機器學習技巧)
Lecture 11: Neural NetworkHsuan-Tien Lin (林軒田)
htlin@csie.ntu.edu.tw
Department of Computer Science& Information Engineering
National Taiwan University(國立台灣大學資訊工程系)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25
htlin@csie.ntu.edu.tw
-
Neural Network
Agenda
Lecture 11: Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/25
-
Neural Network Random Forest: Theory and Practice
Theory: Does Diversity Help?strength-correlation decomposition (classification):
limT→∞
Eout(G) ≤ ρ ·(
1− s2s2
)
• strength: average voting margin within G• correlation: similarity between gt• similar for regression (bias-variance decomposition)
RF good if diverse and strong
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25
-
Neural Network Random Forest: Theory and Practice
Practice: How Many Trees Needed?theory: the more, the ‘better’
• NTU KDDCup 2013 Track 1: predicting author-paper relation• 1− Eval of thousands of trees: [0.981,0.985] depending on seed;
1− Eout of top 20 teams: [0.98130,0.98554]• decision: take 12000 trees with seed 1
cons of RF: may need lots of trees if randomprocess too unstable
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25
-
Neural Network Random Forest: Theory and Practice
Fun Time
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25
-
Neural Network Neural Network Motivation
DisclaimerMany parts of this lecture borrows
Prof. Yaser S. Abu-Mostafa’s slides with permission.
Learning From DataYaser S. Abu-MostafaCalifornia Institute of TehnologyLeture 10: Neural Networks
Sponsored by Calteh's Provost O�e, E&AS Division, and IST • Thursday, May 3, 2012
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25
-
Neural Network Neural Network Motivation
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Biologial inspirationbiologial funtion −→ biologial struture
1
2
1
2
Learning From Data - Leture 10 8/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25
-
Neural Network Neural Network Motivation
Combining pereptrons
Learning From Data - Leture 10 9/21
Combining pereptrons−−
+
x1
x2+
−
h1
x1
x2+
+−
h2x1
x2
Learning From Data - Leture 10 9/21
Combining pereptrons−−
+
x1
x2+
−
h1
x1
x2+
+−
h2x1
x2
OR(x1, x2)
1
x1
x2
1
1
1.5 −1.5
1
x1
x2
AND(x1, x2)1
1
Learning From Data - Leture 10 9/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25
-
Neural Network Neural Network Motivation
Creating layers
Learning From Data - Leture 10 10/21
Creating layers
1
1
h1h̄2
h̄1h2
f
1.5
1
Learning From Data - Leture 10 10/21
Creating layers
1
1
h1h̄2
h̄1h2
f
1.5
1
1f
1
1.5
1
1
1
h1
h2
−1−1
−1.5−1.5
1
Learning From Data - Leture 10 10/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25
-
Neural Network Neural Network Motivation
The multilayer pereptron
Learning From Data - Leture 10 11/21
The multilayer pereptron
w1 • x
f
1
1.5
1
1
1 1
−1−1
−1.5−1.5
1
1x1
x2w2 • x
3 layers �feedforward�Learning From Data - Leture 10 11/21
The multilayer pereptron
w1 • x
f
1
1.5
1
1
1 1
−1−1
−1.5−1.5
1
1x1
x2w2 • x
3 layers �feedforward�Learning From Data - Leture 10 11/21
The multilayer pereptron
w1 • x
f
1
1.5
1
1
1 1
−1−1
−1.5−1.5
1
1x1
x2w2 • x
3 layers �feedforward�Learning From Data - Leture 10 11/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25
-
Neural Network Neural Network Motivation
A powerful model
Learning From Data - Leture 10 12/21
A powerful model
−
−
−
+
+
+
−
+
− −
−−
+
+
+
+
− −
−
+
+
+
−
+
Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization
Learning From Data - Leture 10 12/21
A powerful model
−
−
−
+
+
+
−
+
− −
−−
+
+
+
+
− −
−
+
+
+
−
+
Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization
Learning From Data - Leture 10 12/21
A powerful model
−
−
−
+
+
+
−
+
− −
−−
+
+
+
+
− −
−
+
+
+
−
+
Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization
Learning From Data - Leture 10 12/21
A powerful model
−
−
−
+
+
+
−
+
− −
−−
+
+
+
+
− −
−
+
+
+
−
+
Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization
Learning From Data - Leture 10 12/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25
-
Neural Network Neural Network Motivation
Fun Time
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/25
-
Neural Network Neural Network Hypothesis
The neural network
Learning From Data - Leture 10 13/21
The neural network
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
θ
θ
θ θ
θ
Learning From Data - Leture 10 13/21
The neural network
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
input xθ
θ
θ θ
θ
Learning From Data - Leture 10 13/21
The neural network
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
input x hidden layers 1 ≤ l < Lθ
θ
θ θ
θ
Learning From Data - Leture 10 13/21
The neural network
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
input x hidden layers 1 ≤ l < L output layer l = Lθ
θ
θ θ
θ
Learning From Data - Leture 10 13/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/25
-
Neural Network Neural Network Hypothesis
The neural network
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
input x hidden layers 1 ≤ l < L output layer l = Lθ
θ
θ θ
θ
Learning From Data - Leture 10 13/21How the network operates
Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21
How the network operates
PSfrag replaements
lineartanhhard threshold
+1
−1
-4-2024-1-0.500.51θ(s) = tanh(s) =
es − e−ses + e−s
w(l)ij
1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs
x(l)j = θ(s
(l)j ) = θ
d(l−1)∑
i=0
w(l)ij x
(l−1)i
Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25
-
Neural Network Neural Network Hypothesis
Fun Time
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/25
-
Neural Network Neural Network Training
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is
e (h(xn), yn) = e(w)To implement SGD, we need the gradient
∇e(w): ∂ e(w)∂ w
(l)ij
for all i, j, lLearning From Data - Leture 10 16/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/25
-
Neural Network Neural Network Training
Computing ∂ e(w)∂ w
(l)ij
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Computing ∂ e(w)∂ w
(l)ij
x
w
s
x
i
ij
j
j(l)
(l)
(l)
(l−1)
top
bottom
θWe an evaluate ∂ e(w)∂ w
(l)ij
one by one: analytially or numeriallyA trik for e�ient omputation:
∂ e(w)∂ w
(l)ij
=∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ w(l)ij
We have ∂ s(l)j∂ w
(l)ij
= x(l−1)i We only need: ∂ e(w)
∂ s(l)j
= δ(l)j
Learning From Data - Leture 10 17/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/25
-
Neural Network Neural Network Training
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e(h(xn), yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = e( x(L)1 , yn)
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = ( x(L)1 − yn)2
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
δ for the �nal layerδ(l)j =
∂ e(w)∂ s
(l)j
For the �nal layer l = L and j = 1:δ(L)1 =
∂ e(w)∂ s
(L)1e(w) = ( x(L)1 − yn)2
x(L)1 = θ(s
(L)1 )
θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/25
-
Neural Network Neural Network Training
Bak propagation of δ
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
i(l−1)
s
x i
w ij
s j
x j(l)
(l)
(l)
(l−1)
top
bottom
θ
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
x i(l−1)
w ij
j
x j(l)
(l)
(l)
δ
δ i(l−1)
top
bottom
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Bak propagation of δ
x i(l−1)
x i(l−1) 2
w ij
j
x j(l)
(l)
(l)
δ
δ i(l−1)
1−( )
top
bottom
δ(l−1)i =
∂ e(w)∂ s
(l−1)i
=
d(l)∑
j=1
∂ e(w)∂ s
(l)j
×∂ s
(l)j
∂ x(l−1)i
×∂ x(l−1)i
∂ s(l−1)i
=
d(l)∑
j=1
δ(l)j × w
(l)ij × θ′(s
(l−1)i )
δ(l−1)i = (1− (x
(l−1)i )
2)d(l)∑
j=1
w(l)ij δ
(l)j
Learning From Data - Leture 10 19/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25
-
Neural Network Neural Network Training
Bakpropagation algorithm
Learning From Data - Leture 10 20/21
Bakpropagation algorithm
w ij(l)
δ j(l)
x(l−1)i
top
bottom
Learning From Data - Leture 10 20/21
Bakpropagation algorithm
w ij(l)
δ j(l)
x(l−1)i
top
bottom
1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21
Bakpropagation algorithm
w ij(l)
δ j(l)
x(l−1)i
top
bottom
1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25
-
Neural Network Neural Network Training
Fun Time
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/25
-
Neural Network Deep Neural Networks
Final remark: hidden layers
Learning From Data - Leture 10 21/21
Final remark: hidden layers
Learning From Data - Leture 10 21/21
Final remark: hidden layers
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
θ
θ
θ θ
θ
learned nonlinear transforminterpretation?
Learning From Data - Leture 10 21/21
Final remark: hidden layers
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
θ
θ
θ θ
θ
learned nonlinear transforminterpretation?
Learning From Data - Leture 10 21/21
Final remark: hidden layers
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
θ
θ
θ θ
θ
learned nonlinear transforminterpretation?
Learning From Data - Leture 10 21/21
Final remark: hidden layers
θ
x2
xd
s
θ(s)
h(x)
11 1
x1
θ
θ
θ θ
θ
learned nonlinear transforminterpretation?
Learning From Data - Leture 10 21/21
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25
-
Neural Network Deep Neural Networks
Shallow versus Deep Structuresshallow: few hidden layers; deep: many hidden layers
Shallow• efficient• powerful if enough neurons
Deep• challenging to train• needing more structural
(model) decisions• ‘meaningful’?
deep structure (deep learning) re-gainattention recently
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/25
-
Neural Network Deep Neural Networks
Key Techniques behind Deep Learning
• (usually) unsupervised pre-training between hidden layers,—viewing hidden layers as ‘condensing’ low-level representationto high-level one
• fine-tune with backprop after initializing with those ‘good’weights—because direct backprop may get stuck more easily
• speed-up: better optimization algorithms, and faster GPU
currently very useful for vision and speechrecognition
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/25
-
Neural Network Deep Neural Networks
Fun Time
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/25
-
Neural Network Deep Neural Networks
Summary
Lecture 11: Neural NetworkRandom Forest: Theory and Practice
Neural Network Motivation
Neural Network Hypothesis
Neural Network Training
Deep Neural Networks
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/25
Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks
top related