machine learning techniques hxÒhtlin/course/ml13fall/... · 10 14/ 21 psfrag replacements r linea...

26
Machine Learning Techniques (_hx) Lecture 11: Neural Network Hsuan-Tien Lin (0) [email protected] Department of Computer Science & Information Engineering National Taiwan University (¸cx˙ß) Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

Upload: others

Post on 05-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine Learning Techniques(機器學習技巧)

    Lecture 11: Neural NetworkHsuan-Tien Lin (林軒田)

    [email protected]

    Department of Computer Science& Information Engineering

    National Taiwan University(國立台灣大學資訊工程系)

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/25

    [email protected]

  • Neural Network

    Agenda

    Lecture 11: Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/25

  • Neural Network Random Forest: Theory and Practice

    Theory: Does Diversity Help?strength-correlation decomposition (classification):

    limT→∞

    Eout(G) ≤ ρ ·(

    1− s2s2

    )

    • strength: average voting margin within G• correlation: similarity between gt• similar for regression (bias-variance decomposition)

    RF good if diverse and strong

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/25

  • Neural Network Random Forest: Theory and Practice

    Practice: How Many Trees Needed?theory: the more, the ‘better’

    • NTU KDDCup 2013 Track 1: predicting author-paper relation• 1− Eval of thousands of trees: [0.981,0.985] depending on seed;

    1− Eout of top 20 teams: [0.98130,0.98554]• decision: take 12000 trees with seed 1

    cons of RF: may need lots of trees if randomprocess too unstable

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/25

  • Neural Network Random Forest: Theory and Practice

    Fun Time

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/25

  • Neural Network Neural Network Motivation

    DisclaimerMany parts of this lecture borrows

    Prof. Yaser S. Abu-Mostafa’s slides with permission.

    Learning From DataYaser S. Abu-MostafaCalifornia Institute of TehnologyLeture 10: Neural Networks

    Sponsored by Calteh's Provost O�e, E&AS Division, and IST • Thursday, May 3, 2012

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/25

  • Neural Network Neural Network Motivation

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Biologial inspirationbiologial funtion −→ biologial struture

    1

    2

    1

    2

    Learning From Data - Leture 10 8/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/25

  • Neural Network Neural Network Motivation

    Combining pereptrons

    Learning From Data - Leture 10 9/21

    Combining pereptrons−−

    +

    x1

    x2+

    h1

    x1

    x2+

    +−

    h2x1

    x2

    Learning From Data - Leture 10 9/21

    Combining pereptrons−−

    +

    x1

    x2+

    h1

    x1

    x2+

    +−

    h2x1

    x2

    OR(x1, x2)

    1

    x1

    x2

    1

    1

    1.5 −1.5

    1

    x1

    x2

    AND(x1, x2)1

    1

    Learning From Data - Leture 10 9/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/25

  • Neural Network Neural Network Motivation

    Creating layers

    Learning From Data - Leture 10 10/21

    Creating layers

    1

    1

    h1h̄2

    h̄1h2

    f

    1.5

    1

    Learning From Data - Leture 10 10/21

    Creating layers

    1

    1

    h1h̄2

    h̄1h2

    f

    1.5

    1

    1f

    1

    1.5

    1

    1

    1

    h1

    h2

    −1−1

    −1.5−1.5

    1

    Learning From Data - Leture 10 10/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/25

  • Neural Network Neural Network Motivation

    The multilayer pereptron

    Learning From Data - Leture 10 11/21

    The multilayer pereptron

    w1 • x

    f

    1

    1.5

    1

    1

    1 1

    −1−1

    −1.5−1.5

    1

    1x1

    x2w2 • x

    3 layers �feedforward�Learning From Data - Leture 10 11/21

    The multilayer pereptron

    w1 • x

    f

    1

    1.5

    1

    1

    1 1

    −1−1

    −1.5−1.5

    1

    1x1

    x2w2 • x

    3 layers �feedforward�Learning From Data - Leture 10 11/21

    The multilayer pereptron

    w1 • x

    f

    1

    1.5

    1

    1

    1 1

    −1−1

    −1.5−1.5

    1

    1x1

    x2w2 • x

    3 layers �feedforward�Learning From Data - Leture 10 11/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/25

  • Neural Network Neural Network Motivation

    A powerful model

    Learning From Data - Leture 10 12/21

    A powerful model

    +

    +

    +

    +

    − −

    −−

    +

    +

    +

    +

    − −

    +

    +

    +

    +

    Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization

    Learning From Data - Leture 10 12/21

    A powerful model

    +

    +

    +

    +

    − −

    −−

    +

    +

    +

    +

    − −

    +

    +

    +

    +

    Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization

    Learning From Data - Leture 10 12/21

    A powerful model

    +

    +

    +

    +

    − −

    −−

    +

    +

    +

    +

    − −

    +

    +

    +

    +

    Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization

    Learning From Data - Leture 10 12/21

    A powerful model

    +

    +

    +

    +

    − −

    −−

    +

    +

    +

    +

    − −

    +

    +

    +

    +

    Target 8 pereptrons 16 pereptrons2 red �ags for generalization and optimization

    Learning From Data - Leture 10 12/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/25

  • Neural Network Neural Network Motivation

    Fun Time

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/25

  • Neural Network Neural Network Hypothesis

    The neural network

    Learning From Data - Leture 10 13/21

    The neural network

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    θ

    θ

    θ θ

    θ

    Learning From Data - Leture 10 13/21

    The neural network

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    input xθ

    θ

    θ θ

    θ

    Learning From Data - Leture 10 13/21

    The neural network

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    input x hidden layers 1 ≤ l < Lθ

    θ

    θ θ

    θ

    Learning From Data - Leture 10 13/21

    The neural network

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    input x hidden layers 1 ≤ l < L output layer l = Lθ

    θ

    θ θ

    θ

    Learning From Data - Leture 10 13/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/25

  • Neural Network Neural Network Hypothesis

    The neural network

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    input x hidden layers 1 ≤ l < L output layer l = Lθ

    θ

    θ θ

    θ

    Learning From Data - Leture 10 13/21How the network operates

    Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21

    How the network operates

    PSfrag replaements

    lineartanhhard threshold

    +1

    −1

    -4-2024-1-0.500.51θ(s) = tanh(s) =

    es − e−ses + e−s

    w(l)ij

    1 ≤ l ≤ L layers0 ≤ i ≤ d(l−1) inputs1 ≤ j ≤ d(l) outputs

    x(l)j = θ(s

    (l)j ) = θ

    d(l−1)∑

    i=0

    w(l)ij x

    (l−1)i

    Apply x to x(0)1 · · · x(0)d(0) →→ x(L)1 = h(x)Learning From Data - Leture 10 14/21Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/25

  • Neural Network Neural Network Hypothesis

    Fun Time

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/25

  • Neural Network Neural Network Training

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Applying SGDAll the weights w = {w(l)ij } determine h(x)Error on example (xn, yn) is

    e (h(xn), yn) = e(w)To implement SGD, we need the gradient

    ∇e(w): ∂ e(w)∂ w

    (l)ij

    for all i, j, lLearning From Data - Leture 10 16/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/25

  • Neural Network Neural Network Training

    Computing ∂ e(w)∂ w

    (l)ij

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Computing ∂ e(w)∂ w

    (l)ij

    x

    w

    s

    x

    i

    ij

    j

    j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θWe an evaluate ∂ e(w)∂ w

    (l)ij

    one by one: analytially or numeriallyA trik for e�ient omputation:

    ∂ e(w)∂ w

    (l)ij

    =∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ w(l)ij

    We have ∂ s(l)j∂ w

    (l)ij

    = x(l−1)i We only need: ∂ e(w)

    ∂ s(l)j

    = δ(l)j

    Learning From Data - Leture 10 17/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/25

  • Neural Network Neural Network Training

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e(h(xn), yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = e( x(L)1 , yn)

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = ( x(L)1 − yn)2

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    δ for the �nal layerδ(l)j =

    ∂ e(w)∂ s

    (l)j

    For the �nal layer l = L and j = 1:δ(L)1 =

    ∂ e(w)∂ s

    (L)1e(w) = ( x(L)1 − yn)2

    x(L)1 = θ(s

    (L)1 )

    θ′(s) = 1 − θ2(s) for the tanhLearning From Data - Leture 10 18/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/25

  • Neural Network Neural Network Training

    Bak propagation of δ

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    i(l−1)

    s

    x i

    w ij

    s j

    x j(l)

    (l)

    (l)

    (l−1)

    top

    bottom

    θ

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    x i(l−1)

    w ij

    j

    x j(l)

    (l)

    (l)

    δ

    δ i(l−1)

    top

    bottom

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Bak propagation of δ

    x i(l−1)

    x i(l−1) 2

    w ij

    j

    x j(l)

    (l)

    (l)

    δ

    δ i(l−1)

    1−( )

    top

    bottom

    δ(l−1)i =

    ∂ e(w)∂ s

    (l−1)i

    =

    d(l)∑

    j=1

    ∂ e(w)∂ s

    (l)j

    ×∂ s

    (l)j

    ∂ x(l−1)i

    ×∂ x(l−1)i

    ∂ s(l−1)i

    =

    d(l)∑

    j=1

    δ(l)j × w

    (l)ij × θ′(s

    (l−1)i )

    δ(l−1)i = (1− (x

    (l−1)i )

    2)d(l)∑

    j=1

    w(l)ij δ

    (l)j

    Learning From Data - Leture 10 19/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/25

  • Neural Network Neural Network Training

    Bakpropagation algorithm

    Learning From Data - Leture 10 20/21

    Bakpropagation algorithm

    w ij(l)

    δ j(l)

    x(l−1)i

    top

    bottom

    Learning From Data - Leture 10 20/21

    Bakpropagation algorithm

    w ij(l)

    δ j(l)

    x(l−1)i

    top

    bottom

    1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21

    Bakpropagation algorithm

    w ij(l)

    δ j(l)

    x(l−1)i

    top

    bottom

    1: Initialize all weights w(l)ij at random2: for t = 0, 1, 2, . . . do3: Pik n ∈ {1, 2, · · · , N}4: Forward: Compute all x(l)j5: Bakward: Compute all δ(l)j6: Update the weights: w(l)ij ← w(l)ij − η x(l−1)i δ(l)j7: Iterate to the next step until it is time to stop8: Return the �nal weights w(l)ijLearning From Data - Leture 10 20/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/25

  • Neural Network Neural Network Training

    Fun Time

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/25

  • Neural Network Deep Neural Networks

    Final remark: hidden layers

    Learning From Data - Leture 10 21/21

    Final remark: hidden layers

    Learning From Data - Leture 10 21/21

    Final remark: hidden layers

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    θ

    θ

    θ θ

    θ

    learned nonlinear transforminterpretation?

    Learning From Data - Leture 10 21/21

    Final remark: hidden layers

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    θ

    θ

    θ θ

    θ

    learned nonlinear transforminterpretation?

    Learning From Data - Leture 10 21/21

    Final remark: hidden layers

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    θ

    θ

    θ θ

    θ

    learned nonlinear transforminterpretation?

    Learning From Data - Leture 10 21/21

    Final remark: hidden layers

    θ

    x2

    xd

    s

    θ(s)

    h(x)

    11 1

    x1

    θ

    θ

    θ θ

    θ

    learned nonlinear transforminterpretation?

    Learning From Data - Leture 10 21/21

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/25

  • Neural Network Deep Neural Networks

    Shallow versus Deep Structuresshallow: few hidden layers; deep: many hidden layers

    Shallow• efficient• powerful if enough neurons

    Deep• challenging to train• needing more structural

    (model) decisions• ‘meaningful’?

    deep structure (deep learning) re-gainattention recently

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/25

  • Neural Network Deep Neural Networks

    Key Techniques behind Deep Learning

    • (usually) unsupervised pre-training between hidden layers,—viewing hidden layers as ‘condensing’ low-level representationto high-level one

    • fine-tune with backprop after initializing with those ‘good’weights—because direct backprop may get stuck more easily

    • speed-up: better optimization algorithms, and faster GPU

    currently very useful for vision and speechrecognition

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/25

  • Neural Network Deep Neural Networks

    Fun Time

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 24/25

  • Neural Network Deep Neural Networks

    Summary

    Lecture 11: Neural NetworkRandom Forest: Theory and Practice

    Neural Network Motivation

    Neural Network Hypothesis

    Neural Network Training

    Deep Neural Networks

    Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 25/25

    Neural NetworkRandom Forest: Theory and PracticeNeural Network MotivationNeural Network HypothesisNeural Network TrainingDeep Neural Networks