lecture 03: multi-layer perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · linear predictor...
TRANSCRIPT
![Page 1: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/1.jpg)
CS489/698: Intro to MLLecture 03: Multi-layer Perceptron
9/19/17 Yao-Liang Yu1
![Page 2: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/2.jpg)
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu2
![Page 3: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/3.jpg)
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu3
![Page 4: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/4.jpg)
The XOR problem
9/19/17 Yao-Liang Yu4
• X = { (0,0), (0,1), (1,0), (1,1) }, y = {-1, 1, 1, -1}• f*(x) = sign[ fxor - ½ ], fxor(x) = (x1 & -x2) | (-x1 & x2)
![Page 5: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/5.jpg)
No separat ing hyperplane
• x1 = (0,0), y1 = -1 b < 0• x2 = (0,1), y2 = 1 w2 + b > 0• x3 = (1,0), y3 = 1 w1 + b > 0• x4 = (1,1), y4 = -1 w1 + w2 + b < 0
• Contradiction!• [At least one of the blue or green inequalities has to be strict]
9/19/17 Yao-Liang Yu5
(w1 + w2 + b) + b > 0
Ex. what happens if we run perceptron/winnow on this example?
![Page 6: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/6.jpg)
Fixing the problem
• Our model (hyperplanes) underfits the data (xor func)
• Fix representation, richer model
• Fix model, richer representation
• NN: still use hyperplane, but learn representation simultaneously
9/19/17 Yao-Liang Yu6
![Page 7: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/7.jpg)
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu7
![Page 8: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/8.jpg)
Two-layer perceptron
9/19/17 Yao-Liang Yu8
x1
x2
z1
z2
ŷ
h1
h2
u11
u22
u12
u21
1
c1
c2
w1
w2
1
b
inputlayer
hiddenlayer
Nonlinear activation function
outputlayer
weights weights
learned rep.
1st linear layer
nonlinear transform
2nd linear layer
makes all the difference!
![Page 9: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/9.jpg)
Does i t work?
• x1 = (0,0), y1 = -1 z1 = (0,-1), h1 = (0,0) ŷ1 = -1 ✔• x2 = (0,1), y2 = 1 z2 = (1,0), h2 = (1,0) ŷ2 = 1 ✔• x3 = (1,0), y3 = 1 z3 = (1,0), h3 = (1,0) ŷ3 = 1 ✔• x4 = (1,1), y4 = -1 z4 = (2,1), h4 = (2,1) ŷ4 = -1 ✔
9/19/17 Yao-Liang Yu9
Rectified Linear Unit (ReLU)
![Page 10: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/10.jpg)
Mult i- layer perceptron
9/19/17 Yao-Liang Yu10
x1
x2
z1
z2
h1
h2
11
.
.
.
xd
zk
.
.
.
hk
.
.
.
.
.
.
ŷm
.
.
.
ŷ2
ŷ1
inputlayer:
Rd
outputlayer:
Rm
weights: fully connectedweights: fully connected
hiddenlayer Rk
![Page 11: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/11.jpg)
Mult i- layer perceptron (stacked)
9/19/17 Yao-Liang Yu11
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1…
1 2 p
depth
width
feed-forward
learned hierarchical nonlinear feature representation linearpredictor
![Page 12: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/12.jpg)
Activat ion funct ion
• Sigmoid
• Tanh
• Rectified Linear
9/19/17 Yao-Liang Yu12
smooth
nonsmooth
saturate
![Page 13: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/13.jpg)
Underf i t t ing vs. Overf i t t ing
• Linear predictor (perceptron / winnow / linear regression) underfits
• NNs learn hierarchical nonlinear feature jointly with linear predictor• may overfit• tons of heuristics (some later)
• Size of NN: # of weights/connections• ~ 1 billion; human brain 106 billions (estimated)
9/19/17 Yao-Liang Yu13
![Page 14: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/14.jpg)
Weights training
9/19/17 Yao-Liang Yu14
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1
x1x2
z1z2
h1h2
1 1
.
.
.xd
zk
.
.
. hk
.
.
.
.
.
. ŷm
.
.
.
ŷ2
ŷ1…x in
Rdŷ inRm
• Need a loss to measure diff. between pred ŷ and truth y• E.g., (ŷ-y)2; more later
• Need a training set {(x1,y1), …, (xn,yn)} to train weights Θ
ŷ = q(x; Θ)
![Page 15: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/15.jpg)
Gradient Descent
9/19/17 Yao-Liang Yu15
Step size (learning rate)• const., if L is smooth• diminishing, otherwise
(Generalized) gradientO(n) !
![Page 16: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/16.jpg)
Stochastic Gradient Descent (SGD)
• diminishing step size, e.g., 1/sqrt{t} or 1/t• averaging, momentum, variance-reduction, etc.• sample w/o replacement; cycle; permute in each pass
9/19/17 Yao-Liang Yu16
average over n samplesa random sample suffices
![Page 17: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/17.jpg)
A l i t t le history on optimizat ion• Gradient descent mentioned first in (Cauchy, 1847)
9/19/17 Yao-Liang Yu17
• First rigorous convergence proof (Curry, 1944)
• SGD proposed and analyzed (Robbins & Monro, 1951)
![Page 18: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/18.jpg)
Herbert Robbins ( 1 9 1 5 – 2 0 0 1 )
9/19/17 Yao-Liang Yu18
![Page 19: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/19.jpg)
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu19
![Page 20: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/20.jpg)
Backpropogation
• A fancy name for the chain rule for derivatives:f(x) = g[ h(x) ] f’(x) = g’[ h(x) ] * h’(x)
• Efficiently computes the derivative in NN
• Two passes; complexity = O(size(NN))• forward pass: compute function value sequentially• backward pass: compute derivative sequentially
9/19/17 Yao-Liang Yu20
![Page 21: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/21.jpg)
Algori thm
9/19/17 Yao-Liang Yu21
f: activation function
J = L + λΩ: training obj.Forward pass
Backwardpass
![Page 22: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/22.jpg)
History (pe r fec t f o r a cou rse p ro jec t ! )
9/19/17 Yao-Liang Yu22
![Page 23: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/23.jpg)
Outl ine
• Failure of Perceptron
• Neural Network
• Backpropagation
• Universal Approximator
9/19/17 Yao-Liang Yu23
![Page 24: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/24.jpg)
Rationals are dense in R
• Any real number can be approximated by some rational number arbitrarily well
• Or in fancy mathematical language
9/19/17 Yao-Liang Yu24
domain of interest subset in pocket metric for approx.
![Page 25: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/25.jpg)
Kolmogorov-Arnold Theorem
• Binary addition is the “only” multivariate function!• Solves Hilbert’s 13th problem!• can be constructed from g, which is unknown…
9/19/17 Yao-Liang Yu25
Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, …). Any continuous function g: [0,1]d R can be written (exactly!) as:
![Page 26: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/26.jpg)
Andrey Kolmogorov ( 1 9 0 3 - 1 9 8 7 )
9/19/17 Yao-Liang Yu26
Foundations of the theory of probability
"Every mathematician believes that he is ahead of the others. The reason none state this belief in public is because they are intelligent people."
![Page 27: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/27.jpg)
Universal Approximator
• conditions are necessary in some sense• includes (almost) all activation functions in practice
9/19/17 Yao-Liang Yu27
Theorem (Cybenko, Hornik et al., Leshno et al., …). Any continuous function g: [0,1]d R can be uniformlyapproximated in arbitrary precision by a two-layer NN with an activation function f that 1. is locally bounded2. has “negligible’’ closure of discontinuous points3. is not a polynomial
![Page 28: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/28.jpg)
Caveat and Remedy
• NNs were praised for being “universal”• but shall see later that many kernels are universal as well• desirable but perhaps not THE explanation
• May need exponentially many hidden units…
• Increase depth may reduce network size, exponentially!• can be a course project, ask for references
9/19/17 Yao-Liang Yu28
![Page 29: Lecture 03: Multi-layer Perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · Linear predictor (perceptron / winnow / linear regression) underfits • NNs learn. hierarchical nonlinear](https://reader030.vdocuments.site/reader030/viewer/2022040622/5d388ca188c9931d5c8c3278/html5/thumbnails/29.jpg)
Questions?
9/19/17 Yao-Liang Yu29