lecture 03: multi-layer perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · linear predictor...

CS489/698: Intro to MLLecture 03: Multi-layer Perceptron

9/19/17 Yao-Liang Yu1

Outl ine

• Failure of Perceptron

• Neural Network

• Backpropagation

• Universal Approximator


Outl ine


• Neural Network

• Backpropagation



The XOR problem


• X = { (0,0), (0,1), (1,0), (1,1) }, y = {-1, 1, 1, -1}• f*(x) = sign[ fxor - ½ ], fxor(x) = (x1 & -x2) | (-x1 & x2)

No separat ing hyperplane

• x1 = (0,0), y1 = -1 b < 0• x2 = (0,1), y2 = 1 w2 + b > 0• x3 = (1,0), y3 = 1 w1 + b > 0• x4 = (1,1), y4 = -1 w1 + w2 + b < 0

• Contradiction!• [At least one of the blue or green inequalities has to be strict]


(w1 + w2 + b) + b > 0

Ex. what happens if we run perceptron/winnow on this example?

Fixing the problem

• Our model (hyperplanes) underfits the data (xor func)

• Fix representation, richer model

• Fix model, richer representation

• NN: still use hyperplane, but learn representation simultaneously


Outl ine


• Neural Network

• Backpropagation



Two-layer perceptron


x1

x2

z1

z2

ŷ

h1

h2

u11

u22

u12

u21

1

c1

c2

w1

w2

1

b

inputlayer

hiddenlayer

Nonlinear activation function

outputlayer

weights weights

learned rep.

1st linear layer

nonlinear transform

2nd linear layer

makes all the difference!

Does i t work?

• x1 = (0,0), y1 = -1 z1 = (0,-1), h1 = (0,0) ŷ1 = -1 ✔• x2 = (0,1), y2 = 1 z2 = (1,0), h2 = (1,0) ŷ2 = 1 ✔• x3 = (1,0), y3 = 1 z3 = (1,0), h3 = (1,0) ŷ3 = 1 ✔• x4 = (1,1), y4 = -1 z4 = (2,1), h4 = (2,1) ŷ4 = -1 ✔


Rectified Linear Unit (ReLU)

Mult i- layer perceptron


x1

x2

z1

z2

h1

h2

11

.

.

.

xd

zk

.

.

.

hk

.

.

.

.

.

.

ŷm

.

.

.

ŷ2

ŷ1

inputlayer:

Rd

outputlayer:

Rm

weights: fully connectedweights: fully connected

hiddenlayer Rk

Mult i- layer perceptron (stacked)


x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1…

1 2 p

depth

width

feed-forward

learned hierarchical nonlinear feature representation linearpredictor

Activat ion funct ion

• Sigmoid

• Tanh

• Rectified Linear


smooth

nonsmooth

saturate

Underf i t t ing vs. Overf i t t ing

• Linear predictor (perceptron / winnow / linear regression) underfits

• NNs learn hierarchical nonlinear feature jointly with linear predictor• may overfit• tons of heuristics (some later)

• Size of NN: # of weights/connections• ~ 1 billion; human brain 106 billions (estimated)


Weights training


x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1

x1x2

z1z2

h1h2

1 1

.

.

.xd

zk

.

.

. hk

.

.

.

.

.

. ŷm

.

.

.

ŷ2

ŷ1…x in

Rdŷ inRm

• Need a loss to measure diff. between pred ŷ and truth y• E.g., (ŷ-y)2; more later

• Need a training set {(x1,y1), …, (xn,yn)} to train weights Θ

ŷ = q(x; Θ)

Gradient Descent


Step size (learning rate)• const., if L is smooth• diminishing, otherwise

(Generalized) gradientO(n) !

Stochastic Gradient Descent (SGD)

• diminishing step size, e.g., 1/sqrt{t} or 1/t• averaging, momentum, variance-reduction, etc.• sample w/o replacement; cycle; permute in each pass


average over n samplesa random sample suffices

A l i t t le history on optimizat ion• Gradient descent mentioned first in (Cauchy, 1847)


• First rigorous convergence proof (Curry, 1944)

• SGD proposed and analyzed (Robbins & Monro, 1951)

Herbert Robbins ( 1 9 1 5 – 2 0 0 1 )


https://en.wikipedia.org/wiki/Herbert_Robbins

Outl ine


• Neural Network

• Backpropagation



Backpropogation

• A fancy name for the chain rule for derivatives:f(x) = g[ h(x) ] f’(x) = g’[ h(x) ] * h’(x)

• Efficiently computes the derivative in NN

• Two passes; complexity = O(size(NN))• forward pass: compute function value sequentially• backward pass: compute derivative sequentially


Algori thm


f: activation function

J = L + λΩ: training obj.Forward pass

Backwardpass

History (pe r fec t f o r a cou rse p ro jec t ! )


Outl ine


• Neural Network

• Backpropagation



Rationals are dense in R

• Any real number can be approximated by some rational number arbitrarily well

• Or in fancy mathematical language


domain of interest subset in pocket metric for approx.

Kolmogorov-Arnold Theorem

• Binary addition is the “only” multivariate function!• Solves Hilbert’s 13th problem!• can be constructed from g, which is unknown…


Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, …). Any continuous function g: [0,1]d R can be written (exactly!) as:

Andrey Kolmogorov ( 1 9 0 3 - 1 9 8 7 )


Foundations of the theory of probability

"Every mathematician believes that he is ahead of the others. The reason none state this belief in public is because they are intelligent people."

https://en.wikipedia.org/wiki/Andrey_Kolmogorov

Universal Approximator

• conditions are necessary in some sense• includes (almost) all activation functions in practice


Theorem (Cybenko, Hornik et al., Leshno et al., …). Any continuous function g: [0,1]d R can be uniformlyapproximated in arbitrary precision by a two-layer NN with an activation function f that 1. is locally bounded2. has “negligible’’ closure of discontinuous points3. is not a polynomial

Caveat and Remedy

• NNs were praised for being “universal”• but shall see later that many kernels are universal as well• desirable but perhaps not THE explanation

• May need exponentially many hidden units…

• Increase depth may reduce network size, exponentially!• can be a course project, ask for references


Questions?


lecture 03: multi-layer perceptrony328yu/mycourses/489/lectures/lec03_org.pdf · linear predictor...

Documents