function learning and neural nets r&n: chap. 20, sec. 20.5

1

Function Learning and Neural Nets

R&N: Chap. 20, Sec. 20.5

2

Function-Learning Formulation

Goal function f Training set: (x(i),y(i)), i = 1,…,n, y(i)=f(x(i))

Inductive inference: find a function h that fits the points well

Same Keep-It-Simple biasx

f(x)

3

Least-Squares Fitting Propose a class of functions g(x,)

parameterized by Minimize E() = i ( g(x(i),)-y(i))2

x

f(x)

4

Linear Least-Squares

g(x,) = x1 1 + … + xN N

Best given by = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s

x

f(x)g(x,)

5

Constant offset

Set x0=1, g(x,) = x0 0 + x1 1 + … + xN N

Best given by = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s

x

f(x)g(x,)

6

Nonlinear Least-Squares

E.g. quadratic g(x,) = 0 + x 1 + x2

2

E.g. exponential g(x,) = exp(0 + x

1) Any combinations

g(x,) = exp(0 + x 1) + 2 + x

3x

f(x)

linear

quadratic other

7

Performance of Nonlinear Least-squares

Overfitting: too many parameters Efficient optimization

Often can only find a local minimum of objective E()

Expensive with lots of data!

8

Neural Networks Overfitting: too many parameters Efficient optimization

Often can only find a local minimum of objective E()

Expensive with lots of data!

9

Perceptron(The goal function f is a boolean

one)

gxi

x1

xn

ywi

y = g(i=1,…,n wi xi)

+ +

+

++ -

-

--

-x1

x2

w1 x1 + w2 x2 = 0

10

gxi

x1

xn

ywi


+ +

+ +

+ -

-

--

-

?

Perceptron(The goal function f is a boolean

one)

11

Unit (Neuron)

gxi

x1

xn

ywi


g(u) = 1/[1 + exp(-u)]

12

A Single Neuron can learn

gxi

x1

xn

ywi

A disjunction of boolean literals x1 x2 x3

Majority function

XOR?

13

Neural Network

Network of interconnected neurons

gxi

x1

xn

ywi

gxi

x1

xn

ywi

Acyclic (feed-forward) vs. recurrent networks

14

Two-Layer Feed-Forward Neural Network

Inputs Hiddenlayer

Outputlayer

w1j w2k

15

Backpropagation (Principle)

New example y(k) = f(x(k)) φ(k) = outcome of NN with weights w(k-1) for

inputs x(k) Error function: E(k)(w(k-1)) = ||φ(k) – y(k)||2

wij(k) = wij

(k-1) – εE(k)/wij (w(k) = w(k-1) - E)

Backpropagation algorithm: Update the weights of the inputs to the last layer, then the weights of the inputs to the previous layer, etc.

16

Understanding Backpropagation

Minimize E() Gradient Descent…

E()

17



E()

Gradient of E

18



E()

Step ~ gradient

19


Example of Stochastic Gradient Descent

Minimize E() = e1()+e2()+…+eN() Here ei = (g(x(i),)-y(i))2

Take a step to reduce eiE()

Gradient of e1

20





Gradient of e1

21





Gradient of e2

22





Gradient of e2

23





Gradient of e3

24





Gradient of e3

25

Stochastic Gradient Descent

Parameter values over time

(local) minimum of E

26

Stochastic Gradient Descent

Objective function values over time

27

Caveats

Choosing a convergent “learning rate” can be hard in practice

E()

28

Comments and Issues

How to choose the size and structure of networks?• If network is too large, risk of over-fitting

(data caching)• If network is too small, representation

may not be rich enough Role of representation: e.g., learn the

concept of an odd number Incremental learning

29

Role of Marketing

Not a good model of a neuron Spiking behavior, recurrence in real NNs

No special properties above other learning techniques

Like other learning techniques, a convenient way to get results without thinking too hard

30

Incremental (“Online”) Function Learning

31

Incremental (“Online”) Function Learning

Data is streaming into learnerx1,y1, …, xt,yt yi = f(xi)

Observes xt+1 and must make

prediction for next time step yt+1

Brute force approach: Store all data at step t Use your learner of choice on all data up

to time t, predict for time t+1

32

Example: Mean Estimation

yi = + error term (no x’s) Current estimate t= 1/t i=1…t yi

t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5

33



t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5

y6

34



t+1= 1/(t+1) i=1…t+1 yi

= 1/(t+1) (yt+1 + i=1…t yi) = 1/(t+1) (yt+1 + tt)

5 6 = 5/6 5 + 1/6 y6

35


t+1= 1/(t+1) (yt+1 + tt) Only need to store t, t

Similar formulas for standard deviation

5 6 = 5/6 6 + 1/6 y6

36

Incremental Least Squares

Recall Least Squares estimate = (ATA)-1 AT b

Where A is matrix of x(i)’s, b is vector of y(i)’s (laid out in rows)

A =

x(1)

x(2)

x(N)

…

b =

y(1)

y(2)

y(N)

…NxM Nx1

37


Let A(t), b(t) be A matrix, b vector up to time t

(t) = (A(t)TA(t))-1 A(t)T b(t)

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

38



(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

39



(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)

A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T

A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

40



(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)


A(t+1) =

x(t+1)

b(t+1) =

y(t+1)

(T+1)xM (t+1)x1

b(t)A(t)

41



(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)

A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)


Sherman-Morrison Update (Y + xxT)-1 = Y-1 - Y-1

xxT Y-1 / (1 – xT Y-1 x)

42


Putting it all together Store

p(t) = A(t)Tb(t)

Q(t) = (A(t)TA(t))-1

Updatep(t+1) = p(t) + y x

Q(t+1) = Q(t) - Q(t)

xxT Q(t) / (1 – xT Q(t) x)(t+1) = Q(t+1)p(t+1)

43

Recap

• Function learning with least squares

• Neural nets, backpropagation, and gradient descent

• Incremental learning

44

Reminder

• HW6 due

• HW7 available on Oncourse

45

Machine Learning Classes

• CS659 (Hauser) Principles of Intelligent Robot Motion

• CS657 (Yu) Computer Vision

• STAT520 (Trosset) Introduction to Statistics

• STAT682 (Rocha) Statistical Model Selection

function learning and neural nets r&n: chap. 20, sec. 20.5

Documents