neural networks introduction to artificial intelligence cos302 michael l. littman fall 2001

31
Neural Networks Neural Networks Introduction to Introduction to Artificial Intelligence Artificial Intelligence COS302 COS302 Michael L. Littman Michael L. Littman Fall 2001 Fall 2001

Post on 21-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Neural NetworksNeural Networks

Introduction toIntroduction toArtificial IntelligenceArtificial Intelligence

COS302COS302

Michael L. LittmanMichael L. Littman

Fall 2001Fall 2001

Page 2: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

AdministrationAdministration

11/28 Neural Networks11/28 Neural Networks

Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]

12/03 Latent Semantic Indexing12/03 Latent Semantic Indexing

12/05 Belief Networks12/05 Belief Networks

Ch. 15 [15.1, 15.2]Ch. 15 [15.1, 15.2]

12/10 Belief Network Inference12/10 Belief Network Inference

Ch. 19 [19.6]Ch. 19 [19.6]

Page 3: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

ProposalProposal

11/28 Neural Networks11/28 Neural Networks

Ch. 19 [19.3, 19.4]Ch. 19 [19.3, 19.4]

12/03 Backpropagation in NNs12/03 Backpropagation in NNs

12/05 Latent Semantic Indexing12/05 Latent Semantic Indexing

12/10 Segmentation12/10 Segmentation

Page 4: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Regression: DataRegression: Data

xx11= 2= 2yy11= 1= 1

xx22== 66 yy22= 2.2= 2.2

xx33== 44 yy33= 2= 2

xx44== 33 yy44= 1.9= 1.9

xx55== 44 yy55= 3.1= 3.1

Given x, want to predict y.Given x, want to predict y.

Page 5: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Regression: PictureRegression: Picture

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8

Page 6: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Linear RegressionLinear Regression

Linear regression assumes that the Linear regression assumes that the expected value of the output given expected value of the output given an input E(y|x) is linear.an input E(y|x) is linear.

Simplest case:Simplest case:

out(x) = w xout(x) = w x

for some unknown for some unknown weightweight w. w.

Estimate w given the data.Estimate w given the data.

Page 7: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

1-Parameter Linear Reg.1-Parameter Linear Reg.

Assume that the data is formed byAssume that the data is formed by

yyi i = w x= w xi i + noise+ noise

where…where…• the noise signals are indep.the noise signals are indep.• noise normally distributed: mean 0 noise normally distributed: mean 0

and unknown variance and unknown variance 22

Page 8: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Distribution for ysDistribution for ys

wx wx wx+wx+wx-wx-

wx+2wx+2wx-2wx-2

Pr(y|w, x) normally distributed with Pr(y|w, x) normally distributed with mean wx and variance mean wx and variance 22

Page 9: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Data to ModelData to Model

Fix xs. What w makes ys most likely? Fix xs. What w makes ys most likely? Also known as…Also known as…

argmaxargmaxw w Pr(yPr(y11…y…ynn|x|x11…x…xnn, w), w)

= argmax= argmaxw w i i Pr(yPr(yii|x|xii, w), w)

= argmax= argmaxw w i i exp(-1/2 ((yexp(-1/2 ((yii-wx-wxii)/)/))22))

= argmin= argminw w i i (y(yii-wx-wxii))22

Minimize sum-of-squared Minimize sum-of-squared residualsresiduals..

Page 10: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

ResidualsResiduals

0

0.5

1

1.5

2

2.5

3

3.5

0 2 4 6 8

i

Page 11: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

How Minimize?How Minimize?

E = E = i i (y(yii-wx-wxii))22

= = i i yyii2 2 – (2 – (2 ii x xi i yyii) w + () w + (i i xxii

22) w) w22

Minimize quadratic function of w.Minimize quadratic function of w.

E minimized withE minimized with

w* = (w* = (i i xxi i yyii) / () / (i i xxii22))

so ML model is Out(x) = w* x.so ML model is Out(x) = w* x.

Page 12: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Multivariate RegressionMultivariate Regression

What if inputs are vectors?What if inputs are vectors?

n data points, D componentsn data points, D components

xx11

xxnn

…X =X =yy11

yynn

…Y =Y =

DD

Page 13: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Closed Form SolutionClosed Form Solution

Multivariate linear regression Multivariate linear regression assumes a vector w s.t.assumes a vector w s.t.

Out(x) = wOut(x) = wTTx x

= w[1] x[1] + … + w[D] x[D]= w[1] x[1] + … + w[D] x[D]

ML solution: w = (XML solution: w = (XTTX)X)–1–1 (X (XTTY)Y)

XXTTX is DxD, k,j elt is sumX is DxD, k,j elt is sumi i xxij ij xxikik

XXTTY is Dx1, k elt is sumY is Dx1, k elt is sumi i xxik ik yyii

Page 14: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Got Constants?Got Constants?

0

2

4

6

8

10

0 2 4 6 8

Page 15: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Fitting with an OffsetFitting with an Offset

We might expect a linear function We might expect a linear function that doesn’t go through the origin.that doesn’t go through the origin.

Simple obvious hack so we don’t Simple obvious hack so we don’t have to start from scratch…have to start from scratch…

Page 16: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Gradient DescentGradient Descent

Scalar function: f(w): Scalar function: f(w): Want a local minimum.Want a local minimum.

Start with some value for w.Start with some value for w.

Gradient descent rule:Gradient descent rule:

w w w - w - //w f(w)w f(w) ““learning rate” (small pos. num.)learning rate” (small pos. num.)

Justify!Justify!

Page 17: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Partial DerivativesPartial Derivatives

E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)

wwj j w wj j - - //wwj j f(w) f(w)

How would a small increase in How would a small increase in weight wweight wj j change the error?change the error?

Small positive?Small positive?Large positive?Large positive?

Small negative?Small negative? Large negative?Large negative?

Page 18: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Neural Net ConnectionNeural Net Connection

Set of weights w.Set of weights w.

Find weights to minimize sum-of-Find weights to minimize sum-of-squared residuals. Why?squared residuals. Why?

When would we want to use gradient When would we want to use gradient descent?descent?

Page 19: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Linear PerceptronLinear Perceptron

Earliest, simplest NN.Earliest, simplest NN.

xx11

yy

xx22 xx33 xxDD…

sumsum

ww11

11

ww22 ww33wwDD ww00

Page 20: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Learning RuleLearning Rule

Multivariate linear function, trained Multivariate linear function, trained by gradient descent.by gradient descent.

Derive the update rule…Derive the update rule…

out(x) = wout(x) = wTTxx

E = sumE = sumk k (w(wTTxxk k – y– ykk))2 2 = f(w)= f(w)

wwj j w wj j - - //wwj j f(w) f(w)

Page 21: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

““Batch” AlgorithmBatch” Algorithm

1.1. Randomly initialize wRandomly initialize w11…w…wDD

2.2. Append 1s to inputs to allow Append 1s to inputs to allow function to miss the originfunction to miss the origin

3.3. For i=1 to n, For i=1 to n, i i = y= yi i – w– wTT xxii

4.4. For j=1 to D, wFor j=1 to D, wjj= w= wj j + + sum sumii i i xxijij

5.5. If sumIf sumii ii2 2 is small, stop, else 3.is small, stop, else 3.

Why squared?Why squared?

Page 22: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

ClassificationClassification

Let’s say all outputs are 0 or 1.Let’s say all outputs are 0 or 1.

How can we interpret the output of How can we interpret the output of the perceptron as zero or one?the perceptron as zero or one?

Page 23: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

ClassificationClassification

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

0 5 10

Page 24: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Change Output FunctionChange Output Function

SolutionSolution::

Instead ofInstead of out(x) = wout(x) = wTTxx

we’ll use we’ll use

out(x) = g(wout(x) = g(wTTx)x)

g(x): g(x): (0,1), squashing function (0,1), squashing function

Page 25: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

SigmoidSigmoid

E = sumE = sumk k (g(w(g(wTTxxkk)) –– yykk))2 2 = f(w)= f(w)

where g(h) = 1/(1+ewhere g(h) = 1/(1+e-h-h))

0

0.2

0.4

0.6

0.8

1

0 5 10

Page 26: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Classification Percept.Classification Percept.

xx11

netnetii

xx22 xx33 xxDD…

sumsum

ww11

11

ww22 ww33wwDD ww00

yy squashsquash

gg

Page 27: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Classifying RegionsClassifying Regions

xx22

xx11

1111

11

00

00

00

Page 28: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Gradient Descent in Gradient Descent in PerceptronsPerceptrons

Notice g’(h) = g(h)(1-g(h)).Notice g’(h) = g(h)(1-g(h)).

Let netLet neti i = sum= sumk k wwk k xxikik, , ii = y = yii-g(net-g(netii) )

out(xout(xii) = g(net) = g(netii))

E = sumE = sumi i (y(yii-g(net-g(netii))))22

E/E/wwj j = sum= sumi i 2(y2(yii-g(net-g(netii)) (-)) (-//wwj j g(netg(netii))))

= –2 sum= –2 sumi i (y(yii-g(net-g(netii)) g’(net)) g’(netii) ) //wwj j netnetii

= –2 sum= –2 sumi i ii g(net g(netii) (1-g(net) (1-g(netii)) x)) xii

Page 29: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Delta Rule for PerceptronsDelta Rule for Perceptrons

wwj j = w= wjj

+ + sum sumii i i out(xout(xii) (1-out(x) (1-out(xii)) x)) xijij

Invented and popularized by Invented and popularized by Rosenblatt (1962)Rosenblatt (1962)

Guaranteed convergenceGuaranteed convergenceStable behavior for overconstrained Stable behavior for overconstrained

and underconstrained problemsand underconstrained problems

Page 30: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

What to LearnWhat to Learn

Linear regression as MLLinear regression as ML

Gradient descent to find MLGradient descent to find ML

Perceptron training rule (regressions Perceptron training rule (regressions version and classification version)version and classification version)

Sigmoids for classification problemsSigmoids for classification problems

Page 31: Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Homework 9 (due 12/5)Homework 9 (due 12/5)

1.1. Write a program that decides if a pair Write a program that decides if a pair of words are synonyms using wordnet. of words are synonyms using wordnet. I’ll send you the list, you send me the I’ll send you the list, you send me the answers.answers.

2.2. Draw a decision tree that represents Draw a decision tree that represents (a) f(a) f11+f+f22+…+f+…+fnn (or), (b) f (or), (b) f11ff22…f…fn n (and), (and), (c) parity (odd number of features (c) parity (odd number of features “on”).“on”).

3.3. Show that g’(h) = g(h)(1-g(h)).Show that g’(h) = g(h)(1-g(h)).