learningfromdata lecture2 theperceptronmagdon/courses/lfd-slides/slideslect02.pdf · a perceptron...

25
Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA Other Views of Learning Is Learning Feasible: A Puzzle M. Magdon-Ismail CSCI 4100/6100

Upload: duongcong

Post on 30-Sep-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Learning From DataLecture 2

The Perceptron

The Learning Setup

A Simple Learning Algorithm: PLAOther Views of Learning

Is Learning Feasible: A Puzzle

M. Magdon-IsmailCSCI 4100/6100

Page 2: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

recap: The Plan

1. What is Learning?

2. Can We do it?

3. How to do it?

4. How to do it well?

5. General principles?

6. Advanced techniques.

7. Other Learning Paradigms.

conceptstheorypractice

our language will be mathematics . . .. . . our sword will be computer algorithms

c© AML Creator: Malik Magdon-Ismail The Perceptron: 2 /25 Recap: key players −→

Page 3: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

recap: The Key Players

• Salary, debt, years in residence, . . . input x ∈ Rd = X .

• Approve credit or not output y ∈ {−1,+1} = Y .

• True relationship between x and y target function f : X 7→ Y .

(The target f is unknown.)

• Data on customers data set D = (x1, y1), . . . , (xN , yN).

(yn = f(xn).)

X Y and D are given by the learning problem;

The target f is fixed but unknown.

We learn the function f from the data D.

c© AML Creator: Malik Magdon-Ismail The Perceptron: 3 /25 Recap: learning setup −→

Page 4: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

recap: Summary of the Learning Setup

(ideal credit approval formula)

(historical records of credit customers)

(set of candidate formulas)

(learned credit approval formula)

UNKNOWN TARGET FUNCTION

f : X 7→ Y

TRAINING EXAMPLES

(x1, y1), (x2, y2), . . . , (xN , yN )

HYPOTHESIS SET

H

FINAL

HYPOTHESIS

g ≈ f

LEARNING

ALGORITHM

A

yn = f(xn)

c© AML Creator: Malik Magdon-Ismail The Perceptron: 4 /25 Simple learning model −→

Page 5: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

A Simple Learning Model

• Input vector x = [x1, . . . , xd]t.

• Give importance weights to the different inputs and compute a “Credit Score”

“Credit Score” =d∑

i=1

wixi.

• Approve credit if the “Credit Score” is acceptable.

Approve credit ifd∑

i=1

wixi > threshold, (“Credit Score” is good)

Deny credit ifd∑

i=1

wixi < threshold. (“Credit Score” is bad)

• How to choose the importance weights wi

input xi is important =⇒ large weight |wi|

input xi beneficial for credit =⇒ positive weight wi > 0input xi detrimental for credit =⇒ negative weight wi < 0

c© AML Creator: Malik Magdon-Ismail The Perceptron: 5 /25 Rewriting the model −→

Page 6: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

A Simple Learning Model

Approve credit ifd∑

i=1

wixi > threshold,

Deny credit ifd∑

i=1

wixi < threshold.

can be written formally as

h(x) = sign

((

d∑

i=1

wixi

)

+ w0

)

The “bias weight” w0 corresponds to the threshold. (How?)

c© AML Creator: Malik Magdon-Ismail The Perceptron: 6 /25 Perceptron −→

Page 7: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

The Perceptron Hypothesis Set

We have defined a Hyopthesis set H

H = {h(x) = sign(wtx)} ← uncountably infinite H

w =

w0

w1...wd

∈ Rd+1, x =

1x1...xd

∈ {1} × Rd.

This hypothesis set is called the perceptron or linear separator

c© AML Creator: Malik Magdon-Ismail The Perceptron: 7 /25 Geometry of perceptron −→

Page 8: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Geometry of The Perceptron

h(x) = sign(wtx) (Problem 1.2 in LFD)

Age

Inco

me

Age

Incom

e

Which one should we pick?

c© AML Creator: Malik Magdon-Ismail The Perceptron: 8 /25 Use the data −→

Page 9: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Use the Data to Pick a Line

Age

Inco

me

Age

Inco

me

A perceptron fits the data by using a line to separate the +1 from −1 data.

Fitting the data: How to find a hyperplane that separates the data?(“It’s obvious - just look at the data and draw the line,” is not a valid solution.)

c© AML Creator: Malik Magdon-Ismail The Perceptron: 9 /25 How to learn g −→

Page 10: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

How to Learn a Final Hypothesis g from H

We want to select g ∈ H so that g ≈ f .

We certainly want g ≈ f on the data set D. Ideally,

g(xn) = yn.

How do we find such a g in the infinite hypothesis set H, if it exists?

Idea! Start with some weight vector and try to improve it.

Age

Inco

me

c© AML Creator: Malik Magdon-Ismail The Perceptron: 10 /25 PLA −→

Page 11: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

The Perceptron Learning Algorithm (PLA)

A simple iterative method.

1: w(1) = 0

2: for iteration t = 1, 2, 3, . . .

3: the weight vector is w(t).

4: From (x1, y1), . . . , (xN , yN) pick any misclassified example.

5: Call the misclassified example (x∗, y∗),

sign (w(t) • x∗) 6= y∗.

6: Update the weight:

w(t + 1) = w(t) + y∗x∗.

7: t← t + 1

w(t + 1)w(t)

w(t)w(t + 1)

x∗

y∗ = −1

x∗

y∗x∗

y∗ = +1

y∗x∗

PLA implements our idea: start at some weights and try to improve.

“incremental learning”on a single example at a time

c© AML Creator: Malik Magdon-Ismail The Perceptron: 11 /25 PLA convergence −→

Page 12: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 1

c© AML Creator: Malik Magdon-Ismail The Perceptron: 12 /25 Start −→

Page 13: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 1

c© AML Creator: Malik Magdon-Ismail The Perceptron: 13 /25 Iteration 1 −→

Page 14: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 1

c© AML Creator: Malik Magdon-Ismail The Perceptron: 14 /25 Iteratrion 2 −→

Page 15: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 2

c© AML Creator: Malik Magdon-Ismail The Perceptron: 15 /25 Iteration 3 −→

Page 16: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 3

c© AML Creator: Malik Magdon-Ismail The Perceptron: 16 /25 Iteration 4 −→

Page 17: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 4

c© AML Creator: Malik Magdon-Ismail The Perceptron: 17 /25 Iteration 5 −→

Page 18: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 5

c© AML Creator: Malik Magdon-Ismail The Perceptron: 18 /25 Iteration 6 −→

Page 19: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 6

c© AML Creator: Malik Magdon-Ismail The Perceptron: 19 /25 Non-separable data? −→

Page 20: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Does PLA Work?

Theorem. If the data can be fit by a linear separator, then after some finite numberof steps, PLA will find one.

After how long?

What if the data cannot be fit by a perceptron?

AgeIncome

iteration 1

c© AML Creator: Malik Magdon-Ismail The Perceptron: 20 /25 We can fit! −→

Page 21: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

We can Fit the Data

•We can find an h that works from infinitely many (for the perceptron).(So computationally, things seem good.)

• Ultimately, remember that we want to predict.

We don’t care about the data, we care about “outside the data”.

Can a limited data set reveal enough information to pin down anentire target function, so that we can predict outside the data?

c© AML Creator: Malik Magdon-Ismail The Perceptron: 21 /25 Other views of learning −→

Page 22: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Other Views of Learning

• Design: learning is from data, design is from specs and a model.

• Statistics, Function Approximation.

• Data Mining: find patterns in massive data (typically unsupervised).

• Three Learning Paradigms

– Supervised: the data is (xn, f(xn)) – you are told the answer.

– Reinforcement: you get feedback on potential answers you try:

x→ try something → get feedback.

– Unsupervised: only given xn, learn to “organize” the data.

c© AML Creator: Malik Magdon-Ismail The Perceptron: 22 /25 Coins – supervised −→

Page 23: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Supervised Learning - Classifying Coins

10

15

25

Size

Mass

10

1

5

25

SizeMass

c© AML Creator: Malik Magdon-Ismail The Perceptron: 23 /25 Coins – unsupervised −→

Page 24: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Unsupervised Learning - Categorizing Coins

Size

Mass

type 1

type 2

type 3

type 4

SizeMass

c© AML Creator: Malik Magdon-Ismail The Perceptron: 24 /25 Puzzle: outside the data −→

Page 25: LearningFromData Lecture2 ThePerceptronmagdon/courses/LFD-Slides/SlidesLect02.pdf · A perceptron fits the data by using a line to separate the +1 from −1 data. Fittingthe data:

Outside the Data Set - A Puzzle

Trees (f = +1)

Dogs (f = −1)

Tree or Dog? (f = ?)

c© AML Creator: Malik Magdon-Ismail The Perceptron: 25 /25