lecture 4: linear predictors and the perceptroninabd171/wiki.files/lecture4_handouts.pdf · lecture...

34
Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34

Upload: others

Post on 30-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Lecture 4: Linear predictorsand the Perceptron

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 4 1 / 34

Page 2: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Inductive Bias

Inductive bias is critical to prevent overfitting.

Inductive bias =⇒ a relatively “simple” hypothesis class H.

What if we don’t know which H is suitable for our learning problem?I Choose a good representation (relevant features)

I Use a general purpose hypothesis class.

One of the most popular choices: Linear predictors.

Kontorovich and Sabato (BGU) Lecture 4 2 / 34

Page 3: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Linear predictors

Recall that in many learning problems X = Rd .

Each example is a vector with d coordinates (features).

In binary classification problems, Y consists of two labels.

In linear prediction, H is the class of all linear separators.

Illustration with d = 2:

•Kontorovich and Sabato (BGU) Lecture 4 3 / 34

Page 4: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Why restrict to linear predictors?

If the algorithm could choose any separation,we could get overfitting:Labels of unseen examples would not be predicted correctly.

I E.g. if using a squiggly line.

•Kontorovich and Sabato (BGU) Lecture 4 4 / 34

Page 5: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Preventing overfitting

Linear predictors prevent overfitting:

I In dimension d , if training sample size is Θ(d),then training error ≈ true prediction error.

Dimension d ≡ number of features.

So, if we use linear predictors with enough training samples,does this guarantee low prediction error?

Recall the No-Free-Lunch theorem...

Kontorovich and Sabato (BGU) Lecture 4 5 / 34

Page 6: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Preventing overfitting is not enough

No overfitting:error on training sample ≈ true prediction error on distribution.

Can still have high error on training sample

Kontorovich and Sabato (BGU) Lecture 4 6 / 34

Page 7: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Preventing overfitting is not enough

Recall the decomposition of the prediction error err(hS ,D):I Approximation error:

errapp := infh∈H

err(h,D)

I Estimation error

errest := err(hS ,D)− infh∈H

err(h,D).

No overfitting ≡ estimation error is low.

But we also need low approximation error: if all linear predictorshave high error, no sample size will work.

In practice, this is often the case with linear predictors.I If “good” features are chosen to represent the examples!

•Kontorovich and Sabato (BGU) Lecture 4 7 / 34

Page 8: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Formalizing linear predictorsTo formalize linear predictors we will use inner products.

Definition: For vectors x , z ∈ Rd , 〈x , z〉 :=∑d

i=1 x(i)z(i).

The length of a vector x ∈ Rd can be defined by its inner product:

‖x‖ =

√√√√ d∑i=1

x(i)2 =√〈x , x〉.

The angle between two vectors is defined by their inner product:

cos(θ) =〈x , z〉‖x‖ · ‖z‖

.

(Large value =⇒ small angle.)

Inner products are commutative: 〈x , z〉 = 〈z , x〉.Inner products are linear: If a, b ∈ R and x , x ′, z ∈ Rd ,

〈a · x , z〉 = a · 〈x , z〉,〈x + x ′, z〉 = 〈x , z〉+ 〈x ′, z〉.

•Kontorovich and Sabato (BGU) Lecture 4 8 / 34

Page 9: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Formalizing linear predictors

Call the labels Y = −1,+1.In 2 dimensions: the line is a · x(1) + b = x(2)

The linear prediction rule is:

y =

+1 a · x(1) + b ≥ x(2)

−1 a · x(1) + b < x(2).

•Kontorovich and Sabato (BGU) Lecture 4 9 / 34

Page 10: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Formalizing linear predictorsIn two dimensions:

y =

+1 a · x(1) + b ≥ x(2)

−1 a · x(1) + b < x(2).

Define a vector w = (w(1),w(2)) = (a,−1).

Can rewrite this as:y = sign(w(1) · x(1) + w(2) · x(2) + b) = sign(〈w , x〉+ b).

For a vector w ∈ Rd and a number b ∈ R,define the linear predictor hw ,b:

∀x ∈ Rd , hw ,b(x) := sign(〈w , x〉+ b).

b is called the bias of the predictor.

Hypothesis class of all linear predictors in dimension d :

HdL := hw ,b | w ∈ Rd , b ∈ R.

•Kontorovich and Sabato (BGU) Lecture 4 10 / 34

Page 11: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Formalizing linear predictors

hw ,b(x) := sign(〈w , x〉+ b), HdL := hw ,b | w ∈ Rd , b ∈ R.

In 3 dimensions, the linear boundary 〈w , x〉+ b = 0 is a plane.

In higher dimensions, it is a hyperplane.

The vector w is the normal to the hyperplane

|b|/‖w‖ is the distance from the origin to the hyperplane.

w

|b|/‖w‖

•Kontorovich and Sabato (BGU) Lecture 4 11 / 34

Page 12: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The bias b is not needed

Suppose we have a classification problem with X = Rd

For every example x ∈ Rd , define x ′ ∈ Rd+1:

x = (x(1), . . . , x(d)) =⇒ x ′ := (x(1), . . . , x(d), 1)

For every linear predictor with a bias hw ,b on Rd ,define a linear predictor hw ′ without a bias on Rd+1:

w ′ := (w(1), . . . ,w(d), b).

We get hw ,b(x) = hw ′(x ′), for all x ,w , b:

hw ′(x ′) = sign(〈x ′,w ′〉) = sign(〈x ,w〉+ b) = hw ,b(x).

Conclusion: by adding a coordinate which is always 1,we can discard the bias term.

•Kontorovich and Sabato (BGU) Lecture 4 12 / 34

Page 13: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Removing the bias term

From 1 dimension with a bias:

To two dimensions without a bias:

Linear predictors without a bias are called homogeneous.

Linear predictors are also called halfspaces.

•Kontorovich and Sabato (BGU) Lecture 4 13 / 34

Page 14: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Implementing the ERM for linear predictors

Implementing ERM: Find a linear predictor hw with a minimalempirical error:

err(hw ,S) ≡ 1

m|i | sign(〈xi ,w〉) 6= yi|.

This problem is NP-hard

There are workarounds (later in the course).

Today: efficient algorithm if problem is realizable.

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

Kontorovich and Sabato (BGU) Lecture 4 14 / 34

Page 15: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM in the realizable case

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

For any xi in the training sample, yi = h∗(xi ).

So, minh∈H err(h,S) ≤ err(h∗,S) = 0.

ERM in the realizable case: find some h ∈ H such that err(h, S) = 0.

For linear predictors: find hw ,b that separates the positive andnegative labels in the training sample.

I Can be done efficientlyI We will see two efficient methods.

For linear predictors: realizable = separable.

Kontorovich and Sabato (BGU) Lecture 4 15 / 34

Page 16: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM for separable linear predictors: Linear Programming

A linear program (LP) is a problem of the following form:

maximizew∈Rd 〈u,w〉subject to Aw ≥ v .

w ∈ Rd : a vector we wish to find.

u ∈ Rd , v ∈ Rm, A ∈ Rm×d .

The values of u, v ,A define the specific linear program.

LPs can be solved efficiently

Many solvers are available .

In Matlab: w = linprog(-u,-A,-v).

Kontorovich and Sabato (BGU) Lecture 4 16 / 34

Page 17: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM for separable linear predictors: Linear Programming

Linear Program

maximizew∈Rd 〈u,w〉subject to Aw ≥ v

ERM for the separable case:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.

Recall yi ∈ −1,+1. Our goal:

Find w ∈ Rd s.t. ∀i ≤ m, sign(〈xi ,w〉) = yi .

This is equivalent to:

Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.

Problem: in the linear program we have a weak inequality ≥, not strict >.

If we use ≥ here, w = 0 satisfies the constraints

•Kontorovich and Sabato (BGU) Lecture 4 17 / 34

Page 18: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM for separable linear predictors: Linear Programming

Linear Program

maximizew∈Rd 〈u,w〉subject to Aw ≥ v

Our goal:Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.

Need to change our the strict inequality to a weak one.

If the problem is separable, there exists a solution. Name one of thesolutions w∗.

Denote γ := mini yi 〈xi ,w∗〉. Note γ > 0.

Define w = w∗/γ.

For all i ≤ m, yi 〈xi , w〉 = yi 〈xi ,w∗〉/γ ≥ 1.

Conclusion: There is a predictor w ∈ Rd such that ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.

Also, any predictor that satisfies this is a good solution.

•Kontorovich and Sabato (BGU) Lecture 4 18 / 34

Page 19: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM for separable linear predictors: Linear Programming

Linear Program

maximizew∈Rd 〈u,w〉subject to Aw ≥ v

Our goal can be re-written as:

Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.

Turn this into the form of a linear program:I u = (0, . . . , 0) (nothing is maximized), v = (1, . . . , 1)

I Row i of the matrix A is yixi ≡ (yi · xi (1), . . . , yi · xi (d)).

I The linear program:

maximizew∈Rd 0

subject to

y1 · x1(1), . . . , y1 · x1(d). . .

ym · xm(1), . . . , ym · xm(d)

· w ≥ 1. . .1

.

•Kontorovich and Sabato (BGU) Lecture 4 19 / 34

Page 20: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

ERM for separable linear predictors: Linear Programming

The LP approach is very easy to implement

But it can be slow.

And it fails completely if there is even one bad label.

Kontorovich and Sabato (BGU) Lecture 4 20 / 34

Page 21: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron

The Perceptron algorithm was invented in 1958 by Rosenblatt.

This version is called the Batch Perceptron.

Goal remains as before:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.

Perceptron Idea:

I Work in rounds

I Start with a default predictor

I In each round, look at a single training example

I If current predictor is wrong on this example,move predictor in the right direction.

I Stop when the predictor assigns all examples with the correct label.

Kontorovich and Sabato (BGU) Lecture 4 21 / 34

Page 22: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron

Batch Perceptron

input A training sample S = (x1, y1), . . . , (xm, ym)output w ∈ Rd such that ∀i ≤ m, hw (xi ) = yi .1: w (1) ← (0, . . . , 0)2: while ∃i s.t. yi 〈w (t), xi 〉 ≤ 0 do3: w (t+1) ← w (t) + yixi4: t ← t + 15: end while6: Return w (t).

Why does the update rule make sense?

yi 〈w (t+1), xi 〉 = yi 〈w (t) + yixi , xi 〉= yi 〈w (t), xi 〉+ y2

i 〈xi , xi 〉= yi 〈w (t), xi 〉+ ‖xi‖2.

Each update moves yi 〈w (t), xi 〉 closer to being positive.

•Kontorovich and Sabato (BGU) Lecture 4 22 / 34

Page 23: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron

Illustration in two dimensional space (d = 2) - on the board.

I The separator tilts in the right direction in each updateI The same example can be repeated several times.

Does this always work?

How many updates does it take to get an error-free separator?

Kontorovich and Sabato (BGU) Lecture 4 23 / 34

Page 24: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The separation marginIntuitively, separation is “easier” ifpositive and negative points are far apart.

“far apart” ≡ there is a separator which is far from all points.

w

distance from point

We will show that the Perceptron is indeed faster in this case.

First, let’s make this formal.

Claim

|〈w ,x〉|‖w‖ is the distance between x ∈ Rd and the separator defined by w .

•Kontorovich and Sabato (BGU) Lecture 4 24 / 34

Page 25: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The separation margin

Claim: distance between x and the hyperplane defined by w is |〈w ,x〉|‖w‖ .

Define w := w/‖w‖. Then ‖w‖ =√〈w , w〉 = 1.

The hyperplane is H = v | 〈v ,w〉 = 0 = v | 〈v , w〉 = 0.The distance between the hyperplane and x is ∆ := minv∈H ‖x − v‖.Take v = x − 〈w , x〉 · w . Then v ∈ H, because

〈v , w〉 = 〈x , w〉 − 〈(〈w , x〉 · w), w〉 = 〈x , w〉 − 〈w , x〉〈w , w〉 = 0.

The distance is at most ‖x − v‖.

∆ ≤ ‖x − v‖ = ‖〈w , x〉 · w‖ = |〈w , x〉| · ‖w‖ = |〈w , x〉| =|〈w , x〉|‖w‖

.

Also, for any u ∈ H:

‖x − u‖2 = ‖x − v + v − u‖2 = 〈(x − v) + (v − u), (x − v) + (v − u)〉

= ‖x − v‖2 + 2〈x − v , v − u〉+ ‖v − u‖2 ≥ |〈w , x〉|2

‖w‖2+ 2〈x − v , v − u〉.

And 〈x − v , v − u〉 = 〈w , x〉〈w , v − u〉 = 0.

•Kontorovich and Sabato (BGU) Lecture 4 25 / 34

Page 26: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The separation margin

Denote R := maxi ‖xi‖. We will normalize by this value.

The minimal normalized distance of any xi in S from w is called themargin of w .

γ(w) :=1

Rmini≤m

|〈w , xi 〉|‖w‖

.

w

•Kontorovich and Sabato (BGU) Lecture 4 26 / 34

Page 27: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The separation margin

Which separators have a large margin?

For any α > 0,w ∈ Rd , α · w defines the same separator, with thesame margin.

We can look at separators w such that mini≤m yi 〈w , xi 〉 = 1.

This does not lose generality!

Then

γ(w) =1

R · ‖w‖(R := maxi ‖xi‖)

Small norm ‖w‖ and small R =⇒ large margin γ(w).

Kontorovich and Sabato (BGU) Lecture 4 27 / 34

Page 28: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron: Guarantee

Theorem (Theorem 9.1 in course book)

Assume that S = ((x1, y1), . . . , (xm, ym)) is separable. Then

1. When the Perceptron stops and returns w (t), ∀i ≤ m, yi 〈w (t), xi 〉 > 0.

2. Define

R := maxi∈[m]‖xi‖.

B := min‖w‖ | ∀i ≤ m, yi 〈w , xi 〉 ≥ 1.

The Perceptron performs at most (RB)2 updates.

Part (1) is trivial: Perceptron never stops unless this holds.

γ(w) = 1R·‖w‖

Let γS be the largest margin achievable on S .

Then γS = 1RB .

Number of updates ∝ 1/γ2S .

•Kontorovich and Sabato (BGU) Lecture 4 28 / 34

Page 29: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron: Proving the theorem

We will show that if the Perceptron runs for at least T iterations, thenT ≤ (RB)2. So total number of iterations is at most (RB)2.

Let w∗ such that yi 〈w , xi 〉 ≥ 1 and ‖w∗‖ = B.

We will keep track of two quantities: ‖w (t)‖ and 〈w∗,w (t)〉.

We will show that the norm grows slowly, while the inner product grows fast.

More precisely:I ‖w (t+1)‖ ≤

√tR

I 〈w∗,w (t+1)〉 ≥ t.

Recall that larger 〈w∗,w (t+1)〉‖w∗‖·‖w (t+1)‖ =⇒ smaller angle between w∗,w (t).

Reminder: Cauchy-Schwarz inequality. For all u, v ∈ Rd ,

〈u, v〉 ≤ ‖u‖ · ‖v‖.

So we will conclude:

T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR =⇒ T ≤ (RB)2.

Kontorovich and Sabato (BGU) Lecture 4 29 / 34

Page 30: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron: Proving the theorem

Upper bounding the norm ‖w (T+1)‖:

In iteration t, let i be the example that was used to update w (t).

Recall the Perceptron update: w (t+1) ← w (t) + yixi

We have yi 〈w (t), xi 〉 ≤ 0. So

‖w (t+1)‖2 = ‖w (t) + yixi‖2

= ‖w (t)‖2 + 2yi 〈w (t), xi 〉+ y2i ‖xi‖2

≤ ‖w (t)‖2 + R2.

‖w (1)‖2 = 0.

By induction, ‖w (T+1)‖2 ≤ TR2

So ‖w (T+1)‖ ≤√TR.

•Kontorovich and Sabato (BGU) Lecture 4 30 / 34

Page 31: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron: Proving the theorem

Lower bounding the inner product 〈w∗,w (T+1)〉

w (1) = (0, . . . , 0) =⇒ 〈w∗,w (1)〉 = 0.

Recall the Perceptron update: w (t+1) ← w (t) + yixi

In each iteration 〈w∗,w (t)〉 is increased by at least one:

〈w∗,w (t+1)〉 − 〈w∗,w (t)〉 = 〈w∗,w (t+1) − w (t)〉= 〈w∗, yixi 〉= yi 〈w∗, xi 〉

(from the definition of w∗) ≥ 1

After T iterations:

〈w∗,w (T+1)〉 =T∑t=1

(〈w∗,w (t+1)〉 − 〈w∗,w (t)〉) ≥ T .

This means that our w gets “closer” to w∗ at each iteration

•Kontorovich and Sabato (BGU) Lecture 4 31 / 34

Page 32: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

The Perceptron: Proving the theorem

We showed:

I ‖w (T+1)‖ ≤√TR

I 〈w∗,w (T+1)〉 ≥ T .

Using Cauchy-Schwarz:

T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR

Conclusion: T ≤ (RB)2.

We showed

The Perceptron runs for at most (RB)2 iterations.

When it stops, the separator it returns separates the examples in S .

γS := the best possible margin on S , then 1RB = γS .

So, number of iterations: O(1/γ2S).

•Kontorovich and Sabato (BGU) Lecture 4 32 / 34

Page 33: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Perceptron properties

Processes one example at a time: low working-memory

Number of updates depends on margin.

If margin is very small, Perceptron might take an Ω(2d) time toconverge.

In practice, in many natural problems, margin is large and Perceptronis faster than LP.

What if the training sample is not separable?I LP will completely failI The Perceptron can still run, but will not terminate on its ownI No guarantee for the Perceptron in this case.

Kontorovich and Sabato (BGU) Lecture 4 33 / 34

Page 34: Lecture 4: Linear predictors and the Perceptroninabd171/wiki.files/lecture4_handouts.pdf · Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of

Linear predictors: Intermediate summary

Linear predictors are very popular, becauseI If the sample size is Θ(d) (e.g. 10 times the dimensions),

the training error and the true error will probably be similar

I For many natural problems, there are linear predictors with low error.

Computing the ERM for a linear predictor is NP-hard.

But in the realizable/separable case, there are efficient algorithms:I Using linear programming;

I The Batch Perceptron algorithm, if the margin is not too small.

Next: What to do when the problem is not separable.

Kontorovich and Sabato (BGU) Lecture 4 34 / 34