lecture 4: linear predictors and the perceptroninabd171/wiki.files/lecture4_handouts.pdf · lecture...

Lecture 4: Linear predictorsand the Perceptron

Introduction to Learningand Analysis of Big Data

Kontorovich and Sabato (BGU) Lecture 4 1 / 34

Inductive Bias

Inductive bias is critical to prevent overfitting.

Inductive bias =⇒ a relatively “simple” hypothesis class H.

What if we don’t know which H is suitable for our learning problem?I Choose a good representation (relevant features)

I Use a general purpose hypothesis class.

One of the most popular choices: Linear predictors.

•


Linear predictors

Recall that in many learning problems X = Rd .

Each example is a vector with d coordinates (features).

In binary classification problems, Y consists of two labels.

In linear prediction, H is the class of all linear separators.

Illustration with d = 2:

•Kontorovich and Sabato (BGU) Lecture 4 3 / 34

Why restrict to linear predictors?

If the algorithm could choose any separation,we could get overfitting:Labels of unseen examples would not be predicted correctly.

I E.g. if using a squiggly line.


Preventing overfitting

Linear predictors prevent overfitting:

I In dimension d , if training sample size is Θ(d),then training error ≈ true prediction error.

Dimension d ≡ number of features.

So, if we use linear predictors with enough training samples,does this guarantee low prediction error?

Recall the No-Free-Lunch theorem...

•


Preventing overfitting is not enough

No overfitting:error on training sample ≈ true prediction error on distribution.

Can still have high error on training sample

•


Preventing overfitting is not enough

Recall the decomposition of the prediction error err(hS ,D):I Approximation error:

errapp := infh∈H

err(h,D)

I Estimation error

errest := err(hS ,D)− infh∈H

err(h,D).

No overfitting ≡ estimation error is low.

But we also need low approximation error: if all linear predictorshave high error, no sample size will work.

In practice, this is often the case with linear predictors.I If “good” features are chosen to represent the examples!


Formalizing linear predictorsTo formalize linear predictors we will use inner products.

Definition: For vectors x , z ∈ Rd , 〈x , z〉 :=∑d

i=1 x(i)z(i).

The length of a vector x ∈ Rd can be defined by its inner product:

‖x‖ =

√√√√ d∑i=1

x(i)2 =√〈x , x〉.

The angle between two vectors is defined by their inner product:

cos(θ) =〈x , z〉‖x‖ · ‖z‖

.

(Large value =⇒ small angle.)

Inner products are commutative: 〈x , z〉 = 〈z , x〉.Inner products are linear: If a, b ∈ R and x , x ′, z ∈ Rd ,

〈a · x , z〉 = a · 〈x , z〉,〈x + x ′, z〉 = 〈x , z〉+ 〈x ′, z〉.


Formalizing linear predictors

Call the labels Y = −1,+1.In 2 dimensions: the line is a · x(1) + b = x(2)

The linear prediction rule is:

y =

+1 a · x(1) + b ≥ x(2)

−1 a · x(1) + b < x(2).


Formalizing linear predictorsIn two dimensions:

y =

+1 a · x(1) + b ≥ x(2)

−1 a · x(1) + b < x(2).

Define a vector w = (w(1),w(2)) = (a,−1).

Can rewrite this as:y = sign(w(1) · x(1) + w(2) · x(2) + b) = sign(〈w , x〉+ b).

For a vector w ∈ Rd and a number b ∈ R,define the linear predictor hw ,b:

∀x ∈ Rd , hw ,b(x) := sign(〈w , x〉+ b).

b is called the bias of the predictor.

Hypothesis class of all linear predictors in dimension d :

HdL := hw ,b | w ∈ Rd , b ∈ R.


Formalizing linear predictors

hw ,b(x) := sign(〈w , x〉+ b), HdL := hw ,b | w ∈ Rd , b ∈ R.

In 3 dimensions, the linear boundary 〈w , x〉+ b = 0 is a plane.

In higher dimensions, it is a hyperplane.

The vector w is the normal to the hyperplane

|b|/‖w‖ is the distance from the origin to the hyperplane.

w

|b|/‖w‖


The bias b is not needed

Suppose we have a classification problem with X = Rd

For every example x ∈ Rd , define x ′ ∈ Rd+1:

x = (x(1), . . . , x(d)) =⇒ x ′ := (x(1), . . . , x(d), 1)

For every linear predictor with a bias hw ,b on Rd ,define a linear predictor hw ′ without a bias on Rd+1:

w ′ := (w(1), . . . ,w(d), b).

We get hw ,b(x) = hw ′(x ′), for all x ,w , b:

hw ′(x ′) = sign(〈x ′,w ′〉) = sign(〈x ,w〉+ b) = hw ,b(x).

Conclusion: by adding a coordinate which is always 1,we can discard the bias term.


Removing the bias term

From 1 dimension with a bias:

To two dimensions without a bias:

Linear predictors without a bias are called homogeneous.

Linear predictors are also called halfspaces.


Implementing the ERM for linear predictors

Implementing ERM: Find a linear predictor hw with a minimalempirical error:

err(hw ,S) ≡ 1

m|i | sign(〈xi ,w〉) 6= yi|.

This problem is NP-hard

There are workarounds (later in the course).

Today: efficient algorithm if problem is realizable.

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

•


ERM in the realizable case

Definition

D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.

For any xi in the training sample, yi = h∗(xi ).

So, minh∈H err(h,S) ≤ err(h∗,S) = 0.

ERM in the realizable case: find some h ∈ H such that err(h, S) = 0.

For linear predictors: find hw ,b that separates the positive andnegative labels in the training sample.

I Can be done efficientlyI We will see two efficient methods.

For linear predictors: realizable = separable.

•


ERM for separable linear predictors: Linear Programming

A linear program (LP) is a problem of the following form:

maximizew∈Rd 〈u,w〉subject to Aw ≥ v .

w ∈ Rd : a vector we wish to find.

u ∈ Rd , v ∈ Rm, A ∈ Rm×d .

The values of u, v ,A define the specific linear program.

LPs can be solved efficiently

Many solvers are available .

In Matlab: w = linprog(-u,-A,-v).

•



Linear Program

maximizew∈Rd 〈u,w〉subject to Aw ≥ v

ERM for the separable case:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.

Recall yi ∈ −1,+1. Our goal:

Find w ∈ Rd s.t. ∀i ≤ m, sign(〈xi ,w〉) = yi .

This is equivalent to:

Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.

Problem: in the linear program we have a weak inequality ≥, not strict >.

If we use ≥ here, w = 0 satisfies the constraints



Linear Program


Our goal:Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.

Need to change our the strict inequality to a weak one.

If the problem is separable, there exists a solution. Name one of thesolutions w∗.

Denote γ := mini yi 〈xi ,w∗〉. Note γ > 0.

Define w = w∗/γ.

For all i ≤ m, yi 〈xi , w〉 = yi 〈xi ,w∗〉/γ ≥ 1.

Conclusion: There is a predictor w ∈ Rd such that ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.

Also, any predictor that satisfies this is a good solution.



Linear Program


Our goal can be re-written as:

Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.

Turn this into the form of a linear program:I u = (0, . . . , 0) (nothing is maximized), v = (1, . . . , 1)

I Row i of the matrix A is yixi ≡ (yi · xi (1), . . . , yi · xi (d)).

I The linear program:

maximizew∈Rd 0

subject to

y1 · x1(1), . . . , y1 · x1(d). . .

ym · xm(1), . . . , ym · xm(d)

· w ≥ 1. . .1

.



The LP approach is very easy to implement

But it can be slow.

And it fails completely if there is even one bad label.

•


The Perceptron

The Perceptron algorithm was invented in 1958 by Rosenblatt.

This version is called the Batch Perceptron.

Goal remains as before:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.

Perceptron Idea:

I Work in rounds

I Start with a default predictor

I In each round, look at a single training example

I If current predictor is wrong on this example,move predictor in the right direction.

I Stop when the predictor assigns all examples with the correct label.

•


The Perceptron

Batch Perceptron

input A training sample S = (x1, y1), . . . , (xm, ym)output w ∈ Rd such that ∀i ≤ m, hw (xi ) = yi .1: w (1) ← (0, . . . , 0)2: while ∃i s.t. yi 〈w (t), xi 〉 ≤ 0 do3: w (t+1) ← w (t) + yixi4: t ← t + 15: end while6: Return w (t).

Why does the update rule make sense?

yi 〈w (t+1), xi 〉 = yi 〈w (t) + yixi , xi 〉= yi 〈w (t), xi 〉+ y2

i 〈xi , xi 〉= yi 〈w (t), xi 〉+ ‖xi‖2.

Each update moves yi 〈w (t), xi 〉 closer to being positive.


The Perceptron

Illustration in two dimensional space (d = 2) - on the board.

I The separator tilts in the right direction in each updateI The same example can be repeated several times.

Does this always work?

How many updates does it take to get an error-free separator?

•


The separation marginIntuitively, separation is “easier” ifpositive and negative points are far apart.

“far apart” ≡ there is a separator which is far from all points.

w

distance from point

We will show that the Perceptron is indeed faster in this case.

First, let’s make this formal.

Claim

|〈w ,x〉|‖w‖ is the distance between x ∈ Rd and the separator defined by w .


The separation margin

Claim: distance between x and the hyperplane defined by w is |〈w ,x〉|‖w‖ .

Define w := w/‖w‖. Then ‖w‖ =√〈w , w〉 = 1.

The hyperplane is H = v | 〈v ,w〉 = 0 = v | 〈v , w〉 = 0.The distance between the hyperplane and x is ∆ := minv∈H ‖x − v‖.Take v = x − 〈w , x〉 · w . Then v ∈ H, because

〈v , w〉 = 〈x , w〉 − 〈(〈w , x〉 · w), w〉 = 〈x , w〉 − 〈w , x〉〈w , w〉 = 0.

The distance is at most ‖x − v‖.

∆ ≤ ‖x − v‖ = ‖〈w , x〉 · w‖ = |〈w , x〉| · ‖w‖ = |〈w , x〉| =|〈w , x〉|‖w‖

.

Also, for any u ∈ H:

‖x − u‖2 = ‖x − v + v − u‖2 = 〈(x − v) + (v − u), (x − v) + (v − u)〉

= ‖x − v‖2 + 2〈x − v , v − u〉+ ‖v − u‖2 ≥ |〈w , x〉|2

‖w‖2+ 2〈x − v , v − u〉.

And 〈x − v , v − u〉 = 〈w , x〉〈w , v − u〉 = 0.



Denote R := maxi ‖xi‖. We will normalize by this value.

The minimal normalized distance of any xi in S from w is called themargin of w .

γ(w) :=1

Rmini≤m

|〈w , xi 〉|‖w‖

.

w



Which separators have a large margin?

For any α > 0,w ∈ Rd , α · w defines the same separator, with thesame margin.

We can look at separators w such that mini≤m yi 〈w , xi 〉 = 1.

This does not lose generality!

Then

γ(w) =1

R · ‖w‖(R := maxi ‖xi‖)

Small norm ‖w‖ and small R =⇒ large margin γ(w).

•


The Perceptron: Guarantee

Theorem (Theorem 9.1 in course book)

Assume that S = ((x1, y1), . . . , (xm, ym)) is separable. Then

1. When the Perceptron stops and returns w (t), ∀i ≤ m, yi 〈w (t), xi 〉 > 0.

2. Define

R := maxi∈[m]‖xi‖.

B := min‖w‖ | ∀i ≤ m, yi 〈w , xi 〉 ≥ 1.

The Perceptron performs at most (RB)2 updates.

Part (1) is trivial: Perceptron never stops unless this holds.

γ(w) = 1R·‖w‖

Let γS be the largest margin achievable on S .

Then γS = 1RB .

Number of updates ∝ 1/γ2S .


The Perceptron: Proving the theorem

We will show that if the Perceptron runs for at least T iterations, thenT ≤ (RB)2. So total number of iterations is at most (RB)2.

Let w∗ such that yi 〈w , xi 〉 ≥ 1 and ‖w∗‖ = B.

We will keep track of two quantities: ‖w (t)‖ and 〈w∗,w (t)〉.

We will show that the norm grows slowly, while the inner product grows fast.

More precisely:I ‖w (t+1)‖ ≤

√tR

I 〈w∗,w (t+1)〉 ≥ t.

Recall that larger 〈w∗,w (t+1)〉‖w∗‖·‖w (t+1)‖ =⇒ smaller angle between w∗,w (t).

Reminder: Cauchy-Schwarz inequality. For all u, v ∈ Rd ,

〈u, v〉 ≤ ‖u‖ · ‖v‖.

So we will conclude:

T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR =⇒ T ≤ (RB)2.

•



Upper bounding the norm ‖w (T+1)‖:

In iteration t, let i be the example that was used to update w (t).

Recall the Perceptron update: w (t+1) ← w (t) + yixi

We have yi 〈w (t), xi 〉 ≤ 0. So

‖w (t+1)‖2 = ‖w (t) + yixi‖2

= ‖w (t)‖2 + 2yi 〈w (t), xi 〉+ y2i ‖xi‖2

≤ ‖w (t)‖2 + R2.

‖w (1)‖2 = 0.

By induction, ‖w (T+1)‖2 ≤ TR2

So ‖w (T+1)‖ ≤√TR.



Lower bounding the inner product 〈w∗,w (T+1)〉

w (1) = (0, . . . , 0) =⇒ 〈w∗,w (1)〉 = 0.

Recall the Perceptron update: w (t+1) ← w (t) + yixi

In each iteration 〈w∗,w (t)〉 is increased by at least one:

〈w∗,w (t+1)〉 − 〈w∗,w (t)〉 = 〈w∗,w (t+1) − w (t)〉= 〈w∗, yixi 〉= yi 〈w∗, xi 〉

(from the definition of w∗) ≥ 1

After T iterations:

〈w∗,w (T+1)〉 =T∑t=1

(〈w∗,w (t+1)〉 − 〈w∗,w (t)〉) ≥ T .

This means that our w gets “closer” to w∗ at each iteration



We showed:

I ‖w (T+1)‖ ≤√TR

I 〈w∗,w (T+1)〉 ≥ T .

Using Cauchy-Schwarz:

T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR

Conclusion: T ≤ (RB)2.

We showed

The Perceptron runs for at most (RB)2 iterations.

When it stops, the separator it returns separates the examples in S .

γS := the best possible margin on S , then 1RB = γS .

So, number of iterations: O(1/γ2S).


Perceptron properties

Processes one example at a time: low working-memory

Number of updates depends on margin.

If margin is very small, Perceptron might take an Ω(2d) time toconverge.

In practice, in many natural problems, margin is large and Perceptronis faster than LP.

What if the training sample is not separable?I LP will completely failI The Perceptron can still run, but will not terminate on its ownI No guarantee for the Perceptron in this case.

•


Linear predictors: Intermediate summary

Linear predictors are very popular, becauseI If the sample size is Θ(d) (e.g. 10 times the dimensions),

the training error and the true error will probably be similar

I For many natural problems, there are linear predictors with low error.

Computing the ERM for a linear predictor is NP-hard.

But in the realizable/separable case, there are efficient algorithms:I Using linear programming;

I The Batch Perceptron algorithm, if the margin is not too small.

Next: What to do when the problem is not separable.

•


lecture 4: linear predictors and the perceptroninabd171/wiki.files/lecture4_handouts.pdf · lecture...

Documents