lecture 4: linear predictors and the perceptroninabd171/wiki.files/lecture4_handouts.pdf · lecture...
TRANSCRIPT
Lecture 4: Linear predictorsand the Perceptron
Introduction to Learningand Analysis of Big Data
Kontorovich and Sabato (BGU) Lecture 4 1 / 34
Inductive Bias
Inductive bias is critical to prevent overfitting.
Inductive bias =⇒ a relatively “simple” hypothesis class H.
What if we don’t know which H is suitable for our learning problem?I Choose a good representation (relevant features)
I Use a general purpose hypothesis class.
One of the most popular choices: Linear predictors.
•
Kontorovich and Sabato (BGU) Lecture 4 2 / 34
Linear predictors
Recall that in many learning problems X = Rd .
Each example is a vector with d coordinates (features).
In binary classification problems, Y consists of two labels.
In linear prediction, H is the class of all linear separators.
Illustration with d = 2:
•Kontorovich and Sabato (BGU) Lecture 4 3 / 34
Why restrict to linear predictors?
If the algorithm could choose any separation,we could get overfitting:Labels of unseen examples would not be predicted correctly.
I E.g. if using a squiggly line.
•Kontorovich and Sabato (BGU) Lecture 4 4 / 34
Preventing overfitting
Linear predictors prevent overfitting:
I In dimension d , if training sample size is Θ(d),then training error ≈ true prediction error.
Dimension d ≡ number of features.
So, if we use linear predictors with enough training samples,does this guarantee low prediction error?
Recall the No-Free-Lunch theorem...
•
Kontorovich and Sabato (BGU) Lecture 4 5 / 34
Preventing overfitting is not enough
No overfitting:error on training sample ≈ true prediction error on distribution.
Can still have high error on training sample
•
Kontorovich and Sabato (BGU) Lecture 4 6 / 34
Preventing overfitting is not enough
Recall the decomposition of the prediction error err(hS ,D):I Approximation error:
errapp := infh∈H
err(h,D)
I Estimation error
errest := err(hS ,D)− infh∈H
err(h,D).
No overfitting ≡ estimation error is low.
But we also need low approximation error: if all linear predictorshave high error, no sample size will work.
In practice, this is often the case with linear predictors.I If “good” features are chosen to represent the examples!
•Kontorovich and Sabato (BGU) Lecture 4 7 / 34
Formalizing linear predictorsTo formalize linear predictors we will use inner products.
Definition: For vectors x , z ∈ Rd , 〈x , z〉 :=∑d
i=1 x(i)z(i).
The length of a vector x ∈ Rd can be defined by its inner product:
‖x‖ =
√√√√ d∑i=1
x(i)2 =√〈x , x〉.
The angle between two vectors is defined by their inner product:
cos(θ) =〈x , z〉‖x‖ · ‖z‖
.
(Large value =⇒ small angle.)
Inner products are commutative: 〈x , z〉 = 〈z , x〉.Inner products are linear: If a, b ∈ R and x , x ′, z ∈ Rd ,
〈a · x , z〉 = a · 〈x , z〉,〈x + x ′, z〉 = 〈x , z〉+ 〈x ′, z〉.
•Kontorovich and Sabato (BGU) Lecture 4 8 / 34
Formalizing linear predictors
Call the labels Y = −1,+1.In 2 dimensions: the line is a · x(1) + b = x(2)
The linear prediction rule is:
y =
+1 a · x(1) + b ≥ x(2)
−1 a · x(1) + b < x(2).
•Kontorovich and Sabato (BGU) Lecture 4 9 / 34
Formalizing linear predictorsIn two dimensions:
y =
+1 a · x(1) + b ≥ x(2)
−1 a · x(1) + b < x(2).
Define a vector w = (w(1),w(2)) = (a,−1).
Can rewrite this as:y = sign(w(1) · x(1) + w(2) · x(2) + b) = sign(〈w , x〉+ b).
For a vector w ∈ Rd and a number b ∈ R,define the linear predictor hw ,b:
∀x ∈ Rd , hw ,b(x) := sign(〈w , x〉+ b).
b is called the bias of the predictor.
Hypothesis class of all linear predictors in dimension d :
HdL := hw ,b | w ∈ Rd , b ∈ R.
•Kontorovich and Sabato (BGU) Lecture 4 10 / 34
Formalizing linear predictors
hw ,b(x) := sign(〈w , x〉+ b), HdL := hw ,b | w ∈ Rd , b ∈ R.
In 3 dimensions, the linear boundary 〈w , x〉+ b = 0 is a plane.
In higher dimensions, it is a hyperplane.
The vector w is the normal to the hyperplane
|b|/‖w‖ is the distance from the origin to the hyperplane.
w
|b|/‖w‖
•Kontorovich and Sabato (BGU) Lecture 4 11 / 34
The bias b is not needed
Suppose we have a classification problem with X = Rd
For every example x ∈ Rd , define x ′ ∈ Rd+1:
x = (x(1), . . . , x(d)) =⇒ x ′ := (x(1), . . . , x(d), 1)
For every linear predictor with a bias hw ,b on Rd ,define a linear predictor hw ′ without a bias on Rd+1:
w ′ := (w(1), . . . ,w(d), b).
We get hw ,b(x) = hw ′(x ′), for all x ,w , b:
hw ′(x ′) = sign(〈x ′,w ′〉) = sign(〈x ,w〉+ b) = hw ,b(x).
Conclusion: by adding a coordinate which is always 1,we can discard the bias term.
•Kontorovich and Sabato (BGU) Lecture 4 12 / 34
Removing the bias term
From 1 dimension with a bias:
To two dimensions without a bias:
Linear predictors without a bias are called homogeneous.
Linear predictors are also called halfspaces.
•Kontorovich and Sabato (BGU) Lecture 4 13 / 34
Implementing the ERM for linear predictors
Implementing ERM: Find a linear predictor hw with a minimalempirical error:
err(hw ,S) ≡ 1
m|i | sign(〈xi ,w〉) 6= yi|.
This problem is NP-hard
There are workarounds (later in the course).
Today: efficient algorithm if problem is realizable.
Definition
D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.
•
Kontorovich and Sabato (BGU) Lecture 4 14 / 34
ERM in the realizable case
Definition
D is realizable by H if there exists some h∗ ∈ H such that err(h∗,D) = 0.
For any xi in the training sample, yi = h∗(xi ).
So, minh∈H err(h,S) ≤ err(h∗,S) = 0.
ERM in the realizable case: find some h ∈ H such that err(h, S) = 0.
For linear predictors: find hw ,b that separates the positive andnegative labels in the training sample.
I Can be done efficientlyI We will see two efficient methods.
For linear predictors: realizable = separable.
•
Kontorovich and Sabato (BGU) Lecture 4 15 / 34
ERM for separable linear predictors: Linear Programming
A linear program (LP) is a problem of the following form:
maximizew∈Rd 〈u,w〉subject to Aw ≥ v .
w ∈ Rd : a vector we wish to find.
u ∈ Rd , v ∈ Rm, A ∈ Rm×d .
The values of u, v ,A define the specific linear program.
LPs can be solved efficiently
Many solvers are available .
In Matlab: w = linprog(-u,-A,-v).
•
Kontorovich and Sabato (BGU) Lecture 4 16 / 34
ERM for separable linear predictors: Linear Programming
Linear Program
maximizew∈Rd 〈u,w〉subject to Aw ≥ v
ERM for the separable case:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.
Recall yi ∈ −1,+1. Our goal:
Find w ∈ Rd s.t. ∀i ≤ m, sign(〈xi ,w〉) = yi .
This is equivalent to:
Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.
Problem: in the linear program we have a weak inequality ≥, not strict >.
If we use ≥ here, w = 0 satisfies the constraints
•Kontorovich and Sabato (BGU) Lecture 4 17 / 34
ERM for separable linear predictors: Linear Programming
Linear Program
maximizew∈Rd 〈u,w〉subject to Aw ≥ v
Our goal:Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 > 0.
Need to change our the strict inequality to a weak one.
If the problem is separable, there exists a solution. Name one of thesolutions w∗.
Denote γ := mini yi 〈xi ,w∗〉. Note γ > 0.
Define w = w∗/γ.
For all i ≤ m, yi 〈xi , w〉 = yi 〈xi ,w∗〉/γ ≥ 1.
Conclusion: There is a predictor w ∈ Rd such that ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.
Also, any predictor that satisfies this is a good solution.
•Kontorovich and Sabato (BGU) Lecture 4 18 / 34
ERM for separable linear predictors: Linear Programming
Linear Program
maximizew∈Rd 〈u,w〉subject to Aw ≥ v
Our goal can be re-written as:
Find w ∈ Rd s.t. ∀i ≤ m, yi 〈xi ,w〉 ≥ 1.
Turn this into the form of a linear program:I u = (0, . . . , 0) (nothing is maximized), v = (1, . . . , 1)
I Row i of the matrix A is yixi ≡ (yi · xi (1), . . . , yi · xi (d)).
I The linear program:
maximizew∈Rd 0
subject to
y1 · x1(1), . . . , y1 · x1(d). . .
ym · xm(1), . . . , ym · xm(d)
· w ≥ 1. . .1
.
•Kontorovich and Sabato (BGU) Lecture 4 19 / 34
ERM for separable linear predictors: Linear Programming
The LP approach is very easy to implement
But it can be slow.
And it fails completely if there is even one bad label.
•
Kontorovich and Sabato (BGU) Lecture 4 20 / 34
The Perceptron
The Perceptron algorithm was invented in 1958 by Rosenblatt.
This version is called the Batch Perceptron.
Goal remains as before:Find a linear predictor with zero error on the training sample (xi , yi )i≤m.
Perceptron Idea:
I Work in rounds
I Start with a default predictor
I In each round, look at a single training example
I If current predictor is wrong on this example,move predictor in the right direction.
I Stop when the predictor assigns all examples with the correct label.
•
Kontorovich and Sabato (BGU) Lecture 4 21 / 34
The Perceptron
Batch Perceptron
input A training sample S = (x1, y1), . . . , (xm, ym)output w ∈ Rd such that ∀i ≤ m, hw (xi ) = yi .1: w (1) ← (0, . . . , 0)2: while ∃i s.t. yi 〈w (t), xi 〉 ≤ 0 do3: w (t+1) ← w (t) + yixi4: t ← t + 15: end while6: Return w (t).
Why does the update rule make sense?
yi 〈w (t+1), xi 〉 = yi 〈w (t) + yixi , xi 〉= yi 〈w (t), xi 〉+ y2
i 〈xi , xi 〉= yi 〈w (t), xi 〉+ ‖xi‖2.
Each update moves yi 〈w (t), xi 〉 closer to being positive.
•Kontorovich and Sabato (BGU) Lecture 4 22 / 34
The Perceptron
Illustration in two dimensional space (d = 2) - on the board.
I The separator tilts in the right direction in each updateI The same example can be repeated several times.
Does this always work?
How many updates does it take to get an error-free separator?
•
Kontorovich and Sabato (BGU) Lecture 4 23 / 34
The separation marginIntuitively, separation is “easier” ifpositive and negative points are far apart.
“far apart” ≡ there is a separator which is far from all points.
w
distance from point
We will show that the Perceptron is indeed faster in this case.
First, let’s make this formal.
Claim
|〈w ,x〉|‖w‖ is the distance between x ∈ Rd and the separator defined by w .
•Kontorovich and Sabato (BGU) Lecture 4 24 / 34
The separation margin
Claim: distance between x and the hyperplane defined by w is |〈w ,x〉|‖w‖ .
Define w := w/‖w‖. Then ‖w‖ =√〈w , w〉 = 1.
The hyperplane is H = v | 〈v ,w〉 = 0 = v | 〈v , w〉 = 0.The distance between the hyperplane and x is ∆ := minv∈H ‖x − v‖.Take v = x − 〈w , x〉 · w . Then v ∈ H, because
〈v , w〉 = 〈x , w〉 − 〈(〈w , x〉 · w), w〉 = 〈x , w〉 − 〈w , x〉〈w , w〉 = 0.
The distance is at most ‖x − v‖.
∆ ≤ ‖x − v‖ = ‖〈w , x〉 · w‖ = |〈w , x〉| · ‖w‖ = |〈w , x〉| =|〈w , x〉|‖w‖
.
Also, for any u ∈ H:
‖x − u‖2 = ‖x − v + v − u‖2 = 〈(x − v) + (v − u), (x − v) + (v − u)〉
= ‖x − v‖2 + 2〈x − v , v − u〉+ ‖v − u‖2 ≥ |〈w , x〉|2
‖w‖2+ 2〈x − v , v − u〉.
And 〈x − v , v − u〉 = 〈w , x〉〈w , v − u〉 = 0.
•Kontorovich and Sabato (BGU) Lecture 4 25 / 34
The separation margin
Denote R := maxi ‖xi‖. We will normalize by this value.
The minimal normalized distance of any xi in S from w is called themargin of w .
γ(w) :=1
Rmini≤m
|〈w , xi 〉|‖w‖
.
w
•Kontorovich and Sabato (BGU) Lecture 4 26 / 34
The separation margin
Which separators have a large margin?
For any α > 0,w ∈ Rd , α · w defines the same separator, with thesame margin.
We can look at separators w such that mini≤m yi 〈w , xi 〉 = 1.
This does not lose generality!
Then
γ(w) =1
R · ‖w‖(R := maxi ‖xi‖)
Small norm ‖w‖ and small R =⇒ large margin γ(w).
•
Kontorovich and Sabato (BGU) Lecture 4 27 / 34
The Perceptron: Guarantee
Theorem (Theorem 9.1 in course book)
Assume that S = ((x1, y1), . . . , (xm, ym)) is separable. Then
1. When the Perceptron stops and returns w (t), ∀i ≤ m, yi 〈w (t), xi 〉 > 0.
2. Define
R := maxi∈[m]‖xi‖.
B := min‖w‖ | ∀i ≤ m, yi 〈w , xi 〉 ≥ 1.
The Perceptron performs at most (RB)2 updates.
Part (1) is trivial: Perceptron never stops unless this holds.
γ(w) = 1R·‖w‖
Let γS be the largest margin achievable on S .
Then γS = 1RB .
Number of updates ∝ 1/γ2S .
•Kontorovich and Sabato (BGU) Lecture 4 28 / 34
The Perceptron: Proving the theorem
We will show that if the Perceptron runs for at least T iterations, thenT ≤ (RB)2. So total number of iterations is at most (RB)2.
Let w∗ such that yi 〈w , xi 〉 ≥ 1 and ‖w∗‖ = B.
We will keep track of two quantities: ‖w (t)‖ and 〈w∗,w (t)〉.
We will show that the norm grows slowly, while the inner product grows fast.
More precisely:I ‖w (t+1)‖ ≤
√tR
I 〈w∗,w (t+1)〉 ≥ t.
Recall that larger 〈w∗,w (t+1)〉‖w∗‖·‖w (t+1)‖ =⇒ smaller angle between w∗,w (t).
Reminder: Cauchy-Schwarz inequality. For all u, v ∈ Rd ,
〈u, v〉 ≤ ‖u‖ · ‖v‖.
So we will conclude:
T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR =⇒ T ≤ (RB)2.
•
Kontorovich and Sabato (BGU) Lecture 4 29 / 34
The Perceptron: Proving the theorem
Upper bounding the norm ‖w (T+1)‖:
In iteration t, let i be the example that was used to update w (t).
Recall the Perceptron update: w (t+1) ← w (t) + yixi
We have yi 〈w (t), xi 〉 ≤ 0. So
‖w (t+1)‖2 = ‖w (t) + yixi‖2
= ‖w (t)‖2 + 2yi 〈w (t), xi 〉+ y2i ‖xi‖2
≤ ‖w (t)‖2 + R2.
‖w (1)‖2 = 0.
By induction, ‖w (T+1)‖2 ≤ TR2
So ‖w (T+1)‖ ≤√TR.
•Kontorovich and Sabato (BGU) Lecture 4 30 / 34
The Perceptron: Proving the theorem
Lower bounding the inner product 〈w∗,w (T+1)〉
w (1) = (0, . . . , 0) =⇒ 〈w∗,w (1)〉 = 0.
Recall the Perceptron update: w (t+1) ← w (t) + yixi
In each iteration 〈w∗,w (t)〉 is increased by at least one:
〈w∗,w (t+1)〉 − 〈w∗,w (t)〉 = 〈w∗,w (t+1) − w (t)〉= 〈w∗, yixi 〉= yi 〈w∗, xi 〉
(from the definition of w∗) ≥ 1
After T iterations:
〈w∗,w (T+1)〉 =T∑t=1
(〈w∗,w (t+1)〉 − 〈w∗,w (t)〉) ≥ T .
This means that our w gets “closer” to w∗ at each iteration
•Kontorovich and Sabato (BGU) Lecture 4 31 / 34
The Perceptron: Proving the theorem
We showed:
I ‖w (T+1)‖ ≤√TR
I 〈w∗,w (T+1)〉 ≥ T .
Using Cauchy-Schwarz:
T ≤ 〈w∗,w (T+1)〉 ≤ ‖w∗‖ · ‖w (T+1)‖ ≤ B√TR
Conclusion: T ≤ (RB)2.
We showed
The Perceptron runs for at most (RB)2 iterations.
When it stops, the separator it returns separates the examples in S .
γS := the best possible margin on S , then 1RB = γS .
So, number of iterations: O(1/γ2S).
•Kontorovich and Sabato (BGU) Lecture 4 32 / 34
Perceptron properties
Processes one example at a time: low working-memory
Number of updates depends on margin.
If margin is very small, Perceptron might take an Ω(2d) time toconverge.
In practice, in many natural problems, margin is large and Perceptronis faster than LP.
What if the training sample is not separable?I LP will completely failI The Perceptron can still run, but will not terminate on its ownI No guarantee for the Perceptron in this case.
•
Kontorovich and Sabato (BGU) Lecture 4 33 / 34
Linear predictors: Intermediate summary
Linear predictors are very popular, becauseI If the sample size is Θ(d) (e.g. 10 times the dimensions),
the training error and the true error will probably be similar
I For many natural problems, there are linear predictors with low error.
Computing the ERM for a linear predictor is NP-hard.
But in the realizable/separable case, there are efficient algorithms:I Using linear programming;
I The Batch Perceptron algorithm, if the margin is not too small.
Next: What to do when the problem is not separable.
•
Kontorovich and Sabato (BGU) Lecture 4 34 / 34