lecture 2l

124
Introduction to Machine Learning (67577) Lecture 2 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem PAC learning Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 1 / 45

Upload: jc4024

Post on 21-Dec-2015

254 views

Category:

Documents


0 download

DESCRIPTION

d

TRANSCRIPT

Page 1: Lecture 2l

Introduction to Machine Learning (67577)Lecture 2

Shai Shalev-Shwartz

School of CS and Engineering,The Hebrew University of Jerusalem

PAC learning

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 1 / 45

Page 2: Lecture 2l

Outline

1 The PAC Learning Framework

2 No Free Lunch and Prior Knowledge

3 PAC Learning of Finite Hypothesis Classes

4 The Fundamental Theorem of Learning TheoryThe VC dimension

5 Solving ERM for Halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 2 / 45

Page 3: Lecture 2l

Recall: The Game Board

Domain set, X : This is the set ofobjects that we may wish to label.

Label set, Y: The set of possible labels.

A prediction rule, h : X → Y: used tolabel future examples. This function iscalled a predictor, a hypothesis, or aclassifier.

Example

X = R2 representing colorand shape of papayas.

Y = {±1} representing“tasty” or “non-tasty”.

h(x) = 1 if x is within theinner rectangle

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 3 / 45

Page 4: Lecture 2l

Batch Learning

The learner’s input:

Training data, S = ((x1, y1) . . . (xm, ym)) ∈ (X × Y)m

The learner’s output:

A prediction rule, h : X → Y

What should be the goal of the learner?

Intuitively, h should be correct on future examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 4 / 45

Page 5: Lecture 2l

Batch Learning

The learner’s input:

Training data, S = ((x1, y1) . . . (xm, ym)) ∈ (X × Y)m

The learner’s output:

A prediction rule, h : X → Y

What should be the goal of the learner?

Intuitively, h should be correct on future examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 4 / 45

Page 6: Lecture 2l

“Correct on future examples”

Let f be the correct classifier, then we should find h s.t. h ≈ f

One way: define the error of h w.r.t. f to be

LD,f (h) = Px∼D

[h(x) 6= f(x)]

where D is some (unknown) probability measure over XMore formally, D is a distribution over X , that is, for a given A ⊂ X ,the value of D(A) is the probability to see some x ∈ A. Then,

LD,f (h)def= P

x∼D[h(x) 6= f(x)]

def= D ({x ∈ X : h(x) 6= f(x)}) .

Can we find h s.t. LD,f (h) is small ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 5 / 45

Page 7: Lecture 2l

“Correct on future examples”

Let f be the correct classifier, then we should find h s.t. h ≈ fOne way: define the error of h w.r.t. f to be

LD,f (h) = Px∼D

[h(x) 6= f(x)]

where D is some (unknown) probability measure over X

More formally, D is a distribution over X , that is, for a given A ⊂ X ,the value of D(A) is the probability to see some x ∈ A. Then,

LD,f (h)def= P

x∼D[h(x) 6= f(x)]

def= D ({x ∈ X : h(x) 6= f(x)}) .

Can we find h s.t. LD,f (h) is small ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 5 / 45

Page 8: Lecture 2l

“Correct on future examples”

Let f be the correct classifier, then we should find h s.t. h ≈ fOne way: define the error of h w.r.t. f to be

LD,f (h) = Px∼D

[h(x) 6= f(x)]

where D is some (unknown) probability measure over XMore formally, D is a distribution over X , that is, for a given A ⊂ X ,the value of D(A) is the probability to see some x ∈ A. Then,

LD,f (h)def= P

x∼D[h(x) 6= f(x)]

def= D ({x ∈ X : h(x) 6= f(x)}) .

Can we find h s.t. LD,f (h) is small ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 5 / 45

Page 9: Lecture 2l

“Correct on future examples”

Let f be the correct classifier, then we should find h s.t. h ≈ fOne way: define the error of h w.r.t. f to be

LD,f (h) = Px∼D

[h(x) 6= f(x)]

where D is some (unknown) probability measure over XMore formally, D is a distribution over X , that is, for a given A ⊂ X ,the value of D(A) is the probability to see some x ∈ A. Then,

LD,f (h)def= P

x∼D[h(x) 6= f(x)]

def= D ({x ∈ X : h(x) 6= f(x)}) .

Can we find h s.t. LD,f (h) is small ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 5 / 45

Page 10: Lecture 2l

Data-generation Model

We must assume some relation between the training data and D, fSimple data generation model:

Independently Identically Distributed (i.i.d.): Each xi is sampledindependently according to DRealizability: For every i ∈ [m], yi = f(xi)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 6 / 45

Page 11: Lecture 2l

Can only be Approximately correct

Claim: We can’t hope to find h s.t. L(D,f)(h) = 0

Proof: for every ε ∈ (0, 1) take X = {x1, x2} and D({x1}) = 1− ε,D({x2}) = ε

The probability not to see x2 at all among m i.i.d. examples is(1− ε)m ≈ e−εm

So, if ε� 1/m we’re likely not to see x2 at all, but then we can’tknow its label

Relaxation: We’d be happy with L(D,f)(h) ≤ ε, where ε isuser-specified

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 7 / 45

Page 12: Lecture 2l

Can only be Approximately correct

Claim: We can’t hope to find h s.t. L(D,f)(h) = 0

Proof: for every ε ∈ (0, 1) take X = {x1, x2} and D({x1}) = 1− ε,D({x2}) = ε

The probability not to see x2 at all among m i.i.d. examples is(1− ε)m ≈ e−εm

So, if ε� 1/m we’re likely not to see x2 at all, but then we can’tknow its label

Relaxation: We’d be happy with L(D,f)(h) ≤ ε, where ε isuser-specified

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 7 / 45

Page 13: Lecture 2l

Can only be Approximately correct

Claim: We can’t hope to find h s.t. L(D,f)(h) = 0

Proof: for every ε ∈ (0, 1) take X = {x1, x2} and D({x1}) = 1− ε,D({x2}) = ε

The probability not to see x2 at all among m i.i.d. examples is(1− ε)m ≈ e−εm

So, if ε� 1/m we’re likely not to see x2 at all, but then we can’tknow its label

Relaxation: We’d be happy with L(D,f)(h) ≤ ε, where ε isuser-specified

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 7 / 45

Page 14: Lecture 2l

Can only be Approximately correct

Claim: We can’t hope to find h s.t. L(D,f)(h) = 0

Proof: for every ε ∈ (0, 1) take X = {x1, x2} and D({x1}) = 1− ε,D({x2}) = ε

The probability not to see x2 at all among m i.i.d. examples is(1− ε)m ≈ e−εm

So, if ε� 1/m we’re likely not to see x2 at all, but then we can’tknow its label

Relaxation: We’d be happy with L(D,f)(h) ≤ ε, where ε isuser-specified

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 7 / 45

Page 15: Lecture 2l

Can only be Approximately correct

Claim: We can’t hope to find h s.t. L(D,f)(h) = 0

Proof: for every ε ∈ (0, 1) take X = {x1, x2} and D({x1}) = 1− ε,D({x2}) = ε

The probability not to see x2 at all among m i.i.d. examples is(1− ε)m ≈ e−εm

So, if ε� 1/m we’re likely not to see x2 at all, but then we can’tknow its label

Relaxation: We’d be happy with L(D,f)(h) ≤ ε, where ε isuser-specified

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 7 / 45

Page 16: Lecture 2l

Can only be Probably correct

Recall that the input to the learner is randomly generated

There’s always a (very small) chance to see the same example againand again

Claim: No algorithm can guarantee L(D,f)(h) ≤ ε for sure

Relaxation: We’d allow the algorithm to fail with probability δ, whereδ ∈ (0, 1) is user-specifiedHere, the probability is over the random choice of examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 8 / 45

Page 17: Lecture 2l

Can only be Probably correct

Recall that the input to the learner is randomly generated

There’s always a (very small) chance to see the same example againand again

Claim: No algorithm can guarantee L(D,f)(h) ≤ ε for sure

Relaxation: We’d allow the algorithm to fail with probability δ, whereδ ∈ (0, 1) is user-specifiedHere, the probability is over the random choice of examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 8 / 45

Page 18: Lecture 2l

Can only be Probably correct

Recall that the input to the learner is randomly generated

There’s always a (very small) chance to see the same example againand again

Claim: No algorithm can guarantee L(D,f)(h) ≤ ε for sure

Relaxation: We’d allow the algorithm to fail with probability δ, whereδ ∈ (0, 1) is user-specifiedHere, the probability is over the random choice of examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 8 / 45

Page 19: Lecture 2l

Can only be Probably correct

Recall that the input to the learner is randomly generated

There’s always a (very small) chance to see the same example againand again

Claim: No algorithm can guarantee L(D,f)(h) ≤ ε for sure

Relaxation: We’d allow the algorithm to fail with probability δ, whereδ ∈ (0, 1) is user-specifiedHere, the probability is over the random choice of examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 8 / 45

Page 20: Lecture 2l

Probably Approximately Correct (PAC) learning

The learner doesn’t know D and f .

The learner receives accuracy parameter ε and confidence parameter δ

The learner can ask for training data, S, containing m(ε, δ) examples(that is, the number of examples can depend on the value of ε and δ,but it can’t depend on D or f)

Learner should output a hypothesis h s.t. with probability of at least1− δ it holds that LD,f (h) ≤ εThat is, the learner should be Probably (with probability at least1− δ) Approximately (up to accuracy ε) Correct

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 9 / 45

Page 21: Lecture 2l

Probably Approximately Correct (PAC) learning

The learner doesn’t know D and f .

The learner receives accuracy parameter ε and confidence parameter δ

The learner can ask for training data, S, containing m(ε, δ) examples(that is, the number of examples can depend on the value of ε and δ,but it can’t depend on D or f)

Learner should output a hypothesis h s.t. with probability of at least1− δ it holds that LD,f (h) ≤ εThat is, the learner should be Probably (with probability at least1− δ) Approximately (up to accuracy ε) Correct

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 9 / 45

Page 22: Lecture 2l

Probably Approximately Correct (PAC) learning

The learner doesn’t know D and f .

The learner receives accuracy parameter ε and confidence parameter δ

The learner can ask for training data, S, containing m(ε, δ) examples(that is, the number of examples can depend on the value of ε and δ,but it can’t depend on D or f)

Learner should output a hypothesis h s.t. with probability of at least1− δ it holds that LD,f (h) ≤ εThat is, the learner should be Probably (with probability at least1− δ) Approximately (up to accuracy ε) Correct

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 9 / 45

Page 23: Lecture 2l

Probably Approximately Correct (PAC) learning

The learner doesn’t know D and f .

The learner receives accuracy parameter ε and confidence parameter δ

The learner can ask for training data, S, containing m(ε, δ) examples(that is, the number of examples can depend on the value of ε and δ,but it can’t depend on D or f)

Learner should output a hypothesis h s.t. with probability of at least1− δ it holds that LD,f (h) ≤ ε

That is, the learner should be Probably (with probability at least1− δ) Approximately (up to accuracy ε) Correct

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 9 / 45

Page 24: Lecture 2l

Probably Approximately Correct (PAC) learning

The learner doesn’t know D and f .

The learner receives accuracy parameter ε and confidence parameter δ

The learner can ask for training data, S, containing m(ε, δ) examples(that is, the number of examples can depend on the value of ε and δ,but it can’t depend on D or f)

Learner should output a hypothesis h s.t. with probability of at least1− δ it holds that LD,f (h) ≤ εThat is, the learner should be Probably (with probability at least1− δ) Approximately (up to accuracy ε) Correct

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 9 / 45

Page 25: Lecture 2l

No Free Lunch

Suppose that |X | =∞For any finite C ⊂ X take D to be uniform distribution over C

If number of training examples is m ≤ |C|/2 the learner has noknowledge on at least half the elements in C

Formalizing the above, it can be shown that:

Theorem (No Free Lunch)

Fix δ ∈ (0, 1), ε < 1/2. For every learner A and training set size m, thereexists D, f such that with probability of at least δ over the generation of atraining data, S, of m examples, it holds that LD,f (A(S)) ≥ ε.

Remark: LD,f (random guess) = 1/2, so the theorem states that you can’tbe better than a random guess

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 10 / 45

Page 26: Lecture 2l

No Free Lunch

Suppose that |X | =∞For any finite C ⊂ X take D to be uniform distribution over C

If number of training examples is m ≤ |C|/2 the learner has noknowledge on at least half the elements in C

Formalizing the above, it can be shown that:

Theorem (No Free Lunch)

Fix δ ∈ (0, 1), ε < 1/2. For every learner A and training set size m, thereexists D, f such that with probability of at least δ over the generation of atraining data, S, of m examples, it holds that LD,f (A(S)) ≥ ε.

Remark: LD,f (random guess) = 1/2, so the theorem states that you can’tbe better than a random guess

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 10 / 45

Page 27: Lecture 2l

No Free Lunch

Suppose that |X | =∞For any finite C ⊂ X take D to be uniform distribution over C

If number of training examples is m ≤ |C|/2 the learner has noknowledge on at least half the elements in C

Formalizing the above, it can be shown that:

Theorem (No Free Lunch)

Fix δ ∈ (0, 1), ε < 1/2. For every learner A and training set size m, thereexists D, f such that with probability of at least δ over the generation of atraining data, S, of m examples, it holds that LD,f (A(S)) ≥ ε.

Remark: LD,f (random guess) = 1/2, so the theorem states that you can’tbe better than a random guess

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 10 / 45

Page 28: Lecture 2l

Prior Knowledge

Give more knowledge to the learner: the target f comes from somehypothesis class, H ⊂ YX

The learner knows HIs it possible to PAC learn now ?

Of course, the answer depends on H since the No Free Lunchtheorem tells us that for H = YX the problem is not solvable ...

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 11 / 45

Page 29: Lecture 2l

Outline

1 The PAC Learning Framework

2 No Free Lunch and Prior Knowledge

3 PAC Learning of Finite Hypothesis Classes

4 The Fundamental Theorem of Learning TheoryThe VC dimension

5 Solving ERM for Halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 12 / 45

Page 30: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 31: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 32: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 33: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)

Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 34: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 35: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 36: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 37: Lecture 2l

Learning Finite Classes

Assume that H is a finite hypothesis class

E.g.: H is all the functions from X to Y that can be implementedusing a Python program of length at most b

Use the Consistent learning rule:

Input: H and S = (x1, y1), . . . , (xm, ym)Output: any h ∈ H s.t. ∀i, yi = h(xi)

This is also called Empirical Risk Minimization (ERM)

ERMH(S)

Input: training set S = (x1, y1), . . . , (xm, ym)

Define the empirical risk: LS(h) =1m |{i : h(xi) 6= yi}|

Output: any h ∈ H that minimizes LS(h)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 13 / 45

Page 38: Lecture 2l

Learning Finite Classes

Theorem

Fix ε, δ. If m ≥ log(|H|/δ)ε then for every D, f , with probability of at least

1− δ (over the choice of S of size m), LD,f (ERMH(S)) ≤ ε.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 14 / 45

Page 39: Lecture 2l

Proof

Let S|x = (x1, . . . , xm) be the instances of the training set

We would like to prove:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) ≤ δ

Let HB be the set of “bad” hypotheses,

HB = {h ∈ H : L(D,f)(h) > ε}

Let M be the set of “misleading” samples,

M = {S|x : ∃h ∈ HB, LS(h) = 0}

Observe:

{S|x : L(D,f)(ERMH(S)) > ε} ⊆M =⋃

h∈HB

{S|x : LS(h) = 0}

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 15 / 45

Page 40: Lecture 2l

Proof

Let S|x = (x1, . . . , xm) be the instances of the training set

We would like to prove:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) ≤ δ

Let HB be the set of “bad” hypotheses,

HB = {h ∈ H : L(D,f)(h) > ε}

Let M be the set of “misleading” samples,

M = {S|x : ∃h ∈ HB, LS(h) = 0}

Observe:

{S|x : L(D,f)(ERMH(S)) > ε} ⊆M =⋃

h∈HB

{S|x : LS(h) = 0}

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 15 / 45

Page 41: Lecture 2l

Proof

Let S|x = (x1, . . . , xm) be the instances of the training set

We would like to prove:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) ≤ δ

Let HB be the set of “bad” hypotheses,

HB = {h ∈ H : L(D,f)(h) > ε}

Let M be the set of “misleading” samples,

M = {S|x : ∃h ∈ HB, LS(h) = 0}

Observe:

{S|x : L(D,f)(ERMH(S)) > ε} ⊆M =⋃

h∈HB

{S|x : LS(h) = 0}

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 15 / 45

Page 42: Lecture 2l

Proof

Let S|x = (x1, . . . , xm) be the instances of the training set

We would like to prove:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) ≤ δ

Let HB be the set of “bad” hypotheses,

HB = {h ∈ H : L(D,f)(h) > ε}

Let M be the set of “misleading” samples,

M = {S|x : ∃h ∈ HB, LS(h) = 0}

Observe:

{S|x : L(D,f)(ERMH(S)) > ε} ⊆M =⋃

h∈HB

{S|x : LS(h) = 0}

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 15 / 45

Page 43: Lecture 2l

Proof

Let S|x = (x1, . . . , xm) be the instances of the training set

We would like to prove:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) ≤ δ

Let HB be the set of “bad” hypotheses,

HB = {h ∈ H : L(D,f)(h) > ε}

Let M be the set of “misleading” samples,

M = {S|x : ∃h ∈ HB, LS(h) = 0}

Observe:

{S|x : L(D,f)(ERMH(S)) > ε} ⊆M =⋃

h∈HB

{S|x : LS(h) = 0}

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 15 / 45

Page 44: Lecture 2l

Proof (Cont.)

Lemma (Union bound)

For any two sets A,B and a distribution D we have

D(A ∪B) ≤ D(A) +D(B) .

We have shown:{S|x : L(D,f)(ERMH(S)) > ε} ⊆

⋃h∈HB

{S|x : LS(h) = 0}Therefore, using the union bound

Dm({S|x :L(D,f)(ERMH(S)) > ε})

≤∑h∈HB

Dm({S|x : LS(h) = 0})

≤ |HB| maxh∈HB

Dm({S|x : LS(h) = 0})

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 16 / 45

Page 45: Lecture 2l

Proof (Cont.)

Lemma (Union bound)

For any two sets A,B and a distribution D we have

D(A ∪B) ≤ D(A) +D(B) .

We have shown:{S|x : L(D,f)(ERMH(S)) > ε} ⊆

⋃h∈HB

{S|x : LS(h) = 0}

Therefore, using the union bound

Dm({S|x :L(D,f)(ERMH(S)) > ε})

≤∑h∈HB

Dm({S|x : LS(h) = 0})

≤ |HB| maxh∈HB

Dm({S|x : LS(h) = 0})

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 16 / 45

Page 46: Lecture 2l

Proof (Cont.)

Lemma (Union bound)

For any two sets A,B and a distribution D we have

D(A ∪B) ≤ D(A) +D(B) .

We have shown:{S|x : L(D,f)(ERMH(S)) > ε} ⊆

⋃h∈HB

{S|x : LS(h) = 0}Therefore, using the union bound

Dm({S|x :L(D,f)(ERMH(S)) > ε})

≤∑h∈HB

Dm({S|x : LS(h) = 0})

≤ |HB| maxh∈HB

Dm({S|x : LS(h) = 0})

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 16 / 45

Page 47: Lecture 2l

Proof (Cont.)

Observe:

Dm({S|x : LS(h) = 0}) = (1− LD,f (h))m

If h ∈ HB then LD,f (h) > ε and therefore

Dm({S|x : LS(h) = 0}) < (1− ε)m

We have shown:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |HB| (1− ε)m

Finally, using 1− ε ≤ e−ε and |HB| ≤ |H| we conclude:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |H| e−εm

The right-hand side would be ≤ δ if m ≥ log(|H|/δ)ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 17 / 45

Page 48: Lecture 2l

Proof (Cont.)

Observe:

Dm({S|x : LS(h) = 0}) = (1− LD,f (h))m

If h ∈ HB then LD,f (h) > ε and therefore

Dm({S|x : LS(h) = 0}) < (1− ε)m

We have shown:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |HB| (1− ε)m

Finally, using 1− ε ≤ e−ε and |HB| ≤ |H| we conclude:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |H| e−εm

The right-hand side would be ≤ δ if m ≥ log(|H|/δ)ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 17 / 45

Page 49: Lecture 2l

Proof (Cont.)

Observe:

Dm({S|x : LS(h) = 0}) = (1− LD,f (h))m

If h ∈ HB then LD,f (h) > ε and therefore

Dm({S|x : LS(h) = 0}) < (1− ε)m

We have shown:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |HB| (1− ε)m

Finally, using 1− ε ≤ e−ε and |HB| ≤ |H| we conclude:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |H| e−εm

The right-hand side would be ≤ δ if m ≥ log(|H|/δ)ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 17 / 45

Page 50: Lecture 2l

Proof (Cont.)

Observe:

Dm({S|x : LS(h) = 0}) = (1− LD,f (h))m

If h ∈ HB then LD,f (h) > ε and therefore

Dm({S|x : LS(h) = 0}) < (1− ε)m

We have shown:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |HB| (1− ε)m

Finally, using 1− ε ≤ e−ε and |HB| ≤ |H| we conclude:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |H| e−εm

The right-hand side would be ≤ δ if m ≥ log(|H|/δ)ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 17 / 45

Page 51: Lecture 2l

Proof (Cont.)

Observe:

Dm({S|x : LS(h) = 0}) = (1− LD,f (h))m

If h ∈ HB then LD,f (h) > ε and therefore

Dm({S|x : LS(h) = 0}) < (1− ε)m

We have shown:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |HB| (1− ε)m

Finally, using 1− ε ≤ e−ε and |HB| ≤ |H| we conclude:

Dm({S|x : L(D,f)(ERMH(S)) > ε}) < |H| e−εm

The right-hand side would be ≤ δ if m ≥ log(|H|/δ)ε .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 17 / 45

Page 52: Lecture 2l

Illustrating the use of the union bound

Each point is a possible sample S|x. Each colored oval representsmisleading samples for some h ∈ HB. The probability mass of eachsuch oval is at most (1− ε)m. But, the algorithm might err if itsamples S|x from any of these ovals.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 18 / 45

Page 53: Lecture 2l

Outline

1 The PAC Learning Framework

2 No Free Lunch and Prior Knowledge

3 PAC Learning of Finite Hypothesis Classes

4 The Fundamental Theorem of Learning TheoryThe VC dimension

5 Solving ERM for Halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 19 / 45

Page 54: Lecture 2l

PAC learning

Definition (PAC learnability)

A hypothesis class H is PAC learnable if there exists a functionmH : (0, 1)2 → N and a learning algorithm with the following property:

for every ε, δ ∈ (0, 1)

for every distribution D over X , and for every labeling functionf : X → {0, 1}

when running the learning algorithm on m ≥ mH(ε, δ) i.i.d. examplesgenerated by D and labeled by f , the algorithm returns a hypothesis hsuch that, with probability of at least 1− δ (over the choice of theexamples), L(D,f)(h) ≤ ε.

mH is called the sample complexity of learning H

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 20 / 45

Page 55: Lecture 2l

PAC learning

Definition (PAC learnability)

A hypothesis class H is PAC learnable if there exists a functionmH : (0, 1)2 → N and a learning algorithm with the following property:

for every ε, δ ∈ (0, 1)

for every distribution D over X , and for every labeling functionf : X → {0, 1}

when running the learning algorithm on m ≥ mH(ε, δ) i.i.d. examplesgenerated by D and labeled by f , the algorithm returns a hypothesis hsuch that, with probability of at least 1− δ (over the choice of theexamples), L(D,f)(h) ≤ ε.

mH is called the sample complexity of learning H

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 20 / 45

Page 56: Lecture 2l

PAC learning

Leslie Valiant, Turing award 2010

For transformative contributions to the theory of computation,including the theory of probably approximately correct (PAC)learning, the complexity of enumeration and of algebraiccomputation, and the theory of parallel and distributedcomputing.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 21 / 45

Page 57: Lecture 2l

What is learnable and how to learn?

We have shown:

Corollary

Let H be a finite hypothesis class.

H is PAC learnable with sample complexity mH(ε, δ) ≤ log(|H|/δ)ε

This sample complexity is obtained by using the ERMH learning rule

What about infinite hypothesis classes?

What is the sample complexity of a given class?

Is there a generic learning algorithm that achieves the optimal samplecomplexity ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 22 / 45

Page 58: Lecture 2l

What is learnable and how to learn?

We have shown:

Corollary

Let H be a finite hypothesis class.

H is PAC learnable with sample complexity mH(ε, δ) ≤ log(|H|/δ)ε

This sample complexity is obtained by using the ERMH learning rule

What about infinite hypothesis classes?

What is the sample complexity of a given class?

Is there a generic learning algorithm that achieves the optimal samplecomplexity ?

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 22 / 45

Page 59: Lecture 2l

What is learnable and how to learn?

The fundamental theorem of statistical learning:

The sample complexity is characterized by the VC dimensionThe ERM learning rule is a generic (near) optimal learner

Chervonenkis Vapnik

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 23 / 45

Page 60: Lecture 2l

Outline

1 The PAC Learning Framework

2 No Free Lunch and Prior Knowledge

3 PAC Learning of Finite Hypothesis Classes

4 The Fundamental Theorem of Learning TheoryThe VC dimension

5 Solving ERM for Halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 24 / 45

Page 61: Lecture 2l

The VC dimension — Motivation

if someone can explain every phenomena, her explanations are worthless.

Example: http://www.youtube.com/watch?v=p_MzP2MZaOo

Pay attention to the retrospect explanations at 5:00

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 25 / 45

Page 62: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from HThen, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from HIf this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 63: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from H

Then, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from HIf this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 64: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from HThen, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from HIf this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 65: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from HThen, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from H

If this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 66: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from HThen, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from HIf this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 67: Lecture 2l

The VC dimension — Motivation

Suppose we got a training set S = (x1, y1), . . . , (xm, ym)

We try to explain the labels using a hypothesis from HThen, ooops, the labels we received are incorrect and we get the sameinstances with different labels, S′ = (x1, y

′1), . . . , (xm, y

′m)

We again try to explain the labels using a hypothesis from HIf this works for us, no matter what the labels are, then something isfishy ...

Formally, if H allows all functions over some set C of size m, thenbased on the No Free Lunch, we can’t learn from, say, m/2 examples

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 26 / 45

Page 68: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ X

Let HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 69: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ C

Observe: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 70: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 71: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 72: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 73: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}

That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 74: Lecture 2l

The VC dimension — Formal Definition

Let C = {x1, . . . , x|C|} ⊂ XLet HC be the restriction of H to C, namely, HC = {hC : h ∈ H}where hC : C → {0, 1} is s.t. hC(xi) = h(xi) for every xi ∈ CObserve: we can represent each hC as the vector(h(x1), . . . , h(x|C|)) ∈ {±1}|C|

Therefore: |HC | ≤ 2|C|

We say that H shatters C if |HC | = 2|C|

VCdim(H) = sup{|C| : H shatters C}That is, the VC dimension is the maximal size of a set C such that Hgives no prior knowledge w.r.t. C

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 27 / 45

Page 75: Lecture 2l

VC dimension — Examples

To show that VCdim(H) = d we need to show that:

1 There exists a set C of size d which is shattered by H.

2 Every set C of size d+ 1 is not shattered by H.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 28 / 45

Page 76: Lecture 2l

VC dimension — Examples

To show that VCdim(H) = d we need to show that:

1 There exists a set C of size d which is shattered by H.

2 Every set C of size d+ 1 is not shattered by H.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 28 / 45

Page 77: Lecture 2l

VC dimension — Examples

Threshold functions: X = R, H = {x 7→ sign(x− θ) : θ ∈ R}

Show that {0} is shattered

Show that any two points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 29 / 45

Page 78: Lecture 2l

VC dimension — Examples

Threshold functions: X = R, H = {x 7→ sign(x− θ) : θ ∈ R}

Show that {0} is shattered

Show that any two points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 29 / 45

Page 79: Lecture 2l

VC dimension — Examples

Intervals: X = R, H = {ha,b : a < b ∈ R}, where ha,b(x) = 1 iff x ∈ [a, b]

Show that {0, 1} is shattered

Show that any three points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 30 / 45

Page 80: Lecture 2l

VC dimension — Examples

Intervals: X = R, H = {ha,b : a < b ∈ R}, where ha,b(x) = 1 iff x ∈ [a, b]

Show that {0, 1} is shattered

Show that any three points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 30 / 45

Page 81: Lecture 2l

VC dimension — Examples

Axis aligned rectangles: X = R2,H = {h(a1,a2,b1,b2) : a1 < a2 and b1 < b2}, where h(a1,a2,b1,b2)(x1, x2) = 1iff x1 ∈ [a1, a2] and x2 ∈ [b1, b2]

Show:

Shattered Not Shatteredc1

c2

c3

c4 c5

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 31 / 45

Page 82: Lecture 2l

VC dimension — Examples

Finite classes:

Show that the VC dimension of a finite H is at most log2(|H|).

Show that there can be arbitrary gap between VCdim(H) andlog2(|H|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 32 / 45

Page 83: Lecture 2l

VC dimension — Examples

Finite classes:

Show that the VC dimension of a finite H is at most log2(|H|).Show that there can be arbitrary gap between VCdim(H) andlog2(|H|)

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 32 / 45

Page 84: Lecture 2l

VC dimension — Examples

Halfspaces: X = Rd, H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

Show that {e1, . . . , ed} is shattered

Show that any d+ 1 points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 33 / 45

Page 85: Lecture 2l

VC dimension — Examples

Halfspaces: X = Rd, H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

Show that {e1, . . . , ed} is shattered

Show that any d+ 1 points cannot be shattered

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 33 / 45

Page 86: Lecture 2l

The Fundamental Theorem of Statistical Learning

Theorem (The Fundamental Theorem of Statistical Learning)

Let H be a hypothesis class of binary classifiers. Then, there are absoluteconstants C1, C2 such that the sample complexity of PAC learning H is

C1d+ log(1/δ)

ε≤ mH(ε, δ) ≤ C2

d log(1/ε) + log(1/δ)

ε

Furthermore, this sample complexity is achieved by the ERM learning rule.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 34 / 45

Page 87: Lecture 2l

Proof of the lower bound – main ideas

Suppose VCdim(H) = d and let C = {x1, . . . , xd} be a shattered set

Consider the distribution D supported on C s.t.

D({xi}) =

{1− 4ε if i = 1

4ε/(d− 1) if i > 1

If we see m i.i.d. examples then the expected number of examplesfrom C \ {x1} is 4εm

If m < d−18ε then 4εm < d−1

2 and therefore, we have no informationon the labels of at least half the examples in C \ {x1}Best we can do is to guess, but then our error is ≥ 1

2 · 2ε = ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 35 / 45

Page 88: Lecture 2l

Proof of the lower bound – main ideas

Suppose VCdim(H) = d and let C = {x1, . . . , xd} be a shattered set

Consider the distribution D supported on C s.t.

D({xi}) =

{1− 4ε if i = 1

4ε/(d− 1) if i > 1

If we see m i.i.d. examples then the expected number of examplesfrom C \ {x1} is 4εm

If m < d−18ε then 4εm < d−1

2 and therefore, we have no informationon the labels of at least half the examples in C \ {x1}Best we can do is to guess, but then our error is ≥ 1

2 · 2ε = ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 35 / 45

Page 89: Lecture 2l

Proof of the lower bound – main ideas

Suppose VCdim(H) = d and let C = {x1, . . . , xd} be a shattered set

Consider the distribution D supported on C s.t.

D({xi}) =

{1− 4ε if i = 1

4ε/(d− 1) if i > 1

If we see m i.i.d. examples then the expected number of examplesfrom C \ {x1} is 4εm

If m < d−18ε then 4εm < d−1

2 and therefore, we have no informationon the labels of at least half the examples in C \ {x1}Best we can do is to guess, but then our error is ≥ 1

2 · 2ε = ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 35 / 45

Page 90: Lecture 2l

Proof of the lower bound – main ideas

Suppose VCdim(H) = d and let C = {x1, . . . , xd} be a shattered set

Consider the distribution D supported on C s.t.

D({xi}) =

{1− 4ε if i = 1

4ε/(d− 1) if i > 1

If we see m i.i.d. examples then the expected number of examplesfrom C \ {x1} is 4εm

If m < d−18ε then 4εm < d−1

2 and therefore, we have no informationon the labels of at least half the examples in C \ {x1}

Best we can do is to guess, but then our error is ≥ 12 · 2ε = ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 35 / 45

Page 91: Lecture 2l

Proof of the lower bound – main ideas

Suppose VCdim(H) = d and let C = {x1, . . . , xd} be a shattered set

Consider the distribution D supported on C s.t.

D({xi}) =

{1− 4ε if i = 1

4ε/(d− 1) if i > 1

If we see m i.i.d. examples then the expected number of examplesfrom C \ {x1} is 4εm

If m < d−18ε then 4εm < d−1

2 and therefore, we have no informationon the labels of at least half the examples in C \ {x1}Best we can do is to guess, but then our error is ≥ 1

2 · 2ε = ε

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 35 / 45

Page 92: Lecture 2l

Proof of the upper bound – main ideas

Recall our proof for finite class:

For a single hypothesis, we’ve shown that the probability of the event:LS(h) = 0 given that L(D,f) > ε is at most e−εm

Then we applied the union bound over all “bad” hypotheses, to obtainthe bound on ERM failure: |H| e−εm

If H is infinite, or very large, the union bound yields a meaninglessbound

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 36 / 45

Page 93: Lecture 2l

Proof of the upper bound – main ideas

Recall our proof for finite class:

For a single hypothesis, we’ve shown that the probability of the event:LS(h) = 0 given that L(D,f) > ε is at most e−εm

Then we applied the union bound over all “bad” hypotheses, to obtainthe bound on ERM failure: |H| e−εm

If H is infinite, or very large, the union bound yields a meaninglessbound

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 36 / 45

Page 94: Lecture 2l

Proof of the upper bound – main ideas

Recall our proof for finite class:

For a single hypothesis, we’ve shown that the probability of the event:LS(h) = 0 given that L(D,f) > ε is at most e−εm

Then we applied the union bound over all “bad” hypotheses, to obtainthe bound on ERM failure: |H| e−εm

If H is infinite, or very large, the union bound yields a meaninglessbound

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 36 / 45

Page 95: Lecture 2l

Proof of the upper bound – main ideas

Recall our proof for finite class:

For a single hypothesis, we’ve shown that the probability of the event:LS(h) = 0 given that L(D,f) > ε is at most e−εm

Then we applied the union bound over all “bad” hypotheses, to obtainthe bound on ERM failure: |H| e−εm

If H is infinite, or very large, the union bound yields a meaninglessbound

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 36 / 45

Page 96: Lecture 2l

Proof of the upper bound – main ideas

The two samples trick: show that

PS∼Dm

[∃h ∈ HB : LS(h) = 0]

≤ 2 PS,T∼Dm

[∃h ∈ HB : LS(h) = 0 and LT (h) ≥ ε/2]

Symmetrization: Since S, T are i.i.d., we can think on first sampling2m examples and then splitting them to S, T at random

If we fix h, and S ∪ T , the probability to have LS(h) = 0 whileLT (h) ≥ ε/2 is ≤ e−εm/4

Once we fixed S ∪ T , we can take a union bound only over HS∪T !

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 37 / 45

Page 97: Lecture 2l

Proof of the upper bound – main ideas

The two samples trick: show that

PS∼Dm

[∃h ∈ HB : LS(h) = 0]

≤ 2 PS,T∼Dm

[∃h ∈ HB : LS(h) = 0 and LT (h) ≥ ε/2]

Symmetrization: Since S, T are i.i.d., we can think on first sampling2m examples and then splitting them to S, T at random

If we fix h, and S ∪ T , the probability to have LS(h) = 0 whileLT (h) ≥ ε/2 is ≤ e−εm/4

Once we fixed S ∪ T , we can take a union bound only over HS∪T !

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 37 / 45

Page 98: Lecture 2l

Proof of the upper bound – main ideas

The two samples trick: show that

PS∼Dm

[∃h ∈ HB : LS(h) = 0]

≤ 2 PS,T∼Dm

[∃h ∈ HB : LS(h) = 0 and LT (h) ≥ ε/2]

Symmetrization: Since S, T are i.i.d., we can think on first sampling2m examples and then splitting them to S, T at random

If we fix h, and S ∪ T , the probability to have LS(h) = 0 whileLT (h) ≥ ε/2 is ≤ e−εm/4

Once we fixed S ∪ T , we can take a union bound only over HS∪T !

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 37 / 45

Page 99: Lecture 2l

Proof of the upper bound – main ideas

The two samples trick: show that

PS∼Dm

[∃h ∈ HB : LS(h) = 0]

≤ 2 PS,T∼Dm

[∃h ∈ HB : LS(h) = 0 and LT (h) ≥ ε/2]

Symmetrization: Since S, T are i.i.d., we can think on first sampling2m examples and then splitting them to S, T at random

If we fix h, and S ∪ T , the probability to have LS(h) = 0 whileLT (h) ≥ ε/2 is ≤ e−εm/4

Once we fixed S ∪ T , we can take a union bound only over HS∪T !

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 37 / 45

Page 100: Lecture 2l

Proof of the upper bound – main ideas

Lemma (Sauer-Shelah-Perles2-Vapnik-Chervonenkis)

Let H be a hypothesis class with VCdim(H) ≤ d <∞. Then, for allC ⊂ X s.t. |C| = m > d+ 1 we have

|HC | ≤(emd

)d

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 38 / 45

Page 101: Lecture 2l

Outline

1 The PAC Learning Framework

2 No Free Lunch and Prior Knowledge

3 PAC Learning of Finite Hypothesis Classes

4 The Fundamental Theorem of Learning TheoryThe VC dimension

5 Solving ERM for Halfspaces

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 39 / 45

Page 102: Lecture 2l

ERM for halfspaces

Recall:H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

ERM for Halfspaces:given S = (x1, y1), . . . , (xm, ym) find w s.t. for all i,sign(〈w,xi〉) = yi.

Cast as a Linear Program:Find w s.t.

∀i, yi〈w,xi〉 > 0 .

Can solve efficiently using standard methods

Exercise: show how to solve the above Linear Program using theEllipsoid learner from the previous lecture

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 40 / 45

Page 103: Lecture 2l

ERM for halfspaces

Recall:H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

ERM for Halfspaces:given S = (x1, y1), . . . , (xm, ym) find w s.t. for all i,sign(〈w,xi〉) = yi.

Cast as a Linear Program:Find w s.t.

∀i, yi〈w,xi〉 > 0 .

Can solve efficiently using standard methods

Exercise: show how to solve the above Linear Program using theEllipsoid learner from the previous lecture

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 40 / 45

Page 104: Lecture 2l

ERM for halfspaces

Recall:H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

ERM for Halfspaces:given S = (x1, y1), . . . , (xm, ym) find w s.t. for all i,sign(〈w,xi〉) = yi.

Cast as a Linear Program:Find w s.t.

∀i, yi〈w,xi〉 > 0 .

Can solve efficiently using standard methods

Exercise: show how to solve the above Linear Program using theEllipsoid learner from the previous lecture

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 40 / 45

Page 105: Lecture 2l

ERM for halfspaces

Recall:H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

ERM for Halfspaces:given S = (x1, y1), . . . , (xm, ym) find w s.t. for all i,sign(〈w,xi〉) = yi.

Cast as a Linear Program:Find w s.t.

∀i, yi〈w,xi〉 > 0 .

Can solve efficiently using standard methods

Exercise: show how to solve the above Linear Program using theEllipsoid learner from the previous lecture

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 40 / 45

Page 106: Lecture 2l

ERM for halfspaces

Recall:H = {x 7→ sign(〈w,x〉) : w ∈ Rd}

ERM for Halfspaces:given S = (x1, y1), . . . , (xm, ym) find w s.t. for all i,sign(〈w,xi〉) = yi.

Cast as a Linear Program:Find w s.t.

∀i, yi〈w,xi〉 > 0 .

Can solve efficiently using standard methods

Exercise: show how to solve the above Linear Program using theEllipsoid learner from the previous lecture

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 40 / 45

Page 107: Lecture 2l

ERM for halfspaces using the Perceptron Algorithm

Perceptron

initialize: w = (0, . . . , 0) ∈ Rdwhile ∃i s.t. yi〈w,xi〉 ≤ 0w← w + yixi

Dates back at least to Rosenblatt 1958.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 41 / 45

Page 108: Lecture 2l

Analysis

Theorem (Agmon’54, Novikoff’62)

Let (x1, y1), . . . , (xm, ym) be a sequence of examples such that thereexists w∗ ∈ Rd such that for every i, yi 〈w∗,xi〉 ≥ 1. Then, thePerceptron will make at most

‖w∗‖2 maxi‖xi‖2

updates before breaking with an ERM halfspace.

The condition would always hold if the data is realizable by somehalfspace

However, ‖w∗‖ might be very large

In many practical cases, ‖w∗‖ would not be too large

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 42 / 45

Page 109: Lecture 2l

Analysis

Theorem (Agmon’54, Novikoff’62)

Let (x1, y1), . . . , (xm, ym) be a sequence of examples such that thereexists w∗ ∈ Rd such that for every i, yi 〈w∗,xi〉 ≥ 1. Then, thePerceptron will make at most

‖w∗‖2 maxi‖xi‖2

updates before breaking with an ERM halfspace.

The condition would always hold if the data is realizable by somehalfspace

However, ‖w∗‖ might be very large

In many practical cases, ‖w∗‖ would not be too large

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 42 / 45

Page 110: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 111: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 112: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 113: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 114: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 115: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:

1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 116: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:1 〈w(t+1),w∗〉 ≥ t

2 ‖w(t+1)‖ ≤ R√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 117: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 118: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 119: Lecture 2l

Proof

Let w(t) be the value of w at iteration t

Let (xt, yt) be the example used to update w at iteration t

Denote R = maxi ‖xi‖

The cosine of the angle between w∗ and w(t) is 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

By the Cauchy-Schwartz inequality, this is always ≤ 1

We will show:1 〈w(t+1),w∗〉 ≥ t2 ‖w(t+1)‖ ≤ R

√t

This would yield

t

R√t ‖w∗‖

≤ 〈w(t),w∗〉‖w(t)‖ ‖w∗‖

≤ 1

Rearranging the above would yield t ≤ ‖w∗‖2R2 as required.

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 43 / 45

Page 120: Lecture 2l

Proof (Cont.)

Showing 〈w(t+1),w∗〉 ≥ tInitially, 〈w(1),w∗〉 = 0

Whenever we update, 〈w(t),w∗〉 increases by at least 1:

〈w(t+1),w∗〉 = 〈w(t) + ytxt,w∗〉 = 〈w(t),w∗〉+ yt〈xt,w∗〉︸ ︷︷ ︸

≥1

Showing ‖w(t+1)‖2 ≤ R2 t

Initially, ‖w(1)‖2 = 0

Whenever we update, ‖w(t)‖2 increases by at most 1:

‖w(t+1)‖2 = ‖w(t) + ytxt‖2 = ‖w(t)‖2 + 2yt〈w(t),xt〉︸ ︷︷ ︸≤0

+y2t ‖xt‖2

≤ ‖w(t)‖2 +R2 .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 44 / 45

Page 121: Lecture 2l

Proof (Cont.)

Showing 〈w(t+1),w∗〉 ≥ tInitially, 〈w(1),w∗〉 = 0

Whenever we update, 〈w(t),w∗〉 increases by at least 1:

〈w(t+1),w∗〉 = 〈w(t) + ytxt,w∗〉 = 〈w(t),w∗〉+ yt〈xt,w∗〉︸ ︷︷ ︸

≥1

Showing ‖w(t+1)‖2 ≤ R2 t

Initially, ‖w(1)‖2 = 0

Whenever we update, ‖w(t)‖2 increases by at most 1:

‖w(t+1)‖2 = ‖w(t) + ytxt‖2 = ‖w(t)‖2 + 2yt〈w(t),xt〉︸ ︷︷ ︸≤0

+y2t ‖xt‖2

≤ ‖w(t)‖2 +R2 .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 44 / 45

Page 122: Lecture 2l

Proof (Cont.)

Showing 〈w(t+1),w∗〉 ≥ tInitially, 〈w(1),w∗〉 = 0

Whenever we update, 〈w(t),w∗〉 increases by at least 1:

〈w(t+1),w∗〉 = 〈w(t) + ytxt,w∗〉 = 〈w(t),w∗〉+ yt〈xt,w∗〉︸ ︷︷ ︸

≥1

Showing ‖w(t+1)‖2 ≤ R2 t

Initially, ‖w(1)‖2 = 0

Whenever we update, ‖w(t)‖2 increases by at most 1:

‖w(t+1)‖2 = ‖w(t) + ytxt‖2 = ‖w(t)‖2 + 2yt〈w(t),xt〉︸ ︷︷ ︸≤0

+y2t ‖xt‖2

≤ ‖w(t)‖2 +R2 .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 44 / 45

Page 123: Lecture 2l

Proof (Cont.)

Showing 〈w(t+1),w∗〉 ≥ tInitially, 〈w(1),w∗〉 = 0

Whenever we update, 〈w(t),w∗〉 increases by at least 1:

〈w(t+1),w∗〉 = 〈w(t) + ytxt,w∗〉 = 〈w(t),w∗〉+ yt〈xt,w∗〉︸ ︷︷ ︸

≥1

Showing ‖w(t+1)‖2 ≤ R2 t

Initially, ‖w(1)‖2 = 0

Whenever we update, ‖w(t)‖2 increases by at most 1:

‖w(t+1)‖2 = ‖w(t) + ytxt‖2 = ‖w(t)‖2 + 2yt〈w(t),xt〉︸ ︷︷ ︸≤0

+y2t ‖xt‖2

≤ ‖w(t)‖2 +R2 .

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 44 / 45

Page 124: Lecture 2l

Summary

The PAC Learning model

What is PAC learnable?

PAC learning of finite classes using ERM

The VC dimension and the fundmental theorem of learning

Classes of finite VC dimension

How to PAC learn?

Using ERM

Learning halfspaces using: Linear programming, Ellipsoid, Perceptron

Shai Shalev-Shwartz (Hebrew U) IML Lecture 2 PAC learning 45 / 45