ece595 / stat598: machine learning i lecture 28 sample …...e.g., linear in 2d: n = 3 is okay, but...

89
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 28 Sample and Model Complexity Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 17

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 28 Sample and Model Complexity

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 17

Page 2: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Outline

Lecture 28 Sample and Model Complexity

Lecture 29 Bias and Variance

Lecture 30 Overfit

Today’s Lecture:

Generalization Bound using VC Dimension

Review of growth function and VC dimensionGeneralization bound

Sample and Model Complexity

Sample complexityModel complexityTrade off

2 / 17

Page 3: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 4: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 5: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 6: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 7: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 8: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 9: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 10: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 11: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f

3 / 17

Page 12: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

VC Dimension

Definition (VC Dimension)

The Vapnik-Chervonenkis dimension of a hypothesis set H, denoted by dVC, is the largestvalue of N for which H can shatter all N training samples.

You give me a hypothesis set H, e.g., linear model

You tell me the number of training samples N

Start with a small N

I will be able to shatter for a while, until I hit a bump

E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay

So I find the largest N such that H can shatter N training samples

E.g., linear in 2D: dVC = 3

If H is complex, then expect large dVC

Does not depend on p(x), A and f3 / 17

Page 13: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Linking the Growth Function

Theorem (Sauer’s Lemma)

Let dVC be the VC dimension of a hypothesis set H, then

mH(N) ≤dVC∑i=0

(N

i

). (1)

I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:

dVC∑i=0

(N

i

)≤ NdVC + 1.

This can be proved by simple induction. Exercise.So together we have

mH(N) ≤ NdVC + 1.

4 / 17

Page 14: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Linking the Growth Function

Theorem (Sauer’s Lemma)

Let dVC be the VC dimension of a hypothesis set H, then

mH(N) ≤dVC∑i=0

(N

i

). (1)

I skip the proof here. See AML Chapter 2.2 for proof.

What is more interesting is this:

dVC∑i=0

(N

i

)≤ NdVC + 1.

This can be proved by simple induction. Exercise.So together we have

mH(N) ≤ NdVC + 1.

4 / 17

Page 15: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Linking the Growth Function

Theorem (Sauer’s Lemma)

Let dVC be the VC dimension of a hypothesis set H, then

mH(N) ≤dVC∑i=0

(N

i

). (1)

I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:

dVC∑i=0

(N

i

)≤ NdVC + 1.

This can be proved by simple induction. Exercise.

So together we havemH(N) ≤ NdVC + 1.

4 / 17

Page 16: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Linking the Growth Function

Theorem (Sauer’s Lemma)

Let dVC be the VC dimension of a hypothesis set H, then

mH(N) ≤dVC∑i=0

(N

i

). (1)

I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:

dVC∑i=0

(N

i

)≤ NdVC + 1.

This can be proved by simple induction. Exercise.

So together we havemH(N) ≤ NdVC + 1.

4 / 17

Page 17: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Linking the Growth Function

Theorem (Sauer’s Lemma)

Let dVC be the VC dimension of a hypothesis set H, then

mH(N) ≤dVC∑i=0

(N

i

). (1)

I skip the proof here. See AML Chapter 2.2 for proof.What is more interesting is this:

dVC∑i=0

(N

i

)≤ NdVC + 1.

This can be proved by simple induction. Exercise.So together we have

mH(N) ≤ NdVC + 1.4 / 17

Page 18: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Difference between VC and Hoeffding

5 / 17

Page 19: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 20: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 21: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 22: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 23: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 24: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 25: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

Recall the generalization bound

Ein(g)−√

1

2Nlog

2M

δ≤ Eout(g) ≤ Ein(g) +

√1

2Nlog

2M

δ.

Substitute M by mH(N), and then mH(N) ≤ NdVC + 1:

Eout(g) ≤ Ein(g) +

√1

2Nlog

2(NdVC + 1)

δ.

Wonderful!

Everything is characterized by δ, N and dVC

dVC tells us the expressiveness of the model

You can also think of dVC as the effective number of parameters

6 / 17

Page 26: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 27: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 28: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 29: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 30: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 31: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 32: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 33: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 34: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound Again

If dVC <∞,

Then as N →∞,

ε =

√1

2Nlog

2(NdVC + 1)

δ→ 0.

If this is the case, then the final hypothesis g ∈ H will generalize.

dVC =∞,

Then His as diverse as it can be

It is not possible to generalize

Message 1: If you choose a complex model, then you need to pay the price of trainingsample

Message 2: If you choose an extremely complex model, then it may not be able togeneralize regardless the number of samples

7 / 17

Page 35: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 36: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 37: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 38: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 39: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 40: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalizing the Generalization Bound

Theorem (Generalization Bound)

For any tolerance δ > 0

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ,

with probability at least 1− δ.

Some small subtle technical requirements. See AML chapter 2.2

How tight is this generalization bound? Not too tight.

The Hoeffding inequality has a slack. The inequality is too general for all values of Eout

The growth function mH(N) gives the worst case scenario

Bounding mH(N) by a polynomial introduces slack

8 / 17

Page 41: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Outline

Lecture 28 Sample and Model Complexity

Lecture 29 Bias and Variance

Lecture 30 Overfit

Today’s Lecture:

Generalization Bound using VC Dimension

Review of growth function and VC dimensionGeneralization bound

Sample and Model Complexity

Sample complexityModel complexityTrade off

9 / 17

Page 42: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 43: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 44: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 45: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δ

Regardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 46: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 47: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 48: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 49: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 50: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 51: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δ

Regardless of what learning algorithm you use

10 / 17

Page 52: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample and Model Complexity

Sample Complexity

What is the smallest number of samples required?

Required to ensure training and testing error are close

Close = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

Model Complexity

What is the largest model you can use?

Refers to the hypothesis set

With respect to the number of training samples

Largest = measured in terms of VC dimension

Can use = within certain ε, with confidence 1− δRegardless of what learning algorithm you use

10 / 17

Page 53: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ.

If you want the generalization error to be at most ε, then√8

Nlog

4mH(2N)

δ≤ ε.

Rearrange terms and use VC dimension,

N ≥ 8

ε2log

(4(2N)dVC + 1

δ

).

Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

11 / 17

Page 54: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ.

If you want the generalization error to be at most ε, then√8

Nlog

4mH(2N)

δ≤ ε.

Rearrange terms and use VC dimension,

N ≥ 8

ε2log

(4(2N)dVC + 1

δ

).

Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

11 / 17

Page 55: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ.

If you want the generalization error to be at most ε, then√8

Nlog

4mH(2N)

δ≤ ε.

Rearrange terms and use VC dimension,

N ≥ 8

ε2log

(4(2N)dVC + 1

δ

).

Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

11 / 17

Page 56: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ.

If you want the generalization error to be at most ε, then√8

Nlog

4mH(2N)

δ≤ ε.

Rearrange terms and use VC dimension,

N ≥ 8

ε2log

(4(2N)dVC + 1

δ

).

Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

11 / 17

Page 57: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4mH(2N)

δ.

If you want the generalization error to be at most ε, then√8

Nlog

4mH(2N)

δ≤ ε.

Rearrange terms and use VC dimension,

N ≥ 8

ε2log

(4(2N)dVC + 1

δ

).

Example. dVC = 3. ε = 0.1. δ = 0.1 (90% confidence). Then the number of samples weneed is

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

11 / 17

Page 58: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 59: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 60: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 61: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 62: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 63: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 64: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.

12 / 17

Page 65: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Sample Complexity

How to solve for N in this equation?

N ≥ 8

0.12log

(4(2N)3 + 4

0.1

).

Put N = 1000 to the right hand side

N ≥ 8

0.12log

(4(2× 1000)3 + 4

0.1

)≈ 21, 193.

Not enough. So put N = 21, 193 to the right hand side. Iterate.

Then we get N ≈ 30, 000.

So we need at least 30,000 samples.

However, generalization bound is not tight. So our estimate is over-estimate.

Rule of thumb, 10× dVC.12 / 17

Page 66: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 67: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 68: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 69: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 70: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 71: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 72: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.

13 / 17

Page 73: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Error Bar

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

(4 ((2N)dVC + 1)

δ

).

What error bar can we offer?

Example. N = 100. δ = 0.1 (90% confidence). dVC = 1.

Eout(g) ≤ Ein(g) +

√8

100log

(4 ((2× 100) + 1)

0.1

)≈ Ein(g) + 0.848.

Close to useless.

If we use N = 1000, thenEout(g) ≤ Ein(g) + 0.301.

Somewhat more respectable estimate.13 / 17

Page 74: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 75: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 76: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 77: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 78: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 79: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Model Complexity

The generalization bound is

Eout(g) ≤ Ein(g) +

√8

Nlog

4 ((2N)dVC + 1)

δ︸ ︷︷ ︸=ε(N,H,δ)

ε(N,H, δ) = penalty of the model complexity

If dVC is large, then ε(N,H, δ) is big

So the generalization error is large

There is a trade-off curve

14 / 17

Page 80: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Trade-off Curve

15 / 17

Page 81: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 82: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.

The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 83: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 84: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 85: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 86: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 87: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L

16 / 17

Page 88: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Generalization Bound for Testing

Testing Set: Dtest = {x1, . . . , xL}.The final hypothesis g is already determined. So no need to use Union bound.

The Hoeffding is as simple as

P{|Ein(g)− Eout(g)| > ε

}≤ 2e−2ε

2L,

The generalization bound is

Eout(g) ≤ Ein(g) +

√1

2Llog

2

δ.

If you have a lot of testing samples, then Ein(g) will be good estimate of Eout(g)

Independent of model complexity

Only δ and L16 / 17

Page 89: ECE595 / STAT598: Machine Learning I Lecture 28 Sample …...E.g., linear in 2D: N = 3 is okay, but N = 4 is not okay So I nd the largest N such that Hcan shatter N training samples

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Yasar Abu-Mostafa, Learning from Data, chapter 2.1

Mehrya Mohri, Foundations of Machine Learning, Chapter 3.2

Stanford Note http://cs229.stanford.edu/notes/cs229-notes4.pdf

17 / 17