introduction to machine learningepxing/class/10715/lectures/vctheory.pdf · machine learning,...

45
Advanced Introduction to Machine Learning, CMU-10715 VapnikChervonenkis Theory Barnabás Póczos

Upload: others

Post on 15-Jul-2020

22 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Advanced Introduction to

Machine Learning, CMU-10715

Vapnik–Chervonenkis Theory

Barnabás Póczos

Page 2: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

We have explored many ways of learning from data

But…

– How good is our classifier, really?

– How much data do we need to make it “good enough”?

Learning Theory

2

Page 3: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Review of what we have

learned so far

3

Page 4: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Notation

4

This is what the learning algorithm produces

We will need these definitions, please copy it!

Page 5: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Big Picture

5

Bayes risk

Estimation error Approximation error

Bayes risk

Ultimate goal:

Approximation error

Estimation error

Page 6: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Big Picture: Illustration of Risks

6

Upper bound

Goal of Learning:

Page 7: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Learning Theory

7

Page 8: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Outline

8

These results are useless if N is big, or infinite. (e.g. all possible hyper-planes)

Today we will see how to fix this with the

Shattering coefficient and VC dimension

Theorem:

From Hoeffding’s inequality, we have seen that

Page 9: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Outline

9

Theorem:

From Hoeffding’s inequality, we have seen that

After this fix, we can say something meaningful about this too:

This is what the learning algorithm produces and its true risk

Page 10: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Hoeffding inequality

10

Theorem:

Observation!

Definition:

Page 11: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

McDiarmid’s

Bounded Difference Inequality

It follows that

11

Page 12: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Bounded Difference Condition

12

Our main goal is to bound

Lemma:

Let g denote the following function:

Observation:

Proof:

) McDiarmid can be applied for g!

Page 13: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Bounded Difference Condition

13

Corollary:

The Vapnik-Chervonenkis inequality does that

with the shatter coefficient (and VC dimension)!

Page 14: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Concentration and

Expected Value

14

Page 15: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Vapnik-Chervonenkis inequality

15

We already know:

Vapnik-Chervonenkis inequality:

Corollary: Vapnik-Chervonenkis theorem:

Our main goal is to bound

Page 16: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Shattering

16

Page 17: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

How many points can a linear

boundary classify exactly in 1D?

- +

2 pts 3 pts - + +

- - +

- + - ?? There exists placement s.t. all labelings can be classified

- +

17

The answer is 2

Page 18: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

- + 3 pts 4 pts

- +

+ - +

- +

- ??

- +

- +

How many points can a linear

boundary classify exactly in 2D?

There exists placement s.t. all labelings can be classified

18

The answer is 3

Page 19: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

How many points can a linear

boundary classify exactly in d-dim?

19

The answer is d+1

How many points can a linear

boundary classify exactly in 3D?

The answer is 4

tetraeder

+

+ -

-

Page 20: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Growth function,

Shatter coefficient

Definition

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

0 1 0

1 1 1

(=5 in this example)

Growth function, Shatter coefficient

maximum number of behaviors on n points

20

Page 21: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Growth function,

Shatter coefficient

Definition

Growth function, Shatter coefficient

maximum number of behaviors on n points

- +

+

Example: Half spaces in 2D

+ + -

21

Page 22: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC-dimension

Definition

Growth function, Shatter coefficient

maximum number of behaviors on n points

Definition: VC-dimension

# b

eha

vio

rs

Definition: Shattering

Note: 22

Page 23: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC-dimension

Definition

# b

eha

vio

rs

23

Page 24: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

- +

- +

VC-dimension

24

Page 25: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Examples

25

Page 26: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC dim of decision stumps

(axis aligned linear separator) in 2d

What’s the VC dim. of decision stumps in 2d?

- +

+

- +

-

+ +

-

There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

26

Page 27: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

What’s the VC dim. of decision stumps in 2d? If VC dim = 3, then for all placements of 4 pts, there exists a labeling that

can’t be shattered

3 collinear 1 in convex hull of other 3

quadrilateral

- + - - +

-

-

+ -

+ -

VC dim of decision stumps

(axis aligned linear separator) in 2d

27

Page 28: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC dim. of axis parallel

rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d?

- +

+

- +

- There is a placement of 3 pts that can be shattered ) VC dim ≥ 3

28

Page 29: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC dim. of axis parallel

rectangles in 2d

There is a placement of 4 pts that can be shattered ) VC dim ≥ 4 29

Page 30: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

VC dim. of axis parallel

rectangles in 2d What’s the VC dim. of axis parallel rectangles in 2d?

+ +

-

-

- -

- + - - +

-

- -

- +

-

+

-

If VC dim = 4, then for all placements of 5 pts, there exists a labeling that

can’t be shattered

4 collinear

2 in convex hull 1 in convex hull

pentagon

30

Page 31: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Sauer’s Lemma

31

The VC dimension can be used to upper bound

the shattering coefficient.

Sauer’s lemma:

Corollary:

We already know that [Exponential in n]

[Polynomial in n]

Page 32: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Sauer’s Lemma

Write all different behaviors on a sample

(x1,x2,…xn) in a matrix:

0 0 0

0 1 0

1 1 1

1 0 0

0 1 0

1 1 1

0 1 1

32

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

Page 33: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Sauer’s Lemma

We will prove that

33

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

Shattered subsets of columns:

Therefore,

Page 34: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Sauer’s Lemma

Lemma 1

34

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

for any binary matrix with no repeated rows.

Shattered subsets of columns:

Lemma 2

In this example: 5· 6

In this example: 6· 1+3+3=7

Page 35: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Lemma 1

Lemma 1

35

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

Shattered subsets of columns:

Proof

In this example: 6· 1+3+3=7

Page 36: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Lemma 2

36

for any binary matrix with no repeated rows.

Lemma 2

Induction on the number of columns Proof

Base case: A has one column. There are three cases:

) 1 · 1

) 1 · 1

) 2 · 2

Page 37: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Lemma 2

37

Inductive case: A has at least two columns.

We have,

By induction (less columns)

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

Page 38: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Proof of Lemma 2

38

because

0 0 0

0 1 0

1 1 1

1 0 0

0 1 1

Page 39: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Vapnik-Chervonenkis inequality

39

Vapnik-Chervonenkis inequality:

From Sauer’s lemma:

Since

Therefore,

[We don’t prove this]

Estimation error

Page 40: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Linear (hyperplane) classifiers

40

Estimation error

We already know that

Estimation error

Estimation error

Page 41: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Vapnik-Chervonenkis Theorem

41

We already know from McDiarmid:

Corollary: Vapnik-Chervonenkis theorem: [We don’t prove them]

Vapnik-Chervonenkis inequality:

Hoeffding + Union bound for finite function class:

Page 42: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

PAC Bound for the Estimation

Error

42

VC theorem:

Inversion:

Estimation error

Page 43: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Structoral Risk Minimization

43

Bayes risk

Estimation error Approximation error

Ultimate goal:

Approximation error

Estimation error

So far we studied when estimation error ! 0, but we also want approximation error ! 0

Many different variants…

penalize too complex models to avoid overfitting

Page 44: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

What you need to know

Complexity of the classifier depends on number of points that can

be classified exactly

Finite case – Number of hypothesis

Infinite case – Shattering coefficient, VC dimension

PAC bounds on true error in terms of empirical/training error and

complexity of hypothesis space

Empirical and Structural Risk Minimization

44

Page 45: Introduction to Machine Learningepxing/Class/10715/lectures/VCTheory.pdf · Machine Learning, CMU-10715 Vapnik–Chervonenkis Theory Barnabás Póczos . ... What you need to know

Thanks for your attention

45