learning from data lecture 7 approximation versus ...magdon/courses/lfd-slides/slideslect07.pdf ·...

22
Learning From Data Lecture 7 Approximation Versus Generalization The VC Dimension Approximation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100

Upload: others

Post on 22-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Learning From DataLecture 7

Approximation Versus Generalization

The VC Dimension

Approximation Versus GeneralizationBias and Variance

The Learning Curve

M. Magdon-IsmailCSCI 4100/6100

Page 2: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

recap: The Vapnik-Chervonenkis Bound (VC Bound)

P [|Ein(g)−Eout(g)| > ǫ] ≤ 4mH(2N)e−ǫ2N/8, for any ǫ > 0.P[|Ein(g)−Eout(g)|>ǫ]≤2|H|e−2ǫ2N ← finite H

P [|Ein(g)−Eout(g)| ≤ ǫ] ≥ 1− 4mH(2N)e−ǫ2N/8, for any ǫ > 0.P[|Ein(g)−Eout(g)|≤ǫ]≥1−2|H|e−2ǫ

2N ← finite H

Eout(g) ≤ Ein(g) +√

8Nlog 4mH(2N)

δ, w.p. at least 1− δ.

Eout(g)≤Ein(g)+√

12N log 2|H|

δ← finite H

mH(N) ≤k−1∑∑∑

i=1

(Ni

)

≤ Nk−1 + 1 k is a break point.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 2 /22 VC dimension −→

Page 3: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

The VC Dimension dvc

mH(N) ∼ Nk−1

The tightest bound is obtained with the smallest break point k∗.

Definition [VC Dimension] dvc = k∗ − 1.

The VC dimension is the largest N which can be shattered (mH(N) = 2N).N ≤ dvc: H could shatter your data (H can shatter some N points).

N > dvc: N is a break point for H; H cannot possibly shatter your data.

mH(N) ≤ Ndvc + 1 ∼ Ndvc

Eout(g) ≤ Ein(g) + O

(√dvc logN

N

)

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 3 /22 dvc versus number of parameters −→

Page 4: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

The VC-dimension is an Effective Number of Parameters

N

1 2 3 4 5 · · · #Param dvc

2-D perceptron 2 4 8 14 · · · 3 3

1-D pos. ray 2 3 4 5 · · · 1 1

2-D pos. rectangles 2 4 8 16 < 25 · · · 4 4

pos. convex sets 2 4 8 16 32 · · · ∞ ∞

There are models with few parameters but infinite dvc.

There are models with redundant parameters but small dvc.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 4 /22 dvc for perceptron −→

Page 5: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

1. dvc ≥ d + 1.What needs to be shown?

(a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 1 points that cannot be shattered.

(c) Every set of d + 1 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

2. dvc ≤ d + 1.What needs to be shown?

(a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 2 points that cannot be shattered.

(c) Every set of d + 2 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

(e) Every set of d + 2 points cannot be shattered.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 5 /22 Step 1 answer −→

Page 6: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

1. dvc ≥ d + 1.What needs to be shown?

X (a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 1 points that cannot be shattered.

(c) Every set of d + 1 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

2. dvc ≤ d + 1.What needs to be shown?

(a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 2 points that cannot be shattered.

(c) Every set of d + 2 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

(e) Every set of d + 2 points cannot be shattered.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 6 /22 Step 2 answer −→

Page 7: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

VC-dimension of the Perceptron in Rd is d + 1

This can be shown in two steps:

1. dvc ≥ d + 1.What needs to be shown?

X (a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 1 points that cannot be shattered.

(c) Every set of d + 1 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

2. dvc ≤ d + 1.What needs to be shown?

(a) There is a set of d + 1 points that can be shattered.

(b) There is a set of d + 2 points that cannot be shattered.

(c) Every set of d + 2 points can be shattered.

(d) Every set of d + 1 points cannot be shattered.

X (e) Every set of d + 2 points cannot be shattered.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 7 /22 dvc characterizes complexity in error bar −→

Page 8: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

A Single Parameter Characterizes Complexity

Eout(g) ≤ Ein(g) +

1

2Nlog

2|H|δ

in-sample error

model complexity

out-of-sample error

|H|

Error

|H|∗

Eout(g) ≤ Ein(g) +

8

Nlog

4((2N)dvc + 1)

δ

︸ ︷︷ ︸

penalty for model complexity

Ω(dvc)in-sample error

model complexity

out-of-sample error

VC dimension, dvcError

d∗vc

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 8 /22 Sample complexity −→

Page 9: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Sample Complexity: How Many Data Points Do You Need?

Set the error bar at ǫ.

ǫ =

8

Nln4((2N)dvc + 1)

δ

Solve for N :

N =8

ǫ2ln4((2N)dvc + 1)

δ= O (dvc lnN)

Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1).A simple iterative method works well. Trying N = 1000 we get

N ≈ 1

0.12log

(4(2000)3 + 4

0.1

)

≈ 21192.

We continue iteratively, and converge to N ≈ 30000.If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000.

(N ∝ dvc, but gross overestimates)

Practical Rule of Thumb: N = 10× dvc

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 9 /22 Theory versus practice −→

Page 10: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Theory Versus Practice

The VC analysis allows us to reach outside the data for general H.– a single parameter characterizes complexity of H– dvc depends only on H.– Ein can reach outside D to Eout when dvc is finite.

In Practice . . .

• The VC bound is loose.

– Hoeffding;

– mH(N) is a worst case # of dichotomies, not average case or likely case.

– The polynomial bound on mH(N) is loose.

• It is a good guide – models with small dvc are good.

• Roughly 10× dvc examples needed to get good generalization.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 10 /22 Test set −→

Page 11: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

The Test Set

• Another way to estimate Eout(g) is using a test set to obtain Etest(g).

• Etest is better than Ein: you don’t pay the price for fitting.You can use |H| = 1 in the Hoeffding bound with Etest.

• Both a test and training set have variance.The training set has optimistic bias due to selection – fitting the data.

A test set has no bias.

• The price for a test set is fewer training examples. (why is this bad?)

Etest ≈ Eout but now Etest may be bad.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 11 /22 Approximation versus Generalization −→

Page 12: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

VC Bound Quantifies Approximation Versus Generalization

The best H is H = f.You are better off buying a lottery ticket.

dvc ↑ =⇒ better chance of approximating f (Ein ≈ 0).

dvc ↓ =⇒ better chance of generalizing to out of sample (Ein ≈ Eout).

Eout ≤ Ein + Ω(dvc).

VC analysis only depends on H.Independent of f, P (x), learning algorithm.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 12 /22 Bias-variance analysis −→

Page 13: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Bias-Variance Analysis

Another way to quantify the tradeoff:

1. How well can the learning approximate f .. . . as opposed to how well did the learning approximate f in-sample (Ein).

2. How close can you get to that approximation with a finite data set.. . . as opposed to how close is Ein to Eout.

Bias-variance analysis applies to squared errors (classification and regression)

Bias-variance analysis can take into account the learning algorithm

Different learning algorithms can have different Eout when applied to the same H!

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 13 /22 Sin example −→

Page 14: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

A Simple Learning Problem

2 Data Points. 2 hypothesis sets:

H0 : h(x) = b

H1 : h(x) = ax + b

x

y

x

y

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 14 /22 Many data sets −→

Page 15: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Let’s Repeat the Experiment Many Times

x

y

x

y

For each data set D, you get a different gD.

So, for a fixed x, gD(x) is random value, depending on D.

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22 Average behavior −→

Page 16: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

What’s Happening on Average

x

y

g(x)

sin(x)

x

y g(x)

sin(x)

We can define:

gD(x) ← random value, depending on D

g(x) = ED[gD(x)

]

≈ 1K(gD1(x) + · · · + gDK(x))

← your average prediction on x

var(x) = ED[(gD(x)− g(x))2

]

= ED[gD(x)2

]− g(x)2

← how variable is your prediction?

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 16 /22 Error on out-of-sample test point −→

Page 17: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Eout on Test Point x for Data D

x

f(x)

gD(x)

EDout(x)

x

f(x)

gD(x)

EDout(x)

EDout(x) = (gD(x)− f(x))2 ← squared error, a random value depending on D

Eout(x) = ED[EDout(x)

]← expected Eout(x) before seeing D

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 17 /22 bias-vardecomposition −→

Page 18: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

The Bias-Variance Decomposition

Eout(x) = ED[(gD(x)− f(x))2

]

= ED[gD(x)2 − 2gD(x)f(x) + f(x)2

]

= ED[gD(x)2

]− 2g(x)f(x) + f(x)2 ← understand this; the rest is just algebra

= ED[gD(x)2

]−g(x)2 + g(x)2 − 2g(x)f(x) + f(x)2

= ED[gD(x)2

]− g(x)2

︸ ︷︷ ︸

var(x)

+ (g(x)− f(x))2

︸ ︷︷ ︸

bias(x)

Eout(x) = bias(x) + var(x)

f

H bias

var

f

H

Very small model Very large model

If you take average over x: Eout = bias + var

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 18 /22 Back to sin example −→

Page 19: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Back to H0 and H1; and, our winner is . . .

x

y

g(x)

sin(x)

x

y g(x)

sin(x)

H0

bias = 0.50var = 0.25Eout = 0.75 X

H1

bias = 0.21var = 1.69Eout = 1.90

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 19 /22 2 versus 5 data points −→

Page 20: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Match Learning Power to Data, . . . Not to f

2 Data Points 5 Data Points

x

y

x

y

x

y

g(x)

sin(x)

x

y g(x)

sin(x)

H0

bias = 0.50;

var = 0.25.

Eout = 0.75 X

H1

bias = 0.21;

var = 1.69.

Eout = 1.90

x

y

x

y

x

y

g(x)

sin(x)

x

y g(x)

sin(x)

H0

bias = 0.50;

var = 0.1.

Eout = 0.6

H1

bias = 0.21;

var = 0.21.

Eout = 0.42 X

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 20 /22 Learning curves −→

Page 21: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Learning Curves: When Does the Balance Tip?

Simple Model Complex Model

Number of Data Points, N

Exp

ectedError Eout

Ein

Number of Data Points, NExp

ectedError

Eout

Ein

Eout = Ex [Eout(x)]

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 21 /22 Decomposing the learning curve −→

Page 22: Learning From Data Lecture 7 Approximation Versus ...magdon/courses/LFD-Slides/SlidesLect07.pdf · 2. d vc ≤d+1. What needs to be shown? (a) There is a set of d+1 points that can

Decomposing The Learning Curve

VC Analysis Bias-Variance Analysis

Number of Data Points, N

Exp

ectedError

in-sample error

generalization error

Eout

Ein

Number of Data Points, NExp

ectedError

bias

variance

Eout

Ein

Pick H that can generalize and has a good

chance to fit the data

Pick (H,A) to approximate f and not behave

wildly after seeing the data

c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 22 /22