learning from data lecture 7 approximation versus ...magdon/courses/lfd-slides/slideslect07.pdf ·...
TRANSCRIPT
Learning From DataLecture 7
Approximation Versus Generalization
The VC Dimension
Approximation Versus GeneralizationBias and Variance
The Learning Curve
M. Magdon-IsmailCSCI 4100/6100
recap: The Vapnik-Chervonenkis Bound (VC Bound)
P [|Ein(g)−Eout(g)| > ǫ] ≤ 4mH(2N)e−ǫ2N/8, for any ǫ > 0.P[|Ein(g)−Eout(g)|>ǫ]≤2|H|e−2ǫ2N ← finite H
P [|Ein(g)−Eout(g)| ≤ ǫ] ≥ 1− 4mH(2N)e−ǫ2N/8, for any ǫ > 0.P[|Ein(g)−Eout(g)|≤ǫ]≥1−2|H|e−2ǫ
2N ← finite H
Eout(g) ≤ Ein(g) +√
8Nlog 4mH(2N)
δ, w.p. at least 1− δ.
Eout(g)≤Ein(g)+√
12N log 2|H|
δ← finite H
mH(N) ≤k−1∑∑∑
i=1
(Ni
)
≤ Nk−1 + 1 k is a break point.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 2 /22 VC dimension −→
The VC Dimension dvc
mH(N) ∼ Nk−1
The tightest bound is obtained with the smallest break point k∗.
Definition [VC Dimension] dvc = k∗ − 1.
The VC dimension is the largest N which can be shattered (mH(N) = 2N).N ≤ dvc: H could shatter your data (H can shatter some N points).
N > dvc: N is a break point for H; H cannot possibly shatter your data.
mH(N) ≤ Ndvc + 1 ∼ Ndvc
Eout(g) ≤ Ein(g) + O
(√dvc logN
N
)
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 3 /22 dvc versus number of parameters −→
The VC-dimension is an Effective Number of Parameters
N
1 2 3 4 5 · · · #Param dvc
2-D perceptron 2 4 8 14 · · · 3 3
1-D pos. ray 2 3 4 5 · · · 1 1
2-D pos. rectangles 2 4 8 16 < 25 · · · 4 4
pos. convex sets 2 4 8 16 32 · · · ∞ ∞
There are models with few parameters but infinite dvc.
There are models with redundant parameters but small dvc.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 4 /22 dvc for perceptron −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.What needs to be shown?
(a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 1 points that cannot be shattered.
(c) Every set of d + 1 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.What needs to be shown?
(a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 2 points that cannot be shattered.
(c) Every set of d + 2 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
(e) Every set of d + 2 points cannot be shattered.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 5 /22 Step 1 answer −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.What needs to be shown?
X (a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 1 points that cannot be shattered.
(c) Every set of d + 1 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.What needs to be shown?
(a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 2 points that cannot be shattered.
(c) Every set of d + 2 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
(e) Every set of d + 2 points cannot be shattered.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 6 /22 Step 2 answer −→
VC-dimension of the Perceptron in Rd is d + 1
This can be shown in two steps:
1. dvc ≥ d + 1.What needs to be shown?
X (a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 1 points that cannot be shattered.
(c) Every set of d + 1 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
2. dvc ≤ d + 1.What needs to be shown?
(a) There is a set of d + 1 points that can be shattered.
(b) There is a set of d + 2 points that cannot be shattered.
(c) Every set of d + 2 points can be shattered.
(d) Every set of d + 1 points cannot be shattered.
X (e) Every set of d + 2 points cannot be shattered.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 7 /22 dvc characterizes complexity in error bar −→
A Single Parameter Characterizes Complexity
Eout(g) ≤ Ein(g) +
√
1
2Nlog
2|H|δ
in-sample error
model complexity
out-of-sample error
|H|
Error
|H|∗
↓
Eout(g) ≤ Ein(g) +
√
8
Nlog
4((2N)dvc + 1)
δ
︸ ︷︷ ︸
penalty for model complexity
Ω(dvc)in-sample error
model complexity
out-of-sample error
VC dimension, dvcError
d∗vc
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 8 /22 Sample complexity −→
Sample Complexity: How Many Data Points Do You Need?
Set the error bar at ǫ.
ǫ =
√
8
Nln4((2N)dvc + 1)
δ
Solve for N :
N =8
ǫ2ln4((2N)dvc + 1)
δ= O (dvc lnN)
Example. dvc = 3; error bar ǫ = 0.1; confidence 90% (δ = 0.1).A simple iterative method works well. Trying N = 1000 we get
N ≈ 1
0.12log
(4(2000)3 + 4
0.1
)
≈ 21192.
We continue iteratively, and converge to N ≈ 30000.If dvc = 4, N ≈ 40000; for dvc = 5, N ≈ 50000.
(N ∝ dvc, but gross overestimates)
Practical Rule of Thumb: N = 10× dvc
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 9 /22 Theory versus practice −→
Theory Versus Practice
The VC analysis allows us to reach outside the data for general H.– a single parameter characterizes complexity of H– dvc depends only on H.– Ein can reach outside D to Eout when dvc is finite.
In Practice . . .
• The VC bound is loose.
– Hoeffding;
– mH(N) is a worst case # of dichotomies, not average case or likely case.
– The polynomial bound on mH(N) is loose.
• It is a good guide – models with small dvc are good.
• Roughly 10× dvc examples needed to get good generalization.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 10 /22 Test set −→
The Test Set
• Another way to estimate Eout(g) is using a test set to obtain Etest(g).
• Etest is better than Ein: you don’t pay the price for fitting.You can use |H| = 1 in the Hoeffding bound with Etest.
• Both a test and training set have variance.The training set has optimistic bias due to selection – fitting the data.
A test set has no bias.
• The price for a test set is fewer training examples. (why is this bad?)
Etest ≈ Eout but now Etest may be bad.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 11 /22 Approximation versus Generalization −→
VC Bound Quantifies Approximation Versus Generalization
The best H is H = f.You are better off buying a lottery ticket.
dvc ↑ =⇒ better chance of approximating f (Ein ≈ 0).
dvc ↓ =⇒ better chance of generalizing to out of sample (Ein ≈ Eout).
Eout ≤ Ein + Ω(dvc).
VC analysis only depends on H.Independent of f, P (x), learning algorithm.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 12 /22 Bias-variance analysis −→
Bias-Variance Analysis
Another way to quantify the tradeoff:
1. How well can the learning approximate f .. . . as opposed to how well did the learning approximate f in-sample (Ein).
2. How close can you get to that approximation with a finite data set.. . . as opposed to how close is Ein to Eout.
Bias-variance analysis applies to squared errors (classification and regression)
Bias-variance analysis can take into account the learning algorithm
Different learning algorithms can have different Eout when applied to the same H!
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 13 /22 Sin example −→
A Simple Learning Problem
2 Data Points. 2 hypothesis sets:
H0 : h(x) = b
H1 : h(x) = ax + b
x
y
x
y
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 14 /22 Many data sets −→
Let’s Repeat the Experiment Many Times
x
y
x
y
For each data set D, you get a different gD.
So, for a fixed x, gD(x) is random value, depending on D.
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 15 /22 Average behavior −→
What’s Happening on Average
x
y
g(x)
sin(x)
x
y g(x)
sin(x)
We can define:
gD(x) ← random value, depending on D
g(x) = ED[gD(x)
]
≈ 1K(gD1(x) + · · · + gDK(x))
← your average prediction on x
var(x) = ED[(gD(x)− g(x))2
]
= ED[gD(x)2
]− g(x)2
← how variable is your prediction?
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 16 /22 Error on out-of-sample test point −→
Eout on Test Point x for Data D
x
f(x)
gD(x)
EDout(x)
x
f(x)
gD(x)
EDout(x)
EDout(x) = (gD(x)− f(x))2 ← squared error, a random value depending on D
Eout(x) = ED[EDout(x)
]← expected Eout(x) before seeing D
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 17 /22 bias-vardecomposition −→
The Bias-Variance Decomposition
Eout(x) = ED[(gD(x)− f(x))2
]
= ED[gD(x)2 − 2gD(x)f(x) + f(x)2
]
= ED[gD(x)2
]− 2g(x)f(x) + f(x)2 ← understand this; the rest is just algebra
= ED[gD(x)2
]−g(x)2 + g(x)2 − 2g(x)f(x) + f(x)2
= ED[gD(x)2
]− g(x)2
︸ ︷︷ ︸
var(x)
+ (g(x)− f(x))2
︸ ︷︷ ︸
bias(x)
Eout(x) = bias(x) + var(x)
f
H bias
var
f
H
Very small model Very large model
If you take average over x: Eout = bias + var
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 18 /22 Back to sin example −→
Back to H0 and H1; and, our winner is . . .
x
y
g(x)
sin(x)
x
y g(x)
sin(x)
H0
bias = 0.50var = 0.25Eout = 0.75 X
H1
bias = 0.21var = 1.69Eout = 1.90
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 19 /22 2 versus 5 data points −→
Match Learning Power to Data, . . . Not to f
2 Data Points 5 Data Points
x
y
x
y
x
y
g(x)
sin(x)
x
y g(x)
sin(x)
H0
bias = 0.50;
var = 0.25.
Eout = 0.75 X
H1
bias = 0.21;
var = 1.69.
Eout = 1.90
x
y
x
y
x
y
g(x)
sin(x)
x
y g(x)
sin(x)
H0
bias = 0.50;
var = 0.1.
Eout = 0.6
H1
bias = 0.21;
var = 0.21.
Eout = 0.42 X
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 20 /22 Learning curves −→
Learning Curves: When Does the Balance Tip?
Simple Model Complex Model
Number of Data Points, N
Exp
ectedError Eout
Ein
Number of Data Points, NExp
ectedError
Eout
Ein
Eout = Ex [Eout(x)]
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 21 /22 Decomposing the learning curve −→
Decomposing The Learning Curve
VC Analysis Bias-Variance Analysis
Number of Data Points, N
Exp
ectedError
in-sample error
generalization error
Eout
Ein
Number of Data Points, NExp
ectedError
bias
variance
Eout
Ein
Pick H that can generalize and has a good
chance to fit the data
Pick (H,A) to approximate f and not behave
wildly after seeing the data
c© AML Creator: Malik Magdon-Ismail Approximation Versus Generalization: 22 /22