machine learning 10-701 - carnegie mellon school of ...tom/10701_sp11/slides/pac... · machine...
TRANSCRIPT
![Page 1: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/1.jpg)
1
Machine Learning 10-701 Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 24, 2011
Today:
• Computational Learning Theory
• PAC learning theorem • VC dimension
Recommended reading:
• Mitchell: Ch. 7 • suggested exercises: 7.1,
7.2, 7.7
* see Annual Conference on Learning Theory (COLT)
![Page 2: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/2.jpg)
2
Target concept is the boolean-valued fn to be learned
c: X {0,1}
Function Approximation: The Big Picture
![Page 3: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/3.jpg)
3
![Page 4: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/4.jpg)
4
D
instances drawn at random from
Probability distribution P(x)
training examples
D
instances drawn at random from
Probability distribution P(x)
training examples
Can we bound
in terms of
??
![Page 5: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/5.jpg)
5
Probability distribution P(x)
training examples
Can we bound
in terms of
??
if D was a set of examples drawn from and independent of h, then we could use standard statistical confidence intervals to determine that with 95% probability, lies in the interval:
but D is the training data for h ….
c: X {0,1}
![Page 6: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/6.jpg)
6
Function Approximation: The Big Picture
true error less
![Page 7: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/7.jpg)
7
Any(!) learner that outputs a hypothesis consistent with all training examples (i.e., an h contained in VSH,D)
![Page 8: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/8.jpg)
8
What it means
[Haussler, 1988]: probability that the version space is not ε-exhausted after m training examples is at most
1. How many training examples suffice?
Suppose we want this probability to be at most δ
2. If then with probability at least (1-δ):
![Page 9: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/9.jpg)
9
Example: H is Conjunction of Boolean Literals
Consider classification problem f:XY: • instances: X = <X1 X2 X3 X4> where each Xi is boolean • learned hypotheses are rules of the form:
– IF <X1 X2 X3 X4> = <0,?,1,?> , THEN Y=1, ELSE Y=0 – i.e., rules constrain any subset of the Xi
How many training examples m suffice to assure that with probability at least 0.99, any consistent learner will output a hypothesis with true error at most 0.05?
Example: H is Decision Tree with depth=2
Consider classification problem f:XY: • instances: X = <X1 … XN> where each Xi is boolean • learned hypotheses are decision trees of depth 2, using
only two variables
How many training examples m suffice to assure that with probability at least 0.99, any consistent learner will output a hypothesis with true error at most 0.05?
![Page 10: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/10.jpg)
10
Sufficient condition:
Holds if learner L requires only a polynomial number of training examples, and processing per example is polynomial
![Page 11: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/11.jpg)
11
true error training error degree of overfitting
note ε here is the difference between the training error and true error
Additive Hoeffding Bounds – Agnostic Learning
• Given m independent coin flips of coin with true Pr(heads) = θ bound the error in the maximum likelihood estimate
• Relevance to agnostic learning: for any single hypothesis h
• But we must consider all hypotheses in H
• So, with probability at least (1-δ) every h satisfies
![Page 12: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/12.jpg)
12
General Hoeffding Bounds
• When estimating parameter θ inside [a,b] from m examples
• When estimating a probability θ is inside [0,1], so
• And if we’re interested in only one-sided error, then
Question: If H = {h | h: X Y} is infinite, what measure of complexity should we
use in place of |H| ?
![Page 13: Machine Learning 10-701 - Carnegie Mellon School of ...tom/10701_sp11/slides/PAC... · Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University](https://reader034.vdocuments.site/reader034/viewer/2022042316/5f054d5a7e708231d4124a32/html5/thumbnails/13.jpg)
13
Question: If H = {h | h: X Y} is infinite, what measure of complexity should we
use in place of |H| ?
Answer: The largest subset of X for which H can guarantee zero training error (regardless of the target function c)