T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi:
General Conditions for Predictivity in Learning Theory
Michael Pfeiffer
25.11.2004
Motivation
Supervised Learning learn functional
relationships from a finite set of labelled training examples
Generalization How well does the learned
function perform on unseen test examples?
Central question in supervised learning
What you will hear
New Idea: Stability implies predictivity learning algorithm is stable if
small pertubations of training set do not change hypothesis much
Conditions for generalization on learning map rather than hypothesis space in contrast to VC-analysis
Agenda
Introduction Problem Definition Classical Results Stability Criteria Conclusion
Some Definitions 1/2
Training Data: S = {z1=(x1,y1), ..., zn=(xn, yn)} Z = X Y Unknown Distribution (x, y)
Hypothesis Space: H Hypothesis fS H: X Y
Learning Algorithm: Regression: fS is real-valued / Classification: fS is binary symmetric learning algorithm (ordering irrelevant)
1:
n n HZL
Snnn fyxzyxzLSL ),(),...,,()( 111
Some Definitions 2/2
Loss Function: V(f, z) e.g. V(f, z) = (f(x) – y)2
Assume that V is bounded
Empirical Error (Training Error)
Expected Error (True Error)
Z
zdzfVfI )(μ),(][
n
iiS zfV
nfI
1
),(1
][
RZHV :
Generalization and Consistency
Convergence in Probability
Generalization Performance on training examples must be a good
indicator of performance on future examples
Consistency Expected error converges to most accurate one in H
yprobabilitin 0][][lim SSS
nfIfI
0ε][inf][lim0εμ
fIfIP
HfS
n
0ε0εlim
XXP nn
Agenda
Introduction Problem Definition Classical Results Stability Criteria Conclusion
Empirical Risk Minimization (ERM)
Focus of classical learning theory research exact and almost ERM
Minimize training error over H: take best hypothesis on training data
For ERM: Generalization Consistency
][min][ fIfI SHf
SS
What algorithms are ERM? All these belong to class of ERM algorithms
Least Squares Regression Decision Trees ANN Backpropagation (?) ...
Are all learning algorithms ERM? NO!
Support Vector Machines k-Nearest Neighbour Bagging, Boosting Regularization ...
Vapnik asked
What property must the hypothesis space H have to ensure good generalization
of ERM?
Classical Results for ERM1
Theorem: A necessary and sufficient condition for generalization and consistency of ERM is that H is a uniform Glivenko-Cantelli (uGC) class:
convergence of empirical mean to true expected value uniform convergence in probability of loss functions
induced by H and V
0ε)(μ)()(1
supsuplim0ε1μ
n
iXi
HfS
nxdxfxf
nP
1 e.g. Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997
VC-Dimension
Binary functions f: X{0, 1} VC-dim(H) = size of largest
finite set in X that can be shattered by H e.g. linear separation in
2D yields VC-dim = 3
Theorem: Let H be a class of binary valued hypotheses, then H is a uGC-class if and only if VC-dim(H) is finite1.
1 Alon, Ben-David, Cesa-Bianchi, Hausller: Scale-sensitive dimensions, uniform convergence and learnability, Journal of ACM 44, 1997
Achievements of Classical Learning Theory
Complete characterization of necessary and sufficient conditions for generalization and consistency of ERM
Remaining questions: What about non-ERM algorithms? Can we establish criteria not only for the
hypothesis space?
Agenda
Introduction Problem Definition Classical Results Stability Criteria Conclusion
Poggio et.al. asked
What property must the learning map L have for good
generalization of general algorithms?
Can a new theory subsume the classical results for
ERM?
Stability
Small pertubations of the training set should not change the hypothesis much especially deleting one
training example Si = S \ {zi}
How can this be mathematically defined?
Original Training Set S
Perturbed Training Set Si
Hypothesis Space
Learning Map
Uniform Stability1
A learning algorithm L is uniformly stable if
n
KzfVzfVniZS iSS
Zz
n
),(),(sup},...,1{,
After deleting one training sample the change must be small at all points z Z
Uniform stability implies generalizationRequirement is too strong
Most algorithms (e.g. ERM) are not uniformly stable
1 Bousquet, Elisseeff: Stability and Generalization, JMLR 2, 2001
CVloo stability1
Cross-Validation leave-one-out stability
considers only errors at removed training points
strictly weaker than uniform stability
remove zi
error at xi
..0),(),(suplim},...,1{
pizfVzfV iSiSnin
i
1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Equivalence for ERM1
Theorem: For good loss functions the following statements are equivalent for ERM: L is distribution-independent CVloo stable ERM generalizes and is universally consistent H is a uGC class
Question: Does CVloo stability ensure generalization for all learning algorithms?
1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
CVloo Counterexample1
X be uniform on [0, 1] Y {-1, +1} Target f *(x) = 1 Learning algorithm L:
No change at removed training point CVloo stable Algorithm does not generalize at all!
otherwise1
point traininga is if1)(
1n
n
S
xxf
otherwise)(
if)()(
xf
xxxfxf
S
iS
S i
1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Additional Stability Criteria
Error (Eloo) stability
Empirical Error (EEloo) stability
Weak conditions, satisfied by most reasonable learning algorithms (e.g. ERM)
Not sufficient for generalization
yprobabilitin0][][suplim},...,1{
iSSnin
fIfI
yprobabilitin0][][suplim},...,1{
ii SSSSnin
fIfI
CVEEEloo Stability
Learning Map L is CVEEEloo stable if it is CVloo stable
and Eloo stable
and EEloo stable
Question: Does this imply generalization for all L?
CVEEEloo implies Generalization1
Theorem: If L is CVEEEloo stable and the loss function is bounded, then fS generalizes
Remarks: Neither condition (CV, E, EE) itself is sufficient Eloo and EEloo stability are not sufficient
For ERM CVloo stability alone is necessary and sufficient for generalization and consistency
1 Mukherjee, Niyogi, Poggio, Rifkin: Statistical Learning: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, MIT 2003
Consistency
CVEEEloo stability in general does NOT guarantee consistency
Good generalization does NOT necessarily mean good prediction but poor expected
performance is indicated by poor training performance
CVEEEloo stable algorithms
Support Vector Machines and Regularization k-Nearest Neighbour (k increasing with n) Bagging (number of regressors increasing with n) More results to come (e.g. AdaBoost)
For some of these algorithms a ´VC-style´ analysis is impossible (e.g. k-NN)
For all these algorithms generalization is guaranteed by the shown theorems!
Agenda
Introduction Problem Definition Classical Results Stability Criteria Conclusion
Implications
Classical „VC-style“ conditions Occams Razor: prefer simple hypotheses
CVloo stability Incremental Change online-algorithms
Inverse Problems: stability well-posedness condition numbers characterize stability
Stability-based learning may have more direct connections with brain‘s learning mechanisms condition on learning machinery
Language Learning
Goal: learn grammars from sentences Hypothesis Space: class of all learnable grammars What is easier to characterize and gives more
insight into real language learning? Language learning algorithm or Class of all learnable grammars?
Focus on algorithms shift focus to stability
Conclusion
Stability implies generalization intuitive (CVloo) and technical (Eloo, EEloo) criteria
Theory subsumes classical ERM results Generalization criteria also for non-ERM algorithms
Restrictions on learning map rather than hypothesis space
New approach for designing learning algorithms
Open Questions
Easier / other necessary and sufficient conditions for generalization
Conditions for general consistency Tight bounds for sample complexity Applications of the theory for new algorithms Stability proofs for existing algorithms
Thank you!
Sources T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General conditions for
predictivity in learning theory, Nature Vol. 428, S. 419-422, 2004 S. Mukherjee, P. Niyogi, T. Poggio, R. Rifkin: Statistical Learning:
Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization, AI Memo 2002-024, MIT, 2003
T. Mitchell: Machine Learning, McGraw-Hill, 1997 C. Tomasi: Past Performance and future results, Nature Vol. 428, S.
378, 2004 N. Alon, S. Ben-David, N. Cesa-Bianchi, D. Haussler: Scale-
sensitive Dimensions, Uniform Convergence, and Learnability, Journal of ACM 44(4), 1997