lecture 3: statistical decision theory (part ii)
TRANSCRIPT
Lecture 3: Statistical Decision Theory (Part II)
Wenbin Lu
Department of StatisticsNorth Carolina State University
Fall 2019
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 1 / 27
Outline of This Note
Part I: Statistics Decision Theory (Classical Statistical Perspective -“Estimation”)
loss and riskMSE and bias-variance tradeoffBayes risk and minimax risk
Part II: Learning Theory for Supervised Learning (Machine LearningPerspective - “Prediction”)
optimal learnerempirical risk minimizationrestricted estimators
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 2 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Supervised Learning
Any supervised learning problem has three components:
Input vector X ∈ Rd .
Output Y , either discrete or continuous valued
Probability framework
(X ,Y ) ∼ P(X ,Y ).
The goal is to estimate the relationship between X and Y , described byf (X ), for future prediction and decision-making.
For regression, f : Rd → R
For K -class classification, f : Rd → {1, · · · ,K}
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 3 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Learning Loss Function
Similar to the learning theory, we use a learning loss function L
to measure the discrepancy Y and f (X )
to penalize the errors for predicting Y .
Examples:
squared error loss (used in mean regression)
L(Y , f (X )) = (Y − f (X ))2
absolute error loss (used in median regression)
L(Y , f (X )) = |Y − f (X )|
0-1 loss function (used in binary classification)
L(Y , f (X )) = I (Y 6= f (X ))
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 4 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Risk, Expected Prediction Error (EPE)
The risk of f is
R(f ) = EX ,Y L(Y , f (X )) =
∫L(y , f (x))dP(x , y)
In machine learning, R(f ) is called Expected Prediction Error (EPE).
For squared error loss,
R(f ) = EPE (f ) = EX ,Y [Y − f (X )]2
For 0-1 loss
R(f ) = EPE (f ) = EX ,Y I (Y 6= f (X )) = Pr(Y 6= f (X )),
which is the probability of making a mistake.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 5 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Optimal Learner
We seek f which minimizes the average (long-term) loss overthe entire population.
Optimal Learner: The minimizer of R(f ) is
f ∗ = arg minf
R(f ).
The solution f ∗ depends on the loss L.
f ∗ is not achievable without knowing P(x , y) or the entire population
Bayes Risk: The optimal risk, or risk of the optimal learner
R(f ∗) = EX ,Y (Y − f ∗(X ))2.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 6 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
How to Find The Optimal Learner
Note that
R(f ) = EX ,Y L(Y , f (X )) =
∫L(y , f (x))dP(x , y)
= EX{EY |XL(Y , f (X ))
}One can do the following,
for each fixed x , find f ∗(x) by solving
f ∗(x) = arg minc
EY |X=xL(Y , c)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 7 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Example 1: Regression with Squared Error Loss
For regression, consider L(Y , f (X )) = (Y − f (X ))2. The risk is
R(f ) = EX ,Y (Y − f (X ))2 = EXEY |X ((Y − f (X ))2|X ).
For each fixed x , the best learner solves the problem
minc
EY |X [(Y − c)2|X = x ].
The solution is E (Y |X = x), so
f ∗(x) = E (Y |X = x) =
∫ydP(y |x).
So the optimal leaner is f ∗(X ) = E (Y |X ).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 8 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Decomposition of EPE: Under Squared Error Loss
For any learner f , its EPE under the squared error loss is decomposed as
EPE (f ) = EX ,Y [Y − f (X )]2 = EXEY |X ([Y − f (X )]2|X )
= EXEY |X ([Y − E (Y |X ) + E (Y |X )− f (X )]2|X}= EXEY |X ([Y − E (Y |X )]2|X ) + EX [f (X )− E (Y |X )]2
= EX ,Y [Y − f ∗(X )]2 + EX [f (X )− f ∗(X )]2
= EPE (f ∗) + MSE
= Bayes Risk + MSE
The Bayes risk (gold standard) gives a lower bound for any risk.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 9 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Example 2: Classification with 0-1 Loss
In binary classification, assume Y ∈ {0, 1}. The learning rulef : Rd → {0, 1}.
Using 0-1 lossL(Y , f (X )) = I (Y 6= f (X )),
the expected prediction error is
EPE (f ) = EI (Y 6= f (X ))
= Pr(Y 6= f (X ))
= EX [P(Y = 1|X )I (f (X ) = 0) + P(Y = 0|X )I (f (X ) = 1)]
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 10 / 27
Part II: Learning Theory for Supervised Learning Loss, Risk, and Optimal Learner
Bayes Classification Rule for Binary Problems
For each fixed x , the minimizer of EPE (f ) is given by
f ∗(x) =
{1 if P(Y = 1|X = x) > P(Y = 0|X = x)
0 if P(Y = 1|X = x) < P(Y = 0|X = x)
It is called the Bayes classifier, denoted by φB .
The Bayes rule can be written as
φB(x) = f ∗(x) = I [P(Y = 1|X = x) > 0.5].
To obtain the Bayes rule, we need to know P(Y = 1|X = x), orwhether P(Y = 1|X = x) is larger than 0.5 or not.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 11 / 27
Part II: Learning Theory for Supervised Learning Learning Process: ERM
Probability Framework
Assume the training set
(X1,Y1), ..., (Xn,Yn) ∼ i.i.d P(X ,Y ).
We aim to find the best function (optimal solution) from the model classF for decision making.
However, the optimal learner f ∗ is not attainable without knowledgeof P(x , y).
The training set (x1, y1), ..., (xn, yn) generated from P(x , y) carryinformation about P(x , y)
We approximate the integral over P(x , y) by the average over thetraining samples.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 12 / 27
Part II: Learning Theory for Supervised Learning Learning Process: ERM
Empirical Risk Minimization
Key Idea: We can not compute R(f ) = EX ,Y L(Y , f (X )), but cancompute its empirical version using the data!
Empirical Risk: also known as the training error
Remp(f ) =1
n
n∑i=1
L(yi , f (xi )).
A reasonable estimator f should be close to f ∗, or converge to f ∗
when more data are collected.
Using law of large numbers, Remp(f )→ EPE (f ) as n→∞.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 13 / 27
Part II: Learning Theory for Supervised Learning Learning Process: ERM
Learning Based on ERM
We construct the estimator of f ∗ by
f̂ = arg minf
Remp(f ).
This principle is called Empirical Risk Minimization. (ERM)
For regression,
the empirical risk is known as Residual Sum of Squares (RSS)
Remp(f ) = RSS(f ) =n∑
i=1
[yi − f (xi )]2.
ERM gives the least mean squares:
f̂ ols = arg minf
RSS(f ) = arg minf
n∑i=1
[yi − f (xi )]2.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 14 / 27
Part II: Learning Theory for Supervised Learning Learning Process: ERM
Issues with ERM
Minimizing Remp(f ) over all functions is not proper.
Any function passing through all (xi , yi ) has RSS = 0. Example:
f1(x) =
{yi if x = xi for some i = 1, · · · , n1 otherwise.
It is easy to check that RSS(f1) = 0.
However, it does not amount to any form of learning. For the futurepoints not equal to xi , it will always predict y = 1. This phenomenonis called “overfitting”.
Overfitting is undesired, since a small Remp(f ) does not imply asmall R(f ) =
∫L(y , f (x))dP(x , y).
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 15 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Class of Restricted Estimators
We need to restrict the estimation process within a set of functions F , tocontrol complexity of functions and hence avoid over-fitting.
Choose a simple model class F .
f̂ols = arg minf
RSS(f ∈ F) = arg minf ∈F
n∑i=1
[yi − f (xi )]2.
Choose a large model class F , but use a roughness penalty to controlmodel complexity: Find f ∈ F to minimize
f̂ = arg minf ∈F
Penalized RSS ≡ RSS(f ) + λJ(f ).
The solution is called regularized empirical risk minimizer.
How should we choose the model F? Simple one or complex one?
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 16 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Parametric Learners
Some model classes can be parametrized as
F = {f (X ;α), α ∈ Λ},
where α is the index parameter for the class.
Parametric Model Class: The form of f is known except for finitelymany unknown parameters
Linear Models: linear regression, linear SVM, LDA
f (X ;β) = β0 + X′β1, β1 ∈ Rd ,
f (X ;β) = β0 + sin(X1)β1 + β2X22 , X ∈ R2
Nonlinear models
f (x ;β) = β1xβ2
f (X ;β) = β1 + β2 expβ3Xβ4
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 17 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Advantages of Linear Models
Why are linear models in wide use?
If linear assumptions are correct, it is more efficient thannonparametric fits.
convenient and easy to fit
easy to interpret
providing the first-order Taylor approximation to f (X)
Example: In the linear regression, F = {all linear functions of x}.
The ERM is the same as solving the ordinary least squares,
(β̂ols0 , β̂ols) = arg min(β0,β)
n∑i=1
[yi − β0 − x′iβ]2.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 18 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Drawback of Parametric Models
We never know whether linear assumptions are proper or not.
Not flexible in practice, having all orders of derivatives and the samefunction form everywhere
Individual observations have large influences on remote parts of thecurve
The polynomial degree can not be controlled continuously
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 19 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
More Flexible Learners
Nonparametric Model Class F :the function form f is unspecified, and the parameter set could beinfinitely dimensional.
F = {all continuous functions of x}.F = the second-order Sobolev space over [0, 1].
F is the space spanned by a set of Basis functions (dictionaryfunctions)
F = {f : fθ(x) =∞∑
m=1
θmhm(x)}.
Nearest-neighbor methods: f̂ (x) = k−1∑
xi∈Nk (x)yi , where Nk(x) is
the neighborhood of x defined by the k closest points xi in thetraining sample.
Kernel methods and local regression: F contains all piece-wiseconstant or linear functions.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 20 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Example of Nonparametric Methods
For regression problems,
splines, wavelets, kernel estimators, locally weighted polynomialregression, GAM
projection pursuit, regression trees, k-nearest neighbors
For classification problems,
kernel support vector machines, k-nearest neighbors
classification trees ...
For density estimation,
histogram, kernel density estimation
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 21 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Semiparametric Models: a compromise between linear models andnonparametric models
partially linear models
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 22 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Nonparametric Models
Motivation: The underlying regression function is so complicated that noreasonable parametric models would be adequate. Advantages:
NOT assume any specific form.
Assume less, thus discover more
infinite parameters: not no parameters
Local more relying on the neighbors
Outliers and influential points do not affect the results
Let the data speak for themselves and show us the appropriatefunctional form!
Price we pay: more computational effort, lower efficiency and slowerconvergence rate (if linear models happen to be reasonable)
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 23 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Nonparametric vs Parametric Models
They are not mutually exclusive competitors
A nonparametric regression estimate may suggest a simple parametricmodel
In practice, parametric models are preferred for simplicity
Linear regression line is infinitely smooth fit
method # parameters estimate outliers
parametric finite global sensitivenonparametric infinite local robust
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 24 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
linear fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
2−order polynomial fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
3−order polynomial fit
x
y
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
smoothing spline (lambda=0.85)
x
y
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 25 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Interpretation of Risk
Let F denote the restricted model space.
Denote f ∗ = arg minEPE (f ), the optimal learner. The minimizationis taken over all measurable functions.
Denote f̃ = arg minF EPE (f ), the best learner in the restricted spaceF .
Denote f̂ = arg minF Remp(f ), the learner based on limited data inthe restricted space F .
EPE (f̂ )− EPE (f ∗)
= [EPE (f̂ )− EPE (f̃ )] + [EPE (f̃ )− EPE (f ∗)]
= estimation error + approximation error.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 26 / 27
Part II: Learning Theory for Supervised Learning Restricted Model Class F
Estimation Error and Approximation Error
The gap EPE (f̂ )− EPE (f̃ ) is called the estimation error, which isdue to the limited training data.
The gap EPE (f̃ )− EPE (f ∗) is called the approximation error. It isincurred from approximating f ∗ with a restricted model space or viaregularization, since it is possible that f ∗ /∈ F .
This decomposition is similar to the bias-variance trade-off, with
estimation error playing the role of variance
approximation error playing the role of bias.
Wenbin Lu (NCSU) Data Mining and Machine Learning Fall 2019 27 / 27