traing versus testing and linear regularization
TRANSCRIPT
Traing versus Testing and Linear Regularization
Tong Zhang
Rutgers University
T. Zhang (Rutgers) Linear Regularization 1 / 23
Supervised Learning Problem
Prediction problemInput X : known information.Output Y : unknown information.Goal: to predict Y based on Xfind a function (prediction rule) f (X ) such that Y ≈ f (X )
Observe historical data Sn = {(X1,Y1), . . . , (Xn,Yn)}Sn called training data
Learning: find a good prediction rule f (x) based on Sn
Example: least squares regressionf (x) = wT x : linear prediction rulelearning algorithm: find w to minimize squared error on training data
w = arg minw
n∑i=1
(f (Xi)− Yi)2
T. Zhang (Rutgers) Linear Regularization 2 / 23
Supervised Learning Problem
Prediction problemInput X : known information.Output Y : unknown information.Goal: to predict Y based on Xfind a function (prediction rule) f (X ) such that Y ≈ f (X )
Observe historical data Sn = {(X1,Y1), . . . , (Xn,Yn)}Sn called training data
Learning: find a good prediction rule f (x) based on Sn
Example: least squares regressionf (x) = wT x : linear prediction rulelearning algorithm: find w to minimize squared error on training data
w = arg minw
n∑i=1
(f (Xi)− Yi)2
T. Zhang (Rutgers) Linear Regularization 2 / 23
Training versus Testing
Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data
find a rule to fit well on Sn
Test data: future observations that may be different from trainingdata
purpose: learned prediction rule should work well on test dataLearning process:
learn prediction rule on training dataevaluate performance on test data
Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic
T. Zhang (Rutgers) Linear Regularization 3 / 23
Training versus Testing
Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data
find a rule to fit well on Sn
Test data: future observations that may be different from trainingdata
purpose: learned prediction rule should work well on test data
Learning process:learn prediction rule on training dataevaluate performance on test data
Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic
T. Zhang (Rutgers) Linear Regularization 3 / 23
Training versus Testing
Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data
find a rule to fit well on Sn
Test data: future observations that may be different from trainingdata
purpose: learned prediction rule should work well on test dataLearning process:
learn prediction rule on training dataevaluate performance on test data
Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic
T. Zhang (Rutgers) Linear Regularization 3 / 23
Overfitting
Goal: predict well on future unseen data (test data)Two aspects:
rule should fit training data wellrequires a more complex model.
behavior of rule on test data should match that on training datarequires a less complex (more stable) model.
Model complexity:more complex model: smaller training error but larger differencebetween test and training errorless complex model: larger training error but smaller differencebetween test and training error
Related concepts:bias variance trade-offregularization
T. Zhang (Rutgers) Linear Regularization 4 / 23
Overfitting
Goal: predict well on future unseen data (test data)Two aspects:
rule should fit training data wellrequires a more complex model.
behavior of rule on test data should match that on training datarequires a less complex (more stable) model.
Model complexity:more complex model: smaller training error but larger differencebetween test and training errorless complex model: larger training error but smaller differencebetween test and training error
Related concepts:bias variance trade-offregularization
T. Zhang (Rutgers) Linear Regularization 4 / 23
Overfitting: Simple Example
Binary classification example:Input X is one dimensional and uniformly distributed in [−1,1]True class label
Y =
{1 X ≥ 0−1 X < 0
Given training data (Xi ,Yi) (i = 1, . . . ,n)with probability one Xi are all different
Assume you don’t know the relationship of X and Y but want tolearn it from data.
T. Zhang (Rutgers) Linear Regularization 5 / 23
Overfitting Method
The following prediction rule fits training data perfectly
f (X ) =
{Yi if X = Xi for some i1 otherwise
Has no prediction capability when X not in the training data.Very complex prediction rule – arbitrary flexibility.Characteristics of overfitting procedure:
training error: small (0)difference of test error and training error: large (0.5)test error: large (0.5)
T. Zhang (Rutgers) Linear Regularization 6 / 23
Underfitting method
Simply return the following prediction rule independent of the data
f (X ) = 1.
Ignore the data: fits training data poorlySimplistic prediction rule — no flexibility.Characteristics of underfitting procedure:
training error: large (≈ 0.5)the difference of test error and training error: small (≈ 0)test error: large (0.5)
T. Zhang (Rutgers) Linear Regularization 7 / 23
Balanced fit
We pick the class of functions
f (x) = sign(x − θ)
with unknown θ; then find θ to minimize training error.Fits training data well with good prediction capability.
small number of parameters: one dimensional.some flexibility but not overly complex.Characteristics of balanced fitting
training error: small (0)difference of test error and training error: small (≈ 0)test error: small (≈ 0)
T. Zhang (Rutgers) Linear Regularization 8 / 23
Regularization
Model complexityhow complex a target function can be represented by the model familymore complexity model:
represent more complexity and less smooth functionsprone to overfitting
less complexity model:represent less complexity and smoother functionsprone to underfitting
Regularization: limit model complexity to achieve good test errorrestrict parameter space to control model complexity
Typical learning algorithms:have tuning parameters to control model complexitytuning parameters should be adjusted to achieve good balance.
T. Zhang (Rutgers) Linear Regularization 9 / 23
Bias-variance trade-off
testing
Model complexity
Error
training
low biashigh variance
high biaslow variance
bias
variance(bias+var+noise)
T. Zhang (Rutgers) Linear Regularization 10 / 23
Linear Model
Consider training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Linear prediction rule:
f (X ) = wT X .
Linear Least Squares Method:
w = arg minw
1n
n∑i=1
(wT xi − Yi)2
Complexity:dimension d : the large d is the more complex the model is
T. Zhang (Rutgers) Linear Regularization 11 / 23
Linear Model
Consider training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Linear prediction rule:
f (X ) = wT X .
Linear Least Squares Method:
w = arg minw
1n
n∑i=1
(wT xi − Yi)2
Complexity:dimension d : the large d is the more complex the model is
T. Zhang (Rutgers) Linear Regularization 11 / 23
Generalization and Model Complexity
Training error:
training error(f ) =1n
n∑i=1
(f (Xi)− Yi)2.
Test error:test error(w) = E(x ,y)(f (x)− y)2
Learning algorithm: find w to minimize training errorModel complexity: measuring difference of training and test errorGeneralization error: with large probability we have
test error(w) ≤ training error(w) + O(d/n)︸ ︷︷ ︸difference of training test
only useful when d � nwhat to do when d is large? need other form of regularization
T. Zhang (Rutgers) Linear Regularization 12 / 23
Linear Regularization
Regularization: restrict the model expressiveness when d is large.Restrict the linear family size: g(w) ≤ A.
example: g(w) = ‖w‖22 =
∑dj=1 w2
j .A: tuning parameter (estimated through cross-validation).
Model complexity: measured by A.Benefit of regularization:
statistical: robust to large number of features.numerical: stabilize solution.
T. Zhang (Rutgers) Linear Regularization 13 / 23
Linear Regularization Formulation
Goal: minimize average squared loss on unseen data.A practical method: minimize observed loss on training:
w =arg minw
1n
n∑i=1
(wT Xi − Yi)2,
such that g(w) ≤ A.
Equivalent formulation (λ ≥ 0):
w = arg minw
1n
n∑i=1
(wT Xi − Yi)2 + λg(w)
Small A corresponds to larger λ and vice versusComplexity: A large complexity large; λ small complexity large
T. Zhang (Rutgers) Linear Regularization 14 / 23
Training and Test Error versus λ
1e−04 1e−01 1e+02 1e+05 1e+08
050
100
150
200
250
300
lambda
mea
n sq
uare
d er
ror
training errortest error
model complexity decreases as λ increasesT. Zhang (Rutgers) Linear Regularization 15 / 23
Model Complexity: generalization bound
Least squares with square regularization:
w = arg minw
1n
∑i
(wT Xi − Yi)2 + λw2.
Predictive power of w :
test error ≤(
1 +1λn
maxx‖x‖22
)2
minw
Ex ,y
((wT x − y)2 + λ‖w‖22
)︸ ︷︷ ︸
best possible regularized loss
Robust to large feature set:square reg g(w) = 1
2 w2
generalization depends on ‖x‖22/λ, not on dimension d
T. Zhang (Rutgers) Linear Regularization 16 / 23
Lp Regularization
Lp-regularization: (constrained version)
wLp = arg minw∈Rd
n∑i=1
(wT Xi − Yi)2 subject to
d∑j=1
|wj |p ≤ A.
Equivalent formulation: λ ≥ 0 is regularization parameter(penalized version)
wLp = arg minw∈Rd
n∑i=1
(wT Xi − Yi)2 + λ
d∑j=1
|wj |p.
p = 0: sparsity; p = 1:Lasso; p = 2: ridge regression.subset selection: p = 0 (non-convex, non-smooth)p ∈ (0,1) (non-convex, smooth)p ≥ 1: convex (p = 1 is the closest convex approximation to p = 0)p = 2: ridge regression; p = 1: Lasso
T. Zhang (Rutgers) Linear Regularization 17 / 23
Ridge regression
Regularization formulation
wL2 = arg minw∈Rd
n∑i=1
(wT Xi − Yi)2 + λ
d∑j=1
|wj |2.
Implicit dimension reduction effect (principal components).More stable (smaller weights) than least squaresIt does not generally lead to sparse solution.Closely related to kernel methods.
T. Zhang (Rutgers) Linear Regularization 18 / 23
Ridge regression solution
X : n × d data matrix (n training data and d features)Solution of ridge regression:
wL2 = (X T X + λI)−1X T Y .
Compare to standard least squares regression solution:
w = (X T X )−1X T Y .
disadvantage: X T X may not be invertibleAdvantage: ridge regression allows d > n
stable: X T X + λI is always invertibleImplicit dimension reduction effect.
T. Zhang (Rutgers) Linear Regularization 19 / 23
The Effect of Correlated Feature
Boston Housing data: 13 variables (X [·,1 · · · 13])Adding another feature: X [·,14] = X [·,13]+ random-noise
smaller noise: X [·,13] is more correlated to X [·,14]
weight |w13| under various noise sizenoise size 0.01 0.1 0.5 1
least squares 358 25.5 1.77 14.7ridge (λ = 1) 3.90 3.99 3.39 7.31
least squares is not stable when there are correlated features
T. Zhang (Rutgers) Linear Regularization 20 / 23
Lasso
Find w with 1-norm ≤ s to minimize squared error:
wL1 =arg minw∈Rd
n∑i=1
(wT Xi − Yi)2
subject tod∑
j=1
|wj | ≤ s.
Equivalent to (λ ≥ 0 is regularization parameter):
wL1 = arg minw∈Rd
n∑i=1
(wT Xi − Yi)2 + λ
d∑j=1
|wj |.
T. Zhang (Rutgers) Linear Regularization 21 / 23
Solution property
Convex optimization problem, but solution may not be unique.Global solution can be efficiently found.Solution is sparse: some wj will be zero: achieves feature selection.Solution is not necessarily stable.
T. Zhang (Rutgers) Linear Regularization 22 / 23
Summary
Training versus test:find prediction rule to best fit training dataperformance evaluated on test data
Model complexity: difference between training and test errorincrease model complexity decreases training error but increasesdifference between training and test
Regularization:control model complexity by restricting parameter space
Linear regularization: restrict weight using ‖w‖p ≤ Ap = 2: ridge regression — stable solution and allow d ≥ np = 1: Lasso (convex) — sparse solution but may be unstable.
T. Zhang (Rutgers) Linear Regularization 23 / 23