traing versus testing and linear regularization

28
Traing versus Testing and Linear Regularization Tong Zhang Rutgers University T. Zhang (Rutgers) Linear Regularization 1 / 23

Upload: others

Post on 27-Dec-2021

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Traing versus Testing and Linear Regularization

Traing versus Testing and Linear Regularization

Tong Zhang

Rutgers University

T. Zhang (Rutgers) Linear Regularization 1 / 23

Page 2: Traing versus Testing and Linear Regularization

Supervised Learning Problem

Prediction problemInput X : known information.Output Y : unknown information.Goal: to predict Y based on Xfind a function (prediction rule) f (X ) such that Y ≈ f (X )

Observe historical data Sn = {(X1,Y1), . . . , (Xn,Yn)}Sn called training data

Learning: find a good prediction rule f (x) based on Sn

Example: least squares regressionf (x) = wT x : linear prediction rulelearning algorithm: find w to minimize squared error on training data

w = arg minw

n∑i=1

(f (Xi)− Yi)2

T. Zhang (Rutgers) Linear Regularization 2 / 23

Page 3: Traing versus Testing and Linear Regularization

Supervised Learning Problem

Prediction problemInput X : known information.Output Y : unknown information.Goal: to predict Y based on Xfind a function (prediction rule) f (X ) such that Y ≈ f (X )

Observe historical data Sn = {(X1,Y1), . . . , (Xn,Yn)}Sn called training data

Learning: find a good prediction rule f (x) based on Sn

Example: least squares regressionf (x) = wT x : linear prediction rulelearning algorithm: find w to minimize squared error on training data

w = arg minw

n∑i=1

(f (Xi)− Yi)2

T. Zhang (Rutgers) Linear Regularization 2 / 23

Page 4: Traing versus Testing and Linear Regularization

Training versus Testing

Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data

find a rule to fit well on Sn

Test data: future observations that may be different from trainingdata

purpose: learned prediction rule should work well on test dataLearning process:

learn prediction rule on training dataevaluate performance on test data

Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic

T. Zhang (Rutgers) Linear Regularization 3 / 23

Page 5: Traing versus Testing and Linear Regularization

Training versus Testing

Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data

find a rule to fit well on Sn

Test data: future observations that may be different from trainingdata

purpose: learned prediction rule should work well on test data

Learning process:learn prediction rule on training dataevaluate performance on test data

Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic

T. Zhang (Rutgers) Linear Regularization 3 / 23

Page 6: Traing versus Testing and Linear Regularization

Training versus Testing

Training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Learning: find a prediction rule on training data

find a rule to fit well on Sn

Test data: future observations that may be different from trainingdata

purpose: learned prediction rule should work well on test dataLearning process:

learn prediction rule on training dataevaluate performance on test data

Why separate training and testing?training error is usually unrealistically low (overfitting)error on test data (data not used to fit model) is more realistic

T. Zhang (Rutgers) Linear Regularization 3 / 23

Page 7: Traing versus Testing and Linear Regularization

Overfitting

Goal: predict well on future unseen data (test data)Two aspects:

rule should fit training data wellrequires a more complex model.

behavior of rule on test data should match that on training datarequires a less complex (more stable) model.

Model complexity:more complex model: smaller training error but larger differencebetween test and training errorless complex model: larger training error but smaller differencebetween test and training error

Related concepts:bias variance trade-offregularization

T. Zhang (Rutgers) Linear Regularization 4 / 23

Page 8: Traing versus Testing and Linear Regularization

Overfitting

Goal: predict well on future unseen data (test data)Two aspects:

rule should fit training data wellrequires a more complex model.

behavior of rule on test data should match that on training datarequires a less complex (more stable) model.

Model complexity:more complex model: smaller training error but larger differencebetween test and training errorless complex model: larger training error but smaller differencebetween test and training error

Related concepts:bias variance trade-offregularization

T. Zhang (Rutgers) Linear Regularization 4 / 23

Page 9: Traing versus Testing and Linear Regularization

Overfitting: Simple Example

Binary classification example:Input X is one dimensional and uniformly distributed in [−1,1]True class label

Y =

{1 X ≥ 0−1 X < 0

Given training data (Xi ,Yi) (i = 1, . . . ,n)with probability one Xi are all different

Assume you don’t know the relationship of X and Y but want tolearn it from data.

T. Zhang (Rutgers) Linear Regularization 5 / 23

Page 10: Traing versus Testing and Linear Regularization

Overfitting Method

The following prediction rule fits training data perfectly

f (X ) =

{Yi if X = Xi for some i1 otherwise

Has no prediction capability when X not in the training data.Very complex prediction rule – arbitrary flexibility.Characteristics of overfitting procedure:

training error: small (0)difference of test error and training error: large (0.5)test error: large (0.5)

T. Zhang (Rutgers) Linear Regularization 6 / 23

Page 11: Traing versus Testing and Linear Regularization

Underfitting method

Simply return the following prediction rule independent of the data

f (X ) = 1.

Ignore the data: fits training data poorlySimplistic prediction rule — no flexibility.Characteristics of underfitting procedure:

training error: large (≈ 0.5)the difference of test error and training error: small (≈ 0)test error: large (0.5)

T. Zhang (Rutgers) Linear Regularization 7 / 23

Page 12: Traing versus Testing and Linear Regularization

Balanced fit

We pick the class of functions

f (x) = sign(x − θ)

with unknown θ; then find θ to minimize training error.Fits training data well with good prediction capability.

small number of parameters: one dimensional.some flexibility but not overly complex.Characteristics of balanced fitting

training error: small (0)difference of test error and training error: small (≈ 0)test error: small (≈ 0)

T. Zhang (Rutgers) Linear Regularization 8 / 23

Page 13: Traing versus Testing and Linear Regularization

Regularization

Model complexityhow complex a target function can be represented by the model familymore complexity model:

represent more complexity and less smooth functionsprone to overfitting

less complexity model:represent less complexity and smoother functionsprone to underfitting

Regularization: limit model complexity to achieve good test errorrestrict parameter space to control model complexity

Typical learning algorithms:have tuning parameters to control model complexitytuning parameters should be adjusted to achieve good balance.

T. Zhang (Rutgers) Linear Regularization 9 / 23

Page 14: Traing versus Testing and Linear Regularization

Bias-variance trade-off

testing

Model complexity

Error

training

low biashigh variance

high biaslow variance

bias

variance(bias+var+noise)

T. Zhang (Rutgers) Linear Regularization 10 / 23

Page 15: Traing versus Testing and Linear Regularization

Linear Model

Consider training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Linear prediction rule:

f (X ) = wT X .

Linear Least Squares Method:

w = arg minw

1n

n∑i=1

(wT xi − Yi)2

Complexity:dimension d : the large d is the more complex the model is

T. Zhang (Rutgers) Linear Regularization 11 / 23

Page 16: Traing versus Testing and Linear Regularization

Linear Model

Consider training data: Sn = {(X1,Y1), . . . , (Xn,Yn)}Linear prediction rule:

f (X ) = wT X .

Linear Least Squares Method:

w = arg minw

1n

n∑i=1

(wT xi − Yi)2

Complexity:dimension d : the large d is the more complex the model is

T. Zhang (Rutgers) Linear Regularization 11 / 23

Page 17: Traing versus Testing and Linear Regularization

Generalization and Model Complexity

Training error:

training error(f ) =1n

n∑i=1

(f (Xi)− Yi)2.

Test error:test error(w) = E(x ,y)(f (x)− y)2

Learning algorithm: find w to minimize training errorModel complexity: measuring difference of training and test errorGeneralization error: with large probability we have

test error(w) ≤ training error(w) + O(d/n)︸ ︷︷ ︸difference of training test

only useful when d � nwhat to do when d is large? need other form of regularization

T. Zhang (Rutgers) Linear Regularization 12 / 23

Page 18: Traing versus Testing and Linear Regularization

Linear Regularization

Regularization: restrict the model expressiveness when d is large.Restrict the linear family size: g(w) ≤ A.

example: g(w) = ‖w‖22 =

∑dj=1 w2

j .A: tuning parameter (estimated through cross-validation).

Model complexity: measured by A.Benefit of regularization:

statistical: robust to large number of features.numerical: stabilize solution.

T. Zhang (Rutgers) Linear Regularization 13 / 23

Page 19: Traing versus Testing and Linear Regularization

Linear Regularization Formulation

Goal: minimize average squared loss on unseen data.A practical method: minimize observed loss on training:

w =arg minw

1n

n∑i=1

(wT Xi − Yi)2,

such that g(w) ≤ A.

Equivalent formulation (λ ≥ 0):

w = arg minw

1n

n∑i=1

(wT Xi − Yi)2 + λg(w)

Small A corresponds to larger λ and vice versusComplexity: A large complexity large; λ small complexity large

T. Zhang (Rutgers) Linear Regularization 14 / 23

Page 20: Traing versus Testing and Linear Regularization

Training and Test Error versus λ

1e−04 1e−01 1e+02 1e+05 1e+08

050

100

150

200

250

300

lambda

mea

n sq

uare

d er

ror

training errortest error

model complexity decreases as λ increasesT. Zhang (Rutgers) Linear Regularization 15 / 23

Page 21: Traing versus Testing and Linear Regularization

Model Complexity: generalization bound

Least squares with square regularization:

w = arg minw

1n

∑i

(wT Xi − Yi)2 + λw2.

Predictive power of w :

test error ≤(

1 +1λn

maxx‖x‖22

)2

minw

Ex ,y

((wT x − y)2 + λ‖w‖22

)︸ ︷︷ ︸

best possible regularized loss

Robust to large feature set:square reg g(w) = 1

2 w2

generalization depends on ‖x‖22/λ, not on dimension d

T. Zhang (Rutgers) Linear Regularization 16 / 23

Page 22: Traing versus Testing and Linear Regularization

Lp Regularization

Lp-regularization: (constrained version)

wLp = arg minw∈Rd

n∑i=1

(wT Xi − Yi)2 subject to

d∑j=1

|wj |p ≤ A.

Equivalent formulation: λ ≥ 0 is regularization parameter(penalized version)

wLp = arg minw∈Rd

n∑i=1

(wT Xi − Yi)2 + λ

d∑j=1

|wj |p.

p = 0: sparsity; p = 1:Lasso; p = 2: ridge regression.subset selection: p = 0 (non-convex, non-smooth)p ∈ (0,1) (non-convex, smooth)p ≥ 1: convex (p = 1 is the closest convex approximation to p = 0)p = 2: ridge regression; p = 1: Lasso

T. Zhang (Rutgers) Linear Regularization 17 / 23

Page 23: Traing versus Testing and Linear Regularization

Ridge regression

Regularization formulation

wL2 = arg minw∈Rd

n∑i=1

(wT Xi − Yi)2 + λ

d∑j=1

|wj |2.

Implicit dimension reduction effect (principal components).More stable (smaller weights) than least squaresIt does not generally lead to sparse solution.Closely related to kernel methods.

T. Zhang (Rutgers) Linear Regularization 18 / 23

Page 24: Traing versus Testing and Linear Regularization

Ridge regression solution

X : n × d data matrix (n training data and d features)Solution of ridge regression:

wL2 = (X T X + λI)−1X T Y .

Compare to standard least squares regression solution:

w = (X T X )−1X T Y .

disadvantage: X T X may not be invertibleAdvantage: ridge regression allows d > n

stable: X T X + λI is always invertibleImplicit dimension reduction effect.

T. Zhang (Rutgers) Linear Regularization 19 / 23

Page 25: Traing versus Testing and Linear Regularization

The Effect of Correlated Feature

Boston Housing data: 13 variables (X [·,1 · · · 13])Adding another feature: X [·,14] = X [·,13]+ random-noise

smaller noise: X [·,13] is more correlated to X [·,14]

weight |w13| under various noise sizenoise size 0.01 0.1 0.5 1

least squares 358 25.5 1.77 14.7ridge (λ = 1) 3.90 3.99 3.39 7.31

least squares is not stable when there are correlated features

T. Zhang (Rutgers) Linear Regularization 20 / 23

Page 26: Traing versus Testing and Linear Regularization

Lasso

Find w with 1-norm ≤ s to minimize squared error:

wL1 =arg minw∈Rd

n∑i=1

(wT Xi − Yi)2

subject tod∑

j=1

|wj | ≤ s.

Equivalent to (λ ≥ 0 is regularization parameter):

wL1 = arg minw∈Rd

n∑i=1

(wT Xi − Yi)2 + λ

d∑j=1

|wj |.

T. Zhang (Rutgers) Linear Regularization 21 / 23

Page 27: Traing versus Testing and Linear Regularization

Solution property

Convex optimization problem, but solution may not be unique.Global solution can be efficiently found.Solution is sparse: some wj will be zero: achieves feature selection.Solution is not necessarily stable.

T. Zhang (Rutgers) Linear Regularization 22 / 23

Page 28: Traing versus Testing and Linear Regularization

Summary

Training versus test:find prediction rule to best fit training dataperformance evaluated on test data

Model complexity: difference between training and test errorincrease model complexity decreases training error but increasesdifference between training and test

Regularization:control model complexity by restricting parameter space

Linear regularization: restrict weight using ‖w‖p ≤ Ap = 2: ridge regression — stable solution and allow d ≥ np = 1: Lasso (convex) — sparse solution but may be unstable.

T. Zhang (Rutgers) Linear Regularization 23 / 23