regression / calibration mlr, rr, pcr, pls. paul geladi head of research nirce unit of biomass...

Regression / Calibration

MLR, RR, PCR, PLS

Paul Geladi

Head of Research NIRCEUnit of Biomass Technology and ChemistrySwedish University of Agricultural SciencesUmeåTechnobothniaVasa [email protected] [email protected]

mailto:[email protected]



Univariate regression

x

y

Offset

Slope

x

y

Offset a

Slope b

y = a + bx +

x

y Linear fit

Underfit

x

y Overfit

x

y Quadratic fit

Multivariate linear regression

y = f(x)

Works sometimes

y = f(x)

Works only for a few variables

Measurement noise!

∞ possible functions

X y

I

K

y = f(x)

y = f(x)

Simplified by:

y = b0 + b1x1 + b2x2 + ... + bKxK + f

Linear approximation

y = b0 + b1x1 + b2x2 + ... + bKxK + f

y : responsexk : predictorsbk : regression coefficientsb0 : offset, constantf : residual

Nomenclature

X y

I

K

X, y mean-centered b0 out

y = b1x1 + b2x2 + ... + bKxK + f

y = b1x1 + b2x2 + ... + bKxK + f

y = b1x1 + b2x2 + ... + bKxK + f

y = b1x1 + b2x2 + ... + bKxK + f

y = b1x1 + b2x2 + ... + bKxK + f

} I samples

y = b1x1 + b2x2 + ... + bKxK +f

y = b1x1 + b2x2 + ... + bKxK +f

y = b1x1 + b2x2 + ... + bKxK +f

y = b1x1 + b2x2 + ... + bKxK +f

y = b1x1 + b2x2 + ... + bKxK +f

Xy

I

K

f

b

= +

y = Xb + f

X, y known, measurableb, f unknown

No solution

f must be constrained

The MLR solution

Multiple Linear Regression

Ordinary Least Squares (OLS)

b = (X’X)-1 X’y

Problems?

Least squares

3b1 + 4b2 = 14b1 + 5b2 = 0

One solution

3b1 + 4b2 = 14b1 + 5b2 = 0 b1 + b2 = 4

No solution

3b1 + 4b2 + b3 = 14b1 + 5b2 + b3 = 0

∞ solutions

b = (X’X)-1 X’y

-K > I ∞ solutions-I > K no solution-error in X-error in y-inverse may not exist-inverse may be unstable

3b1 + 4b2 + e = 14b1 + 5b2 + e = 0 b1 + b2 + e = 4

Solution

Wanted solution

- I ≥ K- No inverse- No noise in X

Diagnostics

y = Xb + f

SS tot = SSmod + SSres

R2 = SSmod / SStot = 1- SSres / SStot

Coefficient of determination

Diagnostics

y = Xb + f

SSres = f’f

RMSEC = [ SSres / (I-A) ] 1/2

Root Mean Squared Error of Calibration

Alternatives to MLR/OLS

Ridge Regression (RR)

b = (X’X)-1 X’y

I easiest to invert

b = (X’X + kI)-1 X’y

k (ridge constant) as small as possible

Problems

- Choice of ridge constant

- No diagnostics

Principal Component Regression (PCR)

- I ≥ K

-Easy inversion


X T

K A

PCA

- A ≤ I- T orthogonal- Noise in X removed


y = Td + f

d = (T’T)-1 T’y

Problem

How many components used?

Advantage

- PCA done on data- Outliers- Classes- Noise in X removed

Partial Least SquaresRegression

X Yt u

X Yt u

w’ q’

Outer relationship

X Yt u

w’ q’

Inner relationship

X Yt u

w’ q’

A

A A

A

p’

Advantages

- X decomposed- Y decomposed- Noise in X left out- Noise in Y left out

PCR, PLS are one component at a time methods

After each component, a residual is calculated

The next component is calculatedon the residual

Another view

y = Xb + f

y = XbRR + fRR

y = XbPCR + fPCR

y = XbPLS + fPLS

bbb123OLSShrunk and rotatedA regression vector with too much shrinkage

Subspace of useful regression vectors

Prediction

Xcal ycal

I

K

Xtest ytest

J

yhat

Prediction diagnostics

yhat = Xtestb

ftest = ytest -yhat

PRESS = ftest’ftest

RMSEP = [ PRESS / J ] 1/2

Root Mean Squared Error of Prediction

Prediction diagnostics

yhat = Xtestb

ftest = ytest -yhat

R2test = Q2 = 1 - ftest’ftest/ytest’ytest

Some rules of thumb

R2 > 0.65 5 PLS comp.

R2test > 0.5

R2 - R2test < 0.2

Bias

f = y - Xb

always 0 bias

ftest = y - yhat

bias = 1/J ftest

Leverage - influence

b= (X’X)-1 X’y

yhat = Xb = X(X’X)-1 X’y = Hy

the Hat matrix

diagonal elements of H: Leverage

Leverage - influence

ypred0OutlierBiasedftestUnbiasedLarge varianceSmall varianceHeteroscedastic

Residual plot

Residual

-Check histogram f

-Check variablewise E

-Check objectwise E

Measured responsePredicted responseMeasured responsePredicted responseHeteroscedasticMeasured responsePredicted responseOutlier byextrapolationBad outlierEFG

X Yt u

w’ q’

A

A A

A

p’

Plotting: line plots

Scree plot RMSEC, RMSECV, RMSEP

Loading plot against wavel.

Score plot against time

Residual against sample

Residual against yhat

T2 against sample

H against sample

Plotting: scatter plots 2D, 3DScore plot

Loading plot

Biplot

H against residual

Inner relation t - u

Weight wq

Nonlinearities

xyxyxyABDLinearWeak nonlinearxyCStrong nonlinearNon-monotonicxyELinear approximations

Remedies for nonlinearites. Making nonlinear data fit a linear model or making the model nonlinear.

-Fundamental theory (e.g. going from transmittance to absorbance)

-Use extra latent variables in PCR or PLSR

-Use transformations of latent variables

-Remove disturbing variables

-Find subsets that behave linearly

Remedies for nonlinearites. Making nonlinear data fit a linear model or making the model nonlinear.

-Use intrinsically nonlinear methods

-Locally transform variables X, y, or both nonlinearly (powers, logarithms, adding powers)

-Transformation in a neighbourhood (window methods)

-Use global transformations (Fourier, Wavelet)

-GIFI type discretization

regression / calibration mlr, rr, pcr, pls. paul geladi head of research nirce unit of biomass...

Documents