defining multivarite calibration model complexity for model selection and comparison

48
DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho

Upload: yoshio-wallace

Post on 31-Dec-2015

32 views

Category:

Documents


1 download

DESCRIPTION

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON. John Kalivas Department of Chemistry Idaho State University Pocatello, Idaho. MULTIVARITE CALIBRATION MODEL. y ( m  1) quantitative information of prediction property for m samples - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

DEFINING MULTIVARITE CALIBRATION MODEL COMPLEXITY FOR MODEL SELECTION AND COMPARISON

John KalivasDepartment of Chemistry

Idaho State UniversityPocatello, Idaho

Page 2: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

MULTIVARITE CALIBRATION MODEL

y (m 1)

quantitative information of prediction property for m samples X (m p)

respective values for p predictor variables (wavelengths for spectral data) b (p 1)

unknown regression coefficients

e (m 1) errors with mean zero and covariance σ2I

y Xb e

unk unkˆˆ Ty x b

Page 3: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

REGRESSION VECTOR SOLUTION

MLR solution, requires m ≥ p (variable selection) and nearly orthogonal X

Biased regression methods require selection of meta-

parameter(s)

1ˆ ( )T Tb X X X y

ˆ b X y

2

2min Xb y

Page 4: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

BIASED MODELING METHODS

PLS PCR Ridge regression (RR) Generalized RR Cyclic subspace regression Continuum regression Ridge PCR and PLS Generalized ridge PCR and PLS Etc.

Page 5: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

GENERIC EXPRESSION

where k = rank(X) ≤ min(m,p)

ˆ b X y1ˆ whereT T b VF U y X U V

1

ˆTki

i ii i

f

u y

b v

1

ˆk

i ii

b v

Page 6: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PCR, RR, AND PLS FILTER VALUES

PCR: fi = 1 for retained basis vectors and fi = 0 for deleted basis vectors

RR: 0 ≤ fi ≤ 1 depending on ridge value

PLS: 0 ≤ fi < ∞ depending on PLS factor model

1

ˆTki

i ii i

f

u y

b v

Page 7: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

RR AND PLS FILTER VALUES

RR

PLS

dθj are the eigenvalues of XTX restricted to Krylov subspace

2

2 0i

ii

f

2

1

1 for d factor modeld

d j id i

j d j

f th

1, span , , ,

dT T T T T T Td

X X X y X y X XX y X X X yK

1

ˆTki

i ii i

f

u y

b v

Page 8: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

A CALIBRATION ASSESSMENT PROBLEM

s2 (σ2) is estimated by MSEC

Need degrees of freedom or fitting degrees of freedom (df)

df = p for the particular MLR model requires m > p

RMSEC= i i

m df

y y

Page 9: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

MORE df PROBLEMS

df = d, the number of factors (basis vectors) for PCR and PLS but models can be represented in any basis set

the same model in different basis sets requires different

number of basis vectorsRR and others are not factor based and/or use

multiple meta-parameters

Bb ˆ

Page 10: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

ANOTHER CALIBRATION ASSESSMENT PROBLEM Useful to plot results for different modeling

methods on one plotExample: a plot of RMSEV against number of

factors (basis vectors) is possible for PCR and PLS RR cannot be included in plot still have improper comparison of PCR and PLS as

factors are in different basis sets

Page 11: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

NECESSITY

Effective-rank (ER) for inter-model comparison of

from y = Xb where is from factor based methods such as PCR or PLS, non-factor based methods such as RR, and/or methods based on multiple meta-parameters smaller ER, more parsimonious model ?

bb

Page 12: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

SOLUTIONS

Develop ER in a common basis set using information on how the basis vectors are used

Develop ER that is basis set independent

Page 13: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

COMMON BASIS SET

Use filter values ( fi ) in eigenvector basis set V

f ER = Gilliam, et al., Inverse Problems, 6 (1990) 725

f ER = Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM,

(1998)

T

ˆ ii i

i

f

u y

b v

if

ˆwhere tr tr XX H y Hy

Page 14: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

BASIS SET INDEPENDENT hi = change in the fitted value depending on the change in the

observed value y the larger the hi, the more will change if y changes (fluctuations

around the expected value due to random noise) Add normally distributed noise δ to y N times Obtain vectors from models with perturbed y Calculate for the ith sample (sensitivity of a fitted value to

perturbation in the respective observed value) as the regression slope to:

Ye, Journal American Statistical Association, 93 (1998) 120

ih

GDF

1

ˆER=m

ii

h

ˆˆ 1, ,i i n iy h n N

ˆiy

ˆiy

Page 15: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

BASIS SET INDEPENDENT

Van der Voet, Journal Chemometrics, 13 (1999) 111

VDVER is based on error estimates which contain error

VDV2

SSE/ER 1

RMSECV

mm

Page 16: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

BASIS SET INDEPENDENT Know for eigenvector basis set V, PLS

basis set T, and std. basis set I with β, δ, and γ being respective weight vectors for a model in that basis set Eigenvector basis set:

PLS basis set:

Std. basis set:

ITVb ˆ

222

2 2ˆ

i β b

222

2 2ˆ

i δ b

2222 2

ˆiγ γ b

2

ˆ 22

LS 2

ˆER rank

ˆb

bX

b

Page 17: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

DATA SETS CARBONIC ANHYDRASE (CA) INHIBITORS: CA contributes to

production of eye humor which with excess secretion causes permanent damage and diseases (macular edema and open-angle glaucoma).

142 compounds assayed for inhibition of CA isozymes CA I, CA II, & CA IV. Inhibition values Log(Ki) modeled with 63 (full) & 8 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Chem. Inf. Comput. Sci., 42 (2002) 94)

DIHYDROFOLATE REDUCTASE (DHFR) INHIBITORS: inhibition of DHFR important in combating diseases from pathogens Pneumocystis carinii (pc) and Toxoplasma gondii (tg) in unhealthy immune systems

334, 320, & 340 compounds assayed for inhibition of (pc) DHFR, (tg) DHFR, & mammalian standard rlDHFR. Log of 50% inhibition concentration values (IC50) modeled with 84, 83, & 84 (full) & 10 (subset) descriptors deemed best for an ANN (Mattioni & Jurs, J. Mol. Graphics Modeling, 21 (2002) 391)

Page 18: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV (FULL): PLS & PCR RMSEC (df = d) AGAINST d

Page 19: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PCR (df = d = fER), PLS (df = d),& PLS (df = fER)

Page 20: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PCR & PLS (df = fER)AGAINST fER

fER

Page 21: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PCR, PLS, & RR (df = fER)

fER

Page 22: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV (FULL): PLS & PCR RMSEV AGAINST d

Page 23: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PLS & PCR AGAINST fER

fER

Page 24: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

PLS, PCR, & RR AGAINST fER

fER

Page 25: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

BIAS/VARIANCE CONSIDERATION

Model complexity

variancebias

Pre

dict

ion

Err

or

Page 26: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

GENERAL TIKHONOV REGULARIZATION

λ is meta-parameter that must be optimized L is a matrix of values, usually a derivative operator

Tikhonov, Soviet Math. Dokl., 4 (1963) 1035 L can be the spectral error covariance matrix for removal of undesired

spectral variation (wavelength selection) Kalivas, Anal. Chim. Acta, 505 (2004) 9

2 2

2 2min Xb y Lb

1ˆ T T T

b X X L L X y

( 1)1

1 1

1 1

m p

L ( 2)2

1 2 1

1 2 1

m p

L

Page 27: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

STANDARDIZED TIKHONOV REGULARIZATION

Hansen, Rank-Deficient and Discrete Ill-Posed Problems, SIAM, (1998)

2 2

2 2min Xb y Lb

2 2

2 2min where , , are , , in std. form Xb y b X b y X b y

1ˆ T T

b X X I X yˆ ˆback-transform to in genral formb b

2 22 2

2 22 2

ˆ ˆˆ ˆ and b Lb Xb y Xb y

Page 28: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

STANDARDIZED TIKHONOV REGULARIZATION Simple case: L is square and invertible

L = I

RR

1, , and y y X XL b Lb

1ˆ T T

b X X I X y

1ˆ T T

b X X I X y

1 ˆˆ b L b

Page 29: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

HARMONIOUS (PARETO) PLOT For graphical characterization of Tikhonov regularization,

plot a variance indicator against a bias criterion to reduce the chance of overfitting or underfitting

Curve will have an L-shape (L-curve) Ideal model at corner with the proper bias/variance trade-off

(harmonious model) PCR and PLS: best number of factors RR: best ridge value etc.

Intra- and inter-model comparison Lawson, et.al., Solving Least-Squares Problems. Prentice-Hall,

(1974)

Page 30: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

EXAMPLE PLOT

underfitting

overfitting

best model

2ˆ yy

2b

Page 31: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

VARIANCE EXPRESSIONS

Faber, et al., Journal of Chemometrics, 11 (1997) 181

Lorber, et al., Journal of Chemometrics, 2 (1988) 93

unk unk

2 22 2 2 2 2

unk unk2 2

1ˆ ˆˆV ey s s s h s sm

e y X xb b

2 22 2 2 2

eff2

ˆ ˆi is s s s y y df e y X b

unk

22 2

unk unk2

ˆˆV y s h s x b

22 ˆi is y y df

Page 32: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

EXPERIMENTAL APPROACH

Intra- and inter-model comparison of RR, PLS, and PCR with QSAR data for the most harmonious and parsimonious models

LOOCV tends to overfit Use mean values from LMOCV

Data sets randomly split 300 times with v validation and m – v calibration samples where v ≈ 0.6m

Shao, J., J. Am. Statist. Assoc., 88 (1993) 486-494

Page 33: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV HARMONIOUS RR(750), PLS(6), AND PCR(8) PLOTS FOR 63 DESCRIPTORS

2b

RMSEC RMSEV

ridge value range: 45 - 7050

Page 34: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

GENERAL APPROACH TO OPTIMIZATION OF PARETO CURVE for basis set V and weights β or any basis set

with respective weights Use an optimization algorithm (simplex, simulated

annealing, etc.) adjusting weight values in β while minimizing the distance to target values of variance and bias measures

Models converge to RR models

ˆ b V

Page 35: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV MODEL VALUES FOR 63 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (750) 0.0721 0.667 0.774 0.870 0.751 7.97

PLS (6) 0.0855 0.679 0.791 0.855 0.747 8.08

PCR (8) 0.0948 0.679 0.795 0.853 0.747 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

Page 36: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV HARMONIOUS RR(6), PLS(4), AND PCR(5) PLOTS FOR 8 DESCRIPTORS

2b

RMSEC RMSEV

ridge value range: 0.2 - 126

Page 37: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV MODEL VALUES FOR 8 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (6) 0.460 0.593 0.671 0.880 0.816 5.45

PLS (4) 0.463 0.602 0.676 0.875 0.813 5.20

PCR (5) 0.464 0.605 0.677 0.879 0.814 5

MLR 3.553 0.584 0.700 0.894 0.802 8

aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

Page 38: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV MODEL VALUESModela No. of

Descriptors 2b RMSEC RMSEV 2

calR 2valR ER

RR (750) 63 0.0721 0.667 0.774 0.870 0.751 7.97

PLS (6) 63 0.0855 0.679 0.791 0.855 0.747 8.08

PCR (8) 63 0.0948 0.679 0.795 0.853 0.747 8

RR (6) 8 0.460 0.593 0.671 0.880 0.816 5.45

PLS (4) 8 0.463 0.602 0.676 0.875 0.813 5.20

PCR (5) 8 0.464 0.605 0.677 0.879 0.814 5

MLR 8 3.553 0.584 0.700 0.894 0.802 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

Page 39: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) AND PCR(8) FOR 63 DESCRIPTORS

RMSEV RMSEV

fERfER

2b

Page 40: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA IV HARMONY/PARSIMONY PLOTS: PLS(6) 63 DESCRIPTORS AND PLS(4) 8 DESCRIPTORS

2b

fERRMSEV

Page 41: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

CA I MODEL VALUESModela No. of

Descriptors 2b RMSEC RMSEV 2

calR 2valR ER

RR (670) 63 0.0667 0.566 0.695 0.868 0.717 8.09

PLS (6) 63 0.0767 0.577 0.703 0.851 0.720 8.03

PCR (7) 63 0.0742 0.591 0.705 0.835 0.717 7

RR (22) 8 0.209 0.624 0.699 0.792 0.720 3.54

PLS (3) 8 0.227 0.641 0.705 0.787 0.717 3.65

PCR (4) 8 0.267 0.641 0.711 0.790 0.712 4

MLR 8 9.722 0.532 0.657 0.878 0.762 8 aParentheses contain ridge values for RR and the number of factors for PLS and PCR

Page 42: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

tgDHFR MODEL VALUES USING 10 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (11) 0.597 0.857 0.919 0.677 0.580 6.51

PLS (5) 0.607 0.852 0.929 0.657 0.578 6.62

PCR (6) 0.646 0.867 0.927 0.657 0.581 6

MLR 8.664 0.765 0.902 0.766 0.634 10

aParentheses contain ridge value for RR and the number of factors for PLS and PCR.

Page 43: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

pcDHFR MODEL VALUES USING 10 DESCRIPTORS

Modela

2b RMSEC RMSEV 2

calR 2valR ER

RR (17) 0.421 0.915 0.979 0.605 0.478 5.89

PLS (5) 0.500 0.916 0.996 0.603 0.478 6.50

PCR (6) 0.500 0.916 0.993 0.600 0.478 6

MLR 679 0.870 1.020 0.603 0.495 10

aParentheses contain ridge values for RR and the number of factors for PLS and PCR.

Page 44: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

SUMMARY

ER necessary for fair intra- and inter-model comparison RMSEC and RMSEV plot overlays are possible for different modeling

methods Harmonious plots allow proper determination of meta-

parameters and validation Fair intra- and inter-model comparisons are possible (plot overlays are

possible) In optimal model region of harmonious curve, differences in

models are small ER assesses the true nature of variable selection for improved

parsimony Harmony/parsimony compromise

Page 45: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

FUTURE WORK

Use ER with multiple variance and bias indicators for better characterization of the harmony/parsimony tradeoff for intra- and inter-model comparison with full and/or variable subsets

Include variable selection in the modeling process

2 1

2 1min( + )Xb y Lb

Page 46: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

Include L = second derivative operator in Tikhonov regularization a form of RR with smoothing

smooth spectral noise and temperature influences

Use standardization approach with PCR and PLS PCR:

PLS:

2

2: min d Xb yfrom

2

2: min d Xb yto

2

2: min subject to ,T T

d Xb y b X X X yKfrom

2

2: min subject to ,T T

d Xb y b X X X yKto

1where , span , , ,

dT T T T T T Td

X X X y X y X XX y X X X yK

Page 47: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

ACKNOWLEDGEMENTS

Forrest Stout and Heather Seipel Peter Jurs and Brian Mattioni provided QSAR

data sets National Science Foundation

Page 48: DEFINING  MULTIVARITE  CALIBRATION  MODEL COMPLEXITY  FOR  MODEL SELECTION  AND COMPARISON

STANDARDIZATION PROCESS

For with rank(L) = s < p, obtain a QR factorization of LT

Form and perform a QR factorization of XKo

Compute standardized data

Perform back-transformation

sTs o

RL KR K K

0

s pL

( )m p so

XK

oo o q

TXK HT H H

0

T T Tq q s s

X H XL H XK RTqy H y

1ˆˆ ( )To o o

b L b K T H y XL b