model development and validation in chemometrics

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5

th Iran

ian

Ch

em

om

etric

s

Work

sh

op

Model Development and Validation in Chemometrics

Bahram HemmateenejadChemistry Department, Shiraz University

Shiraz, IranE-mail: [email protected]

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

Relationships between variablesRegression/Correlation?

Correlation problem We have a collect of measures All of interest in their own We wish to see how and how strongly they are related

Regression problem We have a collect of measures One of measure is of especial interest We wish to explore its relationship with the others

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Mathematical Model

Y= f(X) Y: dependent Variables X: Independent Variables

One Y- One X One Y-Many X Many y-Many X

Hard modeling (Fitting data to the model) Soft Modeling (Fitting model to the data)

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Hard Modeling

A pre-defined model is available y = b0 + b1x1 + b2x2 + …

y= b0 + b1x + b2x2 +

y= b0 10 b1x + b2x2 +

Our Task1. Getting data (by own experiment, or reported

data from previous studies

2. Fitting data to the model and calculating the model constants (or coefficients)

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Hard Modeling

Advantages The procedure is very simple Both the dependent and independent variables

are known Only coefficients are unknown No feature selection is needed

Disadvantages It is required that we have a deep insight into the

chemical system It is restricted to some simple chemical

phenomena

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Hard Modeling

Bear-Lambert’s Law One-component systems

A = A0 + b c Multi-component systems

A = A0 + i bi ci

A = A0 + i bi ci + Ax

Ax

Non-additive absorbance problem Complicated matrix effect can not be simply described by a simple

mathematical model

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Soft Modeling

No prior information about the chemical model is available

We know some chemical facts about the system Data are taken and then different reasonable models

are tried to fit the data Many models may be fitted. What is the better?

Getting deeper into the chemical facts Better prediction Lower modeling error

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Soft Modeling

Descriptive model Describing the chemistry of the system

Choosing useful independent variables Chemically meaningful variables The least number of independent variable

Very high statistical qualities are not required They must be evaluated for correct modeling Being careful about homogeneity and

heterogeneity of the data

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Soft Modeling

Predictive model The ultimate goal is predicting y for feature samples Use as many as possible predictor variables Feature selection becomes important Chemical meaning is not essential for predictors Very high statistical qualities are required Model validation is essential part of modeling

Predictive-Descriptive model It is a high quality chemical model

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Modeling Proposes

Development of new algorithms and methods New modeling method, new scoring function, Using

new validation procedure,… Simulated data or previously reported data Comparison with existing methods Validation of the results

Application of models to new chemical systems

The chemical system is novel Being familiar with the system under study or reading

carefully about it Examine the results for accuracy

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Modeling Proposes

Comparative studies Comparing existing algorithms for an individual

chemical system Comparing various types of independent

variables for a chemical system Application of an individual modeling method for

different chemical systems

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Steps in Chemical Modeling

Select the modeling propose Careful studies about the chemical or

mathematical system Select kind of Model (Predictive or Descriptive?) Data Preparation Plot the data Data splitting (calibration, validation, prediction) Model development (MLR, PCR, PLS, ANN)

Calculate model coefficient Validate its performances

Final model validation

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Data Splitting

At least two sets of data are necessary Model development step Final model validation step

In many cases Twos sets are also used in Model development step

Calibration set to calculate the model coefficients Validation set to test the accuracy of the calculated

constants

Calibration-ValidationCalibration-Validation-Prediction

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Data Splitting

Selection of appropriate training and test sets is significantly important in model building

All data sets must span the same space with regard to Diversity in dependent variable Diversity in independent variables Diversity in both dependent and independent

variables Training set should contain two thirds of the

total data

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Splitting methods

J.T. Leonard, K. Roy

On selection of Training and test set for the devotement of predictive QSAR models.

QSAR & Combinatorial Sciences, 2006, In Press

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Random splitting It is not a good choice A homogeneous data from aside of total data my be

classified as test set final model performances will be highly dependent

on the training/test set data

Ranking data based on value of dependent variable (y)

It may be a good choice Diversity in dependent variable is important here Structural similarity is not considered It is a risk that the training set data have different chemical

structures in comparison to validation/prediction data

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


R1 R2 y

H Me 1

H Et 2

Me Me 3

Me Et 4

OH Me 5

OH Et 6

R1R2

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Selection on the basis of independent variables space

Multivariate design Principal Component analysis Clustering methods

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Model Development

Preliminary considerations Simple models are preferred Linear or nonlinear modeling

Produce linear model if possible First Examine MLR and then PCA-based methods MLR is more predictive Choose ANN as the final trial

Variable co-linearity Feature selection/Feature extraction

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Model Development

Collinear variables Degree of collinearity

R2>0.95, 0.9, 0.85 Correlation with y Chemical relevance Correlation with the other

variables Noise content Cost of computation Calculation accuracy

y x1 x2 x3 x4

y 1 0.7 0.4 0.5 0.2

x1 1 0.7 0.4 0.2

x2 1 0.95 0.81

x3 1 0.74

x4 1

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Model development

1. Select the regression method MLR PCR PLS ANN

2. Select the features (variables) Stepwise Genetic Algorithm Chance correlation Support vector machines Ant Colony

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Model development

1. Calibrating the model (Calculation of model coefficients from training data)

2. Evaluate the resulted model Internal validation Cross-validation External validation

3. Calculate goodness of fit Standard error (SE) Correlation coefficient (R2 )

Cross-validation correlation coefficient (Q2) Root mean square error (RMSE) PRESS Variance ratio (F-value)

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Under/Over fitting

D. M. Hawkins, The Problem of Overfitting, J. Chem. Inf. Comput. Sci. 2004, 44, 1-12.

Under-fitting Include less terms than are necessary Uses less complicated approaches than are necessary

Over-fitting Include more terms than are necessary Uses more complicated approaches than are necessary

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Under-fitting: Model performance is low Low Calibration statistics Low generalization Low predictivity

Over-fitting Unstable model Inaccurate coefficient High calibration statistics Low prediction statistics

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Overfitting

Two types of over fitting1. Using model that is more flexible than it need to be

2. Using of model that includes irrelevant components

Why overfitting is undesirable1. Worse decision

2. Worse prediction

3. Wasting the time

4. Non-reproducible results by the others

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Accessing model fit

The use of calibration statistics generally leads to overfitting

Cross-validation test on calibration data Use of separate validation set

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Better predictive model?

The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. Tropsha et al. QSAR Com. Sci. 2003, 22, 69.

The better predictive model: High q2 for training set or low root mean square error of prediction for the test set? Aputa et al. QSAR Com. Sci. 2005, 24, 385.

Accessing model fit by cross-validation. Hawkins et al. J. Chem. Inf. Comput. Sci. 2003, 43, 579.

Mean squared error of prediction estimates (MSEP) for principal component regression (PCR) and partial least squares regression (PLS). Mevik and Cederkvist, Journal of Chemometrics, 2004, 18, 422-429.

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Cross-Validation (CV)

Why CV? Model stability Model predictivity Degree of over-fitting

CV Methods Leave-one-out (LOO-

CV) Leave-many-out

(LMO-CV) -fold CV

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Final Model Validation

Separate prediction set Cross-validation Bootstrapping Y-randomization (Chance correlation)

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

Cross-validation or separate test set?

it is a challenging problem However, use of final prediction set is essential

In the model development step Heavily depends on the sample size Always perform cross-validation If data size allow use another separate validation set Never use a validation data set with very small size

(i.e. 3 or 4)

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Bootstrap Re-sampling

Another approach to cross-validation The basic promise is that each data set should be

representative of the population from which it was drawn K groups of size n are generated by repeated random selection

of n objects Some objects can be included in many groups Others may never be selected The model obtained on n randomly selected objects is used to

predict the target properties

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Y-randomization

Unscrambling, Chance correlations

Some models my be obtained by chance Especially when number of samples are small or model has

high number of constants (coefficients) Chance correlation is a widely used technique to ensure the

robustness of a model Dependent vector is randomly shuffled and a new model is

developed using original predictor variables The resulted models must have low statistical qualities both for

calibration and prediction samples

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Goodness of fit (Scoring Function)

The mostly referred quantity but the least significant one (R or R2)

Total Sum of Squares (SST)

Residual sum of squares (SSR)

Regression or model sum of squares (SSM)

SST =(yi - )2, SSR = (yi- )2, SSM=SST-SSR

R2 = SSM/SST = 1-(SSR/SST)

y iy

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Some aspects of using R2

Homogeneity or diversity of data High sample diversity, high SST and therefore high R2 even

if model is not actually predictive High data homogeneity, low SST and therefore low R2 even

if model is actually predictive

Addition of a random variable will increase the SSM and therefore increases the R2

Using of R2 leads to obtaining over-fitted model

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Cross-validated correlation coefficient (q2 or Q2)

Correlation coefficient for prediction samples (R2P)

Root mean square errors (RMSE) for calibration, prediction and

cross-validation

RMSE = standard deviation of residuals (y- )

Prediction residual error sum of squares (PRESS) for calibration,

prediction and cross-validation

PRESS = sum of square of deviation

Relative error of Prediction (REP)

REP =[ (y- )/y]100

y

y

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh


Difference among R2, RMSE and PRESS These quantities are already correlated R2 measure the percent of total variances in the

original data that are described by the selected model RMSE describes the reproducibility of the model in

predicting y for different samples PRESS and REP measure total model accuracy

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Important notes

Data splitting1. Random splitting2. Diversity in y-variable3. Diversity in X-variables4. Diversity in both y and X

Model development1. Calibrate the model by training set2. Validate the model either by cross-validation or separate

test set

Final Model Validation1. Separate validation set2. Cross-validation3. Bootstrapping4. Y-randomization

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Numerical Example

40 samples 7 independent variables 1 dependent variable

Finding a linear relationship between y and X

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Data matrix-1.5 -2.4 0.14 152.16 40.4 1.01 0.23 17.22

-1.04 0.3 0.39 124.15 47.97 0.78 0.21 17.38

-1.02 -1.36 -0.48 138.13 39.7 1.1 0.22 17.14

-0.99 0 0.34 154.18 42.21 1.12 0.22 17.86

-0.97 0.01 0.27 153.2 38.43 1.05 0.18 9.16

-0.83 0.62 0.33 138.18 35.58 0.75 0.22 17.39

-0.81 -1.77 -0.57 138.13 38.7 1.12 0.2 17.2

-0.78 0.23 -0.18 137.15 41.47 1.01 0.22 8.79

-0.7 0 0.41 154.18 44.33 1.12 0.21 17.84

-0.6 0.77 0.39 154.18 42.64 1.11 0.18 9.34

-0.52 -0.75 0.43 265.45 32.71 1.44 0.22 10.38

-0.51 -2.28 -0.59 138.13 36.61 1.12 2 17.3

-0.51 1.19 0.39 124.15 41.6 0.75 0.17 8.99

-0.39 1.22 0.34 124.15 43.34 0.75 0.22 17.67

-0.38 0.71 0.06 133.16 45.98 0.57 0.22 8.84

-0.38 1.38 -0.46 136.16 35.77 0.82 0.18 8.91

-0.36 1.94 0.42 138.18 39.16 0.75 0.17 9.12

-0.3 1.35 -0.38 136.16 38.63 0.79 0.22 8.83

-0.3 1.78 0.32 168.21 41.16 1.1 0.18 9.12

-0.29 0.44 0.37 108.15 34.18 0.39 0.17 8.92

-0.24 1.37 -0.27 137.15 41.05 1.13 0.21 8.98

-0.21 1.48 0.4 94.12 39.97 0.39 0.17 8.63

-0.18 1.94 0.43 108.15 36.58 0.39 0.17 8.76

-0.18 0.33 0.32 168.21 36.72 1.19 0.21 17.86

-0.16 0.73 0.21 151.18 40.99 1.05 0.22 8.97

-0.14 0.98 -0.49 152.16 42.85 1.15 0.22 9.14

-0.12 1.32 -0.4 166.19 41.15 1.1 0.18 9.18

-0.09 1.42 0.42 154.18 46.71 1.09 0.22 9.1

-0.08 1.52 -0.48 182.19 43.04 1.48 0.21 17.61

-0.06 1.94 0.39 108.15 37.11 0.39 0.17 8.81

-0.05 1.88 -0.49 152.16 45.04 1.11 0.18 8.95

0.08 1.81 -0.4 152.16 41.6 1.09 0.22 8.86

0.33 2.64 -0.05 150.19 38.52 0.75 0.17 9.18

0.59 1.83 -1.15 153.15 31.03 0.36 0.22 9.1

1.13 2.66 -0.33 163 24.1 0.39 0.17 8.88

1.02 2.81 -0.49 198.23 40.89 0.74 0.17 9.1

0.8 2.89 0.13 142.59 35.76 0.39 0.17 8.88

1.23 3.39 -0.63 198.23 41.96 0.81 0.18 19.06

1.3 3.63 0.48 164.27 31.02 0.39 0.17 9.57

1.35 3.23 -0.16 170.22 40.68 0.36 0.22 9.27

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Correlation matrix

y x1 x2 x3 x4 x5 x6 x7

y 1 0.83 -0.29 0.29 -0.36 -0.50 -0.11 -0.35

x1 1 -0.03 0.076 -0.13 -0.52 -0.41 -0.49

x2 1 -0.18 0.10 -0.12 -0.23 -0.03

x3 1 -0.12 0.46 -0.05 0.10

x4 1 0.42 -0.07 0.2

x5 1 0.17 0.37

x6 1 0.26

x7 1

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Stepwise regression

Variables R2 Se F

x1 0.695 0.382 86

x1, x6 0.763 0.341 60

x1, x6, x3 0.819 0.303 54

x1, x6, x3, x5 0.875 0.255 61

x1, x6, x3, x5, x2 0.905 0.225 65

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Data splitting

1. Calibration, prediction

2. Calibration, validation, prediction

Calibration: Two thirds of total data = 26

Remaining: 14

What is the decision?

Selecting a separate test set in model development

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Data splitting

Validation: 8 samples Final prediction: 6 How to split the data?

Random? Y-sorting PCA on X or [x y]

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op

5th Ira

nia

n C

hem

om

etric

s

Work

sh

op Random splitting

model development and validation in chemometrics

Documents