model development and validation in chemometrics
DESCRIPTION
Model Development and Validation in Chemometrics. Bahram Hemmateenejad Chemistry Department, Shiraz University Shiraz, Iran E-mail: [email protected]. Relationships between variables Regression/Correlation?. Correlation problem We have a collect of measures All of interest in their own - PowerPoint PPT PresentationTRANSCRIPT
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5
th Iran
ian
Ch
em
om
etric
s
Work
sh
op
Model Development and Validation in Chemometrics
Bahram HemmateenejadChemistry Department, Shiraz University
Shiraz, IranE-mail: [email protected]
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
Relationships between variablesRegression/Correlation?
Correlation problem We have a collect of measures All of interest in their own We wish to see how and how strongly they are related
Regression problem We have a collect of measures One of measure is of especial interest We wish to explore its relationship with the others
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Mathematical Model
Y= f(X) Y: dependent Variables X: Independent Variables
One Y- One X One Y-Many X Many y-Many X
Hard modeling (Fitting data to the model) Soft Modeling (Fitting model to the data)
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Hard Modeling
A pre-defined model is available y = b0 + b1x1 + b2x2 + …
y= b0 + b1x + b2x2 +
y= b0 10 b1x + b2x2 +
Our Task1. Getting data (by own experiment, or reported
data from previous studies
2. Fitting data to the model and calculating the model constants (or coefficients)
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Hard Modeling
Advantages The procedure is very simple Both the dependent and independent variables
are known Only coefficients are unknown No feature selection is needed
Disadvantages It is required that we have a deep insight into the
chemical system It is restricted to some simple chemical
phenomena
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Hard Modeling
Bear-Lambert’s Law One-component systems
A = A0 + b c Multi-component systems
A = A0 + i bi ci
A = A0 + i bi ci + Ax
Ax
Non-additive absorbance problem Complicated matrix effect can not be simply described by a simple
mathematical model
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Soft Modeling
No prior information about the chemical model is available
We know some chemical facts about the system Data are taken and then different reasonable models
are tried to fit the data Many models may be fitted. What is the better?
Getting deeper into the chemical facts Better prediction Lower modeling error
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Soft Modeling
Descriptive model Describing the chemistry of the system
Choosing useful independent variables Chemically meaningful variables The least number of independent variable
Very high statistical qualities are not required They must be evaluated for correct modeling Being careful about homogeneity and
heterogeneity of the data
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Soft Modeling
Predictive model The ultimate goal is predicting y for feature samples Use as many as possible predictor variables Feature selection becomes important Chemical meaning is not essential for predictors Very high statistical qualities are required Model validation is essential part of modeling
Predictive-Descriptive model It is a high quality chemical model
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Modeling Proposes
Development of new algorithms and methods New modeling method, new scoring function, Using
new validation procedure,… Simulated data or previously reported data Comparison with existing methods Validation of the results
Application of models to new chemical systems
The chemical system is novel Being familiar with the system under study or reading
carefully about it Examine the results for accuracy
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Modeling Proposes
Comparative studies Comparing existing algorithms for an individual
chemical system Comparing various types of independent
variables for a chemical system Application of an individual modeling method for
different chemical systems
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Steps in Chemical Modeling
Select the modeling propose Careful studies about the chemical or
mathematical system Select kind of Model (Predictive or Descriptive?) Data Preparation Plot the data Data splitting (calibration, validation, prediction) Model development (MLR, PCR, PLS, ANN)
Calculate model coefficient Validate its performances
Final model validation
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Data Splitting
At least two sets of data are necessary Model development step Final model validation step
In many cases Twos sets are also used in Model development step
Calibration set to calculate the model coefficients Validation set to test the accuracy of the calculated
constants
Calibration-ValidationCalibration-Validation-Prediction
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Data Splitting
Selection of appropriate training and test sets is significantly important in model building
All data sets must span the same space with regard to Diversity in dependent variable Diversity in independent variables Diversity in both dependent and independent
variables Training set should contain two thirds of the
total data
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Splitting methods
J.T. Leonard, K. Roy
On selection of Training and test set for the devotement of predictive QSAR models.
QSAR & Combinatorial Sciences, 2006, In Press
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Splitting methods
Random splitting It is not a good choice A homogeneous data from aside of total data my be
classified as test set final model performances will be highly dependent
on the training/test set data
Ranking data based on value of dependent variable (y)
It may be a good choice Diversity in dependent variable is important here Structural similarity is not considered It is a risk that the training set data have different chemical
structures in comparison to validation/prediction data
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Splitting methods
R1 R2 y
H Me 1
H Et 2
Me Me 3
Me Et 4
OH Me 5
OH Et 6
R1R2
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Splitting methods
Selection on the basis of independent variables space
Multivariate design Principal Component analysis Clustering methods
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Model Development
Preliminary considerations Simple models are preferred Linear or nonlinear modeling
Produce linear model if possible First Examine MLR and then PCA-based methods MLR is more predictive Choose ANN as the final trial
Variable co-linearity Feature selection/Feature extraction
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Model Development
Collinear variables Degree of collinearity
R2>0.95, 0.9, 0.85 Correlation with y Chemical relevance Correlation with the other
variables Noise content Cost of computation Calculation accuracy
y x1 x2 x3 x4
y 1 0.7 0.4 0.5 0.2
x1 1 0.7 0.4 0.2
x2 1 0.95 0.81
x3 1 0.74
x4 1
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Model development
1. Select the regression method MLR PCR PLS ANN
2. Select the features (variables) Stepwise Genetic Algorithm Chance correlation Support vector machines Ant Colony
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Model development
1. Calibrating the model (Calculation of model coefficients from training data)
2. Evaluate the resulted model Internal validation Cross-validation External validation
3. Calculate goodness of fit Standard error (SE) Correlation coefficient (R2 )
Cross-validation correlation coefficient (Q2) Root mean square error (RMSE) PRESS Variance ratio (F-value)
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Under/Over fitting
D. M. Hawkins, The Problem of Overfitting, J. Chem. Inf. Comput. Sci. 2004, 44, 1-12.
Under-fitting Include less terms than are necessary Uses less complicated approaches than are necessary
Over-fitting Include more terms than are necessary Uses more complicated approaches than are necessary
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Under/Over fitting
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Under/Over fitting
Under-fitting: Model performance is low Low Calibration statistics Low generalization Low predictivity
Over-fitting Unstable model Inaccurate coefficient High calibration statistics Low prediction statistics
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Overfitting
Two types of over fitting1. Using model that is more flexible than it need to be
2. Using of model that includes irrelevant components
Why overfitting is undesirable1. Worse decision
2. Worse prediction
3. Wasting the time
4. Non-reproducible results by the others
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Accessing model fit
The use of calibration statistics generally leads to overfitting
Cross-validation test on calibration data Use of separate validation set
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Better predictive model?
The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models. Tropsha et al. QSAR Com. Sci. 2003, 22, 69.
The better predictive model: High q2 for training set or low root mean square error of prediction for the test set? Aputa et al. QSAR Com. Sci. 2005, 24, 385.
Accessing model fit by cross-validation. Hawkins et al. J. Chem. Inf. Comput. Sci. 2003, 43, 579.
Mean squared error of prediction estimates (MSEP) for principal component regression (PCR) and partial least squares regression (PLS). Mevik and Cederkvist, Journal of Chemometrics, 2004, 18, 422-429.
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Cross-Validation (CV)
Why CV? Model stability Model predictivity Degree of over-fitting
CV Methods Leave-one-out (LOO-
CV) Leave-many-out
(LMO-CV) -fold CV
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Final Model Validation
Separate prediction set Cross-validation Bootstrapping Y-randomization (Chance correlation)
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
Cross-validation or separate test set?
it is a challenging problem However, use of final prediction set is essential
In the model development step Heavily depends on the sample size Always perform cross-validation If data size allow use another separate validation set Never use a validation data set with very small size
(i.e. 3 or 4)
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Bootstrap Re-sampling
Another approach to cross-validation The basic promise is that each data set should be
representative of the population from which it was drawn K groups of size n are generated by repeated random selection
of n objects Some objects can be included in many groups Others may never be selected The model obtained on n randomly selected objects is used to
predict the target properties
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Y-randomization
Unscrambling, Chance correlations
Some models my be obtained by chance Especially when number of samples are small or model has
high number of constants (coefficients) Chance correlation is a widely used technique to ensure the
robustness of a model Dependent vector is randomly shuffled and a new model is
developed using original predictor variables The resulted models must have low statistical qualities both for
calibration and prediction samples
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Goodness of fit (Scoring Function)
The mostly referred quantity but the least significant one (R or R2)
Total Sum of Squares (SST)
Residual sum of squares (SSR)
Regression or model sum of squares (SSM)
SST =(yi - )2, SSR = (yi- )2, SSM=SST-SSR
R2 = SSM/SST = 1-(SSR/SST)
y iy
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Goodness of fit (Scoring Function)
Some aspects of using R2
Homogeneity or diversity of data High sample diversity, high SST and therefore high R2 even
if model is not actually predictive High data homogeneity, low SST and therefore low R2 even
if model is actually predictive
Addition of a random variable will increase the SSM and therefore increases the R2
Using of R2 leads to obtaining over-fitted model
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Goodness of fit (Scoring Function)
Cross-validated correlation coefficient (q2 or Q2)
Correlation coefficient for prediction samples (R2P)
Root mean square errors (RMSE) for calibration, prediction and
cross-validation
RMSE = standard deviation of residuals (y- )
Prediction residual error sum of squares (PRESS) for calibration,
prediction and cross-validation
PRESS = sum of square of deviation
Relative error of Prediction (REP)
REP =[ (y- )/y]100
y
y
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Goodness of fit (Scoring Function)
Difference among R2, RMSE and PRESS These quantities are already correlated R2 measure the percent of total variances in the
original data that are described by the selected model RMSE describes the reproducibility of the model in
predicting y for different samples PRESS and REP measure total model accuracy
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Important notes
Data splitting1. Random splitting2. Diversity in y-variable3. Diversity in X-variables4. Diversity in both y and X
Model development1. Calibrate the model by training set2. Validate the model either by cross-validation or separate
test set
Final Model Validation1. Separate validation set2. Cross-validation3. Bootstrapping4. Y-randomization
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Numerical Example
40 samples 7 independent variables 1 dependent variable
Finding a linear relationship between y and X
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Data matrix-1.5 -2.4 0.14 152.16 40.4 1.01 0.23 17.22
-1.04 0.3 0.39 124.15 47.97 0.78 0.21 17.38
-1.02 -1.36 -0.48 138.13 39.7 1.1 0.22 17.14
-0.99 0 0.34 154.18 42.21 1.12 0.22 17.86
-0.97 0.01 0.27 153.2 38.43 1.05 0.18 9.16
-0.83 0.62 0.33 138.18 35.58 0.75 0.22 17.39
-0.81 -1.77 -0.57 138.13 38.7 1.12 0.2 17.2
-0.78 0.23 -0.18 137.15 41.47 1.01 0.22 8.79
-0.7 0 0.41 154.18 44.33 1.12 0.21 17.84
-0.6 0.77 0.39 154.18 42.64 1.11 0.18 9.34
-0.52 -0.75 0.43 265.45 32.71 1.44 0.22 10.38
-0.51 -2.28 -0.59 138.13 36.61 1.12 2 17.3
-0.51 1.19 0.39 124.15 41.6 0.75 0.17 8.99
-0.39 1.22 0.34 124.15 43.34 0.75 0.22 17.67
-0.38 0.71 0.06 133.16 45.98 0.57 0.22 8.84
-0.38 1.38 -0.46 136.16 35.77 0.82 0.18 8.91
-0.36 1.94 0.42 138.18 39.16 0.75 0.17 9.12
-0.3 1.35 -0.38 136.16 38.63 0.79 0.22 8.83
-0.3 1.78 0.32 168.21 41.16 1.1 0.18 9.12
-0.29 0.44 0.37 108.15 34.18 0.39 0.17 8.92
-0.24 1.37 -0.27 137.15 41.05 1.13 0.21 8.98
-0.21 1.48 0.4 94.12 39.97 0.39 0.17 8.63
-0.18 1.94 0.43 108.15 36.58 0.39 0.17 8.76
-0.18 0.33 0.32 168.21 36.72 1.19 0.21 17.86
-0.16 0.73 0.21 151.18 40.99 1.05 0.22 8.97
-0.14 0.98 -0.49 152.16 42.85 1.15 0.22 9.14
-0.12 1.32 -0.4 166.19 41.15 1.1 0.18 9.18
-0.09 1.42 0.42 154.18 46.71 1.09 0.22 9.1
-0.08 1.52 -0.48 182.19 43.04 1.48 0.21 17.61
-0.06 1.94 0.39 108.15 37.11 0.39 0.17 8.81
-0.05 1.88 -0.49 152.16 45.04 1.11 0.18 8.95
0.08 1.81 -0.4 152.16 41.6 1.09 0.22 8.86
0.33 2.64 -0.05 150.19 38.52 0.75 0.17 9.18
0.59 1.83 -1.15 153.15 31.03 0.36 0.22 9.1
1.13 2.66 -0.33 163 24.1 0.39 0.17 8.88
1.02 2.81 -0.49 198.23 40.89 0.74 0.17 9.1
0.8 2.89 0.13 142.59 35.76 0.39 0.17 8.88
1.23 3.39 -0.63 198.23 41.96 0.81 0.18 19.06
1.3 3.63 0.48 164.27 31.02 0.39 0.17 9.57
1.35 3.23 -0.16 170.22 40.68 0.36 0.22 9.27
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Correlation matrix
y x1 x2 x3 x4 x5 x6 x7
y 1 0.83 -0.29 0.29 -0.36 -0.50 -0.11 -0.35
x1 1 -0.03 0.076 -0.13 -0.52 -0.41 -0.49
x2 1 -0.18 0.10 -0.12 -0.23 -0.03
x3 1 -0.12 0.46 -0.05 0.10
x4 1 0.42 -0.07 0.2
x5 1 0.17 0.37
x6 1 0.26
x7 1
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Stepwise regression
Variables R2 Se F
x1 0.695 0.382 86
x1, x6 0.763 0.341 60
x1, x6, x3 0.819 0.303 54
x1, x6, x3, x5 0.875 0.255 61
x1, x6, x3, x5, x2 0.905 0.225 65
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Data splitting
1. Calibration, prediction
2. Calibration, validation, prediction
Calibration: Two thirds of total data = 26
Remaining: 14
What is the decision?
Selecting a separate test set in model development
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Data splitting
Validation: 8 samples Final prediction: 6 How to split the data?
Random? Y-sorting PCA on X or [x y]
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op
5th Ira
nia
n C
hem
om
etric
s
Work
sh
op Random splitting