1 lmo & jackknife if a qspr/qsar model has a high average q 2 in lmo valiation, it can be...
Post on 20-Dec-2015
217 views
TRANSCRIPT
1
LMO & Jackknife
If a QSPR/QSAR model has a high average q2 in LMO valiation, it can be reasonably concluded that obtained model is robust.
Leave-many-out (LMO) validation:
An internal validation procedure, like LOO.
LMO employs smaller training set than LOO and can be repeated many more times due to possibility of large combinations in leave many compounds out from training set.
n objects in data set,G cancellation groups of equal size, G =n/mj) 2<G<10
A large number of groups : n-mj objects in training set mj objects in validation set => q2(from mj
estim.s
2
Jackknife:
training set → a # Subsamples, SubSampNo>G
each SubSample → SubTrain and SubValid
+SubSampNo times estimation of parameters.(instead of time consuming repetition of the experiment)
Along with LMO cross validation(internal validation)
LMO & Jackknife
LMO & Jackknife
3
SubTrain1 SubValid1
233
211 )ˆ()ˆ( yyyy
q2TOT# subsamples >> # molec.s in training set
.…
255
222 )ˆ()ˆ( yyyy
244
233 )ˆ()ˆ( yyyy
244
222 )ˆ()ˆ( yyyy
255
211 )ˆ()ˆ( yyyy
LMO
n=6m=2G=3
SubTrain4 SubValid4
LMO & Jackknife
4 # subsamples >> # molec.s in training set
.…
Jackknife
n=6m=2G=3
b1
b2
b3
b4
bsn
SubTrain1
SubTrain4
5
>> for i=1:subSampNo PERMUT(i,:)=randperm(Dr); end
for i=1:9 % 9 Subsamples PERMUT(i,:)=randperm(6); % 6 molecules in train end
PERMUT = 6 5 2 4 3 1 1 6 3 5 2 4 5 2 6 4 3 1 5 4 2 1 6 3 5 4 1 6 2 3 2 6 5 1 3 4 1 2 6 5 3 4 6 2 1 5 4 3 4 5 1 6 3 2
SubValid
LMO & Jackknife
SubTrain
6
6 5 2 4 → b1 3 1 → q211 6 3 5 → b2 2 4 → q225 2 6 4 → b3 3 1 → q235 4 2 1 → b4 6 3 → q245 4 1 6 → b5 2 3 → q252 6 5 1 → b6 3 4 → q261 2 6 5 → b7 3 4 → q276 2 1 5 → b8 4 3 → q284 5 1 6 → b9 3 2 → q29
q2TOTDb=y b=D+ y
SubTrain sets: SubValid sets:
histogr
LMO & Jackknife
7
d1d2d3d4d5d6…dn
Distribution of b for 3rd Descriptor
LMO & Jackknife
6 5 2 4 → b11 6 3 5 → b25 2 6 4 → b35 4 2 1 → b45 4 1 6 → b52 6 5 1 → b61 2 6 5 → b76 2 1 5 → b84 5 1 6 → b9
8
Jackknife: on all 31 molec.s and all 53 desc.s 200 subsamples(using MLR)
0 10 20 30 40 50-20
-15
-10
-5
0
5
10
15
20
-0.04 -0.02 00
10
20
30
b
Fre
qu
en
cy
Desc No 25
-0.08 -0.06 -0.04 -0.02 00
20
40
60
80
b
Fre
qu
en
cy
Desc No 15
LMO & Jackknife
9
Jackknife: on all 31 samples and all 53 desc.s ( using MLR)
-0.04 -0.02 00
10
20
30
b
Fre
qu
en
cy
Desc No 25
-0.08 -0.06 -0.04 -0.02 00
20
40
60
80
b
Fre
qu
en
cy
Desc No 15
>> histfit(bJACK(:,15),20);
LMO & Jackknife
10
How much is the probability that 0.0 is different from the population by chance.
To determine the probability:
All data in population, and 0.0, should be standardized to z.
s
xxxz
11
>> disttool
z = -1.5
Probability that 1.5 is different from μ by chance
12
>> disttool
x2=0.134 =p2 tailed Probability that difference between -1.5 and μ is
because of random error is not < 0.05 (p>0.05) -1.5 is not significantly different from population
p< 0.05 => signif. difference
>>cdf Gives the area before z, from left.0.0668
LMO & Jackknife
13
10 20 30 40 50-20
-10
0
10
20
Descriptor Number
b
0 100 200
-10
0
10
20
Subsample No
b
All descriptors, MLR
-2 -1 0 10
50
100
150
q2 value
Fre
qu
en
cy
0 20 40 600
0.1
0.2
0.3
0.4
Descriptor number
p-v
alu
eq2TOT = -407.46
# p<0.05 =0
# signif descrip.s =0
14
0 20 40 60-0.4
-0.2
0
0.2
0.4
Descriptor Number
b
0 50 100 150 200-0.4
-0.2
0
0.2
0.4
Subsample No
b
All descriptors, PLS, lv=14
-2 -1 0 10
10
20
30
40
q2 value
Fre
qu
en
cy
0 20 40 600
0.2
0.4
0.6
0.8
Descriptor number
p-v
alu
e
q2TOT = -0.0988
# p<0.05 =28
LMO & Jackknife
# signif descrip.s =28
15
0 20 40 60-0.4
-0.2
0
0.2
0.4
Descriptor Number
bAll descriptors, PLS, lv=14
0 20 40 600
0.2
0.4
0.6
0.8
Descriptor number
p-v
alu
e
q2TOT = -0.0988
# p<0.05 =28
51 1.4002e-02237 1.383e-01035 8.605e-00938 9.1021e-00939 1.8559e-00836 8.7005e-00815 0.000276891 0.000388082 0.0004054745 0.0005967432 0.00063731
Significant descriptors with p<0.05 can be sorted (according to p value),For doing a forward selection---------------------------------Desc No p---------------------------------
LMO & Jackknife
16
q2TOT at different number of latent variables in PLS (applying all descriptors)4 times running the program
lv8 -.0411 .0776 -.0431 .02709 .2200 .2340 .3641 .257610 .1721 .1147 .2391 .1434 37 signif var11 .2855 .1948 .0667 .237212 .1847 .1275 .2390 .218413 -.0343 -.1439 .0120 .004914 -.2578 -.2460 -.3010 -.0989 28 signif var
8 9 10 11 12 13 14
-0.2
0
0.2
No of latent variables
q2
TO
T Overfitt
Inform ↓
LMO & Jackknife
17
for lv=6:13 % Number of latent var.s in PLS for i=lv:18 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,SRTDpDESC(1:i,1)), y, 150, 27,2,lv); end end
0
5
10
15
200 2 4 6 8 10 12 14
-1.5
-1
-0.5
0
0.5
1
lv
No of descriptors
q2TOT
Max q2TOT at lv=7 and #desc=7
LMO & Jackknife
18
1 1.5 2 2.5 3
0
2
4
6
8
10
12
1 1 1
-7
-6
-5
-4
-3
-2
-1
0x 10
-5D=Dini(:,[38 50 3]);
]q2, bJACK=[jackknife(D, y, 500, 27)
1.99922.0010
0.05
0.1
0.15
0.2
0.25
0.3
Three significant descriptors with q2 < 0.05, as example.
LMO & Jackknife
19
[p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,[34 38 45 51]), y, 150, 27,2,7);
[34 38 45 51]: Selected descriptors 150: #subset samples in Jackknife 27: #samples in training set of each subset 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS
Jackknife is a method for determining the significant descriptors beside LMO CV, as internal validation
…. and can be applied for descriptor selection...
LMO & Jackknife
function
20
Exercise:
Applying Jackknife of selected set of descriptors, using MLR and determining the results and significance of descriptors…
21
Cross model validation (CMV)Anderssen, et alReducing over-optimism in variable selection by cross model validation, Chemom Intell Lab Syst (2006) 84, 69-74.
Validation during variable selection, and not posterior to it.
Gidskehaug, et alCross model validation and optimization of bilinear regression models, Chemom Intell Lab Syst (2008) 93, 1-10.
CMV
Data set → a # train, and Test sets.
train → SubSample → SubTrain and SubValid
22
211 )ˆ( yy
222 )ˆ( yy
233 )ˆ( yy
211 )ˆ( yy
222 )ˆ( yy
233 )ˆ( yy
q2CMV1
q2CMV2
Jackknife-Selec Var.s- # latent var.s
PLS model( b1 ) predic
CMV
Test set: No contribution to var and lv sel process
n=15m=3G=3
Train Test
23
211 )ˆ( yy
222 )ˆ( yy
233 )ˆ( yy
q2CMVm
.
.
.
.
.
.
CMV
Effective external validation
24
[q2TOT,q2CMV]=crossmv(trainD,trainy,testD,testy,selVAR,7)
selVAR: set of selected descriptors (applied calibration method is PLS) 7: Number of latent variables in PLS
CMV is an effective external validation method ...
function CMV
25
Bootstrapping
Bootstrapping:
Bootstrap re-sampling, another approach to internal validation
Wehrens, et alThe Bootstrap: a tutorial, Chemom Intell Lab Syst (2002) 54, 35-52.
There is only one data set.
Data set should be representative of the population from which it was drawn.
Bootstr. is simulation of random selectionGeneration of K groups of size n, by a repeated random selection of n objects from original data set.
26
Some of the objects can be included in the same random sample several times, while other objects will never be selected.
The model obtained on the n randomly selected objects is used to predict the target properties for the excluded sample.+ q2 estimation, as in LMO.
Bootstrapping
27
for i=1:10 %No of subSamples in bootstr for j=1:6 %Dr=6 number of molec.s in Train RND=randperm(6); bootSamp(i,j)=RND(1); end end
Bootstrapping
bootSamp =
5 5 6 3 6 1 → b1 2 4 → q21 4 2 6 3 2 6 .. 1 5 .. 2 5 3 1 2 4 6 2 3 1 4 4 1 5 6 3 3 2 6 3 3 1 4 5 5 5 6 4 4 3 1 2 4 3 6 1 1 2 5 2 2 5 4 5 1 3 6 3 3 2 3 3 5 1 4 6 2 3 1 6 4 6 → b10 5 → q210
SubValidSubTrainNot present in SubTrainSame no of molec as Train
28
0 50 100 150 200-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Subsample Nob
1 2 3
0
0.2
0.4
0.6
0.8
Descr No
bB
OO
T
Bootstrapping
38
50
15
29
Distribution of b values are not normal
Nonparamestimation
Ofconfidence
limits0 100 200
-0.05
0
0.05
Sorted SubSampNo
bB
OO
T
0 100 200-0.5
0
0.5
1
Sorted SubSampNo
bB
OO
T
Sorted
0 100 200
-1
0
1x 10
-4
Sorted SubSampNo
bBO
OT
200 subsamples,200x0.025=5=> 5th from left and 5th from right are the 95% confidence limits.
signif
Not signif
signif0 0.2 0.4 0.6 0.8
0
2
4
6
8
10
b value
Fre
qu
en
cy Bootstrapping
30
1 2 3
0
0.1
0.2
0.3
0.4
0.5
Descriptor No
Co
nf L
imits
-12e-5 0.1113 -0.0181-1.5e-5 0.5131 0.0250
Bootstrapping
38
50
15
Small effectBut signif
Not signif
31
[bBOOT]=bootstrp(trainD, trainy,1000,2,7)
1000: #subset samples in bootstrapping (#molecules in SubTraining set = #molec.s in Train) 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS
Bootstrap is a method for determining the confidence interval for descriptors...
function Bootstrapping
32
Model validation
Y-randomization:
Random shuffling of dependent variable vector, and development of a new QSAR model using original dependent variable matrix.
Repeating the process a number of times,
chance correlation or structural redundancy of training set
Sometimes: High q2 values
Expected: QSAR models with low R2 and LOO q2 values
Acceptable model can not be obtained by this method.
33
Training and test
External validation:
Selecting training and test sets:
a. Finding new experimental tested set: not a simple taskb. Splitting data set into training and test set.
For establishing QSAR model
For external validation
Both training and test sets should separately span the whole descriptor space occupied by the entire data set.
Ideally, each member of test set should be close to one point in training set.
34
Approaches for creating training and test sets:
1. Straightforward random selection
Yasri, et alToward an optimal procedure for variable selection and QSAR model building, J Chem Inf Comput Sci (2001) 41, 1218-1227.
2. Activity sampling
Kauffman, et alQSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors, J Chem Inf Comput Sci (2001) 41, 1553-1560.
Mattioni, et alDevelopment of Q. S.-A. R. and classification models for a set of carbonic anhydrase inhibitors, J Chem Inf Comput Sci (2002) 42, 94-102.
Training and test
35
3. Systematic clustering techniques
Burden, et alUse of automatic relevance determination in QSAR studies using Bayesian neural networks, J Chem Inf Comput Sci (2000) 40, 1423-1430.
Snarey, et alComparison of algorithms for dissimilarity-based compound selection, J Mol Graph Model (1997) 15, 372-385.
4. Self organizing maps (SOMs)
Gramatica, et alQSAR study on the tropospheric degradation of organic compounds, Chemosphere (1999) 38, 1371-1378.
Better than random selection
Training and test
Kohonen Map53 × 31
Columns (molecules) as input for Kohonen map:
Sampling from all region of columns (molecules) space
19 ,18
4 ,23,14
3,20
15,16
selwood data matrix
TrainTest
arrangem1
Other..19,18,3,20,4,23,14,
15,16
arrangem 2
….27,12,3,7,30,23,11,
16
Training and test
RMSECVRMSEP
Sample selection (Kohonen)
Descriptor selection (Kohonen)0.43840.6251Sample selection (Kohonen)
Descriptor selection (P-value)0.42050.6432
Descriptor selection using
P-Value
Descriptor selection using
Kohonen correlation map
51,37,35,38,39 36,15
35,36,37, 40,44 43,51
15CorrelationCorrelation
With activityWith activity
!
Training and test
38
5. Kennard Stone
Kennard, et alComputer aided design of experiments, Technometrics (1969) 11, 137-148.
Bourguignon, et alOptimization in irregularly shaped regions- pH and solvent strength in reverse-phase HPLC separation, Anal Chem (1994) 66, 893-904.
6. Factorial and D-optimal design
Eriksson, et alMultivariate design and modeling in QSAR. Tutorial, Chemometr Intell Lab Syst (1996) 34, 1-19.
Mitchell, et alAlgorithm for the construction of “D-optimal” experimental designs, Technometrics (2000) 42, 48-54.
Training and test
39
Gramatica, et alQSAR modeling of bioconcentration factors by theoretical molecular descriptors, Quant Struct-Act Relat (2003) 22, 374-385.
D-optimal
Selection of samples that maximize the |X’X| determinant.
X: Variance-covariance (information) matrix of independent variables (desriptors) or independent plus dependent variables.
These samples will be spanned across the whole area occupied by representative points and constitute the training set. The points not selected are used as test set.
=> well-balanced structural diversity and representativity of entire data space(descriptors and responses)
Training and test
40
trianD1 = [D(1:3:end,:);[D(2:3:end,:)]];
Training and test
trianD2 = D([1:2 5:13 17 21 22 25:end],:);
Selected descritorsdetCovDy
D=Dini; %all-3.48 e-236 !!
D=Dini(:,[51 37 35 38 39 36 15]); 2.18 e53
D=Dini(:,[38 50 3]); 5.90 e 08
D=Dini;2.13 e-243 !!
D=Dini(:,[51 37 35 38 39 36 15]);2.66 e53
D=Dini(:,[38 50 3]);4.45 e08
Optimum selection of descriptors and molecules in training set can be performed using detCovDy (D-optimal
41
leverage
xXXx 1)( TTleverage
Model applicability domainModel applicability domain
No matter how robust, significant and validated a QSAR maybe, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals.!!
Leverage is a criterion for determining the applicability domain of the model for a query compound:
x: vector of query compound
X: Matrix of training set indep variables
42
5 10 15 20
-50
0
50
100
150
train samples
leve
rag
e
0 5 10-5
0
5
10
15
20x 10
14
test samples
leve
rage
5 10
-2
0
2
4x 10
13
test samples
leve
rag
e
Using all descriptors leverage for all test samples are very high.
It means that test samples are not in the space of training samples and can not be predicted.
leverage
43
leverage
0 10 200
0.2
0.4
0.6
0.8
1
1.2
train samples
leve
rag
e
0 5 100
0.2
0.4
0.6
0.8
1
1.2
1.4
test samples
leve
rag
e
38 50 3 13 24
Using a number of descriptors (38 50 3 13 24) leverage for test samples are similar to training samples.
It means that test samples are in the space of training samples and can be predicted.