Download - Chance Correlation in QSAR studies
![Page 1: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/1.jpg)
Chance Correlation in QSAR studies
Ahmadreza MehdipourMedicinal & Natural Product Chemistry Research Center
![Page 2: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/2.jpg)
Correlation or causation?
Correlation is essential but not sufficient Correlation is meaningless unless its
cause (or role) in the biological activity is interpreted
A satisfactory QSAR correlation does not mean that a particular descriptor causes the efficient action of a compound
![Page 3: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/3.jpg)
Chance Correlation
•Topliss Ratio (J. Med. Chem. 1972, 35, 1066)
• A misconception
• Ratio of variables in model to Sample Size
• Ratio of variables in Data Pool to Sample Size
• Revalidation of problem by Livingstone(J. Med. Chem. 2005, 48, 6661)
![Page 4: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/4.jpg)
• Topliss et al. demonstrated that the more independent variables (X) that are available for selection in a multiple linear regression model, the more likely a model will be found by chance. These authors recommended that in order to reduce the risk of chance correlations there should be a certain ratio of data points to the number of independent variables available. Unfortunately, this ratio was often misinterpreted as the number of data points to the number of independent variables in the final model, a practice that did very little if anything to reduce chance effects.
D.W. Salt, S. Ajmani, R. Crichton, D.J. Livingstone, An improved approximation to the estimation of the critical F values in best subset regression. J. Chem. Inf. Model. 47 (2007) 143-149.
![Page 5: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/5.jpg)
Chance CorrelationHow does it occur?
•A Trial Example with random data
•Characteristics:
• N (Sample Size)=20
• K (Number of variables in data pool)=10, 20, 50, 75, 100
![Page 6: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/6.jpg)
N=20 K=10
![Page 7: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/7.jpg)
N=20 K=20
![Page 8: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/8.jpg)
N=20 K=50
![Page 9: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/9.jpg)
N=20 K=75
![Page 10: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/10.jpg)
N=20 K=100
![Page 11: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/11.jpg)
Avoiding chance correlation
What should we do?
![Page 12: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/12.jpg)
Solutions for detection of chance correlation
Fmax critical Randomization of Y (input scrambling) Validation procedures
![Page 13: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/13.jpg)
Fmax Critical
Linvingstone Approach Normal tabulated F is significant
ONLY WHEN
K=PK= number of variables in data poolP= number of variables in model
![Page 14: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/14.jpg)
Fmax Critical
However, in most cases K>>PK= number of variables in data poolP= number of variables in modelN=Sample Size
![Page 15: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/15.jpg)
Introduction of Fmax Critical Simulated random data Run 1000 times Different N, K and P Obtain Fmax for each combination
(for a significance level of 5%)
Check for some Known data sets www.cmd.port.ac.uk
![Page 16: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/16.jpg)
Randomization of Y
Ys are randomly attributed to samples
![Page 17: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/17.jpg)
Y-randomization
However This method should also be performed during
Variable selection process
if, R2max and Q2
max are low
Then, the risk of chance correlation is low
![Page 18: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/18.jpg)
Cross-validation Process
Different N, K, P N=10, 20, 30, 40, 50, 80, 100 P=1-8 N=p, 10, 20, 30, 50, 100
Run 1000 times Evaluation factorsR2 of training setQ21 = Q2 for LOO CVQ220% = Q2 for Leave-20% of samples-Out CVQ250% = Q2 for Leave-50% of samples-Out CVR2P = R2 of one random test set (25% of samples)
![Page 19: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/19.jpg)
0
0.2
0.4
0.6
0.8
1
1 3 5 7 9p
R2 max
n=10
n=20
n=30
n=40n=50
n=80n=100
0.0
0.2
0.4
0.6
0.8
1.0
1 3 5 7 9p
Q2 1max
n=10
n=100
0.0
0.2
0.4
0.6
0.8
1.0
1 3 5 7 9
p
Q2 20%max
n=10
n=100
0.000
0.200
0.400
0.600
0.800
1.000
1 3 5 7 9p
Q2 50%max
n=10
n=100
0.0
0.2
0.4
0.6
0.8
1.0
1 3 5 7 9p
R2pmax
n=10
n=20
n=30
n=40
n=50
n=80n=100
![Page 20: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/20.jpg)
Cross-validation Process
Leave-one-out Vs Leave-group-out Q2
L50%O is independent of N, K, P
Hemmateenejad B, Mehdipour AR, Bagheri L, Miri R, Judging the significance of the multiple linear regression-based QSAR models by cross-validation. To be submitted
![Page 21: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/21.jpg)
Concluding Remarks
Be aware of N to K ratio
Not only N to P ratio
Check different approaches for chance correlation
![Page 22: Chance Correlation in QSAR studies](https://reader036.vdocuments.site/reader036/viewer/2022081420/5681634e550346895dd3ed0b/html5/thumbnails/22.jpg)
Models are not real but
sometimes are helpful