the good, the bad, and the ugly…practices of qsar...
TRANSCRIPT
![Page 1: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/1.jpg)
The good, the bad, and the ugly…practices of QSAR modeling
Alexander TropshaLaboratory for Molecular Modeling
andCarolina Center for Exploratory
Cheminformatics ResearchSchool of Pharmacy
UNC-Chapel Hill
![Page 2: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/2.jpg)
Some definitions
• Clint Eastwood is the "good" (slow to anger, but quick on the trigger)
• Lee Van Cleef is the bad (an elegant exemplar of absolute evil) and
• Eli Wallach is the "ugly" (a menacingly funny, totally amoral bandido whose relationship with the Eastwood character consists largely of betrayals).
![Page 3: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/3.jpg)
OUTLINE• Introduction: The need for developing
externally validated predictive models of biological data
• Why do models fail (bad practices)• Predictive QSAR Modeling Workflow (good
practices)• Examples of the Workflow applications
– QSAR based virtual screening and hit identification– Consensus QSAR modeling of chemical toxicity
• Conclusions: “best” QSAR modeling is a decision support science focus on accurate predictions
![Page 4: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/4.jpg)
Large fraction are confirmed actives
Small number of computational hits
SAR dataset External database/library
Key point: Focus on Externally Validated Predictions
QSAR Magic
Input
Output
Real Test
![Page 5: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/5.jpg)
QSAR Modeling is easy…
Chemistry Biology(IC50, Kd...)
Cheminformatics(Molecular Descriptors)
Comp.1 Value1 D1 D2 D3 Dn
Goal: Establish correlations
Comp.2 Value2 " " " "Comp.3 Value3 " " " "
Comp.N ValueN " " " "- - - - - - - - - - - - - - - - - - - - - -
between descriptors and the target property capable of predicting activities of novel compounds
BA = F(D) {e.g., …}(e.g., -LogIC50 = k1D1+k2D2+…+knDn)
∑ −∑ −
−=2
22
)()ˆ(1
i
ii
yyyyq
|
||
|
![Page 6: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/6.jpg)
y = 0.5958x + 2.3074
R2 = 0.2135
3
5
7
9
3 5 7 9
Predicted
Obs
erve
d
EXTERNAL TEST SET PREDICTIONSy = 0.4694x + 2.9313
R2 = 0.1181
3
5
7
9
3 5 7 9
Predicted
Obs
erve
d
But … the unbearable lightness of model building for training sets…
…leads to unacceptable prediction accuracy. 0
0.5
1
1.5
2
2.5
3
0 1 2 3 4Actual LogED50 (ED50 = mM/kg)
Pre
dict
ed L
ogED
50
TrainingLinear (Training)
![Page 7: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/7.jpg)
Choices and Practices• Descriptors (thousands and counting)• Data-analytical methods (dozens and
counting)• Validation approaches (unfortunately (!)
only a handful but counting)• Experimental validation as part of model
building (very rare)BUT
• We typically use one (or at best very few) modeling techniques
• Publish successes only • Compete but (mostly) indirectly
![Page 8: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/8.jpg)
BEWARE OF q2!!!
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0.5 0.6 0.7 0.8 0.9 1
q2
R2
Poor ModelsGood Models
Golbraikh & Tropsha, J. Mol. Graphics Mod. 2002, 20, 269-276.
•Only a small fraction of “predictive” training set models withLOO q2 > 0.6 is capable of making accurate predictions (r2 > 0.6)for the test sets.
![Page 9: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/9.jpg)
Some reasons as to why QSAR fails
• No external validation• Incorrect selection of an external test set• Incorrect division of a dataset into training and test sets• Incorrect measure of prediction accuracy• Not all statistical criteria are used to estimate predictive
power of a model• No applicability domain• Incorrectly defined applicability domain• No Y-randomization• Leverage (structure) and activity outliers are not removed• Modeling set is too small
![Page 10: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/10.jpg)
No external validation• It is still a problem, particularly in toxicity studies.
A typical paper:A small dataset of congeneric compounds: n~10-20 QSAR as a linear regression in the form:
log(1/EC50)=a*log P + b(sometimes, additional 1-2 descriptors like ELOMO or the number of H-bond donors in a molecule are included)The only validation method used: Leave-group-out cross-validationRelatively high q2 (not always) and R2:
Typical model acceptance criteria: q2>0.5, r2-q2>0.3 (some use R instead of R2, because R>R2)No true validation using compounds not included in the training set(in some cases the model is tested on just 2-3 compounds)
![Page 11: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/11.jpg)
Artificial Improvement of Predictive Ability of a QSAR Model
• Johnson, S.R. The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy). J. Chem. Inf. Model. 2008, 48, 25-26:"The common practice has been to select the model with the best fitness function score and predict a small group of observations that were withheld at the beginning. All too often, the model development process stops here, or, worse, the validation set is poorly predicted and models are iteratively tested until one predicts this set of compounds well."
A typical example:A dataset is divided into a training and test setMultiple QSAR models with high q2 values are built using training setQSAR model with the highest R2 for the test set is selected
Selected model might have poor predictive ability for other compounds
Another EXTERNAL EVALUATION SETS are necessary
![Page 12: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/12.jpg)
• Typical division of a dataset into training and test sets: random– Undesired outcome:
• some compounds of the test set can be out of the applicability domain of the training set
• large gaps of activity in the training or the test set and activity outliers in them
• Requirements for training and test sets:– Compounds with maximum and minimum activities of the dataset should
be included into the training set (important for methods that cannot extrapolate).
– Large gaps of activities is not allowed neither in training nor in test set.– Compounds of the training set should be distributed within the entire area
of the descriptor space occupied by the dataset.– Each compound of the test set should be close to at least one compound of
the training set.
Incorrect division of a dataset into training and test sets
![Page 13: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/13.jpg)
• A typical example of the target function:CCR=N(classified correctly)/N(total)
A dataset:Class 1: 80 compoundsClass 2: 20 compoundsModel: assign all compounds to Class 1.Target function: CCR=0.8 The model has high classification accuracy
Classification QSAR for Biased Datasets: Incorrect Target Function
• Better target function:CCR=0.5(Sensitivity+Specificity)
∑=
=K
ktotalk
corrk
NN
KCCR
1
1• General formula:
K – the number of classes
Nkcorr – the number of compounds of
class k assigned to class k
Nktotal – total number of compounds
of class k
• For category response variable, target functions can depend also on the absolute errors (differences between predicted and observed classes).
![Page 14: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/14.jpg)
Not all statistical parameters are used to estimate predictive power of a model
• Sometimes, cross-validation q2 for the training set and R2 for the test set are the only criteria used There are other important criteria which are not always used such as coefficients of determination and slopes for regressions through the origin (predicted vs. observed and observed vs. predicted activities), etc.
• Standard error of prediction alone is not a good statistical parameter, if it is not compared with the standard deviation of activities.
![Page 15: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/15.jpg)
No Applicability Domain for the Model
• Compounds which are highly dissimilar from all compounds of the training set (according to the set of descriptors selected) cannot be predicted reliablyLack of the AD:
unjustified extrapolationwrong prediction
Typical situation:a compound of the test set for which error of prediction is highis considered as outlierHOWEVER: a compound of the test set dissimilar from all compounds of the training set can be by chance predicted accurately
![Page 16: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/16.jpg)
Applicability Domain is Too Large
• Typical AD:Rectangular hyper-parallelepiped in the descriptor space with edges equal to intervals of change for each descriptor.This hyper-parallelepiped can be mostly empty with points concentrated onlyalong certain directions (it is particularly true, if descriptors are linearly dependent).Too large AD can lead to the same consequences as not having it at all.AD should be defined as a union of relatively small areas around all points of the training set. For example, currently we define it as
Dcutoff=<dnn>+Zσnn,where <dnn> and σnn are the average of distances between K nearest neighbors in the training set and their standard deviation, and Z is a user-defined value (default is 0.5), which can be adjusted.
![Page 17: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/17.jpg)
Y-randomization test is not carried out
• Y-randomization test:– Scramble activities of the training set– Build models and get model statistics.– If statistics are comparable to those obtained for models built with
real activities of the training set, the last are unreliable and should be discarded.
Frequently, Y-randomization test is not carried out.
Y-randomization test is of particular importance, if there is:
- a small number of compounds in the training or test set- response variable is categorical
![Page 18: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/18.jpg)
Outliers detection and removal• Many potential outliers can be detected in the dataset
prior to QSAR studies, but typically it is not done.
• Two types of outliersLeverage outliers: compounds dissimilar from all other compounds in a dataset.Activity outliers: compounds similar to some other compounds in the dataset, but their activities are quite different from those of their nearest neighbors.
![Page 19: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/19.jpg)
Detection of Leverage Outliers
• Calculate distance/similarity matrix in the entire descriptor space.
• For each compound, find its nearest neighbor/most similar compound.
• Find compounds which are out of the cutoff distance from their nearest neighbors or have similarity to the most similar compound lower than a predefined threshold. These compounds are leverage outliers for this predefined distance or threshold.
![Page 20: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/20.jpg)
Detection of Leverage Outliers using a Sphere-Exclusion algorithm
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Leverage Outliers Not Outliers
PROBE SPHERE RADIUS R:
• Calculate distances between nearest neighbors.
• Find mean and standard deviation σ of these distances.
• R= + Zσ,
where Z is a user defined parameter Z-cutoff.
• Calculations with multiple Z-cutoff values can be carried out.
D
D
![Page 21: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/21.jpg)
Example of “misinformation”: Optimal (left panel) and traditional (right panel) orientations of androgen (DHT shown in gold) and estrogen (estradiol shown in green) within the human SHBG steroid-binding
site.
A. Cherkasov, JMC, in press
Identical q2 (CoMFA*) of 0.53
*CoMFA – Completely Misleading Famous Aberration
![Page 22: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/22.jpg)
Why can’t we get it Right? Have not we tried enough?
• Descriptors? No, we have plenty (e.g., 1000’s in Dragon)
• Datamining methods? No, we also have plenty (e.g., SAS)
• Training set statistics? NO, it does not work• Test set statistics? Maybe, but it is still insufficient
So…what else can we do?????• Change the success criteria! Leave behind the phase of
“narcissistic” modeling and focus on externalpredictivity and experimental validation.
• Recognize QSAR as an empirical data modeling approach: just do it any (all) way you like but VALIDATE on independent datasets!
![Page 23: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/23.jpg)
Revising QSAR Modeling Process : Predictive QSAR Modeling Workflow*
• Model Building: Combination of various descriptor sets and variable selection data modeling methods (Combi-QSAR)
• Model Validation– Y-randomization– Training, test, AND evaluation set selection– Model sampling and selection criteria– Applicability domain
• Consensus prediction using multiple models*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:…Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
![Page 24: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/24.jpg)
Only accept models that have a
q2 > 0.6R2 > 0.6, etc.
Multiple Training Sets
Validated Predictive Models with High Internal
& External Accuracy
Predictive QSAR Workflow*
Original Dataset
Multiple Test Sets
Combi-QSAR ModelingSplit into
Training, Test, and External
Validation Sets
Activity Prediction
Y-Randomization
External validationUsing Applicability
Domain
Database Screening
UsingApplicability
Domain*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:…Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
Experimental Validation
![Page 25: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/25.jpg)
DATASET
EXTERNAL VALIDATION SETMODELING SUBSET
TRAINING SET TEST SET
DIVISION OF A DATASET IN THREE SUBSETS AND EXTERNAL VALIDATION
"PREDICTIVE" MODELS
Rational division
MODELS Model Validation
External Validation
Random division
PREDICTIVE MODELS
![Page 26: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/26.jpg)
COMBINATORIAL QSAR
C-QicsKNNKNN
BINARY QSAR,…BINARY QSAR,…
COMFA descriptorsCOMFA descriptors
MolconnMolconn Z Z descriptorsdescriptors
Chirality descriptorsChirality descriptors
VolsurfVolsurf descriptorsdescriptors
Comma descriptorsComma descriptors
MOE descriptorsMOE descriptors
Dragon descriptorsDragon descriptors
SAR Dataset
Compound representation
Selection of best models
Model validation
and external test sets using Y-Randomization
QSAR modelingg SVMSVM
DECISION TREEDECISION TREE
Lima, P., Golbraikh, A., Oloff, S., Xiao, Y., Tropsha, A. Combinatorial QSAR Modeling of P-Glycoprotein Substrates. J. Chem. Info. Model., 2006 46, 1245-1254.Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y., Zheng, W., Wolschann, P., Buchbauer, G., Tropsha, A. Combinatorial QSAR of Ambergris Fragrance Compounds. J Chem. Inf. Comput. Sci. 2004, 44, 582-95
![Page 27: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/27.jpg)
DEFINING THE APPLICABILITY DOMAIN
Distribution of distances between points and their nearest neighbors in the training set
0
0.1
0.2
0.3
0.4
0-0.1
0.1-0.2
0.2-0.3
0.3-0.4
0.4-0.5
0.5-0.60.6-0.7
0.7-0.80.8-0.9
0.9-1.0
DistancesNi
/ N
Training setTest set
Training set: 60 compoundsTest set: 35 compounds
MODEL:Two nearest neighborsThe number of descriptors: 8Q2(CV)=0.57 R2 =0.67
DISTANCES:<D>train=0.287StDev(D)train=σ =0.149
Closest nearest neighbors of test set compounds:
Dtest ≤ <D>train+ σ ZCutOff(ZCutOff=0.5)
N is the total number of distances ( Ntrain=60 2=120; Ntest=70 )
Ni is the number of distances in each category (bin)
× ×
*Tropsha, A., Gramatica, P., Gombar, V. The importance of being earnest:…Quant. Struct. Act. Relat. Comb. Sci. 2003, 22, 69-77.
![Page 28: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/28.jpg)
Applicability domain vs. prediction accuracy (Ames Genotoxicity dataset)
80
82
84
86
88
90
92
0.5 1 2 3 5 10
kNN Z-Score Used for Prediction Cutoff
Test
Set
%A
ccur
acy
![Page 29: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/29.jpg)
Application of the Predictive QSAR Workflow to HDAC Inhibitors*
59 HDAC Inhibitors
Validation Set(9 compounds)
MultipleTraining Sets
Multiple Test Sets
Remaining Subset (50 compounds)
Y-Randomization
Variable Selection QSAR Models
1385 accepted models that have a
R2Train > 0.60R2Test > 0.60
70 Validated Predictive Models with High Internal
& External Accuracy
Screen 3,100,000
Compounds
27 Hits predicted as HDAC inhibitors
4 validated experimentally*
2 are mkMinhibitors
*collaboration with Bryan Roth, UNC
![Page 30: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/30.jpg)
Experimental validation of HDAC computational hits (data from Bryan
Roth’s lab, UNC-Pharmacology))
![Page 31: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/31.jpg)
Training:48 GGTI compounds (from -log 3.8 to 7.6)274 descriptors (MolconnZ)611 models generatedbest q2 = 0.77 best r2 = 0.94104 models pass cutoff (0.6 for q2&r2)
Database Search:~9 million small molecules (ZINC + Raw Database)
79 predicted actives using linear consensus kNN (0.5 cutoff)range = 4.51 to 5.96 (1.45 log)56 cmpds > 5.5 predicted activity
7 selected for further analysis based on high predicted activity, uniqueness of structure (divergent from training set), & availability
Distribution of Training Set Activities
3.5-4 4-4.5 4.5-5 5-5.5 5.5-6 6-6.5 6.5-7 7-7.5 7.5-80
5
10
15
- log IC50#
of C
ompo
unds
Application to GGTase-I Inhibitors
*collaboration with Y. Peterson & P. Casey, Duke
![Page 32: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/32.jpg)
GGTI QSAR Hits GGTI QSAR Hits –– GGTaseGGTase--I in vitro Activity AssayI in vitro Activity Assay
-8 -7 -6 -5 -4 -30.0
0.2
0.4
0.6
0.8
1.0
As1 68.55 +/- 26.54 -4.16
As2 18.91 +/- 6.84 -4.723
En1 122.6 +/-76.4 -3.912
En2 30.118+/- 8.473 -4.521
Sig 1 139.6 +/- 76.69 -3.855
Sig2 3.12 +/- 1.61 -5.506
Sig3 3.85 +/’- 1.12 -5.415
log [Compound], M
% D
MSO
Con
trol
IC50 (μM) log IC50
![Page 33: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/33.jpg)
2 Training Set Scaffolds
GGTIDUx SeriesPyrazoles
Peterson & Casey
GGTIx SeriesPeptidomimetics
Hamilton & Sebti
Novel Scaffolds DiscoveredDatabase Mining Reveals Unique Chemical EntitiesDatabase Mining Reveals Unique Chemical Entities
NH
NH
R1 O
OOHR2
R3
R2O
N N
R1
SNH R3
O
R3O
NHO R1
NH
R2
O
R3
NH O
CH3
N NN
N N
R1
R2
O
Amide
PyrazoleN N
R6
R5
ON
OR1 R2
R3
R4
Amide
Carboxyl
Amine
Linker
Amide
Amide
Furan
Pyrazole Amide
Amide
Amide
Tetrazole
![Page 34: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/34.jpg)
Characteristic AmpC Ligands and Decoys and Their Ranks by Different Scoring Functions. Blue = DOCK, magenta = ScreenScore, yellow = FlexX, cyan = PLP, purple = PMF, and red = SMoG (SMoG ranks are based on a ranking, which does not include halogenated compounds).
*J. Med. Chem. 2005, 48, 3714-3728
Decoys are frequently indistinguishable from binders
using typical SB scoring functions.*
Application of Cheminformatics Approaches to Binding Decoys
![Page 35: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/35.jpg)
Recent examples of experimentally validated QSAR-based predictions
• Anticonvulsants: Shen, M. et al, J. Med. Chem. 2004, 47, 2356-2364.
• HIV-1 reverse transcriptase inhibitors: Medina-Franco, J., et al, J. Comput. Aided. Mol. Des., 2005, 19, 229–242
• D1 receptor antagonists: Oloff et al, J. Med. Chem., 2005,48, 7322-32
• Anticancer agents: Zhang et al, J. Comp. Aid. Molec. Des., 2007, 21, 97-112.
• Amp inhibitors: Zhang L. et al, J. Comp. Aid. Molec. Des., 2008 (ASAP)
• HDAC inhibitors: Wang, S. et al, (unpublished)• GGT-I inhibitors: Wang, Peterson, et al (provisional
patent)
![Page 36: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/36.jpg)
PubChem NCGC AmpC Dataset Reexamined by QSAR and Docking based Virtual Screening
ASAP March 13, 2008
ASAP March 12, 2008
![Page 37: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/37.jpg)
QSAR modeling of binders vs. binding decoys and structure-less virtual screening
[blue:DOCK, magenta: ScreenScore, yellow: FlexX, cyan: PLP, purple: PMF, and red: SMoG]
(Graves AP; et al. J. Med. Chem. 2005, 48, 3714)
![Page 38: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/38.jpg)
Classification QSAR modeling of binders vs. apparent non-binding decoys.*
T4 lysozymeL99A
AmpCβ-lactamase
Inhibitors 55 21
Non-binders 65 80
Total 120 101
Inhibitors: (AmpC) compounds can inhibit competitively
(T4) compounds are known to bind to T4 lysozyme L99A
Non-binders: (AmpC) compounds do not inhibit at 1 mM
(T4) compounds can’t detect binding at high concentrations
*http://shoichetlab.compbio.ucsf.edu/take-away.php
![Page 39: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/39.jpg)
Study Design
80 nonbinders21 inhibitors
Modeling Set51 compounds
Binary kNN QSARModel Building
(MolConnZ descr)
342Predictive Models
Database Mining
50 nonbindersdissimilar to
inhibitors10 compounds
64 HTS ‘hits’(non-binders)
Similarity Search
Class 1 Class 2
Correct Classification Rate (CCR) = 0.5 * (TP/N1+TN/N0)
N1: Total number of inhibitorsN0: Total number of nonbinders
![Page 40: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/40.jpg)
External Validation Results
Experimental
PredictedInhibitor Nonbinder Total
Inhibitor 5 0
5
5
5
Nonbinder 0 5
Total 5 10
Experimental
PredictedInhibitor Nonbinder Total
Inhibitor 0 6
41
47
6
Nonbinder 0 41
Total 0 47
10 randomly excluded compounds
CCR = 0.5 * (5/5 + 5/5) = 1
50 nonbinders dissimilar to inhibitors
Accuracy = 41/47 = 0.87
![Page 41: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/41.jpg)
The QSAR models do not predict the majority of the 64 HTS ‘Hits’ as binders
Z = 0.5; Accuracy = 20/25 = 0.8Z = 3.0; Accuracy = 47/55 = 0.85
No. of models for prediction
0 50 100 150 200 250 300 350Th
e av
erag
e cl
ass
num
ber
0.91.01.11.21.31.41.51.61.71.81.92.02.1
No. of models in prediction
0 20 40 60 80 100 120
The
aver
age
clas
s nu
mbe
r
0.91.01.11.21.31.41.51.61.71.81.92.02.1
predicted as nonbinderpredicted as inhibitor 5
20
8
47
assigned as nonbinder
assigned as inhibitor
![Page 42: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/42.jpg)
Virtual Screening of a PubChem AMPcHTS dataset of 69,653 Compounds
Database mining of 69653 compounds (Z-cutoff = 0.5)
No. of models in prediction
100 120 140 160 180 200 220 240 260 280
The
aver
age
clas
s nu
mbe
r
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
Hit selection criteria:- Within AD of at least 50% of models- 80% of those predict a compound as an inhibitor
This leads to 15 Hits
![Page 43: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/43.jpg)
5 Compounds Selected for Experimental Testing
S
S
OO
O
O
N H
HH
H H
H
H
H
HH
H
Cl
S
F
OO
O O
NH
H H
H H
H
H
H
H
ClSO O
O
O
NH
H H
H
H
H
H
H
H
H
S
F
OO
O
O
NH
H
H
H
H
H
H
H
H
H
Cl
S
O
O
O
O
NH
H
H
H
H
H
H
H
H
H
HH
H
H
![Page 44: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/44.jpg)
One compound (CID 69951) Shows Micro-molar Inhibitory Activity
Kd = 18μM, Ki = 135μM
Log[I](Log(mM)
-6 -5 -4 -3 -2 -1 0 1
%ac
tivity
20
40
60
80
100
120
CID: 699751
Experiments done by Dr. D. Teotico at UCSF
O H
S
O
ONH
O
C l
![Page 45: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/45.jpg)
Descriptor InterpretationRank Descriptor ID Frequency Interpretation
1 nHCsatu 32.2
28.4
27.5
27.2
26.3
26.3
26.0
26.0
26.0
25.4
24.3
24.3
24.0
23.7
23.7
2 Hsulfonamide
CHn(unsatuaed)
:O:(aromatic)
:S: (aromatic)
:CH:
-CH2-
3 nnitrile
4 Hmin
5 naaO
6 naaS
7 SaaCH
8 n3Pad24
9 SssCH2
10 SHBint5
11 Xvch5
12 n2Pag23
13 IDW
14 htets2
15 nimine
Cl
N
ON+
N
NHN
N
SNH
O
HO
SO
-O O O
inhibitor nonbinder
SO
O N
C N
CN
![Page 46: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/46.jpg)
Comparison between QSAR and Docking Hits
S
S
OO
O
O
N H
HH
H H
H
H
H
HH
H
ClSO O
O
O
NH
H H
H
H
H
H
H
H
H
S
F
OO
O
O
NH
H
H
H
H
H
H
H
H
HCl
S
O
O
O
O
NH
H
H
H
H
H
H
H
H
H
HH
H
H
IC50 = 0.7mMKi = 135µM
N
O
O
HO
O
HO
O
OH
Ki = 37μM
N
O
O
HO
O
HO
O
Ki = 55μM
QSAR Hits Docking Hits
2 true hits out of 16 tested
Cl
S
F
OO
O O
NH
H H
H H
H
H
H
H
IC50 = 3.0mM IC50 = 9.0mM
IC50 = 1.8mM IC50 = 7.0mM 5 true hits out of 5 VS hits tested
![Page 47: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/47.jpg)
QSAR Consensus Prediction of VS Hits
# of models for prediction
0 50 100 150 200 250
The
cons
ensu
s sc
ore
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
ClSO O
O
O
NH
H H
H
H
H
H
H
H
H
N
O
O
HO
O
HO
O
OH
Ki = 37μM
N
O
O
HO
O
HO
O
Ki = 55μM
assigned as nonbinder
assigned as inhibitor
Ki = 135μM
QSAR HitsDocking Hits
![Page 48: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/48.jpg)
QSAR Modeling* of the TETRATOX aquatic toxicity endpoint
• Schultz, T.W. TETRATOX: Tetrahymenapyriformis population growth impairment endpoint-A surrogate for fish lethality. Toxicol. Methods (1997) 7: 289-309
• A short-term, static protocol using the common freshwater ciliate Tetrahymena pyriformis (strain GL-C) to test aquatic toxicity.
• The 50% impairment growth concentration (IGC50) is the recorded endpoint.
• Website: http://www.vet.utk.edu/TETRATOX/
*Zhu et al, JCIM, 2008, in press
![Page 49: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/49.jpg)
International Virtual Collaboratory* of Computational Chemical Toxicology
• USA: UNC-Chapel Hill (UNC) - H. Zhu and A. Tropsha
• France: University of Louis Pasteur (ULP) – D. FOURCHES and A. VARNEK
• Italy: University of Insubria (UI) – E. PAPA and P. GRAMATICA
• Sweden: University of Kalmar (UK) – T. ÖBERG• Germany: Munich Information Center for Protein
Sequences/Virtual Computational Chemistry Laboratory (VCCLAB)– I. TETKO
• Canada: University of British Columbia (UBC) –A. CHERKASOV
*a new networked organizational form that also includes social processes; collaboration techniques; formal and informal communication; andagreement on norms, principles, values, and rules
![Page 50: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/50.jpg)
Different countries, different groups, different tools – shared basic principles
• Explore and combine various QSAR approaches• Use extensive model validation and applicability
domains• Consider external prediction accuracy as the
ultimate criteria of model quality
∑∑ ><−−−=YY
predabs YYYYR 2expexp
2exp
2 )()(1
2expexp
2exp
2 )()(1 ∑∑ ><−−−=YY
LOOabs YYYYQ
∑ −=Y
pred nYYMAE (3)
(2)
(1)
![Page 51: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/51.jpg)
Overview of the Approaches (15 methodologies total)
Group ID
Modeling Techniques
Descriptor Type
Applicability Domain
UNC kNN, SVM MolConnZ, Dragon
Euclidean distance threshold between a test compound and compounds in the modeling set
ULP MLR, kNN, SVM
Fragments Euclidean distance threshold between a compound and compounds in the modeling set; bounding box
UI OLS Dragon Leverage approach
UK PLS Dragon Residual standard deviation and leverage within the PLSR model
MIPS ASNN E-state Maximal correlation coefficient of the test molecule to the training set molecules in the space of models
UBC MLR, ANN, SVM, PLS
IND_I Descriptor variability
![Page 52: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/52.jpg)
Individual vs. Consensus Models for the Modeling Set
Modeling Set (n=644) Model Group ID q2 SE Coverage
kNN-Dragon UNC 0.93 0.23 100% kNN-MolconnZ UNC 0.92 0.26 99.8% SVM-Dragon UNC 0.93 0.26 100%
SVM-MolconnZ UNC 0.89 0.33 100% kNN-Fragmental ULP 0.77 0.44 100% SVM-Fragmental ULP 0.95 0.23 100%
MLR ULP 0.94 0.25 100% MLR-CODESSA ULP 0.72 0.47 100%
OLS UI 0.86 0.35 92.1% PLS UK 0.88 0.34 97.7%
ASNN MISP 0.92 0.27 83.9% PLS-IND_I UBC 0.76 0.39 100% MLR-IND_I UBC 0.77 0.39 100% ANN-IND_I UBC 0.77 0.39 100% SVM-IND_I UBC 0.79 0.31 100%
Consensus Model
- 0.92 0.22 100%
![Page 53: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/53.jpg)
Which model is best?• Observation: Models that afford most accurate
predictions for the validation sets are not necessarily ranked as top models for the modeling set.
• Back to choices and practices: So how do we choose “the best” models?
Should we choose?• Consensus Prediction
– Only predict compounds within the applicability domain of most models
– For each compound, exclude predictions that have high deviations from the mean value
– Final predicted value is the average all predictions.
![Page 54: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/54.jpg)
Consensus Model gives the lowest MAE of prediction (Validation Set I)
![Page 55: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/55.jpg)
Consensus Model gives the lowest MAE of prediction (Validation Set II)
![Page 56: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/56.jpg)
Data visualization (ROI is filed)Calculations using 74 Dragon descriptorsnormalized between 0 and 1.
The three first components are visualized.
Distance cutoff : 0.7
Training SetTest Set 1Test Set 2
![Page 57: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/57.jpg)
Conclusions of the aquatic toxicity modeling
• Training set modeling is insufficient to guarantee externally predictive models
• The use of AD is critical to achieve respectable external predictivity of individual models BUT one should keep in mind the balance between predictivity and space coverage
• Consensus prediction – affords high predictive power– has lowest MAE– stable against relatively inefficient individual models– avoids the problem of making a choice!!!
![Page 58: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/58.jpg)
Emerging approaches: Combining chemical and biological descriptors in QSAR
modeling of chemical carcinogenicity.
![Page 59: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/59.jpg)
NTP-HTS Content Summary of 1408 Compounds
• Chemical Structure Types:- Organic: 1,348- Inorganic: 27- Organometallic: 19- No structure: 14
• 1348 Organic compounds contain:- Unique: 1,279- Complex: 51- Salt: 20- Duplicates: 53
• Curated subset: 1,289 unique organic compounds
![Page 60: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/60.jpg)
Additional biological data on 1,289 NTP/HTS compounds*
NTP-HTS
NTPBSI NTPGTZ HPVCSI CPDB IRISSI
1,289 1,153 1,053 423 270 181
NTPBSI: National Toxicology Program Chemical Structure Index fileNTPGTZ: National Toxicology Program genotoxicityHPVCSI: High Production Volume ChemicalsCPDB: Carcinogenic Potency Data Base All SpeciesIRISSI: EPA Integrated Risk Information System
*Based on the DSSTox project of Dr. Ann Richard at EPA.
![Page 61: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/61.jpg)
The relationship between HTS activity and rodent carcinogenicity of 270 compounds
HTS actives HTS inconclusives HTS inactives
CPDB actives 30 12 136
CPDB Inactives 7 11 74
Correlation 81% - 35%
![Page 62: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/62.jpg)
Comparison between Predictive Power of QSAR Models using Conventional
vs. Hybrid Descriptors.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Sensitivity Specificity Overall Predictivity
Chemical (MolConnZ)Descriptors
Chemical (MolConnZ)Descriptors + Biological (HTS)Descriptors
Zhu et al, EHP, 2008, in press
![Page 63: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/63.jpg)
Relative contributions of HTS descriptors to 34 acceptable models
0
2
4
6
8
10
12
14
16
HTS-BJ HTS-HEK293
HTS-HepG2
HTS-Jurkat
HTS-MRC-5
HTS-SK-N-SH
HTS-Global
![Page 64: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/64.jpg)
Final Thoughts
• Best time ever to be a cheminformatics scholar– Growth of databases– Tool development– Collaborations with computational and experimental scientists
• Extending cheminformatics approaches to new areas– Structure based virtual screening– “-omics” data analysis– Genotype - phenotype correlations
• Focus on Knowledge Discovery (accurate testable predictions!) in Biomolecular Databases
• Practice best practices! Collaborate!!!
Nothing that worth knowing can be taught.Oscar Wilde
![Page 65: The good, the bad, and the ugly…practices of QSAR modelinglori.academicdirect.org/didactic/attends/06_2008_June... · 2013-08-12 · • Predictive QSAR Modeling Workflow (good](https://reader035.vdocuments.site/reader035/viewer/2022070704/5e8a113f4b73e26e1c3416da/html5/thumbnails/65.jpg)
ACKNOWLEDGMENTSUNC ASSOCIATES
Former:-Stephen CAMMER-Sung Jin CHO-Weifan ZHENG- Min SHEN-Bala KRISHNAMOORTHY-Shuxing ZHANG–Peter ITSKOWITZ–Scott OLOFF-Shuquan ZONG-Raed KHASHAN
•Structural bioinformatics group:
– Yetian CHEN– Tanarat KIETSAKORN– Berk ZAFER
• Collaborators– Hal Kohn (UNC)– Richard Mailman (UNC)– Bryan Roth (UNC)– Yuri Peterson (Duke)– Diane Pozefsky (UNC)
• Funding– NIH
• P20-HG003898 (RoadMap)• R21GM076059 (RoadMap)• R01-GM66940• GM068665
– EPA (STAR award)
Cheminformatics group:- Kun WANG - Rima HAJJO– Sasha GOLBRAIKH - Mei WANG– Raed KHASHAN - Lin YE– Chris GRULKE - Liying ZHANG- Hao TANG - Mihir SHAH- Simon WANG - Jui-Hua Hsieh- Hao ZHU - Tong-Ying Wu- M. KARTHIKEYAN
– Jun FENG– Yun-De XIAO–Yuanyuan QIAO–Ruchir SHAH–Patricia LIMA–Assia KOVACHEVA–Julia GRACE–Hao HU
Current