prediction of bioactivity from chemical structure
DESCRIPTION
Presentation for the Small Molecule Bioactivity Resources At The EBI training course 2010TRANSCRIPT
![Page 1: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/1.jpg)
Prediction of bioactivity from chemical structure
Small Molecule Bioactivity Resources At The EBI
Jérémy Besnard
![Page 2: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/2.jpg)
2
Myself
• PhD student at the university of Dundee– Supervisor: Pr. Andrew Hopkins– Lab: medicinal informatics
• Background– Chemistry degree with some biology– One industrial year at Pfizer on computational
chemistry
![Page 3: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/3.jpg)
3
Prediction of bioactivity
• Type of predictions– How active is a compound?
• Continuous model
– Is the compound active, or not?• Categorical model
QSAR – Quantitative Structure-Activity Relationship
Some slides are adapted from Richard Lewis (Novartis) presentation at the University of Sheffield Practical introduction to Chemoinformatics course (next in 2011)
![Page 4: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/4.jpg)
4
Example
3
4
5
6
7
8
150 250 350 450 550
Molecular Weight
Act
ivity
Molecular Weight 180 220 250 290 340 380 450 500
Activity (pIC50) 4 4.3 4.8 5.4 4.8 5.8 7.5 7.7
Molecular Weight = 360
Activity?
Linear regression:
Activity = 0.01 Molecular weight + 1.7 (R2 = 0.900)
Activity = 5.3
Active?
Category:
Molecular weight > 260 = active
Active : Yes
![Page 5: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/5.jpg)
5
QSAR
Activity = IC50, Ki, Ratios…
Molecular Descriptors
Topological (shape, size)
Physical & Thermodynamics
Chemical feature (substructure)
Activity = f(Molecular Descriptors)
Statistics O
O
> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...
![Page 6: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/6.jpg)
6
The absolute basics
• Activity + Representation + Method = QSAR
• Activity = experimental data
• Representation = description of the molecule
• Method = Statistical tool to use– Underlying principle: similar molecules should
have similar activities
![Page 7: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/7.jpg)
7
Advantages of Models
• Fast and cheap method– Virtual screening: the computer does the
manipulation• Human: day – week• Computer : seconds - hours
• Help understand the science behind the observation– Tool to design compounds with higher chance
of being active
![Page 8: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/8.jpg)
8
Activity
• It can be anything– Continuous: IC50, %Inhibition, EC50, ratios,…– Categorical: Yes/No, Low/Medium/High
• Better if– Data come from the same assay/condition– Good quality (you trust the experimental data)
• For ADME endpoints– Lots of software solutions: not easy to predict!
• Few experimental data points (and not very reliable)• In vivo phenomena
![Page 9: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/9.jpg)
9
Molecular descriptors
www.moleculardescriptors.eu
• Many Many Many• Simple counts
– Number of atoms, rings, hydrogen bond donors, acceptors, molecular weight…
• Physicochemical– Hydrophobicity, polarity: cLogP, Polar Surface Area (PSA)
• Shape – Topological indices– Big, small, long, round
• 2D fingerprints– Presence or absence of certain substructures
• From a dictionary (MACCS eg count of acids)• On the fly: look at the substructures present in the data
• 3D: fingerprints, electrostatics, shape
![Page 10: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/10.jpg)
10
Fingerprint
• Binary vector: list of 0 and 1• Dictionary: fixed size with
each bit = one group (defined in advance)
• Hashed: fragment the molecules and insert the fragment in a bit position of the vector
Acid Cl Amide6
aromatic ring
…
O
O
O
O
![Page 11: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/11.jpg)
13
Extending the Initial Atom Codes
• Fingerprint bits indicate presence and absence of certain structural features
• Fingerprints do not depend on a predefined set of substructural features
O
N
A
A
A
A
O
N
AA
A
A A
Each iteration adds bitsthat represent larger and larger structures
Iteration 0
Iteration 1
Iteration 2
![Page 12: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/12.jpg)
14
Generating the Fingerprint
• Iteration is repeated desired number of times– Each iteration extends the diameter by two
bonds• Codes from all iterations are collected• Duplicate bits may be removed
> <FCFP_2#S>160131618154665203677720-154910344918721545241070061035...
> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...
> <FCFP_0#S>16013
...
![Page 13: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/13.jpg)
Data Sets
![Page 14: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/14.jpg)
16
Validity of a model• It is easy to introduce artefacts and “false
correlation”
The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy), Johnson, J. Chem. Inf. Model., 2008, 48 (1), pp 25–26
![Page 15: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/15.jpg)
17
Training and Test Sets• Build the model from training set
• Predict the test set
• Also called Leave-N-Out validation where N=1 compound to 50% of the dataset.
• Cross validation: repeat the steps using complementary training and test set N times.
http://www.cs.cmu.edu/~awm/tutorials
http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf
![Page 16: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/16.jpg)
18
Space of the sets
• The training set should cover the representation space evenly
![Page 17: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/17.jpg)
19
Training vs Test Sets
• The test set should be not too dissimilar to the training set– Too similar = over estimated the good quality– Too dissimilar = difficult prediction
Test Set
Test Set
Test Set
![Page 18: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/18.jpg)
Questions?
![Page 19: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/19.jpg)
21
Statistical Methods
Activity Molecular Descriptors
Training and test sets
Activity = f(Molecular Descriptors)
Statistics O
O
> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...
![Page 20: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/20.jpg)
22
Categorical
• The focus is on a specific criterion:– Is the activity < 10uM? (like in HTS assay)
• The data is not continuous– Soluble/Insoluble
• Try to find a rule (or set of rules) to split the data in classes with the lowest rate of misclassification– Different coefficients to measure the quality(ref: Assessing the accuracy of prediction algorithms for classification: an overview. Baldi et al. Bioinformatics
2000, 16:412-424)
![Page 21: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/21.jpg)
23
Recursive Partitioning
• Using decision trees
• Rules are organized like a tree, each node = one rule – Cut-off : Molecular weight <450– Absence/presence of a group: Acid group
• Usually easy to interpret
• Drawback: overfitting and model to specific to the training data
![Page 22: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/22.jpg)
24
N
O
O
O
Molecular Weight
>450≤ 450
Polar surface area
>100
0,10
≤100
2,0
cLogP
Acid Group 0,7
>4.2≤ 4.2
18,2 1,5
YesNo
21 Actives, 24 Inactives
2,1019,14
19,7
MW: 178PSA: 37LogP: 3
MW: 205PSA: 20LogP: 3
![Page 23: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/23.jpg)
25
Substructural Analysis
• Idea: each fragment of the molecule makes a contribution to the activity , independent of the other fragments in the molecule.
• Fragments get a score for their activity and a molecule has the score of the sum of the fragments.
• A simple fragment scoring function:
ii
ii inactact
actw
Acti = Nb of active compounds containing fragment i
Inacti = Nb of inactive compounds containing fragment i
![Page 24: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/24.jpg)
26
Naïve Bayesian Classifiers
• Related to the substructural analysis (slight differences in the weight sum calculationref)
• Use with fingerprints– Each substructure (bit in the fingerprint) gets a weight– Fingerprint can be mixed with other properties
• Properties are binned and each bin obtains a weight
• Molecules are scored, the higher the score the higher the chance to be in a specific category
• Native implementation in Pipeline Pilot (practical)
Ref: New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching, Hert et al., J. Chem. Inf. Model., 2006, 46 (2), pp 462–470
![Page 25: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/25.jpg)
27
Validation
• Collection of coefficients
• Most common ones– Specificity and sensitivity – ROC curve– Enrichment plot
![Page 26: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/26.jpg)
Specificity & Sensitivity
• Specificity
Example:If all compounds are predicted inactives:
Specificity = 1 (very good)Sensitivity = 0 (very bad)
If all compounds are predicted actives:Specificity = 0 (very bad)Sensitivity =1 (very good)
28
TP=True PositiveTN=True NegativeFP=False PositiveFN=False Negative
• Sensitivity
FPTN
TN
FNTP
TP
http://en.wikipedia.org/wiki/Sensitivity_and_specificity
![Page 27: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/27.jpg)
ROC curve
• Plot sensitivity versus 1-specificity
29
Coefficient = Area Under Curve 1 is ideal, 0.5 is random
http://www.medcalc.be/manual/roc.php
![Page 28: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/28.jpg)
Enrichment curve• On some study the rank of compounds is not that
important: idea is to select X percent of the data• Use the model to select the Top X compounds: try
to have most of the active molecules inside
30
There 40% of the active in the top 10%.This plot doesn’t tell how many compounds this represents (could be 40 actives and 10,000 inactive in the top 10%)
![Page 29: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/29.jpg)
31
Other methods
• There are other statistical methods.
• There is no perfect method and it is project dependent (also “personal” choice)
• Most common:– Forest of trees– Support Vector Machine– Neural Networks
![Page 30: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/30.jpg)
Questions?
![Page 31: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/31.jpg)
33
Regression
• Provide a value with more information than yes or no
• Usually smaller set than classification
• Link activity to the structure by an equation (simple to complicated)
![Page 32: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/32.jpg)
34
Historical•First equation: Hansch in 1964•Link activity to molecule’s electronic characteristics and to its hydrophobicity
C is the concentration required to produce a response
LogP the octanol/water partition coefficient (possibility to cross membrane)
σ the Hammett substitution parameter (strength of the electron-withdrawing or -donating properties of the aromatic substituent)
•It is a linear equation•Then improved with a parabolic function
321 log)/1log( kkPkC
4322
1 log)(log)/1log( kkPkPkC p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure, Hansch et al., J. Am. Chem. Soc., 1964, 86 (8), pp 1616–1626
Parabolic dependence of drug action upon lipophilic character as revealed by a study of hypnotics, Hansch et al., J. Med. Chem., 1968, 11 (1), pp 1–11
![Page 33: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/33.jpg)
35
Deriving a QSAR equation
• Most common method is the linear regression
• In QSAR x is usually a descriptor (eg logP)
• Aim: reduce the sum of the differences between the predicted and the real values
• With more than one descriptors:
cmxy
n
iii cxmy
1
![Page 34: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/34.jpg)
36
Quality
• Most common way is to use the square of the correlation coefficient, R2
• Need to review the data:
Almost the same R2
![Page 35: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/35.jpg)
37
Cross validation
• Involves the removal of some of the values from the dataset, build a QSAR from the remaining data, and apply this model on the previous removed data.
• The R2 of cross validation is written Q2, it represents the goodness of the prediction (R2 goodness of fit).
• Q2 should be lower than R2 but not too much (otherwise the model was over-fit).
![Page 36: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/36.jpg)
38
Designing QSAR experiment
• Find the smallest number of variables to explain as much data as possible– It is easy to calculate thousands of parameters with a computer
in seconds.
• Rule of thumb– >5 compounds for each descriptor– Check the descriptor: remove the invariant ones– Remove correlated factors (by deleting a descriptor, or using
data reduction technique – PCA)
• Selection– Algorithms to select most significant descriptors
• Forward stepping regression: start from 1 and add • Backward-stepping regression: start with all and remove
![Page 37: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/37.jpg)
39
Regression algorithms
• Multiple linear regression (see practical)– Easy to interpret– Problem of correlations between factors
• Partial Least Squares (PLS)– Similar to PCA by reducing the number of factors (xi)
in new orthogonal “latent variables” (ti)
– Compare to PCA, add a correlation between observed data and the latent variables (y~a1t1)
pipiii
nn
xbxbxbt
tatatay
...
...
2211
2211
![Page 38: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/38.jpg)
40
Not limited
• Regressions algorithms are multiple– Implementation– Selection of factors– Best way to consider a good model
• Other methods– Gaussian Processes (
http://dx.doi.org/10.1021/ci7000633 )– Molecular Field Analysis and Partial Least Square:
CoMFA and derivative, using 3D steric and electrostatic information (http://www.wiley.com/legacy/wileychi/ecc/samples/sample05.pdf and http://www.netsci.org/Science/Compchem/feature11.html )
![Page 39: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/39.jpg)
41
Regression + Category
• Poor regression but good classificationO
bser
ved
Predicted
False Positives
GoodBad
Good False Negative
![Page 40: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/40.jpg)
42
After
• Once models are built and have ideas of the mathematical quality:– Look at the observed vs predicted plot– Try to understand the model
• Do the descriptors make sense?– LogP important when modelling solubility– Why is a certain substructure so important?
![Page 41: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/41.jpg)
43
Outliers
• What to do with outliers?
• Prediction far from observed:– Are the compounds similar to the training set?– Outside your space of confidence
• Chemical similarity = Activity similarity not always true– There are activity cliffsref
– Interesting for SAR study
On Outliers and Activity Cliffs−Why QSAR Often Disappoints, Maggiora, J. Chem. Inf. Model.2006, 46, 1535−1535
Structure−Activity Relationship Anatomy by Network-like Similarity Graphs and Local Structure−Activity Relationship Indices, Wawer et al., J. Med. Chem., 2008, 51 (19), pp 6075–6084
![Page 42: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/42.jpg)
44
A model is a model
• It is not the reality
• Provides help for experimentations– Understand what happens– Reduce the number of experiments– Do not replace lab work
• There is no one perfect model– Depending on the method, data sets,
descriptors, tuning parameters…
![Page 43: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/43.jpg)
45
Real correlation?• The decrease of marriage decreases the risk of death?
Should we ban Church of England Weddings?
Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series, Yule, Journal of the Royal Statistical Society, Vol. 89, No. 1. (Jan., 1926), pp. 1-63
![Page 44: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/44.jpg)
46
Further – Multiple Targets
• Large scale model:– Prediction of multiple interactions at once– Need large database
• Wombat (literature), MDDR (patent)• ChemBl
• Identify side effect, or unknown beneficial effect
![Page 45: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/45.jpg)
47
Principle
• SEA approach:– Similarity of a compound to active ligands
(similar to Blast) website: http://sea.bkslab.org/
• Multiple category Bayesian model:– Each fingerprint gets a different weight for
each target: the sum is different by target
• Output:– List of protein ranked by probability of binding
![Page 46: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/46.jpg)
48
References• An introduction to Chemoinformatics, A. Leach and V.Gillet• Sheffield course: next one in 2011:
http://www.shef.ac.uk/is/research/groups/chem/courses.html , Conference: http://cisrg.shef.ac.uk/shef2010/
• Pipeline Pilot documentation and Cheminformatics analysis and learning in a data pipelining environment, Hassan et al., Molecular Diversity (2006) 10: 283–299,
• Multiple targets:• Predicting new molecular targets for known drugs, Keiser et al., Nature
462, 175-181 (12 November 2009) and Relating protein pharmacology by ligand chemistry, Keiser et al., Nat Biotech 25 (2), 197-206 (2007)
• Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases, Nidhi et al., J. Chem. Inf. Model., 2006, 46 (3), pp 1124–1133
• Global mapping of pharmacological space, Paolini et al., Nat Biotech 25 (7), 805-815 (2006)
![Page 47: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/47.jpg)
Questions
![Page 48: Prediction Of Bioactivity From Chemical Structure](https://reader035.vdocuments.site/reader035/viewer/2022081414/54c637b54a7959c9388b4643/html5/thumbnails/48.jpg)
Practicals
Using Pipeline Pilot
Regression and Classification