prediction of bioactivity from chemical structure

Prediction of bioactivity from chemical structure

Small Molecule Bioactivity Resources At The EBI

Jérémy Besnard

[email protected]

mailto:[email protected]

2

Myself

• PhD student at the university of Dundee– Supervisor: Pr. Andrew Hopkins– Lab: medicinal informatics

• Background– Chemistry degree with some biology– One industrial year at Pfizer on computational

chemistry

3

Prediction of bioactivity

• Type of predictions– How active is a compound?

• Continuous model

– Is the compound active, or not?• Categorical model

QSAR – Quantitative Structure-Activity Relationship

Some slides are adapted from Richard Lewis (Novartis) presentation at the University of Sheffield Practical introduction to Chemoinformatics course (next in 2011)

http://www.shef.ac.uk/is/research/groups/chem/courses.html

http://www.shef.ac.uk/is/research/groups/chem/courses.html

4

Example

3

4

5

6

7

8

150 250 350 450 550

Molecular Weight

Act

ivity

Molecular Weight 180 220 250 290 340 380 450 500

Activity (pIC50) 4 4.3 4.8 5.4 4.8 5.8 7.5 7.7

Molecular Weight = 360

Activity?

Linear regression:

Activity = 0.01 Molecular weight + 1.7 (R2 = 0.900)

Activity = 5.3

Active?

Category:

Molecular weight > 260 = active

Active : Yes

5

QSAR

Activity = IC50, Ki, Ratios…

Molecular Descriptors

Topological (shape, size)

Physical & Thermodynamics

Chemical feature (substructure)

Activity = f(Molecular Descriptors)

Statistics O

O

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

6

The absolute basics

• Activity + Representation + Method = QSAR

• Activity = experimental data

• Representation = description of the molecule

• Method = Statistical tool to use– Underlying principle: similar molecules should

have similar activities

7

Advantages of Models

• Fast and cheap method– Virtual screening: the computer does the

manipulation• Human: day – week• Computer : seconds - hours

• Help understand the science behind the observation– Tool to design compounds with higher chance

of being active

8

Activity

• It can be anything– Continuous: IC50, %Inhibition, EC50, ratios,…– Categorical: Yes/No, Low/Medium/High

• Better if– Data come from the same assay/condition– Good quality (you trust the experimental data)

• For ADME endpoints– Lots of software solutions: not easy to predict!

• Few experimental data points (and not very reliable)• In vivo phenomena

9

Molecular descriptors

www.moleculardescriptors.eu

• Many Many Many• Simple counts

– Number of atoms, rings, hydrogen bond donors, acceptors, molecular weight…

• Physicochemical– Hydrophobicity, polarity: cLogP, Polar Surface Area (PSA)

• Shape – Topological indices– Big, small, long, round

• 2D fingerprints– Presence or absence of certain substructures

• From a dictionary (MACCS eg count of acids)• On the fly: look at the substructures present in the data

• 3D: fingerprints, electrostatics, shape

http://www.moleculardescriptors.eu/

10

Fingerprint

• Binary vector: list of 0 and 1• Dictionary: fixed size with

each bit = one group (defined in advance)

• Hashed: fragment the molecules and insert the fragment in a bit position of the vector

Acid Cl Amide6

aromatic ring

…

O

O

O

O

13

Extending the Initial Atom Codes

• Fingerprint bits indicate presence and absence of certain structural features

• Fingerprints do not depend on a predefined set of substructural features

O

N

A

A

A

A

O

N

AA

A

A A

Each iteration adds bitsthat represent larger and larger structures

Iteration 0

Iteration 1

Iteration 2

14

Generating the Fingerprint

• Iteration is repeated desired number of times– Each iteration extends the diameter by two

bonds• Codes from all iterations are collected• Duplicate bits may be removed

> <FCFP_2#S>160131618154665203677720-154910344918721545241070061035...

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

> <FCFP_0#S>16013

...

Data Sets

16

Validity of a model• It is easy to introduce artefacts and “false

correlation”

The Trouble with QSAR (or How I Learned To Stop Worrying and Embrace Fallacy), Johnson, J. Chem. Inf. Model., 2008, 48 (1), pp 25–26

http://dx.doi.org/10.1021/ci700332k

http://dx.doi.org/10.1021/ci700332k

17

Training and Test Sets• Build the model from training set

• Predict the test set

• Also called Leave-N-Out validation where N=1 compound to 50% of the dataset.

• Cross validation: repeat the steps using complementary training and test set N times.

http://www.cs.cmu.edu/~awm/tutorials

http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf



http://research.cs.tamu.edu/prism/lectures/iss/iss_l13.pdf

18

Space of the sets

• The training set should cover the representation space evenly

19

Training vs Test Sets

• The test set should be not too dissimilar to the training set– Too similar = over estimated the good quality– Too dissimilar = difficult prediction

Test Set

Test Set

Test Set

Questions?

21

Statistical Methods

Activity Molecular Descriptors

Training and test sets

Activity = f(Molecular Descriptors)

Statistics O

O

> <FCFP_4#S>160131618154665203677720-154910344918721545241070061035991735244-453677277-581879738-1094243697690083042-975279903...

22

Categorical

• The focus is on a specific criterion:– Is the activity < 10uM? (like in HTS assay)

• The data is not continuous– Soluble/Insoluble

• Try to find a rule (or set of rules) to split the data in classes with the lowest rate of misclassification– Different coefficients to measure the quality(ref: Assessing the accuracy of prediction algorithms for classification: an overview. Baldi et al. Bioinformatics

2000, 16:412-424)

http://bioinformatics.oxfordjournals.org/cgi/reprint/16/5/412

23

Recursive Partitioning

• Using decision trees

• Rules are organized like a tree, each node = one rule – Cut-off : Molecular weight <450– Absence/presence of a group: Acid group

• Usually easy to interpret

• Drawback: overfitting and model to specific to the training data

24

N

O

O

O

Molecular Weight

>450≤ 450

Polar surface area

>100

0,10

≤100

2,0

cLogP

Acid Group 0,7

>4.2≤ 4.2

18,2 1,5

YesNo

21 Actives, 24 Inactives

2,1019,14

19,7

MW: 178PSA: 37LogP: 3

MW: 205PSA: 20LogP: 3

25

Substructural Analysis

• Idea: each fragment of the molecule makes a contribution to the activity , independent of the other fragments in the molecule.

• Fragments get a score for their activity and a molecule has the score of the sum of the fragments.

• A simple fragment scoring function:

ii

ii inactact

actw

Acti = Nb of active compounds containing fragment i

Inacti = Nb of inactive compounds containing fragment i

26

Naïve Bayesian Classifiers

• Related to the substructural analysis (slight differences in the weight sum calculationref)

• Use with fingerprints– Each substructure (bit in the fingerprint) gets a weight– Fingerprint can be mixed with other properties

• Properties are binned and each bin obtains a weight

• Molecules are scored, the higher the score the higher the chance to be in a specific category

• Native implementation in Pipeline Pilot (practical)

Ref: New Methods for Ligand-Based Virtual Screening: Use of Data Fusion and Machine Learning to Enhance the Effectiveness of Similarity Searching, Hert et al., J. Chem. Inf. Model., 2006, 46 (2), pp 462–470

http://dx.doi.org/10.1021/ci050348j

http://dx.doi.org/10.1021/ci050348j

27

Validation

• Collection of coefficients

• Most common ones– Specificity and sensitivity – ROC curve– Enrichment plot

Specificity & Sensitivity

• Specificity

Example:If all compounds are predicted inactives:

Specificity = 1 (very good)Sensitivity = 0 (very bad)

If all compounds are predicted actives:Specificity = 0 (very bad)Sensitivity =1 (very good)

28

TP=True PositiveTN=True NegativeFP=False PositiveFN=False Negative

• Sensitivity

FPTN

TN

FNTP

TP

http://en.wikipedia.org/wiki/Sensitivity_and_specificity



ROC curve

• Plot sensitivity versus 1-specificity

29

Coefficient = Area Under Curve 1 is ideal, 0.5 is random

http://www.medcalc.be/manual/roc.php



Enrichment curve• On some study the rank of compounds is not that

important: idea is to select X percent of the data• Use the model to select the Top X compounds: try

to have most of the active molecules inside

30

There 40% of the active in the top 10%.This plot doesn’t tell how many compounds this represents (could be 40 actives and 10,000 inactive in the top 10%)

31

Other methods

• There are other statistical methods.

• There is no perfect method and it is project dependent (also “personal” choice)

• Most common:– Forest of trees– Support Vector Machine– Neural Networks

Questions?

33

Regression

• Provide a value with more information than yes or no

• Usually smaller set than classification

• Link activity to the structure by an equation (simple to complicated)

34

Historical•First equation: Hansch in 1964•Link activity to molecule’s electronic characteristics and to its hydrophobicity

C is the concentration required to produce a response

LogP the octanol/water partition coefficient (possibility to cross membrane)

σ the Hammett substitution parameter (strength of the electron-withdrawing or -donating properties of the aromatic substituent)

•It is a linear equation•Then improved with a parabolic function

321 log)/1log( kkPkC

4322

1 log)(log)/1log( kkPkPkC p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure, Hansch et al., J. Am. Chem. Soc., 1964, 86 (8), pp 1616–1626

Parabolic dependence of drug action upon lipophilic character as revealed by a study of hypnotics, Hansch et al., J. Med. Chem., 1968, 11 (1), pp 1–11

http://dx.doi.org/10.1021/ja01062a035

http://dx.doi.org/10.1021/ja01062a035

http://dx.doi.org/10.1021/jm00307a001



35

Deriving a QSAR equation

• Most common method is the linear regression

• In QSAR x is usually a descriptor (eg logP)

• Aim: reduce the sum of the differences between the predicted and the real values

• With more than one descriptors:

cmxy

n

iii cxmy

1

36

Quality

• Most common way is to use the square of the correlation coefficient, R2

• Need to review the data:

Almost the same R2

37

Cross validation

• Involves the removal of some of the values from the dataset, build a QSAR from the remaining data, and apply this model on the previous removed data.

• The R2 of cross validation is written Q2, it represents the goodness of the prediction (R2 goodness of fit).

• Q2 should be lower than R2 but not too much (otherwise the model was over-fit).

38

Designing QSAR experiment

• Find the smallest number of variables to explain as much data as possible– It is easy to calculate thousands of parameters with a computer

in seconds.

• Rule of thumb– >5 compounds for each descriptor– Check the descriptor: remove the invariant ones– Remove correlated factors (by deleting a descriptor, or using

data reduction technique – PCA)

• Selection– Algorithms to select most significant descriptors

• Forward stepping regression: start from 1 and add • Backward-stepping regression: start with all and remove

39

Regression algorithms

• Multiple linear regression (see practical)– Easy to interpret– Problem of correlations between factors

• Partial Least Squares (PLS)– Similar to PCA by reducing the number of factors (xi)

in new orthogonal “latent variables” (ti)

– Compare to PCA, add a correlation between observed data and the latent variables (y~a1t1)

pipiii

nn

xbxbxbt

tatatay

...

...

2211

2211

40

Not limited

• Regressions algorithms are multiple– Implementation– Selection of factors– Best way to consider a good model

• Other methods– Gaussian Processes (

http://dx.doi.org/10.1021/ci7000633 )– Molecular Field Analysis and Partial Least Square:

CoMFA and derivative, using 3D steric and electrostatic information (http://www.wiley.com/legacy/wileychi/ecc/samples/sample05.pdf and http://www.netsci.org/Science/Compchem/feature11.html )

http://dx.doi.org/10.1021/ci7000633

http://www.wiley.com/legacy/wileychi/ecc/samples/sample05.pdf

http://www.wiley.com/legacy/wileychi/ecc/samples/sample05.pdf

http://www.netsci.org/Science/Compchem/feature11.html

http://www.netsci.org/Science/Compchem/feature11.html

41

Regression + Category

• Poor regression but good classificationO

bser

ved

Predicted

False Positives

GoodBad

Good False Negative

42

After

• Once models are built and have ideas of the mathematical quality:– Look at the observed vs predicted plot– Try to understand the model

• Do the descriptors make sense?– LogP important when modelling solubility– Why is a certain substructure so important?

43

Outliers

• What to do with outliers?

• Prediction far from observed:– Are the compounds similar to the training set?– Outside your space of confidence

• Chemical similarity = Activity similarity not always true– There are activity cliffsref

– Interesting for SAR study

On Outliers and Activity Cliffs−Why QSAR Often Disappoints, Maggiora, J. Chem. Inf. Model.2006, 46, 1535−1535

Structure−Activity Relationship Anatomy by Network-like Similarity Graphs and Local Structure−Activity Relationship Indices, Wawer et al., J. Med. Chem., 2008, 51 (19), pp 6075–6084

http://dx.doi.org/10.1021/ci060117s



http://dx.doi.org/10.1021/jm800867g




44

A model is a model

• It is not the reality

• Provides help for experimentations– Understand what happens– Reduce the number of experiments– Do not replace lab work

• There is no one perfect model– Depending on the method, data sets,

descriptors, tuning parameters…

45

Real correlation?• The decrease of marriage decreases the risk of death?

Should we ban Church of England Weddings?

Why do we Sometimes get Nonsense-Correlations between Time-Series?--A Study in Sampling and the Nature of Time-Series, Yule, Journal of the Royal Statistical Society, Vol. 89, No. 1. (Jan., 1926), pp. 1-63

46

Further – Multiple Targets

• Large scale model:– Prediction of multiple interactions at once– Need large database

• Wombat (literature), MDDR (patent)• ChemBl

• Identify side effect, or unknown beneficial effect

47

Principle

• SEA approach:– Similarity of a compound to active ligands

(similar to Blast) website: http://sea.bkslab.org/

• Multiple category Bayesian model:– Each fingerprint gets a different weight for

each target: the sum is different by target

• Output:– List of protein ranked by probability of binding

48

References• An introduction to Chemoinformatics, A. Leach and V.Gillet• Sheffield course: next one in 2011:

http://www.shef.ac.uk/is/research/groups/chem/courses.html , Conference: http://cisrg.shef.ac.uk/shef2010/

• Pipeline Pilot documentation and Cheminformatics analysis and learning in a data pipelining environment, Hassan et al., Molecular Diversity (2006) 10: 283–299,

• Multiple targets:• Predicting new molecular targets for known drugs, Keiser et al., Nature

462, 175-181 (12 November 2009) and Relating protein pharmacology by ligand chemistry, Keiser et al., Nat Biotech 25 (2), 197-206 (2007)

• Prediction of Biological Targets for Compounds Using Multiple-Category Bayesian Models Trained on Chemogenomics Databases, Nidhi et al., J. Chem. Inf. Model., 2006, 46 (3), pp 1124–1133

• Global mapping of pharmacological space, Paolini et al., Nat Biotech 25 (7), 805-815 (2006)

Questions

Practicals

Using Pipeline Pilot

Regression and Classification

prediction of bioactivity from chemical structure

Documents