similarity and diversity alexandre varnek, university of strasbourg, france

Similarity and Diversity

Alexandre Varnek, University of Strasbourg, France

What is similar?

Shape Colour

Size Pattern

Different „spaces“, classified by:

O

HOH

O

HCOOHO

HNH2

O

HCl

O

H

OH

O

H

COOH

O

H

NH2

O

H

Cl

O

HOH

O

HCOOH

O

HNH2

O

HCl

O

H O

N

OH

O

H O

N

COOH

O

H O

N

NH2O

H O

N

Cl

16 diverse aldehydes...

O

HOH

O

HCOOH

O

HNH2

O

HCl

O

H

OH O

H

COOH

O

H

NH2O

H

Cl

O

HOH

O

HCOOH

O

HNH2

O

HCl

O

H O

N

OH

O

H O

N

COOH

O

H O

N

NH2

O

H O

N

Cl

...sorted by common scaffold

O

HOH

O

HCOOH

O

HNH2

O

HCl

O

H

OH

O

H

COOH O

H

NH2

O

H

Cl

O

HOH

O

HCOOH

O

HNH2

O

HCl

O

H O

N

OH

O

H O

N

COOH

O

H O

N

NH2

O

H O

N

Cl

...sorted by functional groups

Compounds active as opioid receptors

The „Similarity Principle“ :Structurally similar molecules are assumed to have similar biological properties

structural similarity “fading away” …

0.82

0.39

0.84

0.72

0.67

0.64

0.53

0.56

0.52

reference compounds

Structural Spectrum of Thrombin Inhibitors

Key features in similarity/diversity calculations:

• Properties to describe elements (descriptors, fingerprints)

• Distance measure („metrics“)

N-Dimensional Descriptor Space

• Each chosen descriptor adds a dimension to the reference space

• Calculation of n descriptor values produces an n-dimensional coordinate vector in descriptor space that determines the position of a molecule

descriptor3

descriptor2

descriptor1

descriptorn

molecule Mi

= (descriptor1(i), descriptor2(i), …, descriptorn(i))

Chemical Reference Space

• Distance in chemical space is used as a measure of molecular “similarity“ and “dissimilarity“

• “Molecular similarity“ covers only chemical similarity but also property similarity including biological activity

descriptor3

descriptor2

descriptor1

descriptorn

DAB

A

B

Distance Metrics in n-D Space

• If two molecules have comparable values in all the n descriptors in the space, they are located close to each other in the n-D space.– how to define “closeness“ in space as a measure

of molecular similarity?– distance metrics

Descriptor-based Similarity• When two molecules A and B are projected into an n-D

space, two vectors, A and B, represent their descriptor values, respectively.– A = (a1,a2,...an)– B = (b1,b2,...bn)

• The similarity between A and B, SAB, is negatively correlated with thedistance DAB– shorter distance ~ more similar molecules– in the case of normalized distance

(within value range [0,1]), similarity = 1 – distance

descriptor3

descriptor2

descriptor1

descriptorn

DAB

A

B

C

DBC

DAB>DBC SAB<SBC

(1) The distance values dAB 0; dAA= dBB= 0

(2) Symmetry properties: dAB= dBA

(3) Triangle inequality: dAB dAC+ dBC

Metrics Properties

descriptor3

descriptor2

descriptor1

descriptorn

DAB

A

B

C

DBC

Euclidean Distance in n-D Space• Given two n-dimensional vectors, A and B

– A = (a1,a2,...an)

– B = (b1,b2,...bn)

• Euclidean distance DAB is defined as:

• Example:– A = (3,0,1); B = (5,2,0)

– DAB = = 3

n

iiiAB baD

1

2)(

222 )01()20()53(

descriptor3

descriptor2

descriptor1

descriptorn

DA

B

A

B

Manhattan Distance in n-D Space• Given two n-dimensional vectors, A and B

– A = (a1,a2,...an)

– B = (b1,b2,...bn)

• Manhattan distance DAB is defined as:

• Example:– A = (3,0,1); B = (5,2,0)

– DAB = = 5

n

iiiAB baD

1

||

|01||20||53|

descriptor3

descriptor2

descriptor1

descriptorn

DA

B

A

B

Euclidian distance:

[(x11 - x21) 2 + (x12 - x22)2] 1/2 =

= (42 + 22)1/2 = 4.472

Manhattan (Hamming) distance:

|x11 - x21| + |x12 - x22| = 4 + 2 = 6

Sup distance:

Max (|x11 - x21|, |x12 - x22|) =

= Max (4, 2) = 4

Distance Measures („Metrics“):

Binary Fingerprint

Popular Similarity/Distance Coefficients

• Similarity metrics:– Tanimoto coefficient– Dice coefficient– Cosine coefficient

• Distance metrics:– Euclidean distance– Hamming distance– Soergel distance

Tanimoto Coefficient (Tc)

• Definition:

– value range: [0,1]– Tc is also known as Jaccard coefficient– Tc is the most popular similarity coefficient

cba

cs

),(Tc),( BABA

A BC

Example Tc Calculation

a = 4, b = 4, c = 2

A

B

binary

3

1

6

2

244

2),(Tc

BA

Dice Coefficient

• Definition:

– value range: [0,1]– monotonic with the Tanimoto coefficient

ba

cs

2

),( BA

Cosine Coefficient• Definition:

• Properties:– value range: [0,1]– correlated with the Tanimoto coefficient but not

strictly monotonic with it

ab

cs ),( BA

Hamming Distance• Definition:

– value range: [0,N] (N, length of the fingerprint) – also called Manhattan/City Block distance

cbad 2),( BA

Soergel Distance• Definition:

• Properties:– value range: [0,1]– equivalent to (1 – Tc) for binary fingerprints

cba

cbad

2

),( BA

Similarity coefficients

Properties of Similarlity and Distance Coefficients

(1) The distance values dAB 0; dAA= dBB= 0(2) Symmetry properties: dAB= dBA

(3) Triangle inequality: dAB dAC+ dBC

The Euclidean and Hamming distances and the Tanimoto coefficients (dichotomous variables) obey all properties.The Tanimoto, Dice and Cosine coefficients do not obey inequality (3).

Coefficients are monotonic if they produce the same similarlity ranking

Metric Properties

Using bit strings to encode molecular size. A biphenyl query is compared to a series of analogues of increasing size. The Tanimoto coefficient, which is shown next to the corresponding structure, decreases with increasing size, until a limiting value isreached.

Similarity search

D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386

Similarity search

Molecular similarity at a range of Tanimoto coefficient values


Similarity search


The distribution of Tanimoto coefficient values found in database searches with a range of query molecules of increasing size and complexity

A comparison of the Soergel and Hamming distance values for two pairs of structures to illustrate the effect of molecular size

Molecular Similarity

A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Molecular Similarity

The maximum common subgraph (MCS) between the two molecules is in bold

Similarity = Nbonds(MCS) / Nbonds(query)

A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003

Activity landscape

Inhibitors of acyl-CoA:cholesterol acyltransferase represented with MACCS (a), TGT (b), and Molprint2D (c) fingerprints.

How important is a choice of descriptors ?

small changes in structure have dramatic effects on activity

“cliffs” in activity landscapes

discontinuous SARscontinuous SARs

gradual changes in structure result in moderate changes in activity

“rolling hills” (G. Maggiora)

Structure-Activity Landscape Index: SALIij = Aij / Sij

Aij Sij ) is the difference between activities (similarities) of molecules i and j

R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646

VEGFR-2 tyrosine kinase inhibitors

bad news for molecular similarity analysis...

MACCSTc: 1.00

Analog

6 nM

2390 nM

small changes in structure have dramatic effects on activity

“cliffs” in activity landscapes

lead optimization, QSAR

discontinuous SARs

Example of a “Classical” Discontinuous SAR

Adenosine deaminase inhibitors

(MACCS Tanimoto similarity)

Any similarity method must recognize thesecompounds as being “similar“ ...

Libraries design

Goal: to select a representative subset from a large database

Chemical Space

Overlapping similarity radii Redundancy„Void“ regions Lack of information

Chemical Space

„Void“ regions Lack of information

Chemical Space

No redundancy, no „voids“ Optimally diverse compound library

• Clustering

• Dissimilarity-based methods

• Cell-based methods

• Optimisation techniques

Subset selection from the libraries

Clustering in chemistry

What is clustering?

Clustering is the separation of a set of objects into groups such that items in one group are more like each other than items in a different group

A technique to understand, simplify and interpret large amounts of multidimensional data

Classification without labels (“unsupervised learning”)

Where clustering is used ?

General: data mining, statistical data analysis, data

compression, image segmentation, document classification (information retrieval)

Chemical: representative sample, subsets selection, classification of new compounds

Overall strategy

Select descriptors Generate descriptors for all items Scale descriptors Define similarity measure (« metrics ») Apply appropriate clustering method to group the items on basis of chosen descriptors and similarity measure Analyse results

Data Presentation

mol

ecu

les

descriptors

Pattern matrixm

olec

ule

s

molecules

Proximity matrix

Library contains n molecules, each molecule is described by p descriptors

dii = 0; dij = dji

Clustering methods

Hierarchical

Non-hierarchical

Agglomerative

DivisiveMonothetic

Polythetic

Jarvis-Patrick

Single Pass

Nearest Neighbour

Topographic

Mixture Model

Relocation

Others

Complete Link

Single Link

Group Average

Median

Weighted Gr Av

Centroid

Ward

Hierarchical Clustering

A dendrogram representing an hierarchical clustering of 7 compounds

divisive

aggl

omer

ativ

e

Sequential Agglomerative Hierarchical Non-overlapping (SAHN) methods

Simple link Complete link Group average

In the Single Link method, the intercluster distance is equal to the minimum distance between any two compounds, one from each cluster.

In the Complete Link method, the intercluster distance is equal to the furthest distance between any two compounds, one from each cluster.

The Group Average method measures intercluster distance as the average of the distances between all compounds in the two clusters.

Hierarchical Clustering:Johnson’s method

The algorithm is an agglomerative scheme that erases rows and columns in the proximity matrix as old clusters are merged into new ones.

min[( ), ( )] { [( ), ( )]}d r s d i j

Step 1. Group the pair of objects into a cluster

Step 2. Update the proximity matrix

[( ), ( , )] { [( ), ( )], [( ), ( )]}mind k r s d k r d k s

Single-link

[( ), ( , )] { [( ), ( )], [( ), ( )]}maxd k r s d k r d k sComplete-

link

Hierarchical Clustering: single link

Hierarchical Clustering: complete link

Hierarchical Clustering: single vs complete

link

Non-Hierarchical Clustering:the Jarvis-Patrick

method

At the first step, all nearest neighbours of each compound are found by calculating of all paiwise similarities and sorting according to this similarlity. Then, two compounds are placed into the same cluster if:

1.They are in each other’s list of m nearest neighbours.2.They have p (where p< m) nearest neighbours in common. Typical values: m = 14 ; p = 8.

Pb: too many singletons.

Non-Hierarchical Clustering:the relocation

methods

Relocation algorithms involve an initial assignment of compounds to clusters, which is then iteratively refined by moving (relocating) certain compounds from one cluster to another.

Example: the K-means method1. Random choise of c « seed » compounds. Other compounds are assigned to the nearest seed resulting in an initial set of c clusters. 2.The centroides of cluster are calculated. The objects are re-assigned to the nearest cluster centroid.

Pb: the method is dependent upon the initial set of cluster centroids.

Efficiency of Clustering Methods

Method Storage space

Time

Hierarchical

(general)

O (N2) O (N3)

Hierarchical

(Ward’s method + RNN)

O (N) O (N2)

Non-Hierarchical

(general)

O (N) O (MN)

Non-Hierarchical

(Jarvis-Patric method)

O (N2) O (MN)

N is the number of compounds and M is the number of clusters

Validity of clustering How many clusters are in the data ? Does partitioning match the

categories ? Where should be the dendrogram be cut

? Which of two partitions fit the data

better ?

Dissimilarity-Based Compound Selection (DBCS)

4 steps basic algorithm for DBCS:

1. Select a compound and place it in the subset.2. Calculate the dissimilarity between each

compound remaining in the data set and the compounds in the subset.

3. Choose the next compound as that which is most dissimilar to the compounds in subset.

4. If n < n0 (n0 being the desired size number of compounds in the final subset), return to step 2.


Basic algorithm for DBCS: 1st step – selection of the initial compound

1. Select it at random;

2. Choose the molecules which is « most representative » (e.g., has the largest sum of similarlities to other molecules);

3. Choose the molecules which is « most dissimilar » (e.g., has the smallest sum of similarlities to other molecules).


Basic algorithm for DBCS: 2nd step – calculation of dissimilarity

• Dissimilarity is the opposite of similarity

(Dissimilarity)i,j = 1 – (Similarity )i,j

(where « Similarity » is Tanimoto, or Dice, or Cosine, … coefficients)

Diversity

Diversity characterises a set of molecules

)1(

),(

NN

IJIityDissimilar

IJDiversity


Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound

1). MaxSum method selects the compound i that has the maximum sum of distances to all molecules in the subset

,1

m

i i jj

score D

, ; 1,min( )i i j j mscore D

2). MaxMin method selects the compound i with the maximum distance to its closest neighbour in the subset

There are several methods to select a diversed subset containing m compounds

Basic algorithm for DBCS: 3nd step – selection the most dissimilar compound

3). The Sphere Exclusion Algorithm

1. Define a threshold dissimilarity, t2. Select a compound and place it in the subset.3. Remove all molecules from the data set that have a dissimilarity to

the selected molecule of less than t4. Return to step 2 if there are molecules remaining in the data set.

The next compound can be selected • randomly;• using MinMax-like method

DBCS : Subset selection from the libraries

Cell-based methods

Cell-based or Partitioning methods operated within a predefined low-dimentional chemical space.

If there are K axes (properties) and each is devided into bi bins, then the number of cells Ncells in the multidimentianal space is

1

K

cells ii

N b

Cell-based methods

The construction of 2-dimentional chemical space.

LogP bins: <0, 0-0.3, 3-7 and >7MW bins: 0-250, 250-500, 500-750, > 750.

Cell-based methods

A key feature of Cell-based methods is that they do not requere the calculation of paiwise distances Di,j between compounds; the chemical space is defined independently of the molecules that are positioned within it.

•Advantages of Cell-based methods 1.Empty cells (voids) or cells with low ocupancy can be easily identified.2.The diversity of different subsets can be easily compared by examining the overlap in the cells occupied by each subset.

Main pb: Cell-based methods are restricted to relatively low-dimentional space

Optimisation techniques

DBCS methods prepare a diverse subset selecting interatively ONE molecule a time.

Optimisation techniques provide an efficient ways of sampling large search spaces and selection of diversed subsets

Optimisation techniques

• Example: Monte-Carlo search

1. Random selection of an initial subset and calculation of its diversity D.

2. A new subset is generated from the first by replacing some of its compounds with other randomly selected.

3. The diversity of the new subset Di+1 is compared with Di

if D = Di+1 - Di > 0, the new set is accepted if D < 0, the probability of acceptence depends on the Metropolis

condition, exp(- D / kT).

Scaffolds and Frameworks

Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893

Frameworks

G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171

Frameworks

Dissection of a molecule according to Bemis and Murcko. Diazepam contains three sidechains and one framework with two ring systems and a zero-atom linker.


Graph Frameworks for Compounds in the CMC Database (Numbers Indicate Frequency of Occurrence)

L’algorithme de Bemis et Murcko de génération de framework : 1)les hydrogènes sont supprimés, 2)les atomes avec une seule liaison sont supprimés successivement, 3)le scaffold est obtenu, 4)tous les types d’atomes sont définis en tant que C et tous les types de liaisonssont définis en tant que simples liaisons, ce qui permet d’obtenir le framework.

Scaffolds et Frameworks

Contrairement à la méthode de Bemis et Murcko, A. Monge a proposé de distinguer les liaisons aromatiques et non aromatiques (thèse de doctorat, Univ. Orléans, 2007)


Scaffold-Hopping: How Far Can You Jump?G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171

The Scaffold Tree − Visualization of the Scaffold Universe by Hierarchical Scaffold ClassificationA. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.WaldmannJ. Chem. Inf. Model., 2007, 47 (1), 47-58

A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J. Chem. Inf. Model., 2007, 47 (1), 47-58

Scaffold tree for the results of pyruvate kinase assay. Color intensity represents the ratio of active and inactive molecules with these scaffolds.

similarity and diversity alexandre varnek, university of strasbourg, france

Documents