chemical landscape analysis – the case of tautomers

1
The method generates a ranking list of individual compounds, ordered according to their likelihood of being involved in the formation of activity cliffs. The following examples are using the PubChem Thrombin inhibitors assay AID 1215, the Tanimoto similarity is calculated by The CDK library fingerprints. Chemical Landscape Analysis – the case of tautomers References [1] Jeliazkova, N., Jeliazkov V., Chemical Landscape Analysis with the Opentox Framework, Current Topics in Medicinal Chemistry, 2012, 12(18);1987-2001(15). [2] Kochev N., Paskaleva V., Jeliazkova N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation, Molecular Informatics, 2013, 32(5-6):481-504. [3] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18 [4] AMBIT project, http://ambit.sourceforge.net Nina Jeliazkova *1 , Nikolay T. Kochev 2 , Vedrin Jeliazkov 1 * e-mail : [email protected] twitter: @10705013; 1 Ideaconsult Ltd, 4 Angel Kanchev Str., Sofia 1000, Bulgaria; 2 University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry, Bulgaria What: The Structure-Activity Relationships (SAR) landscape and activity cliffs Why: analysis and visualisation technique Origin: medicinal chemistry and receptor-ligand interactions modelling Activity cliff definition: a “pair of structurally similar compounds with large differences in potency” State of the art : SAS Maps, network graphs, quantification by SALI , SARI , number of methods analyzing the pairwise similarity matrix; various extensions Pros: intuitive Cons: ambiguous, not scalable to large datasets Method: We have recently proposed a new and efficient method for identifying activity cliffs and visualization of activity landscapes [1]. The method ranks the activity cliffs by a probabilistic measure - the likelihood of a compound having large activity difference compared to other compounds, while being highly similar to them. Table 1. Conditional probability of events co-occurrence = + + + + + Background Activity cliffs ranking The tautomers in the chemical landscape s (high similarity) ! s ( low similarity) t (large activity difference) a ~ P(s| t) b ~ P(!s| t) !t (small activity difference) c ~ P( s| !t) d ~ P (!s | !t) d, scaffold hops III c, smooth IV a, activity cliffs I b, nondescript II G 2 Rank ID a b c d IC50, µM G 2 1 12371 2 216 0 310 50 (inactive) 32.34 2 12413 1 310 1 216 5.84 0.07 3 12439 1 308 1 218 10.90 0.07 Visualisation: Bubble Chart The circles area is proportional to G 2 . The activity cliffs are as in the Table 2 ranking Fig 2.The result of a similarity query for the top ranked compound ID = 12731 We extend the landscape analysis and visualisation of datasets, where the chemical structures are represented by more than one tautomer, and study the influence of the tautomerization on the SAR landscape. The Thrombin inhibitors dataset (AID 1215) contains 529 structures. The tautomers enriched dataset (generated by Ambit-Tautomers package [2]) consists of 6145 structures. Table 2. Activity cliffs ranking by G 2 (Tanimoto threshold> 0.8 and activity difference > 21.6) Fig 1. SAS Map of Pubchem Thrombin inhibitor assay AID 1215 ( IC50 , μM); Tanimoto similarity on hashed 1024 bit fingerprints (The CDK library) Counts a, b, c, d as in Table 1 . If taking into account only structure pairs between a given compound and all other compounds in the analysed dataset, the G 2 characterizes the likelihood of this particular compound to form activity cliffs with the compounds in the dataset. By estimating G 2 of all structures in the dataset, a ranking can be established, thus identifying the most eminent activity cliffs. Note that this is a ranking of individual structures, not pairs of structures. This is a significant advantage, especially when processing large datasets, as only the likelihood (or the four counts) need to be stored per compound, instead of the entire pairwise matrix. The column a gives the number of pairs that form activity cliffs with the compound. The paired structures can be easily retrieved by a standard similarity query. The arrangement as a graph naturally emerges from the set of top ranked compounds, as they are usually interconnected as activity cliffs pairs. The network graph The bubble chart is space efficient and can represent a large number of values in a small space. More (interactive) examples at: http://toxmatch.sf.net Combined bubble chart of G 2 ranked compounds. Similarity threshold 0.8; each color corresponds to a different activity difference threshold. The gray color at the right indicates the structures with count a = 0, but G 2 >0, due to the additive smoothing. These are potential activity cliffs at different similarity thresholds. PubChem AID1215, Tautomers enriched The network graph The bubble chart There are 8 activity cliffs pair instead of only one in the original dataset (Fig 2). The bubble chart shows that the G 2 ranking is not the same for all the tautomers of the same compound (the size of the circles of the same color differs). The enriched dataset contains 8 tautomers per each of the three structures at Fig. 2. Blue: tautomers of ID = 12731 Red : tautomers of ID = 12413 Green : tautomers of ID = 12439 Fig 3. Activity cliffs ranking of the tautomer enriched AID 1215 dataset The network graph at Fig 3 shows that the activity cliffs never involve more than one tautomer. Therefore, if the correct combination of tautomers is missing in a particular dataset, the activity cliffs might not be identified. DSSTox CPDBAS dataset (carcinogenicity) Multiple activities and thresholds (1519 structures). Fig 4. The tautomers of ID = 12731 (Pubchem SID 861943) 1-[5-(4-bromophenyl)-7-(4-methoxyphenyl)-1,7-dihydro-[1,2,4]triazolo[1, 5-a]pyrimidin-2-yl]pyrrolidine-2,5-dione The activity cliffs ranking method is implemented as part of the open source Ambit package [3, 4] and could be accessed via REST web service (OpenTox API compliant). All the user interface and charts are JavaScript based and accessible through modern web browsers. Finally, each of the original structures is assigned the maximum G 2 , taken over the set of its tautomers. Then the structures are ordered and the rank is assigned. Fig.5 is an illustration how the activity cliff ranking changes, compared to the ranking derived form the original structures only. 1 101 201 301 401 501 1 101 201 301 401 501 Rank (tautomers enriched dataset) Rank (original dataset) Fig 5. Activity cliff ranking in the original and tautomer enriched datasets The method goes beyond finding structure pairs only. * Additive (Laplace) smoothing is used to deal with zero counts The likelihood G 2 is effectively a quantification of a SAS Map with defined thresholds. It can be calculated for the entire dataset (Fig. 1), for a selected set of compounds, or for an individual compound.

Upload: nina-jeliazkova

Post on 26-Jan-2015

105 views

Category:

Education


0 download

DESCRIPTION

Poster at 6th Joint Sheffield Conference on Chemoinformatics by Nina Jeliazkova, Nikolay T. Kochev, Vedrin Jeliazkov The Structure-Activity Relationships (SAR) landscape and activity cliffs concepts are a popular analysis and visualisation technique, with origins in medicinal chemistry and receptor-ligand interactions modelling. While intuitive, the definition of an activity cliff as a “pair of structurally similar compounds with large differences in potency” is commonly recognized as ambiguous. We have recently proposed a new and efficient method for identifying activity cliffs and visualization of activity landscapes [1]. The method introduces a probabilistic measure - the likelihood of a compound having large activity difference compared to other compounds, while being highly similar to them. The likelihood is effectively a quantification of a SAS Map with defined thresholds and does not require the storage of the pairwise similarity matrix. The method generates a list of individual compounds, ranked according to their likelihood of being involved in the formation of activity cliffs , and goes beyond characterizing cliffs by structure pairs only. Every compound is associated with zero, one, or more compounds with similarity and activity difference above the defined threshold. The paired structures can be easily retrieved by a standard similarity query. The arrangement as a graph naturally emerges from the set of top ranked compounds, as they are usually interconnected as activity cliffs pairs. The popular matched molecular pairs approach could be considered a special case, but is also improved by being able to identify multiple matching pairs at once. We now extend the landscape analysis and visualisation of datasets, where the chemical structures are represented by more than one tautomer, and study the influence of the tautomerization on the SAR landscape. The tautomer generation relies on the Ambit-Tautomer open source package, developed by the authors [2]. Finally, the method is implemented as part of an existing open source Ambit package [3] and could be accessed via an OpenTox API compliant web service. OpenTox API provides a uniform REST web service application programming interface (API) to chemical structures, experimental data and calculated properties, descriptor calculation, model building, validation and reporting [4]. The AMBIT web services package [3] is being developed by Ideaconsult Ltd. and is one of the several existing independent implementations of the OpenTox API, providing data sharing and remote calculations capabilities. Visualisation of the ranked activity cliffs by bubble charts is presented and interactive visualisation at http://toxmatch.sf.net are available. [1] http://www.ncbi.nlm.nih.gov/pubmed/23110534 [2] http://dx.doi.org/10.1002/minf.201200133 [3] http://www.jcheminf.com/content/3/1/18 [4] http://www.jcheminf.com/content/2/1/7

TRANSCRIPT

Page 1: Chemical Landscape Analysis – the case of tautomers

The method generates a ranking list of individual compounds, ordered

according to their likelihood of being involved in the formation of activity cliffs.

The following examples are using the PubChem Thrombin inhibitors assay AID

1215, the Tanimoto similarity is calculated by The CDK library fingerprints.

Chemical Landscape Analysis – the case of tautomers

References

[1] Jeliazkova, N., Jeliazkov V., Chemical Landscape Analysis with the Opentox Framework, Current Topics in Medicinal Chemistry, 2012, 12(18);1987-2001(15).

[2] Kochev N., Paskaleva V., Jeliazkova N., Ambit-Tautomer: An Open Source Tool for Tautomer Generation, Molecular Informatics, 2013, 32(5-6):481-504.

[3] Jeliazkova N., Jeliazkov V., AMBIT RESTful web services: an implementation of the Open Tox application programming interface, Journal of Chemoinformatics 2011, 3:18

[4] AMBIT project, http://ambit.sourceforge.net

Nina Jeliazkova*1, Nikolay T. Kochev2, Vedrin Jeliazkov1

*e-mail : [email protected] twitter: @10705013; 1Ideaconsult Ltd, 4 Angel Kanchev Str., Sofia 1000, Bulgaria; 2University of Plovdiv, Department of Analytical Chemistry and Computer Chemistry, Bulgaria

What: The Structure-Activity Relationships (SAR) landscape and activity cliffs

Why: analysis and visualisation technique

Origin: medicinal chemistry and receptor-ligand interactions modelling

Activity cliff definition: a “pair of structurally similar compounds with large

differences in potency”

State of the art : SAS Maps, network graphs, quantification by SALI , SARI ,

number of methods analyzing the pairwise similarity matrix; various

extensions

Pros: intuitive Cons: ambiguous, not scalable to large datasets

Method:

We have recently proposed a new and efficient method for identifying

activity cliffs and visualization of activity landscapes [1]. The method ranks

the activity cliffs by a probabilistic measure - the likelihood of a compound

having large activity difference compared to other compounds, while being

highly similar to them.

Table 1. Conditional probability of events co-occurrence

𝑮𝟐 = 𝒂 𝐥𝐥𝐥𝒂 𝒄 + 𝒅𝒄 𝒂 + 𝒃

+ 𝒃 𝐥𝐥𝐥𝒃 𝒄 + 𝒅𝒅 𝒂 + 𝒃

Background Activity cliffs ranking The tautomers in the chemical landscape

s (high similarity) ! s ( low similarity) t (large activity difference) a ~ P(s| t) b ~ P(!s| t) !t (small activity difference) c ~ P( s| !t) d ~ P (!s | !t)

d, scaffold hops III

c, smooth IV

a, activity cliffs I

b, nondescript II

G2 Rank ID a b c d IC50, µM G2

1 12371 2 216 0 310 50 (inactive)

32.34

2 12413 1 310 1 216 5.84 0.07

3 12439 1 308 1 218 10.90 0.07

Visualisation: Bubble Chart

The circles area is proportional to G2.

The activity cliffs are as in the Table 2 ranking

Fig 2.The result of a similarity query for the top ranked compound ID = 12731

We extend the landscape analysis and visualisation of datasets, where the

chemical structures are represented by more than one tautomer, and study

the influence of the tautomerization on the SAR landscape. The Thrombin

inhibitors dataset (AID 1215) contains 529 structures. The tautomers

enriched dataset (generated by Ambit-Tautomers package [2]) consists of

6145 structures.

Table 2. Activity cliffs ranking by G2 (Tanimoto threshold> 0.8 and activity difference > 21.6)

Fig 1. SAS Map of Pubchem Thrombin inhibitor assay AID 1215 ( IC50 , μM); Tanimoto similarity on hashed 1024 bit fingerprints (The CDK library) Counts a, b, c, d as in Table 1 .

If taking into account only structure pairs

between a given compound and all other

compounds in the analysed dataset, the G2

characterizes the likelihood of this particular

compound to form activity cliffs with the

compounds in the dataset. By estimating G2 of

all structures in the dataset, a ranking can be

established, thus identifying the most eminent

activity cliffs.

Note that this is a ranking of individual structures, not pairs of

structures. This is a significant advantage, especially when processing large

datasets, as only the likelihood (or the four counts) need to be stored

per compound, instead of the entire pairwise matrix. The column a

gives the number of pairs that form activity cliffs with the compound. The

paired structures can be easily retrieved by a standard similarity query. The

arrangement as a graph naturally emerges from the set of top ranked

compounds, as they are usually interconnected as activity cliffs pairs.

The network graph

The bubble chart is space efficient and can represent a

large number of values in a small space.

More (interactive) examples at:

http://toxmatch.sf.net

Combined bubble chart of G2 ranked compounds. Similarity threshold 0.8; each color corresponds to a

different activity difference threshold. The gray color at the right indicates the structures with count a = 0, but

G2>0, due to the additive smoothing. These are potential activity cliffs at different similarity thresholds.

PubChem AID1215, Tautomers enriched

The network graph The bubble chart

There are 8 activity cliffs pair instead of only one in the original dataset (Fig

2). The bubble chart shows that the G2 ranking is not the same for all the

tautomers of the same compound (the size of the circles of the same color

differs).

The enriched dataset contains 8 tautomers per each of the three structures at Fig. 2. Blue: tautomers of ID = 12731 Red : tautomers of ID = 12413 Green : tautomers of ID = 12439

Fig 3. Activity cliffs ranking of the tautomer enriched AID 1215 dataset

The network graph at Fig 3 shows that the activity cliffs never involve more

than one tautomer. Therefore, if the correct combination of tautomers is

missing in a particular dataset, the activity cliffs might not be identified.

DSSTox CPDBAS dataset (carcinogenicity) Multiple activities and thresholds (1519 structures).

Fig 4. The tautomers of ID = 12731

(Pubchem SID 861943) 1-[5-(4-bromophenyl)-7-(4-methoxyphenyl)-1,7-dihydro-[1,2,4]triazolo[1,

5-a]pyrimidin-2-yl]pyrrolidine-2,5-dione

The activity cliffs ranking method is

implemented as part of the open source

Ambit package [3, 4] and could be accessed

via REST web service (OpenTox API

compliant). All the user interface and charts

are JavaScript based and accessible through

modern web browsers.

Finally, each of the original structures is

assigned the maximum G2, taken over the

set of its tautomers. Then the structures

are ordered and the rank is assigned.

Fig.5 is an illustration how the activity cliff

ranking changes, compared to the

ranking derived form the original

structures only.

1

101

201

301

401

501

1101201301401501

Ran

k

(tau

tom

ers

enri

ched

dat

aset

)

Rank (original dataset)

Fig 5. Activity cliff ranking in the original and tautomer enriched datasets

The method goes beyond

finding structure pairs only.

* Additive (Laplace) smoothing is used to deal with zero counts

The likelihood G2 is effectively a quantification of a SAS Map with

defined thresholds. It can be calculated for the entire dataset (Fig. 1), for a

selected set of compounds, or for an individual compound.