fingerprint design and molecular complexity effects

Fingerprint Design and Molecular Complexity Effects

Chemical Similarity Searching

� Long history in pharmaceutical research

� One of the most popular approaches in virtual screening

� Based on the concept of global molecular similarity

� Similarity-property principle (Johnson & Maggiora, 1990):

Overall similar molecules are likely to have similar biological activity

similar

xanthine oxidase inhibitors

� Bit string representations of molecular structure and properties

� 2D and/or 3D features typically encoded as a vector of binary values

� Reasons for popularity in similarity searching:

- computational efficiency

- surprising effectiveness in detecting active compounds

- often intuitive design

Molecular Fingerprints

molecular fingerprint

� 2D Fragment-based, keyed fingerprints:

�

Exemplary Fingerprint Designs

Pathways: O=CNC, ..., CCN=CC=CCl

�

�

Exemplary Fingerprint Designs

43 4

5D D

AH

A

H

3

2

Similarity Search

reference

molecule(s)

ranked list of database molecules

screening database

similarity

assessment

R

A

XA

XR

cra

c

−+=),(Tc RA XX

Multiple Active Reference Molecules

� Search performance generally increases if multiple active reference

compounds are available

� Search strategies for multiple reference compounds include data fusion

techniques or the centroid approach:

1-NN centroid5-NN

‘NN’: nearest

neighbor

database

compound

reference

molecules

,

Molecular Complexity and Size Effects

� Similarity searching with conventional fingerprints is often biased by molecular complexityor size effects

MACCS Tc similarity

absolu

te fre

quency,

ZIN

C M1

M2

M3

M4

M5

M6

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

M6 83*

M4 41*

M2

30*

M3

37*

M1

20*

M5 46*

* number of bit positions set to “1”

Motivation for a Novel Fingerprint Design

� Not sensitive to molecular complexity effects

� Compound class-directed, trainable

� Focus on multiple reference compounds

� Avoid structural fragment descriptors

Fingerprint Design

� Property Descriptor Value Range Derived Fingerprint:

PDR-FP

- Constant bit density regardless of molecular size and complexity

- Encodes molecular property descriptors displaying

a general tendency to adopt activity class-selective

value ranges

Activity-Specific Descriptor Value Ranges

� Identify property descriptors with activity class-selective tendencies:

� A descriptor displays an activity class-selective tendency, if ...

- activity classes produce narrow value ranges

- class ranges are only sparsely populated by database molecules

descriptor value

rela

tive

fre

qu

en

cy

class 1

class 3class 2

descriptor value

distribution in a

large compound

database (DB)

Assessment of Descriptor Selectivity

� Simple descriptor scoring function to capture the probability that an arbitrary

database compound does not match a class value range:

- d: descriptor

- R: set of active reference molecules (activity class)

- [acMind,R, acMaxd,R]: value range of d in R

- X: random variable following the value distribution of d in a DB

⇒ DS(d,R): fraction of DB molecules (%) that do not match the value range of R

- Top score 100 no DB molecule matches the class value range

- Lowest score 0 all DB molecule match the class range

- Score 50 50% of the DB match

- Score 80 20% of the DB match

DS(d,R) = 100 × (Pr( X ∉ [ acMind,R , acMaxd,R ] ))

Eckert & Bajorath. J Med Chem 49, 2284 (2006)

PDR-FP Descriptor Selection

Activity classes

Remove descriptors scoring consistently <50

Remove descriptors from highly correlated pairs

Descriptor scoring

Correlation analysis

Descriptor pool

d2d1 d3

d4 d5 d6 d7

…d8 d184

...

1

2

26

d1

d2

d3

d4

d5

d6

d7

d8

d93

...Descriptors for PDR-FP design

93 descriptors with a general tendency to adopt activity class-selective value ranges

� Screening of a pool of 184 1D/2D descriptors against a panel of 26 activity classes:

PDR-FP Design

� Equifrequent binning of descriptor value ranges for a screening database

� Typically 5 or 6 intervals (bits) per descriptor

� Total of 93 descriptors

require 500 bits

500 bits

16.7%

distribution of molecular weight in ‘2D ZINC’

d2 d92 d93d1

...

d90d92

100 150 200 250 300 350 400 450 500

93 descriptors

16.7% 16.7% 16.7% 16.7% 16.7%

equifrequent binning

Eckert & Bajorath. J Chem Inf Model 46, 2515 (2006)

PDR-FP Design

� For every descriptor, exactly one PDR-FP bit is set to “1”

� For every test compound, PDR-FP has 93 bits set to “1”

� Constant bit density of 18.6% independent of molecular size/complexity

d93d2d91 d92d1

� Bit strings of reference compounds are combined into a

“class-specific search string”

� High bit frequencies correspond to significant deviations in value

distributions between active and database compounds and represent

activity-specific weighting factors

Activity-Oriented Training of PDR-FP

active referencemolecules

activity-oriented training ∑

=

=5

1jiRiR j

xx

search stringXR

3 2 0 0 0 0 1 1 2 1 0 0 0 0 0 0 5 0 0 0 4 1 0 0 0 0 2 2 1 0

XR1

XR2

XR3

XR4

XR5

� High frequency bit positions must be emphasized during similarity assessment

� Similarity metric needed to compare binary (DB molecule) and non-binary

(class search string) vectors

� As the vector-theoretic notation, the dot product is used:

� Database compounds matching many positions with high weights display the

activity-specific bit setting and obtain high similarity values

� NF: normalization factor to place similarity values in [0,1]

Similarity Assessment with PDR-FP

search string

database moleculeXA

XR

3 2 0 0 0 0 1 1 2 1 0 0 0 0 0 0 5 0 0 0 4 1 0 0 0 0 2 2 1 0

NF),(sm500

1FPPDR ∑

=− =

iiRiAxxRA XXsimilarity

assessment


� On structurally

homogeneous classes,

state-of-the-art 2D FPs

perform comparably well

� Calculations:

- random sets of 5 active

reference compounds

- 2.1 mio. (inactive) DB

molecules (“2D” ZINC)

- selection sets of 100

compounds

- except for PDR-FP,

results for best

performing 1-NN or

centroid search strategy

- 100 trials (averaged)

PDR-FP Similarity Searching

maxav

Recovery rate, %

MethodPotential DB hits

Intra-class MACCS Tc*Activity class

28.43

23.18

25.78

37.76

BCI

Daylight

Molprint 2D

PDR-FP

170.90 0.56 Bradykinin BK2

antagonist

83.40

78.4076.30

76.67

BCI

DaylightMolprint 2D

PDR-FP

100.92 0.64HIV protease

inhibitors

88.43

83.36

72.29

73.05

BCI

Daylight

Molprint 2D

PDR-FP

110.980.68Cyclooxygenase-2

(Cox-2) inhibitors

83.40

60.4093.20

80.71

BCI

DaylightMolprint 2D

PDR-FP

100.960.70ACE inhibitors


*Intra-class MACCS Tc is used to asses structural diversity

� On structurally diverse

classes, PDR-FP

outperforms other 2D

fingerprints

� Calculations:

- random sets of 5 active

reference compounds

- 2.1 mio. (inactive) DB

molecules (“2D” ZINC)

- selection sets of 100

compounds

- except for PDR-FP,

results for best

performing 1-NN or

centroid search strategy

- 100 trials (averaged)

PDR-FP Similarity Searching

maxav

Recovery rate, %

MethodPotential DB hits

Intra-class MACCS TcActivity class

5.31

6.918.39

40.48

BCI

DaylightMolprint 2D

PDR-FP

430.79 0.53 Thrombin inhibitors

11.75

5.10

19.82

38.43

BCI

Daylight

Molprint 2D

PDR-FP

200.82 0.40 Squalene

epoxidaseinhibitors

7.35

3.7011.44

41.34

BCI

DaylightMolprint 2D

PDR-FP

200.79 0.58Neurokinin NK2

antagonists

4.26

5.71

7.64

32.72

BCI

Daylight

Molprint 2D

PDR-FP

280.79 0.44Glucagon receptor

antagonists

2.39

4.5012.05

28.75

BCI

DaylightMolprint 2D

PDR-FP

180.85 0.46Factor VIIainhibitors

Thrombin inhibitors

Scaffold Transitions with PDR-FP

Reference

molecules

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

M1

M2

M3

M4

M5

M6

absolu

te fre

quency,

ZIN

C

PDR-FP similarity

Molecular Complexity and Size Effects

� PDR-FP produces constant bit density

� Calculations are not biased by molecular complexity

M6 93*

M4 93*

M2

93*

M3

93*

M1

93*

M5 93*

* number of bit positions set to “1”

� Assessing molecular similarity using MACCS keys and the Tversky similarity coefficient

� MACCS keys:

- 166 bits recording 166

structural fragments

- complexity-dependent

� Screen of NCI anti-HIV database

- ca. 40,000 compounds

Chen & Brown. ChemMedChem 2, 180 (2007)

Relevance of Complexity Effects

Tversky Similarity

� Tversky similarity coefficient (Tv):

- variable weights on reference and database molecules

- a: number of “1” bits in the bitstring XA of molecule A

- b: number of “1” bits in the bitstring XB of molecule B

- c: number of “1” bits shared by A and B

- α: relative weighting factor in [0,1]

- for α = 0.5: symmetrical Dice coefficient

⇔bba

c

+−=

)(),,(Tv BA

ααXX

ccbca

c

+−−+−=

))(1()(),,(Tv BA

αααXX

� Optimal hit rates with

Tversky coefficient in

benchmark calculations

- not at α = 0.5

- around α in 0.6 - 0.8

� Apparent “asymmetry of chemical similarity”

- why?

Chen & Brown. ChemMedChem 2, 180 (2007)

Relevance of Complexity Effects

NCI: background DB for screening trials

Activity Classes of Different Avg. Size

18.6%13.2%25.7%25.242687NCI

18.6%19.7%31.7%31.065TNF

18.6%6.0%20.3%14.050NNI

18.6%21.8%33.5%27.641HH2

18.6%20.8%30.2%32.390CAT

18.6%13.4%30.8%25.657BEN

PDR-FP

bit density

TGD

bit density

MACCS

bit density

Number of

heavy atoms

Number of

compounds

Activity

class

Pairwise Tversky Similarity

� PDR-FP, 500 bits, constant bit density

- no complexity effect

0.18

0.19

0.20

0.21

0.22

0 0.5 1value of Tv

ave

rag

e s

imila

rity

Calculations:

- NNI vs. NCI

- TNF vs. NCI

- HH2 vs. NCI

ccbca

c

+−−+−=

))(1()(),,(Tv BA

αααXX


� MACCS, 166 bits, fragment-based, complexity-dependent bit density

- apparent asymmetric behavior of Tversky similarity

- consequence: biased search performance

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1value of Tv

ave

rag

e s

imila

rity

Calculations:

- NNI vs. NCI

- TNF vs. NCI

- HH2 vs. NCI

class complexity: (highà low)

HH2 > TNF > NCI > NNI

bit density: (highà low)

ccbca

c

+−−+−=

))(1()(),,(Tv BA

αααXX


� TGD, 420 bits, 2D pharmacophore pattern, complexity-dependent

- apparent asymmetric behavior of Tversky similarity

value of Tv

ave

rag

e s

imila

rity

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1

Calculations:

- NNI vs. NCI

- TNF vs. NCI

- HH2 vs. NCI

class complexity: (highà low)

HH2 > TNF > NCI > NNI

bit density: (highà low)

ccbca

c

+−−+−=

))(1()(),,(Tv BA

αααXX


� Bit densities:

- HH2, TNF > NCI: similarity values decrease with increasing α, NCI compounds de-selected

- NNI < NCI: similarity values increase, NCI preferentially selected

value of Tv

ave

rag

e s

imila

rity

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1

NNI vs. NCI, TNF vs. NCI, HH2 vs. NCI

activity class vs. NCI intra-class similarity

value of Tva

ve

rag

e s

imila

rity

0.3

0.4

0.5

0.6

0.7

0.8

0 0.5 1

NNI vs. NNI, TNF vs. TNF, HH2 vs. HH2

� Apparent asymmetry of similarity searching:

- direct consequence of

complexity effects and ensuing

differences in fingerprint bit

densities

� Optimized reference molecules tend to be on avg. larger than database compounds:

- α > 0.5 produces highest hit rates (when DB cpds are

preferentially de-selected)

Wang, Eckert & Bajorath. ChemMedChem 2, 1037 (2007)

Complexity Effects

Overcoming Complexity Effects

� Fingerprints with constant bit density

� Similarity metrics that equally weight contributions

from “1” and “0” bits

Overcoming Complexity Effects

� Design of a weighted Tversky coefficient (wTv):

- : extra parameter as weight on “1“ bits

- a’, b’, c’: number of “0” bits in the bitstrings of molecule A and B and “0”s

shared by A and B, respectively

bba

c

+−=

)(),,(Tv BA

ααXX

')''(

')1(

)(),,,(wTv BA

bba

c

bba

c

+−−+

+−=

αβ

αββαXX

Wang & Bajorath. J Chem Inf Model 48, 75 (2008)

conventional Tv for comparison:

Weighted Tversky Coefficient

� Calculations:

- activity class:

TNF-alpha release

inhibitors, highly

complex reference

molecules

- database:

NCI anti-HIV

- average pairwise

MACCS wTv similarity

between TNF and NCI ')''(

')1(

)(),,,(wTv BA

bba

c

bba

c

+−−+

+−=

αβ

αββαXX

0.4

0.5

0.6

0.7

0.8

0.9

0 0.5 1

�= 0�

= 0.5�

= 1

value of wTv

ave

rag

e s

imila

rity

for = 0.5: equal weight on “1”s and “0” bits

⇒ complexity independent search calculations

Weighted Similarity Search

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

HR

�

values

�

values

Hit Rate (HR) in Top 100

0.35-0.40.3-0.350.25-0.30.2-0.250.15-0.20.1-0.150.05-0.10-0.05

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

� values

�

values

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

�

values

�

values

50 ref. mol.:

Bit density 25.2%50 ref. mol.:

Bit density 39.2%

50 reference

molecules:

Bit density 18.7%

TK inhibitors, MACCS, single ref. mol., avg. hit rates

250 potential hits: avg. bit density 25.2%

42687 NCI: avg. bit density 25.7%

Complex ref. molecules:

lowest hit rates, most difficult

search scenario

Wang & Bajorath. J Chem Inf Model 48, 75 (2008)

Complexity of Reference Molecules

� Standard similarity search

calculations are affected by

complexity of reference mol.

- templates of high complexity

(e.g. optimized leads) are

vulnerable to varying

complexity of DB compounds

- searching against lower

complexity DB produces on

average low similarity values

- for low complexity templates,

similarity value distributions

are comparable for DB

compounds of varying

complexity

Distribution of Tanimoto similarity values

for molecules with different complexity

(TK inhibitors)

0

5

10

15

20

25

5

10

15

20

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

rela

tive

fre

qu

en

cy

MACCS Tc

ZINC molecules

with 42% bit

density

ZINC molecules

with 30% bit densityZINC molecules

with 18% bit density

(a) Reference molecule: 31% bit density

25 (b) Reference molecule: 13% bit density

Playing with Fingerprints

� Attempting to balance complexity effects in unconventional ways

� Randomly converting “1” bits in fingerprints to “0” should lowertheir chemical information content and search performance

Random Reduction of Bit Density

� Test calculations:

- five activity classes taken from the MDDR, each with 20 ref. mol. (REF)

and 80 potential hits (ADC)

- background database of 5000 ZINC cmpds. (DB)

- MACCS Tanimoto similarity, 20-NN fusion rule, hit rates for top-scoring

100 DB cmpds

protein tyrosine kinase inhibitors

reverse transcriptase inhibitors

phospholipase A2 inhibitors

leukotriene antagonists

cyclooxygenase inhibitors

Description

TKI

RTI

PA2

LKT

COX

Activity class


� Uniform bit density reduction

- all active and DB cmpds. selected to produce comparable MACCS bit density, approx. 22-23%

- three gradual reductions in bit density levels carried out

- at each level, 10 random trials performed (with 10 randomly generated fingerprint versions for each ref. molecule)

avera

ge h

it r

ate

(%),

fiv

e c

lasses

7-8% 12-13% 17-18% 22-23%

0

10

20

30

40

reduced bit density of REF, ADC, and DB original bit

density

Uniform Bit Density Reduction

Uniform Bit Density Reduction

� Comparable bit density of REF, ADC, and DB corresponds to virtual absence of complexity effects

� Bit density reduction in fingerprints of all molecules leads

to loss in chemical information content and lower search

performance


� Increasingly complex reference molecules

- for each class, three sets of REF with increasing bit density are selected

- for each set, three gradual reductions in bit density levels

- at each level, 10 random trials (with 10 randomly generated fingerprint versions for each ref. molecule)

38-41%

30-33%

27-28%

22-23%

Bit density

RS4

RS3

RS2

RS1

Set

Random Reduction of REF Bit Density RS1 RS2

RS3

Wang, Geppert & Bajorath. Chem Biol Drug Des 71, 511 (2008)

(DB)

0

10

20

30

40

avg.

HR

(%)

7-8% 12-13% 17-18% 22-23%(original)

bit density

0

5

10

15

20

25

7-8% 12-13% 17-18% 22-23% 28-29%(original)

avg.

HR

(%)

bit density

0

2

4

6

8

10

7-8% 12-13% 17-18% 22-23% 27-28% 30-34%(original)

avg.

HR

(%)

bit density

0

1

2

3

4

5

6

7-8% 12-13% 17-18%22-23%27-28%32-33% 39-41%(original)

avg.

HR

(%)

bit density

RS4

Random Reduction of REF Bit Density

� Hit rates for unmodified fingerprints decrease in the order of increasing reference set complexity

(RS1 > RS2 > RS3 > RS4)

� When REF are more complex than DB random

reduction of REF bit density increases hit rates

� The higher the bit density of REF is, the larger preferred reduction rates become

Conclusions

� Molecular complexity effects generally compromise similarity searching using fingerprints

� PDR-FP is introduced as a fingerprint design with constant bit density that is thus not affected by differences

in molecular complexity

� Equifrequent binning of database descriptor value ranges

enables compound class-directed training of PDR-FP

� PDR-FP is shown to perform well on compound classes

with increasing structural diversity

Conclusions (cont.)

� Apparent asymmetry in similarity searching is a direct consequence of molecular complexity effects

� Complexity effects can also be balanced by introducing similarity metrics that equally weight contributions of “1”

and “0” bits such as wTv

� The use of highly complex reference molecules

represents the least promising similarity search scenario

� Even random reduction in fingerprint bit density balances

complexity effects and improves hit rates, which outweighs the associated loss in information content

Acknowledgment

Hanna Geppert

Yuan Wang

fingerprint design and molecular complexity effects

Documents