fingerprint design and molecular complexity effects
TRANSCRIPT
Chemical Similarity Searching
� Long history in pharmaceutical research
� One of the most popular approaches in virtual screening
� Based on the concept of global molecular similarity
� Similarity-property principle (Johnson & Maggiora, 1990):
Overall similar molecules are likely to have similar biological activity
similar
xanthine oxidase inhibitors
� Bit string representations of molecular structure and properties
� 2D and/or 3D features typically encoded as a vector of binary values
� Reasons for popularity in similarity searching:
- computational efficiency
- surprising effectiveness in detecting active compounds
- often intuitive design
Molecular Fingerprints
molecular fingerprint
� 2D Fragment-based, keyed fingerprints:
�
Exemplary Fingerprint Designs
Pathways: O=CNC, ..., CCN=CC=CCl
Similarity Search
reference
molecule(s)
ranked list of database molecules
screening database
similarity
assessment
R
A
XA
XR
cra
c
−+=),(Tc RA XX
Multiple Active Reference Molecules
� Search performance generally increases if multiple active reference
compounds are available
� Search strategies for multiple reference compounds include data fusion
techniques or the centroid approach:
1-NN centroid5-NN
‘NN’: nearest
neighbor
database
compound
reference
molecules
,
Molecular Complexity and Size Effects
� Similarity searching with conventional fingerprints is often biased by molecular complexityor size effects
MACCS Tc similarity
absolu
te fre
quency,
ZIN
C M1
M2
M3
M4
M5
M6
0
1000
2000
3000
4000
5000
6000
7000
8000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
M6 83*
M4 41*
M2
30*
M3
37*
M1
20*
M5 46*
* number of bit positions set to “1”
Motivation for a Novel Fingerprint Design
� Not sensitive to molecular complexity effects
� Compound class-directed, trainable
� Focus on multiple reference compounds
� Avoid structural fragment descriptors
Fingerprint Design
� Property Descriptor Value Range Derived Fingerprint:
PDR-FP
- Constant bit density regardless of molecular size and complexity
- Encodes molecular property descriptors displaying
a general tendency to adopt activity class-selective
value ranges
Activity-Specific Descriptor Value Ranges
� Identify property descriptors with activity class-selective tendencies:
� A descriptor displays an activity class-selective tendency, if ...
- activity classes produce narrow value ranges
- class ranges are only sparsely populated by database molecules
descriptor value
rela
tive
fre
qu
en
cy
class 1
class 3class 2
descriptor value
distribution in a
large compound
database (DB)
Assessment of Descriptor Selectivity
� Simple descriptor scoring function to capture the probability that an arbitrary
database compound does not match a class value range:
- d: descriptor
- R: set of active reference molecules (activity class)
- [acMind,R, acMaxd,R]: value range of d in R
- X: random variable following the value distribution of d in a DB
⇒ DS(d,R): fraction of DB molecules (%) that do not match the value range of R
- Top score 100 no DB molecule matches the class value range
- Lowest score 0 all DB molecule match the class range
- Score 50 50% of the DB match
- Score 80 20% of the DB match
DS(d,R) = 100 × (Pr( X ∉ [ acMind,R , acMaxd,R ] ))
Eckert & Bajorath. J Med Chem 49, 2284 (2006)
PDR-FP Descriptor Selection
Activity classes
Remove descriptors scoring consistently <50
Remove descriptors from highly correlated pairs
Descriptor scoring
Correlation analysis
Descriptor pool
d2d1 d3
d4 d5 d6 d7
…d8 d184
...
1
2
26
d1
d2
d3
d4
d5
d6
d7
d8
d93
...Descriptors for PDR-FP design
93 descriptors with a general tendency to adopt activity class-selective value ranges
� Screening of a pool of 184 1D/2D descriptors against a panel of 26 activity classes:
PDR-FP Design
� Equifrequent binning of descriptor value ranges for a screening database
� Typically 5 or 6 intervals (bits) per descriptor
� Total of 93 descriptors
require 500 bits
500 bits
16.7%
distribution of molecular weight in ‘2D ZINC’
d2 d92 d93d1
...
d90d92
100 150 200 250 300 350 400 450 500
93 descriptors
16.7% 16.7% 16.7% 16.7% 16.7%
equifrequent binning
Eckert & Bajorath. J Chem Inf Model 46, 2515 (2006)
PDR-FP Design
� For every descriptor, exactly one PDR-FP bit is set to “1”
� For every test compound, PDR-FP has 93 bits set to “1”
� Constant bit density of 18.6% independent of molecular size/complexity
d93d2d91 d92d1
� Bit strings of reference compounds are combined into a
“class-specific search string”
� High bit frequencies correspond to significant deviations in value
distributions between active and database compounds and represent
activity-specific weighting factors
Activity-Oriented Training of PDR-FP
active referencemolecules
activity-oriented training ∑
=
=5
1jiRiR j
xx
search stringXR
3 2 0 0 0 0 1 1 2 1 0 0 0 0 0 0 5 0 0 0 4 1 0 0 0 0 2 2 1 0
XR1
XR2
XR3
XR4
XR5
� High frequency bit positions must be emphasized during similarity assessment
� Similarity metric needed to compare binary (DB molecule) and non-binary
(class search string) vectors
� As the vector-theoretic notation, the dot product is used:
� Database compounds matching many positions with high weights display the
activity-specific bit setting and obtain high similarity values
� NF: normalization factor to place similarity values in [0,1]
Similarity Assessment with PDR-FP
search string
database moleculeXA
XR
3 2 0 0 0 0 1 1 2 1 0 0 0 0 0 0 5 0 0 0 4 1 0 0 0 0 2 2 1 0
NF),(sm500
1FPPDR ∑
=− =
iiRiAxxRA XXsimilarity
assessment
Eckert & Bajorath. J Chem Inf Model 46, 2515 (2006)
� On structurally
homogeneous classes,
state-of-the-art 2D FPs
perform comparably well
� Calculations:
- random sets of 5 active
reference compounds
- 2.1 mio. (inactive) DB
molecules (“2D” ZINC)
- selection sets of 100
compounds
- except for PDR-FP,
results for best
performing 1-NN or
centroid search strategy
- 100 trials (averaged)
PDR-FP Similarity Searching
maxav
Recovery rate, %
MethodPotential DB hits
Intra-class MACCS Tc*Activity class
28.43
23.18
25.78
37.76
BCI
Daylight
Molprint 2D
PDR-FP
170.90 0.56 Bradykinin BK2
antagonist
83.40
78.4076.30
76.67
BCI
DaylightMolprint 2D
PDR-FP
100.92 0.64HIV protease
inhibitors
88.43
83.36
72.29
73.05
BCI
Daylight
Molprint 2D
PDR-FP
110.980.68Cyclooxygenase-2
(Cox-2) inhibitors
83.40
60.4093.20
80.71
BCI
DaylightMolprint 2D
PDR-FP
100.960.70ACE inhibitors
Eckert & Bajorath. J Chem Inf Model 46, 2515 (2006)
*Intra-class MACCS Tc is used to asses structural diversity
� On structurally diverse
classes, PDR-FP
outperforms other 2D
fingerprints
� Calculations:
- random sets of 5 active
reference compounds
- 2.1 mio. (inactive) DB
molecules (“2D” ZINC)
- selection sets of 100
compounds
- except for PDR-FP,
results for best
performing 1-NN or
centroid search strategy
- 100 trials (averaged)
PDR-FP Similarity Searching
maxav
Recovery rate, %
MethodPotential DB hits
Intra-class MACCS TcActivity class
5.31
6.918.39
40.48
BCI
DaylightMolprint 2D
PDR-FP
430.79 0.53 Thrombin inhibitors
11.75
5.10
19.82
38.43
BCI
Daylight
Molprint 2D
PDR-FP
200.82 0.40 Squalene
epoxidaseinhibitors
7.35
3.7011.44
41.34
BCI
DaylightMolprint 2D
PDR-FP
200.79 0.58Neurokinin NK2
antagonists
4.26
5.71
7.64
32.72
BCI
Daylight
Molprint 2D
PDR-FP
280.79 0.44Glucagon receptor
antagonists
2.39
4.5012.05
28.75
BCI
DaylightMolprint 2D
PDR-FP
180.85 0.46Factor VIIainhibitors
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
M1
M2
M3
M4
M5
M6
absolu
te fre
quency,
ZIN
C
PDR-FP similarity
Molecular Complexity and Size Effects
� PDR-FP produces constant bit density
� Calculations are not biased by molecular complexity
M6 93*
M4 93*
M2
93*
M3
93*
M1
93*
M5 93*
* number of bit positions set to “1”
� Assessing molecular similarity using MACCS keys and the Tversky similarity coefficient
� MACCS keys:
- 166 bits recording 166
structural fragments
- complexity-dependent
� Screen of NCI anti-HIV database
- ca. 40,000 compounds
Chen & Brown. ChemMedChem 2, 180 (2007)
Relevance of Complexity Effects
Tversky Similarity
� Tversky similarity coefficient (Tv):
- variable weights on reference and database molecules
- a: number of “1” bits in the bitstring XA of molecule A
- b: number of “1” bits in the bitstring XB of molecule B
- c: number of “1” bits shared by A and B
- α: relative weighting factor in [0,1]
- for α = 0.5: symmetrical Dice coefficient
⇔bba
c
+−=
)(),,(Tv BA
ααXX
ccbca
c
+−−+−=
))(1()(),,(Tv BA
αααXX
� Optimal hit rates with
Tversky coefficient in
benchmark calculations
- not at α = 0.5
- around α in 0.6 - 0.8
� Apparent “asymmetry of chemical similarity”
- why?
Chen & Brown. ChemMedChem 2, 180 (2007)
Relevance of Complexity Effects
NCI: background DB for screening trials
Activity Classes of Different Avg. Size
18.6%13.2%25.7%25.242687NCI
18.6%19.7%31.7%31.065TNF
18.6%6.0%20.3%14.050NNI
18.6%21.8%33.5%27.641HH2
18.6%20.8%30.2%32.390CAT
18.6%13.4%30.8%25.657BEN
PDR-FP
bit density
TGD
bit density
MACCS
bit density
Number of
heavy atoms
Number of
compounds
Activity
class
Pairwise Tversky Similarity
� PDR-FP, 500 bits, constant bit density
- no complexity effect
0.18
0.19
0.20
0.21
0.22
0 0.5 1value of Tv
ave
rag
e s
imila
rity
Calculations:
- NNI vs. NCI
- TNF vs. NCI
- HH2 vs. NCI
ccbca
c
+−−+−=
))(1()(),,(Tv BA
αααXX
Pairwise Tversky Similarity
� MACCS, 166 bits, fragment-based, complexity-dependent bit density
- apparent asymmetric behavior of Tversky similarity
- consequence: biased search performance
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1value of Tv
ave
rag
e s
imila
rity
Calculations:
- NNI vs. NCI
- TNF vs. NCI
- HH2 vs. NCI
class complexity: (highà low)
HH2 > TNF > NCI > NNI
bit density: (highà low)
ccbca
c
+−−+−=
))(1()(),,(Tv BA
αααXX
Pairwise Tversky Similarity
� TGD, 420 bits, 2D pharmacophore pattern, complexity-dependent
- apparent asymmetric behavior of Tversky similarity
value of Tv
ave
rag
e s
imila
rity
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1
Calculations:
- NNI vs. NCI
- TNF vs. NCI
- HH2 vs. NCI
class complexity: (highà low)
HH2 > TNF > NCI > NNI
bit density: (highà low)
ccbca
c
+−−+−=
))(1()(),,(Tv BA
αααXX
Pairwise Tversky Similarity
� Bit densities:
- HH2, TNF > NCI: similarity values decrease with increasing α, NCI compounds de-selected
- NNI < NCI: similarity values increase, NCI preferentially selected
value of Tv
ave
rag
e s
imila
rity
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1
NNI vs. NCI, TNF vs. NCI, HH2 vs. NCI
activity class vs. NCI intra-class similarity
value of Tva
ve
rag
e s
imila
rity
0.3
0.4
0.5
0.6
0.7
0.8
0 0.5 1
NNI vs. NNI, TNF vs. TNF, HH2 vs. HH2
� Apparent asymmetry of similarity searching:
- direct consequence of
complexity effects and ensuing
differences in fingerprint bit
densities
� Optimized reference molecules tend to be on avg. larger than database compounds:
- α > 0.5 produces highest hit rates (when DB cpds are
preferentially de-selected)
Wang, Eckert & Bajorath. ChemMedChem 2, 1037 (2007)
Complexity Effects
Overcoming Complexity Effects
� Fingerprints with constant bit density
� Similarity metrics that equally weight contributions
from “1” and “0” bits
Overcoming Complexity Effects
� Design of a weighted Tversky coefficient (wTv):
- : extra parameter as weight on “1“ bits
- a’, b’, c’: number of “0” bits in the bitstrings of molecule A and B and “0”s
shared by A and B, respectively
bba
c
+−=
)(),,(Tv BA
ααXX
')''(
')1(
)(),,,(wTv BA
bba
c
bba
c
+−−+
+−=
αβ
αββαXX
Wang & Bajorath. J Chem Inf Model 48, 75 (2008)
conventional Tv for comparison:
Weighted Tversky Coefficient
� Calculations:
- activity class:
TNF-alpha release
inhibitors, highly
complex reference
molecules
- database:
NCI anti-HIV
- average pairwise
MACCS wTv similarity
between TNF and NCI ')''(
')1(
)(),,,(wTv BA
bba
c
bba
c
+−−+
+−=
αβ
αββαXX
0.4
0.5
0.6
0.7
0.8
0.9
0 0.5 1
�= 0�
= 0.5�
= 1
value of wTv
ave
rag
e s
imila
rity
for = 0.5: equal weight on “1”s and “0” bits
⇒ complexity independent search calculations
Weighted Similarity Search
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
HR
�
values
�
values
Hit Rate (HR) in Top 100
0.35-0.40.3-0.350.25-0.30.2-0.250.15-0.20.1-0.150.05-0.10-0.05
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
� values
�
values
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
�
values
�
values
50 ref. mol.:
Bit density 25.2%50 ref. mol.:
Bit density 39.2%
50 reference
molecules:
Bit density 18.7%
TK inhibitors, MACCS, single ref. mol., avg. hit rates
250 potential hits: avg. bit density 25.2%
42687 NCI: avg. bit density 25.7%
Complex ref. molecules:
lowest hit rates, most difficult
search scenario
Wang & Bajorath. J Chem Inf Model 48, 75 (2008)
Complexity of Reference Molecules
� Standard similarity search
calculations are affected by
complexity of reference mol.
- templates of high complexity
(e.g. optimized leads) are
vulnerable to varying
complexity of DB compounds
- searching against lower
complexity DB produces on
average low similarity values
- for low complexity templates,
similarity value distributions
are comparable for DB
compounds of varying
complexity
Distribution of Tanimoto similarity values
for molecules with different complexity
(TK inhibitors)
0
5
10
15
20
25
5
10
15
20
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
rela
tive
fre
qu
en
cy
MACCS Tc
ZINC molecules
with 42% bit
density
ZINC molecules
with 30% bit densityZINC molecules
with 18% bit density
(a) Reference molecule: 31% bit density
25 (b) Reference molecule: 13% bit density
Playing with Fingerprints
� Attempting to balance complexity effects in unconventional ways
� Randomly converting “1” bits in fingerprints to “0” should lowertheir chemical information content and search performance
Random Reduction of Bit Density
� Test calculations:
- five activity classes taken from the MDDR, each with 20 ref. mol. (REF)
and 80 potential hits (ADC)
- background database of 5000 ZINC cmpds. (DB)
- MACCS Tanimoto similarity, 20-NN fusion rule, hit rates for top-scoring
100 DB cmpds
protein tyrosine kinase inhibitors
reverse transcriptase inhibitors
phospholipase A2 inhibitors
leukotriene antagonists
cyclooxygenase inhibitors
Description
TKI
RTI
PA2
LKT
COX
Activity class
Random Reduction of Bit Density
� Uniform bit density reduction
- all active and DB cmpds. selected to produce comparable MACCS bit density, approx. 22-23%
- three gradual reductions in bit density levels carried out
- at each level, 10 random trials performed (with 10 randomly generated fingerprint versions for each ref. molecule)
avera
ge h
it r
ate
(%),
fiv
e c
lasses
7-8% 12-13% 17-18% 22-23%
0
10
20
30
40
reduced bit density of REF, ADC, and DB original bit
density
Uniform Bit Density Reduction
Uniform Bit Density Reduction
� Comparable bit density of REF, ADC, and DB corresponds to virtual absence of complexity effects
� Bit density reduction in fingerprints of all molecules leads
to loss in chemical information content and lower search
performance
Random Reduction of Bit Density
� Increasingly complex reference molecules
- for each class, three sets of REF with increasing bit density are selected
- for each set, three gradual reductions in bit density levels
- at each level, 10 random trials (with 10 randomly generated fingerprint versions for each ref. molecule)
38-41%
30-33%
27-28%
22-23%
Bit density
RS4
RS3
RS2
RS1
Set
Random Reduction of REF Bit Density RS1 RS2
RS3
Wang, Geppert & Bajorath. Chem Biol Drug Des 71, 511 (2008)
(DB)
0
10
20
30
40
avg.
HR
(%)
7-8% 12-13% 17-18% 22-23%(original)
bit density
0
5
10
15
20
25
7-8% 12-13% 17-18% 22-23% 28-29%(original)
avg.
HR
(%)
bit density
0
2
4
6
8
10
7-8% 12-13% 17-18% 22-23% 27-28% 30-34%(original)
avg.
HR
(%)
bit density
0
1
2
3
4
5
6
7-8% 12-13% 17-18%22-23%27-28%32-33% 39-41%(original)
avg.
HR
(%)
bit density
RS4
Random Reduction of REF Bit Density
� Hit rates for unmodified fingerprints decrease in the order of increasing reference set complexity
(RS1 > RS2 > RS3 > RS4)
� When REF are more complex than DB random
reduction of REF bit density increases hit rates
� The higher the bit density of REF is, the larger preferred reduction rates become
Conclusions
� Molecular complexity effects generally compromise similarity searching using fingerprints
� PDR-FP is introduced as a fingerprint design with constant bit density that is thus not affected by differences
in molecular complexity
� Equifrequent binning of database descriptor value ranges
enables compound class-directed training of PDR-FP
� PDR-FP is shown to perform well on compound classes
with increasing structural diversity
Conclusions (cont.)
� Apparent asymmetry in similarity searching is a direct consequence of molecular complexity effects
� Complexity effects can also be balanced by introducing similarity metrics that equally weight contributions of “1”
and “0” bits such as wTv
� The use of highly complex reference molecules
represents the least promising similarity search scenario
� Even random reduction in fingerprint bit density balances
complexity effects and improves hit rates, which outweighs the associated loss in information content