fast descriptor calculation for combinatorial libraries geoff downs & john barnard sheffield, uk
TRANSCRIPT
Fast Descriptor Calculation for
Combinatorial Libraries
Geoff Downs & John Barnard
Sheffield, UK
BC I B arn ard C h em icalI n form ation L td.
Descriptor Generation for Combinatorial Libraries Need to calculate structure descriptors for large
virtual libraries• Subset selection better in property space than in
reactant (precursor) space Full library enumeration, followed by descriptor
calculation for single molecules can be slow Direct analysis of Markush representation of
library can offer order-of-magnitude speedups• gives accurately-calculated descriptors for all
molecules in the library
Markush Structures
Scaffold plus R-groups Each R-group alternative (ni) shown once
Convenient for input and display Markush is O( ni)
• 1 core + 100 R1 + 100 R2 + 100 R3 = 301
Enumeration is O( ni)• 1 core × 100 R1 × 100 R2 × 100 R3 = 1,000,000
Direct Analysis of Markush
Avoid multiple analysis of common/repeated parts Time and space advantages
• do as much work as possible in ni
• work in ni only when absolutely necessary Generate partial descriptors from individual building
blocks, and overlaps between them Combine these using appropriate logic to form full
descriptors for individual products Applicable where descriptors are “additive” in
nature
Two-stage Descriptor Generation from Markush Structures1. Analyse core and R-group alternatives
• Build intermediate representation of “partial descriptors”
• Some partial descriptors may involve overlap between core and R-group(s)
• O( ni) [Sigma Phase]
2. Assemble “full” descriptor for each individual molecule in library
• Usually simple addition, concatenation or logical OR of partial descriptors
• O( ni) [Pi Phase]
Descriptors from Markush
Previously described structure fingerprint generation • based on dictionary of predefined fragments• “Partial” fingerprints for relevant building blocks ORed
together for each specific structure
More recent work on calculation of property values• “Lipinski” properties• topological indices
Markush Analysis Softwarec S L N
M ain in ternal M ark us hrepres en tation (E F CL )
M T Z F ile(S M IR K S /S M IL E S )
R G F ile
F ra g m e ntd ic tio na ry
C lu s te rNu m be rs
P a rtia lfing e rprints
En u m e ra te dFin g e rprin t s
C e n tro id/M o da l
F in g e rprin t
D ive rs ity E xplo re r E xc ha ng e F ile
(S M IR K S /S M IL E S )
P a rtia lS M IL E S
En u m e ra te dS M I L ES
P a rtia lpro pe rty va lue s
En u m e ra te dpro pe rty v a lu e s
S lo g Pa to m type s
R e a c tio n a ndP re c urs o r Input
M a rk us h Input
Internal Markush Representation
Data structure held in memory only while needed for analysis• Separate building blocks (“partial structures”) with logical
relationships• Several (non-independent) substituent groups may be
included in a single structural variable Can be built from various input formats
• “Markush-type” input (e.g. RGfile, cSLN) imported directly• Generic reaction and precursor input is more complex
Representation may be “optimised” for efficient processing
Reaction/Precursor input
Build Markush incrementally, one reaction step at a time
Each step modifies core and adds an R-group (clipped reagents)
Input modules based on Daylight reaction toolkit implementedhttp://www.daylight.com/meetings/mug00/Barnard
Module based on Accord SDK under development
SMILES Enumeration Markush analysis can be used for fast enumeration
of non-canonical SMILES for library members Based on SMILES trick: “C1.C1” “CC”
• dot separates the two carbon atoms• “ring closure” numerals join them up again
Sigma Phase:• Generate Partial SMILES for each Partial Structure • Use unsatisfied ring closure numeral for bonds outside
the Partial Structure Pi phase:
• Concatenate Partial SMILES from each relevant PS
SMILES Enumeration core R1 R2O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . Cl%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . Cl%11
Sigma phase: fast generation of partial SMILES• < 0.04s for 100x100x100 = 1M benzodiazepine library
Pi phase: simple concatenation of partial SMILES• 38,775 structures per sec (SGI R10k)• Producing canonical SMILES slows down enumeration
by factor of 45o Each individual molecule must be separately canonicalised
Lipinski Property Generation Molecular weight
• trivial addition of partial molecular weights Count of aromatic rings
• addition of partial counts – optimisation of internal representation ensures that aromatic rings are not split between building blocks
Hydrogen bond donor/acceptor counts• Partial counts may depend on combination of more than
one R-group (e.g. where H is an alternative)• “Overlap” terms (combinations of building blocks) may
need to be included in addition• HBD/HBA definitions can be customised
Lipinski Property Generation
Rotatable bond counts• Some complexities for bonds
between core and R-group• RB = any single bond except to a
terminal atom (H, Cl etc.) or a terminal group (CH3, NO2 etc.)
• In example, R1 to ring single bond is not rotatable when R1 is CH3 or R2 and R3 are identical terminal atoms
N
R1
N*R3
R2
CH3
*R1=
Lipinski Property Generation
logP• Used SlogP atom-contribution method
o Wildman & Crippen, JCICS 1999, 39, 868-873o 68 atom types (+ 4 supplemental) defined as SMARTS patterns
• Atom types redefined as BCI Fragment Dictionary, e.g.[CH3][(N,O,S,P,F,Cl,Br,I)] => C as Xo 797 fragments (644 AA + 26 AS + 131 direct assignment)o Charged N,O and intermediate C, X, Y atom types
• Some atom types require examination of neighbouring building blocks
Lipinski Generation Timings
100100100 = 1M benzodiazepine library SGI R10000 Sigma phase:
• calculation of partial property values • <0.04s
Pi phase: • assembly and output of full property values • 95.59s for all 1M molecules• 10,461 molecules/s
Topological Index Generation
Many topological indices are based on summing the terms for small parts of structure• Simple extra calculation needed at end for some indices• Several implemented (others under development)
o Kier Chi connectivity indices; any ordero Counts of different subgraph types; any order o Kier Kappa and Phi shape indiceso Zagreb indexo (Wiener index and Balaban (JX, JY) indices)
• Hosoya Index not amenable to Markush approacho Requires analysis of full molecule
Kier Index Generation
Sigma Phase • Identify all subgraphs up to n bonds (n is maximum index
order)• Count number of subgraphs of different types, and calculate
contributions to Chi indices Pi Phase
• Sum appropriate subgraph counts and index contributions for each molecule
• Kappa and Phi shape indices calculated from low-order subgraph counts
Sigma phase is significantly slower than for Lipinski properties and fingerprints
Chi Index Sigma Phase Timings
0.01
0.1
1
10
100
1000
10000
100000
0 1 2 3 4 5 6 7 8 9
Maximum Subgraph Order
Tim
e/s
(log
scal
e)
Calculation of partial subgraph counts up to specified order
100 x 100 x 3 = 30,000 compounds400MHz Pentium Celeron, 64MB RAM
Slowdown at higher orders – number of subgraphs Exponential increase in number of subgraphs at
higher orders• Also a problem when handling specific structures
Subgraph Types• P (Path) – nodes have 1 or 2 connections• C (Cluster) – nodes have 1 or 3 connections• PC (Path/Cluster) – nodes have 1, 2 or 3 connections• CH (Chain) – subgraph contains a ring
Explosion in Number of Subgraphs
0
100
200
0 1 2 3 4 5 6 7 8 9
Subgraph Order
Path
Cluster
Chain
Path-Cluster
Mean number of subgraphs per molecule
Explosion in Number of Subgraphs
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9
Subgraph Order
Path
Cluster
Chain
Path-Cluster
Maximum number of subgraphs in any molecule
Slowdown at higher orders – number of Rgroups Higher order subgraphs
can involve core and multiple R-groups• Order 6 PC can involve
all three R-groups
Depends on how well-separated R-groups are
R2
R1
R2
R3 " CH3
R1 : 100 alternativesR2 (symmetrical): 100 alternativesR3: 3 alternatives
Speeding-up Kier Index Generation
Limit maximum order for subgraph counts and Kier connectivity indices
Avoid identifying PC/CH subgraphs if these indices are not required
• not available yet• some complications as
subgraphs can change type as bonds are added
Clu s ter
P ath
Ch ain( i .e . R i n g )
P athClu s ter
Clustering Library Members Previously described clustering of library members on
the basis of fingerprints Lipinksi properties and topological indices can also be
used as basis for clustering• Descriptors are re-generated from partial descriptors as
needed (Pi phase) and need not be stored K-means relocation method needs O(N) time
• Non-hierarchical clustering method• Produces high-quality clusters• User specifies required number of clusters• Results can depend on random selection of cluster seeds
Current Work: Library Overlap
Work in progress to identify the overlap between combinatorial libraries
Identify specific compounds in common• expressed as another Markush structure
“Brute force” algorithm would• Fully enumerate libraries involved• Compare lists of (e.g.) canonical SMILES for
common members
Library Overlap
Markush algorithm originally designed for structure search in chemical patents
• uses “reduced graph” representation of Markusho avoids “segmentation problem” (different boundaries between R-
group and scaffold)
• eliminates non-matching parts very rapidly• slower (atom-by-atom) check to confirm matches
o worst case is matching library against itself
Implementation in software toolkit form• can be incorporated into users’ software• could form basis for Markush Registration/Search system
Potential Future Work: 3D Conformation Generation Preliminary discussions with Gasteiger group (Univ.
Erlangen) on linking Markush approach with CORINA
CORINA works by• separating cyclic and acyclic components• establishing conformation for each independently• linking them back together • checking and adjusting for steric crowding
Some analogies with Markush approach• First two steps are equivalent to Sigma phase• Last two steps are equivalent to Pi phase
References Barnard, J. M.; Downs, G. M.; von Scholley-Pfab, A.; Brown,
R.D., “Use of Markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries.” J. Mol. Graph. Modelling 2000, 18 (4/5), 452-463
Reactions Markush (Daylight MUG00 meeting) http://www.daylight.com/meetings/mug00/Barnard
P.S. we are recruiting too…
http://www.bci1.demon.co.uk
Copyright © Barnard Chemical Information Ltd., 2001