fast descriptor calculation for combinatorial libraries geoff downs & john barnard sheffield, uk

Fast Descriptor Calculation for

Combinatorial Libraries

Geoff Downs & John Barnard

Sheffield, UK

BC I B arn ard C h em icalI n form ation L td.

Descriptor Generation for Combinatorial Libraries Need to calculate structure descriptors for large

virtual libraries• Subset selection better in property space than in

reactant (precursor) space Full library enumeration, followed by descriptor

calculation for single molecules can be slow Direct analysis of Markush representation of

library can offer order-of-magnitude speedups• gives accurately-calculated descriptors for all

molecules in the library

Markush Structures

Scaffold plus R-groups Each R-group alternative (ni) shown once

Convenient for input and display Markush is O( ni)

• 1 core + 100 R1 + 100 R2 + 100 R3 = 301

Enumeration is O( ni)• 1 core × 100 R1 × 100 R2 × 100 R3 = 1,000,000

Direct Analysis of Markush

Avoid multiple analysis of common/repeated parts Time and space advantages

• do as much work as possible in ni

• work in ni only when absolutely necessary Generate partial descriptors from individual building

blocks, and overlaps between them Combine these using appropriate logic to form full

descriptors for individual products Applicable where descriptors are “additive” in

nature

Two-stage Descriptor Generation from Markush Structures1. Analyse core and R-group alternatives

• Build intermediate representation of “partial descriptors”

• Some partial descriptors may involve overlap between core and R-group(s)

• O( ni) [Sigma Phase]

2. Assemble “full” descriptor for each individual molecule in library

• Usually simple addition, concatenation or logical OR of partial descriptors

• O( ni) [Pi Phase]

Descriptors from Markush

Previously described structure fingerprint generation • based on dictionary of predefined fragments• “Partial” fingerprints for relevant building blocks ORed

together for each specific structure

More recent work on calculation of property values• “Lipinski” properties• topological indices

Markush Analysis Softwarec S L N

M ain in ternal M ark us hrepres en tation (E F CL )

M T Z F ile(S M IR K S /S M IL E S )

R G F ile

F ra g m e ntd ic tio na ry

C lu s te rNu m be rs

P a rtia lfing e rprints

En u m e ra te dFin g e rprin t s

C e n tro id/M o da l

F in g e rprin t

D ive rs ity E xplo re r E xc ha ng e F ile

(S M IR K S /S M IL E S )

P a rtia lS M IL E S

En u m e ra te dS M I L ES

P a rtia lpro pe rty va lue s

En u m e ra te dpro pe rty v a lu e s

S lo g Pa to m type s

R e a c tio n a ndP re c urs o r Input

M a rk us h Input

Internal Markush Representation

Data structure held in memory only while needed for analysis• Separate building blocks (“partial structures”) with logical

relationships• Several (non-independent) substituent groups may be

included in a single structural variable Can be built from various input formats

• “Markush-type” input (e.g. RGfile, cSLN) imported directly• Generic reaction and precursor input is more complex

Representation may be “optimised” for efficient processing

Reaction/Precursor input

Build Markush incrementally, one reaction step at a time

Each step modifies core and adds an R-group (clipped reagents)

Input modules based on Daylight reaction toolkit implementedhttp://www.daylight.com/meetings/mug00/Barnard

Module based on Accord SDK under development

SMILES Enumeration Markush analysis can be used for fast enumeration

of non-canonical SMILES for library members Based on SMILES trick: “C1.C1” “CC”

• dot separates the two carbon atoms• “ring closure” numerals join them up again

Sigma Phase:• Generate Partial SMILES for each Partial Structure • Use unsatisfied ring closure numeral for bonds outside

the Partial Structure Pi phase:

• Concatenate Partial SMILES from each relevant PS

SMILES Enumeration core R1 R2O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12 . Cl%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . [H]%11O=C%12Nc1ccc(ccc1)C(=O)OC%11 . C%12Br . Cl%11

Sigma phase: fast generation of partial SMILES• < 0.04s for 100x100x100 = 1M benzodiazepine library

Pi phase: simple concatenation of partial SMILES• 38,775 structures per sec (SGI R10k)• Producing canonical SMILES slows down enumeration

by factor of 45o Each individual molecule must be separately canonicalised

Lipinski Property Generation Molecular weight

• trivial addition of partial molecular weights Count of aromatic rings

• addition of partial counts – optimisation of internal representation ensures that aromatic rings are not split between building blocks

Hydrogen bond donor/acceptor counts• Partial counts may depend on combination of more than

one R-group (e.g. where H is an alternative)• “Overlap” terms (combinations of building blocks) may

need to be included in addition• HBD/HBA definitions can be customised

Lipinski Property Generation

Rotatable bond counts• Some complexities for bonds

between core and R-group• RB = any single bond except to a

terminal atom (H, Cl etc.) or a terminal group (CH3, NO2 etc.)

• In example, R1 to ring single bond is not rotatable when R1 is CH3 or R2 and R3 are identical terminal atoms

N

R1

N*R3

R2

CH3

*R1=

Lipinski Property Generation

logP• Used SlogP atom-contribution method

o Wildman & Crippen, JCICS 1999, 39, 868-873o 68 atom types (+ 4 supplemental) defined as SMARTS patterns

• Atom types redefined as BCI Fragment Dictionary, e.g.[CH3][(N,O,S,P,F,Cl,Br,I)] => C as Xo 797 fragments (644 AA + 26 AS + 131 direct assignment)o Charged N,O and intermediate C, X, Y atom types

• Some atom types require examination of neighbouring building blocks

Lipinski Generation Timings

100100100 = 1M benzodiazepine library SGI R10000 Sigma phase:

• calculation of partial property values • <0.04s

Pi phase: • assembly and output of full property values • 95.59s for all 1M molecules• 10,461 molecules/s

Topological Index Generation

Many topological indices are based on summing the terms for small parts of structure• Simple extra calculation needed at end for some indices• Several implemented (others under development)

o Kier Chi connectivity indices; any ordero Counts of different subgraph types; any order o Kier Kappa and Phi shape indiceso Zagreb indexo (Wiener index and Balaban (JX, JY) indices)

• Hosoya Index not amenable to Markush approacho Requires analysis of full molecule

Kier Index Generation

Sigma Phase • Identify all subgraphs up to n bonds (n is maximum index

order)• Count number of subgraphs of different types, and calculate

contributions to Chi indices Pi Phase

• Sum appropriate subgraph counts and index contributions for each molecule

• Kappa and Phi shape indices calculated from low-order subgraph counts

Sigma phase is significantly slower than for Lipinski properties and fingerprints

Chi Index Sigma Phase Timings

0.01

0.1

1

10

100

1000

10000

100000

0 1 2 3 4 5 6 7 8 9

Maximum Subgraph Order

Tim

e/s

(log

scal

e)

Calculation of partial subgraph counts up to specified order

100 x 100 x 3 = 30,000 compounds400MHz Pentium Celeron, 64MB RAM

Slowdown at higher orders – number of subgraphs Exponential increase in number of subgraphs at

higher orders• Also a problem when handling specific structures

Subgraph Types• P (Path) – nodes have 1 or 2 connections• C (Cluster) – nodes have 1 or 3 connections• PC (Path/Cluster) – nodes have 1, 2 or 3 connections• CH (Chain) – subgraph contains a ring

Explosion in Number of Subgraphs

0

100

200

0 1 2 3 4 5 6 7 8 9

Subgraph Order

Path

Cluster

Chain

Path-Cluster

Mean number of subgraphs per molecule

Explosion in Number of Subgraphs

0

1000

2000

3000

4000

5000

6000

0 1 2 3 4 5 6 7 8 9

Subgraph Order

Path

Cluster

Chain

Path-Cluster

Maximum number of subgraphs in any molecule

Slowdown at higher orders – number of Rgroups Higher order subgraphs

can involve core and multiple R-groups• Order 6 PC can involve

all three R-groups

Depends on how well-separated R-groups are

R2

R1

R2

R3 " CH3

R1 : 100 alternativesR2 (symmetrical): 100 alternativesR3: 3 alternatives

Speeding-up Kier Index Generation

Limit maximum order for subgraph counts and Kier connectivity indices

Avoid identifying PC/CH subgraphs if these indices are not required

• not available yet• some complications as

subgraphs can change type as bonds are added

Clu s ter

P ath

Ch ain( i .e . R i n g )

P athClu s ter

Clustering Library Members Previously described clustering of library members on

the basis of fingerprints Lipinksi properties and topological indices can also be

used as basis for clustering• Descriptors are re-generated from partial descriptors as

needed (Pi phase) and need not be stored K-means relocation method needs O(N) time

• Non-hierarchical clustering method• Produces high-quality clusters• User specifies required number of clusters• Results can depend on random selection of cluster seeds

Current Work: Library Overlap

Work in progress to identify the overlap between combinatorial libraries

Identify specific compounds in common• expressed as another Markush structure

“Brute force” algorithm would• Fully enumerate libraries involved• Compare lists of (e.g.) canonical SMILES for

common members

Library Overlap

Markush algorithm originally designed for structure search in chemical patents

• uses “reduced graph” representation of Markusho avoids “segmentation problem” (different boundaries between R-

group and scaffold)

• eliminates non-matching parts very rapidly• slower (atom-by-atom) check to confirm matches

o worst case is matching library against itself

Implementation in software toolkit form• can be incorporated into users’ software• could form basis for Markush Registration/Search system

Potential Future Work: 3D Conformation Generation Preliminary discussions with Gasteiger group (Univ.

Erlangen) on linking Markush approach with CORINA

CORINA works by• separating cyclic and acyclic components• establishing conformation for each independently• linking them back together • checking and adjusting for steric crowding

Some analogies with Markush approach• First two steps are equivalent to Sigma phase• Last two steps are equivalent to Pi phase

References Barnard, J. M.; Downs, G. M.; von Scholley-Pfab, A.; Brown,

R.D., “Use of Markush structure analysis techniques for descriptor generation and clustering of large combinatorial libraries.” J. Mol. Graph. Modelling 2000, 18 (4/5), 452-463

Reactions Markush (Daylight MUG00 meeting) http://www.daylight.com/meetings/mug00/Barnard

P.S. we are recruiting too…

http://www.bci1.demon.co.uk

Copyright © Barnard Chemical Information Ltd., 2001

fast descriptor calculation for combinatorial libraries geoff downs & john barnard sheffield, uk

Documents