1 chemical structure representation and search systems lecture 4. nov 11, 2003 john barnard barnard...

1Chemical Structure Representation

and Search Systems

Lecture 4. Nov 11, 2003

John Barnard

Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services

Sheffield, UK

2 New MDL File Formats

Since lecture on Oct 28, MDL have published details of a new file format “XDfile”• XML-based data format for transferring

structure/reaction information with associated data• built around existing MDL connection table formats • can incorporate Chime strings (encrypted format used

to render structures and reactions on a Web page)• can incorporate SMILES strings

New download site: • http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

3 Lecture 4: Topics to be Covered

full structure search• structure registration systems• graph isomorphism

algorithm complexity and NP-complete problems substructure search

o subgraph isomorphismo screening searches and fingerprintso substructure query formulation

• SMARTS

commercial systems

4 Full Structure and Substructure Search

full structure search• query is is complete molecule• is this molecule in the database?

o or tautomers, stereoisomers etc. of it,

substructure search• query is a pattern of atoms and bonds• does this pattern occur as a substructure

of any of the molecules in my database? superstructure search

• query is a complete molecule• are any of the molecules in the database substructures of it?

N.B. Some Daylight Chemical Information Systems Inc. documentation uses “substructure” and “superstructure” search in the opposite sense to those given here

5 Full Structure Search

Many databases contain millions of structures, so search speed is important

Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES)

• could sort database SMILES into alphanumerical order• search sorted list for match with query

“Hash table” lookup can improve search speed• calculate hash-code (“idiot number” in predefined range) from

SMILES for each database structure• this is address (disk file or memory) at which full representation is

stored• only SMILES which have same hash code need to be compared

6 Structure Registration Systems

Many chemical and pharmaceutical companies maintain compound “registry” systems• database of all compounds worked on internally• may included many compounds never published

elsewhere (i.e. not in Chemical Abstracts, Beilstein)• links to company reports, biological screening data,

stock number in compounds store etc.• links to electronic lab notebooks, LIMS (Lab. Info.

Management System), ORACLE database etc.


new compounds need to be added regularly • used to be done by chemical information specialists• now frequently done directly by bench chemists

registration system must• check consistency of input data

o e.g. compare molecular formula with structure

• check that compound is really newo different ways of handling tautomers, salts, stereoisomers etc.

• assign registry number• add supplementary data (melting point etc.)• make data immediately available for search


“Public” databases use same principles, adding compounds from published literature• Chemical Abstracts Registry file

o links to document where data on molecule was published

• Beilstein Registry fileo lots of data may be stored with compound, from different data

sources; existing records may need updating

Updates for searching may be made available at regular intervals (weekly, monthly, annually, etc.)

9 Graph Isomorphism

In graph theory terms, when two full structures match, their graphs are said to be isomorphic

each node N1 in G1 must be mapped to a node N2 in G2

neighbours of N1 must map to neighbours of N2

10 Graph isomorphism by brute force

for each node in G1 • map it against an unmapped node in G2

check that neighbours of each node map appropriately in the two graphs

if each graph has n nodes there are n! ways of doing this• n × (n-1) × (n-2) × (n-3) … × 3 × 2 × 1• this is a big number if n is anything non-trivial• 9! = 362 880• 10! = 3 628 800

11 Computational complexity

a measure of how long a computational algorithm will take to run, depending on the size of input• if you give it twice as much data will it take twice as

long to run? e.g. comparing a word sequentially against each

member of a list of words of length n• take taken depends directly on length of list• algorithm is O(n) [“order-n”]

e.g. comparing each word in a list of length n with every other word of the same list• algorithm is O(n2) [“order-n-squared”]


some algorithms may have complexity O(n3), O(n4), O(log n), O(n log n) etc.• these are all “polynomial” time algorithms

some algorithms have exponential complexity, e.g. O(2n)• this is much slower than polynomial

brute-force graph isomorphism is O(n!)• this is even slower than simple exponential


for some problems you can find more efficient algorithms (lower order of complexity) to do the same thing• e.g. searching a sorted list

o simple “sequential” search is O(n)o “binary chop” search is O(log n)

for some problems there are no known polynomial-time algorithms

14 NP-complete problems

a class of problems for which no polynomial-time algorithms are known

problems in this class are mathematically “equivalent”• if a polynomial time algorithm could be found

for one of them, it would fit all of them well-known example is “travelling salesman

problem” (shortest path visiting each of several cities)

it is suspected (but not proven) that no polynomial-time algorithms can exist for this class of problems

15 NP-complete problems

graph isomorphism is probably NP-complete (not rigorously proven)

subgraph isomorphism is a generalisation of graph isomorphism• nodes in G1 (query structure) must be mapped to

subset of nodes in G2 (database structure)

• i.e. G1 is a subgraph G2

subgraph isomorphism has been proven to be NP-complete• substructure searching is inherently slow

16 Subgraph isomorphism

NP-completeness of problem means that worst-case match times are exponential in number of atoms involved

but average-case match times can be better than this

much effort has been expended on this problem over the past 40+ years• closely-related problems remain an active area of

research

17 Speeding up subgraph isomorphism

1. use a faster computer

2. use tricks to avoid exploring potential solutions that are bound to fail

3. do most of the work in a pre-processing of the database structures, independently of the query

18 Speeding up subgraph isomorphism

chemical graphs have several characteristics that allow heuristics (“tricks”) to be used to speed up isomorphism identification• several different node and edge labels• low connectivity of each node• using hydrogen-suppressed graphs reduces size of

problem (number of nodes) these tricks would be of less use for general graphs additional tricks and algorithms may be used in

special cases (e.g. if graphs are trees)

19 Backtracking

modification of the brute-force approach abandons partial solutions part-way through when

it can be seen they are bound to fail worst-case is still exponential in number of nodes,

but doesn’t arise very often• first map an arbitrary pair of nodes• then map neighbours of these nodes• if successful, map neighbours of each neighbour, etc.• if not, backtrack one step, and try a different mapping

20 Backtracking

algorithm will terminate • when all query nodes are mapped [MATCH]

• when all alternative mappings for first query node have been tried, and have failed [NO MATCH]

extra tricks can be used for further improvement• only map nodes with same element type and charge,

and compatible bonding patterns• start with unusual atom types, and nodes with lots of

neighbours

21 Partitioning and Relaxation

often used as an adjunct to backtracking start by partitioning the nodes into sets of

possible correspondents• e.g. nitrogens can only match nitrogens• iteratively refine the partition on basis of other

possible correspondenceso e.g. if F6 is only possible correspondent for Q1 then

F6 cannot be a correspondent for Q2• if the list of possible correspondents for a query

node becomes empty, there is no isomorphism

22 Partitioning and Relaxation

can also reduce lists of possible correspondents by looking at neighbours• if F6 is to remain a valid correspondent of Q1, then the

neighbours of F6 must be possible correspondents of the neighbours of Q1

• as this check is repeated for each node, we are bringing in information from further away, but only ever looking at immediate neighbours

o this technique is the same as Morgan’s algorithm for node labelling in canonicalisation

o it is called relaxation backtracking can be used as a fallback when no further

reductions can be made in the lists of possible correspondents

23 Subgraph isomorphism algorithms

Ray and Kirsch’s algorithm (1957)• basic backtracking

Sussenguth’s partitioning algorithm (1965)• relaxation technique called “connectivity

property”, with backtracking as fall-back Figueras’s set reduction algorithm (1972) Ullmann’s algorithm (1976)

• efficient relaxation and backtracking von Scholley’s relaxation algorithm (1984)

24 Screening

so far we’ve considered matching one query substructure against one database full structure• each structure from the database needs to be compared

against the query in turn• many will fail because they don’t contain the query

substructure

“screening” allows many of these to be eliminated before we get to this stage

uses structure “fingerprints” discussed in lecture 3

25 Fingerprints

the fragments present in a structure can be represented as a sequence of 0s and 1s

00010100010101000101010011110100• 0 means fragment is not present in structure• 1 means fragment is present in structure (perhaps

multiple times)

each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)

for chemical structures often called structure “fingerprints”

26 Screening

build a fingerprint for the query substructure only those database structures that contain all the

fragments in the query can possibly match the queryQuery: 00000100010101000001010011010100DB struct 1: 00010100010101000101010011110100 MATCHDB struct 2: 00000000100101001001000011100000 NO MATCH

comparing fingerprint bitstrings is very fast (logical AND operation)

only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search

27 Screening

can be made faster by “inverting” the bit strings (actually, turning them on their side)• instead of a bitstring of fragments for each structure ...• store a bitstring of structures for each fragment

o each bit represents a database structureo 1=structure contains fragment; 0=structure does not

• search by ANDing together the bitstrings for the fragments present in the query

o this will list those structures that contain all the query’s fragments

28 Screening Effectiveness

Ideally we want to eliminate as many structures as possible at the screening stage• 99% screenout or more would be good

Fingerprint construction can help in this• frequency distributions of fragments in a large

database are very “skewed”o a few fragments occur in almost all compounds

• will therefore give little or no screenouto many fragments occur in very few compounds

• need very long fingerprint (lots of fragments) to ensure that we will have some in the query

29 Fingerprint construction

best fragments are medium frequency ones fragments also need to be independent of

each other dictionary used for CA Registry search was

constructed on basis of analysis of fragment frequency distributions

30 Daylight fingerprints

each fragment is used to generate a hash code, which specifies the bit position to be set• actually most fragments set several bits• small fragments (more frequent) set fewer bits• larger fragments (less frequent) set more bits

several different fragments may set the same bit• in principle this can reduce screening effectiveness• it may allow a fingerprint to match when structure does

not contain the same fragment as the queryo in practice is not a serious problem

• it will never cause a structure to be rejected when it is actually a match

31 Daylight fingerprints

fingerprint can be “folded” to reduce length of fingerprint

again this increases the chances of false matches, but with “sparse” (low-density) fingerprints this is more than offset by the increase in search speed

0010 0100 0101 0010 1001 1010 11 01 0100

0010 0100 0101 00101001 1010 11 01 0100

1011 111 0 11 01 011 0

32 Hardware solutions

“use a faster computer” cheaper memory means that a lot of operations can

be performed in memory• in Daylight, fingerprints are stored and matched in

memory

parallel processing • database parallel (split database over several machines)• algorithm-parallel (different operations on same

structure on different processors)

33 Parallel processing

Chemical Abstracts Registry File• different machines search different parts of the

database• results are collated for presentation to user

other research work has looked at various algorithms and various processors• speedup declines as more processors added

o overheads in controlling them become dominant

34 Parallel processing

subgraph isomorphism algorithms are not very suitable for algorithm parallelisation• individual operations are very simple

von Scholley algorithm designed for parallel implementation• each processor handles one atom in relaxation step• problem is distributing data to processors

o most processors spend time waiting for next data

35 Preprocessing the database

Do the time-consuming work in advance Full structure search provides an example of this Canonicalisation is a slow process (NP-complete)

• but it can be done in a pre-processing of the file, independently of the query

• then store the canonical representations• can do rapid matches against a canonicalised query

structure• this is faster than using a graph isomorphism algorithm

on non-canonical representations

36 Preprocessing the database

Similar principles are used in some substructure search systems

A tree structure is built, classifying all the atoms found in all the structures in the database

• first level based on atom type• second level based on number of connections• third level based on type of first neighbour• fourth level based on type of second neighbour• etc.• lower levels based on classifications applied to neighbours

(relaxation)• bottom of tree lists structures that contain this class of atom

37 Tree-structured fragment searches

C — C — BrC — C — F

C — C C — OC — CC — Br

C — — C — — C —

— C —

C

C — C — F|

C

C — C — F|F

C e ntralato m type

N um be r o fC o nne c t io ns

F ir s tN e ighbo ur

Se c o ndN e ighbo ur

ThirdN e ighbo ur

38 Tree-structured fragment search

search can be done by tracing tree, looking for atom classes found in query• combine lists of structures found at the bottom

a backtracking atom-by-atom search may be needed to check hits found

best-known example is Beilstein’s Crossfire main problem is updating the trees when

new structures are added to the database

39 Substructure queries

queries for substructure search systems may be more complicated than simple subgraphs

different systems provide different capabilities• variable atom and

bond types• specification of

allowed substitution

Var iab lea to m ty p e

C(s0)

C(s1)

[F,Cl,Br,I]s in g le o r

d o u b le b o n d

n o f u r th ers u b s titu en ts

o n e o r m o r es u b s titu en ts

40 Substructure queries

some systems provide very complex query options R1

O(L1 - 5)

*

N

*

N

*

R1=

41 Substructure queries: SMARTS

Daylight uses an extension of SMILES to describe structure queries (SMARTS)• can attach various properties to each atom

o [CX3] carbon with 3 connectionso [Nr5] nitrogen in a ring of size 5

• properties can be combined with logical operators o ! (NOT)o & (AND – high precedence)o , (OR)o ; (AND – low precedence)

42 SMARTS

complex patterns can be specified this way:• [F,Cl, Br, I] any of the halogen atoms• [!C;!R0] heteroatom in a ring

$(smarts_string) can also be used as an atom property• this is called recursive SMARTS• e.g. $(NC=*)

o nitrogen single-bond carbon double-bond any-atomo (i.e. an amide)

43 SMARTS

recursive SMARTS can be used to describe very complex patterns• e.g. primary or secondary amine, but not amide

[N&X3;H2,H1;!$(NC=*)]

nitro ge n3 c o nns

nitro ge nc o nne c te d to

c arbo n with do ublebo nd to any ato m

AN D AN D O R

2 at tac he dhydro ge ns

1 at tac he dhydro ge n

AN D N O T

44 Commercial systems

Several software companies provide structure registration and search systems to the chemical/pharmaceutical industry• MDL Information Systems Inc.

o MACCS, ISIS• Daylight Chemical Information Systems Inc.

o THOR, MERLIN, DayCart (Oracle cartridge)• IDBS ActivityBase• Accelrys Accord Enterprise Informatics

o replaces Oxford Molecular RS3

45 Conclusions from Lecture 4

structure matching is an NP-complete problem• worst-case time requirements rise exponentially with

number of atoms involved• heuristics (tricks) can be used to improve average

search speed several algorithms have been published

• most use partitioning and relaxation techniques fingerprint screening can rapidly eliminate the

bulk of non-matching structures different systems allow different degrees of

sophistication in formulating search queries

46 Further Reading

J. M. Barnard, “Substructure searching methods: old and new”, J. Chem. Inf. Comput. Sci., 1993, 33, 532-538

J. Xu. “Two dimensional structure and substructure searching.” In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol. 2, pp. 868-884, Wiley-VCH, 2003

47 Lecture 5: More structure searching

• Searching Markush structures in patentso nature and origin of Markush structureso fragment codeso topological systems (MARPAT, Markush DARC)

• Reaction searchingo atom-atom mapping

• Maximal Common Substructure searcho what is the largest substructure common to two

molecules?

1 chemical structure representation and search systems lecture 4. nov 11, 2003 john barnard barnard...

Documents

search systems

search slide

substructure search

search speed

superstructure search

structure check

database smiles

database structures