1 chemical structure representation and search systems lecture 4. nov 11, 2003 john barnard barnard...
TRANSCRIPT
1Chemical Structure Representation
and Search Systems
Lecture 4. Nov 11, 2003
John Barnard
Barnard Chemical Information LtdChemical Informatics Software & Consultancy Services
Sheffield, UK
2 New MDL File Formats
Since lecture on Oct 28, MDL have published details of a new file format “XDfile”• XML-based data format for transferring
structure/reaction information with associated data• built around existing MDL connection table formats • can incorporate Chime strings (encrypted format used
to render structures and reactions on a Web page)• can incorporate SMILES strings
New download site: • http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
3 Lecture 4: Topics to be Covered
full structure search• structure registration systems• graph isomorphism
algorithm complexity and NP-complete problems substructure search
o subgraph isomorphismo screening searches and fingerprintso substructure query formulation
• SMARTS
commercial systems
4 Full Structure and Substructure Search
full structure search• query is is complete molecule• is this molecule in the database?
o or tautomers, stereoisomers etc. of it,
substructure search• query is a pattern of atoms and bonds• does this pattern occur as a substructure
of any of the molecules in my database? superstructure search
• query is a complete molecule• are any of the molecules in the database substructures of it?
N.B. Some Daylight Chemical Information Systems Inc. documentation uses “substructure” and “superstructure” search in the opposite sense to those given here
5 Full Structure Search
Many databases contain millions of structures, so search speed is important
Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES)
• could sort database SMILES into alphanumerical order• search sorted list for match with query
“Hash table” lookup can improve search speed• calculate hash-code (“idiot number” in predefined range) from
SMILES for each database structure• this is address (disk file or memory) at which full representation is
stored• only SMILES which have same hash code need to be compared
6 Structure Registration Systems
Many chemical and pharmaceutical companies maintain compound “registry” systems• database of all compounds worked on internally• may included many compounds never published
elsewhere (i.e. not in Chemical Abstracts, Beilstein)• links to company reports, biological screening data,
stock number in compounds store etc.• links to electronic lab notebooks, LIMS (Lab. Info.
Management System), ORACLE database etc.
7 Structure Registration Systems
new compounds need to be added regularly • used to be done by chemical information specialists• now frequently done directly by bench chemists
registration system must• check consistency of input data
o e.g. compare molecular formula with structure
• check that compound is really newo different ways of handling tautomers, salts, stereoisomers etc.
• assign registry number• add supplementary data (melting point etc.)• make data immediately available for search
8 Structure Registration Systems
“Public” databases use same principles, adding compounds from published literature• Chemical Abstracts Registry file
o links to document where data on molecule was published
• Beilstein Registry fileo lots of data may be stored with compound, from different data
sources; existing records may need updating
Updates for searching may be made available at regular intervals (weekly, monthly, annually, etc.)
9 Graph Isomorphism
In graph theory terms, when two full structures match, their graphs are said to be isomorphic
each node N1 in G1 must be mapped to a node N2 in G2
neighbours of N1 must map to neighbours of N2
10 Graph isomorphism by brute force
for each node in G1 • map it against an unmapped node in G2
check that neighbours of each node map appropriately in the two graphs
if each graph has n nodes there are n! ways of doing this• n × (n-1) × (n-2) × (n-3) … × 3 × 2 × 1• this is a big number if n is anything non-trivial• 9! = 362 880• 10! = 3 628 800
11 Computational complexity
a measure of how long a computational algorithm will take to run, depending on the size of input• if you give it twice as much data will it take twice as
long to run? e.g. comparing a word sequentially against each
member of a list of words of length n• take taken depends directly on length of list• algorithm is O(n) [“order-n”]
e.g. comparing each word in a list of length n with every other word of the same list• algorithm is O(n2) [“order-n-squared”]
12 Computational complexity
some algorithms may have complexity O(n3), O(n4), O(log n), O(n log n) etc.• these are all “polynomial” time algorithms
some algorithms have exponential complexity, e.g. O(2n)• this is much slower than polynomial
brute-force graph isomorphism is O(n!)• this is even slower than simple exponential
13 Computational complexity
for some problems you can find more efficient algorithms (lower order of complexity) to do the same thing• e.g. searching a sorted list
o simple “sequential” search is O(n)o “binary chop” search is O(log n)
for some problems there are no known polynomial-time algorithms
14 NP-complete problems
a class of problems for which no polynomial-time algorithms are known
problems in this class are mathematically “equivalent”• if a polynomial time algorithm could be found
for one of them, it would fit all of them well-known example is “travelling salesman
problem” (shortest path visiting each of several cities)
it is suspected (but not proven) that no polynomial-time algorithms can exist for this class of problems
15 NP-complete problems
graph isomorphism is probably NP-complete (not rigorously proven)
subgraph isomorphism is a generalisation of graph isomorphism• nodes in G1 (query structure) must be mapped to
subset of nodes in G2 (database structure)
• i.e. G1 is a subgraph G2
subgraph isomorphism has been proven to be NP-complete• substructure searching is inherently slow
16 Subgraph isomorphism
NP-completeness of problem means that worst-case match times are exponential in number of atoms involved
but average-case match times can be better than this
much effort has been expended on this problem over the past 40+ years• closely-related problems remain an active area of
research
17 Speeding up subgraph isomorphism
1. use a faster computer
2. use tricks to avoid exploring potential solutions that are bound to fail
3. do most of the work in a pre-processing of the database structures, independently of the query
18 Speeding up subgraph isomorphism
chemical graphs have several characteristics that allow heuristics (“tricks”) to be used to speed up isomorphism identification• several different node and edge labels• low connectivity of each node• using hydrogen-suppressed graphs reduces size of
problem (number of nodes) these tricks would be of less use for general graphs additional tricks and algorithms may be used in
special cases (e.g. if graphs are trees)
19 Backtracking
modification of the brute-force approach abandons partial solutions part-way through when
it can be seen they are bound to fail worst-case is still exponential in number of nodes,
but doesn’t arise very often• first map an arbitrary pair of nodes• then map neighbours of these nodes• if successful, map neighbours of each neighbour, etc.• if not, backtrack one step, and try a different mapping
20 Backtracking
algorithm will terminate • when all query nodes are mapped [MATCH]
• when all alternative mappings for first query node have been tried, and have failed [NO MATCH]
extra tricks can be used for further improvement• only map nodes with same element type and charge,
and compatible bonding patterns• start with unusual atom types, and nodes with lots of
neighbours
21 Partitioning and Relaxation
often used as an adjunct to backtracking start by partitioning the nodes into sets of
possible correspondents• e.g. nitrogens can only match nitrogens• iteratively refine the partition on basis of other
possible correspondenceso e.g. if F6 is only possible correspondent for Q1 then
F6 cannot be a correspondent for Q2• if the list of possible correspondents for a query
node becomes empty, there is no isomorphism
22 Partitioning and Relaxation
can also reduce lists of possible correspondents by looking at neighbours• if F6 is to remain a valid correspondent of Q1, then the
neighbours of F6 must be possible correspondents of the neighbours of Q1
• as this check is repeated for each node, we are bringing in information from further away, but only ever looking at immediate neighbours
o this technique is the same as Morgan’s algorithm for node labelling in canonicalisation
o it is called relaxation backtracking can be used as a fallback when no further
reductions can be made in the lists of possible correspondents
23 Subgraph isomorphism algorithms
Ray and Kirsch’s algorithm (1957)• basic backtracking
Sussenguth’s partitioning algorithm (1965)• relaxation technique called “connectivity
property”, with backtracking as fall-back Figueras’s set reduction algorithm (1972) Ullmann’s algorithm (1976)
• efficient relaxation and backtracking von Scholley’s relaxation algorithm (1984)
24 Screening
so far we’ve considered matching one query substructure against one database full structure• each structure from the database needs to be compared
against the query in turn• many will fail because they don’t contain the query
substructure
“screening” allows many of these to be eliminated before we get to this stage
uses structure “fingerprints” discussed in lecture 3
25 Fingerprints
the fragments present in a structure can be represented as a sequence of 0s and 1s
00010100010101000101010011110100• 0 means fragment is not present in structure• 1 means fragment is present in structure (perhaps
multiple times)
each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)
for chemical structures often called structure “fingerprints”
26 Screening
build a fingerprint for the query substructure only those database structures that contain all the
fragments in the query can possibly match the queryQuery: 00000100010101000001010011010100DB struct 1: 00010100010101000101010011110100 MATCHDB struct 2: 00000000100101001001000011100000 NO MATCH
comparing fingerprint bitstrings is very fast (logical AND operation)
only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search
27 Screening
can be made faster by “inverting” the bit strings (actually, turning them on their side)• instead of a bitstring of fragments for each structure ...• store a bitstring of structures for each fragment
o each bit represents a database structureo 1=structure contains fragment; 0=structure does not
• search by ANDing together the bitstrings for the fragments present in the query
o this will list those structures that contain all the query’s fragments
28 Screening Effectiveness
Ideally we want to eliminate as many structures as possible at the screening stage• 99% screenout or more would be good
Fingerprint construction can help in this• frequency distributions of fragments in a large
database are very “skewed”o a few fragments occur in almost all compounds
• will therefore give little or no screenouto many fragments occur in very few compounds
• need very long fingerprint (lots of fragments) to ensure that we will have some in the query
29 Fingerprint construction
best fragments are medium frequency ones fragments also need to be independent of
each other dictionary used for CA Registry search was
constructed on basis of analysis of fragment frequency distributions
30 Daylight fingerprints
each fragment is used to generate a hash code, which specifies the bit position to be set• actually most fragments set several bits• small fragments (more frequent) set fewer bits• larger fragments (less frequent) set more bits
several different fragments may set the same bit• in principle this can reduce screening effectiveness• it may allow a fingerprint to match when structure does
not contain the same fragment as the queryo in practice is not a serious problem
• it will never cause a structure to be rejected when it is actually a match
31 Daylight fingerprints
fingerprint can be “folded” to reduce length of fingerprint
again this increases the chances of false matches, but with “sparse” (low-density) fingerprints this is more than offset by the increase in search speed
0010 0100 0101 0010 1001 1010 11 01 0100
0010 0100 0101 00101001 1010 11 01 0100
1011 111 0 11 01 011 0
32 Hardware solutions
“use a faster computer” cheaper memory means that a lot of operations can
be performed in memory• in Daylight, fingerprints are stored and matched in
memory
parallel processing • database parallel (split database over several machines)• algorithm-parallel (different operations on same
structure on different processors)
33 Parallel processing
Chemical Abstracts Registry File• different machines search different parts of the
database• results are collated for presentation to user
other research work has looked at various algorithms and various processors• speedup declines as more processors added
o overheads in controlling them become dominant
34 Parallel processing
subgraph isomorphism algorithms are not very suitable for algorithm parallelisation• individual operations are very simple
von Scholley algorithm designed for parallel implementation• each processor handles one atom in relaxation step• problem is distributing data to processors
o most processors spend time waiting for next data
35 Preprocessing the database
Do the time-consuming work in advance Full structure search provides an example of this Canonicalisation is a slow process (NP-complete)
• but it can be done in a pre-processing of the file, independently of the query
• then store the canonical representations• can do rapid matches against a canonicalised query
structure• this is faster than using a graph isomorphism algorithm
on non-canonical representations
36 Preprocessing the database
Similar principles are used in some substructure search systems
A tree structure is built, classifying all the atoms found in all the structures in the database
• first level based on atom type• second level based on number of connections• third level based on type of first neighbour• fourth level based on type of second neighbour• etc.• lower levels based on classifications applied to neighbours
(relaxation)• bottom of tree lists structures that contain this class of atom
37 Tree-structured fragment searches
C — C — BrC — C — F
C — C C — OC — CC — Br
C — — C — — C —
— C —
C
C — C — F|
C
C — C — F|F
C e ntralato m type
N um be r o fC o nne c t io ns
F ir s tN e ighbo ur
Se c o ndN e ighbo ur
ThirdN e ighbo ur
38 Tree-structured fragment search
search can be done by tracing tree, looking for atom classes found in query• combine lists of structures found at the bottom
a backtracking atom-by-atom search may be needed to check hits found
best-known example is Beilstein’s Crossfire main problem is updating the trees when
new structures are added to the database
39 Substructure queries
queries for substructure search systems may be more complicated than simple subgraphs
different systems provide different capabilities• variable atom and
bond types• specification of
allowed substitution
Var iab lea to m ty p e
C(s0)
C(s1)
[F,Cl,Br,I]s in g le o r
d o u b le b o n d
n o f u r th ers u b s titu en ts
o n e o r m o r es u b s titu en ts
40 Substructure queries
some systems provide very complex query options R1
O(L1 - 5)
*
N
*
N
*
R1=
41 Substructure queries: SMARTS
Daylight uses an extension of SMILES to describe structure queries (SMARTS)• can attach various properties to each atom
o [CX3] carbon with 3 connectionso [Nr5] nitrogen in a ring of size 5
• properties can be combined with logical operators o ! (NOT)o & (AND – high precedence)o , (OR)o ; (AND – low precedence)
42 SMARTS
complex patterns can be specified this way:• [F,Cl, Br, I] any of the halogen atoms• [!C;!R0] heteroatom in a ring
$(smarts_string) can also be used as an atom property• this is called recursive SMARTS• e.g. $(NC=*)
o nitrogen single-bond carbon double-bond any-atomo (i.e. an amide)
43 SMARTS
recursive SMARTS can be used to describe very complex patterns• e.g. primary or secondary amine, but not amide
[N&X3;H2,H1;!$(NC=*)]
nitro ge n3 c o nns
nitro ge nc o nne c te d to
c arbo n with do ublebo nd to any ato m
AN D AN D O R
2 at tac he dhydro ge ns
1 at tac he dhydro ge n
AN D N O T
44 Commercial systems
Several software companies provide structure registration and search systems to the chemical/pharmaceutical industry• MDL Information Systems Inc.
o MACCS, ISIS• Daylight Chemical Information Systems Inc.
o THOR, MERLIN, DayCart (Oracle cartridge)• IDBS ActivityBase• Accelrys Accord Enterprise Informatics
o replaces Oxford Molecular RS3
45 Conclusions from Lecture 4
structure matching is an NP-complete problem• worst-case time requirements rise exponentially with
number of atoms involved• heuristics (tricks) can be used to improve average
search speed several algorithms have been published
• most use partitioning and relaxation techniques fingerprint screening can rapidly eliminate the
bulk of non-matching structures different systems allow different degrees of
sophistication in formulating search queries
46 Further Reading
J. M. Barnard, “Substructure searching methods: old and new”, J. Chem. Inf. Comput. Sci., 1993, 33, 532-538
J. Xu. “Two dimensional structure and substructure searching.” In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol. 2, pp. 868-884, Wiley-VCH, 2003
47 Lecture 5: More structure searching
• Searching Markush structures in patentso nature and origin of Markush structureso fragment codeso topological systems (MARPAT, Markush DARC)
• Reaction searchingo atom-atom mapping
• Maximal Common Substructure searcho what is the largest substructure common to two
molecules?