pharmacophoric pattern matching in files of 3d chemical structures: comparison of geometric...

8
Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms A T Brint and P Willett* Department of Information Studies, University of Sheffield, Western Bank, Sheffield SlO 2TN, UK This paper reports a comparative evaluation of four algorithms that can be used to determine the presence or absence of a set of atoms and associated interatomic distances in 30 chemical structures. The geometric search- ing algorithms tested are that described by Lesk for the identification of patterns in proteins, one derived from the set reduction techniques usedfor substructure search- ing in files of 20 chemical structures, a procedure based on clique detection, and Ullman’s subgraph isomorphism algorithm. Tests with structures from the Cambridge Crystallographic Data Bank demonstrate the general superiority of Ullman’s algorithm for data of this sort. Keywords: chemical structure, clique detection, geometric searching, information retrieval, pharmacophoric pattern matching, set reduction, subgraph isomorphism, 30 substructure searching Received 8 October 1986 Accepted 21 October 1986 A pharmacophoric pattern is the set of structural features in a drug molecule which is recognized at a biological receptor site; molecules containing a specified pattern are expected to exhibit the corresponding activity and there is thus considerable interest in methods for the identification of such patterns. Pharmacophoric patterns can be either topological or topographical in character, depending upon whether the requisite pattern of atoms is specified in two-dimensional (2D) or three- dimensional (3D) space. However, topographical pharmacophoric patterns provide much more informa- tion about the precise molecular orientation required for activity, and there has accordingly been some interest in the development of computer systems which would allow the automatic detection of pharmacophoric patterns in 3D chemical structures’-‘. The authors are currently engaged in a project that will enable interactive 3D pharmacophoric pattern searches to be carried out on a routine basis, using tech- niques analogous to those developd for substructure searching in 2D chemical structure retrieval systems4.j. A search in a machine-readable tile for all molecules *To whom all correspondence should be addressed containing a specified 2D query substructure is effected by a two-stage retrieval algorithm. In the first stage, the query is matched against each of the structures in the file using a highly redundant, bit-string search which rapidly eliminates large numbers of structures that can- not possibly contain the query pattern. Each bit in the bit string describing a molecule or query denotes the presence or absence in that structure of a discriminating, substructural screen. A detailed atom-by-atom search is then carried out for those few molecules that have passed the screening search; this atom-by-atom search involves a backtracking subgraph isomorphism pro- cedure which confirms the presence or absence of the query in the molecule. The authors have described pre- viously a methodology for the selection of 3D screens that could be used for the implementation of the first stage of an interactive pharmacophoric pattern matching system: the screens are based upon interatomic distance information and are chosen by an algorithmic selection procedure so as to occur approximately equifrequently in the database that is being searched6. An evaluation of the efficiency of the screens resulting from this selec- tion procedure showed that they were highly effective in eliminating molecules that did not contain a specified query pharmacophore. For the set of ten pharmaco- phoric patterns and 12 728 structures from the Cambridge Crystallographic Data Bank (CCDB)’ used in the evalution, only some 5% of the structures needed to undergo the time-consuming geometric search which checks the relative arrangements of the atoms in a 3D structure to determine whether the exact query pattern is present’. There is an obvious, brute-force algorithm for geo- metric searching. Given a query pattern and a structure containing NQ and NS atoms respectively, generate all NS!/(NQ!(NS-NQ)!)) combinations of NQ atoms from the structure and determine whether any of these combi- nations is an exact match for the query pattern, this match involving a check on the geometric arrangement of the NQ(NQ-1)/2 distinct interatomic distances. Such a procedure is clearly impracticable for all but the smallest structures, and there is accordingly a need for sophisticated techniques which can reduce the number of combinations that need to be tested to an acceptable level. Volume 5 Number 1 March 1987 0263-7855/87/010049-08 $03.00 @ 1987 Butterworth & Co (Publishers) Ltd 49

Upload: a-t-brint

Post on 21-Jun-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

A T Brint and P Willett*

Department of Information Studies, University of Sheffield, Western Bank, Sheffield SlO 2TN, UK

This paper reports a comparative evaluation of four algorithms that can be used to determine the presence or absence of a set of atoms and associated interatomic distances in 30 chemical structures. The geometric search- ing algorithms tested are that described by Lesk for the identification of patterns in proteins, one derived from the set reduction techniques usedfor substructure search- ing in files of 20 chemical structures, a procedure based on clique detection, and Ullman’s subgraph isomorphism algorithm. Tests with structures from the Cambridge Crystallographic Data Bank demonstrate the general superiority of Ullman’s algorithm for data of this sort.

Keywords: chemical structure, clique detection, geometric searching, information retrieval, pharmacophoric pattern matching, set reduction, subgraph isomorphism, 30 substructure searching

Received 8 October 1986 Accepted 21 October 1986

A pharmacophoric pattern is the set of structural features in a drug molecule which is recognized at a biological receptor site; molecules containing a specified pattern are expected to exhibit the corresponding activity and there is thus considerable interest in methods for the identification of such patterns. Pharmacophoric patterns can be either topological or topographical in character, depending upon whether the requisite pattern of atoms is specified in two-dimensional (2D) or three- dimensional (3D) space. However, topographical pharmacophoric patterns provide much more informa- tion about the precise molecular orientation required for activity, and there has accordingly been some interest in the development of computer systems which would allow the automatic detection of pharmacophoric patterns in 3D chemical structures’-‘.

The authors are currently engaged in a project that will enable interactive 3D pharmacophoric pattern searches to be carried out on a routine basis, using tech- niques analogous to those developd for substructure searching in 2D chemical structure retrieval systems4.j. A search in a machine-readable tile for all molecules

*To whom all correspondence should be addressed

containing a specified 2D query substructure is effected by a two-stage retrieval algorithm. In the first stage, the query is matched against each of the structures in the file using a highly redundant, bit-string search which rapidly eliminates large numbers of structures that can- not possibly contain the query pattern. Each bit in the bit string describing a molecule or query denotes the presence or absence in that structure of a discriminating, substructural screen. A detailed atom-by-atom search is then carried out for those few molecules that have passed the screening search; this atom-by-atom search involves a backtracking subgraph isomorphism pro- cedure which confirms the presence or absence of the query in the molecule. The authors have described pre- viously a methodology for the selection of 3D screens that could be used for the implementation of the first stage of an interactive pharmacophoric pattern matching system: the screens are based upon interatomic distance information and are chosen by an algorithmic selection procedure so as to occur approximately equifrequently in the database that is being searched6. An evaluation of the efficiency of the screens resulting from this selec- tion procedure showed that they were highly effective in eliminating molecules that did not contain a specified query pharmacophore. For the set of ten pharmaco- phoric patterns and 12 728 structures from the Cambridge Crystallographic Data Bank (CCDB)’ used in the evalution, only some 5% of the structures needed to undergo the time-consuming geometric search which checks the relative arrangements of the atoms in a 3D structure to determine whether the exact query pattern is present’.

There is an obvious, brute-force algorithm for geo- metric searching. Given a query pattern and a structure containing NQ and NS atoms respectively, generate all NS!/(NQ!(NS-NQ)!)) combinations of NQ atoms from the structure and determine whether any of these combi- nations is an exact match for the query pattern, this match involving a check on the geometric arrangement of the NQ(NQ-1)/2 distinct interatomic distances. Such a procedure is clearly impracticable for all but the smallest structures, and there is accordingly a need for sophisticated techniques which can reduce the number of combinations that need to be tested to an acceptable level.

Volume 5 Number 1 March 1987 0263-7855/87/010049-08 $03.00 @ 1987 Butterworth & Co (Publishers) Ltd 49

Page 2: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

ALGORITHMS FOR PHARMACOPHORIC PATTERN MATCHING

There have been several reports of computer procedures that can be used to carry out various sorts of matching operations on 3D chemical structures’“*8-‘6. The authors’ evaluation of algorithms for pharmacophoric pattern matching has considered the four algorithms described below. In each case, their mode of operation is illustrated with respect to the (very artificial) pattern and structure shown in Figure 1.

The nodes in this pattern and structure are all assumed to be of the same type and the interatomic distances which are not marked in the structure are assumed to be different from those present in the pattern.

Lesk’s algorithm

Lesk” has described an algorithm primarily for pattern searching in proteins but which can also be used with the smaller molecules present in the CCDB. The algorithm assigns a structure atom, S,, as a candidate match for a pattern atom, P,“, if S, has other structure atoms of the appropriate atomic types at the same dis- tances from it as does P,. All of the structure atoms which are not matched to any pattern atom are removed from consideration and the candidate matches are checked again to ensure that the deletions have not affected any of the current structure atom-to-pattern atom matches. This process is repeated until no more eliminations can be made. Thus, the algorithm provides a partitioning technique which reduces the number of possible structure atoms that can match each of the pattern atoms. Once this has been done, the brute-force algorithm is invoked to generate all of the allowed com- binations of NQ structure atoms. These combinations are then tested to see whether they match the pharmaco- phore by trying to rotate them onto the pattern16; it should be noted that efficiencies of operation can result if some structure atom, S, occurs in more than one allowed combination since the rotation need be performed once only.

In more detail, the algorithm consists of the following steps:

1 Form an array of triples of the distance between two atom types.

where each triple consists a pair of atoms and the

/ b\ s,-c-s3

t b

structure 1 Sg-a-wSq

t c

1 ‘6

Figure 1. Pattern and structure used to compare the four algorithms

2

3

4

9

10

Associate two bit-strings with each pattern atom, the entries in the first string corresponding to the set of atom types present in the pattern, and those in the second string to the set of interatomic distances present in the pattern. For each pattern atom, set the bit in string one associated with its atom type. For each pair of pattern atoms, set the bits in the second strings which correspond to the elements in the triple array of step 1 which have the same atom types as the pair and where the distance equals that between the pair of atoms (within any tolerance range that has been specified). Associate two analogous bit-strings with all the struc- ture atoms under consideration. For each of these structure atoms, set the bit in string one associated with its type (if one exists). For all pairs of these structure atoms, set the bits in the second string in a similar manner to step 4. Check whether each of the structure atoms under consideration has all the attributes (shown by ones in the bit strings) of at least one pattern atom; if it does not, remove it from the set of relevant struc- ture atoms. If any atom has been eliminated from the structure, return to step 5. For each pattern atom form a list of the structure atoms which are possible matches for it, i.e. which have matching bit-strings. Form all possible combinations from step 9 and test to see whether they are matches by trying to rotate them onto the pattern.

To illustrate this procedure, consider the example pattern and structure shown previously. Since all of the atoms are of the same atomic types, no partitioning can result from the use of the first bit-string, and thus attention will be restricted to the second one which repre- sents the available interatomic distance information. The set of strings for the pattern is of the form:

while that for the structure is of the form:

Atom Distance a 6 c

s, 1 0 1 S, 1 1 0 S, 0 1 I S4 I 1 0 & 1 0 I S6 0 0 1

A comparison of these two tables shows no possible match for atom S, and this can hence be deleted from consideration. The structure bit strings are then recalcu- lated, giving:

Distance Atom a b c

50 Journal of Molecular Graphics

Page 3: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

Ss now has no potential match since there is no longer a candidate atom at a distance c from it; elimination of this atom and the recalculation of the table gives:

S, can now be eliminated, giving the set of unique equi- valences P, = S,, P, = S, and P, = S, which are then passed on to the check for an exact match.

The coded version of this algorithm used arrays of integers rather than bit-strings owing to the difficulties associated with the manipulation of bit-strings in FORTRAN 77, the implementation language for our experiments; tests showed that the use of the latter data structure caused substantial decreases in the run-time efficiency of the algorithm”.

Set reduction algorithm

Set reduction involves the successive elimination of can- didate structure atoms from sets corresponding to each pattern atom on the basis of an analysis of neighbour- hood and connectivity information; the technique has been widely used as a component of 2D substructure searching systems’8,‘9. The set reduction algorithm described here is a development of that used in previous geometric searching experiments’.

The first stage of the algorithm involves the creation of a distance table. The NQ pattern atoms are labelled from 1 to NQ and for each of the NQ(NQ-1)/2 distinct interatomic distances in the pattern, a list of pairs of atoms from the structure is produced. The distance between the atoms in these pairs is equal to that between the pattern atoms (to within any specified tolerances), and the atom type of the first atom corresponds with the type of the first pattern atom (and similarly for the second atom); thus if the query atoms in some pair are both of the same atomic type, two entries will be made in the list, the second having the atoms in the opposite order to the first.

Once the distance table has been calculated, the main stage consists of taking each pattern atom Pi in turn and finding the smallest list of pairs in the table that is associated with it. For each pair, S,-S,, in this list, check that the atom in correspondence with Pi also corresponds with it in the other NQ - 2 lists correspond- ing to Pi. If it does not, the pair is removed from the list. When the list has been processed, the pairs in the other lists corresponding to Pi are checked to see whether the atom which corresponds to Pi does so in the list which was processed first. If it does not, the pair is again removed. The main stage is repeated until no further eliminations can be made.

The distance table for the example pattern and structure are:

~

The first atom and atom pair examined by the algorithm

are P, and PI-P2. Structure atoms 2 and 4 are not a match for P, because they are not present in P,'s column of the structure atoms in the P,-P, pair list. Therefore the S,-S, and S,-S, atom pairs can be eliminated from the P,-P, pair list, yielding the equivalence P, = S, or S,. Attention then turns to the P, column of P,-P3 and atoms which are not possible equivalences can be deleted; thus the pairs S,-S, and S,-S, are eliminated since these equivalences are not present in the P,-P, list. Hence, the distance table now has the form:

Consideration of P, leads to the elimination of the S,S, pair from the P,-P, list, and hence of the S,-S, pair from the PI-P, list. Consideration of P, then gives the final distance table:

Pattern pair 1 p,-p, 1 p*-p3 ) PI-p3

Structure pairs s,-.% s,s, s,-&

When no more eliminations can be made from the pair lists, they are passed on to the final stage of the algorithm. In the authors previous study’, the remaining pattern atom-to-structure atom equivalences were used as the input to the brute-force algorithm which generated all of the possible combinations of NQ structure atoms for matching against the pattern. However, a depth-first search procedure has now been implemented which allows the successful combinations to be identified much more efficiently. The search procedure is as follows, where Pi is the ith pattern atom:

1 (Initialization): set the level, L, of the search to the value one, Index[l], the index into the pair list P,-P2, to one and Combin[l], the structure atom currently matching pattern atom 1, to the atom corresponding to P, in the first atom pair of P,-P,.

2 Set L = L + 1 and Combin[L], the current structure atom under consideration, to the atom corresponding to PL in the Index[L-11th atom pair in P,-P,.

3 Check the Pi-P, (i = 2, . . . , L-l) pair lists to ensure that the pairs (Combin[d, Combin[L]) are present in the relevant lists. If so then go to step 6 (the next level of the search).

4 (Backtrack): find the first pair in the P,-P, list which is greater than Index[L-1] and whose first atom is Combin[l]. Set Index[L-1] to the number of this pair, Combin[L] to the second atom and go to step 3. If no pair is found, then go to step 5.

5 Set L = L-l. If L = 1, then set Index[l] to Index[l] + 1 and go to step 2 (unless Index[l] is greater than the number of pairs in the P,-P, list when the program terminates as the pattern is not contained in the struc- ture), otherwise go to step 4.

6 Set L = L+ 1. If L is greater than the pattern size, then the pattern has been found and the program terminates, otherwise find the first pair in P,-PL with the first atom equalling Combin[ l] and set Index[L-1] to be the number of this pair and Combin[L] to be the second atom of the pair. If no pair is found go to step 4, otherwise go to step 3.

In the case of the example P and S, a unique set of

Volume 5 Number 1 March 1987 51

Page 4: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

equivalences is obtained from the set reduction, and thus this backtracking step is not required.

When only k of the distances in the pattern are speci- tied, for k< NQ(NQ - 1)/2, the method is the same as that above but only k lists are used. In fact, Gund’ has pointed out that for NQ > 3 only 4(NQ - 3) + 2 lists need to be used since not all of the NQ(NQ- 1)/2 distances in the pattern are independent, with a similar observation applying to Lesk’s algorithm. However, tests showed that this economy only begins to have a signifi- cant effect for NQ> = 9, and it was not employed for the algorithms under test.

are both a, and C,3 is also connected to C,, since the distances between P,-P, and S,-S, are both c; however, C,, is not connected to C,, since the distances between P,-P, and S,-S, are not the same. After setting up the correspondence graph, the problem then becomes one of determining whether the correspondence graph con- tains a clique of size 3. The clique detection algorithm used here was that described by Bron and Kerbosch26.

Ullman’s subgraph isomorphism algorithm

Clique detection algorithm

Any 3D molecule can be regarded as a graph in which both the nodes and the edges have labels, these corres- ponding to the atomic types and the interatomic dis- tances respectively. This being so, it is possible to use established graph theoretical procedures for pharmaco- phoric pattern matching, and two such algorithms have been tested.

The first approach is based on the maximal common subgraph (MCS) algorithm of LeviZo and Barrow et ~l.~‘.~~, where an MCS is defined as the largest subgraph that is common to a pair of graphs23; such algorithms have been widely used in chemical information systems as the basis for automatic reaction indexing systems24.25. Given a pair of graphs, P and S, an MCS algorithm can be used to determine whether .P is a subgraph of S simply by determining whether the MCS is equivalent to P. The algorithm of Levi and Barrow et al. involves the generation of a correspondence graph, C, which can be formed from the two original graphs P and S by:

1 Create the set of all pairs of nodes from the two

The final algorithm tested was the subgraph iso- morphism algorithm due to Ullman27. The algorithm begins with three main arrays P, S and M, of sizes NQ*NQ, NS*NS and NQ*NS respectively. P and S are the distance matrices of the pattern and the structure while the elements of Mo have the value one if the rele- vant pattern and structure nodes could match each other, and zero otherwise. The algorithm uses a tree search algorithm which changes ones into zeros to try to alter M, into a matrix M where each row contains a single one and each column contains no more than a single one. M represents a permutation of the nodes of the structure’s graph, and so, if C is the matrix M*ST*MT, where MT and S’ are the transposes of M and S respec- tively, then M specifies a subgraph isomorphism if:

Cl: V ivj (P,/ = 1) = = = = = > (C, = 1)

(l<=i<=NQ;l<=j<=NQ)

graphs such that the nodes of each pair are of the same type.

2 Form a graph whose nodes are the pairs from step 1. Two nodes (S,, P,), (S,, P2) are connected if the values of the arcs from S, to S, and P, to P2 are the same.

Maximal common subgraphs then correspond to the cliques of the correspondence graph, where a clique is a subgraph in which every node is connected to every other node and which is not contained in any larger subgraph with this property. Examples of the use of this technique for 3D geometric searching have been described by Golender and Rozenblit” and by Kuhl et al I4 The former authors have used it to find out . . if a pharmacophore occurs in a molecule by looking for cliques in the correspondence graph whose size is the same as that of the pattern, and it was this method which was coded.

The basic algorithm works by forming a series of matrices M, (D = 1, 2,. . NQ) each one being created from its predecessor M,_, by systematically changing all but one of the ones in a row to zero. The final matrix is checked to see whether it satisfies the conditions imposed on M; if it does not, then backtracking occurs. Ullman modifies this naive tree search by adding a refinement procedure. This procedure stems from the fact that, for a subgraph isomorphism, if P., is a neigh- bour of P, in the pattern and S, in the structure matches with P,,, then there must exist a neighbour S,, of S: which matches with P, (and the relevant entry for P,S1,

1 2 3 4 5 6 7 8 9 1011 12131415161718

As an illustration of this method, consider the example graphs P and S shown previously. As the nodes are all of the same type, the nodes of the correspondence graph, C, are all the pairs (Si, Pj) (i = 1, . . . ,6; j = 1 . . 3 3). If these nodes are enumerated as C, = (S,, P,), d,=(S,, P2), C,=(S,, PJ etc., then the connectivity matrix for the correspondence graph is shown in the box opposite.

For example, consider the nodes of the correspond- ence graph C,3 = (S,, P,), C,, = (S4, P2) and C,* = (S,, P3). Then C,, is connected to C,, in the correspond- ence graph since the distances between PI-P2 and S,-S,

1100010001000000000 2 010100000000000000 3 001000100000000000 4 010100000000000000 5100010001000000000 6 000001010000000000 7 001000100000000000 8 000001010001000000 9100010001010000000

10 000000000100010000 11000000001010100000 12 000000010001000000 13 000000000010100001 14000000000100010000 15 000000000000001100 16 000000000000001100 17 000000000000000010 18 000000000000100001

52 Journal of Molecular Graphics

Page 5: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

in M must be one). Therefore, for any subgraph isomor- phism, if P, corresponds with S,, then

c2: v x (P;, = 1) = = = = > (a y (M,,.S,, = 1))

(1 <=x<=NQ;l <=y<=Ns)

This is an example of an arc consistency algorithm*‘. The refinement procedure tests each one in M, to

see whether the condition is satisfied, changing the one to zero if it is not. If any change took place, the pro- cedure is repeated. If M,, is left unchanged by condition C2, then M,, represents a subgraph isomorphism. The algorithm is thus as follows:

Form matrices P, S and M,. Set D, the depth of the tree search, to 1. Set M equal to Mo and then refine M. If the new M has one row of all zeros, then go to step 5. If there is no node in the structure graph which could match pattern node D and which has not already been provisionally matched with an earlier pattern node, then go to step 7. Find, from M, the next potential match for pattern node D. Set all other entries in the Dth row of M to zero and refine M. If the new M has one row of all zeros, then go to step 5. If D is equal to the pattern size, then a subgraph isomorphism has been found otherwise go to step 6 (the next level of the search tree). If there are no more potential matches for pattern node D (compare with step 2) go to step 7. Otherwise set M equal to M, and go to step 3 to try this new potential match. Increase D by one (the next level of the tree search) and go to step 2. No match has been found at this point in the tree search. If D = 1 then terminate else subtract one from D, set M equal to M, and backtrack to step 5.

The basic algorithm was modified so as to deal with labelled, weighted graphs by changing condition C2 to condition C3.

C3: V x (Pi, = K) = = = = >

(a Y W&J,., - E(I < E)

(l<=x<=NQ;l<=y<=NS)

where E is the allowed tolerance for two distances to be regarded as equal.

In the example of Figure I, the structure atoms were reordered as (S,, S,, SZ, S,, S,, S,) so as to avoid the method immediately finding the match (S,, S,, S,) in the first 3 rows and columns of the matrix Mo. The distance matrices for the pattern and for this new order- ing of the structure are:

Distance matrix

where X indicates that the distance is not one of those contained in the pattern. As all the atoms are of the same type, any structure atom could possibly match any pattern atom, and so, Mo is initially a 3*6 matrix with every element set to one; although more sophisticated approaches might be used to initialze Mo, e.g. one might check that each of the structure atoms had neighbours at the right distances.

The first step in the algorithm requires M to be set equal to M, and then refined. The refinement involves finding each element M, which has the value one and then checking whether, for these values of i and j, con- dition C3 holds. If it does not, M, is set equal to zero. The matrix M after the first application of the refinement procedure is

As the refinement procedure changed some elements of M, it is reapplied to M. In more detail, the table below shows the working out of condition C3 for each non-zero element of M.

i j 1 x distance y M,, 1 Action I r I

1 1 2 a 1 1 3 c 1 4 2 a 1 4 3 c 2 2 1 2 2 3 ; 2 3 1 2 3 3 ; 3 5 1 3 5 2 :

In each step, i and j and column numbers

2 1 6 0 M,,:= 0 3 1 5 1 No change 1 0 5 1 M,,: = 0 4 1

No change

No change

are first assigned to be the row respectively of the non-zero ele-

ment. Next, x and the relevant pattern distance are found from matrix P before y is found from matrix S. Finally, the element M,, from M is examined and if it is zero, then M, is set equal to zero. In fact, the algorithm may be terminated after the second application of the refine- ment procedure, where M is of the form:

Since each row contains a single one, i.e. a unique pattern atom-to-structure atom equivalence, a match for the pattern has been found in the structure without any recourse to backtracking; this was found to be the usual case in the searches which were undertaken.

EXPERIMENTAL STUDIES OF ALGORITHMIC EFFICIENCY

The four algorithms described above were implemented in FORTRAN 77 on a Prime 9950 minicomputer, and the efficiencies tested using structures from the CCDB.

The first set of experiments involved selecting struc- tures containing 14, 25, 42 and 60 nonhydrogen atoms respectively, and then searching for the presence or absence of patterns containing from 3 to 15 atoms. The query patterns were obtained by selecting the appro-

Volume 5 Number 1 March 1987 53

Page 6: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

priate number of atoms from the target structure, and then slightly distorting some of the interatomic distances. A pattern interatomic distance was accepted as matching a structure interatomic distance if the margin of error was GO.25 A; thus depending upon the degree of dis- tortion imposed on the pattern it was possible to investi- gate the efficiency of both successful and unsuccessful searches. The patterns were specifically chosen to consist of carbon atoms; the evaluation can thus be regarded as a worst case test of algorithmic efficiency since the majority of the structure atoms are also carbons, and since pharmacophoric patterns normally involve heteroatoms.

The results of the searches are presented in Table 1. To improve the accuracy of the timing, each run involved 20 searches for the pattern in the query mole- cule with the recorded times being the average over the 20 runs. Some of the entries in the tables for the Lesk algorithm are represented by ‘*‘. Such entries correspond to runs in which the algorithm generated such an exorbi- tant number of combinations that it was not possible to run the program to completion; this problem is discussed in more detail later.

The second set of experiments involved a subtile of 250 molecules selected from the start of the CCDB. Every 25th molecule was used to generate patterns of size 3, 5 and 7 atoms by randomly selecting the appro- priate number of carbon atoms from the structure. Because one of the selected molecules contained 4 carbon atoms and another one 5, only 9 patterns of size 5 and 8 patterns of size 7 were used, giving a total of 27 patterns which were then run against the sublile to see how many times these patterns occurred; the means and standard deviations of the times for these runs are given in Table 2. Unfortunately, Lesk’s algorithm again suffered from the combinatorial problem and no times were obtained for it.

Table 2. Run times (s) of CPU time on a Prime 9950 for pharmacophoric pattern searches of 250 structures. The two figures in each case correspond to the mean and the standard deviation averaged over the sets of query patterns

Algorithm

Lesk Set reduction Clique finding Ullman

Pattern size 3 5 7

* * * * * *

66 18 131 69 156 82 44 1 109 15 235 24 31 2 37 10 33 10

DISCUSSION

It is important to note that the results reported in Tables 1 and 2 are specific to the environment in which the experiments were carried out. Different results might be obtained if, for example, the structures to be searched were those resulting from an initial screening search followed by a distance search as in our previous experi- ments’, or if the query patterns contained many heteroatoms. If, however, it is accepted that the results presented here do provide a general indication of the relative performances of the algorithms, it is clear that Ullman’s algorithm is by far the quickest and its relative advantage increases with an increase in the size of the pattern, although the suggestion of Gund’, that only 4(NQ3)+2 distances should be used rather than NQ(NQ-1)/2 as here, might reduce this to some extent. It is noticeable that the algorithm is quicker at unsuccess- ful searches than successful ones.

The problem under investigation was to determine whether a specific 3D pattern of atoms was present in a molecule, and so, when a clique of the same size as

Table 1. Run times (s) of CPU time on a Prime 9950 for pharmacophoric pattern searches of molecules of size (a) 14, (b) 25, (c) 52 and (d) 60 atoms. In each case, two times are reported, the first corresponding to the time for a search when the pattern was present in the structure and the second corresponding to the time for a search when the pattern was absent

Pattern size Algorithm 3 5 7 9 11 13 15

a Lesk 0.17 0.03 0.07 0.04 0.13 0.09 0.12 0.16 Set reduction 0.03 0.02 0.07 0.04 0.10 0.08 0.15 0.12 Clique finding 0.03 0.03 0.06 0.06 0.10 0.11 0.15 0.16 Ullman 0.03 0.02 0.04 0.02 0.06 0.02 0.08 0.02

b Lesk * * * * * * * * * *

Set reduction 0.77 0.14 0.94 0.35 1.56 0.90 1.64 1.34 2.21 1.97 Clique finding 0.16 0.18 0.39 0.62 0.75 1.48 1.28 2.63 1.95 4.46 Ullman 0.12 0.05 0.19 0.06 0.26 0.17 0.28 0.13 0.38 0.13

c Lesk 1.48 0.31 0.26 0.22 0.54 0.46 0.99 0.83 1.43 1.25 2.05 1.69 Set reduction 0.15 0.14 0.35 0.37 0.64 0.69 1.03 1.10 1.51 1.57 1.89 2.04 Clique finding 0.10 0.10 0.30 0.34 0.54 0.69 0.89 1.15 1.31 1.66 1.60 2.00 Ullman 0.10 0.09 0.13 0.10 0.18 0.08 0.25 0.09 0.35 0.09 0.44 0.09

d Lesk 0.45 0.29 1.36 1.39 0.88 0.83 1.31 1.26 1.98 1.88 2.91 2.74 4.54 4.12 Set reduction 0.25 0.74 0.80 0.74 1.21 1.26 2.62 2.67 4.11 4.27 5.43 5.54 8.96 9.35 Clique finding 0.36 0.38 1.21 1.64 1.73 2.21 2.94 4.39 4.31 7.20 5.64 9.45 9.03 16.93 Ullman 0.20 0.13 0.27 0.13 0.32 0.17 0.43 0.17 0.58 0.17 0.73 0.18 0.92 0.18

54 Journal of Molecular Graphics

Page 7: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

the pattern was found by the clique detection algorithm, the search was successful and could terminate. If no clique of sufficient size was present in the correspondence graph, then all the cliques need to be generated by the algorithm so as to establish this fact, and thus the clique detection algorithm is not recommended for use if many unsuccessful searches are expected; in addition, this algorithm becomes relatively less efficient as the size of the pattern or of the structure increases.

The Lesk and set reduction algorithms achieve eff- ciencies in operation by means of a partitioning step which reduces the number of potential matches which must be processed in the final stage. The Lesk algorithm would appear, in theory at least, to be much less power- ful since it can never eliminate more structure atoms from consideration than can the set reduction approach; the latter involves not only the identification of structure atoms with neighbours at the right distances, but also considers whether these neighbours are also potential matches for atoms in the pattern. In practice, the two algorithms are of broadly comparable efficiencies owing to the rather complicated operations that need to be carried out on the pair lists in the distance table that forms the basis for the set reduction algorithm; this characteristic becomes of greater importance as the size of the pattern increases. However, this disadvantage is often compensated for by the fact that the set reduction algorithm manages to avoid generating most of the possible combinations by using the depth first search described previously.

The Lesk and set reduction algorithms are not always able to reduce the number of structure atoms to an acceptable level, even if the structure does not contain the query pattern. Thus one of the size-7 query patterns which was compared with the set of 250 molecules, involved a final stage with 15 possible matches for the first pattern atom, 16 for the second, 18 for the third, 16 for the fourth, 14 for the fifth, 14 for the sixth and 15 for the seventh. This corresponds to a total of over 200M possible combinations, and the simple approach used in the final stage of the Lesk algorithm had to be aborted after it had used over 40 min of CPU time; however, the depth first search used in the set reduction algorithm ran to completion in a fraction of second.

When it did not suffer from the combinatorial prob- lem, Lesk’s algorithm was in the same performance range as the clique finding and set reduction approaches, per- forming relatively better with large patterns. This charac- teristic becomes of importance if larger structures are considered than the small molecules used here, e.g. the 3D structures in the Protein Data Bank, where many hundreds or thousands of atoms may need to be con- sidered. For such structures, the clique detection and Ullman algorithms cannot be used owing to the size of the matrices required by these two procedures, and thus choice is restricted to the Lesk and set reduction algorithms. Brint ” has found evidence to suggest that the Lesk algorithm gives rather better results owing to the complexity of the distance table operations required in the set reduction processing; the authors are currently considering the best way to tackle the problem of sub- structure searching in such macromolecules. For small molecular structures of the sort tested here, there would seem little doubt that the Ullman algorithm is by far the most efficient, and the algorithm of choice for any practical system for pharmacophoric pattern searching.

Volume 5 Number 1 March 1987

ACKNOWLEDGEMENTS

The authors thank David Bawden, Jeremy Fisher and Susan Jakes for their contributions to the set-reduction algorithm and the Science and Engineering Research Council for the award of a Research Studentship to ATB.

REFERENCES

1

2

6

7

8

9

10

11

12

13

14

15

16

17 18

19

Gund, P et al. ‘Computer searching of a molecular structure for pharmacophoric patterns’ in Proc. Inc. Conf. Comp. Chem. Res. Educ. Ljubljana, Yugoslavia (July 12-17 1973) pp 33-38 Gund, P ‘Three-dimensional pharmacophoric pattern searching’ Progr. Mol. Sub-Cell. Biol. Vol 5 (1977) pp 117-143 Esaki, T ‘Fundamental studies on quantitative drug design’ PhD thesis Nagoya University, Japan (1983) Ash, J E and Hyde, E Chemical information systems Ellis Horwood, Chichester, UK (1975) Ash, J E et al. Communication, storage and retrieval of chemical information Ellis Horwood, Chichester, UK (1985) Jakes, S E and Willett, P ‘Pharmacophoric pattern matching in tiles of 3-D chemical structures: selection of inter-atomic distance screens’ J. Mol. Graph. Vol 4 (1986) pp 12-20 Allen, F H et al. ‘The Cambridge Crystallographic Data Centre: computer-based search, retrieval analy- sis and display of information’ Acta Cryst. Vol B35 (1979) pp 2331-2339 Jakes, S E et al. ‘Pharmacophoric pattern matching in files of 3-D chemical structures: evaluation of search performance’ J. Mol. Graph. Vol 5 No 1 (March 1987) pp 4148 Sundaram, K et al. ‘A quantitative approach to the comparison of biomolecular topographies’ Physiol. Chem. Phys. Vo16 (1974) pp 4699478 Lesk, A M ‘Detection of 3-D patterns of atoms in chemical structures’ Commun. ACM Vol 22 (1979) pp 219-224 Kuntz, I D et al. ‘A geometrical approach to macromolecule-ligand interactions’ J. Mol. Biol. Vol 161 (1982) pp 269-288 Golender, V and Rozenblit, A Logical and combina- torial algorithms for drug design Research Studies Press, Letchworth, UK (1983) Crandell, C W and Smith, D H ‘Computer-assisted examination of compounds for common three- dimensional substructures’ J. Chem. If. Comp. Sci. Vo123 (1983) pp 186197 Kuhl, F S et al. ‘A combinatorial algorithm for calcu- lating ligand binding’ J. Comput. Chem. Vol 5 (1984) pp 24-34 Danziger, D J and Dean, P M ‘The search for func- tional correspondences in molecular structure between two dissimilar molecules’ J. Theor. Biol. Vol 116 (1985) pp 215-224 McLachlan, A D ‘Rapid comparison of protein struc- tures’ Acta Crysf. Vol A38 (1982) pp 871-873 Brint, A T PhD thesis in preparation Sussenguth, E H ‘A graph-theoretic algorithm for matching chemical structures’ J. Chem. Document. Vo15 (1965) pp 3643 Figueras, J ‘Substructure search by set reduction’ J. Chem. Document. Vol 12 (1972) pp 237-244

55

Page 8: Pharmacophoric pattern matching in files of 3D chemical structures: comparison of geometric searching algorithms

20 Levi, G ‘A note on the derivation of maximal common subgraphs of two directed or undirected graphs’ Calcolo Vo19 (1972) pp 341-352

21 Barrow, H G and Burstall, R M ‘Subgraph iso- morphism, matching relational structures and maxi- mal cliques’ ZnS. Process. Lett. Vo14 (1976) pp 83-84

22 Barrow, H G and Tenebaum, J M ‘Computational vision’ Proc. IEEE Vo169 (1981) pp 572-595

23 McGregor, J J ‘Backtrack search algorithms and the maximal common subgraph problem’ Software- Pratt. Exper. Vol 12 (1982) pp 23-34

24 McGregor, J J and Willett, P ‘Use of a maximal common subgraph algorithm in the automatic identi-

fication of the ostensible bond changes occurring in chemical reaction’ J. Chem. Znf. Comput. Sci. Vol 21 (1981) pp 137-140

25 Willett, P Modern approaches to chemical reaction searching Gower, Aldershot, UK (1986)

26 Bron, C and Kerboscb, J ‘Algorithm 457. Finding all cliques of an undirected graph’ Commun. ACM Vol I6 (1973) pp 575-577

27 Ullman, J R ‘An algorithm for subgraph isomor- phism’ J. ACM Vo123 (1976) pp 3 142

28 McGregor, J J ‘Relational consistency algorithms and their application in finding subgraph and graph isomorphisms’ ZnjI Sci. Vol 19 (1979) pp 229-250

Erratum

On page 182 of Volume 4 Number 3 (the September 1986 issue), lines 3&33 of Abstract No 10 (‘MaxTwist: energy minimization of macromolecules in intelligent degrees of freedom’ by C Sander) should have read: In fact, MaxTwist is a descendant of F32FKMNM (Weizmann Institute, 1981 and earlier, by Peter Stern, Ruth Sharon and Amie Hagler), incorporating a tree data structure of molecular conformation in terms of bond lengths, bond angles and dihedral angles written by Barry Robson in 1972

56 Journal of Molecular Graphics