quantitative structure—activity relationships (qsar)

10
181 Tutorial w Chemometrics and Intelligent Laboratory Systems, 6 (1989) 181-190 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands Quantitative Structure-Activity Relationships ( QSW W.J. DUNN III Department of Medicinai Chemistry and Pharmacognosy, University of Illinois at Chicago, 833 S. Wood Street, Chicago, IL 60612 (U.S.A.) (Received 23 October 1988; accepted 28 February 1989) CONTENTS Abstract ............................................................... 1 The beginnings of QSAR .................................................. 2 The general QSAR problem ................................................ 2.1 The structure-activity data .............................................. 2.2 The biological activity data .............................................. 2.3 The chemical descriptor data ............................................ 2.4 Data structure in QSAR studies ........................................... 3 Modeling relationships between chemical structure and biological activity; four levels of QSAR 3.1 QSAR at level one .................................................... 3.2 QSAR at level two .................................................... 3.3 QSAR at level three ................................................... 3.4 QSAR at level four ................ _ .................................. 4 Conclusions ........................................................... References .............................................................. 181 182 183 184 184 185 185 186 186 186 187 187 188 189 ABSTRACT Dunn III, W.J., 1989. Quantitative structure-activity relationships (QSAR). Chemometrics and Intelligent Laboratory Systems, 6: 181-190. Drug design has been influenced considerably since the first quantitative structure-activity relationship (QSAR) study was published by Hansch and his coworkers in 1962. These workers used mathematical models developed from multivariable data analysis methods to correlate changes in biological activity with changes in chemical structure for series of drugs. Since the publication of this pioneering work of Hansch, QSAR has become extensively used in drug design. There are a number of unique, but practical, problems involved in the development of QSARs. These aspects of QSAR research are discussed. 0169-7439/89/$03.50 0 1989 Elsevier Science Publishers B.V.

Upload: wj-dunn-iii

Post on 21-Jun-2016

223 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Quantitative structure—activity relationships (QSAR)

181 Tutorial w

Chemometrics and Intelligent Laboratory Systems, 6 (1989) 181-190 Elsevier Science Publishers B.V., Amsterdam - Printed in The Netherlands

Quantitative Structure-Activity Relationships

( QSW

W.J. DUNN III

Department of Medicinai Chemistry and Pharmacognosy, University of Illinois at Chicago, 833 S. Wood Street, Chicago, IL 60612 (U.S.A.)

(Received 23 October 1988; accepted 28 February 1989)

CONTENTS

Abstract ............................................................... 1 The beginnings of QSAR .................................................. 2 The general QSAR problem ................................................

2.1 The structure-activity data .............................................. 2.2 The biological activity data .............................................. 2.3 The chemical descriptor data ............................................ 2.4 Data structure in QSAR studies ...........................................

3 Modeling relationships between chemical structure and biological activity; four levels of QSAR 3.1 QSAR at level one .................................................... 3.2 QSAR at level two .................................................... 3.3 QSAR at level three ................................................... 3.4 QSAR at level four ................ _ ..................................

4 Conclusions ........................................................... References ..............................................................

181 182 183 184 184 185 185 186 186 186 187 187 188 189

ABSTRACT

Dunn III, W.J., 1989. Quantitative structure-activity relationships (QSAR). Chemometrics and Intelligent Laboratory

Systems, 6: 181-190.

Drug design has been influenced considerably since the first quantitative structure-activity relationship (QSAR) study was published by Hansch and his coworkers in 1962. These workers used mathematical models developed from multivariable data analysis methods to correlate changes in biological activity with changes in chemical structure for series of drugs. Since the publication of this pioneering work of Hansch, QSAR has become extensively used in drug design. There are a number of unique, but practical, problems involved in the development of QSARs. These aspects of QSAR research are discussed.

0169-7439/89/$03.50 0 1989 Elsevier Science Publishers B.V.

Page 2: Quantitative structure—activity relationships (QSAR)

n Chemometrics and Intelligent Laboratory Systems 182

1 THE BEGINNINGS OF QSAR

Chemistry is the discovery of the order that exists in chemical systems and establishing quantitative rules which describe this order. Find- ing quantitative relationships between the chem-

ical structure of atoms and molecules and their properties governed much of the effort of early chemists. The task of the study of quantitative

structure-activity relationships (QSAR) is essen- tially that of finding quantitative relationships be- tween the biological activities of compounds and their chemical structures. Consider the generic

structure (I) in Fig. 1. A basic problem in physical organic chemistry is one of developing rules which can predict the change in physical properties, e.g.

density, boiling point, etc., for members of the series of compounds (I) in which Y represents

some change in structure relative to a reference substituent, usually Y = H. The problem in QSAR

related to this may be formulated as one of find- ing similar rules which can lead to reliable predic- tions of the biological activities of the members of the series.

Such studies began in the latter part of the 19th

century when chemists began to study the biologi- cal effects of compounds. It was only a natural extension of these early efforts to relate chemical structure to physical properties that led Crum Brown and Fraser [l] to propose in 1868 that the basis of the physiological action of compounds

was their chemical structure. They further pro- posed that functional relationships between struc- ture and activity (QSAR) could be obtained within the framework of mathematics. The next signifi-

cant early development was from the work of Meyer [2] and Overton [3] who independently proposed a theory of narcosis. They proposed that the general anesthetic potency of a compound was related to its partition coefficient between olive oil and water. This was not expressed quantitatively,

D 0 ~w32~2

Y 0)

Fig. 1. Generic structure for

study.

a series of drugs for a QSAR

however. The first quantitative relationship, to

this author’s knowledge, was published by Fur- ukawa in a series of three papers in the Journal of

Tokyo Chemical Society (CA, 13: 977 *) discussing the biological activity of perfumes [4-61. The sig-

nificant result of this work was Furukavya’s pro- posal that, within a series of similar compounds, a cutoff value in ‘size’ will be observed which will result in a decrease in activity. Size was defined in terms of the number of carbon atoms in a mole-

cule. The groundwork for a further understanding of

relationships between chemical structure and bio-

logical activity can be attributed to Hammett [7] whose work is summarized in the book Physical

Organic Chemistry, published in 1940. The work of Hammett dealt with chemical structure and reactivity and the relationship between structure and activity was formulated as the Hammett equa- tion which was called a linear free energy relation- ship (LFER). This provided the basis for the Hansch’s [8] first QSAR paper in 1962. This work, on the relationship between the chemical structure

and the biological activity of phenoxyacetic acid plant growth regulators, was the first paper on QSAR and Hansch is considered to be father of this field. His work extended the ideas of Ham- mett [7] to the effect of changes in structure of compounds on their levels of biological activity.

The initial QSAR model, generally stated below (eq. 1) as the ‘Hansch model for QSAR’, is now regarded as a special case of a more general model

for the relationship between chemical structure and biological activity of compounds. Before a discussion of this general problem is begun some basic ideas are presented.

log l/C = a + b log P + c(log P)’ + d(steric)

+ e (electronic) (1)

In the Hansch model, biological activity is de- termined as a standard response where C is the molar concentration required to obtain a prede-

* I am indebted to Professor Toshio Fujita, Department of Agricultural Chemistry, Kyoto University, Kyoto, Japan for

assisting with the translation of these articles.

Page 3: Quantitative structure—activity relationships (QSAR)

183 Tutorial n

termined level of activity. This may be an LD,,,

Iso, etc. This arranges activity on an increasing scale and also has the effect of log normalizing the data, a practice quite common in physical organic chemistry. Here log P is the logarithm of the partition coefficient for the compound between a nonpolar solvent and water. The nonpolar solvent most generally used is 1-octanol. This parameter is assumed to model the interactions of bioactive compounds with various biophases in the biologi-

cal system. Other solvents have been used but empirically l-octanol is the most useful.

In the early work of Hansch, the quadratic

term was observed to be statistically significant but could not be explained since there was no interpretation for it within the framework of the Hammett equation. It is now understood to be the result of differences in drug distribution in bio- phases. Within a series of drugs, very hydrophilic ones tend to be excreted while very lipophilic ones tend to concentrate in nonpolar biophases. In

either case the drugs become unavailabe for criti- cal receptors and have lower activities. This is consistent with the finding, in almost every case in which the second order term is significant, that the

coefficient c < 0 and plots of log l/C vs. log P exhibit a maximum in log P. This has been termed the ‘optimum log P ’ for a series.

Within the framework of the Hansch method, steric and electronic effects can be modelled by a number of parameters, the majority of which are linear free energy-based substituent constants. Ex- amples are the Taft [9] steric and Verloop [lo] STERIMOL constants. The effects of electronic changes within a series of aromatic compounds

are assumed to be modelled by Hammett u-con- stants [7] which are obtained form pK, data on benzoic acid and substimted benzoic acids. Log P is assumed to be a linear free energy-based prop- erty which has additive-constitutive properties like pK,. This allows one to describe the lipophilicity changes with a substituent constant, the Hansch r-constant, with benzene as the parent molecule.

In practice QSARs are developed by measuring biological activities as the dependent variable and physicochemical properties on a series of com- pounds as independent variables. Using multiple regression, predictive models are then developed.

In spite of the numerous successes in drug design, this approach lacks generality for a num- ber of reasons. The parabolic form of the model has been shown to be inappropriate in some cases where nonlinearity of log P with activity is ob-

served [11,12]. Another problem with the ap- proach is that it cannot deal with nonactive com-

pounds. Understanding why a compound is inac- tive, and being able to predict this result, is cer- tainly an important aspect of drug design.

The use of multiple regression essentially treats the QSAR as a univariate problem when it is indeed multivariate. In the discussion which fol-

lows, the development of QSAR and drug design is presented from a multivariate approach. The

most commonly used data analysis methods in QSAR have been the subject of tutorials in this [13] and other publications [14,15]. Therefore, only a brief review of the statistical methods will be discussed here. The emphasis will rather be placed

on specific properties of these methods as they apply to this problem.

2 THE GENERAL QSAR PROBLEM

Referring to general structure (I) above the problem of QSAR as stated by G-urn Brown and Fraser [2] is to find, within the framework of

mathematics, relationships between the change in chemical structure and the change in physiological actions. This should include the problem of whether a compound will be active or inactive which leads to formulation of the problem as one of classification. This is what has been called ‘the four levels of pattern recognition’ [16]. At the first level the objective is to classify a compound of unknown activity into one of two or more defined classes, e.g. agonist vs. antagonist, substrate vs. inhibitor, etc. It must be known in advance that the compound is one of these defined groups. Also, certain classes of compounds do not belong explicitly to one class but can exhibit dual activi-

ties. For example, a drug can be a partial agonist or partial antagonist, which QSAR methods should predict. At the next level of classification, a com- pound of unknown activity can be categorized as being a member of one of a group of defined

Page 4: Quantitative structure—activity relationships (QSAR)

n Chemometrics and Intelligent Laboratory Systems 184

Compound Biological

Activity Chemical

DWXiptOrS

. . . . . . .._.............. _._ ___.__._.................. . . . . . ..__

Test set

Fig. 2. Data for a QSAR problem.

classes or be a member of none of these. In other words, the unknown may be an outlier.

At the first two levels of pattern recognition,

qualitative relationships between structure and ac- tivity are developed. At the third level of pattern recognition, once an unknown has been assigned

to a class, a single measure of biological activity is correlated with structure. At the fourth or highest level of pattern recognition, a relationship be- tween chemical structure descriptors and two or more measures of biological response are corre- lated. An example of this nature is the prediction of the genetic toxicity profile in a battery of tests for an unknown compound from its chemical

structure.

2.1 The structure-activity data

The data analysis steps in QSAR studies are carried out from two basic data matrices. These are given below in Fig. 2 for a series of com- pounds on which biological activities, either cate- gorical, e.g., active or inactive, or continuous, are tabulated with chemical descriptor data. The descriptor data are either measured in systems which model the biological system or are calcu- lated. They are assumed to be relevant, in some way, to the problem of the physiological action of the compounds. If a coordinate system is con- structed with axes defined in terms of the descrip- tor data, the space spanned by the data is referred to as the pattern space.

The biological data will be referred to as the dependent variables, Y, and the chemical data to

as independent variables, X. The compounds of known activity are the training sets and the com- pounds of unkown activity are the test com-

pounds. The primary objective of QSAR methods is to predict the activities of the test substances.

2.2 The biological activity data

The biological activity data for the training compounds must result from measurements and the data which result can be in several forms for

QSAR studies. Most commonly, if compounds have detectable activity their activities can be ob- tained at a series of doses which leads to a dose-response relationship. From this, standard responses can be obtained for each member of the series. If a compound in the series in inactive the

data can be assigned discrete or binary values. In many cases the data are reported to have been obtained at a single dose for a series. In this case

data can be easily normalized, which is not ideal, due to the risk of error propagation.

The training sets and their measured biological activity data are assumed to span the spectrum possible activities for their structure type. This assumption has rarely been fulfilled in studies performed to date. Only recently has the impor- tance of experimental design in QSAR been real- ized, even though in 1973 Hansch and Unger

proposed [17] that cluster analysis could be used in the selection of substituents in series design. Cluster analysis was used by Dunn et al. [18] to

design a series of antitumor agents. This tech- nique, which finds clustering in higher dimen- sional descriptor data, is not optimal for QSAR studies because it cannot easily be applied to data for compounds with multiple substitution.

More recently Austel [19] has proposed that factorial design be used to select members of a series of compounds once a lead drug has been identified. Hellberg et al. [20] have published a complete fractional factorial design for peptides modified in four positions. This is a subset of the 204 possible peptides which could result from substitution in four positions by the 20 possible natural amino acids and could be used as a series for QSAR studies that result from a clue which

Page 5: Quantitative structure—activity relationships (QSAR)

185 Tutorial n

has been found in which the same four positions

are found to be important to biological activity. Factorial design can deal with the problem of

multiple substitution, which cluster analysis can- not, and thus it is a promising technique for use in drug design and QSAR. The use of experimental design techniques is necessary once a clue has been found if biological activity is to be optimized

with any efficiency in synthesis and testing. QSAR

models developed on compounds which have been selected from experimental design will also give

the most reliable predictions.

2.3 The chemical descriptor data

The chemical data used to describe the molecu- lar structures of biologically active compounds are

a most important aspect of a QSAR study. Most descriptors which have been used are based on the

chemical reactivity analogy and reflect electronic distributions for the chemical agents. Hammett u-constants, for example, are interpreted as re- flecting electron withdrawing or donating effects of substituents. Such thinking assumes that it is only the variation within the series of chemical agents that is important and that no second-order interactions are significant. From the biochemical side, it is necessary to consider what happens to the biological system and biochemists tend to think more in terms of size and shape when corre- lating structure with function and activity. Those involved in QSAR research are becoming more aware of this aspect of the problem with a much better understanding of the resulting structure-ac- tivity.

As mentioned above, the chemical data describ-

ing the training and test compounds can be ob- tained experimentally from model systems, or they can be calculated. In general, descriptors fall into two categories: (1) global types and (2) substituent types. Global variables are based on the whole molecule examples being log P and dipole mo- ment. These descriptors have the advantage of

allowing the direct comparison of compounds which are apparently dissimilar. Substituent type descriptors have the advantage of being easy to approximate due to the assumptions of ‘similarity transference’. Substituent constants should be used

with caution, however. In most of the earlier stud-

ies, substituent constant and other linear free en- ergy-related variables were used. There has been considerable interest in the use of discrete vari- ables such as counts of substructural fragments

[21] and connectivity indices [22] but this is for the most part misguided and these approaches are not

recommended, especially when used with data analysis methods which are sensitive to finding

chance correlations [23]. This will be discussed in more detail later.

A rather interesting approach to molecular de- scription, which is a variation of the linear free

energy approach, is to use latent variables, or principal component scores, derived from the data matrix, X, as independent variables in the deriva- tion of QSAR. This approach was used by Dunn et al. [24] to study the QSAR for the transdermal diffusion of steroids. In this study the significant latent variable could be identified. Hellberg et al.

[20] describe the descriptors found this way as ‘ principal properties’. The principal property ap- proach, unless the principal properties can be identified, cannot easily lead to mechanistic inter-

pretation of the QSAR. It should be noted that the problem of interpretation is separate from QSAR, however, and requires additional experi- mental data beyond that required for QSAR mod- eling.

2.4 Data structure in QSAR studies

In the chemical descriptor data matrix above, each substance can be expressed as a vector in pattern space. When the data for a class are projected into pattern space, in the ideal case, the class will group in a geometrically defined cluster. For two or more classes, a cluster will result for each class. This is shown in Fig. 3 for two classes

in a three-dimensional pattern space. Such classes are assumed to be composed of chemically and pharmacologically similar compounds. Here the data structure is said to be symmetric [25]. The term symmetric is used because the classes can each be defined by a mathematical model and are thus separated in pattern space. Enzyme sub- strates and inhibitors are expected to display sym- metric data structures, as are receptor agonists

Page 6: Quantitative structure—activity relationships (QSAR)

W Chemometrics and Intelligent Laboratory Systems 186

and antagonists. Compounds which are found to be neither substrates or inhibitors nor agonists or

antagonists are considered to be ‘outliers’ from these classes.

One of the most common classification prob- lems in QSAR ion is the classification of com- pounds as active or inactive in a particular assay or group of assays. The assessment of the carcino-

genic potential of a compound based on its re- sponse in several genetic toxicity assays is an

example of such a problem. One might expect that this would be a straightforward case of a two class problem when, in general, it is not. If it is assumed

that the active compounds exert their effect by the same mechanism and are described as being chem- ically similar they should form a well defined class. Compounds which can be detoxified or metabolised will be converted to easily excreted

derivatives and removed from the system. These types of functions are generally functional group specific so that noncarcinogens are generally unique with regard to a particular mechanism of action. Such substances do not behave as a homo- geneous class and display what is termed an asym- metric data structure [25] as shown in Fig. 3. The noncarcinogens are considered to be outliers from

the active class. This has an important implication in classification studies. If an unknown is found to be ‘nonactive’ it could be a member of an as yet to be discovered class of carcinogens.

3 MODELING RELATIONSHIPS BETWEEN CHEMICAL STRUCTURE AND BIOLOGICAL ACTIVITY; FOUR LEVELS OF QSAR

3.1 QSAR at level one

Various data analysis techniques are available for developing QSARs. Selecting the one to use is a function of how the QSAR problem is for- mulated. This is done within the four levels of pattern recognition discussed above. At the first level the objective is to classify a compound into one of the defined classes. By formulating the problem in this way it is assumed that the com- pound of unknowm activity must be a member of one of these classes. At this level, the hyperplane

methods, such as linear discriminant analysis [26] and the linear learning machine [27] can be used if the possibility of an asymmetric data structure can be precluded. These techniques cannot be used in this case. Only class modeling techniques, such as SIMCA pattern recognition [28], can be used when such a data structure is observed. This method will be discussed in the next section.

Linear discriminant analysis and the linear learning machine tend to select variables which

separate classes as compared to variables which determine class assignment. Also, they are ex- tremely sensitive to collinearity in the chemical

descriptor data and are subject to finding chance correlations. As a general rule, there should be at least four compounds per chemical descriptor in order to minimize the possibility of finding chance

correlations. Due to the fact that in most QSAR problems, the compounds within the series studied are similar or generally of the same chemical class, collinearity is built into the problem so these techniques should be used with caution.

3.2 QSAR at level two

At this level the objective of the QSAR study is to classify a compound of unknown class assign- ment into one of the defined classes with the possibility that the unknown can be a member of none of the classes. This is equivalent to the unknown being an outlier from the defined classes. Only class modeling methods of pattern recogni- tion can operate at this level. Examples of these techniques are SIMCA [28], UNEQ [29] and PRIMA [30]. PRIMA is similar to SIMCA pattern recognition with zero principal component models as the classifier. These three methods have been compared recently [31] and a new pattern classi- fier, DASCO, has recently been reported [32].

Page 7: Quantitative structure—activity relationships (QSAR)

187 Tutorial n

3.3 QSAR at level three

At level 3 an unknown is classified and in a subsequent step its level of activity is estimated. It follows that, if compounds of a given pharmaco-

logical class cluster in pattern space, then those compounds with similar levels of activity cluster within the pattern space for that class. This has been observed in several cases [33,34]. A level 3 classification corresponds to class assignment fol-

lowed by a variation of a classical Hansch analysis using, in some cases, latent variables.

The data analysis techniques which can be ap- plied in the post-classification step at this level are principal components regression [14,15] and par- tial least squares (PLS) regression [13,35]. Of the two methods PLS is recommended due to the fact

that it is more stable when there is greater uncer- tainty in the biological data than in the chemical data. Generally, the problem is overdetermined in the chemical descriptor matrix and only a fraction of the variation in the chemical data is explained by the PLS analysis. PLS has the advantage that it extracts the latent variables from both X and Y that are systematic and correlated so, if there is systematic variation in both matrices, PLS finds it.

Principal components regression is not as sensitive

TABLE 1

to this as it extracts latent variables independently from X and Y which may not be correlated

3.4 QSAR at level four

The data for a level 4 QSAR problem consists of at least one Y or biological activity vector for

each compound and a multidimensional chemical descriptor matrix, X. Biological activity on a series of compounds can be obtained at different levels of biological complexity such as isolated enzyme,

single cell, isolated organ, etc., while activity or Y matrices can be obtained for more than one type of activity and in this way multiway analyses can

be done [15]. Very few examples of level four QSAR studies

have been published so an example will be used to

illustrate the technique. This example is from structure-activity studies of the inhibition of the serine peptidases, acrosin, trypsin and thrombin.

Peptides Ala-Cys, No. 2, and Val-Ile-Pro, No. 10, have no activity reported in the enzyme sys-

tems trypsin and thrombin, respectively. These missing values are indicated by - . The objectives of the analysis are, first, to determine if these two peptides are classified as inhibitors of these en- zymes and if so, second, to estimate their level of

Biological activity data for peptides (A,-A,-Arg-CH,-Cl) against the enzymes acrosin, trypsin and thrombin

The first three columns, which correspond to Y in Fig. 2, are logs of the half-lives (log t,,*) for irreversible inhibition of each respective protease. Z, , z2 and zs correspond to X in Fig. 2 and are the principal properties for each amino acid A, and A,. From Hellberg et al. [20].

1. 2. 3. 4.

5. 6. 7. 8. 9.

10. 11. 12. 13.

Peptide Enzyme

A,-.42 Acrosin

Pro-Phe 1.11 Ala-Cys 1.00 Ala-Phe 1.00 acGly-Gly 0.88

Gly-Val 1.46 Pro-Gly 0.62 Phe-Ala 0.56 Ile-Leu 0.75 Val-Pro 0.63 Val-Be-Pro 0.54 Ile-Pro 0.46 val-val 0.32 Glu-Gly 0.58

Trypsin Thrombin

1.36 1.54 1.52 _ 1.28 1.23 1.20 1.28

1.52 1.26 1.30 1.76 1.34 1.25 1.20 1.11 1.20 1.23 - 1.11 1.18 1.34 1.32 1.52 1.65 1.49

*1 22 z3 Zl z2 z3

- 1.22 0.88 2.23 - 4.92 1.30 0.45

0.07 - 1.73 0.09 2.84 1.41 - 3.14 0.07 - 1.73 0.09 - 4.92 1.30 0.45 2.23 - 5.36 0.03 2.23 -5.36 0.30

2.23 - 5.36 0.03 - 2.69 -2.53 - 1.29

-1.22 0.88 2.23 2.23 - 5.36 0.30

- 4.92 1.30 0.45 0.07 - 1.73 0.09

-4.44 - 1.68 -1.03 -4.19 - 1.03 - 0.98 - 2.69 - 2.53 - 1.29 - 1.22 0.88 2.23 -4.44 - 1.68 -1.03 - 1.22 0.88 2.23 - 4.44 - 1.68 - 1.03 -1.22 0.88 2.23

- 2.69 -2.53 - 1.29 - 2.69 - 2.53 - 1.29

3.08 0.39 -0.07 2.23 - 5.36 0.30

Page 8: Quantitative structure—activity relationships (QSAR)

H Chemometrics and Intelligent Laboratory Systems 188

activity. These two compounds are the test peptides and the remainer are the training peptides.

All of the compounds in the table are tri- peptides on which the C-terminal end has been altered to contain the -CH,-Cl function which is an alkylating group. Peptide No. 4 has an acyl group on the N-terminal amino acid and peptide

No. 10 is a tetrapeptide. In the case of the acyl derivative, the z-values for the second peptide were used and in the case of No. 10 the z-values

for the Ile residue were used. In the latter case this ignores the Val residue in the terminal position. Since these enzymes are proteases, irreversible in- hibition should result from the reaction of this

group with a nucleophilic group in the enzyme active site. Therefore inhibition should be a func- tion of noncovalent and covalent interactions of the substrates with the active site.

Application of PLS regression to the data for the training compounds revealed that the inhibi- tion of the enzyme thrombin did not parallel that of the other two and the variables zs for the first and second amino acid residues were not signifi- cant. A one-component PLS model was significant by cross-validation [36] and accounted for 49% of the variation in the inhibitory activity. This was

mainly with the enzyme acrosin. The two test peptides are within the pattern space for the class and a plot of the observed and predicted inhibi- tory activities for the peptides against acrosin and trypsin are given in Fig. 4. The agreement is good.

A recent application of QSAR at level 4 is the

work of Cramer et al. [37] using PLS to estimate

c. & o /’

0 1 2

Observed acrosin activity

Fig. 4. Observed and PLS predicted values for the peptides in

Table 1.

the binding of steroids to carrier proteins. Here the chemical descriptor data are generated by placing the structure of the steroid in a standard three-dimensional grid and calculating the interac- tion energy (Van der Waals and Coulombic) of the steroid at points in the grid with a methyl group probe. These energies are then systemati-

cally placed in a matrix as the chemical descriptor data and PLS regression is used to find and corre- late energies in the spatial regions that are related to the level of binding. This approach has been

called ‘3-D QSAR’. It is three-dimensional only in the sense that the three-dimensional structure of the compounds is considered in their description.

It should not be confused with ‘multiway’ PLS data analysis in which more than one X-data matrix is correlated with biological activity.

4 CONCLUSIONS

The study of quantitative structure-activity re- lationships, or QSAR, has developed from its be- ginnings as proposed by Crum Brown and Fraser [l] to the modeling of these relationships with the most sophisticated data analysis techniques avail-

able. Even though some would argue that obtaining

good biological activity data is more difficult, in my opinion the greatest problem facing those in- volved in drug design using QSAR methods is

describing molecules in such a way as to be rele- vant to the problem of biological activity.

In the early 1970s a great emphasis was placed on moleculear modeling and the graphical display of structures of both the drugs and their receptors. During this time, QSAR became more qualitative than quantitative. It became possible to generate

large numbers of theoretically based descriptors but reliable quantitative structure-activity models could not be developed using multiple regression. Researchers have been slow to recognize this prob- lem, but there has been considerable recent inter- est in multivariate methods and reports are ap- pearing in which principal components and PLS regression methods are being used. These tech- niques are usually the methods of choice due to the multivariate nature of the QSAR problem.

Page 9: Quantitative structure—activity relationships (QSAR)

189 Tutorial n

In addition to the chemical description problem there is also the problem of the biological, or independent variable, data. Early workers in QSAR recognized it to be an efficient technique for optimizing structure and activity but were more or less dependent on available data, either from the literature or industrial data bases. Rarely did such data result from systematic statistical design, with the result that most QSAR models were not optimal from a predictive point of view.

REFERENCES

1

2

9

10

11

12

13

14

15

A. Crum Brown and T.R. Fraser, On the connection be-

tween chemical constitution and physiological action. Part

I. On the physiological action of the salts of the ammonium

bases, derived from strychnia, brucia, thebaia, codeia,

morphia and nicotia, Proceedings of the Royal Society of Edinburgh, XXV (1868) 151-203.

H.H. Meyer, Welche Eigenschaft der Anaesthetica bedingt

ihre narkotische Wirkung?, Pathologie und Pharmakologie,

42 (1899) 109-118.

E. Overton, Studien iiber die Narkose, Fischer Verlag, Jena,

1901.

S. Furukawa, Biological activity of perfumes, Journal of the

Tokyo Chemical Society, 39 (1918) 584-660.

S. Furukawa, Biological activity of perfumes, Journal of the

Tokyo Chemical Society, 39 (1918) 809-846.

S. Furukawa, Biological activity of perfumes, Journal of the

Tokyo Chemical Society, 40 (1919) 42-59.

L.P. Hammett, Physical Organic Chemisrry, McGraw-Hill,

New York, 1st ed., 1940.

C. Hansch, P.P. Maloney, T. Fujita and R. Muir, Correla-

tion of biological activity of phenoxyacetic acids with Ham-

mett substituent constants and partition coefficients, Nu-

ture (London), 194 (1962) 178-180.

R.W. Taft, in M.S. Newman (Editor), Steric Effects in

Organic Chemistry, Wiley, New York, 1956, p. 644.

A. Verloop, W. Hoogenstraaten and J. Tipker, in E.J.

Ariens (Editor), Drug Design, Vol. VII, Academic Press,

New York, 1976, p. 165.

Y.C. Martin, Quantitative Drug Design, Marcel Dekker,

New York, 1978.

H. Kubinyi, Quantitative structure-activity relationships of

the bilinear model, a new model for nonlinear dependence

of biological activity on hydrophobic character, Journal of Medicinal Chemistry, 20 (1977) 625-629.

S. Wold, K. Esbensen and P. Geladi, Principal components

analysis, Chemometrics and Intelligent Laboratory Systems,

2 (1987) 37-52.

P. Geladi and B.R. Kowalski, Partial least squares: A

tutorial. Analytica Chimica Acta, 185 (1986) 1-17.

S. Wold, P. Geladi, K. Esbensen and J. &man, Multi-way

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

principal components and PLS-analysis, Journal of Chem-

ometrics, 1 (1987) 41-56.

C. Albano, W.J. Dunn III. E. Edlund, E. Johansson, B.

Norden, M. Sjijstrom and S. Wold, Four levels of pattern

recognition, Analytica Chimica Acta, 103 (1978) 429-435.

C. Hansch and S.H. Unger, Strategy in drug design. Cluster

analysis as an aid in the selection of substituents, Journal

of Medicinal Chemistry, 16 (1973) 1217-1222.

W.J. Dunn III, M.J. Greenberg and S. Callejas, Use of

cluster analysis in the development of structure-activity

relations for antitumor triazenes, Journal of Medicinal

Chemistry, 11 (1976) 129991301.

V. Austel, A manual method for systematic drug design,

European Journal of Medicinal Chemistry, 17 (1982) 9-16.

S. Hellberg, M. Sjiistrom, B. Skagerberg and S. Wold,

Peptide quantitative structure-activity relationships, a mul-

tivariate approach, Journal of Medicinal Chemistry, 30

(1987) 1126-1135.

A.J. Stuper, WE. Brugger and P.C. Jurs, Computer Assisted

Studies of Chemical Structure and Biological Function, Wi-

ley, New York 1979.

L.B. Kier and L. H. Hall, Molecular Connectivity in Struc-

ture- Activity Analysis, Research Studies Press Ltd.,

Letchworth, 1986.

J.G. Topliss and R.P. Edwards, Chance factors in studies of

quantitative structure activity relationships, Journal of Medicinal Chemistry, 22 (1979) 1238-1244.

W.J. Dunn III, M.G. Koehler and S. Grigoras, The role of

solvent accessible surface area in determining partition

coefficients, Journal of Medicinal Chemistry, 30 (1987)

1121-1126.

W.J. Dunn III and S. Wold, Structure-activity analyzed by

pattern recognition: The asymmetric case, Journal of Med-

icinal Chemistry, 23 (1980) 595-596.

P.A. Lachenbruch, Discriminant Analysis, Hafner, New

York, 1975.

N.J. Nilsson, Learning Machines, McGraw-Hill, New York,

1965.

S. Wold, Pattern recognition by means of disjoint principal

components models, Pattern Recognition, 8 (1976) 127-139.

M.P. Derde and D.L. Massart, UNEQ: a disjoint modeling

technique for pattern recognition based on normal distribu-

tion, Analytica Chimica Acta, 184 (1986) 33-51.

I. Juricskay and G.E. Veress, PRIMA: a new pattern

recognition method, Analytica Chimicu Actn, 171 (1985)

61-76.

M.P. Derde and D.L. Massart, Comparison of the perfor-

mance of the class modeling techniques, UNEQ, SIMCA

and PRIMA, Chemometrics and Intelfigent Laboratory Sys-

tems, 4 (1988) 65-93.

I.E. Frank, DASCO - a new classification method. Chem-

ometrics and Intelligent Laboratory Systems, 4 (1988)

215-222.

W.J. Dunn III, S. Wold and Y.C. Martin, A structure-ac-

tivity study of fi-adrenergic agents using the SIMCA method

of pattern recognition, Journal of Medicinal Chemistry, 21

(1978) 922-927.

Page 10: Quantitative structure—activity relationships (QSAR)

n Chemometrics and Intelligent Laboratory Systems 190

34 S. Wold, S. Hellberg and W.J. Dunn III, Computer meth-

ods for assessment of acute toxicity, Proc. 1st CFN Sym-

posium on LD,, and possible alternatives, in P. Lindgren

(Editor), Acta Pharmacologica et Toxicologica, Supplemen-

fum, II, 52 (1983) 158-189.

36

37

35 S. Wold, H. Wold, A. Ruhe and W.J. Dunn III, The

colinearity problem in linear and nonlinear regression. The

partial least squares (PLS) approach to general inverses,

SIAM Journal of the Science of Statistics and Computation,

5 (1984) 735-743.

S. Wold, Cross validatory estimation of the number of

components in factor and principal components models,

Technometrics, 20 (1978) 397-406.

R.D. Cramer III, D.E. Patterson and J.D. Bunce, Compara-

tive molecular field analysis (CoMFA). 1. Effect of shape

on binding of steroids to carrier proteins, Journal of the

American Chemical Society, 110 (1988) 5959-5967.