drug design 2 - applied bioinformatics group · drug design 2 oliver kohlbacher winter 2009/2010 9....
TRANSCRIPT
1
Abt. Simulation biologischer Systeme WSI/ZBIT, Eberhard-Karls-Universität Tübingen
Drug Design 2
Oliver Kohlbacher Winter 2009/2010
9. QSAR — Part II: Linear Models, PCA —
Overview
• QSAR/QSPR
• Early models – Hammett, Taft
– Hansch
– Free-Wilson
• Linear models – Methods
• Statistical foundations
• Multiple linear regression
• Principal component analysis
• Principal component regression
QSAR
• QSAR – Quantitative Structure-Activity Relationship
• Quantitative description of a biological activity as a function of structure
• In contrast to QSPR, the modeled quantities are aggregate quantities (e.g., effective concentrations) of multi-step processes: – Activity of a structure is based on several steps:
release, absorption, transport, action – Modeling these complex steps is not always
possible with a single model with sufficient accuracy
2
QSAR Model Development
QSAR Model 0.5 2.7 -1.9
1.4 2.0 1.8
QSAR Model
1.2 2.0 1.5
1.4 2.0 1.8
QSAR Model
=?
Training
Validation
Prediction 1.3
QSAR/QSPR – Models
• Question What is the relationship of the modeled property and the descriptors?
• In general, these properties are nonlinear functions of their descriptors
• Some properties can be described as linear functions of certain descriptors (e.g., ClogP, AlogP)
• Nonlinearity can often be achieved by transformations of the data
• However, there are also nonlinear QSAR approaches
Hammett Equation • An early version of QSAR models is the Hammett equation • It was developed to predict the reactivities of different m-
or p-substituted aromatic compounds • The Hammett equation
lg (k/k0) = σρ describes the ratio of rate constants k of the substituted and k0 of the unsubstituted (Y = H) compound
• Reaction rate depends on the electron density in the ring, hence, electronic effects (+/-I, +/-M) of the substituents play a major role
• ρ depends on the type of reaction considered, σ depends on the substituent only
• σ describes the electronic effect of Y onto the ring: the stronger the electron-pulling effect of the substituent, the more positive σ
• σmeta and σpara are different es well
3
Taft Equation
• The Taft equation describes steric influences on reaction rates
• It was derived for the case of ester hydrolysis
R-COOEt + H2O ® R-COOH + EtOH
• Reactivities for varying substitutents R are measured relative to the reactivity of a methyl group::
lg (kR/kMe) = Es
where Es is a steric parameter that can be determined for each group R
• Both Hammett and Taft parameters are popular descriptors for QSAR/QSPR these days
Hansch Analysis
• Hansch and Fujita developed a first mathematical model of biological activity in 1964:
pC = -log EC = a log P – b log²P + k
• This simple nonlinear model allows the prediction of several properties
• To this end, one has to determine effective concentrations EC for a series of compounds and parameters a, b, k have to be fitted to reproduce experimental data
• A whole range of similar empirically derived equations exists for related problems
Free Wilson Models • The already introduced methods for computing AlogP
and ClogP are conceptually identical to the so-called Free Wilson models
• In these models, the activity as determined as a sum of group contributions: -log EC = ∑ ai + EC0 where EC0 is the activity of a reference compound
• Free Wilson models and Hansch models can also be mixed:
-log EC = a log P + b log²P + ∑ ai + k • This type of approach is not very general, however, it
works usually well for sets of closely related structures (varying substituents on the same scaffold)
4
Example: log P of Carbamates R-OCONH2 P log P Δlog P
-CH3 0.22 -0.66 -CH2-CH3 0.70 -0.15 -(CH2)2-CH3 2.3 0.36 -(CH2)3-CH3 7.1 0.85 -(CH2)4-CH3 22.5 1.35 -(CH2)5-CH3 70.8 1.85 -(CH2)6-CH3 230 2.36 -(CH2)7-CH3 700 2.85 -CH(CH3)-C2H5 4.5 0.65 -0.20 * -C(CH3)3 3.0 0.48 -0.37 *
0.51 0.51 0.49 0.50 0.50 0.51 0.49
* relative to –(CH2)3-CH3 H. Kubinyi, Lecture Drug Design
Houston et al., J. Pharmacol. Exp. Ther. (1974), 189, 244
• For a homologous series, the behavior is trivial
• Steric effects have to be considered as well
• Can be added as group contributions
• Absolutely no extrapolation to non-alkanes possible!
QSAR Formalism Now let‘s get a bit more formal:
• For the development of a QSAR model, we have to select a set of m descriptors
• Assuming we are given (experimental) data on activity for n training structures, we obtain n input vectors xi, 1 · i · n of length m encoding the descriptors
• Each structure i also has a an activity/property yi
• The training dataset is thus described by
– a property vector y = (y1, y2, ...,yn) and
– a descriptor matrix X = (x1, x2,...xn).
QSAR Formalism
• From the descriptor matrix X and the property vector y we can now construct a QSAR model
• This model is, generally speaking, a function of descriptor values x encoding an arbitrary structure:
f : Rm ! R, x ! y´= f(x)
• Goal: determine function f in such a way that for all data points xi the predicted values yi´= f(xi) are as close as possible to the training data yi (i.e., minimize the prediction error)
5
Example • A series of derivatives of (1)
was tested for activity • Effective concentrations
were measured • We use three descriptors (π, σ and Es) that were computed for all nine structures (n = 9, m = 3)
π σ Es pC
0.00 0.00 1.24 7.46
0.15 -0.07 1.24 8.16
0.70 0.11 1.24 8.68
1.02 0.15 1.24 8.89
1.26 0.14 1.24 9.25
0.52 -0.31 1.24 9.30
0.13 0.35 0.78 7.52
0.76 0.40 0.27 8.16
1.78 0.55 0.27 9.00
1 xi = (xi1, xi2, xi3) yi
Descriptive Statistics • Simple descriptive statistics yield essential insights into
the influence of individual descriptors and their correlations
• Each descriptor i is characterized by its average mj and its variance vj:
• The total variance v of the descriptors is given as the sum of the descriptor variances. In addition to the variance, one often considers the its square root, the standard deviation s
Example
• Let
m1 = 0.702; m2 = 0.147; m3= 0.973
s1 = 0.585; s2 = 0.261; s3 = 0.426
v = s12 + s2
2 + s32 = 0.592
• For normally distributed descriptors – 68% of the values lie in [mj–sj, mj+sj] (mj±sj)
– 95% of the values lie in [mj-2sj, mj+2sj]
(mj±2sj)
• These intervals are the 68% and 95%
confidence intervals, respectively
π
xi1
σ
xi2
Es
xi3
0.00 0.00 1.24
0.15 -0.07 1.24
0.70 0.11 1.24
1.02 0.15 1.24
1.26 0.14 1.24
0.52 -0.31 1.24
0.13 0.35 0.78
0.76 0.40 0.27
1.78 0.55 0.27
6
Covariance Matrix • Dependencies between the m descriptors
are described by the covariance matrix C
• C is a symmetric matrix with dimension m x m
• The diagonal elements of C are the variances of the corresponding descriptor
• Off-diagonal elements describe the correlation between two descriptors
XT
X
C
Correlation Coefficient • Normalization of the elements of the covariance matrix with the
corresponding standard deviations correlation coefficients rij
rjk = cjk/(sjsk)
• The correlation coefficient describes the degree of correlation between two descriptors
• Descriptors are, of course, perfectly correlated with themselves, the diagonal entries of C thus yield correlation coefficients of 1
• r (often also written R) becomes 1 for colinear descriptors (linearly dependent) and 0 for completely uncorrelated descriptors
• The squared correlation coefficient R² is often used instead of R
Multiple Linear Regression
• Assumption – Variable y can be expressed as a linear function of multiple
independent variables x = (x1, x2, ..., xm):
y = a0 + ∑ ai xi + e – Error of the variables is normally distributed
• Given – Dataset with n data pairs (yi, xi)
• Find – Estimates for coefficients a0...am such that the error e of the
model becomes minimal
• Ansatz – Minimize the squared error
7
Multiple Linear Regression
• Fitting ansatz y = X a + e
with y = (y1, y2, ...yn), X = n x m matrix containing the independent variables of the n data pairs and e = (e1, e2, ..., en) the vector of residues (deviations between predicted and real values)
• This yields a system of linear equations of the form
y1 = a0 + a1 x11 + a2x12 +...+ amx1m + e1
y2 = a0 + a1 x21 + a2x22 +...+ amx2m + e2
.... ...
yn = a0 + a1 xn1 + a2xn2 +...+ amxnm + en
Multiple Linear Regression
• We need to find a solution to the equation system that yields the best regression hyperplane
• Minimization of the squared error (OLS – Ordinary Least Squares)
) minimize ∑i ei2 = eTe
• Find coefficients a´ for which the partial derivatives ∂(eTe)/∂ai vanish
• This condition results in the normal equations, that permit the calculation of a´:
(XTX) a´ = XTy
or solved for a´:
a´= (XTX)-1 XT y
Multiple Linear Regression
• Normal equations can be solved, if XTX is not singular, i.e., the inverse (XTX)-1 exists
• This is the case, if rank(X) = m + 1, thus if at least m+1 different data points are available
• Computing the a´ is then possible with standard methods for matrix inversion
• The optimal linear model then yields predicted values y´ for all inputs from X
y´ = X a´
and has residues
e = y – y´
8
Analysis of Variance
• Quality of the fit can be determined by analysis of variance (ANOVA)
• Consider the total variance of the dependent variable y
• Assumption: total variance has two independent contributions: – Variance explained by the regression vR
– Residual variance vE caused by the residual error
• Both quantities add up to the total variance:
vy = vR + vE
• vR is often called SSR (sum of squares regression), vE accordingly SSE (sum of squares error)
Statistical Significance
• A model’s statistical significance captures how likely it is that the model could have arisen by mere chance
• So-called F-statistics considers not only the quality of the model correlation, but also the number of data points and descriptors
• Large values of F imply a reliable model, smaller values should make you suspicious!
• F-statistics can also be translated into p-values for the model confidence
Example: Activity of N,N-dimethyl-α-bromophenethylamines
• Different meta- and para-substituted derivatives of (1) show anti-adrenergic activiy (inhibition of adrenergic receptors)
• Unger and Hansch developed a QSAR model for a series of these structures
• Model uses three simple descriptors (lipophilicity π, Hammett constant σ and a steric parameter Es)
1
9
Example: Activity of N,N-dimethyl-α-bromophenethylamines
Substituent π σ Es
(meta) pC
Meta (X) Para (Y) Exp. Theor.
H H 0.00 0.00 1.24 7.46 7.88
H F 0.15 -0.07 1.24 8.16 8.17
H Cl 0.70 0.11 1.24 8.68 8.60
H Br 1.02 0.15 1.24 8.89 8.94
H I 1.26 0.14 1.24 9.25 9.26
H Me 0.52 -0.31 1.24 9.30 8.98
F H 0.13 0.35 0.78 7.52 7.43
Cl H 0.76 0.40 0.27 8.16 8.05
Cl Br 1.78 0.55 0.27 9.00 9.11
Linear Model: Prediction
pC = 1.259(±0.19) π -1.460(±0.34) σ + 0.208(±0.17) Es + 7.619(±0.24) (n = 22; R = 0.959; s = 0.173; F = 69.24)
Interpretation of QSAR Equations
• QSAR equations are usually given in the form shown below
• Numbers in parentheses are the 95% confidence intervals for each of the parameters a0...am (± 2 s)
pC = 1.151(±0.19) x1 – 1.464(±0.38) x2 + 7.817(±0.19)
(n = 14; R = 0.945; s = 0.196; F = 78.63) • The second line shows the most important statistical parameters of the
linear regression:
n: number of data points
R: correlation coefficient
s: standard deviation
F: F-statistics
10
Interpretation of QSAR Equations
• Apart from statistical information, the equations often also convey partial physical, chemical, or biological meaning
• Example: pC = 1.259 π -1.460 σ + 0.208 Es + …
• Activity of the substances increases with their lipophilicity (coefficient of π is positive)
) lipophilic side chains are favorable • Activity decreases with increasing electron acceptor strength of
the substituent (coefficient of σ is negative) ) Electron donors will provide better activity
) In summary, lipophilic electron donors (e.g., alkyl residues) are ideal and should be tested!
Scaling • Linear approaches are naturally limited, because there is no
reason to assume that the activity/property is linear in its descriptors
• General nonlinear approaches (e.g., support vector regression, kernel ridge regression, artificial neural networks) are much more complex and often lack the interpretability
• Popular trick if things do not look the way we like them, transform them!
Transformations • Example: Fractional oral bioavailability dataset of
Palm et al. • FA = f(PSA) is not linear, but a sigmoid function
11
Transformations • Clever transformation of data often
renders them linear
• Well known examples are the log transformation already used for log P, log K, log S (instead of P, K, S)
• In the case of the Palm data set, one can apply the well-known logit transformation
• logit(x) converts a sigmoid into a linear function
• We convert the data points of the Palm data set:
(FA, PSA) ! (logit(FA), PSA)
• Transformed data then yields a linear function
Transformations For the Palm et al. data set we thus obtain an R² = 0.882 for a simple linear regression.
Scaling • Apart from linearizing transformations, it is often
necessary to scale the input (descriptor) data
• Descriptors typically have widely varying value ranges and spreads
• Descriptors with a larger variance would thus dominate the descriptors with lower variance in a regression
• Standard approach: scale and center the data, such that
– mi = 0 : centered
– si = 1 : unit variance
12
Scaling mlogP =
6.5±1.5
mvdW = 330±39
m = 0.0 s = 1.0
Causality vs. Correlation
• Not everything that is correlated is causally related • Be very, very careful with premature conclusions and
daring extrapolations based on statistical models!
Sies, Nature (1988), 332, 495 Monty Python, The Meaning of Life
Number of Descriptors • For a sufficiently large number of descriptors, one always gets a
good correlation! • Example
pC = 0.495(±0.35) x1 - 0.149(±0.38) x2 -1.69(±0.41) x3 + 0.207(±0.47) x4+ 9.12(±0.34)
(n = 6; R = 0.979; s = 0.242; F = 5.7)
pC X Y π σ Es x1 x2 x3 x4
7.88 H H 0 0 1.24 0.78259 0.09918 0.95813 0.14123
8.17 H F 0.15 -0.07 1.24 0.22324 0.00000 0.54788 0.52587
8.60 H Cl 0.7 0.11 1.24 0.14289 0.40056 0.48163 0.72094
8.94 H Br 1.02 0.15 1.24 0.07975 0.60681 0.14965 0.13125
9.26 H I 1.26 0.14 1.24 0.86688 0.98799 0.18470 0.69078
8.98 H Me 0.52 -0.31 1.24 0.09430 0.31273 0.06645 0.52361
13
Number of Descriptors
• Excellent correlation – but nonsense: x1...x4 are random numbers • # of descriptors ought to be much smaller than # of data points • F-statistics tells the story: 5.7 is a bit low! Remember: „Essentially, all models are wrong, but some are useful.“ (G.E.P Box)
Example
• Model with two parameters pC = 1.151(±0.19) π – 1.464(±0.38) σ + 7.817(±0.19)
(n = 22; R = 0.945; s = 0.196; F = 78.63)
• Model with three parameters pC = 1.259(±0.19) π -1.460(±0.34) σ +0.208(±0.17) Es +7.619(±0.24)
(n = 22; R = 0.959; s = 0.173; F = 69.24)
• Adding the steric parameter ES to the model results in marginal improvement only
• For most purposes, the first model is thus better!
Occam’s Razor
• Occam‘s Razor also applies to the selection of descriptors:
‘The simplest explanation tends to be the best one.’
) Get rid of unnecessary descriptors!
• Unneeded descriptors do not convey any important information, but only noise
• They lead to overfitting
• Overfitted models show reduced generality
14
Cross Validation • Cross validation is a technique for assessing a model’s robustness
• Key idea – Split the data set into partitions
– Construct a model on a partial data set – Validate the model on the remainder of the data set
– Swap the data sets and redo
• If the model is sufficiently robust (i.e., it generalizes well and does not overfit to the training data), it should yield similar performance on the validation data set as it showed on the training data set
X XA
XB
MA
Validation
s, R2
Cross Validation
• Cross validation is done for k data sets (k-fold cross
validation)
• k is typically chosen to be between three and ten
• X is then randomly partitioned into k data sets of size
n/k
• For each of the k data sets:
– Construct a model using the data from the k-1
remaining data sets
– Validate model on the chosen dataset
Leave One Out
• The simples variant of cross validation is Leave One Out (LOO)
• Idea – Remove one of the data points
– Construct model on all remaining data points
– Predict value and determine error for remove data point
– Repeat for all data points
• This corresponds to n-fold cross validation
• In contrast to 3-fold cross validation, LOO can also be applied to very small datasets
• True cross-validation with smaller k should be preferred whenever possible, though
15
Cross Validation Quality • Model quality in a cross validation is usually
expressed by Q2, the cross-validated equivalent of R2 – R2 measures quality of the fit – Q2 measures quality of the prediction on unseen data
• For a sufficiently robust model, Q2 should be close to R2
• If Q2 is significantly lower than R2, then the model was overfitted to the training data
• Q2 is computed using the PRESS (predictive residual sum off squares)
Problems with Linear Regression
• QSAR data set of tries to model the activity of 31 compounds with 53 descriptors (1D-3D descriptors)
• In order to apply linear regression, we have to remove some of these descriptors
• Exhaustive search among all possible models with a suitable subset of descriptors is not feasible: there are about ~1016 possible models with 29 or less parameters!
• However, when one applies appropriate heuristics for variable selection, on can identify a model consisting of only five descriptors without too much effort
Selwood et al., J. Med. Chem. (1990), 33, 136
Variable Selection
Forward selection
– M0 à {}; k à 0; F0 à 0
– WHILE k < m
• Test all (linear regression) models for variables from Mk combined with each of the remaining m – k variables
• Select variable i, which gives the larges F-statistics Fk+1 in the regression
• If Fk+1 is significantly better than Fk:
– Mk+1 Ã {Mk [ i}
• Otherwise: Abort
16
Varaible Selection
Y X1 X2 X3 X4
78.5 7 26 6 60
74.3 1 29 15 52
104.3 11 56 8 20
87.6 11 31 8 47
95.9 7 52 6 33
109.2 11 55 9 22
102.7 3 71 17 6
72.5 1 31 22 44
93.1 2 54 18 22
115.9 21 47 4 26
83.8 1 40 23 34
113.3 11 66 9 12
109.4 10 68 8 12
Draper & Smith, Applied Regression Analysis, Wiley, New York, 1966, S. 178ff.
Forward Selection
• M0 = {} • Build model for
– Y using X1: R2 = 0.534, F = 12.6 – Y using X2: R2 = 0.666, F = 22.0 – Y using X3: R2 = 0.286, F = 4.4 – Y using X4: R2 = 0.674, F = 22.8
) M1 = {X4} • Build model for
– Y using {X4, X1}: R2 = 0.972, F = 176 – Y using {X4, X2}: R2 = 0.680, F = 10.6 – Y using {X4, X3}: R2 = 0.935, F = 72.3
) M2 = {X1, X4}
Forward Selection
• M2 = {X1, X4} • Build model for
– Y using {X1, X2, X4}: R2 = 0.982, F = 167 – Y using {X1, X3, X4}: R2 = 0.981, F = 157
No improvement, abort!
Best model: Y = a X1 + b X4 + c (R2 = 0.972; F = 176)
17
Variable Selection • Similar: backward elimination:
– Start with a model containing all variables – Remove each of the variables, one by one, select the best
model – Abort if there is no improvement
• Better than forward selection or backward elimination is usually stepwise regression – Combination of forward and backward search – Can go execute “forward” and “backward” steps – Avoids getting stuck in some local minima
• All these methods are heuristics that work well even for large descriptor spaces
• They typically do not find an optimal model (variable selection is NP complete!)
Variable Selection: Example
• Backward Elimination – Y using {X1, X2, X3, X4}: R2 = 0.982, F = 111 – Y using {X1, X2, X4}: R2 = 0.982, F = 167 – Y using {X1, X2}: R2 = 0.982, F = 230!
• Stepwise Regression – Y using X4: R2 = 0.675, F = 22.8 – Y using {X1, X4}: R2 = 0.972, F = 176 – Y using {X1, X2, X4}: R2 = 0.982, F = 166 – Y using {X1, X2}: R2 = 0.982, F = 230
Variable Selection: Example
• Individual correlations: – Y with X1: R2 = 0.534, F = 12.6 – Y with X2: R2 = 0.666, F = 22.0 – Y with X3: R2 = 0.286, F = 4.4 – Y with X4: R2 = 0.674, F = 22.8
• The best models contains either X2 or X4, but not both!
• Reason: X2 and X4 are co-linear: R2 = 0.947 between both descriptors
• All methods thus remove one of the two descriptors from the final model
18
Variable Selection
• Another popular option for variable selection uses genetic algorithms (GAs)
• Some methods also combine GA with stepwise regression
• All of these methods lead to decent models • Optimality is not really necessary, as long as
co-linear descriptors are removed and the models have sufficient quality in the end
• Hundreds of descriptors can easily be used in these approches
Latent Variables • Many of the descriptors do not
directly contribute to the variance of the dependent variable
• In order to reduce the dimension of the problem, one can try to identify latent variables
• Latent variables or components are linear combinations of descriptors
• These components should be chose in such a way that a minimal number of components explains as much of the variance as possible
Latent Variables • A latent variable t = (t1, …, tn) can be extracted as a linear
combination of descriptors from the descriptor matrix X = (xij)
• The components of such a latent variable t are then ti = p1 xi1 + p2 xi2 + … + pm xim = ∑j pj xij
and are called “scores”
• pj describes the influence of a descriptor j on t
• Vector p = (p1, …, pm) is called loading vector
X t
1 m
1
n
19
Latent Variables • The equation ti = ∑j pj xij defines the projection of a
vector xi = (xi1, …, xim) onto an axis p
• An m-dimensional descriptor space is thus projected onto a one-dimensional latent variable along axis p
• Radical reduction of problem dimensionality
• Disadvantage: hard to unravel the underlying relationships between modeled property and descriptors (“black-box models”)
X t
1 m
1
n
Latent Variables • We can also try to identify multiple latent variables ti
• These latent variables ti form a matrix of latent variables T (score matrix), the corresponding loading vectors pi the loading matrix P
• T is computed by multiplication of X and P (XP=T)
• k latent variables allow the projection of m-dimensional descriptor space onto k dimensions (dimensionality reduction)
X t1 t2 ...
1 m
1
n
T 1
1
n
k
Principal Component Analysis • Matrix T can be used for linear regression in the same
way we used the descriptor matrix before
• Question
– What are suitable latent variables?
– How to determine a suitable loading matrix?
• Ansatz
– Find mutually orthogonal components (principal components)
– A minimal number of these components shall then explain a sufficiently large portion of total variance
) Principal Component Analysis (PCA)
20
Principal Component Analysis • Principal components (PCs) are chosen such
that the PC explains the largest portion of the
total variance
• Second PC is chosen such that it explains as
much of the remaining variance as possible
) PC2 has to be orthogonal to PC1, otherwise
it would explain some of the variance
explained by PC1!
• Higher PCs are thus orthogonal to all previous
PCs
) PCs define a new coordinate system (basis)
• Difference to descriptor space: dimensions of
principal components are ordered by decreasing importance!
X2
X1
HK1
HK2
PCA and MLR
• PCA minimizes squared error orthogonal to the principal components
• Linear regression minimizes squared error in Y-direction
X2
X1
Y
X
Principal Component Analysis
• m x n matrix of rank r can be expressed as a sum of r matrices of rank 1
• This can also be expressed as coordinates ti projected onto principal components pi
• Vectors ti are pairwise orthonormal, the same holds for the pi
) pi thus define a rotation matrix P
X 1 m
1
n
= + n
m
X = t1 p1T + t2 p2T + ... + tr prT
+ ... + n
m
n
m
21
Principal Component Analysis
• Principal components are ordered by decreasing importance
• Less important (higher) PCs often contain only noise • In order to reduce the dimensionality of the problem,
one can thus often use only the first d PCs, i.e., the descriptor data is projected down to a d-dimensional space (d < m)
X 1 m
1
n
= n
m
X = T PT
r
r ¼ n
m d
d
Principal Component Analysis • Determination of principal components is usually
done by singular value decomposition (SVD) • Generally, each m£n matrix X can be written as the
product of three other matrices:
where U (n£r) and V (m£r) are two orthonormal matrices and W (r£r) is a diagonal matrix
• In our case: UV = T and V = P • This results in an eigenvalue problem:
W2 contains eigenvalues λi of XXT and V contains the corresponding eigenvectors of XXT
X 1 m
1
n
U 1 r
1
n
W 1 r
1
r VT
1 m 1
r =
Principal Component Analysis • The eigenvalue problem itself can be solved with
standard methods from linear algebra • One obtains a set of real-valued, non-negative
eigenvalues and corresponding principal components • Eigenvectors of XXT correspond to the loading vectors
we were looking for (= PCs) • Corresponding eigenvalues describe the relative
relevance of the PC • XXT is the covariance matrix • Magnitude of the eigenvalues is proportional to the
fraction of variance explained by the associated PC • Using the d most important principal components one
can then compute the projected data T
22
Principal Component Analysis
Overview
• Compute covariance matrix C (XXT using centered, variance-normalized data)
• Compute eigenvalues λi of C • Sort eigenvectors by their corresponding eigenvalues
• Select important eigenvectors – Fraction f of total variance explained by the first d
eigenvalues is
– Select a minimal d such that a sufficient fraction of the total variance is explained by the first d PCs
PCA – Applications
• Explorative data analysis – Projection of high-dimensional data onto a few
(2-3) dimensions and visualization often reveals previously unknown relationships
– Data is projected onto the first 2-3 PCs
• Principal Component Regression – If there are too many variables for a multiple
linear regression (m > n), then instead of a variable selction one can also performa PCA
– The first m PCs can then be used instead of the original descriptor matrix
Example
Gasteiger, Handbook of Chemoinformatics, Bd. 3, S. 1110
23
Example
Gasteiger, Handbook of Chemoinformatics, Bd. 3, S. 1111
• First PC explains about 88% (6.68/(6.68+0.93) = 0.878) of the total variance
• Second PC essentially contains noise
Principal Component Regression • PCR (principal component regression) combines
PCA and MLR: – Determine principal components – Do a multiple linear regression in principal
component space • Advantages
– An option to circumvent variable selection – Principal components explain total variance
• Disadvantages – PCs have no direct physical interpretation
) Hard to deduce improvements of the structures – PCs are selected based on the variance of the
independent variables ) totally unrelated to the independent variable!
PCR – Example • Tayar et al. modeled log P for the 20 proteinogenic
amino acids (aas) • The dataset contains six descriptors for each aa:
– LCE, LCF: two lipophilicity descriptors – FET: transfer free energy from organic solvent to water – POL: polarity parameter – VOL: molecular volume – ASA: area of the solvent-accessible surface
• MLR Model with six descriptors
R2 = 0.94, F = 34.8 Model with three descriptors
R2 = 0.84, F = 27.3 • PCR
Model including the three first principal components R2 = 0.92, F = 55.4
24
PCA – AA Clusters (PC1/PC2)
Summary • QSPR/QSAR allows the prediction of properties and activities of
molecular structures based on reduced representations, descriptors
• Multiple linear regression is a simple method for the construction of linear models
• There are several simple statistic measures describing the quality of a model (e.g., correlation coefficient, F-value)
• Construction of a minimal model (Occam’s Razor!) requires the selection of suitable descriptors
• Principal component analysis (PCA) allows the reduction of dimensionality of the problem in a simple fashion
• Often a few principal components are sufficient to explain most of the data’s variance
References Books • [HSRF] H.-D. Höltje, W. Sippl, D. Rognan, G. Folkers:
Molecular Modeling – Basic Principles and Applications, 2nd ed., Wiley, 2003
• [Lea] Andrew Leach: Molecular Modelling: Principles and Applications, 2nd ed., Prentice Hall, 2001
• [BKK] Böhm, Klebe, Kubinyi: Wirkstoffdesign, Spektrum 2002