unsupervised forward selection a data reduction algorithm for use with very large data sets david...

Unsupervised Forward SelectionA data reduction algorithm for use

with very large data sets

David Whitley†, Martyn Ford† and David Livingstone†‡

†Centre for Molecular Design, University of Portsmouth†‡ChemQuest

Outline

• Variable selection issues• Pre-processing strategy• Dealing with multicollinearity• Unsupervised forward selection• Model selection strategy• Applications

Variable Selection Issues

• Relevance– statistically significant correlation with response– non-small variance

• Redundancy– linear dependence

– some variables have no unique information

• Multicollinearity– near linear dependence

– some variables have little unique information

0 iiv

0 iiv

Pre-processing Strategy

• Identify variables with a significant correlation with the response

• Remove variables with small variance• Remove variables with no unique information• Identify a set of variables on which to construct a

model

Effect of Multicollinearity

ii xxxxxy 55443322110

izxx 15

Build regression models of the form

where

Increasing reduces the collinearity between x5 and x1

and x1 - x4 , y, zi and ei are random N(0,1)

Effect of Multicollinearity

Q2

Dealing with Multicollinearity

• Examine pair-wise correlations between variables, and remove one from each pair with high correlation

• Corchop (Livingstone & Rahr, 1989) aims to remove the smallest number of variables while breaking the largest number of pair-wise collinearities

Unsupervised Forward Selection1 Select the first two variables with the smallest pair-

wise correlation coefficient2 Reject variables whose pair-wise correlation

coefficient with the selected columns exceeds rsqmax3 Select the next variable to have the smallest squared

multiple correlation coefficient with those previously selected

4 Reject variables with squared multiple correlation coefficients greater than rsqmax

5 Repeat 3 - 4 until all variables are selected or rejected

Continuum Regression• A regression procedure with the generalized

criterion function)21(422 )''()''(

2 XcXcyXcF

• Varying the continuous parameter 0 1.5 adjusts the balance between the covariance of the response with the descriptors and the variance of the descriptors, so that = 0 is equivalent to ordinary least squares = 0.5 is equivalent to partial least squares = 1.0 is equivalent to principal components regression

Model Selection Strategy

• For = 0.0, 0.1, …, 1.5 build a CR model for the set of variables selected by UFS with rsqmax = 0.1, 0.2, …, 0.9, 0.99

• Select the model with rsqmax and maximizing Q2 (leave-one-out cross-validated R2)– Apply n-fold cross-validation to check predictive

ability– Apply a randomization test (1000 permutations of the

response scores) to guard against chance correlation

Pyrethroid Data Set

• 70 physicochemical descriptors to predict killing activity (KA) of 19 pyrethroid insecticides

• Only 6 descriptors are correlated with KA at the 5% level

• Optimal models– 4-variable, 2-component model with R2 = 0.775,

Q2 = 0.773 obtained when rsqmax = 0.7, = 1.2

– 3-variable, 1-component model with R2 = 0.81, Q2 = 0.76 obtained when rsqmax = 0.6, = 0.2

Optimal Model I

• Standard errors are bootstrap estimates based on 5000 bootstraps

• Randomization test tail probabilities below 0.0003 for fit and 0.0071 for prediction

DVXMIZAAKA 037.000024.08044.0564.931.2 )80.2( )98( )000083.0( )11.0(

Optimal Model II

• Standard errors are bootstrap estimates based on 5000 bootstraps

• Randomization test tail probabilities below 0.0001 for fit and 0.0052 for prediction

DVXMIZAKA 20.000019.0567.880.1 )00.2( )000055.0( )08.0(

N-Fold Cross-Validation

3 variable model4 variable model

Feature Recognition

• Important explanatory variables may not be selected for inclusion in the model– force some variables in, then continue UFS algorithm

• The component loadings for the original variables can be examined to identify variables highly correlated with the components in the model

Loadings for the 1-component pyrethroid model with tail probability < 0.01

variable loading

A5 0.756

A3 0.723

A8 0.619

NS16 - 0.605

DVX - 0.603

ES12 - 0.584

MIZ 0.567

Steroid Data Set

• 21 steroid compounds from SYBYL CoMFA tutorial to model binding affinity to human TBG

• Initial data set has 1248 variables with values below 30 kcal/mol

• Removed 858 variables not significantly correlated with response (5% level)

• Removed 367 variables with variance below 1.0 kcal/mol

• Leaving 23 variables to be processed by UFS/CR

Optimal models

• UFS/CR produces a 3-variable, 1-component model with R2 = 0.85, Q2 = 0.83 at rsqmax = 0.3, = 0.3

• CoMFA tutorial produces a 5-component model with R2 = 0.98, Q2 = 0.6


CoMFA tutorial model UFS/CR model

Putative Pharmacophore

Selwood Data Set

• 53 descriptors to predict biological activity of 31 antifilarial antimycin analogues

• 12 descriptors are correlated with the response variable at the 5% level

• Optimal models– 2-variable, 1-component model with R2 = 0.42,

Q2 = 0.41 obtained when rsqmax = 0.1, = 1.0

– 12-variable, 1-component model with R2 = 0.85, Q2 = 0.5 obtained when rsqmax = 0.99, = 0.0 (omitting compound M6)


2-variable model 12-variable model

Summary

• Multicollinearity is a potential cause of poor predictive power in regression.

• The UFS algorithm eliminates redundancy and reduces multicollinearity, thus improving the chances of obtaining robust, low-dimensional regression models.

• Chance correlation can be addressed by eliminating variables that are uncorrelated with the response.

Summary

• UFS can be used to adjust the balance between reducing multicollinearity and including relevant information.

• Case studies show that leave-one-out cross-validation should be supplemented by n-fold cross-validation, in order to obtain accurate and precise estimates of predictive ability (Q2).

Acknowledgements

• Astra Zeneca• GlaxoSmithKline• MSI• Unilever

BBSRC Cooperation with Industry Project: Improved Mathematical Methods for Drug Design

Reference

D. C. Whitley, M.G. Ford and D. J. Livingstone Unsupervised forward selection: a method for eliminating redundant variables.J. Chem. Inf. Comp. Sci., 2000, 40, 1160-1168.

UFS software available from: http://www.cmd.port.ac.uk

CR is a component of Paragon (available summer 2001)

unsupervised forward selection a data reduction algorithm for use with very large data sets david...

Documents