active learning strategies for compound screening

Active Learning Active Learning Strategies for Strategies for

Compound Compound ScreeningScreeningMegon WalkerMegon Walker11 and Simon Kasif and Simon Kasif1,21,2

11Bioinformatics Program, Boston UniversityBioinformatics Program, Boston University22Department of Biomedical Engineering, Boston Department of Biomedical Engineering, Boston

UniversityUniversity

229229thth ACS National Meeting ACS National MeetingMarch 13-17, 2005March 13-17, 2005

San Diego, CASan Diego, CA

OutlineOutline

Introduction to active learning for compound Introduction to active learning for compound screeningscreening

Objectives and performance criteriaObjectives and performance criteria Algorithms and proceduresAlgorithms and procedures Thrombin dataset resultsThrombin dataset results Preliminary conclusionsPreliminary conclusions

Introduction: Introduction: drug discoverydrug discovery

drug discovery is drug discovery is an iterative an iterative processprocess

goal: to identify goal: to identify many target many target binding binding compounds with compounds with minimal screening minimal screening iterationsiterations

Features

Com

pounds

00 11 11 11 00

11 00 11 11 00

11 11 11 00 11

00 11 11 00 11

compounds descriptors

selectionscreening

Introduction:Introduction:supervised learningsupervised learning

input: data set with positive and negative input: data set with positive and negative examplesexamples

output: a classifier such that for each example output: a classifier such that for each example = 1 if example is positive= 1 if example is positive = -1 if example is negative= -1 if example is negative

x

standard learningstandard learning classifier trains on classifier trains on

a static training seta static training set train, then testtrain, then test

active learningactive learning classifier chooses data classifier chooses data

points for training setpoints for training set classifer “requests” labelsclassifer “requests” labels iterative rounds of training iterative rounds of training

and testingand testing

( )o x

( )o x

Introduction:Introduction:active learning & compound active learning & compound

screeningscreening

Mamitsuka Mamitsuka et al. Proceedings of the Fifteenth International Conference on et al. Proceedings of the Fifteenth International Conference on Machine Learning,Machine Learning, 1998:1-9. 1998:1-9.

Warmuth Warmuth et al. J. Chem Inf. Comput. Sci.et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 2003, 43: 667-673.

1st query 2nd queryFeaturesFeatures A/IA/I

Com

pou

nd

sC

om

pou

nd

s

train classifier # 1train classifier # 1 II

train classifier # 2train classifier # 2 AA

NOT labeledNOT labeled

??

??

??

??

??

??

testtest ??

??

FeaturesFeatures A/IA/I

Com

pou

nd

sC

om

pou

nd

s

train classifier # 1train classifier # 1

II

AA

AA


II

AA

II

NOT labeledNOT labeled ??

??

testtest ??

??

ObjectivesObjectives

exploitationexploitation Hit PerformanceHit Performance

Enrichment Factor Enrichment Factor (EF)(EF)

explorationexploration Accurate model of Accurate model of

activityactivity SensitivitySensitivity

true positives

total positives

sampled total

sampled total

hits hitsN N

Hit Performance

00.10.20.30.40.50.60.70.80.9

1

fraction of compounds tested

fract

ion

of h

its

foun

d

optimal sample selection other sample selection

random sample selection

Methods: datasetsMethods: datasets 632 DuPont thrombin-632 DuPont thrombin-

targeting compoundstargeting compounds 149 actives149 actives 483 inactives483 inactives

a binary feature vector for a binary feature vector for each compound each compound

shaped-based featuresshaped-based features pharmacophore featurespharmacophore features 139,351 features139,351 features

retrospective dataretrospective data

200 features selected by 200 features selected by mutual information (MI) w.r.t. mutual information (MI) w.r.t. activity labelsactivity labels

mean MI = 0.126mean MI = 0.126

FeaturesFeatures

Com

pou

nd

sC

om

pou

nd

s

00 11 11 11 AA

11 00 11 11 II

11 11 11 00 II

00 11 11 00 AA

1.1. Warmuth Warmuth et alet al. . J. Chem Inf Comput SciJ. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. . 2003 Mar-Apr;43(2):667-73. 2.2. Eksterowicz Eksterowicz et alet al. . J Mol Graph ModelJ Mol Graph Model. 2002 Jun;20(6):469-77.. 2002 Jun;20(6):469-77.3.3. Putta Putta et al. J Chem Inf Comput Sciet al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. . 2002 Sep-Oct;42(5):1230-40. 4.4. KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/

x y

P(x,y)I(X;Y) = P(x,y)log2

P(x)P(y)

Start

Input data files

Pick training and testing data for next round of cross validation

1st batch?

Query training set batch labels

Train classifier committee on labeled training set subsamples

Select 1st batch randomly

or by chemist

Predict compound labels bycommittee weighted majority vote

All training set labels queried?

Cross validation completed?

Accuracy and performance statistics

End

yes no

yes

yes

no

no

Sample Selection:- P(active)- uncertainty- density

Methods: Methods: cross validationcross validation

5X cross validation 5X cross validation

1st FeaturesFeatures

Com

pounds

Com

pounds

traintrain

traintrain

traintrain

traintrain

testtest

2nd

FeaturesFeatures

Com

pounds

Com

pounds

traintrain

traintrain

traintrain

testtest

traintrain

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


given given binary input vector,binary input vector,

weight vector,weight vector,

threshold value, threshold value, TT

learning rate, learning rate, nn

classification, classification, tt

TEST:TEST:

TRAIN: TRAIN: if classified correctly, do nothingif classified correctly, do nothing

if misclassified, if misclassified,

Methods: perceptronMethods: perceptron

( ) (if > then 1 else -1)i i

i

o x w x T

( ( ))i i iw w n t o x x

x

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


w��

Methods: Methods: classifier classifier

committeescommittees

baggingbagging: uniform sampling : uniform sampling distribution distribution

boostingboosting: compounds misclassified : compounds misclassified by classifier #1 more likely by classifier #1 more likely resampled by classifier #2resampled by classifier #2

FeaturesFeatures A/IA/I

Com

pou

nd

sC

om

pou

nd

s


II

AA

AA


II

AA

II

NOT labeledNOT labeled ??

??

testtest ??

??

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


Methods: weighted Methods: weighted votingvoting

weighted vote of all weighted vote of all classifiers predicts classifiers predicts compound activity labelcompound activity labelperceptron output x perceptron weight

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


Methods: Methods: sample selection sample selection

strategiesstrategies P(active)P(active) : select compounds predicted : select compounds predicted

active with highest probability by the active with highest probability by the committeecommittee

uncertaintyuncertainty: select compounds on which : select compounds on which the committee disagrees most stronglythe committee disagrees most strongly

density with respect to activesdensity with respect to actives: select : select compounds most similar to previously compounds most similar to previously labeled or predicted activeslabeled or predicted actives Tanimoto similarity metricTanimoto similarity metric

given compound bitstrings A and Bgiven compound bitstrings A and B a = # bits on in Aa = # bits on in A b = # bits on in Bb = # bits on in B c = # bits on in both A and Bc = # bits on in both A and Bc

(a+b-c)

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


Methods: Methods: performance criteriaperformance criteria

Hit PerformanceHit Performance

Enrichment Factor (EF)Enrichment Factor (EF)

SensitivitySensitivitytrue positives

total positives

sampled total

sampled total

hits hitsN N

Hit Performance

00.10.20.30.40.50.60.70.80.9

1


fra

ctio

n o

f h

its

fou

nd

optimal sample selection other sample selection

random sample selection

Start

Input data files


1st batch?




or by chemist





End

yes no

yes

yes

no

no


Results: hit performance Results: hit performance

Hit Performance

00.10.20.30.40.50.60.70.80.9

1


fracti

on of

hits

found

P(active)_bagged P(active)_boosteduncertainty_bagged uncertainty_boosteddensity_bagged perceptron_density_boostedoptimal random

Enrichment

0.8

1.3

1.8

2.3

2.8

3.3

3.8


enric

hmen

t

Results: sensitivityResults: sensitivity

uncertaintyuncertainty highest testing set highest testing set

sensitivity initiallysensitivity initially

no significant no significant increase in testing increase in testing set sensitivity set sensitivity

Testing Set Senstivity

0.55

0.6

0.65

0.7

0.75

0.8

0.050.10

0.150.20

0.250.30

0.350.41

0.460.51

0.560.61

0.660.71

0.760.80


sen

sitiv

ity

P(active)_bagged P(active)_boosteddensity_bagged density_boosteduncertainty_bagged uncertainty_boosted

Results: bagging vs. Results: bagging vs. boostingboosting

boostingboosting training set TP training set TP

climbs faster, climbs faster, converges converges higherhigher

overfits to the overfits to the training datatraining data

Training Set Senstivity

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0.050.10

0.150.20

0.250.30

0.350.41

0.460.51

0.560.61

0.660.71

0.760.80


sens

itivi

ty

P(active)_bagged P(active)_boosteddensity_bagged density_boosteduncertainty_bagged uncertainty_boosted

Results: # classifiersResults: # classifiers

Hit PerformanceBagged, Density

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.050.10

0.150.20

0.250.30

0.350.41

0.460.51

0.560.61

0.660.71

0.760.80


fra

ctio

n o

f hits

fou

nd

1 25 50 100

Testing Set SensitivityBagged, Density

0.5

0.55

0.6

0.65

0.7

0.75

0.8


sens

itivity

ConclusionsConclusions

Sample selectionSample selection Bag vs. boostBag vs. boost Committee vs. single classifierCommittee vs. single classifier Testing set sensitivityTesting set sensitivity Trade off: exploration and exploitationTrade off: exploration and exploitation

active learning strategies for compound screening

Documents

active learning strategies

itrain classifier

iaatrain classifier

machine learning

compound screeningobjectives

static training settrain

data set

data points