active learning strategies for compound screening
DESCRIPTION
Active Learning Strategies for Compound Screening. Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical Engineering, Boston University 229 th ACS National Meeting March 13-17, 2005 San Diego, CA. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Active Learning Active Learning Strategies for Strategies for
Compound Compound ScreeningScreeningMegon WalkerMegon Walker11 and Simon Kasif and Simon Kasif1,21,2
11Bioinformatics Program, Boston UniversityBioinformatics Program, Boston University22Department of Biomedical Engineering, Boston Department of Biomedical Engineering, Boston
UniversityUniversity
229229thth ACS National Meeting ACS National MeetingMarch 13-17, 2005March 13-17, 2005
San Diego, CASan Diego, CA
OutlineOutline
Introduction to active learning for compound Introduction to active learning for compound screeningscreening
Objectives and performance criteriaObjectives and performance criteria Algorithms and proceduresAlgorithms and procedures Thrombin dataset resultsThrombin dataset results Preliminary conclusionsPreliminary conclusions
Introduction: Introduction: drug discoverydrug discovery
drug discovery is drug discovery is an iterative an iterative processprocess
goal: to identify goal: to identify many target many target binding binding compounds with compounds with minimal screening minimal screening iterationsiterations
Features
Com
pounds
00 11 11 11 00
11 00 11 11 00
11 11 11 00 11
00 11 11 00 11
compounds descriptors
selectionscreening
Introduction:Introduction:supervised learningsupervised learning
input: data set with positive and negative input: data set with positive and negative examplesexamples
output: a classifier such that for each example output: a classifier such that for each example = 1 if example is positive= 1 if example is positive = -1 if example is negative= -1 if example is negative
x
standard learningstandard learning classifier trains on classifier trains on
a static training seta static training set train, then testtrain, then test
active learningactive learning classifier chooses data classifier chooses data
points for training setpoints for training set classifer “requests” labelsclassifer “requests” labels iterative rounds of training iterative rounds of training
and testingand testing
( )o x
( )o x
Introduction:Introduction:active learning & compound active learning & compound
screeningscreening
Mamitsuka Mamitsuka et al. Proceedings of the Fifteenth International Conference on et al. Proceedings of the Fifteenth International Conference on Machine Learning,Machine Learning, 1998:1-9. 1998:1-9.
Warmuth Warmuth et al. J. Chem Inf. Comput. Sci.et al. J. Chem Inf. Comput. Sci. 2003, 43: 667-673. 2003, 43: 667-673.
1st query 2nd queryFeaturesFeatures A/IA/I
Com
pou
nd
sC
om
pou
nd
s
train classifier # 1train classifier # 1 II
train classifier # 2train classifier # 2 AA
NOT labeledNOT labeled
??
??
??
??
??
??
testtest ??
??
FeaturesFeatures A/IA/I
Com
pou
nd
sC
om
pou
nd
s
train classifier # 1train classifier # 1
II
AA
AA
train classifier # 2train classifier # 2
II
AA
II
NOT labeledNOT labeled ??
??
testtest ??
??
ObjectivesObjectives
exploitationexploitation Hit PerformanceHit Performance
Enrichment Factor Enrichment Factor (EF)(EF)
explorationexploration Accurate model of Accurate model of
activityactivity SensitivitySensitivity
true positives
total positives
sampled total
sampled total
hits hitsN N
Hit Performance
00.10.20.30.40.50.60.70.80.9
1
fraction of compounds tested
fract
ion
of h
its
foun
d
optimal sample selection other sample selection
random sample selection
Methods: datasetsMethods: datasets 632 DuPont thrombin-632 DuPont thrombin-
targeting compoundstargeting compounds 149 actives149 actives 483 inactives483 inactives
a binary feature vector for a binary feature vector for each compound each compound
shaped-based featuresshaped-based features pharmacophore featurespharmacophore features 139,351 features139,351 features
retrospective dataretrospective data
200 features selected by 200 features selected by mutual information (MI) w.r.t. mutual information (MI) w.r.t. activity labelsactivity labels
mean MI = 0.126mean MI = 0.126
FeaturesFeatures
Com
pou
nd
sC
om
pou
nd
s
00 11 11 11 AA
11 00 11 11 II
11 11 11 00 II
00 11 11 00 AA
1.1. Warmuth Warmuth et alet al. . J. Chem Inf Comput SciJ. Chem Inf Comput Sci. 2003 Mar-Apr;43(2):667-73. . 2003 Mar-Apr;43(2):667-73. 2.2. Eksterowicz Eksterowicz et alet al. . J Mol Graph ModelJ Mol Graph Model. 2002 Jun;20(6):469-77.. 2002 Jun;20(6):469-77.3.3. Putta Putta et al. J Chem Inf Comput Sciet al. J Chem Inf Comput Sci. 2002 Sep-Oct;42(5):1230-40. . 2002 Sep-Oct;42(5):1230-40. 4.4. KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/KDD Cup 2001. http://www.cs.wisc.edu/~dpage/kddcup2001/
x y
P(x,y)I(X;Y) = P(x,y)log2
P(x)P(y)
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
Methods: Methods: cross validationcross validation
5X cross validation 5X cross validation
1st FeaturesFeatures
Com
pounds
Com
pounds
traintrain
traintrain
traintrain
traintrain
testtest
2nd
FeaturesFeatures
Com
pounds
Com
pounds
traintrain
traintrain
traintrain
testtest
traintrain
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
given given binary input vector,binary input vector,
weight vector,weight vector,
threshold value, threshold value, TT
learning rate, learning rate, nn
classification, classification, tt
TEST:TEST:
TRAIN: TRAIN: if classified correctly, do nothingif classified correctly, do nothing
if misclassified, if misclassified,
Methods: perceptronMethods: perceptron
( ) (if > then 1 else -1)i i
i
o x w x T
( ( ))i i iw w n t o x x
x
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
w��������������
Methods: Methods: classifier classifier
committeescommittees
baggingbagging: uniform sampling : uniform sampling distribution distribution
boostingboosting: compounds misclassified : compounds misclassified by classifier #1 more likely by classifier #1 more likely resampled by classifier #2resampled by classifier #2
FeaturesFeatures A/IA/I
Com
pou
nd
sC
om
pou
nd
s
train classifier # 1train classifier # 1
II
AA
AA
train classifier # 2train classifier # 2
II
AA
II
NOT labeledNOT labeled ??
??
testtest ??
??
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
Methods: weighted Methods: weighted votingvoting
weighted vote of all weighted vote of all classifiers predicts classifiers predicts compound activity labelcompound activity labelperceptron output x perceptron weight
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
Methods: Methods: sample selection sample selection
strategiesstrategies P(active)P(active) : select compounds predicted : select compounds predicted
active with highest probability by the active with highest probability by the committeecommittee
uncertaintyuncertainty: select compounds on which : select compounds on which the committee disagrees most stronglythe committee disagrees most strongly
density with respect to activesdensity with respect to actives: select : select compounds most similar to previously compounds most similar to previously labeled or predicted activeslabeled or predicted actives Tanimoto similarity metricTanimoto similarity metric
given compound bitstrings A and Bgiven compound bitstrings A and B a = # bits on in Aa = # bits on in A b = # bits on in Bb = # bits on in B c = # bits on in both A and Bc = # bits on in both A and Bc
(a+b-c)
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
Methods: Methods: performance criteriaperformance criteria
Hit PerformanceHit Performance
Enrichment Factor (EF)Enrichment Factor (EF)
SensitivitySensitivitytrue positives
total positives
sampled total
sampled total
hits hitsN N
Hit Performance
00.10.20.30.40.50.60.70.80.9
1
fraction of compounds tested
fra
ctio
n o
f h
its
fou
nd
optimal sample selection other sample selection
random sample selection
Start
Input data files
Pick training and testing data for next round of cross validation
1st batch?
Query training set batch labels
Train classifier committee on labeled training set subsamples
Select 1st batch randomly
or by chemist
Predict compound labels bycommittee weighted majority vote
All training set labels queried?
Cross validation completed?
Accuracy and performance statistics
End
yes no
yes
yes
no
no
Sample Selection:- P(active)- uncertainty- density
Results: hit performance Results: hit performance
Hit Performance
00.10.20.30.40.50.60.70.80.9
1
fraction of compounds tested
fracti
on of
hits
found
P(active)_bagged P(active)_boosteduncertainty_bagged uncertainty_boosteddensity_bagged perceptron_density_boostedoptimal random
Enrichment
0.8
1.3
1.8
2.3
2.8
3.3
3.8
fraction of compounds tested
enric
hmen
t
Results: sensitivityResults: sensitivity
uncertaintyuncertainty highest testing set highest testing set
sensitivity initiallysensitivity initially
no significant no significant increase in testing increase in testing set sensitivity set sensitivity
Testing Set Senstivity
0.55
0.6
0.65
0.7
0.75
0.8
0.050.10
0.150.20
0.250.30
0.350.41
0.460.51
0.560.61
0.660.71
0.760.80
fraction of compounds tested
sen
sitiv
ity
P(active)_bagged P(active)_boosteddensity_bagged density_boosteduncertainty_bagged uncertainty_boosted
Results: bagging vs. Results: bagging vs. boostingboosting
boostingboosting training set TP training set TP
climbs faster, climbs faster, converges converges higherhigher
overfits to the overfits to the training datatraining data
Training Set Senstivity
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
0.050.10
0.150.20
0.250.30
0.350.41
0.460.51
0.560.61
0.660.71
0.760.80
fraction of compounds tested
sens
itivi
ty
P(active)_bagged P(active)_boosteddensity_bagged density_boosteduncertainty_bagged uncertainty_boosted
Results: # classifiersResults: # classifiers
Hit PerformanceBagged, Density
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.050.10
0.150.20
0.250.30
0.350.41
0.460.51
0.560.61
0.660.71
0.760.80
fraction of compounds tested
fra
ctio
n o
f hits
fou
nd
1 25 50 100
Testing Set SensitivityBagged, Density
0.5
0.55
0.6
0.65
0.7
0.75
0.8
fraction of compounds tested
sens
itivity
ConclusionsConclusions
Sample selectionSample selection Bag vs. boostBag vs. boost Committee vs. single classifierCommittee vs. single classifier Testing set sensitivityTesting set sensitivity Trade off: exploration and exploitationTrade off: exploration and exploitation