fingerprinting chemical structures
TRANSCRIPT
Fingerprin(ng Chemical Structures
Rajarshi Guha h7ps://github.com/rajarshi/ctpa-‐fingerprints
September 9 2014
High Throughput Screening
• Test thousands to hundreds of thousands of compounds in one or more assays – Biochemical, gene(c, pharmacological assays
• Employs a robo(c plaLorm • Rapidly iden(fy novel modulators of biological systems – Infec(ous agents – Cellular basis of diseases
Goal of HTS
• Rapidly screen large compound collec(ons
• Efficiently iden(fy real ac(ves – Test them in slower, accurate, expensive screens
• Use the data to learn what types of compounds tend to be ac(ve
• Use the model to suggest more compounds to screen
300K
1000
300
Nu
mb
er o
f M
ole
cu
les
Cherry
Picks
HTS
HTS Data Types
• Categorical – ac(ve/inac(ve or toxic/nontoxic • Con(nuous – Single point – Dose response
• Mul(ple readouts – Might read at different wavelengths or (mepoints – More complex when dealing with imaging
• These (usually) represent the dependent variable
30
60
90
120
0.01 1.00log10 Concentration
Response
0
25
50
75
100
9.50 9.75 10.00 10.25 10.50Concentration
Response
Independent Variable(s)
• HTS tests the ac(vity of a molecule – the molecule is our “independent variable”
• Need to describe the molecular structure – Various discrete or real-‐valued descriptors – Surfaces (3D) – Binary fingerprints
Activity = f Structure( )
Fingerprint Representa(on
• Lots of types of fingerprints • “Keyed” fingerprints indicate the presence or absence of a structural feature
• Length can vary from 166 to 4096 bits or more • Fingerprints usually compared using the Tanimoto metric
1 0 1 1 0 0 0 1 0
What Can I Use Them For?
• Search – Given a potent ac(ve molecule, find similar ones (or dissimilar, but also potent)
• Predic(on – Given a set of ac(ve & inac(ve molecules build a model to predict which members from a large collec(on will be ac(ve
• Clustering – Given a set of molecules, do they cluster into structurally different groups?
Fingerprints in R
• The fingerprint package supports I/O, manipula(on, similarity methods, and various u(lity methods
• A fingerprint is a S4 object – Create them manually
– Read them in from files
new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200))
fp.read('data/cdk.fp', size=1024, lf=cdk.lf)
Gehng Fingerprints
• You can also generate fingerprints from chemical structures using the rcdk package
• If you’re not doing cheminforma(cs you can read in your own FP data by implemen(ng a line reader!– See cdk.lf, moe.lf, bci.lf!!
Random Fingerprints
• Useful for benchmarking, genera(ng null distribu(ons, exploring effects of bit density
## How long does a similarity matrix calculation take as a function of fp length? nfp <- 300 sizes <- c(64, 128, 512, 1024, 4096, 8192) times <- sapply(sizes, function(size) { fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) system.time(junk <- fp.sim.matrix(fps))[3] }) ## For a given length, how does bit density affect calculation time? densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) times <- sapply(densities, function(density) { fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) system.time(junk <- fp.sim.matrix(fps))[3] })
Random Fingerprints
0.0
0.2
0.4
0.6
0 2000 4000 6000 8000Fingerprint Length
Tim
e (s
)
0.066
0.068
0.070
0.072
0.25 0.50 0.75Bit Density
Tim
e (s
)
0
1
2
3
0.00 0.25 0.50 0.75 1.00Similarity
density
MetricDice
Tanimoto
Compare Similarity Metrics
• More than 20 similarity metrics – Some are in wri7en in C, so very fast, applicable to larger fingerprint collec(ons
– Others are in pure R, slow
fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf, header=TRUE)[1:500] s.tanimoto <- fp.sim.matrix(fps, method='tanimoto') s.dice <- fp.sim.matrix(fps, method='dice') d <- rbind(data.frame(method='Tanimoto', s=as.numeric(s.tanimoto)), data.frame(method='Dice', s=as.numeric(s.dice)))
Predic(ng with Fingerprints
• Read in fingerprints & convert to matrix form • See
– data/solubility.csv – data/solubility.maccs!
• 33,182 observa(ons of solubility
• 57,857 fingerprints • Requires some data wrangling before modeling
OOB estimate of error rate: 22.37% Confusion matrix: high low medium class.error high 181 52 621 0.78805621 low 35 5611 4598 0.45226474 medium 89 2029 19965 0.09591088
0
5000
10000
15000
20000
high low mediumSolubility Class
Frequency
Predic(ng with Fingerprints
• The model will use MACCS keys – 166 bits – Each bit is associated with a structural feature
• Low resolu(on, somewhat simplis(c • Data comes in a non-‐standard format, so we must implement our own line reader
• Classifica(on problem – predict low/medium/high solubility
Predic(ng with Fingerprints sol <- read.csv('data/solubility.csv', header=TRUE) fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, lf=function(line) { toks <- strsplit(line, " ")[[1]] title <- toks[1] bits <- as.numeric(toks[2:length(toks)]) list(title, bits, list()) }) ## Extract fingerprint for which we have a label common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) fps <- fps[common] ## Order the fingerprints & data sol <- sol[order(sol$sid),] fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] ## Make X matrix fpm <- fp.to.matrix(fps) ## Model! library(randomForest) m1 <- randomForest(x=fpm, y=as.factor(sol$label))
Predic(ng with Fingerprints
• We can then use the RF variable importance measure
• Features important for predic(ve performance – Presence of aroma(c rings – Presence of charged atoms – Presence of 6-‐membered rings – N & O atoms connected in a chain
• Chemically sensible 1208590100138776599961521111331319316013280959879150135144971496210514549125
0 50 150 250
MeanDecreaseGinih7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt
Clustering with Fingerprints
• Generate a distance matrix directly from a list of fingerprints
fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf)[1:500] sims <- fp.sim.matrix(fps) dmat <- as.dist(1-sims) clus <- hclust(dmat) par(mar=c(1,4,1,1)) plot(clus, label=FALSE, xlab='', main='’)
0.0
0.2
0.4
0.6
0.8
Height
• Exercise: How do clusters vary with similarity metric and/or fingerprint type?
Comparing Data Sets
• How do we compare two sets of chemical structures? – Sizes may be different, and very large
• Pairwise? – O(N2) running (me – Need to aggregate the resultant pairwise values
Comparing Data Sets
• How do we compare two sets of chemical structures? – Sizes may be different, and very large
• Distribu(ons? – Of what? – Can lead to mul(ple ways to generate a comparison
– Data fusion?
0.00
0.25
0.50
0.75
1.00
0 250 500 750Bit Position
Nor
mal
ized
Fre
quen
cy
Bit Spectrum
• Vector summary of the fingerprints for a dataset • Defined as the frac(on of (mes a bit posi(on is set to 1, for each bit posi(on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~ 10K molecules
Bit Spectrum
• Now comparison of two datasets is a O(1) opera(on – independent dataset size – Simply take the difference of the two bit spectra
• e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))
-1.0
-0.5
0.0
0.5
1.0
0 50 100 150Bit Position
Δ N
orm
aliz
ed F
requ
ency
Explaining Poor Model Performance
• Training set for model
• Poor predic(ons on test set
• Both test set classes look like the toxic class in the training set
Guha & Schurer, J. Comp. Aided. Molec. Des., 2008, 22, 367
Summary
• Fingerprints are a useful representa(on for molecules – fast, objec(ve, compact
• But are applicable to other domains and objects – Can be generated from arbitrary datasets (e.g. text) or objects (e.g. networks)
• Useful for various tasks – search & comparison, predic(on, clustering
• The fingerprint package provides a domain agnos(c way to handle binary fingerprints
Comparing Clusterings
• Generate mul(ple representa(ons of a set of molecules
• How differently do these representa(ons cluster? – Measure correla(on of clusters using cophene(c coefficient
• A variety of R packages to support this – dendextend, clValid
Comparing Clusterings
0.8 0.6 0.4 0.2 0.0
Pubchem 881
181187185218194219186146193150208207217202209200201180184183182192233236121901991642344111183316811516716917011011181828384116117919249524635474844504353451036381439173732401331221224222223263213225230231211216210212251264265206238229252253227254266273274272278275277288299293281282298276295296294256291287297280286271270255269267268292283284285289290915772151797557426767781970712678738687424272538029881911771641781482412442462432452262422402602592619098666930689395108546559941006357615660641016285586728558999969716610713527914125826214024722823519724818824919620525023418921522014719523920421423223719820310912513917217312613814917515613713616217417612811212913317911415711315916115816014413413013115515415115216517114516313215330012112225722202123103119105123106124104118120142143102127
0.0 0.2 0.4 0.6 0.8
CDK Ext 1024
257123120118124119103142143127102104105106233236300121122887029867987424915725767781926737557472151273807826711412471881972481472042202391952141962052491892342152502282351401982032322372272532542292522992752772742732722782662812822882852832842892902912862952962932802942922672682692702552712982872972562762432452462442262422602592402612412582621661071352162512112302312232242222132252102122212632642062382652791911771641784645474852103544435349365011011116817081849211611516916782839111793899799641086510110062855660635761965994952855545867306866906998175149176156129155157112161128154133159113114179158160151152174163144134130131145153171132165162136137200207202208217201209199185187218194219186146193148164234411211183313311438403937173215019022202123109125139172173126138181183182192180184
Comparing Clusterings
Pairwise cophene(c correla(ons for clusterings generated using different
fingerprints
Pubchem CDK Extended CDK Graph MACCS!Pubchem 1.0000000 0.7075479 0.6879805 0.5752923!CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863!CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428!MACCS 0.5752923 0.7386863 0.7288428 1.0000000!