fingerprinting chemical structures

Fingerprin(ng Chemical Structures

Rajarshi Guha h7ps://github.com/rajarshi/ctpa-‐fingerprints

September 9 2014

High Throughput Screening

•  Test thousands to hundreds of thousands of compounds in one or more assays – Biochemical, gene(c, pharmacological assays

•  Employs a robo(c plaLorm •  Rapidly iden(fy novel modulators of biological systems –  Infec(ous agents – Cellular basis of diseases

Goal of HTS

•  Rapidly screen large compound collec(ons

•  Efficiently iden(fy real ac(ves – Test them in slower, accurate, expensive screens

•  Use the data to learn what types of compounds tend to be ac(ve

•  Use the model to suggest more compounds to screen

300K

1000

300

Nu

mb

er o

f M

ole

cu

les

Cherry

Picks

HTS

HTS Data Types

•  Categorical – ac(ve/inac(ve or toxic/nontoxic •  Con(nuous – Single point – Dose response

•  Mul(ple readouts – Might read at different wavelengths or (mepoints – More complex when dealing with imaging

•  These (usually) represent the dependent variable

30

60

90

120

0.01 1.00log10 Concentration

Response

0

25

50

75

100

9.50 9.75 10.00 10.25 10.50Concentration

Response

Independent Variable(s)

•  HTS tests the ac(vity of a molecule – the molecule is our “independent variable”

•  Need to describe the molecular structure – Various discrete or real-‐valued descriptors – Surfaces (3D) – Binary fingerprints

Activity = f Structure( )

Fingerprint Representa(on

•  Lots of types of fingerprints •  “Keyed” fingerprints indicate the presence or absence of a structural feature

•  Length can vary from 166 to 4096 bits or more •  Fingerprints usually compared using the Tanimoto metric

1 0 1 1 0 0 0 1 0

What Can I Use Them For?

•  Search – Given a potent ac(ve molecule, find similar ones (or dissimilar, but also potent)

•  Predic(on – Given a set of ac(ve & inac(ve molecules build a model to predict which members from a large collec(on will be ac(ve

•  Clustering – Given a set of molecules, do they cluster into structurally different groups?

Fingerprints in R

•  The fingerprint package supports I/O, manipula(on, similarity methods, and various u(lity methods

•  A fingerprint is a S4 object – Create them manually

– Read them in from files

new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200))

fp.read('data/cdk.fp', size=1024, lf=cdk.lf)

Gehng Fingerprints

•  You can also generate fingerprints from chemical structures using the rcdk package

•  If you’re not doing cheminforma(cs you can read in your own FP data by implemen(ng a line reader!– See cdk.lf, moe.lf, bci.lf!!

Random Fingerprints

•  Useful for benchmarking, genera(ng null distribu(ons, exploring effects of bit density

## How long does a similarity matrix calculation take as a function of fp length? nfp <- 300 sizes <- c(64, 128, 512, 1024, 4096, 8192) times <- sapply(sizes, function(size) { fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) system.time(junk <- fp.sim.matrix(fps))[3] }) ## For a given length, how does bit density affect calculation time? densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) times <- sapply(densities, function(density) { fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) system.time(junk <- fp.sim.matrix(fps))[3] })

Random Fingerprints

0.0

0.2

0.4

0.6

0 2000 4000 6000 8000Fingerprint Length

Tim

e (s

)

0.066

0.068

0.070

0.072

0.25 0.50 0.75Bit Density

Tim

e (s

)

0

1

2

3

0.00 0.25 0.50 0.75 1.00Similarity

density

MetricDice

Tanimoto

Compare Similarity Metrics

•  More than 20 similarity metrics – Some are in wri7en in C, so very fast, applicable to larger fingerprint collec(ons

– Others are in pure R, slow

fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf, header=TRUE)[1:500] s.tanimoto <- fp.sim.matrix(fps, method='tanimoto') s.dice <- fp.sim.matrix(fps, method='dice') d <- rbind(data.frame(method='Tanimoto', s=as.numeric(s.tanimoto)), data.frame(method='Dice', s=as.numeric(s.dice)))

Predic(ng with Fingerprints

•  Read in fingerprints & convert to matrix form •  See

–  data/solubility.csv –  data/solubility.maccs!

•  33,182 observa(ons of solubility

•  57,857 fingerprints •  Requires some data wrangling before modeling

OOB estimate of error rate: 22.37% Confusion matrix: high low medium class.error high 181 52 621 0.78805621 low 35 5611 4598 0.45226474 medium 89 2029 19965 0.09591088

0

5000

10000

15000

20000

high low mediumSolubility Class

Frequency


•  The model will use MACCS keys – 166 bits – Each bit is associated with a structural feature

•  Low resolu(on, somewhat simplis(c •  Data comes in a non-‐standard format, so we must implement our own line reader

•  Classifica(on problem – predict low/medium/high solubility

Predic(ng with Fingerprints sol <- read.csv('data/solubility.csv', header=TRUE) fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, lf=function(line) { toks <- strsplit(line, " ")[[1]] title <- toks[1] bits <- as.numeric(toks[2:length(toks)]) list(title, bits, list()) }) ## Extract fingerprint for which we have a label common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) fps <- fps[common] ## Order the fingerprints & data sol <- sol[order(sol$sid),] fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] ## Make X matrix fpm <- fp.to.matrix(fps) ## Model! library(randomForest) m1 <- randomForest(x=fpm, y=as.factor(sol$label))


•  We can then use the RF variable importance measure

•  Features important for predic(ve performance – Presence of aroma(c rings – Presence of charged atoms – Presence of 6-‐membered rings – N & O atoms connected in a chain

•  Chemically sensible 1208590100138776599961521111331319316013280959879150135144971496210514549125

0 50 150 250

MeanDecreaseGinih7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt

Clustering with Fingerprints

•  Generate a distance matrix directly from a list of fingerprints

fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf)[1:500] sims <- fp.sim.matrix(fps) dmat <- as.dist(1-sims) clus <- hclust(dmat) par(mar=c(1,4,1,1)) plot(clus, label=FALSE, xlab='', main='’)

0.0

0.2

0.4

0.6

0.8

Height

•  Exercise: How do clusters vary with similarity metric and/or fingerprint type?

Comparing Data Sets

•  How do we compare two sets of chemical structures? – Sizes may be different, and very large

•  Pairwise? –  O(N2) running (me – Need to aggregate the resultant pairwise values

Comparing Data Sets

•  How do we compare two sets of chemical structures? – Sizes may be different, and very large

•  Distribu(ons? – Of what? – Can lead to mul(ple ways to generate a comparison

– Data fusion?

0.00

0.25

0.50

0.75

1.00

0 250 500 750Bit Position

Nor

mal

ized

Fre

quen

cy

Bit Spectrum

•  Vector summary of the fingerprints for a dataset •  Defined as the frac(on of (mes a bit posi(on is set to 1, for each bit posi(on

0 0 1

0 1 0

1 1 1

1 0 1

0.5 0.5 0.75

...

...

...

...

...

~ 10K molecules

Bit Spectrum

•  Now comparison of two datasets is a O(1) opera(on – independent dataset size – Simply take the difference of the two bit spectra

•  e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))

-1.0

-0.5

0.0

0.5

1.0

0 50 100 150Bit Position

Δ N

orm

aliz

ed F

requ

ency

Explaining Poor Model Performance

•  Training set for model

•  Poor predic(ons on test set

•  Both test set classes look like the toxic class in the training set

Guha & Schurer, J. Comp. Aided. Molec. Des., 2008, 22, 367

Summary

•  Fingerprints are a useful representa(on for molecules – fast, objec(ve, compact

•  But are applicable to other domains and objects – Can be generated from arbitrary datasets (e.g. text) or objects (e.g. networks)

•  Useful for various tasks – search & comparison, predic(on, clustering

•  The fingerprint package provides a domain agnos(c way to handle binary fingerprints

Comparing Clusterings

•  Generate mul(ple representa(ons of a set of molecules

•  How differently do these representa(ons cluster? – Measure correla(on of clusters using cophene(c coefficient

•  A variety of R packages to support this – dendextend, clValid


0.8 0.6 0.4 0.2 0.0

Pubchem 881

181187185218194219186146193150208207217202209200201180184183182192233236121901991642344111183316811516716917011011181828384116117919249524635474844504353451036381439173732401331221224222223263213225230231211216210212251264265206238229252253227254266273274272278275277288299293281282298276295296294256291287297280286271270255269267268292283284285289290915772151797557426767781970712678738687424272538029881911771641781482412442462432452262422402602592619098666930689395108546559941006357615660641016285586728558999969716610713527914125826214024722823519724818824919620525023418921522014719523920421423223719820310912513917217312613814917515613713616217417612811212913317911415711315916115816014413413013115515415115216517114516313215330012112225722202123103119105123106124104118120142143102127

0.0 0.2 0.4 0.6 0.8

CDK Ext 1024

257123120118124119103142143127102104105106233236300121122887029867987424915725767781926737557472151273807826711412471881972481472042202391952141962052491892342152502282351401982032322372272532542292522992752772742732722782662812822882852832842892902912862952962932802942922672682692702552712982872972562762432452462442262422602592402612412582621661071352162512112302312232242222132252102122212632642062382652791911771641784645474852103544435349365011011116817081849211611516916782839111793899799641086510110062855660635761965994952855545867306866906998175149176156129155157112161128154133159113114179158160151152174163144134130131145153171132165162136137200207202208217201209199185187218194219186146193148164234411211183313311438403937173215019022202123109125139172173126138181183182192180184


Pairwise cophene(c correla(ons for clusterings generated using different

fingerprints

Pubchem CDK Extended CDK Graph MACCS!Pubchem 1.0000000 0.7075479 0.6879805 0.5752923!CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863!CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428!MACCS 0.5752923 0.7386863 0.7288428 1.0000000!

fingerprinting chemical structures

Technology