fingerprinting chemical structures

27
Fingerprin(ng Chemical Structures Rajarshi Guha h7ps:// github.com / rajarshi / ctpa fingerprints September 9 2014

Upload: rguha

Post on 01-Jul-2015

375 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Fingerprinting Chemical Structures

Fingerprin(ng  Chemical  Structures  

Rajarshi  Guha  h7ps://github.com/rajarshi/ctpa-­‐fingerprints  

September  9  2014  

Page 2: Fingerprinting Chemical Structures

High  Throughput  Screening  

•  Test  thousands  to  hundreds  of  thousands  of  compounds  in  one  or  more  assays  – Biochemical,  gene(c,    pharmacological  assays  

•  Employs  a  robo(c  plaLorm  •  Rapidly  iden(fy  novel    modulators  of  biological  systems  –  Infec(ous  agents  – Cellular  basis  of  diseases    

 

Page 3: Fingerprinting Chemical Structures

Goal  of  HTS  

•  Rapidly  screen  large  compound  collec(ons  

•  Efficiently  iden(fy  real  ac(ves    – Test  them  in  slower,  accurate,  expensive  screens  

•  Use  the  data  to  learn  what  types  of  compounds  tend  to  be  ac(ve  

•  Use  the  model  to  suggest  more  compounds  to  screen  

300K

1000

300

Nu

mb

er o

f M

ole

cu

les

Cherry

Picks

HTS

Page 4: Fingerprinting Chemical Structures

HTS  Data  Types  

•  Categorical  –  ac(ve/inac(ve  or  toxic/nontoxic  •  Con(nuous  – Single  point  – Dose  response    

•  Mul(ple  readouts  – Might  read  at  different  wavelengths  or  (mepoints  – More  complex  when  dealing  with  imaging  

•  These  (usually)  represent  the  dependent  variable  

30

60

90

120

0.01 1.00log10 Concentration

Response

0

25

50

75

100

9.50 9.75 10.00 10.25 10.50Concentration

Response

Page 5: Fingerprinting Chemical Structures

Independent  Variable(s)  

•  HTS  tests  the  ac(vity  of  a  molecule  –  the  molecule  is  our  “independent  variable”      

•  Need  to  describe  the  molecular  structure  – Various  discrete  or  real-­‐valued  descriptors    – Surfaces  (3D)  – Binary  fingerprints    

Activity = f Structure( )

Page 6: Fingerprinting Chemical Structures

Fingerprint  Representa(on  

•  Lots  of  types  of  fingerprints    •  “Keyed”  fingerprints  indicate  the  presence  or  absence  of  a  structural  feature    

•  Length  can  vary  from  166  to  4096  bits  or  more    •  Fingerprints  usually  compared  using  the  Tanimoto  metric  

1 0 1 1 0 0 0 1 0

Page 7: Fingerprinting Chemical Structures

What  Can  I  Use  Them  For?  

•  Search  – Given  a  potent  ac(ve  molecule,  find  similar  ones  (or  dissimilar,  but  also  potent)  

•  Predic(on  – Given  a  set  of  ac(ve  &  inac(ve  molecules  build  a  model  to  predict  which  members  from  a  large  collec(on  will  be  ac(ve  

•  Clustering  – Given  a  set  of  molecules,  do  they  cluster  into  structurally  different  groups?  

Page 8: Fingerprinting Chemical Structures

Fingerprints  in  R  

•  The  fingerprint  package  supports  I/O,  manipula(on,  similarity  methods,  and  various  u(lity  methods  

•  A  fingerprint  is  a  S4  object  – Create  them  manually      

– Read  them  in  from  files  

new("fingerprint", nbit = 1024, bits = c(1,4,5,100,200))

fp.read('data/cdk.fp', size=1024, lf=cdk.lf)

Page 9: Fingerprinting Chemical Structures

Gehng  Fingerprints  

•  You  can  also  generate  fingerprints  from  chemical  structures  using  the  rcdk  package  

•  If  you’re  not  doing  cheminforma(cs  you  can  read  in  your  own  FP  data  by  implemen(ng  a  line  reader!– See  cdk.lf, moe.lf, bci.lf!!

Page 10: Fingerprinting Chemical Structures

Random  Fingerprints  

•  Useful  for  benchmarking,  genera(ng  null  distribu(ons,  exploring  effects  of  bit  density  

## How long does a similarity matrix calculation take as a function of fp length? nfp <- 300 sizes <- c(64, 128, 512, 1024, 4096, 8192) times <- sapply(sizes, function(size) { fps <- lapply(1:nfp, function(i) random.fingerprint(size, size * 0.35)) system.time(junk <- fp.sim.matrix(fps))[3] }) ## For a given length, how does bit density affect calculation time? densities <- c(0.1, 0.25, 0.5, 0.75, 0.95) times <- sapply(densities, function(density) { fps <- lapply(1:nfp, function(i) random.fingerprint(1024, 1024 * density)) system.time(junk <- fp.sim.matrix(fps))[3] })

Page 11: Fingerprinting Chemical Structures

Random  Fingerprints  

0.0

0.2

0.4

0.6

0 2000 4000 6000 8000Fingerprint Length

Tim

e (s

)

0.066

0.068

0.070

0.072

0.25 0.50 0.75Bit Density

Tim

e (s

)

Page 12: Fingerprinting Chemical Structures

0

1

2

3

0.00 0.25 0.50 0.75 1.00Similarity

density

MetricDice

Tanimoto

Compare  Similarity  Metrics  

•  More  than  20    similarity    metrics    – Some  are  in    wri7en  in  C,  so    very  fast,  applicable  to  larger    fingerprint  collec(ons  

– Others  are  in  pure  R,  slow  

fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf, header=TRUE)[1:500] s.tanimoto <- fp.sim.matrix(fps, method='tanimoto') s.dice <- fp.sim.matrix(fps, method='dice') d <- rbind(data.frame(method='Tanimoto', s=as.numeric(s.tanimoto)), data.frame(method='Dice', s=as.numeric(s.dice)))

Page 13: Fingerprinting Chemical Structures

Predic(ng  with  Fingerprints  

•  Read  in  fingerprints  &  convert  to  matrix  form  •  See    

–  data/solubility.csv  –  data/solubility.maccs!

•  33,182  observa(ons    of  solubility    

•  57,857  fingerprints  •  Requires  some  data  wrangling  before  modeling  

OOB estimate of error rate: 22.37% Confusion matrix: high low medium class.error high 181 52 621 0.78805621 low 35 5611 4598 0.45226474 medium 89 2029 19965 0.09591088

0

5000

10000

15000

20000

high low mediumSolubility Class

Frequency

Page 14: Fingerprinting Chemical Structures

Predic(ng  with  Fingerprints  

•  The  model  will  use  MACCS  keys    – 166  bits  – Each  bit  is  associated  with  a  structural  feature  

•  Low  resolu(on,  somewhat  simplis(c  •  Data  comes  in  a  non-­‐standard  format,  so  we  must  implement  our  own  line  reader  

•  Classifica(on  problem  –  predict  low/medium/high  solubility  

Page 15: Fingerprinting Chemical Structures

Predic(ng  with  Fingerprints  sol <- read.csv('data/solubility.csv', header=TRUE) fps <- fp.read('data/solubility.maccs', header=FALSE, size=166, lf=function(line) { toks <- strsplit(line, " ")[[1]] title <- toks[1] bits <- as.numeric(toks[2:length(toks)]) list(title, bits, list()) }) ## Extract fingerprint for which we have a label common <- which( sapply(fps, function(x) x@name) %in% sol$sid ) fps <- fps[common] ## Order the fingerprints & data sol <- sol[order(sol$sid),] fps <- fps[order(sapply(fps, function(x) as.integer(x@name)))] ## Make X matrix fpm <- fp.to.matrix(fps) ## Model! library(randomForest) m1 <- randomForest(x=fpm, y=as.factor(sol$label))

Page 16: Fingerprinting Chemical Structures

Predic(ng  with  Fingerprints  

•  We  can  then  use  the  RF  variable  importance  measure  

•  Features  important  for  predic(ve  performance  – Presence  of  aroma(c  rings  – Presence  of  charged  atoms  – Presence  of  6-­‐membered  rings  – N  &  O  atoms  connected  in  a  chain  

•  Chemically  sensible   1208590100138776599961521111331319316013280959879150135144971496210514549125

0 50 150 250

MeanDecreaseGinih7ps://github.com/cdk/cdk/blob/master/descriptor/fingerprint/src/main/resources/org/openscience/cdk/fingerprint/data/maccs.txt  

Page 17: Fingerprinting Chemical Structures

Clustering  with  Fingerprints  

•  Generate  a  distance  matrix  directly  from  a  list  of  fingerprints  

fps <- fp.read('data/cdk.fp', size=881, lf=cdk.lf)[1:500] sims <- fp.sim.matrix(fps) dmat <- as.dist(1-sims) clus <- hclust(dmat) par(mar=c(1,4,1,1)) plot(clus, label=FALSE, xlab='', main='’)

0.0

0.2

0.4

0.6

0.8

Height

•  Exercise:  How  do  clusters  vary  with  similarity  metric  and/or  fingerprint  type?  

Page 18: Fingerprinting Chemical Structures

Comparing  Data  Sets    

•  How  do  we  compare  two  sets  of  chemical  structures?  – Sizes  may  be  different,  and  very  large  

•  Pairwise?  –   O(N2)  running  (me  – Need  to  aggregate  the  resultant  pairwise  values  

Page 19: Fingerprinting Chemical Structures

Comparing  Data  Sets    

•  How  do  we  compare  two  sets  of  chemical  structures?  – Sizes  may  be  different,  and  very  large  

•  Distribu(ons?    – Of  what?    – Can  lead  to  mul(ple    ways  to  generate  a    comparison  

– Data  fusion?  

Page 20: Fingerprinting Chemical Structures

0.00

0.25

0.50

0.75

1.00

0 250 500 750Bit Position

Nor

mal

ized

Fre

quen

cy

Bit  Spectrum  

•  Vector  summary  of  the  fingerprints  for  a  dataset  •  Defined  as  the  frac(on  of  (mes  a  bit  posi(on  is  set  to  1,  for  each  bit  posi(on  

0 0 1

0 1 0

1 1 1

1 0 1

0.5 0.5 0.75

...

...

...

...

...

~  10K  molecules  

Page 21: Fingerprinting Chemical Structures

Bit  Spectrum  

•  Now  comparison  of  two  datasets  is  a  O(1)  opera(on  –  independent  dataset  size  – Simply  take  the  difference  of  the  two  bit  spectra  

•  e.g.:  Compare  ~  800  solubles  with  >  30k  insolubles  ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))

-1.0

-0.5

0.0

0.5

1.0

0 50 100 150Bit Position

Δ N

orm

aliz

ed F

requ

ency

Page 22: Fingerprinting Chemical Structures

Explaining  Poor  Model  Performance  

•  Training  set  for  model  

•  Poor  predic(ons  on  test  set  

•  Both  test  set  classes  look  like  the  toxic  class  in  the  training  set  

Guha  &  Schurer,  J.  Comp.  Aided.  Molec.  Des.,  2008,  22,  367  

Page 23: Fingerprinting Chemical Structures

Summary  

•  Fingerprints  are  a  useful  representa(on  for  molecules  –  fast,  objec(ve,  compact  

•  But  are  applicable  to  other  domains  and  objects  – Can  be  generated  from  arbitrary  datasets  (e.g.  text)  or  objects  (e.g.  networks)  

•  Useful  for  various  tasks  –  search  &  comparison,  predic(on,  clustering  

•  The  fingerprint  package  provides  a  domain  agnos(c  way  to  handle  binary  fingerprints  

Page 24: Fingerprinting Chemical Structures
Page 25: Fingerprinting Chemical Structures

Comparing  Clusterings  

•  Generate  mul(ple  representa(ons  of  a  set  of  molecules  

•  How  differently  do  these  representa(ons  cluster?  – Measure  correla(on  of  clusters  using  cophene(c  coefficient  

•  A  variety  of  R  packages  to  support  this  – dendextend,  clValid  

Page 26: Fingerprinting Chemical Structures

Comparing  Clusterings  

0.8 0.6 0.4 0.2 0.0

Pubchem 881

181187185218194219186146193150208207217202209200201180184183182192233236121901991642344111183316811516716917011011181828384116117919249524635474844504353451036381439173732401331221224222223263213225230231211216210212251264265206238229252253227254266273274272278275277288299293281282298276295296294256291287297280286271270255269267268292283284285289290915772151797557426767781970712678738687424272538029881911771641781482412442462432452262422402602592619098666930689395108546559941006357615660641016285586728558999969716610713527914125826214024722823519724818824919620525023418921522014719523920421423223719820310912513917217312613814917515613713616217417612811212913317911415711315916115816014413413013115515415115216517114516313215330012112225722202123103119105123106124104118120142143102127

0.0 0.2 0.4 0.6 0.8

CDK Ext 1024

257123120118124119103142143127102104105106233236300121122887029867987424915725767781926737557472151273807826711412471881972481472042202391952141962052491892342152502282351401982032322372272532542292522992752772742732722782662812822882852832842892902912862952962932802942922672682692702552712982872972562762432452462442262422602592402612412582621661071352162512112302312232242222132252102122212632642062382652791911771641784645474852103544435349365011011116817081849211611516916782839111793899799641086510110062855660635761965994952855545867306866906998175149176156129155157112161128154133159113114179158160151152174163144134130131145153171132165162136137200207202208217201209199185187218194219186146193148164234411211183313311438403937173215019022202123109125139172173126138181183182192180184

Page 27: Fingerprinting Chemical Structures

Comparing  Clusterings  

Pairwise  cophene(c  correla(ons  for  clusterings  generated  using  different  

fingerprints  

Pubchem CDK Extended CDK Graph MACCS!Pubchem 1.0000000 0.7075479 0.6879805 0.5752923!CDK Extended 0.7075479 1.0000000 0.8050349 0.7386863!CDK Graph 0.6879805 0.8050349 1.0000000 0.7288428!MACCS 0.5752923 0.7386863 0.7288428 1.0000000!