crowd computing: all your base are belong to us

31

Upload: david-thompson

Post on 30-Nov-2014

866 views

Category:

Technology


5 download

DESCRIPTION

Results from the Boehringer Ingelheim Pharmacueticals, Inc. 'Predicting a biological response' Kaggle competition. The central thesis is that a lot of problems can be framed with gaming elements, lowering the barrier to participation, and increasing engagement. Presented at the Bio-IT Cloud Summit, Data-focused Cloud Applications session, Sept. 12-13, Hotel Kabuki, San Francisco, CA

TRANSCRIPT

Page 1: Crowd computing: All your base are belong to us
Page 2: Crowd computing: All your base are belong to us

Crowd  Compu*ng:  All  Your  Base  Are  Belong  to  Us  

David  C.  Thompson  

Page 3: Crowd computing: All your base are belong to us

What  is  about  to  happen  

•  Some  background  on:  – me  –  compe;;on  

•  Crowdsourced  science  through  the  ‘ages’  •  The  data  set  •  The  Kaggle  process  •  An  overview  of  the  compe;;on  •  The  models  and  implementa;on  •  What  we  have  learnt  

Page 4: Crowd computing: All your base are belong to us

Behold!  Let  the  science  begin  …  

Page 5: Crowd computing: All your base are belong to us

hGp://amzn.to/OyQMVf  

Page 6: Crowd computing: All your base are belong to us

about.me/dcthompson  

My  favourite  papers  from  each  period:  [1]  J.  Chem.  Phys.  122,  124107  (2005)  [2]  J.  Chem.  Phys.  128,  224103  (2008)  [3]  J.  Chem.  Inf.  Model.  49,  1889  (2009)  [4]  J.  Chem.  Inf.  Model.  51,  93  (2011)  

Page 7: Crowd computing: All your base are belong to us

A  funny  thing  happened  at  my  1st  external  communica;ons  conference  …  

7  

Or  …  

Or  …  

A  heart-­‐wrenching  tale  of  man  versus  coffee  machine  …  

How  an  external  networking  opportunity  brought  some  ‘gamifica;on’  to  research  

Page 8: Crowd computing: All your base are belong to us

hGp://www.taviscurry.com/  

Page 9: Crowd computing: All your base are belong to us

Do  real  science,  at  home.  

Page 10: Crowd computing: All your base are belong to us

What  happens  when  you  search  for  ‘blindfolded  archery'  

Page 11: Crowd computing: All your base are belong to us

I  never  make  predic;ons.  And  I  never  will*  

 Lots  of  opportunity  to  translate  problems,  from  all  fields,  into  systems  with  gaming  elements  

•  Goal  –  What  do  you  hope  to  achieve  by  playing  the  game?  

•  Rules  –  The  limita;ons  on  how  you  can  achieve  the  goals  

•  Feedback  –  How  close  are  you  to  achieving  your  goal?  •  Voluntary  par*cipa*on  –  Everyone  playing  the  game  accepts  the  goals,  the  rules,  and  the  feedback  

 

*  Paul  Gascoigne   hGp://janemcgonigal.com/  

Page 12: Crowd computing: All your base are belong to us
Page 13: Crowd computing: All your base are belong to us

hGp://fold.it/portal/  

Page 14: Crowd computing: All your base are belong to us
Page 15: Crowd computing: All your base are belong to us

•  We  wanted  to  inves;gate  the  u;lity  of  the  process    

•  We  wanted  to  move  with  speed  •  We  wanted  to  use  a  data  set  the  scien;fic  community  had  previously  seen  

•  We  wanted  to  be  inclusive  –  no  domain  exper*se  needed  

What  you  should  know  about  this  exercise  

Page 16: Crowd computing: All your base are belong to us

 “All  models  are  wrong,  but  some  models  are  useful”

 –  G.  E.  P.  Box  

Simula;on  and  its  discontents,  Sherry  Turkle,  Cambridge,  MA:  MIT  Press  (2009)  

Shameless  slide  reuse  …  *  

*  D.  C.  Thompson  et  al.  Schrödinger  Regional  User  Mee;ng,  New  York,  NY  2009  

“…the  validity  of  any  given  model  is  of  limited  scope,  as  is  the  case  with  any  mental  construct  that  we  have  about  what  our  molecules  are  

doing,  whether  we  used  a  sosware  package  or  waved  our  hands  around  in  the  air.”  –  D.  Lowe  

 

Page 17: Crowd computing: All your base are belong to us

The  data  set  

•  Version  2  of  the  Hansen  AMES  mutagenicity  data  was  used  

•  The  following  protocol  was  observed:    

hGp://doc.ml.tu-­‐berlin.de/toxbenchmark/  J.  Chem.  Inf.  Model.  49,  2077  (2009)  *  D,  B,  Al,  P,  Ga,  Si,  Ge,  Sn,  As,  Sb,  Se,  Te,  At,  He,  Ne,  Ar,  Kr,  Xe,  Rn  

What  happened   #  of  molecules  (removed)  

Download  smiles   6512  

Conversion  with  Corina   6503  (9)  

Remove  non-­‐zero  formal  charge   6419  (84)  

Remove  if  more  than  99  atoms   6414  (5)  

Remove  if  contain  undesirable  atoms*   6252  (162)  

Page 18: Crowd computing: All your base are belong to us

Descriptor  calcula;on  SD  file,  descriptor  calcula;on  –  6252  x  5030  – Filter  for  low  variance  (≤  0.01);  removed  2537  – Remove  for  high  correla;on  (>  0.90);  removed  716  

– Descriptor  normaliza;on  resulted  in  6252  x  1777  .csv  file  Descriptor  Engine   #  of  descriptors    

MOE  2D   76  (186)  

Atom  Pair   696  (1920)  

MolConn-­‐Z   174  (745)  

Pipeline  Pilot  Property  Counts   5  (130)  

Daylight  fingerprints   825  (2048)  

clogP   0  (1)  0  

200  

400  

600  

800  

1000  

1200  

1400  

50  

100  

150  

200  

250  

300  

350  

400  

450  

500  

550  

600  

650  

700  

750  

800  

850  

900  

950  

1000  

1050  

1100  

1150  

1200  

J.  Chem.  Inf.  Model.  49,  2077  (2009)  

Page 19: Crowd computing: All your base are belong to us

Tes;ng  Framework  

“Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  

•  Public  Leaderboard:  The  split  of  the  test  set  that  compe;;on  par;cipants  see  real-­‐;me  feedback  on  over  the  course  of  the  compe;;on.  

•  Private  Leaderboard:  The  split  of  the  test  set  that  is  used  to  determine  the  compe;;on  winners  and  es;mate  the  generaliza;on  error.  Par;cipants  do  not  see  feedback  on  this  during  the  compe;;on.  

Page 20: Crowd computing: All your base are belong to us

Expecta;ons  “Applicability  Domains  for  Classifica;on  Problems:  Benchmarking  of  Distance  to  Models  for  Ames  Mutagenicity  Set”    •  20  models  generated  with  different  algorithms  and  descriptors  •  Models  have  overall  accuracies  between  0.75  and  0.83  for  the  training  set  

and  0.76  and  0.82  for  the  test  set  •  Inter-­‐laboratory  accuracy  for  Ames  test  reported  at  85%    

Expecta*on:  Models  should  have  similar  accuracy  to  literature  

Goal:  Models  should  be  balanced;  sensi*vity  and  specificity  should  be  high  

J.  Chem.  Inf.  Model.  50,  2094  (2010)  

Page 21: Crowd computing: All your base are belong to us

hGp://www.kaggle.com/c/bioresponse  

Page 22: Crowd computing: All your base are belong to us
Page 23: Crowd computing: All your base are belong to us

log  loss=     ∑=

−−+−N

iiiii yyyy

N 1)ˆ1log()1()ˆlog(1

Performance  as  a  func;on  of  ;me  

796  players  703  teams  8841  entries  55  forum  topics,  409  posts  

Page 24: Crowd computing: All your base are belong to us

Final  Ranking   Team  Name   Public  

Ranking  Δ  (log  loss)  

1   Winter  is  Coming  &  Sergey   11   0  2   seelary   26   7E-­‐05  3   bluehat   1   0.00051  4   jazz   15   0.0014  5   Wayne  Zhang  &  Gxav  &  woshialex   19   0.00146  6   Indy  Actuaries   38   0.00184  7   bluemaster  &  imran   7   0.00231  8   Efiimov  &  Bers  &  Cragin  &  vsu   4   0.00241  9   y_tag   18   0.0026  10   Killian  O’Connor   44   0.00285  11   PlanetThanet  &  SirGuessalot   40   0.00298  12   AussieTim   48   0.00335  13   Jason  Farmer   31   0.00347  14   GreenPeace   16   0.00356  15   mars   32   0.00388  16   Fuzzify   60   0.00392  17   Emanuele   63   0.00395  18   HappyHour   10   0.00431  19   Bal;c   30   0.00465  20   dejavu   20   0.00482  352   Random  Forest  Benchmark   373   0.04184  

541  Support  Vector  Machine  Benchmark   522   0.12147  

647  Op;mized  Constant  Value  Benchmark   638   0.31414  

650   Uniform  Benchmark   642   0.31959  

hGps://github.com/emanuele/kaggle_pbr  

hGps://github.com/benhamner/BioResponse  

Page 25: Crowd computing: All your base are belong to us

#FTW  Strategies  

•  Feature  selec;on    

•  RF  +  complementary  approaches  •  Blending  

All  three  winning  teams  iden;fied  D27  as  important.    What  is  it?      Organon  toxicophore*  

*  J.  Med.  Chem.  49,  312  (2005)  “Predic;ve  Modeling  from  a  Kaggler’s  Perspec;ve”  Jeremy  Achin,  Sergey  Yergenson,  Tom  Degodoy  

Page 26: Crowd computing: All your base are belong to us

Winning  Teams  

Team  1  Team  2  Team  3  

873  888  893  

165  150  145  

Team  1  Team  2  Team  3  

151  165  162  

687  673  676  

TP   FN  

FP   TN  

Benchmarks  

RF  SVM  

888  822  

150  216  

RF  SVM  

166  215  

672  673  

Other  

Team  17  D27  

896  781  

142  257  

Team  17  D27  

169  215  

669  623  

Se   Sp   CCR  

RF   0.86   0.80   0.83  

SVM   0.79   0.74   0.77  

Se   Sp   CCR  

Team  1   0.84   0.82   0.83  

Team  2   0.86   0.80   0.83  

Team  3   0.86   0.80   0.83  

Se   Sp   CCR  

Team  17   0.86   0.80   0.83  

D27   0.75   0.74   0.75  

Se:  TP/(TP+FN)  Sp:  TN/(FP+TN)  CCR:  (Se  +  Sp)/2  

Private  Set  Performance  

Page 27: Crowd computing: All your base are belong to us

Okay,  where’s  this  ‘second’  web  service?  

27  

BIpredict    Physicochemical  proper;es  are  updated  as  molecule  is  built    Atomis;c  descriptor  values  are  appended  directly  to  the  molecule  

*  D.  C.  Thompson  Chemical  Compu;ng  Group,  User  Group  Mee;ng,  Montreal,  2011  

Page 28: Crowd computing: All your base are belong to us

So,  what  did  we  learn?  

•  Was  this  useful?  –  Yes  

•  Par;cipa;on  was  high,  contributors  and  contribu;ons  were  diverse*  

•  A  large  number  of  models  were  of  a  high  quality  – Differences  in  top  models  in  log  loss  metric  are  small  – Different  sta;s;cal  measures  lead  to  different  rankings  

–  RandomForest  benchmark  has  high  correct  classifica;on  rate  (CCR)  

*  Sort  of  

Page 29: Crowd computing: All your base are belong to us

‘Machine  learning  that  maGers’  

Kiri  L.  Wagstaff.  Machine  Learning  that  Mabers.  Proceedings  of  the  Twenty-­‐Ninth  Interna8onal  Conference  on  Machine  Learning  (ICML),  June  2012.  Download  PDF  (CL  #12-­‐2026)  

Domain  exper;se   Machine  learning  skill  

Page 30: Crowd computing: All your base are belong to us

Know  your  meme  

hGp://roflcon.org/  hGp://katemiltner.com/  

Page 31: Crowd computing: All your base are belong to us

Thanks  to:  Lilly  Ackley  Ben  Hamner  Amy  Kunkel  Mehul  Patel  Alex  Renner,  PhD  All  Kaggle  par;cipants  –  esp.  Winter  is  Coming  &  Sergey