Crowd Compu*ng: All Your Base Are Belong to Us
David C. Thompson
What is about to happen
• Some background on: – me – compe;;on
• Crowdsourced science through the ‘ages’ • The data set • The Kaggle process • An overview of the compe;;on • The models and implementa;on • What we have learnt
Behold! Let the science begin …
hGp://amzn.to/OyQMVf
about.me/dcthompson
My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)
A funny thing happened at my 1st external communica;ons conference …
7
Or …
Or …
A heart-‐wrenching tale of man versus coffee machine …
How an external networking opportunity brought some ‘gamifica;on’ to research
hGp://www.taviscurry.com/
Do real science, at home.
What happens when you search for ‘blindfolded archery'
I never make predic;ons. And I never will*
Lots of opportunity to translate problems, from all fields, into systems with gaming elements
• Goal – What do you hope to achieve by playing the game?
• Rules – The limita;ons on how you can achieve the goals
• Feedback – How close are you to achieving your goal? • Voluntary par*cipa*on – Everyone playing the game accepts the goals, the rules, and the feedback
* Paul Gascoigne hGp://janemcgonigal.com/
hGp://fold.it/portal/
• We wanted to inves;gate the u;lity of the process
• We wanted to move with speed • We wanted to use a data set the scien;fic community had previously seen
• We wanted to be inclusive – no domain exper*se needed
What you should know about this exercise
“All models are wrong, but some models are useful”
– G. E. P. Box
Simula;on and its discontents, Sherry Turkle, Cambridge, MA: MIT Press (2009)
Shameless slide reuse … *
* D. C. Thompson et al. Schrödinger Regional User Mee;ng, New York, NY 2009
“…the validity of any given model is of limited scope, as is the case with any mental construct that we have about what our molecules are
doing, whether we used a sosware package or waved our hands around in the air.” – D. Lowe
The data set
• Version 2 of the Hansen AMES mutagenicity data was used
• The following protocol was observed:
hGp://doc.ml.tu-‐berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn
What happened # of molecules (removed)
Download smiles 6512
Conversion with Corina 6503 (9)
Remove non-‐zero formal charge 6419 (84)
Remove if more than 99 atoms 6414 (5)
Remove if contain undesirable atoms* 6252 (162)
Descriptor calcula;on SD file, descriptor calcula;on – 6252 x 5030 – Filter for low variance (≤ 0.01); removed 2537 – Remove for high correla;on (> 0.90); removed 716
– Descriptor normaliza;on resulted in 6252 x 1777 .csv file Descriptor Engine # of descriptors
MOE 2D 76 (186)
Atom Pair 696 (1920)
MolConn-‐Z 174 (745)
Pipeline Pilot Property Counts 5 (130)
Daylight fingerprints 825 (2048)
clogP 0 (1) 0
200
400
600
800
1000
1200
1400
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
J. Chem. Inf. Model. 49, 2077 (2009)
Tes;ng Framework
“Predic;ve Modeling from a Kaggler’s Perspec;ve” Jeremy Achin, Sergey Yergenson, Tom Degodoy
• Public Leaderboard: The split of the test set that compe;;on par;cipants see real-‐;me feedback on over the course of the compe;;on.
• Private Leaderboard: The split of the test set that is used to determine the compe;;on winners and es;mate the generaliza;on error. Par;cipants do not see feedback on this during the compe;;on.
Expecta;ons “Applicability Domains for Classifica;on Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set” • 20 models generated with different algorithms and descriptors • Models have overall accuracies between 0.75 and 0.83 for the training set
and 0.76 and 0.82 for the test set • Inter-‐laboratory accuracy for Ames test reported at 85%
Expecta*on: Models should have similar accuracy to literature
Goal: Models should be balanced; sensi*vity and specificity should be high
J. Chem. Inf. Model. 50, 2094 (2010)
hGp://www.kaggle.com/c/bioresponse
log loss= ∑=
−−+−N
iiiii yyyy
N 1)ˆ1log()1()ˆlog(1
Performance as a func;on of ;me
796 players 703 teams 8841 entries 55 forum topics, 409 posts
Final Ranking Team Name Public
Ranking Δ (log loss)
1 Winter is Coming & Sergey 11 0 2 seelary 26 7E-‐05 3 bluehat 1 0.00051 4 jazz 15 0.0014 5 Wayne Zhang & Gxav & woshialex 19 0.00146 6 Indy Actuaries 38 0.00184 7 bluemaster & imran 7 0.00231 8 Efiimov & Bers & Cragin & vsu 4 0.00241 9 y_tag 18 0.0026 10 Killian O’Connor 44 0.00285 11 PlanetThanet & SirGuessalot 40 0.00298 12 AussieTim 48 0.00335 13 Jason Farmer 31 0.00347 14 GreenPeace 16 0.00356 15 mars 32 0.00388 16 Fuzzify 60 0.00392 17 Emanuele 63 0.00395 18 HappyHour 10 0.00431 19 Bal;c 30 0.00465 20 dejavu 20 0.00482 352 Random Forest Benchmark 373 0.04184
541 Support Vector Machine Benchmark 522 0.12147
647 Op;mized Constant Value Benchmark 638 0.31414
650 Uniform Benchmark 642 0.31959
hGps://github.com/emanuele/kaggle_pbr
hGps://github.com/benhamner/BioResponse
#FTW Strategies
• Feature selec;on
• RF + complementary approaches • Blending
All three winning teams iden;fied D27 as important. What is it? Organon toxicophore*
* J. Med. Chem. 49, 312 (2005) “Predic;ve Modeling from a Kaggler’s Perspec;ve” Jeremy Achin, Sergey Yergenson, Tom Degodoy
Winning Teams
Team 1 Team 2 Team 3
873 888 893
165 150 145
Team 1 Team 2 Team 3
151 165 162
687 673 676
TP FN
FP TN
Benchmarks
RF SVM
888 822
150 216
RF SVM
166 215
672 673
Other
Team 17 D27
896 781
142 257
Team 17 D27
169 215
669 623
Se Sp CCR
RF 0.86 0.80 0.83
SVM 0.79 0.74 0.77
Se Sp CCR
Team 1 0.84 0.82 0.83
Team 2 0.86 0.80 0.83
Team 3 0.86 0.80 0.83
Se Sp CCR
Team 17 0.86 0.80 0.83
D27 0.75 0.74 0.75
Se: TP/(TP+FN) Sp: TN/(FP+TN) CCR: (Se + Sp)/2
Private Set Performance
Okay, where’s this ‘second’ web service?
27
BIpredict Physicochemical proper;es are updated as molecule is built Atomis;c descriptor values are appended directly to the molecule
* D. C. Thompson Chemical Compu;ng Group, User Group Mee;ng, Montreal, 2011
So, what did we learn?
• Was this useful? – Yes
• Par;cipa;on was high, contributors and contribu;ons were diverse*
• A large number of models were of a high quality – Differences in top models in log loss metric are small – Different sta;s;cal measures lead to different rankings
– RandomForest benchmark has high correct classifica;on rate (CCR)
* Sort of
‘Machine learning that maGers’
Kiri L. Wagstaff. Machine Learning that Mabers. Proceedings of the Twenty-‐Ninth Interna8onal Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-‐2026)
Domain exper;se Machine learning skill
Know your meme
hGp://roflcon.org/ hGp://katemiltner.com/
Thanks to: Lilly Ackley Ben Hamner Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle par;cipants – esp. Winter is Coming & Sergey