Download - Crowd computing: All your base are belong to us

Crowd Compu*ng: All Your Base Are Belong to Us

David C. Thompson

What is about to happen

•  Some background on: – me –  compe;;on

•  Crowdsourced science through the ‘ages’ •  The data set •  The Kaggle process •  An overview of the compe;;on •  The models and implementa;on •  What we have learnt

Behold! Let the science begin …

hGp://amzn.to/OyQMVf

about.me/dcthompson

My favourite papers from each period: [1] J. Chem. Phys. 122, 124107 (2005) [2] J. Chem. Phys. 128, 224103 (2008) [3] J. Chem. Inf. Model. 49, 1889 (2009) [4] J. Chem. Inf. Model. 51, 93 (2011)

A funny thing happened at my 1st external communica;ons conference …

7

Or …

Or …

A heart-‐wrenching tale of man versus coffee machine …

How an external networking opportunity brought some ‘gamifica;on’ to research

hGp://www.taviscurry.com/

Do real science, at home.

What happens when you search for ‘blindfolded archery'

I never make predic;ons. And I never will*

Lots of opportunity to translate problems, from all fields, into systems with gaming elements

•  Goal – What do you hope to achieve by playing the game?

•  Rules – The limita;ons on how you can achieve the goals

•  Feedback – How close are you to achieving your goal? •  Voluntary par*cipa*on – Everyone playing the game accepts the goals, the rules, and the feedback

* Paul Gascoigne hGp://janemcgonigal.com/

hGp://fold.it/portal/

•  We wanted to inves;gate the u;lity of the process

•  We wanted to move with speed •  We wanted to use a data set the scien;fic community had previously seen

•  We wanted to be inclusive – no domain exper*se needed

What you should know about this exercise

“All models are wrong, but some models are useful”

– G. E. P. Box

Simula;on and its discontents, Sherry Turkle, Cambridge, MA: MIT Press (2009)

Shameless slide reuse … *

* D. C. Thompson et al. Schrödinger Regional User Mee;ng, New York, NY 2009

“…the validity of any given model is of limited scope, as is the case with any mental construct that we have about what our molecules are

doing, whether we used a sosware package or waved our hands around in the air.” – D. Lowe

The data set

•  Version 2 of the Hansen AMES mutagenicity data was used

•  The following protocol was observed:

hGp://doc.ml.tu-‐berlin.de/toxbenchmark/ J. Chem. Inf. Model. 49, 2077 (2009) * D, B, Al, P, Ga, Si, Ge, Sn, As, Sb, Se, Te, At, He, Ne, Ar, Kr, Xe, Rn

What happened # of molecules (removed)

Download smiles 6512

Conversion with Corina 6503 (9)

Remove non-‐zero formal charge 6419 (84)

Remove if more than 99 atoms 6414 (5)

Remove if contain undesirable atoms* 6252 (162)

Descriptor calcula;on SD file, descriptor calcula;on – 6252 x 5030 – Filter for low variance (≤ 0.01); removed 2537 – Remove for high correla;on (> 0.90); removed 716

– Descriptor normaliza;on resulted in 6252 x 1777 .csv file Descriptor Engine # of descriptors

MOE 2D 76 (186)

Atom Pair 696 (1920)

MolConn-‐Z 174 (745)

Pipeline Pilot Property Counts 5 (130)

Daylight fingerprints 825 (2048)

clogP 0 (1) 0

200

400

600

800

1000

1200

1400

50

100

150

200

250

300

350

400

450

500

550

600

650

700

750

800

850

900

950

1000

1050

1100

1150

1200

J. Chem. Inf. Model. 49, 2077 (2009)

Tes;ng Framework

“Predic;ve Modeling from a Kaggler’s Perspec;ve” Jeremy Achin, Sergey Yergenson, Tom Degodoy

•  Public Leaderboard: The split of the test set that compe;;on par;cipants see real-‐;me feedback on over the course of the compe;;on.

•  Private Leaderboard: The split of the test set that is used to determine the compe;;on winners and es;mate the generaliza;on error. Par;cipants do not see feedback on this during the compe;;on.

Expecta;ons “Applicability Domains for Classifica;on Problems: Benchmarking of Distance to Models for Ames Mutagenicity Set” •  20 models generated with different algorithms and descriptors •  Models have overall accuracies between 0.75 and 0.83 for the training set

and 0.76 and 0.82 for the test set •  Inter-‐laboratory accuracy for Ames test reported at 85%

Expecta*on: Models should have similar accuracy to literature

Goal: Models should be balanced; sensi*vity and specificity should be high

J. Chem. Inf. Model. 50, 2094 (2010)

hGp://www.kaggle.com/c/bioresponse

log loss= ∑=

−−+−N

iiiii yyyy

N 1)ˆ1log()1()ˆlog(1

Performance as a func;on of ;me

796 players 703 teams 8841 entries 55 forum topics, 409 posts

Final Ranking Team Name Public

Ranking Δ (log loss)

1 Winter is Coming & Sergey 11 0 2 seelary 26 7E-‐05 3 bluehat 1 0.00051 4 jazz 15 0.0014 5 Wayne Zhang & Gxav & woshialex 19 0.00146 6 Indy Actuaries 38 0.00184 7 bluemaster & imran 7 0.00231 8 Efiimov & Bers & Cragin & vsu 4 0.00241 9 y_tag 18 0.0026 10 Killian O’Connor 44 0.00285 11 PlanetThanet & SirGuessalot 40 0.00298 12 AussieTim 48 0.00335 13 Jason Farmer 31 0.00347 14 GreenPeace 16 0.00356 15 mars 32 0.00388 16 Fuzzify 60 0.00392 17 Emanuele 63 0.00395 18 HappyHour 10 0.00431 19 Bal;c 30 0.00465 20 dejavu 20 0.00482 352 Random Forest Benchmark 373 0.04184

541 Support Vector Machine Benchmark 522 0.12147

647 Op;mized Constant Value Benchmark 638 0.31414

650 Uniform Benchmark 642 0.31959

hGps://github.com/emanuele/kaggle_pbr

hGps://github.com/benhamner/BioResponse

#FTW Strategies

•  Feature selec;on

•  RF + complementary approaches •  Blending

All three winning teams iden;fied D27 as important. What is it? Organon toxicophore*

* J. Med. Chem. 49, 312 (2005) “Predic;ve Modeling from a Kaggler’s Perspec;ve” Jeremy Achin, Sergey Yergenson, Tom Degodoy

Winning Teams

Team 1 Team 2 Team 3

873 888 893

165 150 145

Team 1 Team 2 Team 3

151 165 162

687 673 676

TP FN

FP TN

Benchmarks

RF SVM

888 822

150 216

RF SVM

166 215

672 673

Other

Team 17 D27

896 781

142 257

Team 17 D27

169 215

669 623

Se Sp CCR

RF 0.86 0.80 0.83

SVM 0.79 0.74 0.77

Se Sp CCR

Team 1 0.84 0.82 0.83

Team 2 0.86 0.80 0.83

Team 3 0.86 0.80 0.83

Se Sp CCR

Team 17 0.86 0.80 0.83

D27 0.75 0.74 0.75

Se: TP/(TP+FN) Sp: TN/(FP+TN) CCR: (Se + Sp)/2

Private Set Performance

Okay, where’s this ‘second’ web service?

27

BIpredict Physicochemical proper;es are updated as molecule is built Atomis;c descriptor values are appended directly to the molecule

* D. C. Thompson Chemical Compu;ng Group, User Group Mee;ng, Montreal, 2011

So, what did we learn?

•  Was this useful? –  Yes

•  Par;cipa;on was high, contributors and contribu;ons were diverse*

•  A large number of models were of a high quality – Differences in top models in log loss metric are small – Different sta;s;cal measures lead to different rankings

–  RandomForest benchmark has high correct classifica;on rate (CCR)

* Sort of

‘Machine learning that maGers’

Kiri L. Wagstaff. Machine Learning that Mabers. Proceedings of the Twenty-‐Ninth Interna8onal Conference on Machine Learning (ICML), June 2012. Download PDF (CL #12-‐2026)

Domain exper;se Machine learning skill

Know your meme

hGp://roflcon.org/ hGp://katemiltner.com/

Thanks to: Lilly Ackley Ben Hamner Amy Kunkel Mehul Patel Alex Renner, PhD All Kaggle par;cipants – esp. Winter is Coming & Sergey

Download - Crowd computing: All your base are belong to us

Top Related