open source pharma: crowd computing: a new approach to predictive modeling

Predictive in silico models Crowd computing: A new approach to predictive modeling Jörg Bentzien Open-Source Pharma Bellagio, Italy 7/16/2014 – 7/18/2014

Open-Source Pharma Bellagio, Italy 7/16/2014 – 7/18/2014 2

Introduction

Ph.D. In Chemistry, Univ. Münster, Germany, Prof Martin Klessinger Photochemical [2+2] Cycloaddition reactions Post-Doctoral Studies at USC, Los Angeles, CA, Nobel Laureate Prof Arieh Warshel Enzymatic Reactions Xencor, Monrovia, CA Protein Design Boehringer Ingelheim Pharmaceuticals, Ridgefield, CT Computational Chemist, Small Molecule Drug Design ADMET Modeling Crowdsourcing with Kaggle, 2012 Bentzien et al. Drug Discovery Today (2013), 18, 472 - 478.

Bentzien et al. J Phys Chem B (1998), 102, 2293 - 2301

Hayes et al. J PNAS (2002), 99, 15926 - 15931

Bentzien, Klessinger J Org Chem (1994), 59, 4887 - 4894

3

Why are we building predictive in silico Models?

We cannot make and test every compound. • Reduce drug failure rates, de-risk compounds • Select and prioritize compounds before synthesis

Predictive in silico models could help to achieve this task.

Lack of efficacy and safety/toxicity are the main reasons why drugs fail in the clinic. Toxicity is the main reason for attrition in early drug development.

Reasons for attrition in clinical trials:

Arrowsmith, Nat. Rev. Drug Disc. 2013, 12, 569

Open-Source Pharma Bellagio, Italy 7/16/2014 – 7/18/2014

Efficacy

Safety

4

Principal of in silico modeling

NS

O

NH

N

S

O O

OH

Prediction of (ADMET) Observable

In vivo effect In vitro effect

Code in Machine readable form

Calculate Descriptors: Physical Chemical Descriptors Molecular Properties Fingerprints Substructure Counts etc.

Generate predictive Model: Random Forest SVM PLS CoMFA etc.

Select: Training Set Test Set Validation Set

Pi = f(x1,x2,x3,x4, ….) x1,x2,x3,x4, ….

N

NH

O

N N

O

N+

O

O

OH

S

SH

H

Br

0

5

10

0 5 10

pIC

50 p

reci

ted

IC50 exp

Positive Predicted

Negative Predicted

Positive Exper.i

True Positive

(TP)

False Negative

(FN)

Negative Experi.

False Positive

(FP)

True Negative

(TN)

Regression:

Classification:


Find a relation between the chemical structure and the observable, Pi (e.g. genotoxicity), by first calculating descriptors, xi (e.g. physchem properties), and then using a mathematical algorithm that calculates the observable Pi for each structure.

BI 621,079 hCB2 cAMP EC50 = 1.6 nM

O

NH

S

Cl

OO N N


5

Crowd-Sourcing applied to in silico Modeling: The general idea

Traditional Model Building The KAGGLE approach

Ames Positive predicted

AM1

Ames Negative predicted

AM1

Ames Positive

experimental 167 (183) 16

Ames Negative

experimental 21 53 (74)

Potent Ames negative compound

OF

FF

N

NH

Single Expert

Modeller

3. Generate a Model

4. Find a solution

3. Generate a Model

Taking advantage of the “crowd” one vs. many

Potent Ames positive compound

O

NH

1. Define the problem 2. Prepare the Data

Open-Source Pharma Bellagio, Italy 7/16/2014 – 7/18/2014 6

The Kaggle Challenge: The Data Set Predicting a Biological Response 3/16/2012 – 6/15/2012

Data Set of 6512 compounds from Literature CADD-BI performed: Data Set Clean-Up (6252: 3401p/2851n) Random split into: Training Set (3751: 2034p/1717n) Public Test Set (625: 329p/296n) Private Validation Set (1876: 1038p/838n) Pre-calculated Descriptors (1776)

Participants had no knowledge of • the modeled endpoint • the descriptor types • the chemical structures

BI offered $20,000 for the best three models Participants could use any technology they wanted BI will get the models

Objectives: • Response to competition • Quality of the algorithms/models • Model transfer

Task: Generate an Ames Classification model 1 = Ames positive 0 = Ames negative

This Challenge does NOT test all aspects of predictive in silico modeling Important aspects, e.g. data set selection, descriptor selection/design, are missing Study is a machine learning exercise, a proof of concept Advantage: We know exactly what to expect, comparative benchmarks available


7

The Kaggle Challenge: The Competition Predicting a Biological Response 3/16/2012 – 6/15/2012

Overwhelming response to competition!

Best models perform better than standard benchmarks:

Rank Log Loss

Best Model 1 0.37356

Random Forest 352 0.41540

SVM 541 0.49503

Each Class Predicted with Probability 0.5

599 0.69250 On average 88 entries per day!

Optimal model generated after ~20 Days

796 players (487 first time participants) 703 teams 8841 models submitted

The Kaggle Challenge: Measuring the Performance Different performance metrics for in silico classification models

LogLoss Sensitivity Specificity CCR PPV NPV MCC

Random Forest 352

0.41540 0.855 0.802 0.829 0.843 0.818 0.66

SVM 540 0.49503 0.792 0.743 0.768 0.793 0.743 0.55

Rank 1 0.37356 0.841 0.820 0.830 0.853 0.806 0.66

Rank 2 0.37363 0.855 0.803 0.829 0.843 0.818 0.66

Rank 3 0.37407 0.860 0.807 0.833 0.846 0.823 0.67

Rank 10 0.37641 0.860 0.810 0.835 0.849 0.824 0.67

Rank 50 0.38229 0.856 0.805 0.831 0.845 0.819 0.66

Rank 100 0.38958 0.869 0.794 0.831 0.839 0.830 0.67

Differences in top models in logloss metric are small. Different statistical measures lead to different rankings.

RF benchmark has high correct classification rate (CCR) and high Matthew Correlation Coefficient.

Benchmarks

Positive Predicted

Negative Predicted

Positive Experi.

True Positive

(TP)

False Negative

(FN)

Negative Experi.

False Positive

(FP)

True Negative

(TN)

Positive Predicted

Negative Predicted

Rank 1 Rank 2 Rank 3

873 888 893

165 150 145

Rank 1 Rank 2 Rank 3

151 165 162

687 673 676

Positive Predicted

Negative Predicted

RF SVM

888 822

150 216

RF SVM

166 215

672 673

Positive Predicted

Negative Predicted

Rank 17 D27

896 781

142 257

Rank 17 D27

169 215

669 623

Other Models Winning Teams


8


9

The Kaggle Challenge: Lessons learned

Technology aspects: • 1st ranked team: R-software, blending of several different RandomForest models, with

special feature selection and weighting techniques. Final models were merged using other machine learning techniques.

• 2nd ranked team: R-software, RandomForest, derived new response variable pending on value and observed activity. This may lead to better separation between actives and inactives.

• 3rd ranked team: R-software, RandomForest with special techniques to deal with imbalanced data sets.

• The challenge was a success • There was a great response • Predictive in silico models were generated within a three months time frame • Models were at least as good as the literature • Social aspects of crowd-sourcing were observed


10

The Kaggle Challenge: Lessons learned (continued)

Performance aspects: • Model performance on par with best literature models, reached maximum performance for

data set • Top ranking models are not significantly different from Random Forest benchmark • Quick turn-around (3 months), code made available • Model performance plateaued after 20 days

A standard RandomForest model is a good starting point. In-house technology performs as well as more complex approaches. Social aspects of competition: • Very strong response: 703 teams, 8841 models submitted • People from all over the world participated: 1st place team from US (Harvard, Travelers insurance) 2nd place team from Russia graduate student from Moscow 3rd place from China graduate student from Beijing • Winning teams had no CompChem/Chemistry background • Formation of teams occurred during competition

Bentzien at al. “Crowd computing: Using competitive dynamics to develop and refine highly predictive models”, Drug Discovery Today (2013), 18, 472 - 478.


11

The Kaggle Challenge: Lessons learned (continued) Important aspects for successful crowdsourcing:

Design the Crowdsourcing Challenge: Very clear defined task/objective Predefined precise metric to measure entries Provide adequate incentive/prize money for participants

Participants: Hosting the challenge either through third party or self Internal/Restricted/Open Challenge Promote the crowd sourcing challenge among key expert leaders

The Challenge: Right barrier for participation Fast turn-around/feedback to participants Gamification can provide additional incentive to participants can lead to synergies amongst participants

After the Challenge: Clear follow-up of what to do with the results Does the challenge benefit to your Network/Organization?

Crowd-Sourcing : Other examples


12

http://www.nytimes.com/2012/11/24/science/scientists-see-advancesin- deep-learning-a-part-of-artificial-intelligence.html?_r=0

Lakhani et al., Nat Biotech, 2013, 31, 108-111.

www.innocentive.com

www.the-dream-project.com Prill et al., ScienceSignaling, 2011, 4, 1-6

www.kaggle.com

www.topcoder.com

www.grants4targets.com

http://www.innocentive.com/

http://www.the-dream-project.com/





http://www.kaggle.com/

http://www.topcoder.com/

http://www.topcoder.com/


13

Crowd-Sourcing: A new way for solving problems(?)

Will crowd-sourcing solve all the problems? Likely not. Crowd sourcing offers opportunities but it is not without risks. For crowd sourcing to be successful/innovative the task needs to be structured right.

Murcko & Walters, “Alpha Shock” J Comput Aided Mol Des 2012, 26, 97-102

Kittur et al. “The Future of Crowd Work” 16th ACM Conference on Computer Supported Cooperative Work (CSCW 2013)

Will crowd-sourcing be the future way of drug discovery? Maybe, ….

Drug Discovery will definitely be different from what it is now.

Potential framework for future crowd work. Requires • Intelligent work decomposition • sophisticated workflow design • high level of collaborative work • quality assurance.

Simple crowd work • tendency to be mechanical • not innovative • has exploitive tendency

Example: Amazon Mechanical Turk


14

Acknowledgements Business Partners and Collaborates

ADMET-WG: Jan Kriegl, Bernd BeckStefan, Scheuerer, Michael Durawa, Pierre Bonneau, Sanjay Srivastava, Michel Garneau, Hassan Kadhim, Matthias Klemencic, Christian Klein, Robert Happel, Gerald Birringer, Dustin Smith, Scott Oloff, Zheng Yang

Toxicology: Warren Ku Patricia Escobar Ray Kemper

External Collaborators: Ernst-Walter Knapp Özgür Demir-Kamuk Alex Tropsha Curt Breneman John Pu Andy Fant Zhuo Zhen

Medicinal Chemistry: Robert Hughes In silico VPR-team All the MedChem users

Research IS: Scott Oloff David Thompson (PAC) Zheng Yang Scott Whalen Cathy Farrell Miguel Teodoro IS-Innovation Team Alex Renner

Structural Research: Sandy Farmer Neil Farrow Ingo Mügge All CADD colleagues

SKD: Will Loging

Kaggle: Kaggle Team Kaggle Challenge Participants

open source pharma: crowd computing: a new approach to predictive modeling

Science

nm o n h s

ames negative experimental

data set cleanup

observable pi

synthesis predictive

silico models crowd

drug discoverytoday

drug failure rates