bigger data to increase drug discovery

Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery

Sean EkinsSean Ekins

Phoenix Nest, Inc., Brooklyn, NY.Collaborations in Chemistry, Inc., Fuquay Varina, NC.Collaborative Drug Discovery, Inc., Burlingame, CA.

Collaborations Pharmaceuticals, Inc., Fuquay Varina, NC.

In a Perfect World…

• All major diseases cured• All > 7000 rare diseases have treatments available• Neglected diseases are eradicated• Antibiotics, antivirals, vaccines developed to anticipate all

future mutations• Drug resistance eradicated• All research coordinated globally• Government/individuals collaboration- discovers / fund all

research• Billions of molecules will be available with data for different

targets• All decisions will involve machine learning• Life expectancy is infinite

Big DATA

Ebola- related tweets in a 6 week period 2014

Robert Moore

Why ‘Bigger’ and not ‘Big’

Just a matter of scale?

Drug Discovery’s definition of Big data

Everyone else’s definition of Big data

What about Chemistry and Biology - Pharmacology X.0

• Data Sources• PubChem

• ChEMBL

• ToxCast over 1800 molecules tested against over 800 endpoints

BUTBUT

WHERE

ARE

THE

‘Big’ Chemistry DBs

But what about small data?

• In some cases its all we have• In vivo data is not high throughput

• Small data builds networks DATA

V

http://smalldatagroup.com/

The past

• 1996• Data from low throughput

Drug-drug interaction studies

• E.g. Ki values with CYP 3A4

• A drug company might have 10s of values

• This data was used to build 3D QSAR, pharmacophores

JPET, 290: 429-438, 1999

Hydrophobi

c features

(HPF)

Hydrogen

bond

acceptor

(HBA)

Hydrogen

bond

donor

(HBD)

Observed

vs.

predicted

IC50 r

Acoustic mediated process2 1 1 0.92

Tip-based process0 2 1 0.80

Acoustic Tip based

Generated with Discovery Studio Generated with Discovery Studio (Accelrys)(Accelrys)

Cyan = hydrophobicCyan = hydrophobic

Green = hydrogen bond acceptorGreen = hydrogen bond acceptor

Purple = hydrogen bond donorPurple = hydrogen bond donor

Each model shows most potent Each model shows most potent molecule mappingmolecule mapping

How you dispense liquids may be important: insights from small dataHow you dispense liquids may be important: insights from small data

PLoS ONE 8(5): e62325 (2013)

Ebola inhibitor Pharmacophore

Ekins S, Freundlich JS and Coffee M F1000Research 2014, 3:277

Docking FDA approved compounds in VP35 protein showing overlap with ligand (yellow)

Proposed amodiaquine, chloroquine, clomiphene toremifeneWhich all are active in vitro may have common features and bind common site / target

A common feature pharmacophore for FDA-approved drugs inhibiting the Ebola virus

The last 5 years -Present• 2010• Data from high

throughput screens at Pfizer

• E.g. metabolic stability data ~200K compounds

• This data was used to build machine learning models

• 2015• Could easily be

double this amountDrug Metab Dispos, 38: 2083-2090, 2010

Ebola Machine Learning ModelsModels

(training set

868

compounds)

RP Forest

(Out of

bag ROC)

RP Single

Tree (With 5

fold cross

validation

ROC)

SVM

(with 5 fold

cross

validation

ROC)

Bayesian

(with 5 fold

cross

validation

ROC)

Bayesian

(leave out

50% x 100

ROC)

Open Bayesian

(with 5 fold

cross

validation

ROC)

Ebola

replication

(actives = 20)

0.70 0.78 0.73 0.86 0.86 0.82

Ebola

Pseudotype

(actives = 41)

0.85 0.81 0.76 0.85 0.82 0.82

Ekins, Freundlich, Madrid and Clark

https://goo.gl/uG8K3P

Tuberculosis still kills 1.6-1.7m/yr (~1 every 8 seconds)

1/3rd of worlds population infected!!!!

streptomycin (1943)streptomycin (1943)para-para-aminosalicyclic acid (1949)aminosalicyclic acid (1949)isoniazid (1952) isoniazid (1952) pyrazinamide (1954)pyrazinamide (1954)cycloserine (1955)cycloserine (1955)ethambutol (1962)ethambutol (1962)rifampicin (1967)rifampicin (1967)

Multi drug resistance in 4.3% of cases Multi drug resistance in 4.3% of cases

Extensively drug resistant increasing Extensively drug resistant increasing incidenceincidence

2 new drugs (bedaquiline, delamanid) 2 new drugs (bedaquiline, delamanid) in 40 yrs in 40 yrs

Tuberculosis – a big diseaseTuberculosis – a big disease

Tested >350,000 moleculesTested >350,000 molecules Tested ~2M 2M Tested ~2M 2M >300,000 >300,000

>1500 active and non toxic>1500 active and non toxic Published 177 100s Published 177 100s 800 800

Big Data: Screening for New Tuberculosis Treatments Big Data: Screening for New Tuberculosis Treatments

How many will become a new drug?How do we learn from this big data?

TBDA screened over 1 million, 1 million more to go

TB Alliance + Japanese pharma screens

Over 8000 molecules with dose response data for Mtb in CDD Public

from NIAID/SRI

https://app.collaborativedrug.com/register

Over 6 years analyzed in vitro data and built models

Top scoring molecules assayed for

Mtb growth inhibition

Mtb screening molecule

database/s

High-throughputphenotypic

Mtb screening

Descriptors + Bioactivity (+Cytotoxicity)

Bayesian Machine Learning classification Mtb Model

Molecule Database (e.g. GSK malaria

actives)virtually scored

using Bayesian Models

New bioactivity datamay enhance models

Identify in vitro hits and test models3 x published prospective tests ~750 ~750 molecules were tested molecules were tested in vitroin vitro 198 actives were identified198 actives were identified>20 % hit rate>20 % hit rateMultiple retrospective tests 3-10 fold enrichment

NH

S

N

Ekins et al., Pharm Res 31: 414-435, 2014Ekins, et al., Tuberculosis 94; 162-169, 2014Ekins, et al., PLOSONE 8; e63240, 2013Ekins, et al., Chem Biol 20: 370-378, 2013Ekins, et al., JCIM, 53: 3054−3063, 2013Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011Ekins et al., Mol BioSyst, 6: 840-851, 2010 Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,

5 active compounds vs Mtb in a few months

7 tested, 5 active (70% hit rate)

Ekins et al.,Chem Biol 20, 370–378, 2013

1. Virtually screen 13,533-member GSK antimalarial hit library

2. Bayesian Model = SRI TAACF-CB2 dose response + cytotoxicity model

3. Top 46 commercially available compounds visually inspected

4. 7 compounds chosen for Mtb testing based on

- drug-likeness- chemotype diversity

GSK #Bayesian

Score Chemical Structure

Mtb H37Rv MIC

(µg/mL)

GSK Reported

% Inhibition HepG2 @ 10

µM cmpd

TCMDC-123868 5.73 >32 40

TCMDC-125802 5.63 0.0625 5

TCMDC-124192 5.27 2.0 4

TCMDC-124334 5.20 2.0 4

TCMDC-123856 5.09 1.0 83

TCMDC-123640 4.66 >32 10

TCMDC-124922 4.55 1.0 9

Filling out the triazine matrix using SARtable:A new kind of map

Green = good activity, Red = bad; colored dots are predictions

No relationship between internal or external ROC and the number of molecules in the training set?

PCA of combined data and ARRA(red)

Ekins et al., J Chem Inf Model 54: 2157-2165 (2014)

Internal and leave out 50%x100 ROC track each otherExternal ROC less correlationSmaller models do just as well with external testing

~350,000

What matters most >70 years of TB mouse in vivo data – Mind the gap - 770 molecules

MIND THE TB GAP

Ekins et al., J Chem Inf Model 54: 1070-82, 2014

Ekins, Nuermberger & Freundlich DDT 19: 1279-1282, 2014

In vivo Machine Learning Models

ROC 5 fold cross validation

RP Forest RP Single

Tree

SVM

Bayesian

3 /11

(27.2%)

4/11

(36.4%)

7/11

(63.6%)

8/11

(72.7%)

External test set

Ekins et al., J Chem Inf Model 54: 1070-82, 2014

RP Forest RP Single

Tree

SVM

Bayesian

0.75 0.71 0.77 0.73

How can we find the in vivo active compounds?We need a map..

>70 years of TB in vivo dataGreen = in vivo mouse activeEmpty = in vivo inactiveYellow = 2013-2015 data

Uses Bayesian fingerprintsand clustering by similarity

Clark and Ekins - unpublished

Clustering in vivo mouse TB dataHex

plot

>70 years of TB in vivo dataGreen = in vivo mouse activeEmpty = in vivo inactiveYellow = 2013-2015

Clark and Ekins - unpublished

Clustering in vivo mouse TB data

Triazine surrounded by inactives

IssuesHigh Log P, poor solubility

How do we ‘increase drug discovery’?

• Make data and models more accessible• Collaborate• Share

– Create mobile apps

• Encourage engagement from non scientists

MoDELS RESIDE IN PAPERSNOT ACCESSIBLE…THIS IS UNDESIRABLE

How do we share them?How do we use Them?

• CDD VisionUses Bayesian algorithm and FCFP_6 fingerprints

Bayesian models

Clark et al., J Cheminform 6:38 2014

Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6 Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6 fingerprints; (b) modified Bayesian estimators for active and inactive compounds; fingerprints; (b) modified Bayesian estimators for active and inactive compounds; (c) structures of selected binders.(c) structures of selected binders.

For each listed target with at least two binders, it is first assumed that all of the For each listed target with at least two binders, it is first assumed that all of the molecules in the collection that do not indicate this as one of their targets are molecules in the collection that do not indicate this as one of their targets are inactive. inactive.

In the app we used ECFP_6 fingerprints In the app we used ECFP_6 fingerprints

Building Bayesian models for each target in TB MobileBuilding Bayesian models for each target in TB Mobile

Clark et al., J Cheminform 6:38 2014

TB Mobile Vers.2TB Mobile Vers.2

Ekins et al., J Cheminform 5:13, 2013Clark et al., J Cheminform 6:38 2014

Predict targetsCluster molecules

http://goo.gl/vPOKS

http://goo.gl/iDJFR

Predictions for 2013-2015 in vivo molecules

Bayesian models added to mobile apps: MMDS

Bayesian models added to mobile apps: Approved drugs

Human Microsomal Intrinsic clearance

Human protein binding Solubility pH 7.4

AZ dataset models >1000 molecules

Models from ChEMBL data

http://molsync.com/bayesian2

What do 2000 ChEMBL models look like

Folding bit size

AverageROC

http://molsync.com/bayesian2

Bigger datasets and model collections

• Profiling “big datasets” is going to be the norm.• A recent study mined PubChem datasets for

compounds that have rat in vivo acute toxicity data

• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.

• Kinase screening data (1000s mols x 100s assays)

• GPCR datasets etc (1000s mols x 100s assays)

Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863

http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863

• Data is at your fingertips instantly• labs add data to a massive corpus

of knowledge • Instantly available to all• Algorithms for mining, prediction• Millions of models accessible • Making decisions on experiments

needed and running them• Data visualization, exploration is

real-time, updated• Data follows you

Sean Ekins, a computational drug discovery consultant at Collaborations in Chemistry in North Carolina, is much more skeptical. He notes pharma companies have found hundreds of antimalaria compounds more potent than TNP-470 and says that he is not convinced Eve can do QSAR. He wants to see Eve go head-to-head with a real computational chemist. “Eve should go back to the Garden of Eden and leave drug discovery to scientists who know what they are doing,” Ekins says.

How close are we?

• Computers and models do not replace scientists• A tool to help us sift through ideas quickly• Many examples have lead to leads• Bigger data not needed for good models• More data becoming public• Can model ADME, bioactivity and more• Collaboration and software is important

• Mobile apps have useful cheminformatics features - aid anyone to do drug discovery

• Models are compact < 1MB and portable• The age of model sharing is here

Conclusions

Wanted

• “Bigger” small molecule screening datasets• Preferably > 500,000 – 1,000,000 molecules with data• To test how machine learning Algorithms Scale

• Contact [email protected]

Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many others …Funding: others …Funding: Bill and Melinda Gates Foundation (Grant#49852) Bill and Melinda Gates Foundation (Grant#49852) 1R41AI088893-01, 1R41AI088893-01,

2R42AI088893-02, R43 LM011152-01, 2R42AI088893-02, R43 LM011152-01, 9R44TR000942-02, 1R41AI108003-01, 1U19AI109713-01, MM4TB, Software: Biovia MM4TB, Software: Biovia

Freundlich Lab

bigger data to increase drug discovery

Science

bigger data

compoundsthis data

drug company

s of valuesthis data

collaborative drug discovery

multi drug resistance

metabolic stability

ebola virusthe