bigger data to increase drug discovery
TRANSCRIPT
Bigger Data to Increase Drug DiscoveryBigger Data to Increase Drug Discovery
Sean EkinsSean Ekins
Phoenix Nest, Inc., Brooklyn, NY.Collaborations in Chemistry, Inc., Fuquay Varina, NC.Collaborative Drug Discovery, Inc., Burlingame, CA.
Collaborations Pharmaceuticals, Inc., Fuquay Varina, NC.
In a Perfect World…
• All major diseases cured• All > 7000 rare diseases have treatments available• Neglected diseases are eradicated• Antibiotics, antivirals, vaccines developed to anticipate all
future mutations• Drug resistance eradicated• All research coordinated globally• Government/individuals collaboration- discovers / fund all
research• Billions of molecules will be available with data for different
targets• All decisions will involve machine learning• Life expectancy is infinite
Big DATA
Ebola- related tweets in a 6 week period 2014
Robert Moore
Why ‘Bigger’ and not ‘Big’
Just a matter of scale?
Drug Discovery’s definition of Big data
Everyone else’s definition of Big data
What about Chemistry and Biology - Pharmacology X.0
• Data Sources• PubChem
• ChEMBL
• ToxCast over 1800 molecules tested against over 800 endpoints
BUTBUT
WHERE
ARE
THE
‘Big’ Chemistry DBs
But what about small data?
• In some cases its all we have• In vivo data is not high throughput
• Small data builds networks DATA
V
http://smalldatagroup.com/
The past
• 1996• Data from low throughput
Drug-drug interaction studies
• E.g. Ki values with CYP 3A4
• A drug company might have 10s of values
• This data was used to build 3D QSAR, pharmacophores
JPET, 290: 429-438, 1999
Hydrophobi
c features
(HPF)
Hydrogen
bond
acceptor
(HBA)
Hydrogen
bond
donor
(HBD)
Observed
vs.
predicted
IC50 r
Acoustic mediated process2 1 1 0.92
Tip-based process0 2 1 0.80
Acoustic Tip based
Generated with Discovery Studio Generated with Discovery Studio (Accelrys)(Accelrys)
Cyan = hydrophobicCyan = hydrophobic
Green = hydrogen bond acceptorGreen = hydrogen bond acceptor
Purple = hydrogen bond donorPurple = hydrogen bond donor
Each model shows most potent Each model shows most potent molecule mappingmolecule mapping
How you dispense liquids may be important: insights from small dataHow you dispense liquids may be important: insights from small data
PLoS ONE 8(5): e62325 (2013)
Ebola inhibitor Pharmacophore
Ekins S, Freundlich JS and Coffee M F1000Research 2014, 3:277
Docking FDA approved compounds in VP35 protein showing overlap with ligand (yellow)
Proposed amodiaquine, chloroquine, clomiphene toremifeneWhich all are active in vitro may have common features and bind common site / target
A common feature pharmacophore for FDA-approved drugs inhibiting the Ebola virus
The last 5 years -Present• 2010• Data from high
throughput screens at Pfizer
• E.g. metabolic stability data ~200K compounds
• This data was used to build machine learning models
• 2015• Could easily be
double this amountDrug Metab Dispos, 38: 2083-2090, 2010
Ebola Machine Learning ModelsModels
(training set
868
compounds)
RP Forest
(Out of
bag ROC)
RP Single
Tree (With 5
fold cross
validation
ROC)
SVM
(with 5 fold
cross
validation
ROC)
Bayesian
(with 5 fold
cross
validation
ROC)
Bayesian
(leave out
50% x 100
ROC)
Open Bayesian
(with 5 fold
cross
validation
ROC)
Ebola
replication
(actives = 20)
0.70 0.78 0.73 0.86 0.86 0.82
Ebola
Pseudotype
(actives = 41)
0.85 0.81 0.76 0.85 0.82 0.82
Ekins, Freundlich, Madrid and Clark
https://goo.gl/uG8K3P
Tuberculosis still kills 1.6-1.7m/yr (~1 every 8 seconds)
1/3rd of worlds population infected!!!!
streptomycin (1943)streptomycin (1943)para-para-aminosalicyclic acid (1949)aminosalicyclic acid (1949)isoniazid (1952) isoniazid (1952) pyrazinamide (1954)pyrazinamide (1954)cycloserine (1955)cycloserine (1955)ethambutol (1962)ethambutol (1962)rifampicin (1967)rifampicin (1967)
Multi drug resistance in 4.3% of cases Multi drug resistance in 4.3% of cases
Extensively drug resistant increasing Extensively drug resistant increasing incidenceincidence
2 new drugs (bedaquiline, delamanid) 2 new drugs (bedaquiline, delamanid) in 40 yrs in 40 yrs
Tuberculosis – a big diseaseTuberculosis – a big disease
Tested >350,000 moleculesTested >350,000 molecules Tested ~2M 2M Tested ~2M 2M >300,000 >300,000
>1500 active and non toxic>1500 active and non toxic Published 177 100s Published 177 100s 800 800
Big Data: Screening for New Tuberculosis Treatments Big Data: Screening for New Tuberculosis Treatments
How many will become a new drug?How do we learn from this big data?
TBDA screened over 1 million, 1 million more to go
TB Alliance + Japanese pharma screens
Over 8000 molecules with dose response data for Mtb in CDD Public
from NIAID/SRI
https://app.collaborativedrug.com/register
Over 6 years analyzed in vitro data and built models
Top scoring molecules assayed for
Mtb growth inhibition
Mtb screening molecule
database/s
High-throughputphenotypic
Mtb screening
Descriptors + Bioactivity (+Cytotoxicity)
Bayesian Machine Learning classification Mtb Model
Molecule Database (e.g. GSK malaria
actives)virtually scored
using Bayesian Models
New bioactivity datamay enhance models
Identify in vitro hits and test models3 x published prospective tests ~750 ~750 molecules were tested molecules were tested in vitroin vitro 198 actives were identified198 actives were identified>20 % hit rate>20 % hit rateMultiple retrospective tests 3-10 fold enrichment
NH
S
N
Ekins et al., Pharm Res 31: 414-435, 2014Ekins, et al., Tuberculosis 94; 162-169, 2014Ekins, et al., PLOSONE 8; e63240, 2013Ekins, et al., Chem Biol 20: 370-378, 2013Ekins, et al., JCIM, 53: 3054−3063, 2013Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011Ekins et al., Mol BioSyst, 6: 840-851, 2010 Ekins, et al., Mol. Biosyst. 6, 2316-2324, 2010,
5 active compounds vs Mtb in a few months
7 tested, 5 active (70% hit rate)
Ekins et al.,Chem Biol 20, 370–378, 2013
1. Virtually screen 13,533-member GSK antimalarial hit library
2. Bayesian Model = SRI TAACF-CB2 dose response + cytotoxicity model
3. Top 46 commercially available compounds visually inspected
4. 7 compounds chosen for Mtb testing based on
- drug-likeness- chemotype diversity
GSK #Bayesian
Score Chemical Structure
Mtb H37Rv MIC
(µg/mL)
GSK Reported
% Inhibition HepG2 @ 10
µM cmpd
TCMDC-123868 5.73 >32 40
TCMDC-125802 5.63 0.0625 5
TCMDC-124192 5.27 2.0 4
TCMDC-124334 5.20 2.0 4
TCMDC-123856 5.09 1.0 83
TCMDC-123640 4.66 >32 10
TCMDC-124922 4.55 1.0 9
Filling out the triazine matrix using SARtable:A new kind of map
Green = good activity, Red = bad; colored dots are predictions
No relationship between internal or external ROC and the number of molecules in the training set?
PCA of combined data and ARRA(red)
Ekins et al., J Chem Inf Model 54: 2157-2165 (2014)
Internal and leave out 50%x100 ROC track each otherExternal ROC less correlationSmaller models do just as well with external testing
~350,000
What matters most >70 years of TB mouse in vivo data – Mind the gap - 770 molecules
MIND THE TB GAP
Ekins et al., J Chem Inf Model 54: 1070-82, 2014
Ekins, Nuermberger & Freundlich DDT 19: 1279-1282, 2014
In vivo Machine Learning Models
ROC 5 fold cross validation
RP Forest RP Single
Tree
SVM
Bayesian
3 /11
(27.2%)
4/11
(36.4%)
7/11
(63.6%)
8/11
(72.7%)
External test set
Ekins et al., J Chem Inf Model 54: 1070-82, 2014
RP Forest RP Single
Tree
SVM
Bayesian
0.75 0.71 0.77 0.73
How can we find the in vivo active compounds?We need a map..
>70 years of TB in vivo dataGreen = in vivo mouse activeEmpty = in vivo inactiveYellow = 2013-2015 data
Uses Bayesian fingerprintsand clustering by similarity
Clark and Ekins - unpublished
Clustering in vivo mouse TB dataHex
plot
>70 years of TB in vivo dataGreen = in vivo mouse activeEmpty = in vivo inactiveYellow = 2013-2015
Clark and Ekins - unpublished
Clustering in vivo mouse TB data
Triazine surrounded by inactives
IssuesHigh Log P, poor solubility
How do we ‘increase drug discovery’?
• Make data and models more accessible• Collaborate• Share
– Create mobile apps
• Encourage engagement from non scientists
MoDELS RESIDE IN PAPERSNOT ACCESSIBLE…THIS IS UNDESIRABLE
How do we share them?How do we use Them?
• CDD VisionUses Bayesian algorithm and FCFP_6 fingerprints
Bayesian models
Clark et al., J Cheminform 6:38 2014
Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6 Predictions for the InhA target: (a) the ROC curve with ECFP_6 and FCFP_6 fingerprints; (b) modified Bayesian estimators for active and inactive compounds; fingerprints; (b) modified Bayesian estimators for active and inactive compounds; (c) structures of selected binders.(c) structures of selected binders.
For each listed target with at least two binders, it is first assumed that all of the For each listed target with at least two binders, it is first assumed that all of the molecules in the collection that do not indicate this as one of their targets are molecules in the collection that do not indicate this as one of their targets are inactive. inactive.
In the app we used ECFP_6 fingerprints In the app we used ECFP_6 fingerprints
Building Bayesian models for each target in TB MobileBuilding Bayesian models for each target in TB Mobile
Clark et al., J Cheminform 6:38 2014
TB Mobile Vers.2TB Mobile Vers.2
Ekins et al., J Cheminform 5:13, 2013Clark et al., J Cheminform 6:38 2014
Predict targetsCluster molecules
http://goo.gl/vPOKS
http://goo.gl/iDJFR
Predictions for 2013-2015 in vivo molecules
Bayesian models added to mobile apps: MMDS
Bayesian models added to mobile apps: Approved drugs
Human Microsomal Intrinsic clearance
Human protein binding Solubility pH 7.4
AZ dataset models >1000 molecules
Models from ChEMBL data
http://molsync.com/bayesian2
What do 2000 ChEMBL models look like
Folding bit size
AverageROC
http://molsync.com/bayesian2
Bigger datasets and model collections
• Profiling “big datasets” is going to be the norm.• A recent study mined PubChem datasets for
compounds that have rat in vivo acute toxicity data
• This could be used in other big data initiatives like ToxCast (> 1000 compounds x 800 assays) and Tox21 etc.
• Kinase screening data (1000s mols x 100s assays)
• GPCR datasets etc (1000s mols x 100s assays)
Zhang J, Hsieh JH, Zhu H (2014) Profiling Animal Toxicants by Automatically Mining Public Bioassay Data: A Big Data Approach for Computational Toxicology. PLoS ONE 9(6): e99863. doi:10.1371/journal.pone.0099863http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0099863
• Data is at your fingertips instantly• labs add data to a massive corpus
of knowledge • Instantly available to all• Algorithms for mining, prediction• Millions of models accessible • Making decisions on experiments
needed and running them• Data visualization, exploration is
real-time, updated• Data follows you
Sean Ekins, a computational drug discovery consultant at Collaborations in Chemistry in North Carolina, is much more skeptical. He notes pharma companies have found hundreds of antimalaria compounds more potent than TNP-470 and says that he is not convinced Eve can do QSAR. He wants to see Eve go head-to-head with a real computational chemist. “Eve should go back to the Garden of Eden and leave drug discovery to scientists who know what they are doing,” Ekins says.
How close are we?
• Computers and models do not replace scientists• A tool to help us sift through ideas quickly• Many examples have lead to leads• Bigger data not needed for good models• More data becoming public• Can model ADME, bioactivity and more• Collaboration and software is important
• Mobile apps have useful cheminformatics features - aid anyone to do drug discovery
• Models are compact < 1MB and portable• The age of model sharing is here
Conclusions
Wanted
• “Bigger” small molecule screening datasets• Preferably > 500,000 – 1,000,000 molecules with data• To test how machine learning Algorithms Scale
• Contact [email protected]
Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many Nadia Litterman, Krishna Dole and all at CDD, Megan Coffee, SRI, MM4TB and many others …Funding: others …Funding: Bill and Melinda Gates Foundation (Grant#49852) Bill and Melinda Gates Foundation (Grant#49852) 1R41AI088893-01, 1R41AI088893-01,
2R42AI088893-02, R43 LM011152-01, 2R42AI088893-02, R43 LM011152-01, 9R44TR000942-02, 1R41AI108003-01, 1U19AI109713-01, MM4TB, Software: Biovia MM4TB, Software: Biovia
Freundlich Lab