computationally leveraging the collective: mining ...nas-sites.org › emergingscience › files ›...
TRANSCRIPT
Computationally Leveraging the Collective: Mining Published Data and Crowdsourcing
Consensus ModelsNicole C. KleinstreuerNICEATM Deputy Director
PI, Comp Tox Group, DIR/BCBB
Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions
6th June 2019, NAS, Washington DC
Outline
• Problem Statement
• Application of AI/ML to Big Data in Toxicology
• Current and Future Projects:
– Endocrine Disruption
– Acute Oral Toxicity
– Automation
• Reference Data Identification
• Mechanistic Testing Toolbox
Environmental Chemical Disease Contributions
• Pesticides
– Cancer, neurodegenerative diseases, thyroid
• Consumer products
– Neurological, developmental, systemic
• Air pollutants
– Childhood ADHD, autism, allergic asthma
• Drinking water contaminants
– Systemic effects, cancer, neurological
• Endocrine Disruptors
– Developmental impairment, decreased fertility, cancer
…....and many others.......
https://www.niehs.nih.gov/health/materials/index.cfm
picture© ChemSec
Chemicals >> Data
• 80+ million substances synthetized• 140,000 chemicals in commerce
(plus mixtures, natural products and metabolites)
• Less than 10% tested
Number of Chemicals(Chemical Diversity)
Num
ber o
f Ass
ays
(Bio
logi
cal D
iver
sity
)
Tox21
ToxCast
High Throughput Screening (Tox21/ToxCast)
• Produce data on thousands of chemicals
• Prioritize compounds for hazard
• Develop predictive models for biological response in humans/ecosystems
• Reduce reliance on animal models
• NICEATM: comp tox + validation support to Tox21 and ICCVAM
Curated Legacy Datae.g. REACH, ToxRefDB, ICE
Omics technologiese.g. transcriptomics, metabolomics,exposomics
High-ThroughputScreeninge.g. Tox21, EUToxRisk
Chemical Featurese.g. RDKit, PhysChem
Big Data + Machine Learning
Supporting Regulatory DecisionsFar too many chemicals to test with standard animal-based methods or even in vitro HTS
– Cost, time, animal welfare, human relevance– >10,000 chemicals to be tested for EDSP, >50,000 for TSCA– Fill the data gaps and bridge the lack of knowledge
Alternative
Endocrine DisruptionEstrogen (ER) & Androgen (AR)
• Binding • Agonism• Antagonism
Acute Systemic Toxicity Oral LD50s
• Toxic/ Very toxic • LD50 Point estimates• EPA Categories• GHS Categories
Endpoints:
• Obtain high quality training sets– Apply best modeling practices
– Validate performance of models
– Define applicability domain and model limitations
– Use models to predict across large chemical sets
– Help inform regulatory decision making
Overall Approach
Mansouri et al. 2018
• Data & structure curation
» Flagged and curated files available
• Preparation of training and test sets
» Available as SDFiles and csv data files
• Initial descriptor calculation
» molecular descriptors and structural fingerprints generated and shared
• Variable selection technique
» e.g. genetic algorithm
• Selection of a mathematical method
» Test several approaches: KNN, PLS, SVM, RF…
• Validation of the model’s predictive ability
» 5-fold cross validation & external test set
• Define the Applicability Domain
» Local (nearest neighbors) and global (leverage) approaches
Modeling Steps and Considerations
CoMPARACollaborative Modeling Project for AndrogenReceptor Activity (2017/18)
CATMoSCollaborative Acute Toxicity Modeling Suite(2018/19)
Endocrine Disruptor Screening Program (EDSP)
ICCVAM Acute Systemic Toxicity Workgroup
Mansouri et al. 2016 EHP 124:1023–1033Mansouri et al. 2019 under review at EHP
Kleinstreuer et al. 2018 Comp Tox; Mansouri et al. 2019 in prep
ICCVAMNICEATM
Global Collaborative Projects
• 35 Participants/Groups from around the globe representing academia, industry, and government contributed
Consortium:
International Participation
(https://batchgeo.com/map/d06c5d497ed8f76ecfee500c2b0e1dfa)
CERAPP participants models CoMPARA participants modelsCategorical Continuous Total Categorical Continuous Total
Binding 21 3 24 35 5 40Agonist 11 3 14 21 5 26Antagonist 8 2 10 22 3 25Total 40 8 48 78 13 91
Models for Endocrine Disruption
Judson et al. Toxicol. Sci. (2015) 148: 137-154 Kleinstreuer et al. CRT (2017) 30(4): 946-964.
Tox21/ToxCast ER Pathway Model Tox21/ToxCast AR Pathway Model
Validated pathway models provide training data for 1800 chemicals
Binding Agonist Antagonist
Train Test Train Test Train Test
Sn 0.93 0.58 0.85 0.94 0.67 0.18
Sp 0.97 0.92 0.98 0.94 0.94 0.90
BA 0.95 0.75 0.92 0.94 0.80 0.54
Binding Agonist Antagonist
Train Test Train Test Train Test
Sn 0.99 0.69 0.95 0.74 1.00 0.61
Sp 0.91 0.87 0.98 0.97 0.95 0.87BA 0.95 0.78 0.97 0.86 0.97 0.74
CERAPP consensus CoMPARA consensus
Distributions of the number of predicted chemical structures by all binding models.
Consensus Models Assessment
Training Set: 1.8k, Evaluation Set: 7k, Prediction Set: 32k
Informing Regulatory Decisions
Modeling Endpoints for Acute Oral Toxicity
I (≤ 50mg/kg)
II (>50 ≤ 500mg/kg)
III (>500 ≤ 5000mg/kg) IV (>5000mg/kg) Hazard
I (≤ 5mg/kg)
II (>5 ≤ 50mg/kg)
III (>50 ≤ 300mg/kg) IV (>300 ≤ 2000mg/kg)
HazardPacking Group
HazardToxic (>50-5000mg/kg)
Highly toxic (≤50mg/kg)
GHS
EPA
+ Quantitative
LD50values
+ Nontoxic (>2000 mg/kg)
NC (> 2000mg/kg)
Rodent LD50 data obtained & curated for ~15k chemicalsQSAR-ready structures: ~12k
Bootstrapping of the standard deviations for repeat test chemicals (~1000) identified a 95% confidence interval for LD50 values of ±0.31 log10(mg/kg)
Defining a Confidence RangeLD
50 (l
og10
(mg/
kg))
Submitted Models
Consortium Comprised 35 Participating Groups
• Very Toxic: 32 models• Non-toxic: 33 models• EPA categories: 26 models• GHS categories: 23 models• LD50: 25 models
Total: 139 models
Support vector
machine
Artificial Neural
Networks
Regression Model
Bayesian Networks
XGBoost
kNN
Deep Learning
Random Forest
Training Set: 10k, Evaluation Set: 2k, Prediction Set: 48k
Evaluation procedure
Qualitative evaluation:
Quantitative evaluation:
• Documentation• Defined endpoint• Unambiguous algorithm• Availability of code
- Goodness of fit: training statistics- Evaluation set predictivity: statistics on the evaluation set- Robustness: balance between (Goodness of fit) & (Test set predictivity)
• Applicability domain definition• Availability of data used for modeling• Mechanistic interpretation
The consensus predictions perform as well as replicate in
vivo data at predicting oral acute toxicity outcome
CATMoS Consensus Model: in-domain predictions, weighted majority/average
Performance Assessment
Very Toxic Non-Toxic EPA GHSTrain Eval Train Eval Train Eval Train Eval
Sensitivity 0.87 0.67 0.93 0.70 0.73 0.50 0.63 0.45Specificity 0.94 0.96 0.96 0.88 0.96 0.91 0.91 0.92Balanced Accuracy 0.93 0.81 0.94 0.79 0.83 0.71 0.77 0.68
In vivo Balanced Accuracy
0.81 0.89 0.82 0.79
LD50 values LD50 valuesTrain Eval In Vivo
R2 0.84 0.64 0.80RMSE 0.32 0.51 0.42
Consensus models outperformed all individual models for each endpoint
• LD50: 28954
• VT: 23767
• NT: 30971
• EPA: 25487
• GHS: 25720
Consensus Implementation
Generalized CATMoS models: datasets
• High concordance among models• Proportional distribution of:
• LD50 values• VT/NT classes• EPA/GHS categories
• Split into 75% training and 25% test set• Calculate PaDEL & CDK2 descriptors• Dimensionality reduction (missing values & low variance)• Feature selection (most relevant descriptors for each endpoint)
Generalized CATMoS models: new chemical predictions
New chemical to be predicted
Nearest neighbors (𝑁𝑁𝑖𝑖)
𝑑𝑑𝑖𝑖
𝑤𝑤𝑖𝑖 = 𝑓𝑓(𝑑𝑑𝑖𝑖)𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑓𝑓(𝑤𝑤𝑖𝑖 ,𝑁𝑁𝑖𝑖)
𝑑𝑑1 ≠ 0 𝑑𝑑1 = 0𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑁𝑁𝑖𝑖
Automated, weighted-endpoint dependent read-across: weighted kNN
𝑑𝑑𝑖𝑖: Euclidean distance based on the selected descriptors for each endpoint
Consensus Implementation
OPERA Standalone app
Running Consensus Models
Mansouri et al. J Cheminform (2018). https://doi.org/10.1186/s13321-018-0263-1
Command line Graphical user interface
- Free, open-source & open-data- Single chemical and batch mode- Multiple platforms (Windows and Linux)- Embeddable libraries (java, C, C++, Python)
https://github.com/NIEHS/OPERA
OPERA predictions on EPA’s CompTox dashboard
Calculation Result for a chemical Model Performance
with full QMRF
Nearest Neighbors from Training Set
Mansouri et al. OPERA models (https://doi.org/10.1186/s13321-018-0263-1) Williams et al. CompTox Chemistry Dashboard(https://doi.org/10.1186/s13321-017-0247-6)
https://comptox.epa.gov/dashboard
Summary
• Toxicology data can be synthesized and modeled effectively using AI & machine learning approaches.
– Consensus models are a powerful way to leverage collective expertise
– Other applications: exposure, use case, systematic review, etc.
• Machine learning models (e.g. QSARs) have already achieved limited acceptance in the regulatory space.
• Additional education, training, and communication will facilitate more widespread adoption.
https://ntp.niehs.nih.gov/pubhealth/evalatm/natl-strategy/
https://ice.ntp.niehs.nih.gov/
Manually Identifying Reference Data
Systematic literature search of publically available data (e.g. PubMed)
Identify chemical activities measured in “guideline-like” uterotrophic studies
Identify a subset of in vivo reference chemicals
• Active chemicals verified in >2 independent studies
• Inactive chemicals verified in >2 independent studies (with no positive results in any study)
Kleinstreuer et al. EHP (2015)
Automating Reference Data Identification
• Project with Oak Ridge National Labs (ORNL) and FDA CFSAN to apply text-mining (NLP) approaches & ML to identify high-quality data
• Semi-automated retrieval and evaluation of published literature (trained on uterotrophic database)
• Apply to developmental toxicity studies (with ICCVAM DARTWG)
• Define literature search keywords, identify corpus
• Extract/characterize study protocol details from regulatory guidelines: minimum criteria
• Apply ML algorithms to identify high-quality studies, expert check
• Search published literature for assays/biomarkers to modernize carcinogenicity testing toolbox
• Initial search strategy: Work with NTP Report on Carcinogens and Office of Health Assessment and Translation to identify keywords
– 256 keywords mapped to Hallmarks of Cancer and Key Characteristics of Carcinogens
– 7 keywords for assays/biomarker, crossed with HM/KC
• Recruit participants to screen and tag abstracts
– Metadata: KC, HM, Organism, Publication type, Study type
– Mesh terms automatically tracked for PubMed articles
Semi-supervised systematic review
Automating Toolbox Construction
Sysrev: Semi-automated review platform
https://sysrev.com/
• Freely available website • Abstract screening and annotating• Intuitive user interface• Including mobile/tablet access • Uses machine learning to rank the abstracts
@sysrev1
Registerhttps://sysrev.com/register
Project nameHallmark and key characteristics mapping
Project linkhttps://sysrev.com/p/3588
Example: Carcinogenicity
Hallmarks of Cancer & Key Characteristics of Carcinogens• Inflammation
• Oxidative stress
• Genotoxicity/instablitiy
• Angiogenesis
• Immortalization/proliferation
• Immunosuppression
• Invasion/metastasis
• Specific receptor- or enzyme-mediated
Mechanistic Mapping of HTS Assays
Hanahan & Weingberg 2011; Smith et al. 2016; Guyton et al. 2018; Chiu et al. 2018
Acknowledgments
• Kamel Mansouri
• Richard Judson
• ILS/NICEATM
• Alexandre Borrel
• ICCVAM partners
• ICATM partners
• Modeling consortiumparticipants
• Tom Luechtefeld
Extra Slides
QSAR-ready KNIME workflow
Remove of duplicates
Normalize of tautomers
Clean salts and counterions
Remove inorganics and mixtures
Final inspection QSAR-ready
structures
Indigo
Aim of the workflow: • Combine different procedures and ideas • Minimize the differences between the structures used for
prediction• Produce a flexible free and open source workflow to be
shared
Structure standardization procedure
Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)
Fourches et al. J Chem Inf Model, 2010, 29, 476 – 488Wedebye et al. Danish EPA Environmental Project No. 1503, 2013
Total binders: 3961Agonists: 2494Antagonists: 2793
Consensus Qualitative Accuracy
ToxCast data
Literature data
(All: 7283)
Literature data
(>6 sources: 1209)
Sensitivity 0.93 0.30 0.87Specificity 0.97 0.91 0.94Balancedaccuracy 0.95 0.61 0.91
ToxCast data (training set)
Literature data (test set)
Observed\Predicted
Actives Inactives Actives Inactives
Actives 83 6 597 1385Inactives 40 1400 463 4838
Prediction Accuracy Strongly Depends on Data Quality
ROC curve of the external validation set (literature)
Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267
757 chemicals have >75% active concordance
Actives
Inactives
Prioritization
Most models predict most chemicals as inactive
Only a small fraction of chemicals are prioritized for further testing
Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267
Chemical Prioritization
OPERA v2
Model PropertyAOH Atmospheric Hydroxylation
RateBCF Bioconcentration Factor
BioHL Biodegradation Half-life
RB Ready Biodegradability
BP Boiling Point
HL Henry's Law Constant
KM Fish Biotransformation Half-life
KOA Octanol/Air Partition Coefficient
LogP Octanol-water Partition Coefficient
MP Melting Point
KOC Soil Adsorption Coefficient
VP Vapor Pressure
WS Water solubility
RT HPLC retention time
OPERA 1.5
• Physchem properties:• General structural properties• pKa • Log D
• Toxicity endpoints• ER activity (CERAPP)
https://ehp.niehs.nih.gov/15-10267/• AR activity (CoMPARA)
https://doi.org/10.13140/RG.2.2.19612.80009• Acute toxicity (CATMoS)
https://doi.org/10.1016/j.comtox.2018.08.002)
• ADME properties• Plasma fraction unbound (FuB)• Intrinsic clearance (Clint)
New in OPERA 2:Physchem & Environmental fate:
Addressing Risk Probabilistically
Prior
Posterior
Risk
% Population
+ Human-Relevant Mechanistic Information(from HTS assays, 3D organotypic systems, QSAR models, targeted animal studies, etc.)
+ Exposure Data and Population Genetics(from biomonitoring studies, high-throughput
transcriptomics, GWAS studies, etc.)
0.1 1 10