computationally leveraging the collective: mining ...nas-sites.org › emergingscience › files ›...

36
Computationally Leveraging the Collective: Mining Published Data and Crowdsourcing Consensus Models Nicole C. Kleinstreuer NICEATM Deputy Director PI, Comp Tox Group, DIR/BCBB Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions 6 th June 2019, NAS, Washington DC

Upload: others

Post on 03-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Computationally Leveraging the Collective: Mining Published Data and Crowdsourcing

Consensus ModelsNicole C. KleinstreuerNICEATM Deputy Director

PI, Comp Tox Group, DIR/BCBB

Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions

6th June 2019, NAS, Washington DC

Page 2: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Outline

• Problem Statement

• Application of AI/ML to Big Data in Toxicology

• Current and Future Projects:

– Endocrine Disruption

– Acute Oral Toxicity

– Automation

• Reference Data Identification

• Mechanistic Testing Toolbox

Page 3: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Environmental Chemical Disease Contributions

• Pesticides

– Cancer, neurodegenerative diseases, thyroid

• Consumer products

– Neurological, developmental, systemic

• Air pollutants

– Childhood ADHD, autism, allergic asthma

• Drinking water contaminants

– Systemic effects, cancer, neurological

• Endocrine Disruptors

– Developmental impairment, decreased fertility, cancer

…....and many others.......

https://www.niehs.nih.gov/health/materials/index.cfm

Page 4: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

picture© ChemSec

Chemicals >> Data

• 80+ million substances synthetized• 140,000 chemicals in commerce

(plus mixtures, natural products and metabolites)

• Less than 10% tested

Page 5: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Number of Chemicals(Chemical Diversity)

Num

ber o

f Ass

ays

(Bio

logi

cal D

iver

sity

)

Tox21

ToxCast

High Throughput Screening (Tox21/ToxCast)

• Produce data on thousands of chemicals

• Prioritize compounds for hazard

• Develop predictive models for biological response in humans/ecosystems

• Reduce reliance on animal models

• NICEATM: comp tox + validation support to Tox21 and ICCVAM

Page 6: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Curated Legacy Datae.g. REACH, ToxRefDB, ICE

Omics technologiese.g. transcriptomics, metabolomics,exposomics

High-ThroughputScreeninge.g. Tox21, EUToxRisk

Chemical Featurese.g. RDKit, PhysChem

Big Data + Machine Learning

Page 7: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Supporting Regulatory DecisionsFar too many chemicals to test with standard animal-based methods or even in vitro HTS

– Cost, time, animal welfare, human relevance– >10,000 chemicals to be tested for EDSP, >50,000 for TSCA– Fill the data gaps and bridge the lack of knowledge

Alternative

Endocrine DisruptionEstrogen (ER) & Androgen (AR)

• Binding • Agonism• Antagonism

Acute Systemic Toxicity Oral LD50s

• Toxic/ Very toxic • LD50 Point estimates• EPA Categories• GHS Categories

Endpoints:

Page 8: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

• Obtain high quality training sets– Apply best modeling practices

– Validate performance of models

– Define applicability domain and model limitations

– Use models to predict across large chemical sets

– Help inform regulatory decision making

Overall Approach

Mansouri et al. 2018

Page 9: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

• Data & structure curation

» Flagged and curated files available

• Preparation of training and test sets

» Available as SDFiles and csv data files

• Initial descriptor calculation

» molecular descriptors and structural fingerprints generated and shared

• Variable selection technique

» e.g. genetic algorithm

• Selection of a mathematical method

» Test several approaches: KNN, PLS, SVM, RF…

• Validation of the model’s predictive ability

» 5-fold cross validation & external test set

• Define the Applicability Domain

» Local (nearest neighbors) and global (leverage) approaches

Modeling Steps and Considerations

Page 10: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

CoMPARACollaborative Modeling Project for AndrogenReceptor Activity (2017/18)

CATMoSCollaborative Acute Toxicity Modeling Suite(2018/19)

Endocrine Disruptor Screening Program (EDSP)

ICCVAM Acute Systemic Toxicity Workgroup

Mansouri et al. 2016 EHP 124:1023–1033Mansouri et al. 2019 under review at EHP

Kleinstreuer et al. 2018 Comp Tox; Mansouri et al. 2019 in prep

ICCVAMNICEATM

Global Collaborative Projects

Page 11: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

• 35 Participants/Groups from around the globe representing academia, industry, and government contributed

Consortium:

International Participation

(https://batchgeo.com/map/d06c5d497ed8f76ecfee500c2b0e1dfa)

Page 12: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

CERAPP participants models CoMPARA participants modelsCategorical Continuous Total Categorical Continuous Total

Binding 21 3 24 35 5 40Agonist 11 3 14 21 5 26Antagonist 8 2 10 22 3 25Total 40 8 48 78 13 91

Models for Endocrine Disruption

Judson et al. Toxicol. Sci. (2015) 148: 137-154 Kleinstreuer et al. CRT (2017) 30(4): 946-964.

Tox21/ToxCast ER Pathway Model Tox21/ToxCast AR Pathway Model

Validated pathway models provide training data for 1800 chemicals

Page 13: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Binding Agonist Antagonist

Train Test Train Test Train Test

Sn 0.93 0.58 0.85 0.94 0.67 0.18

Sp 0.97 0.92 0.98 0.94 0.94 0.90

BA 0.95 0.75 0.92 0.94 0.80 0.54

Binding Agonist Antagonist

Train Test Train Test Train Test

Sn 0.99 0.69 0.95 0.74 1.00 0.61

Sp 0.91 0.87 0.98 0.97 0.95 0.87BA 0.95 0.78 0.97 0.86 0.97 0.74

CERAPP consensus CoMPARA consensus

Distributions of the number of predicted chemical structures by all binding models.

Consensus Models Assessment

Training Set: 1.8k, Evaluation Set: 7k, Prediction Set: 32k

Page 14: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Informing Regulatory Decisions

Page 15: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Modeling Endpoints for Acute Oral Toxicity

I (≤ 50mg/kg)

II (>50 ≤ 500mg/kg)

III (>500 ≤ 5000mg/kg) IV (>5000mg/kg) Hazard

I (≤ 5mg/kg)

II (>5 ≤ 50mg/kg)

III (>50 ≤ 300mg/kg) IV (>300 ≤ 2000mg/kg)

HazardPacking Group

HazardToxic (>50-5000mg/kg)

Highly toxic (≤50mg/kg)

GHS

EPA

+ Quantitative

LD50values

+ Nontoxic (>2000 mg/kg)

NC (> 2000mg/kg)

Rodent LD50 data obtained & curated for ~15k chemicalsQSAR-ready structures: ~12k

Page 16: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Bootstrapping of the standard deviations for repeat test chemicals (~1000) identified a 95% confidence interval for LD50 values of ±0.31 log10(mg/kg)

Defining a Confidence RangeLD

50 (l

og10

(mg/

kg))

Page 17: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Submitted Models

Consortium Comprised 35 Participating Groups

• Very Toxic: 32 models• Non-toxic: 33 models• EPA categories: 26 models• GHS categories: 23 models• LD50: 25 models

Total: 139 models

Support vector

machine

Artificial Neural

Networks

Regression Model

Bayesian Networks

XGBoost

kNN

Deep Learning

Random Forest

Training Set: 10k, Evaluation Set: 2k, Prediction Set: 48k

Page 18: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Evaluation procedure

Qualitative evaluation:

Quantitative evaluation:

• Documentation• Defined endpoint• Unambiguous algorithm• Availability of code

- Goodness of fit: training statistics- Evaluation set predictivity: statistics on the evaluation set- Robustness: balance between (Goodness of fit) & (Test set predictivity)

• Applicability domain definition• Availability of data used for modeling• Mechanistic interpretation

Page 19: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

The consensus predictions perform as well as replicate in

vivo data at predicting oral acute toxicity outcome

CATMoS Consensus Model: in-domain predictions, weighted majority/average

Performance Assessment

Very Toxic Non-Toxic EPA GHSTrain Eval Train Eval Train Eval Train Eval

Sensitivity 0.87 0.67 0.93 0.70 0.73 0.50 0.63 0.45Specificity 0.94 0.96 0.96 0.88 0.96 0.91 0.91 0.92Balanced Accuracy 0.93 0.81 0.94 0.79 0.83 0.71 0.77 0.68

In vivo Balanced Accuracy

0.81 0.89 0.82 0.79

LD50 values LD50 valuesTrain Eval In Vivo

R2 0.84 0.64 0.80RMSE 0.32 0.51 0.42

Consensus models outperformed all individual models for each endpoint

Page 20: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

• LD50: 28954

• VT: 23767

• NT: 30971

• EPA: 25487

• GHS: 25720

Consensus Implementation

Generalized CATMoS models: datasets

• High concordance among models• Proportional distribution of:

• LD50 values• VT/NT classes• EPA/GHS categories

• Split into 75% training and 25% test set• Calculate PaDEL & CDK2 descriptors• Dimensionality reduction (missing values & low variance)• Feature selection (most relevant descriptors for each endpoint)

Page 21: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Generalized CATMoS models: new chemical predictions

New chemical to be predicted

Nearest neighbors (𝑁𝑁𝑖𝑖)

𝑑𝑑𝑖𝑖

𝑤𝑤𝑖𝑖 = 𝑓𝑓(𝑑𝑑𝑖𝑖)𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑓𝑓(𝑤𝑤𝑖𝑖 ,𝑁𝑁𝑖𝑖)

𝑑𝑑1 ≠ 0 𝑑𝑑1 = 0𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑁𝑁𝑖𝑖

Automated, weighted-endpoint dependent read-across: weighted kNN

𝑑𝑑𝑖𝑖: Euclidean distance based on the selected descriptors for each endpoint

Consensus Implementation

Page 22: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

OPERA Standalone app

Running Consensus Models

Mansouri et al. J Cheminform (2018). https://doi.org/10.1186/s13321-018-0263-1

Command line Graphical user interface

- Free, open-source & open-data- Single chemical and batch mode- Multiple platforms (Windows and Linux)- Embeddable libraries (java, C, C++, Python)

https://github.com/NIEHS/OPERA

Page 23: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

OPERA predictions on EPA’s CompTox dashboard

Calculation Result for a chemical Model Performance

with full QMRF

Nearest Neighbors from Training Set

Mansouri et al. OPERA models (https://doi.org/10.1186/s13321-018-0263-1) Williams et al. CompTox Chemistry Dashboard(https://doi.org/10.1186/s13321-017-0247-6)

https://comptox.epa.gov/dashboard

Page 24: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Summary

• Toxicology data can be synthesized and modeled effectively using AI & machine learning approaches.

– Consensus models are a powerful way to leverage collective expertise

– Other applications: exposure, use case, systematic review, etc.

• Machine learning models (e.g. QSARs) have already achieved limited acceptance in the regulatory space.

• Additional education, training, and communication will facilitate more widespread adoption.

https://ntp.niehs.nih.gov/pubhealth/evalatm/natl-strategy/

https://ice.ntp.niehs.nih.gov/

Page 25: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Manually Identifying Reference Data

Systematic literature search of publically available data (e.g. PubMed)

Identify chemical activities measured in “guideline-like” uterotrophic studies

Identify a subset of in vivo reference chemicals

• Active chemicals verified in >2 independent studies

• Inactive chemicals verified in >2 independent studies (with no positive results in any study)

Kleinstreuer et al. EHP (2015)

Page 26: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Automating Reference Data Identification

• Project with Oak Ridge National Labs (ORNL) and FDA CFSAN to apply text-mining (NLP) approaches & ML to identify high-quality data

• Semi-automated retrieval and evaluation of published literature (trained on uterotrophic database)

• Apply to developmental toxicity studies (with ICCVAM DARTWG)

• Define literature search keywords, identify corpus

• Extract/characterize study protocol details from regulatory guidelines: minimum criteria

• Apply ML algorithms to identify high-quality studies, expert check

Page 27: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

• Search published literature for assays/biomarkers to modernize carcinogenicity testing toolbox

• Initial search strategy: Work with NTP Report on Carcinogens and Office of Health Assessment and Translation to identify keywords

– 256 keywords mapped to Hallmarks of Cancer and Key Characteristics of Carcinogens

– 7 keywords for assays/biomarker, crossed with HM/KC

• Recruit participants to screen and tag abstracts

– Metadata: KC, HM, Organism, Publication type, Study type

– Mesh terms automatically tracked for PubMed articles

Semi-supervised systematic review

Automating Toolbox Construction

Page 28: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Sysrev: Semi-automated review platform

https://sysrev.com/

• Freely available website • Abstract screening and annotating• Intuitive user interface• Including mobile/tablet access • Uses machine learning to rank the abstracts

@sysrev1

Registerhttps://sysrev.com/register

Project nameHallmark and key characteristics mapping

Project linkhttps://sysrev.com/p/3588

Page 29: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Example: Carcinogenicity

Hallmarks of Cancer & Key Characteristics of Carcinogens• Inflammation

• Oxidative stress

• Genotoxicity/instablitiy

• Angiogenesis

• Immortalization/proliferation

• Immunosuppression

• Invasion/metastasis

• Specific receptor- or enzyme-mediated

Mechanistic Mapping of HTS Assays

Hanahan & Weingberg 2011; Smith et al. 2016; Guyton et al. 2018; Chiu et al. 2018

Page 30: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Acknowledgments

• Kamel Mansouri

• Richard Judson

• ILS/NICEATM

• Alexandre Borrel

• ICCVAM partners

• ICATM partners

• Modeling consortiumparticipants

• Tom Luechtefeld

Page 31: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Extra Slides

Page 32: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

QSAR-ready KNIME workflow

Remove of duplicates

Normalize of tautomers

Clean salts and counterions

Remove inorganics and mixtures

Final inspection QSAR-ready

structures

Indigo

Aim of the workflow: • Combine different procedures and ideas • Minimize the differences between the structures used for

prediction• Produce a flexible free and open source workflow to be

shared

Structure standardization procedure

Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)

Fourches et al. J Chem Inf Model, 2010, 29, 476 – 488Wedebye et al. Danish EPA Environmental Project No. 1503, 2013

Page 33: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Total binders: 3961Agonists: 2494Antagonists: 2793

Consensus Qualitative Accuracy

ToxCast data

Literature data

(All: 7283)

Literature data

(>6 sources: 1209)

Sensitivity 0.93 0.30 0.87Specificity 0.97 0.91 0.94Balancedaccuracy 0.95 0.61 0.91

ToxCast data (training set)

Literature data (test set)

Observed\Predicted

Actives Inactives Actives Inactives

Actives 83 6 597 1385Inactives 40 1400 463 4838

Prediction Accuracy Strongly Depends on Data Quality

ROC curve of the external validation set (literature)

Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267

Page 34: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

757 chemicals have >75% active concordance

Actives

Inactives

Prioritization

Most models predict most chemicals as inactive

Only a small fraction of chemicals are prioritized for further testing

Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267

Chemical Prioritization

Page 35: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

OPERA v2

Model PropertyAOH Atmospheric Hydroxylation

RateBCF Bioconcentration Factor

BioHL Biodegradation Half-life

RB Ready Biodegradability

BP Boiling Point

HL Henry's Law Constant

KM Fish Biotransformation Half-life

KOA Octanol/Air Partition Coefficient

LogP Octanol-water Partition Coefficient

MP Melting Point

KOC Soil Adsorption Coefficient

VP Vapor Pressure

WS Water solubility

RT HPLC retention time

OPERA 1.5

• Physchem properties:• General structural properties• pKa • Log D

• Toxicity endpoints• ER activity (CERAPP)

https://ehp.niehs.nih.gov/15-10267/• AR activity (CoMPARA)

https://doi.org/10.13140/RG.2.2.19612.80009• Acute toxicity (CATMoS)

https://doi.org/10.1016/j.comtox.2018.08.002)

• ADME properties• Plasma fraction unbound (FuB)• Intrinsic clearance (Clint)

New in OPERA 2:Physchem & Environmental fate:

Page 36: Computationally Leveraging the Collective: Mining ...nas-sites.org › emergingscience › files › 2019 › 07 › 07_Kleinstreuer.pdf · • Toxicology data can be synthesized

Addressing Risk Probabilistically

Prior

Posterior

Risk

% Population

+ Human-Relevant Mechanistic Information(from HTS assays, 3D organotypic systems, QSAR models, targeted animal studies, etc.)

+ Exposure Data and Population Genetics(from biomonitoring studies, high-throughput

transcriptomics, GWAS studies, etc.)

0.1 1 10