computationally leveraging the collective: mining ...nas-sites.org › emergingscience › files ›...

Computationally Leveraging the Collective: Mining Published Data and Crowdsourcing

Consensus ModelsNicole C. KleinstreuerNICEATM Deputy Director

PI, Comp Tox Group, DIR/BCBB

Leveraging Artificial Intelligence and Machine Learning to Advance Environmental Health Research and Decisions

6th June 2019, NAS, Washington DC

Outline

• Problem Statement

• Application of AI/ML to Big Data in Toxicology

• Current and Future Projects:

– Endocrine Disruption

– Acute Oral Toxicity

– Automation

• Reference Data Identification

• Mechanistic Testing Toolbox

Environmental Chemical Disease Contributions

• Pesticides

– Cancer, neurodegenerative diseases, thyroid

• Consumer products

– Neurological, developmental, systemic

• Air pollutants

– Childhood ADHD, autism, allergic asthma

• Drinking water contaminants

– Systemic effects, cancer, neurological

• Endocrine Disruptors

– Developmental impairment, decreased fertility, cancer

…....and many others.......

https://www.niehs.nih.gov/health/materials/index.cfm

https://www.niehs.nih.gov/health/materials/index.cfm

picture© ChemSec

Chemicals >> Data

• 80+ million substances synthetized• 140,000 chemicals in commerce

(plus mixtures, natural products and metabolites)

• Less than 10% tested

Number of Chemicals(Chemical Diversity)

Num

ber o

f Ass

ays

(Bio

logi

cal D

iver

sity

)

Tox21

ToxCast

High Throughput Screening (Tox21/ToxCast)

• Produce data on thousands of chemicals

• Prioritize compounds for hazard

• Develop predictive models for biological response in humans/ecosystems

• Reduce reliance on animal models

• NICEATM: comp tox + validation support to Tox21 and ICCVAM

Curated Legacy Datae.g. REACH, ToxRefDB, ICE

Omics technologiese.g. transcriptomics, metabolomics,exposomics

High-ThroughputScreeninge.g. Tox21, EUToxRisk

Chemical Featurese.g. RDKit, PhysChem

Big Data + Machine Learning

Supporting Regulatory DecisionsFar too many chemicals to test with standard animal-based methods or even in vitro HTS

– Cost, time, animal welfare, human relevance– >10,000 chemicals to be tested for EDSP, >50,000 for TSCA– Fill the data gaps and bridge the lack of knowledge

Alternative

Endocrine DisruptionEstrogen (ER) & Androgen (AR)

• Binding • Agonism• Antagonism

Acute Systemic Toxicity Oral LD50s

• Toxic/ Very toxic • LD50 Point estimates• EPA Categories• GHS Categories

Endpoints:

• Obtain high quality training sets– Apply best modeling practices

– Validate performance of models

– Define applicability domain and model limitations

– Use models to predict across large chemical sets

– Help inform regulatory decision making

Overall Approach

Mansouri et al. 2018

• Data & structure curation

» Flagged and curated files available

• Preparation of training and test sets

» Available as SDFiles and csv data files

• Initial descriptor calculation

» molecular descriptors and structural fingerprints generated and shared

• Variable selection technique

» e.g. genetic algorithm

• Selection of a mathematical method

» Test several approaches: KNN, PLS, SVM, RF…

• Validation of the model’s predictive ability

» 5-fold cross validation & external test set

• Define the Applicability Domain

» Local (nearest neighbors) and global (leverage) approaches

Modeling Steps and Considerations

CoMPARACollaborative Modeling Project for AndrogenReceptor Activity (2017/18)

CATMoSCollaborative Acute Toxicity Modeling Suite(2018/19)

Endocrine Disruptor Screening Program (EDSP)

ICCVAM Acute Systemic Toxicity Workgroup

Mansouri et al. 2016 EHP 124:1023–1033Mansouri et al. 2019 under review at EHP

Kleinstreuer et al. 2018 Comp Tox; Mansouri et al. 2019 in prep

ICCVAMNICEATM

Global Collaborative Projects

• 35 Participants/Groups from around the globe representing academia, industry, and government contributed

Consortium:

International Participation

(https://batchgeo.com/map/d06c5d497ed8f76ecfee500c2b0e1dfa)

https://batchgeo.com/map/d06c5d497ed8f76ecfee500c2b0e1dfa

CERAPP participants models CoMPARA participants modelsCategorical Continuous Total Categorical Continuous Total

Binding 21 3 24 35 5 40Agonist 11 3 14 21 5 26Antagonist 8 2 10 22 3 25Total 40 8 48 78 13 91

Models for Endocrine Disruption

Judson et al. Toxicol. Sci. (2015) 148: 137-154 Kleinstreuer et al. CRT (2017) 30(4): 946-964.

Tox21/ToxCast ER Pathway Model Tox21/ToxCast AR Pathway Model

Validated pathway models provide training data for 1800 chemicals

Binding Agonist Antagonist

Train Test Train Test Train Test

Sn 0.93 0.58 0.85 0.94 0.67 0.18

Sp 0.97 0.92 0.98 0.94 0.94 0.90

BA 0.95 0.75 0.92 0.94 0.80 0.54

Binding Agonist Antagonist

Train Test Train Test Train Test

Sn 0.99 0.69 0.95 0.74 1.00 0.61

Sp 0.91 0.87 0.98 0.97 0.95 0.87BA 0.95 0.78 0.97 0.86 0.97 0.74

CERAPP consensus CoMPARA consensus

Distributions of the number of predicted chemical structures by all binding models.

Consensus Models Assessment

Training Set: 1.8k, Evaluation Set: 7k, Prediction Set: 32k

Informing Regulatory Decisions

Modeling Endpoints for Acute Oral Toxicity

I (≤ 50mg/kg)

II (>50 ≤ 500mg/kg)

III (>500 ≤ 5000mg/kg) IV (>5000mg/kg) Hazard

I (≤ 5mg/kg)

II (>5 ≤ 50mg/kg)

III (>50 ≤ 300mg/kg) IV (>300 ≤ 2000mg/kg)

HazardPacking Group

HazardToxic (>50-5000mg/kg)

Highly toxic (≤50mg/kg)

GHS

EPA

+ Quantitative

LD50values

+ Nontoxic (>2000 mg/kg)

NC (> 2000mg/kg)

Rodent LD50 data obtained & curated for ~15k chemicalsQSAR-ready structures: ~12k

Bootstrapping of the standard deviations for repeat test chemicals (~1000) identified a 95% confidence interval for LD50 values of ±0.31 log10(mg/kg)

Defining a Confidence RangeLD

50 (l

og10

(mg/

kg))

Submitted Models

Consortium Comprised 35 Participating Groups

• Very Toxic: 32 models• Non-toxic: 33 models• EPA categories: 26 models• GHS categories: 23 models• LD50: 25 models

Total: 139 models

Support vector

machine

Artificial Neural

Networks

Regression Model

Bayesian Networks

XGBoost

kNN

Deep Learning

Random Forest

Training Set: 10k, Evaluation Set: 2k, Prediction Set: 48k

Evaluation procedure

Qualitative evaluation:

Quantitative evaluation:

• Documentation• Defined endpoint• Unambiguous algorithm• Availability of code

- Goodness of fit: training statistics- Evaluation set predictivity: statistics on the evaluation set- Robustness: balance between (Goodness of fit) & (Test set predictivity)

• Applicability domain definition• Availability of data used for modeling• Mechanistic interpretation

The consensus predictions perform as well as replicate in

vivo data at predicting oral acute toxicity outcome

CATMoS Consensus Model: in-domain predictions, weighted majority/average

Performance Assessment

Very Toxic Non-Toxic EPA GHSTrain Eval Train Eval Train Eval Train Eval

Sensitivity 0.87 0.67 0.93 0.70 0.73 0.50 0.63 0.45Specificity 0.94 0.96 0.96 0.88 0.96 0.91 0.91 0.92Balanced Accuracy 0.93 0.81 0.94 0.79 0.83 0.71 0.77 0.68

In vivo Balanced Accuracy

0.81 0.89 0.82 0.79

LD50 values LD50 valuesTrain Eval In Vivo

R2 0.84 0.64 0.80RMSE 0.32 0.51 0.42

Consensus models outperformed all individual models for each endpoint

• LD50: 28954

• VT: 23767

• NT: 30971

• EPA: 25487

• GHS: 25720

Consensus Implementation

Generalized CATMoS models: datasets

• High concordance among models• Proportional distribution of:

• LD50 values• VT/NT classes• EPA/GHS categories

• Split into 75% training and 25% test set• Calculate PaDEL & CDK2 descriptors• Dimensionality reduction (missing values & low variance)• Feature selection (most relevant descriptors for each endpoint)

Generalized CATMoS models: new chemical predictions

New chemical to be predicted

Nearest neighbors (𝑁𝑁𝑖𝑖)

𝑑𝑑𝑖𝑖

𝑤𝑤𝑖𝑖 = 𝑓𝑓(𝑑𝑑𝑖𝑖)𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑓𝑓(𝑤𝑤𝑖𝑖 ,𝑁𝑁𝑖𝑖)

𝑑𝑑1 ≠ 0 𝑑𝑑1 = 0𝑃𝑃𝑃𝑃𝑃𝑃𝑑𝑑𝑖𝑖 = 𝑁𝑁𝑖𝑖

Automated, weighted-endpoint dependent read-across: weighted kNN

𝑑𝑑𝑖𝑖: Euclidean distance based on the selected descriptors for each endpoint

Consensus Implementation

OPERA Standalone app

Running Consensus Models

Mansouri et al. J Cheminform (2018). https://doi.org/10.1186/s13321-018-0263-1

Command line Graphical user interface

- Free, open-source & open-data- Single chemical and batch mode- Multiple platforms (Windows and Linux)- Embeddable libraries (java, C, C++, Python)

https://github.com/NIEHS/OPERA

https://github.com/NIEHS/OPERA

OPERA predictions on EPA’s CompTox dashboard

Calculation Result for a chemical Model Performance

with full QMRF

Nearest Neighbors from Training Set

Mansouri et al. OPERA models (https://doi.org/10.1186/s13321-018-0263-1) Williams et al. CompTox Chemistry Dashboard(https://doi.org/10.1186/s13321-017-0247-6)

https://comptox.epa.gov/dashboard

https://doi.org/10.1186/s13321-018-0263-1

https://doi.org/10.1186/s13321-017-0247-6

https://comptox.epa.gov/dashboard

Summary

• Toxicology data can be synthesized and modeled effectively using AI & machine learning approaches.

– Consensus models are a powerful way to leverage collective expertise

– Other applications: exposure, use case, systematic review, etc.

• Machine learning models (e.g. QSARs) have already achieved limited acceptance in the regulatory space.

• Additional education, training, and communication will facilitate more widespread adoption.

https://ntp.niehs.nih.gov/pubhealth/evalatm/natl-strategy/

https://ice.ntp.niehs.nih.gov/

https://ntp.niehs.nih.gov/pubhealth/evalatm/natl-strategy/

https://ice.ntp.niehs.nih.gov/

Manually Identifying Reference Data

Systematic literature search of publically available data (e.g. PubMed)

Identify chemical activities measured in “guideline-like” uterotrophic studies

Identify a subset of in vivo reference chemicals

• Active chemicals verified in >2 independent studies

• Inactive chemicals verified in >2 independent studies (with no positive results in any study)

Kleinstreuer et al. EHP (2015)

Automating Reference Data Identification

• Project with Oak Ridge National Labs (ORNL) and FDA CFSAN to apply text-mining (NLP) approaches & ML to identify high-quality data

• Semi-automated retrieval and evaluation of published literature (trained on uterotrophic database)

• Apply to developmental toxicity studies (with ICCVAM DARTWG)

• Define literature search keywords, identify corpus

• Extract/characterize study protocol details from regulatory guidelines: minimum criteria

• Apply ML algorithms to identify high-quality studies, expert check

• Search published literature for assays/biomarkers to modernize carcinogenicity testing toolbox

• Initial search strategy: Work with NTP Report on Carcinogens and Office of Health Assessment and Translation to identify keywords

– 256 keywords mapped to Hallmarks of Cancer and Key Characteristics of Carcinogens

– 7 keywords for assays/biomarker, crossed with HM/KC

• Recruit participants to screen and tag abstracts

– Metadata: KC, HM, Organism, Publication type, Study type

– Mesh terms automatically tracked for PubMed articles

Semi-supervised systematic review

Automating Toolbox Construction

Sysrev: Semi-automated review platform

https://sysrev.com/

• Freely available website • Abstract screening and annotating• Intuitive user interface• Including mobile/tablet access • Uses machine learning to rank the abstracts

@sysrev1

Registerhttps://sysrev.com/register

Project nameHallmark and key characteristics mapping

Project linkhttps://sysrev.com/p/3588

https://sysrev.com/register

https://sysrev.com/p/3588

Example: Carcinogenicity

Hallmarks of Cancer & Key Characteristics of Carcinogens• Inflammation

• Oxidative stress

• Genotoxicity/instablitiy

• Angiogenesis

• Immortalization/proliferation

• Immunosuppression

• Invasion/metastasis

• Specific receptor- or enzyme-mediated

Mechanistic Mapping of HTS Assays

Hanahan & Weingberg 2011; Smith et al. 2016; Guyton et al. 2018; Chiu et al. 2018

Acknowledgments

• Kamel Mansouri

• Richard Judson

• ILS/NICEATM

• Alexandre Borrel

• ICCVAM partners

• ICATM partners

• Modeling consortiumparticipants

• Tom Luechtefeld

Extra Slides

QSAR-ready KNIME workflow

Remove of duplicates

Normalize of tautomers

Clean salts and counterions

Remove inorganics and mixtures

Final inspection QSAR-ready

structures

Indigo

Aim of the workflow: • Combine different procedures and ideas • Minimize the differences between the structures used for

prediction• Produce a flexible free and open source workflow to be

shared

Structure standardization procedure

Mansouri et al. (http://ehp.niehs.nih.gov/15-10267/)

Fourches et al. J Chem Inf Model, 2010, 29, 476 – 488Wedebye et al. Danish EPA Environmental Project No. 1503, 2013

Total binders: 3961Agonists: 2494Antagonists: 2793

Consensus Qualitative Accuracy

ToxCast data

Literature data

(All: 7283)

Literature data

(>6 sources: 1209)

Sensitivity 0.93 0.30 0.87Specificity 0.97 0.91 0.94Balancedaccuracy 0.95 0.61 0.91

ToxCast data (training set)

Literature data (test set)

Observed\Predicted

Actives Inactives Actives Inactives

Actives 83 6 597 1385Inactives 40 1400 463 4838

Prediction Accuracy Strongly Depends on Data Quality

ROC curve of the external validation set (literature)

Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267

757 chemicals have >75% active concordance

Actives

Inactives

Prioritization

Most models predict most chemicals as inactive

Only a small fraction of chemicals are prioritized for further testing

Mansouri et al. (2016) EHP 124:1023–1033 DOI:10.1289/ehp.1510267

Chemical Prioritization

OPERA v2

Model PropertyAOH Atmospheric Hydroxylation

RateBCF Bioconcentration Factor

BioHL Biodegradation Half-life

RB Ready Biodegradability

BP Boiling Point

HL Henry's Law Constant

KM Fish Biotransformation Half-life

KOA Octanol/Air Partition Coefficient

LogP Octanol-water Partition Coefficient

MP Melting Point

KOC Soil Adsorption Coefficient

VP Vapor Pressure

WS Water solubility

RT HPLC retention time

OPERA 1.5

• Physchem properties:• General structural properties• pKa • Log D

• Toxicity endpoints• ER activity (CERAPP)

https://ehp.niehs.nih.gov/15-10267/• AR activity (CoMPARA)

https://doi.org/10.13140/RG.2.2.19612.80009• Acute toxicity (CATMoS)

https://doi.org/10.1016/j.comtox.2018.08.002)

• ADME properties• Plasma fraction unbound (FuB)• Intrinsic clearance (Clint)

New in OPERA 2:Physchem & Environmental fate:

https://ehp.niehs.nih.gov/15-10267/

https://doi.org/10.13140/RG.2.2.19612.80009

https://doi.org/10.1016/j.comtox.2018.08.002

Addressing Risk Probabilistically

Prior

Posterior

Risk

% Population

+ Human-Relevant Mechanistic Information(from HTS assays, 3D organotypic systems, QSAR models, targeted animal studies, etc.)

+ Exposure Data and Population Genetics(from biomonitoring studies, high-throughput

transcriptomics, GWAS studies, etc.)

0.1 1 10

computationally leveraging the collective: mining ...nas-sites.org › emergingscience › files ›...

Documents