towards reproducible toxicogenomics for risk assessment...weida tong subject: presentations from the...

16
Towards Reproducible Toxicogenomics for Risk Assessment Weida Tong, Ph.D. Director, Division of Bioinformatics and Biostatistics, NCTR/FDA 1 Part 1 – Lessons learned from MAQC1 and SEQC1

Upload: others

Post on 24-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Towards Reproducible Toxicogenomics for Risk Assessment 

    Weida Tong, Ph.D.Director, Division of Bioinformatics and 

    Biostatistics, NCTR/FDA 

    1

    Part 1 – Lessons learned from MAQC‐1 and SEQC‐1 

  • Irreproducible Science is Not Science

    • Up to 85% resources are wasted in irreproducible science– Not correlates with journal impact 

    factor

    • The cause of irreproducible results are diverse– Human factors: selective 

    publication, sloppiness and data fabrication

    – Technical factors: in silico mistakes, inadequate statistics, difference between technologies and between labs

    2

    • Irreproducible biology research costs put at $28 billion per year

    • Journals unites for reproducibility• NIH plans to enhance reproducibility• Independent labs to verify high-

    profile papers• If a job is worth doing, it is worth

    doing twice• Sluggish data sharing hampers the

    reproducibility effort• Statistical errors

  • Sponsor FDA

    Different Data Analysis Methods,with the Same VGDS Data Set

    20%

    Updated from Federico Goodsaid, The 3rd MAQC Project Meeting, Palo Alto, CA, December 1-2, 2005

    ??3

  • MicroArray Quality Control (MAQC) Consortium• Objective: An FDA‐led community wide consortium 

    effort to assess technical performance and application of genomics technologies in clinical application, safety assessment and precision medicine.

    • Started at 2005 and completed 3 projects by 2014; most FDA centers have been involved. 

    • Evaluated 3 genomics technologies: microarrays (MAQC 1 and 2), GWAS (MAQC 2) and RNA‐seq (MAQC 3) 

    • Produced 28 peer‐reviewed articles, 11 of which were published in Nat Biotechnol, 2 are among most cited papers in the past 20 years

    • Supported the FDA’s “Guidance for Industry: Pharmacogenomics Data Submission – Companion Guidance”

    4

  • MAQC/SEQC Objectives

    5

    Lab efficiency; this can be achieved 

    Platform  specific; difference rests on how  to assess low expression

    MAQC1 criterion (p+FC) to ensure cross‐lab and cross–platform reproducibility

    Sample

    RNA

    Array or Seq

    DEGs

    If the same samples provided to different labs, platforms with different analysis protocols ……

    Can we get the same DEGs?

  • DEGs Reproducibility – A definition

    • POG (Percentage of Overlapped Genes) plot1. The DEGs were ranked based 

    on a choice of method (i.e., FC, p, FC+P, and P+FC)

    2. Two ranked list of DEGs are generated from Exp1 and Exp2

    3. Selected the top # of genes from two lists, which showed in x‐axis 

    4. Concordance was defined as percentage of overlap of the two compared DEG results between two experiments

    6

  • 7

    Rat Toxicogenomics Study ‐ Validation

    AA – Aristolochic acid; RDL – Riddelliine; CFY – Comfrey; CTR – Control

    AA CTR AA RDL CFY CTR

    Kidney Liver

    Microarrays from Applied Biosystems, Affymetrix (2 sites), Agilent, and GE Healthcare. Results are summarized in 

    o Guo et al., Nat. Biotechnol. 24, 1162‐1169 (2006).o Tong et al., Nat. Biotechnol. 24, 1132‐1139 (2006).

  • 8

    Are the Lists of Differentially Expressed Genes (DEGs) Reproducible?

    • Within‐lab repeatability (same technician, same protocol)

    • Cross‐lab reproducibility (same protocol, same experimental design)

    • Cross‐platform reproducibility (same experimental design)

  • 9

    Within Site Repeatability ‐ Rat TGx Study

    AA CTR AA RDL CFY CTR

    Kidney Liver

    CFY CTR

    P vs FC

    Exp 1

    Exp 2

  • 10

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000

    Intra‐labo

    ratory con

    cordan

    ce (%

    )

    Number of genes selected as differentially expressed

    • P + FC >2.0• P + FC >1.4• P Ranking

    Rat Toxicogenomics Dataset: Volcano Plot

    Guo L et al., Nature Biotechnology, 24(9), 1162-9 (2006).

    FC

    • FC ranking• FC + P

  • 11

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000 0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000

    ABI

    AG1 GEH

    Intra‐labo

    ratory con

    cordan

    ce (%

    )

    Number of genes selected as differentially expressed

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000 0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000

    AFX AFX2

    • Fold Change Rank Ordering• Fold Change Rank Ordering/P‐value1.4• P‐value Rank Ordering

    Rat Toxicogenomics Dataset: Gene List Concordance

    Guo L et al., Nature Biotechnology, 24(9), 1162-9 (2006).

  • 12

    Summary• Reproducibility ≠ Statistical Significance

    – Reproducibility is a third dimension beyond sensitivity and specificity, yet it is not commonly considered by statisticians. 

    – MAQC1 criterion: Using (FC + P) cutoff for differentially expressed genes (DEGs) can enhance reproducibility while still optimally balancing sensitivity and specificity.

    • Irreproducible DEGs results in irreproducible biological interpretation

    • A question that is not resolved:– Does MAQC1 criterion is lab‐dependent?– Does MAQC1 criteria is platform dependent?

  • NGS is a tool and the challenge rests on its application

    Genome size

    • Human (~3.2B)

    • Microbial genome (1‐

    10M)

    • mRNA (~27K)

    • microRNA (18‐25)

  • 14

    The 3rd Phase of MAQC‐ SEquencing Quality Control (SEQC)  

    • >180 participants from 73 organizations• Generated > 10Tb data and >100 billion reads• Represented ~6% data in GEO (Jun, 2014)• 10 Manuscripts: 3 by Nat Biotechnol, 2 by Nat 

    Commun, 3 by Sci Data, 2 by Genome Biology 

    Objectives

    Study designs

    Datasets

  • #5: Relative measures agree well across laboratories and platforms but not for absolute measurements

    15

    11 Labs 3 Platforms- Illumina- SOLiD- 454

    6 reference samples

    Bioinformatics Pipelines

    Statistical test:

    A (Lab1) vs A (Lab2) = DEGs

    DEG: A/B (Lab1) vs A/B (Lab2)

  • 16

    Summary• RNA-seq outperforms microarrays in low expression

    gene detection

    • Absolute measurement is not reliable

    • DEGs reproducibility: MAQC1 criterion is equally effective for RNA-seq data

    • Caveat: not tested on toxicogenomics dataset