towards reproducible toxicogenomics for risk assessment...weida tong subject: presentations from the...
TRANSCRIPT
-
Towards Reproducible Toxicogenomics for Risk Assessment
Weida Tong, Ph.D.Director, Division of Bioinformatics and
Biostatistics, NCTR/FDA
1
Part 1 – Lessons learned from MAQC‐1 and SEQC‐1
-
Irreproducible Science is Not Science
• Up to 85% resources are wasted in irreproducible science– Not correlates with journal impact
factor
• The cause of irreproducible results are diverse– Human factors: selective
publication, sloppiness and data fabrication
– Technical factors: in silico mistakes, inadequate statistics, difference between technologies and between labs
2
• Irreproducible biology research costs put at $28 billion per year
• Journals unites for reproducibility• NIH plans to enhance reproducibility• Independent labs to verify high-
profile papers• If a job is worth doing, it is worth
doing twice• Sluggish data sharing hampers the
reproducibility effort• Statistical errors
-
Sponsor FDA
Different Data Analysis Methods,with the Same VGDS Data Set
20%
Updated from Federico Goodsaid, The 3rd MAQC Project Meeting, Palo Alto, CA, December 1-2, 2005
??3
-
MicroArray Quality Control (MAQC) Consortium• Objective: An FDA‐led community wide consortium
effort to assess technical performance and application of genomics technologies in clinical application, safety assessment and precision medicine.
• Started at 2005 and completed 3 projects by 2014; most FDA centers have been involved.
• Evaluated 3 genomics technologies: microarrays (MAQC 1 and 2), GWAS (MAQC 2) and RNA‐seq (MAQC 3)
• Produced 28 peer‐reviewed articles, 11 of which were published in Nat Biotechnol, 2 are among most cited papers in the past 20 years
• Supported the FDA’s “Guidance for Industry: Pharmacogenomics Data Submission – Companion Guidance”
4
-
MAQC/SEQC Objectives
5
Lab efficiency; this can be achieved
Platform specific; difference rests on how to assess low expression
MAQC1 criterion (p+FC) to ensure cross‐lab and cross–platform reproducibility
Sample
RNA
Array or Seq
DEGs
If the same samples provided to different labs, platforms with different analysis protocols ……
Can we get the same DEGs?
-
DEGs Reproducibility – A definition
• POG (Percentage of Overlapped Genes) plot1. The DEGs were ranked based
on a choice of method (i.e., FC, p, FC+P, and P+FC)
2. Two ranked list of DEGs are generated from Exp1 and Exp2
3. Selected the top # of genes from two lists, which showed in x‐axis
4. Concordance was defined as percentage of overlap of the two compared DEG results between two experiments
6
-
7
Rat Toxicogenomics Study ‐ Validation
AA – Aristolochic acid; RDL – Riddelliine; CFY – Comfrey; CTR – Control
AA CTR AA RDL CFY CTR
Kidney Liver
Microarrays from Applied Biosystems, Affymetrix (2 sites), Agilent, and GE Healthcare. Results are summarized in
o Guo et al., Nat. Biotechnol. 24, 1162‐1169 (2006).o Tong et al., Nat. Biotechnol. 24, 1132‐1139 (2006).
-
8
Are the Lists of Differentially Expressed Genes (DEGs) Reproducible?
• Within‐lab repeatability (same technician, same protocol)
• Cross‐lab reproducibility (same protocol, same experimental design)
• Cross‐platform reproducibility (same experimental design)
-
9
Within Site Repeatability ‐ Rat TGx Study
AA CTR AA RDL CFY CTR
Kidney Liver
CFY CTR
P vs FC
Exp 1
Exp 2
-
10
0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000
Intra‐labo
ratory con
cordan
ce (%
)
Number of genes selected as differentially expressed
• P + FC >2.0• P + FC >1.4• P Ranking
Rat Toxicogenomics Dataset: Volcano Plot
Guo L et al., Nature Biotechnology, 24(9), 1162-9 (2006).
FC
• FC ranking• FC + P
-
11
0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000
0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000 0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000
ABI
AG1 GEH
Intra‐labo
ratory con
cordan
ce (%
)
Number of genes selected as differentially expressed
0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000 0
10
20
30
40
50
60
70
80
90
100
1 10 7 5 3 2 100 60 30 1000 500 200 10000 4000
AFX AFX2
• Fold Change Rank Ordering• Fold Change Rank Ordering/P‐value1.4• P‐value Rank Ordering
Rat Toxicogenomics Dataset: Gene List Concordance
Guo L et al., Nature Biotechnology, 24(9), 1162-9 (2006).
-
12
Summary• Reproducibility ≠ Statistical Significance
– Reproducibility is a third dimension beyond sensitivity and specificity, yet it is not commonly considered by statisticians.
– MAQC1 criterion: Using (FC + P) cutoff for differentially expressed genes (DEGs) can enhance reproducibility while still optimally balancing sensitivity and specificity.
• Irreproducible DEGs results in irreproducible biological interpretation
• A question that is not resolved:– Does MAQC1 criterion is lab‐dependent?– Does MAQC1 criteria is platform dependent?
-
NGS is a tool and the challenge rests on its application
Genome size
• Human (~3.2B)
• Microbial genome (1‐
10M)
• mRNA (~27K)
• microRNA (18‐25)
-
14
The 3rd Phase of MAQC‐ SEquencing Quality Control (SEQC)
• >180 participants from 73 organizations• Generated > 10Tb data and >100 billion reads• Represented ~6% data in GEO (Jun, 2014)• 10 Manuscripts: 3 by Nat Biotechnol, 2 by Nat
Commun, 3 by Sci Data, 2 by Genome Biology
Objectives
Study designs
Datasets
-
#5: Relative measures agree well across laboratories and platforms but not for absolute measurements
15
11 Labs 3 Platforms- Illumina- SOLiD- 454
6 reference samples
Bioinformatics Pipelines
Statistical test:
A (Lab1) vs A (Lab2) = DEGs
DEG: A/B (Lab1) vs A/B (Lab2)
-
16
Summary• RNA-seq outperforms microarrays in low expression
gene detection
• Absolute measurement is not reliable
• DEGs reproducibility: MAQC1 criterion is equally effective for RNA-seq data
• Caveat: not tested on toxicogenomics dataset