avoiding nonsense results in your ngs variant studies
DESCRIPTION
Presented at the 2014 Bio-IT World Expo in Boston, this slideshow provides info on the use of Lyons-Weiler's entropy-based measures of genotypic signal to improve concordance among alternative variant calling algorithms and to evaluate various steps in the GATK best practices pipeline. The second part of the talk presented data showing a demarcation bias in the widely used measure of fold change in selected differentially expressed genes, transcripts or proteins from microarray and RNASeq data. http://www.bio-itworldexpo.com/Next-Gen-Sequencing-Informatics/TRANSCRIPT
Avoiding Nonsense Resultsin your NGS Variant Studies
James Lyons-Weiler, PhDScientific Director/
Senior Research ScientistBioinformatics Analysis Core
Genomics & Proteomics Core LaboratoriesUniversity of Pittsburgh
Pittsburgh, PAMay 1, 2014
Two Parts
• Identifying sites with low genotypic signal increases concordance among variant callers
• Hazards in finding differentially expressed genes in RNASeq – how to do it more robustly.
23andMe: High risk of RA and psiriosisGTL: Low risk of RA and psiriosis
NYTimes Article, etc.
Data were from Illumina hi-seq 2000
Among method averageConcordance57.5% overall; 32.7% at high coverage
O’Rawe et al.
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT CALLERS
LOW CONCORDANCE (O’Rawe et al., 2013)
Consensus Analysise.g.,2/3, ¾, set analysis
Information Theory(-> modeling)
Improve Callers(fix errors, modeling) Bake Offs
Simulations
Spiked Ins
Entropy of Base Distributions
A T C GA T C G A T C GLow entropyHigh enthalpy
Low entropyHigh enthalpy
High entropyLow enthalpy
Boltzmann Entropy
• s = k ln w (Planck)
• w = antiln(s/k)
http://schneider.ncifcrf.gov/images/boltzmann/boltzmann-tomb-4.html
Rank Sorted Distribution of w(O’Rawe et al. data)
Homozygotes w = 1
Heterozygotes w = 2
Example w Density Distribution
w and FBVCA T C G w pw Zygosity Genotype200 0 0 0 1 0 Homozygote AA
16 158 13 13 2.102558 0 Homozygote TT100 100 0 0 2 0 Heterozygote AT
58 30 1 111 2.768507 0 Heterozygote AG28 80 14 78 3.303636 0 Heterozygote TG76 38 29 57 3.758733 0 Heterozygote AG33 49 60 58 3.895496 0.0126 Heterzygote? CG?50 50 50 50 4 1 noise unknown
Operational*Equiprobable Null Distribution
{f(A) = f(T) = f(G) = f(C)}
Convergence of significance (pw)
What We Expect
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT/BASE CALLERS
Genotypic Signal Filtering
INCREASED CONCORDANCE
Phom Function
gatkConcordance w/ FBVC Hom Het
ALL 0.5762 11868 17670pw<=0.05 0.9976 11282 5676
pw>0.05 0.0074 586 11994samtools
ALL 0.5649 11541 18799pw<=0.05 0.9917 11489 5761
pw>0.05 0.0002 52 13038snver
ALL 0.6006 11904 16729pw<=0.05 0.9934 11812 5470
pw>0.05 0.0007 92 11259
From the O’Rawe et al. generated resultsFBVC = frequency-based variant caller (Lyons-Weiler et al.)
Signal Tx %ConcordanceFBVC_vs_FBVC Marked ALL 85.64
pw<=0.05 91.08pw>0.05 35.66
FBVC_vs_FBVC Realigned ALL 83.82pw<=0.05 91.69
pw>0.05 28.21FBVC_vs_FBVC Recalibrated ALL 93.14
pw<=0.05 ***99.39pw>0.05 48.53
FBVC_vs_FBVC Reduced ALL 21.54pw<=0.05 24.57
pw>0.05 4.25FBVC_vs_FBVC Marked-Realigned ALL 76.91
pw<=0.05 86.11pw>0.05 15.44
FBVC_vs_FBVC Marked-Realigned-Recalibrated ALL 76.73pw<=0.05 85.99
pw>0.05 15.34
FBVC_vs_FBVC Marked-Realigned-Recalibrated-Reduced ALL 19.98pw<=0.05 22.9
pw>0.05 2.66
TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)
SEQUENCER
MAPPER
VARIANT CALLERS
LOW CONCORDANCE (O’Rawe et al., 2013)
Consensus Analysise.g.,2/3, ¾, set analysis
Information Theory(-> modeling)
Improve Callers(fix errors, modeling) Bake Offs
Simulations
Spiked Ins
Lifescope reads (read)
Shrimp2 reads (blue)
Mappers must be systematically evaluated
Part 2: Good and Bad News forRNASeq (and everything else):
The Bad News:
Fold Change is Biased.
The Good News:
We have identified a much less biased method.
T-test is not appropriatefor small N, large P data
(such as RNASeq)
Fold Change > 2.0
Delta > 25
FC(A/B) is Blind to Large Portionsof Your Data
FC(A/B)
Delta(and J5: Patel & Lyons-Weiler, 2004)
Ratio are Hard to Interpret asBiological Differences
Gene A B delta (A-B) FC(A/B)
gene1 5 3 2 1.667
gene2 50 30 20 1.667
gene3 500 300 200 1.667
gene4 5000 3000 2000 1.667
gene5 50000 30000 20000 1.667
A-B is a differenceA/B is a quotient.
Log2 TransformationDoes not Help
Reveals Minor Delta (&J5) Bias
Pink = FC(A/B)Black = Delta
G-Thresholding J5
FC Bias in Amyotrophic Lateral Sclerosis
0
50000
100000
150000
200000
250000
300000
350000
0 50000 100000 150000 200000
Control
ALS DEGy
FCDEGy
Black circles = FC(A/B). Pink = Gthr-J5 genes
Black circles = FC(A/B). Pink = Gthr-J5 genes
FC(A/B) Bias inAlchohol-Induced Hepatitis
Conclusions• Not all NGS/HTS sites have sufficient genotypic signal to warrant
a base call. High coverage alone does not provide a solution.
• By measuring genotypic signal, we can determine which sites we can call with confidence.
• Fold-change(FC(A/B) is blind to highly expressed genes and should be abandoned as a measure of differential expression altogether – even for single gene or single protein studies!
• Published microarray data sets analyzed to date using FC(A/B) only are a gold-mine for re-analysis using less biased methods.
Credits and Contact• pw, pHom, etc: James Lyons-Weiler, Alan Twaddle, Rahil Sethi.
– (MS in preparation)– Our software is called Gconf (not yet available)
• Fold-Change Bias: James Lyons-Weiler, Tamanna Sultana, Rick Jordan, Rahil Sethi– (Paper in review)– For now, read
• Mariani TJ, Budhraja V, Mecham BH, Gu CC, Watson MA, Sadovsky Y. 2003. A variable fold change threshold determines significance for expression microarrays. FASEB J. 17:321-3. doi: 10.1096/fj.02-0351fje
• Pearson, K. 1897. On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60:489-498 doi: 10.1098/rspl.1896.0076