avoiding nonsense results in your ngs variant studies

Post on 02-Dec-2014

115 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presented at the 2014 Bio-IT World Expo in Boston, this slideshow provides info on the use of Lyons-Weiler's entropy-based measures of genotypic signal to improve concordance among alternative variant calling algorithms and to evaluate various steps in the GATK best practices pipeline. The second part of the talk presented data showing a demarcation bias in the widely used measure of fold change in selected differentially expressed genes, transcripts or proteins from microarray and RNASeq data. http://www.bio-itworldexpo.com/Next-Gen-Sequencing-Informatics/

TRANSCRIPT

Avoiding Nonsense Resultsin your NGS Variant Studies

James Lyons-Weiler, PhDScientific Director/

Senior Research ScientistBioinformatics Analysis Core

Genomics & Proteomics Core LaboratoriesUniversity of Pittsburgh

Pittsburgh, PAMay 1, 2014

Two Parts

• Identifying sites with low genotypic signal increases concordance among variant callers

• Hazards in finding differentially expressed genes in RNASeq – how to do it more robustly.

23andMe: High risk of RA and psiriosisGTL: Low risk of RA and psiriosis

NYTimes Article, etc.

Data were from Illumina hi-seq 2000

Among method averageConcordance57.5% overall; 32.7% at high coverage

O’Rawe et al.

TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)

SEQUENCER

MAPPER

VARIANT CALLERS

LOW CONCORDANCE (O’Rawe et al., 2013)

Consensus Analysise.g.,2/3, ¾, set analysis

Information Theory(-> modeling)

Improve Callers(fix errors, modeling) Bake Offs

Simulations

Spiked Ins

Entropy of Base Distributions

A T C GA T C G A T C GLow entropyHigh enthalpy

Low entropyHigh enthalpy

High entropyLow enthalpy

Boltzmann Entropy

• s = k ln w (Planck)

• w = antiln(s/k)

http://schneider.ncifcrf.gov/images/boltzmann/boltzmann-tomb-4.html

Rank Sorted Distribution of w(O’Rawe et al. data)

Homozygotes w = 1

Heterozygotes w = 2

Example w Density Distribution

w and FBVCA T C G w pw Zygosity Genotype200 0 0 0 1 0 Homozygote AA

16 158 13 13 2.102558 0 Homozygote TT100 100 0 0 2 0 Heterozygote AT

58 30 1 111 2.768507 0 Heterozygote AG28 80 14 78 3.303636 0 Heterozygote TG76 38 29 57 3.758733 0 Heterozygote AG33 49 60 58 3.895496 0.0126 Heterzygote? CG?50 50 50 50 4 1 noise unknown

Operational*Equiprobable Null Distribution

{f(A) = f(T) = f(G) = f(C)}

Convergence of significance (pw)

What We Expect

TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)

SEQUENCER

MAPPER

VARIANT/BASE CALLERS

Genotypic Signal Filtering

INCREASED CONCORDANCE

Phom Function

gatkConcordance w/ FBVC Hom Het

ALL 0.5762 11868 17670pw<=0.05 0.9976 11282 5676

pw>0.05 0.0074 586 11994samtools

ALL 0.5649 11541 18799pw<=0.05 0.9917 11489 5761

pw>0.05 0.0002 52 13038snver

ALL 0.6006 11904 16729pw<=0.05 0.9934 11812 5470

pw>0.05 0.0007 92 11259

From the O’Rawe et al. generated resultsFBVC = frequency-based variant caller (Lyons-Weiler et al.)

Signal Tx %ConcordanceFBVC_vs_FBVC Marked ALL 85.64

pw<=0.05 91.08pw>0.05 35.66

FBVC_vs_FBVC Realigned ALL 83.82pw<=0.05 91.69

pw>0.05 28.21FBVC_vs_FBVC Recalibrated ALL 93.14

pw<=0.05 ***99.39pw>0.05 48.53

FBVC_vs_FBVC Reduced ALL 21.54pw<=0.05 24.57

pw>0.05 4.25FBVC_vs_FBVC Marked-Realigned ALL 76.91

pw<=0.05 86.11pw>0.05 15.44

FBVC_vs_FBVC Marked-Realigned-Recalibrated ALL 76.73pw<=0.05 85.99

pw>0.05 15.34

FBVC_vs_FBVC Marked-Realigned-Recalibrated-Reduced ALL 19.98pw<=0.05 22.9

pw>0.05 2.66

TRUTH (BIOLOGICAL MOLECULAR SEQUENCE)

SEQUENCER

MAPPER

VARIANT CALLERS

LOW CONCORDANCE (O’Rawe et al., 2013)

Consensus Analysise.g.,2/3, ¾, set analysis

Information Theory(-> modeling)

Improve Callers(fix errors, modeling) Bake Offs

Simulations

Spiked Ins

Lifescope reads (read)

Shrimp2 reads (blue)

Mappers must be systematically evaluated

Part 2: Good and Bad News forRNASeq (and everything else):

The Bad News:

Fold Change is Biased.

The Good News:

We have identified a much less biased method.

T-test is not appropriatefor small N, large P data

(such as RNASeq)

Fold Change > 2.0

Delta > 25

FC(A/B) is Blind to Large Portionsof Your Data

FC(A/B)

Delta(and J5: Patel & Lyons-Weiler, 2004)

Ratio are Hard to Interpret asBiological Differences

Gene A B delta (A-B) FC(A/B)

gene1 5 3 2 1.667

gene2 50 30 20 1.667

gene3 500 300 200 1.667

gene4 5000 3000 2000 1.667

gene5 50000 30000 20000 1.667

A-B is a differenceA/B is a quotient.

Log2 TransformationDoes not Help

Reveals Minor Delta (&J5) Bias

Pink = FC(A/B)Black = Delta

G-Thresholding J5

FC Bias in Amyotrophic Lateral Sclerosis

0

50000

100000

150000

200000

250000

300000

350000

0 50000 100000 150000 200000

Control

ALS DEGy

FCDEGy

Black circles = FC(A/B). Pink = Gthr-J5 genes

Black circles = FC(A/B). Pink = Gthr-J5 genes

FC(A/B) Bias inAlchohol-Induced Hepatitis

Conclusions• Not all NGS/HTS sites have sufficient genotypic signal to warrant

a base call. High coverage alone does not provide a solution.

• By measuring genotypic signal, we can determine which sites we can call with confidence.

• Fold-change(FC(A/B) is blind to highly expressed genes and should be abandoned as a measure of differential expression altogether – even for single gene or single protein studies!

• Published microarray data sets analyzed to date using FC(A/B) only are a gold-mine for re-analysis using less biased methods.

Credits and Contact• pw, pHom, etc: James Lyons-Weiler, Alan Twaddle, Rahil Sethi.

– (MS in preparation)– Our software is called Gconf (not yet available)

• Fold-Change Bias: James Lyons-Weiler, Tamanna Sultana, Rick Jordan, Rahil Sethi– (Paper in review)– For now, read

• Mariani TJ, Budhraja V, Mecham BH, Gu CC, Watson MA, Sadovsky Y. 2003. A variable fold change threshold determines significance for expression microarrays. FASEB J. 17:321-3. doi: 10.1096/fj.02-0351fje

• Pearson, K. 1897. On a form of spurious correlation that may arise when indices are used for the measurement of organs. Proc Roy Soc Lond 60:489-498 doi: 10.1098/rspl.1896.0076

top related