varid : a variation detection framework for color- and letter-space platforms

31
VARiD: A Variation Detection Framework for Color- and Letter-Space platforms Adrian Dalca 1,2 , Steven Rumble 3 , Sam Levy 4 , Michael Brudno 2,5 1. Massachusetts Institute of Technology 2. University of Toronto 3. Stanford University 4. The Scripps Research Institute 5. Donnelly Centre and the Banting and Best Department of Medical Research

Upload: dung

Post on 23-Feb-2016

39 views

Category:

Documents


0 download

DESCRIPTION

VARiD : A Variation Detection Framework for Color- and Letter-Space platforms. Adrian Dalca 1,2 , Steven Rumble 3 , Sam Levy 4 , Michael Brudno 2,5. Massachusetts Institute of Technology University of Toronto Stanford University The Scripps Research Institute - PowerPoint PPT Presentation

TRANSCRIPT

Slide 1

VARiD: A Variation Detection Framework for Color- and Letter-Space platformsAdrian Dalca1,2, Steven Rumble3, Sam Levy4, Michael Brudno2,5Massachusetts Institute of TechnologyUniversity of TorontoStanford UniversityThe Scripps Research InstituteDonnelly Centre and the Banting and Best Department of Medical Research

1motivation | methods | results | summaryVariation detection from NGS readsReference: TCAGCATCGGCATCGACTGCACAGGACCAGTCGATCGAC

Donor: ??????????????????????????????????????? GCATCGACTGCA CGGGATCGACTGAligned reads: ATCCATTGCA GATCCACTGCAC Determine differences (variation) between reference and donorusing NGS reads of the donorMotivationColor-space and Letter-space platformsbring them togetherMethodsSummaryResults33motivation | methods | results | summarySequencing Platforms letter-space Sanger, 454, Illumina, etc> NC_005109.2 | BRCA1 SX3TCAGCATCGGCATCGACTGCACAGG color-space AB SOLiD less software tools available > NC_005109.2 | BRCA1 AF3T212313230313232121311120 many differences -> useful to combine this information sequencing biases inherent errors advantages4Mention that we use numbers for colors- Dont List Differences4

Color Spacemotivation | methods | results | summaryTranslation MatrixTranslation Automata5> T2123132303132321213111205Translating> T212313230313232121311120> T

Sequencing Error vs SNPSequencing Error> T212313230313232121311120> T212313230310232121311120> TCAGCATCGGCAAGCTGACGTGTCC

SNP> TCAGCATCGGCATCGACTGCACAGG> TCAGCATCGGCAGCGACTGCACAGG> T212313230312332121311120AGCTCAGCATCGGCATCGACTGCACAGGColor Spacemotivation | methods | results | summary6Known first letter part of the linker, not of the read6

Color Spacemotivation | methods | results | summary7 clear distinction between a sequencing error and a SNP can this help us in SNP detection? sounds like it! single color change error, 2 colors changed (likely) SNP.Easy snp callWell covered basesDifficult Casereference incolor-spacereadsposition

7Detection Heterozygous SNPs Homozygous SNPs Tri-allelic SNPs small indels account for various errors, quality values & misalignmentsMotivation variation caller to handle both letter-space & color-space readsMotivationmotivation | methods | results | summary8VARiD system to make inferences on the donor bases variation detection8Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationSummaryResults99Statistical model for a system - states

Assume that system is a Markov process with state unobserved. Markov Process: next state depends only on current state

We can observe the states emission (output)each state has a probability distribution over outputsHidden Markov Model (HMM)motivation | methods | results | summary10S1S2S3e1e2e1e2e1e210Hidden Markov Model (HMM)motivation | methods | results | summary11Apply HMM to variation detection: we dont know the state (donor), but we can observe some output determined by the state (aligned reads)11Hidden Markov Model (HMM)motivation | methods | results | summary12. . . . . B6 B7 B8 B9 . . . . .B6 B7B7 B8B8 B9AAACcolor 0color 1AAACcolor 0color 1AAACcolor 0color 1unknowndonor- Will go to states, emissions and transitions12Why pairs of letters? Handle colors. AA and TT gives the same colors. Cant just model colorsThe donor could be: letters: AA color 0 letters: AC color 1 : letters: TT color 016 combinations

Statesmotivation | methods | results | summary13B6 B7REDUNDANT?

13Transitionsmotivation | methods | results | summary14. . . B6 B7 B8 . . .B6 B7B7 B8AACAATTT::GACT::AATT.........Possible StatesTransitions only certain transitions allowed when allowed, p(Xt|Xt-1) = freq(Xt) each state depends only on the previous states (Markov Process)States 16 possible states only look at second letter- Will go to states, emissions and transitions14T01020100311223 T1030101311223 T20100311223ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGACUnknown genomeColor readsLetter readsEmissionsmotivation | methods | results | summary15..... B6B7B8B9 .....B7 B8color 0color 1AAACcolor 0AA15AAcolor 0color 1color 2color 3letters Aletters C1 3emissionprobabilityp(em|AA)letters T1- 3letters GEmission Probabilitiesmotivation | methods | results | summary16Same color emission distributionTTDifferent letter emission distributionTT16

E.g. For state CC:Combining emission probabilities probability that this state emitted these reads.motivation | methods | results | summaryEmission Probabilities17T01020100311223 T1030101311223 T20100311223ATTGCGCAATGCG TTGGGCAATGCGA GCGCACTGCGAC..... B6B7B8B9 .....17Summary

unknown state donor pair at location

transitions transition probabilities

emissions reads at location emission probabilities motivation | methods | results | summarySimple HMM18B6 B7AAACcolor 0color 118 Have set-up a form of an HMM run Forward-Backward algorithm get probability distribution over states at each positionAACAATTT::GACT::likely statemotivation | methods | results | summaryForward-Backward Algorithm19 Variation Detection:compare most likely state with reference:

ref: GCTATCCAdon: ...AT...probability- FB intuition !!!19Methods

Simple HMM Modelstates, emissions, transitions, FB

Extended HMM Modelgaps, diploids, exceptions

MotivationSummaryResults2020Simple HMM only detects homozygous SNPs

Extended HMM: short indels heterozygous SNPs complex error profiles & quality valuesmotivation | methods | results | summaryExtended HMM2121Expand states Have states that include gaps emit: gap or colorA----GAGTGT-T- Have larger states, for diploids Transitions built in similar fashion as before Same algorithm, but in all we have 1600 states with very sparse transitionsExpansion: Gaps and heterozygous SNPsmotivation | methods | results | summary2222 Emission probabilities Support quality values Use variable error rates for emissions Translate through the first color first color is incorrect letter-space signalDonor: ACAGCATCGGCATCGACTGC 1123132303123321213read: >T2123132303123321213 > C123132303123321213Expansionmotivation | methods | results | summary23 Post-process putative SNPs correlated adjacent errors may support het SNPs check putative SNPs know the error rate (= error rate at first color) note: not ok to translate the whole read due to effects of color-space error, but one letter is safe. handle like a normal letter-space emission23motivation | methods | results | summary

blue: varid stepsSummaryResultsMotivationMethodsSummary2525Resultsmotivation | methods | results | summary Human dataset from Harismendy et al, 2009. (NA17156,17275,17460,17773)454, SOLiD, Sanger

Color-space dataset: Compare random subsets: Corona (with AB mapper) VARiD (with SHRiMP) VARiD (with AB mapper)

Conclusions: the three pipelines perform very similarly. High-coverage results is as good as can be achieved 26F-measureSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

26Resultsmotivation | methods | results | summaryLetter-space dataset: Compare random subsets : GigaBayes (with Mosaik) VARiD (with SHRiMP) VARiD (with Mosaik)

Conclusion:the three pipelines perform very similarly. High-coverage results is as good as can be achieved 27F-measureSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

27letter-spaceF-meas.0x1x2.5x5x10xColor-space0x0.019.443.559.171.710x47.851.859.469.576.520x60.658.965.373.480.350x73.169.873.680.083.5100x75.675.277.982.786.0ResultsVARiD Motivation:Combining Letter-space and Color-space data to achieve increased accuracy in at-cost comparisonmotivation | methods | results | summary28Assuming a same-cost comparison of: 10x letter-space (LS) 100x color-space (CS) 5x LS and 50x CSVARiDSAY: Although VARiD should be used to combine datasets, for now we only tested it on purely color-space dataSAY: (most FN?) Low coverage for the 2nd allele- add a column for indels?

28SummaryMotivationMethodsResults2929Summary of VARiD HMM modeling underlying donor Treats color-space and letter-space together in the same framework no translation take advantage of each technologys properties accurately calls SNPs, short indels in both color- and letter-space improved results with hybrid data.Summary motivation | methods | results | summary Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: [email protected] ADJ SNP: would need an extra runStress: Fully Probabilistic

TOOK OUT ADJACENT SNPs example30Acknowledgementsmotivation | methods | results | summary Website: http://compbio.cs.utoronto.ca/varid (VARiD freely available) Contact: [email protected] Participant Support Award : ISCB, Department of Energy, and National Science Foundation Acknowledgements Steven Rumble, Sam Levy, Michael Brudno Misko DzambaFunding

For ADJ SNP: would need an extra runStress: Fully Probabilistic

TOOK OUT ADJACENT SNPs example31