varid: a variation detection framework for color-space and letter- space platforms by a.v. dalca, s....

33
VARiD: A Variation Detection Framework for Color-space and Letter- space platforms By A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno Presented by Velian Pandeliev

Upload: arthur-lee

Post on 13-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

VARiD: A Variation Detection Framework for Color-space and

Letter-space platformsBy A.V. Dalca, S. M. Rumble, S. Levy, M. Brudno

Presented by Velian Pandeliev

VARiD Overview• Purpose: Variation Detection (SNP, indel)• Pitch: First to use both colour-space and letter-

space data• Principle: Hidden Markov Model with Forward-

Backward algorithm• Platform: 454/Roche, Solexa, ABI SOLiD• Pros: Can work with unconverted sets of both

formats simultaneously• Performance: linear in length of reference,

great on mixed format data

ABI SOLiD Basics

• Reads bases two at a time• Outputs one of four colours based

on transition state machine:

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

Error:

ABI SOLiD Properties

• Read errors and SNPs present differently.Reference:

Error:

SNP:

ABI SOLiD Properties

• A read error propagates through the rest of the sequence on translation to letter-space

Consequences

• Colour-space encoding is better suited to calling SNPs than letter-space encoding

• In letter-space data, errors do not propagate through to the rest of the read

Wouldn’t it be great to have a SNP calling framework that could use both kinds of data!?

VARiD • A Hidden Markov Model for Variation DetectionIn general, HMM’s have the following elements:- States (hidden)- Transitions (probabilities of reaching any particular state from the previous

one)- Emissions (observed outputs)

Building a Basic HMMStates: pairs of consecutive letter-space positions:

S = {AA, AT, AC, AGTT, TA, TC, TGCC, CA, CT, CGGG, GA, GT, GC}

Building a Basic HMMTransitions: since consecutive states share a nucleotide, probabilities are

defined as follows:

P(transition WX YZ) =frequency(Z) if X=Y0 if X≠Y

Building a Basic HMMEmissions: a letter and a colour from donor reads at each state.

E.g.P(emission = c|state = CA) = q(c|CA) =

1 – 3ε if c is 1ε if c is 0, 2, 3

for colour space

Building a Basic HMMEmissions: a letter and a colour from donor reads at each state.

E.g.P(emission = n|state = CA) = q(n|CA) =

1 – 3ξ if n is Aξ if n is C, G, T

for letter space

Building a Basic HMMEmission probabilities from all reads:

P(emissions = E|state = s) =

which combines colour and letter space data

EnEc

snqscqsEq )|()|()|(

Building a Basic HMM

Detecting variation is accomplished through finding the maximum likelihood state for each position in the genotype (the donor) and comparing it against the reference nucleotide.

Building a Basic HMM

Source: Dalca, A. & Brudno, M. (Poster)

By running the Forward-Backward algorithm on the HMM, a probability distribution is obtained from the possible states and a base is called (in bold).

ExtensionsThe HMM described above is quite simple and only calls a

single nucleotide for each position.

VARiD extends the model to detect heterozygous SNPs, as well as to handle indels.

MicroindelsTo deal with microindels (<5 bp) in the sample, gap states are required:E.g. [A - - - G] (would emit colour 2)- 4 dummy ‘gap’ nucleotides are defined, one for A, C, G, T- [A - - - G] = {(A, gap-A), (gap-A, gap-A), (gapA-gap-A), (gap-A,G)}

Colour 2

MicroindelsRequires 24 more states:- (X, gapX) x 4

- (gapX, gapX) x 4

- (gapX,Y) x16

- Total (incl. orig.) 40 states

Heterozygous SNPs

• For diploid samples, each state has to account for heterozygous differences• Each state in VARiD’s HMM is a unique combination of two of the original 40 states (obtained by S x S)

• 402 = 1600 states!

Features

• Keeps track of quality scores and positions within a read to augment HMM error rates (ε, ξ) for greater accuracy

• Post-processing ensures that all heterozygous SNP calls are supported by enough reads

Features

Source: Original paper

Features

• First T in a read is NOT part of the sequence.

Features

• First T is NOT part of the genotype!

• VARiD eliminates linker remnant without having to translate fully

VALiDation• 260kb from the human genome• Sequenced with ABI SOLiD and 454/Roche• Reference obtained through Sanger reads• Artificial datasets created with varying

amounts of coverage• Tested in colour-space alone (against Corona),

letter-space alone (against gigaBayes) with various aligners and with a combination of data

VALiDationMeasures:

• True Positives (correctly identified SNPs)• False Positives (SNPs not in Sanger set) • Precision (TP as fraction of all predictions)• Recall (TP as fraction of Sanger set SNPs)

VALiDationColour space only

In colour space, VARiD had slightly higher precision than the Corona caller on AB-mapped reads, but had comparable and slightly lower recall.

Using VARiD with SHRiMP produced a higher recall rate, but a lower precision when compared to VARiD + AB mapper.

(no significance statistics were presented)

VALiDationLetter Space Only

In letter space, gigaBayes + mosaik perfomed better than VARiD (using the same mosaik mapper) with low coverage, but fell behind in higher coverage.

VARiD + SHRiMP did better than VARiD + mosaik in both low and high coverage, and clearly outperformed gigaBayes at 20x coverage

VALiDationMixed space

VARiD’s true strength lies in being able to combine colour- and letter-space reads and to perform better on them than on cost-equivalent letter-only or colour-only data:

Issues• No statistical significance presented on

performance improvement• Experimental size relatively small (260kb)• Not ideal for low coverage data• Would be interesting to see how VARiD

performs on more diverse data sets (more/fewer SNPs, indels, etc.)

Issues• No statistical significance presented on

performance improvement• Experimental size relatively small (260kb)• Not ideal for low coverage data• Would be interesting to see how VARiD

performs on more diverse data sets (more/fewer SNPs, indels, etc.)

• Any more?

The End.

References• Dalca, A.V., Rumble, S.M., Levy, S., Brudno, M. VARiD: A

Variation Detection Framework for Color-space and Letter-space platforms. 2010 (in progress)

• Dalca, A.V. & Brudno, M. VARiD: Variation Detection in Color-space and Letter-space (poster)

• Hidden Markov model. (2010, Février 2). In Wikipedia, The Gratuit Encyclopedia. Retrieved 13:24, Février 10, 2010, from http://en.wikipedia.org/w/index.php?title=Hidden_Markov_model&oldid=341442380

• Rumble, S.M., Lacroute, P., Dalca, A.V., Fiume, M. Sidow, A. and Brudno, M. (2009) SHRiMP: Accurate mapping of short color-space reads. PLoS Comput Biol.