from calling bases to calling variants: experiences with illumina data

38
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data

Upload: blythe

Post on 18-Jan-2016

82 views

Category:

Documents


0 download

DESCRIPTION

From calling bases to calling variants: Experiences with Illumina data. Gerton Lunter Wellcome Trust Centre for Human Genetics. This talk. Refresher : Illumina sequencing QC What can go wrong Useful QC statistics Read mapping Comparison of popular read mappers Stampy - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: From calling bases to calling variants: Experiences with Illumina data

Gerton Lunter

Wellcome Trust Centre for Human Genetics

From calling bases to calling variants:Experiences with Illumina data

Page 2: From calling bases to calling variants: Experiences with Illumina data

This talk

Refresher: Illumina sequencing

QC What can go wrong Useful QC statistics

Read mapping Comparison of popular read mappers Stampy

Indel and SNP calling

(Some results: 1000 Genomes indel calls)

Page 3: From calling bases to calling variants: Experiences with Illumina data

7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

Page 4: From calling bases to calling variants: Experiences with Illumina data

1. Refresher: Illumina sequencing

Page 5: From calling bases to calling variants: Experiences with Illumina data

Illumina sequencing

Page 6: From calling bases to calling variants: Experiences with Illumina data
Page 7: From calling bases to calling variants: Experiences with Illumina data
Page 8: From calling bases to calling variants: Experiences with Illumina data

Illumina sequencing

8 lanes… x 120 tiles x108 bp x 2 reads…

= about 48 Gb raw bp

Page 9: From calling bases to calling variants: Experiences with Illumina data

2. QC

Page 10: From calling bases to calling variants: Experiences with Illumina data

Quality issues

Bases are identified by their fluorescent tag Overlapping emission spectra

Single base per cycle: reversible terminator chemistry Not perfect: fraction lags, fraction runs ahead: “Phasing” Limits read length

Optimizing yield: cluster density Higher densities mean more errors Above an optimum, yield decreases Partly signal processing issue: software improvements

Low amounts of initial DNA Linker-linker hybrids; duplicated reads

Page 11: From calling bases to calling variants: Experiences with Illumina data

Overlapping fluorescence spectra

C/A and G/T overlap (Most common mutations are transitions, A-

G and C-T)

Rougemont et al, 2008

Page 12: From calling bases to calling variants: Experiences with Illumina data

Refresher: Phred scores

Phred score = 10 log10( probability of error )

10: 10% error probability 20: 1% error probability 30: 0.1% error probability (one in 1,000)

3 = 50%, 7 = 20% 13 = 5%, 17 = 2% 23 = 0.5%, 27 = 0.2% 33 = 0.05%, 37 = 0.02%

Page 13: From calling bases to calling variants: Experiences with Illumina data

Phasing

August 2009 June 2010

Page 14: From calling bases to calling variants: Experiences with Illumina data

Cluster density & other improvements

June 2010:

August 2009:

Page 15: From calling bases to calling variants: Experiences with Illumina data

Library complexity, duplicate reads

Some sequences are read several times: Low amount of initial material, many PCR copies Optical duplicates; secondary cluster seeding

Problem for variant calling Any PCR error will be seen twice: evidence for variant

Rate of duplicates is rarely >5% Criterion: both ends of a PE read map to matching location Can occur by chance, but low probability, except for very

high coverage Post processing: duplicate removal

Standard processing step (e.g. Samtools, Picard)

Useful statistic: Duplicate fraction is approximately additive across lanes

(same library) 2x duplication fraction ≈ fraction of the library that was

sequenced

Page 16: From calling bases to calling variants: Experiences with Illumina data

Library complexity, duplicate reads

Fraction α of all molecules is sequencedNumber of times a PCR copy is sequenced:

Poisson(α)

Expected fraction of duplicates: e-α-1+αAs a fraction of all reads sequenced:

(e-α-1+α)/α = ½ α + …

n = 0 1 2 3 …

Poisson e-α α e-α α2 e-α/2 α3 e-α/6 …

Duplicates:

0 0 1 2 …

Page 17: From calling bases to calling variants: Experiences with Illumina data

Sequencing QC

Page 18: From calling bases to calling variants: Experiences with Illumina data

QC statistics

Page 19: From calling bases to calling variants: Experiences with Illumina data

QC statistics - coverage

Page 20: From calling bases to calling variants: Experiences with Illumina data

QC statistics – quality scores

Page 21: From calling bases to calling variants: Experiences with Illumina data

GATK recalibration tool

Page 22: From calling bases to calling variants: Experiences with Illumina data

3. Read mapping

Page 23: From calling bases to calling variants: Experiences with Illumina data

Read mapping

First processing step after sequencing: Read mapping (most times) Assembly (no reference sequence; specialized

analyses)

Quality of mapping determines downstream results Accessible genome Biases (ref vs. variant) Sensitivity (divergent reference; snps, indels, SV) Specificity (calibration of mapping quality)

Page 24: From calling bases to calling variants: Experiences with Illumina data

Read mapper comparison

Read mappers: Maq BWA Eland Novoalign Stampy

Criteria: Sensitivity (overall; divergent reference;

variants) Specificity (mapping quality calibration) Speed

Page 25: From calling bases to calling variants: Experiences with Illumina data

Sensitivity

Page 26: From calling bases to calling variants: Experiences with Illumina data

Sensitivity - indels

Page 27: From calling bases to calling variants: Experiences with Illumina data

Sensitivity – Divergent reference

Page 28: From calling bases to calling variants: Experiences with Illumina data

Specificity – ROC curves

ROC - indels

Page 29: From calling bases to calling variants: Experiences with Illumina data

Performance on real dataPro

port

ion m

apped t

o w

ithin

10

kb o

f m

ate

Page 30: From calling bases to calling variants: Experiences with Illumina data

Efficiency

Page 31: From calling bases to calling variants: Experiences with Illumina data

Stampy – first part of algorithm

read

15 bp subsequence

Remove rev-comp symmetry

29 bit word

4 bytes x229 entry (2 Gb)

hash tablecandidatepositions

open addressing,cache-friendly

Page 32: From calling bases to calling variants: Experiences with Illumina data

Second part: Fast candidate alignment

Single-instruction-multiple-data (SIMD),parallel execution

Affine gap penalties.

Linear-time and constant memory algorithm: DP table in registers.

Maximum indel size 15 bp.

Page 33: From calling bases to calling variants: Experiences with Illumina data

Third part: Modeling mapping failures Pseudo-bayesian posterior

(using candidates, rather than all mapping positions)

Failure to find the correct candidate

(2 or more mismatches in every 15bp subsequence) Sequence not in reference

(is sequence match better than expected best random match?)

Page 34: From calling bases to calling variants: Experiences with Illumina data

4. SNP and indel calling

Page 35: From calling bases to calling variants: Experiences with Illumina data

SNP calling

General idea:

Works quite well! Some caveats: Include mapping quality:

P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map)

Mapping errors are dependent: don’t include mapQ<10 Base errors are not uniform (A/C/G/T): assume worst case (all

identical) Assumes no anomalies (seg dups; alignments; indel/SV; …)

Hard problem: be conservative Expected SNP rate (human): 10-3/nt. FPR of 10-5 required for 1%

FDR Filtering is required to achieve good FDR –

or all data features must be adequately modeled

Page 36: From calling bases to calling variants: Experiences with Illumina data

Indel calling

General idea:

Differences with SNP calling: Pseudo-Bayes: cannot consider all possible

variants/genotypesGenerate large set of candidatesFilter using goodness-of-fit test

Illumina reads do not have an explicit indel error model

Page 37: From calling bases to calling variants: Experiences with Illumina data

Indel error model

Homopolymer run length

Page 38: From calling bases to calling variants: Experiences with Illumina data

Wrap up

GA-II produces large amounts of good data

Artefacts do occur, keep a look at QC statistics

Choice of mapper influences yield and quality

Variant calling: Bayesian approaches work well Some assumptions (independence) not met, hard

to model Filtering remains necessary