polymorphism discovery in next- generation re-sequencing data gabor t. marth boston college biology...

54
Polymorphism discovery in next-generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November 19-20, 2007

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Polymorphism discovery in next-generation re-sequencing data

Gabor T. MarthBoston College Biology Department

Illumina workshop, Washington, DCNovember 19-20, 2007

Page 2: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Why we care about genetic variations?

underlie phenotypic differences

cause inherited diseases

allow tracking ancestral human history

Page 3: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Variation types

Structural variations

SNPs

Mikkelsen et al. Nature 2007

Epigenetic variations

Page 4: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Sequence resources for polymorphism discovery

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

100 Mb

10 Mb

1Mb

1Gb

Illumina/Solexa, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(100 Mb in ~250 bp reads)

(1-4 Gb in 25-50 bp reads)

Page 5: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Resequencing-based SNP discovery

(iv) read assembly

REF

(iii) read mapping (pair-wise alignment to genome reference)

IND

(i) base calling

IND

(v) SNP calling

(vi) SNP validation

(ii) micro-repeat analysis

(vii) data viewing, hypothesis generation

Page 6: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Talk topics

Tools for resequencing read analysis

Data mining projects

• base calling• resequenceability analysis• read mapping / alignment / assembly• SNP calling• structural variation discovery• read data visualization

• SNP and short-INDEL discovery in C. elegans• Complete mutational profiling in Pichia stipitis

Page 7: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

…AND the cover on the box

Reference-guided read alignment

Reference-sequence guided assembly:

…they give you the pieces…

Page 8: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Some pieces are easier to place than others…

…pieces with unique features

pieces that look like each other…

Page 9: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Resequenceability: unique read placement

• Reads from repeats cannot be uniquely mapped

• RepeatMasker does not capture all repeats at the read length scale

• Near-perfect repeats can be also a problem because of sequencing errors and / or SNPs

fract

ion

of

read

s

number of mismatches

Page 10: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Finding micro-repeats is not easy

• Hash based methods (fast but only work out to a couple of mismatches)• Exact methods (very slow but find every repeat copy)• Heuristic methods (fast but miss a fraction of the repeats)

Page 11: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Presenting repeats for downstream analysis

masking bases

masking fragments

• bases in repetitive fragments may be resequenced with reads representing other, unique fragments fragment-level repeat annotations spare a higher fraction of the genome than base-level repeat masking

Page 12: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Fragment level annotation is economical

0 1 20.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Fra

ctio

n o

f ge

no

me

Number of mismatches allowed

Page 13: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Paired-end reads will not make the question go away

Page 14: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Read alignment

Page 15: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Read alignment

INDELs require gapped alignment

ABI/cap.

454/FLX

Illumina

454/GS20

sequences, often from different machine types, must be assembled together

billions of sequences must be aligned

Page 16: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

MOSAIK: an anchored aligner / assembler

Step 1. initial short-hash scan for possible read locations

Step 2. evaluation of candidate locations with SW method

Michael Stromberg

Page 17: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

MOSAIK – performance

• Solexa read alignments to C. elegans genome:100 million reads aligned in 95 minutes18,000 reads / second

• 454 reads to Pichia (yeast-size) genomeGS20: 2,000 reads / secondFLX: 300 reads / second

• Solexa read alignments to masked human genome:40 seconds for 1 million reads 18,000 reads / second5.5 GB RAM used (more for longer initial hash sizes)

Page 18: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Polymorphism detection

• Goal: to discern true variation from sequencing error

sequencing error

polymorphism

Page 19: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Using base quality values

• use base quality values to help us decide if mismatches are true polymorphisms or sequencing errors

Page 20: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Bayesian detection algorithm

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic combination

monomorphic combinationBayesian

posterior probability i.e. the SNP score

Base call + Base quality Expected polymorphism rate

Base composition Depth of coverage

Page 21: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

The PolyBayes software

Marth et al. Nature Genetics 1999http://bioinformatics.bc.edu/~marth/PolyBayes

Page 22: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Data visualization

1. aid software development: integration of trace data viewing, fast navigation, zooming/panning

2. facilitate data validation (e.g. SNP validation): co-viewing of multiple read types, quality value displays

3. promote hypothesis generation: integration of annotation tracks

Weichun Huang

Page 23: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

SNP calling in short-read coverage

C. elegans reference genome (Bristol, N2 strain)

Pasadena, CB4858(1 ½ machine runs)

Bristol, N2 strain(3 ½ machine runs)

• 5 runs (~120 million) Illumina reads from Wash. U. (Elaine Mardis)• detect polymorphisms between the Pasadena and the Bristol strain• aligned / assembled the reads (< 4 hours on 1 CPU)• found 44,642 SNP candidates (2 hours on our 160-CPU cluster) • SNP density: 1 in 1,630 bp (of non-repeat genome sequence)

Page 24: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Polymorphism discovery in C. elegans

• SNP calling error rate very low:

Validation rate = 97.8% (224/229)Conversion rate = 92.6% (224/242)Missed SNP rate = 3.75% (26/693)

SNP

INS

• INDEL candidates validate and convert at similar rates to SNPs:

Validation rate = 89.3% (193/216) Conversion rate = 87.3% (193/221)

Page 25: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Mutational profiling: deep 454/Illumina data

• collaboration with Doug Smith at Agencourt

• Pichia stipitis converts xylose to ethanol (bio-fuel production)

• one mutagenized strain had especially high conversion

efficiency

• determine where the mutations were that caused this

phenotype

• we resequenced the 15MB genome with 454 Illumina, and

SOLiD reads

Pichia stipitis reference sequence

Image from JGI web site

Page 26: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Mutational profiling: comparisons

Technology Coverage Nominal coverage FP FN Total error

454/FLX 2 runs 12.9x 1 0 1

454/FLX 1 run 9.8x 6 1 7

Illumina 7 lanes 53.5x 0 0 0

Illumina 3 lanes 23.4x 0 0 0

Illumina 2 lanes 15.6x 2 0 2

Illumina 1 lane 7.6x 2 2 2

SOLiD - 30.0X 0 0 0

SOLiD - 20.0X 0 0 0

SOLiD - 10.0X 0 0 0

SOLiD - 8.0X 0 4 4

SOLiD - 6.0X 0 6 6

Page 27: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Our software is available for testing

http://bioinformatics.bc.edu/marthlab/Beta_Release

Page 28: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Credits

http://bioinformatics.bc.edu/marthlab

Elaine Mardis (Washington University)Doug Smith (Agencourt)

Research supported by: NHGRI (G.T.M.) BC Presidential Scholarship (A.R.Q.)

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Page 29: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Resequencing of diploid individual genomes

Ind. 1

Ind. 2

Ind. 3

Ind. 4

Page 30: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

How do we find sequence variations?

• compare multiple sequences from the same genome region

Page 31: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Resequencing applications of next-gen sequencers

Emerging applications:• DNA-protein interaction analysis (CHiP-Seq)• epigenetic analysis (methylation profiling)• novel transcript discovery• quantification of gene expression

Polymorphism discovery: • organismal SNP discovery• complete mutational profiling• individual human resequencing for SNP, INDEL and structural variation discovery

DELSNP

reference genomeresequenced individual

Page 32: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Task 5. Dealing with massive data volumes

Short-read format working [email protected](Asim Siddiqui, UBC)

Assembly format working group

Boston Collegehttp://assembly.bc.edu

• two connected working groups to define standard data formats

Page 33: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

SNP calling in low 454 coverage

• with Andy Clark (Cornell) and Elaine Mardis (Wash. U.)• 10 different African and American melanogaster isolates• 10 runs of 454 reads (~300,000 reads per isolate) (~1.5X total)• can we detect SNPs in survey-style 454 read coverage?

DNA courtesy of Chuck Langley, UC Davis

• base-calling with PYROBAYES • alignment to 120 Mb euchromatic reference sequence with MOSAIK • SNP detection with POLYBAYES

Page 34: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

SNP calling results

iso-1 reference

46-2 454 read

46-2 ABI reads (2 fwd + 2 rev)

• 92.9 % validation rate (1,342 / 1,443)• 2.0% missed SNP rate (25 / 1247)

• 658,280 SNPs

• Ѳ ≈ 5x10-3 (1 SNP / 200 bp)

Page 35: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Flow signal vs. actual base number

Page 36: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Reference-guided read alignment

Page 37: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

PYROBAYES: A 454 base caller program

• better correlation between assigned and measured quality values

• higher fraction of high-quality bases

Aaron Quinlan

Page 38: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

454 errors: over and under-calls

Page 39: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Validation / score calibration

0

20

40

60

80

100

51-60 61-70 71-80 81-90 91-100

SNP score [%]

Co

nfi

rmat

ion

rat

e [

%]

Page 40: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Traditional SNP discovery data

capillary sequences(ABI)

clonal (haploid) sequences

Page 41: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

The SNP score

polymorphism

specific variation

Page 42: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

SNPs and short INDELs

Single-base substitutions (SNPs)

Insertion-deletion polymorphisms (INDELs)

Page 43: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Structural variations

Translocations: DNA exchange between different chromosomes

Inversion of long chromosomal tracts

“Simple” duplications and deletions

Multiple duplications (copy number

changes)

Page 44: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Epigenetic variations

Epigenetic variations e.g. changes in methylation / chromatin structure that do not strictly involve base changes

Mikkelsen et al. Nature 2007

Page 45: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Task 1. Base calling / base accuracy estimation

• how do we translate the machine readouts to base calls?• how do we estimate and represent sequencing errors (base quality values)?

Page 46: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

454 pyrosequencing errors

Page 47: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

454 pyrosequencer error profile

INDEL errors dominate

Page 48: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

454 base quality values

• most bases have low quality values, not optimal for SNP discovery

• native 454 base quality values underestimate true accuracy

Page 49: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Illumina/Solexa base accuracy

• Most errors are substitutions PHRED quality values work

• Measured base quality is a function of base position within the read (i.e. there is need for quality value calibration)

Page 50: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Illumina/Solexa base accuracy

• error rate grows as a function of base position within the read

• a large fraction of the reads contains 1 or 2 errors

Page 51: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Task 2. Read mapping and assembly

… is similar to a jigsaw puzzle…

… that you have to put together all by yourself

De novo assembly:

Page 52: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Structural variation discovery

• copy number variations (deletions & amplifications) can be detected from variations in the depth of read coverage

• structural rearrangements (inversions and translocations) require paired-end reads

Page 53: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Task 4. Data visualization

make screenshot with annotation

Page 54: Polymorphism discovery in next- generation re-sequencing data Gabor T. Marth Boston College Biology Department Illumina workshop, Washington, DC November

Applications

2. Mutational profiling in deep 454 and Illumina read data(Pichia stipitis)

1. SNP and INDEL discovery in deep Illumina short-read coverage(Caenorhabditis elegans)

(image from Nature Biotech.)