informatics challenges and computer tools for sequencing 1000s of human genomes gabor t. marth...

Informatics challenges and computer tools for sequencing 1000s of human genomes

Gabor T. MarthBoston College Biology Department

Cold Spring Harbor LaboratoryPersonal Genomes meetingOctober 9-12, 2008

Large-scale individual human resequencing

Next-gen sequencers offer vast throughput…

read length

base

s p

er

mach

ine r

un

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina, AB/SOLiD short-read sequencers

ABI capillary sequencer

454 pyrosequencer(100-400 Mb in 200-450 bp reads)

(5-15Gb in 25-70 bp reads)

1 Mb

The resequencing informatics pipeline

(iii) read assembly

REF

(ii) read mapping

IND

(i) base calling

IND(iv) SNP and short INDEL calling

(vi) data validation, hypothesis generation

(v) SV calling

The variation discovery “toolbox”

• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

GigaBayesGigaBayes

1. Base calling

base sequence

base quality (Q-value) sequence

• early manufacturer-supplied base callers were imperfect• third party software made substantial improvements• machine manufacturers are now focusing more on base calling

… and they give you the picture on the box

2. Read mapping

Read mapping is like doing a jigsaw puzzle…

…you get the pieces…

Larger, more unique pieces are easier to place than others…

Next-gen reads are generally short

read length [bp]0 100 200 300

~200-450 (variable)

25-70 (fixed)

25-50 (fixed)

20-60 (variable)

400

Base error rates are low

Illumina

454

Strategies to deal with non-unique mapping

Mapping probabilities (qualities)

0.8 0.19 0.01

read

Error types are very different

Illumina

454

Gapped alignments

MOSAIK

• fast• accurate• gapped• versatile (short + long reads)

3. SNP and short-INDEL calling

• deep alignments of 100s / 1000s of individuals • trio sequences

Allele discovery is a multi-step sampling process

Population Samples Reads

Capturing the allele in the sample

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1E-0

4

2E-0

4

5E-0

40.

001

0.00

20.

005

0.01

0.02

0.05 0.

10.

20.

5

Population AF

Pro

b(a

llele

cap

ture

d in

sam

ple

)

n=100

n=200

n=400

n=800

n=1600

Allele calling in the reads

1 2

1 21

1

1 2

Pr | Pr | Pr , , ,

Pr | Pr | Pr , , ,

Pr , , , |i

kT

ii n

l kT

nk ki i i n

i

nk k l l l li i

iG

n

B T T G G G G

B T T G G G G

G G G B

base quality

allele call in read

number of individuals

GigaBayesGigaBayes

How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac


Q30 Q40 Q50 Q60

1 0.01 0.01 0.1 0.5

2 0.82 1.0 1.0 1.0

3 1.0 1.0 1.0 1.0

The need for accurate data…

… and realistic base quality values

Recalibrated base quality values (Illumina)

More samples or deeper coverage / sample?

Shallower read coverage from more individuals …

…or deeper coverage from fewer samples?

simulation analysis by Aaron

Quinlan

Analysis indicates a balance

SNP calling in trios

2

2

2 22 2

2

2

2

2 2

2

11 12 22

1 111: 1 1

2 2 11: 111: 11 1

11 12 : 2 1 12 : 2 1 1 12 : 12 2

22 : 22 : 11 122 : 1

2 2

1 1 111: 1 1 11:

2 2 4Pr | , 1 1

12 12 : 2 1 12 2

1 122 : 1

2 2

M M M

F

C M F

F

G G G

G

G G GG

2 2 2

2 22 2

2 22

2

2 22 2

1 1 1 11 1 11: 1

2 4 2 21 1 1 1 1

12 : 2 1 1 2 1 12 : 1 2 14 2 4 2 2

1 1 1 1 122 : 1 1 22 : 1 1

4 2 4 2 2

1 111: 1

2 211: 11 1

22 12 : 1 12 : 12

22 : 1FG

2

2

2

11:

2 1 12 : 2 12

22 : 11 122 : 1 1

2 2

• the child inherits one chromosome from each parent• there is a small probability for a mutation in the child

SNP calling in trios

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctac

aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac


mother father

childP=0.79

P=0.86

4. Structural variation discovery

Deletion

DNA reference

LM ~ LF+Ldel & depth: low

pattern

LMLF

Ldel

Tandemduplication

LM ~ LF-Ldup & depth: highLdup

Inversion LM ~ +Linv & ends flipped LM ~ -Linv depth: normalLinv

Translocation

LM ~ LF+LT1 LM ~ LF+LT2 & depth: normal LM ~ LF-LT1-LT2

LT2 LT1

LM LM

LM

InsertionLins

un-paired read clusters & depth normal

Chromosomaltranslocation

LT

LM ~LF+LT & depth: normal& cross-paired read clusters

Read pair mapping pattern (breakpoint detection)

Copy number estimation

Depth of read coverage

Deletion: Aberrant positive mapping distance

Tandem duplication: negative mapping distance

Het deletion “revealed” by normalization

Chip StewartSaturday poster session

5. Data visualization

• software development• data validation• hypothesis generation

Summary

• Next-generation sequencing is a boon for large-scale individual human resequencing

• Basic data mining tools are getting applied and tested in the 1000 Genomes Project

• There is still a lot of fine-tuning to do

• A different set of tools are needed for comparative analysis and effective visualization of 100s/1000s of genomes

Credits

Derek BarnettEric Tsung

Aaron QuinlanDamien Croteau-Chonka

Weichun Huang

Michael Stromberg

Chip Stewart

Michele Busby

Several postdoc positions are available… … mail [email protected]

Software tools for next-gen data

http://bioinformatics.bc.edu/marthlab/Beta_Release

Positions

Several postdoc positions are available… mail [email protected]

Individual genotype directly from sequence

AACGTTAGCATAAACGTTAGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTCGCATAAACGTTCGCATAAACGTTCGCATAAACGTTCGCATA

AACGTTAGCATAAACGTTAGCATA

individual 1

individual 3

individual 2

A/C

C/C

A/A

Genotyping from primary sequence data

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 500000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SNP Position

Fra

ctio

n of

con

fiden

t gen

otyp

es

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

x 104

0

200

400

600

800

1000

1200

1400

1600

100 @ 16x: 0.975 +/- 0.121

200 @ 8x: 0.968 +/- 0.129

400 @ 4x: 0.924 +/- 0.151

800 @ 2x: 0.769 +/- 0.154

Most reads contain no or few errors

Paired-end reads help unique read placement

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

Korbel et al. Science 2007

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

PE

MP

How many reads needed to call an allele?aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaCgtacctacaatgtagtaCgtacctac


aatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctacaatgtagtaAgtacctac




P=0.82 P=0.08

informatics challenges and computer tools for sequencing 1000s of human genomes gabor t. marth...

Documents

mb slide

child slide

base calling slide

sample slide

balance slide

accurate data slide

aaron quinlan slide

poster session slide