design goals

MICHAEL STRÖMBERGBoston College Data Club

April 2008

Design Goals

Crash Course: Reference-guided Assembly

Sequencing Technologie

s

future

Next-Gen Sequence Lengths

Capillary (Sanger) Roche 454 FLX0

200

400

600

800

1000

1200

1400

1600

maxmeanmin

Sequencing Technology

Sequ

ence

Len

gth

(bp)

Illumina AB SOLiD Helicos0

10

20

30

40

50

60

70

80

maxmeanmin

Sequencing Technology

Sequ

ence

Len

gth

(bp)

3 6 9 12 15 18 21 24 27 30 330%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Unique Genome Coverage (H. sapiens)

Sequence Length

Uni

que

Gen

ome

Cove

rage

Mixing It Up: Paired-end Reads

0 50 100 150 200 250 300 3500

200400600800

10001200140016001800

fragment length (bp)

read

pai

rs (

coun

t)

How Does It Work?

C. elegans: a case for INDELsSPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)Assembly time: 100 min

INDELS

INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)

P. stipitis: Co-assembly

Capillary454 FLX

454 GS20

Illumina

Scaling Up

Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000

100,000

1,000,000

10,000,000

100,000,000

1,000,000,000

10,000,000,000

Project Date

Refe

renc

e Se

quen

ce L

engt

h (b

p)

C. elegans

M. musculus

H. sapiens

P. stipitis

M. musculus mtDNA

H. sapiens CAPON region

D. melanogaster

H. sapiens ENCODE region

Performance: Aligners

Aligners: Feature SetELAND MAQ

Newbler SHRiMP SOAP

SequencingPlatforms

Illumina454

SOLiDcapillary

Illumina IlluminaSOLiD

454 IlluminaSOLiD

Illumina

AlignmentAlgorithm

Smith-Waterma

nHash-based

Hash-based

FlowMapper

Smith-Waterma

nHash-based

Co-assemblyCreation

?

GappedAlignments ?Paired-end ReadsPlatformBinaries

Windows, Mac, Linux,

Sun, iPhone

Mac, Linux Linux Mac, Linux Mac, Linux

Performance: AlignerIllumina 35 bp (X Chromosome)program aligned reads/sMOSAIK 180 - 16,658ELAND 7,716SOAP 1,637MAQ 1,376SHRIMP 39

MOSAIK (fast)

MOSAIK (single)

MOSAIK (multi)

MOSAIK (all)

ELAND MAQ SOAP SHRIMP0

2000400060008000

10000120001400016000

Performance: AlignerRoche 454 FLX ~250 bpprogram aligned reads/sRoche 454 Newbler 1,176MOSAIK 317 - 616

Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.

† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)

Accuracy: Synthetic Data Sets

1 per 1.3 kb 1 per 7.2 kb

H. sapiens Xchromosome

1 million

Accuracy: Classification

MOSAIK

(fast)

MOSAIK (s

ingle)

MOSAIK (m

ulti)

MOSAIK

(all)

ELAND

MAQSO

AP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

unique readsnon-unique reads

Accuracy: Unique Read Alignment

MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

readsINDELsSNPs

Reasons to use ?

• FAST• Accurate• Multiprocessor (OPENMP)• Co-assemblies• Gapped alignments• Widely used

“One tool, many technologies,

many applications”

(Near) Future Development• All technologies– Pacific BioSciences– Helicos

• All application areas– Adapter trimming– Coverage graphs

• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)

1000 Genomes Project• Many samples with light coverage

(1000 dg)– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per

population

• Trios with moderate coverage (990 dg)– 30 trios at 11x coverage

• If you’re looking for SNPs, are your tools and methods robust?

Scaling Up: Disk Footprint• Current situation: files created by

MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk

speed)

• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location

Scaling Up: Disk Footprint

Scaling Up: Memory Footprint

• Current situation: storing the entire human genome stored with all associated hash locations

– Optimized hash table ≈ 55 GB RAM

– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file


9 10 11 12 13 14 15 16 17 1805

10152025303540455055606570

JumpDB Memory Usage (Human Genome)

JumpDB MOSAIK hash table

hash size (bp)

mem

ory

used

(G

B RA

M)

Berkeley (all positions in database)

Berkeley (1 position in database)

Jump (all positions in file-based database)

Mosaik hash table

0 4 8 12 16 20

Alignment Performance with 35bp human reads

Reads/s

Scaling Up: Speed & Sensitivity

• Current situation: speed increases as the hash size increases, sensitivity decreases

• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.

• Status: Implemented but not tested.

BORK! BORK! BORK!

(translated: when will MOSAIK get published?)

AcknowledgementsBoston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart

Thomas SeyfriedMike Kiebish

Washington University School of Medicine

Elaine MardisJarret GlasscockVincent Magrini

AgencourtDouglas SmithWei Tao

design goals

Documents

speed increases

hash size increases

gb file

associated hash locations

small hash sizes

flx data set

design goals crash course

file formatsditch