design goals
DESCRIPTION
Design Goals. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Crash Course: Reference-guided Assembly. Sequencing Technologies. future. Next-Gen Sequence Lengths. Mixing It Up: Paired-end Reads. How Does It Work?. How Does It Work?. - PowerPoint PPT PresentationTRANSCRIPT
MICHAEL STRÖMBERGBoston College Data Club
April 2008
Design Goals
Crash Course: Reference-guided Assembly
Crash Course: Reference-guided Assembly
Crash Course: Reference-guided Assembly
Sequencing Technologie
s
future
Next-Gen Sequence Lengths
Capillary (Sanger) Roche 454 FLX0
200
400
600
800
1000
1200
1400
1600
maxmeanmin
Sequencing Technology
Sequ
ence
Len
gth
(bp)
Illumina AB SOLiD Helicos0
10
20
30
40
50
60
70
80
maxmeanmin
Sequencing Technology
Sequ
ence
Len
gth
(bp)
3 6 9 12 15 18 21 24 27 30 330%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Unique Genome Coverage (H. sapiens)
Sequence Length
Uni
que
Gen
ome
Cove
rage
Mixing It Up: Paired-end Reads
0 50 100 150 200 250 300 3500
200400600800
10001200140016001800
fragment length (bp)
read
pai
rs (
coun
t)
How Does It Work?
How Does It Work?
C. elegans: a case for INDELsSPEED100 million Illumina readsAlignment time: 93 min (17,800 reads/s)Assembly time: 100 min
INDELS
INDEL validation rate: 89.3 % (216)SNP validation rate: 97.8 % (229)
P. stipitis: Co-assembly
Capillary454 FLX
454 GS20
Illumina
Scaling Up
Dec-05 Mar-06 Jul-06 Oct-06 Jan-07 Apr-07 Aug-07 Nov-07 Feb-08 Jun-08 10,000
100,000
1,000,000
10,000,000
100,000,000
1,000,000,000
10,000,000,000
Project Date
Refe
renc
e Se
quen
ce L
engt
h (b
p)
C. elegans
M. musculus
H. sapiens
P. stipitis
M. musculus mtDNA
H. sapiens CAPON region
D. melanogaster
H. sapiens ENCODE region
Performance: Aligners
Aligners: Feature SetELAND MAQ
Newbler SHRiMP SOAP
SequencingPlatforms
Illumina454
SOLiDcapillary
Illumina IlluminaSOLiD
454 IlluminaSOLiD
Illumina
AlignmentAlgorithm
Smith-Waterma
nHash-based
Hash-based
FlowMapper
Smith-Waterma
nHash-based
Co-assemblyCreation
?
GappedAlignments ?Paired-end ReadsPlatformBinaries
Windows, Mac, Linux,
Sun, iPhone
Mac, Linux Linux Mac, Linux Mac, Linux
Performance: AlignerIllumina 35 bp (X Chromosome)program aligned reads/sMOSAIK 180 - 16,658ELAND 7,716SOAP 1,637MAQ 1,376SHRIMP 39
MOSAIK (fast)
MOSAIK (single)
MOSAIK (multi)
MOSAIK (all)
ELAND MAQ SOAP SHRIMP0
2000400060008000
10000120001400016000
Performance: AlignerRoche 454 FLX ~250 bpprogram aligned reads/sRoche 454 Newbler 1,176MOSAIK 317 - 616
Using P. stipitis (15.4 Mbp) 454 FLX data set. 932,565 reads basecalled by PyroBayes†.
† Quinlan et al. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods (2008)
Accuracy: Synthetic Data Sets
1 per 1.3 kb 1 per 7.2 kb
H. sapiens Xchromosome
1 million
Accuracy: Classification
MOSAIK
(fast)
MOSAIK (s
ingle)
MOSAIK (m
ulti)
MOSAIK
(all)
ELAND
MAQSO
AP0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
unique readsnon-unique reads
Accuracy: Unique Read Alignment
MOSAIK (fast) MOSAIK (single) MOSAIK (multi) MOSAIK (all) ELAND MAQ SOAP0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
readsINDELsSNPs
Reasons to use ?
• FAST• Accurate• Multiprocessor (OPENMP)• Co-assemblies• Gapped alignments• Widely used
“One tool, many technologies,
many applications”
(Near) Future Development• All technologies– Pacific BioSciences– Helicos
• All application areas– Adapter trimming– Coverage graphs
• Optimization• Improved paired-end read support• File format standardization (SAF & SRF)
1000 Genomes Project• Many samples with light coverage
(1000 dg)– 100 samples from 10 populations at 2x coverage– Find 90% of the 1 % frequency variants per
population
• Trios with moderate coverage (990 dg)– 30 trios at 11x coverage
• If you’re looking for SNPs, are your tools and methods robust?
Scaling Up: Disk Footprint• Current situation: files created by
MOSAIK are not optimized for speed or size– Assembly can take a long time (slow disk
speed)
• Hypothetical solution– Optimize the file formats– Ditch the built-in index– Keep data sorted by aligned location
Scaling Up: Disk Footprint
Scaling Up: Memory Footprint
• Current situation: storing the entire human genome stored with all associated hash locations
– Optimized hash table ≈ 55 GB RAM
– File-based hash table (BerkeleyDB)• User selects how much RAM to use• Dreadfully slow performance• Large disk footprint ≈ 65 GB file
Scaling Up: Memory Footprint
Scaling Up: Memory Footprint
9 10 11 12 13 14 15 16 17 1805
10152025303540455055606570
JumpDB Memory Usage (Human Genome)
JumpDB MOSAIK hash table
hash size (bp)
mem
ory
used
(G
B RA
M)
Berkeley (all positions in database)
Berkeley (1 position in database)
Jump (all positions in file-based database)
Mosaik hash table
0 4 8 12 16 20
Alignment Performance with 35bp human reads
Reads/s
Scaling Up: Speed & Sensitivity
• Current situation: speed increases as the hash size increases, sensitivity decreases
• Hypothetical solution: use small hash sizes and require a clustering of a predefined length.
• Status: Implemented but not tested.
BORK! BORK! BORK!
(translated: when will MOSAIK get published?)
AcknowledgementsBoston CollegeGabor MarthDerek BarnettMichele BusbyWeichun HuangAaron QuinlanChip Stewart
Thomas SeyfriedMike Kiebish
Washington University School of Medicine
Elaine MardisJarret GlasscockVincent Magrini
AgencourtDouglas SmithWei Tao