making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009
TRANSCRIPT
Making the most of short readsMaking the most of short reads
Torsten Seemann
Victorian Bioinformatics ConsortiumMonash University
07/08/12 Making the most of short reads 2
Outline
● About the VBC● Sequencing technologies● Read mapping● Applications● Conclusion● Questions
07/08/12 Making the most of short reads 3
What is the VBC ?
● Victorian Bioinformatics Consortium● 2000-2005
– Monash .med .infotech, CSIRO, DPI– $4M STI grant from State Govt.
● 2005+– Dept. Microbiology, Monash Uni.– NHMRC/ARC Network Parisitology– Micromon (sequencing centre)
07/08/12 Making the most of short reads 4
Where is the VBC ?
● Monash Uni.● Clayton Campus● STRIP2 / Bldg 76● Level 2● Microbiology● Rooms 223-225
07/08/12 Making the most of short reads 5
VBC capabilities
● Sequence analysis● Assembly, annotation, SNPs● Anything-omics!● Microarray analysis/storage● Data mining/visualization● Custom software development● Computer system architecture
07/08/12 Making the most of short reads 6
VBC Collaborators
● Monash Uni.● Uni. Melbourne● Bio21● UNSW, Uni. Syd● UQ : IMB● MIMR, MMC, Austin● MISCL
● CSIRO : FSA, LI● USDA : ARS● Pasteur Institute● TIGR● UCSD● UCLA● Uni. Copenhagen
07/08/12 Making the most of short reads 7
Sanger sequencing
● Dye terminated capillary sequencing● Read length ~ 300 - 900 bp● Yield ~ 1 Mbp per day maximum● Cost ~ $HIGH
07/08/12 Making the most of short reads 8
Roche 454 FLX+
● Pyro-sequencing ● Read length ~ 100 - 250 bp● Yield ~ 600 Mbp (250 bp PE)● Run time ~ 1 day● Prep time ~ 5 days● Homo-polymer run errors● Cost $MEDIUM
07/08/12 Making the most of short reads 9
ABI SOLID 3
● Sequencing by ligation● Read length ~ 35 – 50 bp● Yield ~ 15,000 Mbp (50 bp PE)● Run time ~ 14 days● Prep time ~ ? days● Colour space error propagation● Cost $MEDIUM
07/08/12 Making the most of short reads 10
Illumina GA2 (Solexa)
● Sequencing by synthesis● Read length ~ 36 – 100 bp● Yield ~ 6,000 Mbp (36bp PE)● Run time ~ 5 days● Prep time ~ 1 day● No homo-polymer errors● Cost $LOW
07/08/12 Making the most of short reads 11
Illumina output 36bp
Bad read
@HWUSI-EAS100R:3:1:5:1526#0/1TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG+HWUSI-EAS100R:3:1:5:1526#0/1abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a\`\
@HWUSI-EAS100R:3:1:3:1073#0/2TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT+HWUSI-EAS100R:3:1:3:1073#0/2a\DDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB
Good read
'B'=Q2 Pr(wrong)=0.38
'a'=Q33 Pr(wrong)=0.0005
07/08/12 Making the most of short reads 12
Read mapping
● Align 108 36bp reads to 5 Mbp reference● Traditional tools too slow● New crop of “short read aligners” (SRA)
– SHRiMP – MAQ– Bowtie– ELAND– Novocraft
07/08/12 Making the most of short reads 13
SRA capabilities● SNP = Single nucleotide polymorphism
– Subsitution, eg. A → C– insertion or deletion (“indel”) eg. A → -
● Warning: not all aligners support indels!● We tend to use SHRiMP
– Supports substitutions and indels– Fast SIMD implementation & parallelizable– Full post-hit Smith-Waterman alignment– Will identify “most” high scoring hits
07/08/12 Making the most of short reads 14
Genome coverage
● Mapped 7 M reads to 4 Mbp genome● Yellow line is mean coverage (56x)● Bowl shaped coverage = circular genome● Could be used to guide scaffolding
07/08/12 Making the most of short reads 15
Missing DNA
● Read coverage drops to zero where reference has DNA that the new sequence does not
● LB022 absent● hemH present
07/08/12 Making the most of short reads 16
Repeated DNA
● Coverage increases in repeated areas● LA_SNP3199 is probably triplicated in
this strain – depth 120, average 40
07/08/12 Making the most of short reads 17
SNPs
● SNPs appear as dips/pinches in the coverage graph
● LA1299 gene has possible 4 SNPs relative to ref.
● Rest of gene has average coverage
07/08/12 Making the most of short reads 18
Repairing 454 data
● 454 has “homopolymer” errors● Loses track if same base > 3 times in row● Traditional assemblers don't like too many
indels or frame shifts● 454 developed Newbler assembler● Challenging for hybrid assemblies● What if we could “repair” our 454 data?
07/08/12 Making the most of short reads 19
454 Repair Guide
● One sample with 454 and Illumina reads● Get a read mapper supporting indels● Align all your Illumina reads to 454 data● If sufficient un-ambiguous depth
– correct the 454 sequence!
● Can apply to old closed sequences, 454 contigs, 454 reads etc.
● Find old errors via resequencing
07/08/12 Making the most of short reads 20
Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT
Sequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1
>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT
07/08/12 Making the most of short reads 21
Trimming short reads
● Quality worsens toward 3' end ● Many reads have “N” basecalls● Variation across flowcell/slide
● Will reduce data size● Trade quality for depth● Is it worth it?
07/08/12 Making the most of short reads 22
Should I trim?● For 36 bp
– Results are mixed– Usually best NOT to trim– Depth will “fix” most errors
● For 75+ bp– 3' quality can be very poor– Seems best to trim– Not all reads need trimming
● More research needed
07/08/12 Making the most of short reads 23
Conclusion
● Short read mapping is a powerful tool for genomic discovery
– Automated analysis eg. SNPs– Visualization eg. depth/coverage graphs– Repairing longer read data
● Still need de novo assembly for unmapped reads
07/08/12 Making the most of short reads 24
Contact me
Webhttp://www.vicbioinformatics.com/