making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

Making the most of short readsMaking the most of short reads

Torsten Seemann

Victorian Bioinformatics ConsortiumMonash University

07/08/12 Making the most of short reads 2

Outline

● About the VBC● Sequencing technologies● Read mapping● Applications● Conclusion● Questions


What is the VBC ?

● Victorian Bioinformatics Consortium● 2000-2005

– Monash .med .infotech, CSIRO, DPI– $4M STI grant from State Govt.

● 2005+– Dept. Microbiology, Monash Uni.– NHMRC/ARC Network Parisitology– Micromon (sequencing centre)


Where is the VBC ?

● Monash Uni.● Clayton Campus● STRIP2 / Bldg 76● Level 2● Microbiology● Rooms 223-225


VBC capabilities

● Sequence analysis● Assembly, annotation, SNPs● Anything-omics!● Microarray analysis/storage● Data mining/visualization● Custom software development● Computer system architecture


VBC Collaborators

● Monash Uni.● Uni. Melbourne● Bio21● UNSW, Uni. Syd● UQ : IMB● MIMR, MMC, Austin● MISCL

● CSIRO : FSA, LI● USDA : ARS● Pasteur Institute● TIGR● UCSD● UCLA● Uni. Copenhagen


Sanger sequencing

● Dye terminated capillary sequencing● Read length ~ 300 - 900 bp● Yield ~ 1 Mbp per day maximum● Cost ~ $HIGH


Roche 454 FLX+

● Pyro-sequencing ● Read length ~ 100 - 250 bp● Yield ~ 600 Mbp (250 bp PE)● Run time ~ 1 day● Prep time ~ 5 days● Homo-polymer run errors● Cost $MEDIUM


ABI SOLID 3

● Sequencing by ligation● Read length ~ 35 – 50 bp● Yield ~ 15,000 Mbp (50 bp PE)● Run time ~ 14 days● Prep time ~ ? days● Colour space error propagation● Cost $MEDIUM


Illumina GA2 (Solexa)

● Sequencing by synthesis● Read length ~ 36 – 100 bp● Yield ~ 6,000 Mbp (36bp PE)● Run time ~ 5 days● Prep time ~ 1 day● No homo-polymer errors● Cost $LOW


Illumina output 36bp

Bad read

@HWUSI-EAS100R:3:1:5:1526#0/1TCCCTTGCATTACTCTTAATCGAGGAAATCCCTTTG+HWUSI-EAS100R:3:1:5:1526#0/1abbaaaaaaaaaaaaaaaaa_X^WT]a```a_a\`\

@HWUSI-EAS100R:3:1:3:1073#0/2TGNNNNNNCAAATTCANNNNNNNTCNNTTTATATCT+HWUSI-EAS100R:3:1:3:1073#0/2a\DDDDDD^[K]BBBBBBBBBBBBBBBBBBBBBBBB

Good read

'B'=Q2 Pr(wrong)=0.38

'a'=Q33 Pr(wrong)=0.0005


Read mapping

● Align 108 36bp reads to 5 Mbp reference● Traditional tools too slow● New crop of “short read aligners” (SRA)

– SHRiMP – MAQ– Bowtie– ELAND– Novocraft


SRA capabilities● SNP = Single nucleotide polymorphism

– Subsitution, eg. A → C– insertion or deletion (“indel”) eg. A → -

● Warning: not all aligners support indels!● We tend to use SHRiMP

– Supports substitutions and indels– Fast SIMD implementation & parallelizable– Full post-hit Smith-Waterman alignment– Will identify “most” high scoring hits


Genome coverage

● Mapped 7 M reads to 4 Mbp genome● Yellow line is mean coverage (56x)● Bowl shaped coverage = circular genome● Could be used to guide scaffolding


Missing DNA

● Read coverage drops to zero where reference has DNA that the new sequence does not

● LB022 absent● hemH present


Repeated DNA

● Coverage increases in repeated areas● LA_SNP3199 is probably triplicated in

this strain – depth 120, average 40


SNPs

● SNPs appear as dips/pinches in the coverage graph

● LA1299 gene has possible 4 SNPs relative to ref.

● Rest of gene has average coverage


Repairing 454 data

● 454 has “homopolymer” errors● Loses track if same base > 3 times in row● Traditional assemblers don't like too many

indels or frame shifts● 454 developed Newbler assembler● Challenging for hybrid assemblies● What if we could “repair” our 454 data?


454 Repair Guide

● One sample with 454 and Illumina reads● Get a read mapper supporting indels● Align all your Illumina reads to 454 data● If sufficient un-ambiguous depth

– correct the 454 sequence!

● Can apply to old closed sequences, 454 contigs, 454 reads etc.

● Find old errors via resequencing


Example repair>FF6ELPM06G1HYY original 180bpAAATCTAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT

Sequence Pos Change type Old New EvidenceFF6ELPM06G1HYY 11 insertion-before - A "A"x166FF6ELPM06G1HYY 61 insertion-before - A "A"x212 "-"x12FF6ELPM06G1HYY 92 insertion-before - A "A"x368 "-"x1

>FF6ELPM06G1HYY repaired 183bpAAATCTAAAAAGAATAGTCGTGGAGCAGGTAGAAAACCTAGATTTACTGAAGAAGAAAAAAATATTATAAGAGCTCAAAGAAAAGAAGGAAAAACAATAAAAGAGCTTGCAACTTTAAATAATTGTAGCTTTGGAGTAATTCATAAAATTTTACATGAATAATAAATAAAAGGGGATTGAGAT


Trimming short reads

● Quality worsens toward 3' end ● Many reads have “N” basecalls● Variation across flowcell/slide

● Will reduce data size● Trade quality for depth● Is it worth it?


Should I trim?● For 36 bp

– Results are mixed– Usually best NOT to trim– Depth will “fix” most errors

● For 75+ bp– 3' quality can be very poor– Seems best to trim– Not all reads need trimming

● More research needed


Conclusion

● Short read mapping is a powerful tool for genomic discovery

– Automated analysis eg. SNPs– Visualization eg. depth/coverage graphs– Repairing longer read data

● Still need de novo assembly for unmapped reads


Contact me

Webhttp://www.vicbioinformatics.com/

[email protected]

mailto:[email protected]

making the most of short reads torsten seemann - agrf ngs sig - 28 apr 2009

Technology