fuzzypath - a hybrid de novo assembler using solexa and 454 short reads zemin ning the wellcome...

20
FuzzyPath - A Hybrid De novo Assembler FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads using Solexa and 454 Short Reads Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute

Upload: shannon-wheeler

Post on 21-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

FuzzyPath - A Hybrid De novo FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Assembler using Solexa and 454

Short ReadsShort Reads

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Page 2: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Outline of the Talk:

Assembly strategy Read extension using base qualities and read pairs Repeat junctions and single base variation Fuzzy kmers – how to find mismatches Assemblies with mixed Solexa and 454 reads Solexa reads guided by a closely related reference Long Solexa reads with 70 bps Future Work

Page 3: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Assembly StrategyAssembly Strategy

Selexa reads assembler toextend long reads of 1-2Kb

Genome/Chromosome

Capillary reads assemblerPhrap/Phusion

forward-reverse paired reads

30-70 bp

known dist

~500 bp

30-70 bp

Page 4: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions

Page 5: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Quality Filters on JunctionsQuality Filters on Junctions

Page 6: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Repetitive Contig and Read PairsRepetitive Contig and Read Pairs

DepthDepthFor each hit read in the For each hit read in the contig, contig index and contig, contig index and offset are stored.offset are stored.

Insert lengthInsert length

Current read positionCurrent read position

Contig startContig start

Pair read positionPair read position

DepthDepth

Page 7: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Handling of Single Base Variations Handling of Single Base Variations

Page 8: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

ACGTAACTACGTAACTAAACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0000 00 01 00 10 11 11 00 01 00 10 11 11

ACGTAACTACGTAACTCCACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0101 00 01 00 10 11 11 00 01 00 10 11 11

ACGTAACT ACAGTTACGTAACT ACAGTT00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0101 00 00 00 00 00 00 00 00 00 00 00 00

Number of Mismatches between Two Kmers Number of Mismatches between Two Kmers

Page 9: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Use of Kmers with Mismatches Use of Kmers with Mismatches

Page 10: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Mixed Solexa and 454 ReadsMixed Solexa and 454 Reads

L = ~250 bp

L-K+1 kmers

L-N-K+1 kmers

Pileup of 454 reads at a repeat junction Pileup of 454 reads at a repeat junction

Page 11: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Pileup of Pileup of SolexaSolexa and 454 Reads and 454 Reads

Page 12: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Guided by A Closely Related ReferenceGuided by A Closely Related Reference

L = 3000 bp

L-K+1 kmers

L-N-K+1 kmers

Pileup of shredded reads at a repeat junctionPileup of shredded reads at a repeat junction

Page 13: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Pileup of Pileup of SolexaSolexa and Shredded Reads and Shredded Reads

Page 14: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Long Solexa Reads with 70 bpLong Solexa Reads with 70 bp

L = 70 bp

L-K+1 kmers

Pileup of long Solexa reads at a repeat junctionPileup of long Solexa reads at a repeat junction

Page 15: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Pileup of Long 70 bp Pileup of Long 70 bp SolexaSolexa Reads Reads

Page 16: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~55X;Number of 454 reads: 100,000;Read coverage of 454: 10X;

Assembly features: - contig statsTotal number of contigs: 73;Total bases of contigs: 1,999,817 bpN50 contig size: 62,508;Largest contig: 162,190 Averaged contig size: 27,394;Contig coverage over the genome: ~99 %;Contig extension errors: 2Mis-assembly errors: 3

S.SuisS.Suis P1/7 Solexa/454 Assembly P1/7 Solexa/454 Assembly

Page 17: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;

Assembly features:Paired_Data Not_Paired

Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2

S.Suis S.Suis P1/7 with Shredded Pair-end ReadsP1/7 with Shredded Pair-end Reads

Page 18: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 6,346,317;Finished genome size: 4.7 Mbp;Read length: 33 bp;Estimated read coverage: ~40 X;Shredded reference of SpA: 10X;

Assembly features: - contig statsTotal number of contigs: 66;Total bases of contigs: 4,615,704 bpN50 contig size: 168,793;Largest contig: 401,700 Averaged contig size: 69,934;Contig coverage over the genome: ~98 %;Contig extension errors: 0Mis-assembly errors: 2

Salmonella delhi5 Salmonella delhi5 Solexa AssemblySolexa AssemblyGuided by A Close ReferenceGuided by A Close Reference

Page 19: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Shredded reads:Number of reads: 1,338,161;Finished genome size: 2,007,491 bp;Read length: 36;Estimated read coverage: 24X;Insert size: 500 bp;

Assembly features:Paired_Data Not_Paired

Number of contigs: 35 317Total assembled bases: 1.996 Mb 1.956 MbN50 contig size: 243,039 13,929Largest contig: 474,070 33,460Averaged contig size: 57,043 6,168Contig coverage: >99.0 % >99.0 %Contig extension errors: 0 0Mis-assembly errors: 3 2

S Suis S Suis P1/7 Shredded Read AssemblyP1/7 Shredded Read Assembly

Page 20: FuzzyPath - A Hybrid De novo Assembler using Solexa and 454 Short Reads Zemin Ning The Wellcome Trust Sanger Institute

Acknowledgements:

Yong Gu Ben Blackburne Hannes Ponstingl Harold Swerdlow Michael Quail Tony Cox Richard Durbin