fuzzypath assemblies - from bacterial to mammalian genomes and zebrafish finishing zemin ning the...

20
FuzzyPath Assemblies - from Bacterial to FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Mammalian Genomes and Zebrafish Finishing Finishing Zemin Ning Zemin Ning The Wellcome Trust Sanger Institute The Wellcome Trust Sanger Institute

Upload: emma-hunter

Post on 01-Apr-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

FuzzyPath Assemblies - from FuzzyPath Assemblies - from Bacterial to Mammalian Genomes Bacterial to Mammalian Genomes

and Zebrafish Finishingand Zebrafish Finishing

Zemin NingZemin Ning

The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute

Page 2: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Assembly StrategyAssembly Strategy

Selexa reads assembler toextend long reads of 1-2Kb

Genome/Chromosome

Capillary reads assemblerPhrap/Phusion

forward-reverse paired reads

30-70 bp

known dist

~500 bp

30-70 bp

Page 3: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions

Page 4: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Handling of Single Base Variations Handling of Single Base Variations

Page 5: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

ACGTAACTACGTAACTAAACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0000 00 01 00 10 11 11 00 01 00 10 11 11

ACGTAACTACGTAACTCCACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0101 00 01 00 10 11 11 00 01 00 10 11 11

ACGTAACT ACAGTTACGTAACT ACAGTT00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0101 00 00 00 00 00 00 00 00 00 00 00 00

Fuzzy KmersFuzzy KmersNumber of Mismatches between Two Kmers Number of Mismatches between Two Kmers

Page 6: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Means to handle repeats:Means to handle repeats: - Base quality- Base quality - Read pair- Read pair - Fuzzy kmers- Fuzzy kmers - Closely related reference- Closely related reference - 454 or Sanger reads- 454 or Sanger reads

Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions

Pileup of other reads like 454, Sanger etc Pileup of other reads like 454, Sanger etc at a repeat junction at a repeat junction

Consensus

Page 7: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Pileup of Pileup of SolexaSolexa and 454 Reads and 454 Reads

Page 8: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~55X;Number of 454 reads: 100,000;Read coverage of 454: 10X;

Assembly features: - contig statsTotal number of contigs: 73;Total bases of contigs: 1,999,817 bpN50 contig size: 62,508;Largest contig: 162,190 Averaged contig size: 27,394;Contig coverage over the genome: ~99 %;Contig extension errors: 2Mis-assembly errors: 3

S.SuisS.Suis P1/7 Solexa/454 Assembly P1/7 Solexa/454 Assembly

Page 9: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp;Read length: 2x37 bp;Estimated read coverage: ~92.5 X;Insert size: 170/50-300 bp;

Assembly features: - contig statsSolexa 454

Total number of contigs: 75; 390Total bases of contigs: 4.80 Mbp 4.77 MbN50 contig size: 139,353 25,702Largest contig: 395,600 62,040Averaged contig size: 63,969 12,224Contig coverage on genome: ~99.8 % 99.4%Contig extension errors: 0Mis-assembly errors: 0 4

Salmonella seftenberg Salmonella seftenberg Solexa Solexa Assembly from Pair-End ReadsAssembly from Pair-End Reads

Page 10: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

library organism read length Mb sequence genome mean

generated size (Mb) coverage

PCR-free B. pertussis ST24 2 x 76 907 4.1 221

PCR-free E. coli 042 2 x 76 573 5.3 108

PCR-free P. falciparum 3D7 2 x 76 1486 23.0 65

PCR-free B. pertussis ST24 2 x 36 452 4.1 110

PCR-free P. falciparum 3D7 2 x 36 1008 23.0 44

PCR-free E. coli 042 2 x 36 958 5.3 181

standard-245 P. falciparum 3D7 2 x 35 2198 23.0 96

standard-368 P. falciparum 3D7 2 x 35 2628 23.0 115

standard-851 P. falciparum 3D7 2 x 35 474 23.0 21

standard-883 P. falciparum clin 2 x 36 3994 23.0 175

Extremely GC Biased GenomesExtremely GC Biased Genomes

GC

68.0%

50.5%

19.0%

19.0%

50.8%

19.0%

68.0%

19.0%

19.0%

19.0%

Page 11: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads: 2x36 bp 2x76 bpNumber of reads: 14.0m 9.77mFinished genome size: 23 Mbp 23 MbpEstimated read coverage: 43x 64xInsert size: 170 bp 170 bp

Assembly features:Total number of contigs: 26,926 22839Total bases of contigs: 19.2 Mbp 21.1 MbN50 contig size: 1456 1621Largest contig: 9106 9825Averaged contig size: 706 923Contig coverage on genome: ~83.5 % 91.7%Contig extension errors: ? ?Mis-assembly errors: ? ?

Malaria 3D7 AssembliesMalaria 3D7 Assemblies

Page 12: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 7,055,348;Finished genome size: 5.35 Mbp;Read length: 2x36bp;Estimated read coverage: ~95X;Insert size: 170/50-300 bp;

Assembly features: - contig statsTotal number of contigs: 168;Total bases of contigs: 5.19 MbpN50 contig size: 85,886;Largest contig: 337,768 Averaged contig size: 30,886;Contig coverage over the genome: ~99 %;Contig extension errors: 1Mis-assembly errors: 2

E.Coli strain 042 E.Coli strain 042 AssemblyAssembly

Page 13: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Solexa reads:Number of reads: 86.5 million;Finished genome size: 95.2 Mbp;Read length: 2x36bp;Estimated read coverage: ~65X;Insert size: 120/50-200 bp;

Assembly features: - contig statsTotal number of contigs: 55,802;Total bases of contigs: 75.8 MbpN50 contig size: 2,322;Largest contig: 17,859 Averaged contig size: 1,358;Contig coverage over the genome: ~80 %;Contig extension errors: ?Mis-assembly errors: ?

Mouse Chromosome 17 Mouse Chromosome 17 AssemblyAssembly

Page 14: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Clone Name

Length (bp)

Finished Cloning Vector Species Capillary Data Pathway

zH117H1 129221 Yes pTARBAC2.1 D. rerio /nfs/repository/d0012/zH117H1zH141B18 119622 Yes pTARBAC2.1 D. rerio /nfs/repository/d0012/zH141B18

zH151M17

122622 Yes pTARBAC2.1 D. rerio /nfs/repository/d0014/zH151M17

zH117E7 139449 Yes pTARBAC2.1 D. rerio /nfs/repository/d0015/zH117E7zH137D22 122615 Yes pTARBAC2.1 D. rerio /nfs/repository/d0023/zH137D22

zH97A24   113538 Yes pTARBAC2.1 D. rerio /nfs/repository/d0027/zH97A24 

zH146D21 109862 Yes pTARBAC2.1 D. rerio /nfs/repository/d0040/zH146D21

zH140N19 118794 Yes pTARBAC2.1 D. rerio /nfs/repository/d0013/zH140N19

zH147D24 111470 Yes pTARBAC2.1 D. rerio /nfs/repository/d0011/zH147D24

bE2F11 170585 Yes pTARBAC1.3_BamHI S. scrofa /nfs/repository/d0027/bE2F11

bE156J20 210831 Yes pTARBAC1.3_BamHI S. scrofa /nfs/repository/d0041/bE156J20

bE240L11 216560* No pTARBAC1.3_BamHI S. scrofa /nfs/repository/d0012/bE240L11

* Finished length may be shorter or longer once complete

Pooled Clones: Zfish 9, Pig 3Pooled Clones: Zfish 9, Pig 3

Page 15: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

0

100

200

300

400

500

600

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700

Accumulated clone position (Kb)

Ave

rag

ed r

ead

dep

th o

n 1

kb w

ind

ow

Zfish-set1

Zfish-set2

Mapping of Solexa Reads On the ReferenceMapping of Solexa Reads On the Reference

Page 16: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

extended long reads of

1-2Kb

30-70 bp

Insert

~300 bp

30-70 bp

Solexa assembly

Genome/Chromosome Assembly

Fishing WGS Reads

WGS Reads5X

Combined Reads

FuzzyPath

Phusion or Phrap

Phusion

Page 17: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute
Page 18: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute
Page 19: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute
Page 20: FuzzyPath Assemblies - from Bacterial to Mammalian Genomes and Zebrafish Finishing Zemin Ning The Wellcome Trust Sanger Institute

Acknowledgements:

Yong Gu James Bonfiled Helen Beasley Siobhan Whitehead Daniel Turner Michael Quail Tony Cox Richard Durbin