fuzzypath assemblies - from mixed solexa/454 datasets to extremely gc biased genomes zemin ning the...
TRANSCRIPT
FuzzyPath Assemblies - from Mixed FuzzyPath Assemblies - from Mixed Solexa/454 Datasets to Extremely GC Solexa/454 Datasets to Extremely GC
Biased GenomesBiased Genomes
Zemin NingZemin Ning
The Wellcome Trust Sanger InstituteThe Wellcome Trust Sanger Institute
Assembly StrategyAssembly Strategy
Selexa reads assembler toextend long reads of 1-2Kb
Genome/Chromosome
Capillary reads assemblerPhrap/Phusion
forward-reverse paired reads
30-70 bp
known dist
~500 bp
30-70 bp
Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions
Handling of Single Base Variations Handling of Single Base Variations
ACGTAACTACGTAACTAAACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0000 00 01 00 10 11 11 00 01 00 10 11 11
ACGTAACTACGTAACTCCACAGTTACAGTT00 01 10 11 00 00 01 11 00 01 10 11 00 00 01 11 0101 00 01 00 10 11 11 00 01 00 10 11 11
ACGTAACT ACAGTTACGTAACT ACAGTT00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0101 00 00 00 00 00 00 00 00 00 00 00 00
Fuzzy KmersFuzzy KmersNumber of Mismatches between Two Kmers Number of Mismatches between Two Kmers
Means to handle repeats:Means to handle repeats: - Base quality- Base quality - Read pair- Read pair - Fuzzy kmers- Fuzzy kmers - Closely related reference- Closely related reference - 454 or Sanger reads- 454 or Sanger reads
Kmer Extension & Repeat JunctionsKmer Extension & Repeat Junctions
Pileup of other reads like 454, Sanger etc Pileup of other reads like 454, Sanger etc at a repeat junction at a repeat junction
Consensus
Pileup of Pileup of SolexaSolexa and 454 Reads and 454 Reads
Solexa reads:Number of reads: 3,084,185;Finished genome size: 2,007,491 bp;Read length: 39 and 36 bp;Estimated read coverage: ~55X;Number of 454 reads: 100,000;Read coverage of 454: 10X;
Assembly features: - contig statsTotal number of contigs: 73;Total bases of contigs: 1,999,817 bpN50 contig size: 62,508;Largest contig: 162,190 Averaged contig size: 27,394;Contig coverage over the genome: ~99 %;Contig extension errors: 2Mis-assembly errors: 3
S.SuisS.Suis P1/7 Solexa/454 Assembly P1/7 Solexa/454 Assembly
Solexa reads:Number of reads: 6,000,000;Finished genome size: ~4.8 Mbp;Read length: 2x37 bp;Estimated read coverage: ~92.5 X;Insert size: 170/50-300 bp;
Assembly features: - contig statsSolexa 454
Total number of contigs: 75; 390Total bases of contigs: 4.80 Mbp 4.77 MbN50 contig size: 139,353 25,702Largest contig: 395,600 62,040Averaged contig size: 63,969 12,224Contig coverage on genome: ~99.8 % 99.4%Contig extension errors: 0Mis-assembly errors: 0 4
Salmonella seftenberg Salmonella seftenberg Solexa Solexa Assembly from Pair-End ReadsAssembly from Pair-End Reads
Solexa reads:Number of reads: 7,055,348;Finished genome size: 5.35 Mbp;Read length: 2x36bp;Estimated read coverage: ~95X;Insert size: 170/50-300 bp;
Assembly features: - contig statsTotal number of contigs: 168;Total bases of contigs: 5.19 MbpN50 contig size: 85,886;Largest contig: 337,768 Averaged contig size: 30,886;Contig coverage over the genome: ~99 %;Contig extension errors: 1Mis-assembly errors: 2
E.Coli strain 042 E.Coli strain 042 AssemblyAssembly
Solexa reads:Number of reads: 6,346,317;Finished genome size: 4.7 Mbp;Read length: 33 bp;Estimated read coverage: ~40 X;Shredded reference of SpA: 10X;
Assembly features: - contig statsTotal number of contigs: 66;Total bases of contigs: 4,615,704 bpN50 contig size: 168,793;Largest contig: 401,700 Averaged contig size: 69,934;Contig coverage over the genome: ~98 %;Contig extension errors: 0Mis-assembly errors: 2
Salmonella delhi5 Salmonella delhi5 Solexa AssemblySolexa AssemblyGuided by A Close ReferenceGuided by A Close Reference
The The Malaria Genome Malaria Genome
Project Project
library organism read length Mb sequence genome mean
generated size (Mb) coverage
PCR-free B. pertussis ST24 2 x 76 907 4.1 221
PCR-free E. coli 042 2 x 76 573 5.3 108
PCR-free P. falciparum 3D7 2 x 76 1486 23.0 65
PCR-free B. pertussis ST24 2 x 36 452 4.1 110
PCR-free P. falciparum 3D7 2 x 36 1008 23.0 44
PCR-free E. coli 042 2 x 36 958 5.3 181
standard-245 P. falciparum 3D7 2 x 35 2198 23.0 96
standard-368 P. falciparum 3D7 2 x 35 2628 23.0 115
standard-851 P. falciparum 3D7 2 x 35 474 23.0 21
standard-883 P. falciparum clin 2 x 36 3994 23.0 175
Datasets with Various GC ContentDatasets with Various GC Content
GC
68.0%
50.5%
19.0%
19.0%
50.8%
19.0%
68.0%
19.0%
19.0%
19.0%
0
1
2
3
4
5
6
7
8
9
10
11
12
0 10 20 30 40 50 60 70 80 90 100Depth of genome base coverage
Frac
tion
of u
nmak
ed g
enom
e (%
)
3D7:PCR-free-36bp-43x3D7:PCR-free-76bp-64xPFClin:run883-36bp-50xPFClin:run883-36bp-170x3D7:run368-35bp-114x3D7:run245-35bp-95x3D7:run851-35bp-21x
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000Depth of genome base coverage
Acc
umul
ated
frac
tion
of u
nmak
sed
geno
me
(%)
3D7:PCR-free-36bp-43x
3D7:PCR-free-76bp-64x
PFClin:run883-36bp-170x
3D7:run368-35bp-114x
3D7:run245-35bp-95x
3D7:run851-35bp-21x
0
0.1
0.2
0.3
0.4
0.5
0.6
0 20 40 60 80 100 120 140 160 180 200Depth of genome base coverage
GC
con
tent
(%)
3D7:PCR-free-36bp-43x
3D7:PCR-free-76bp-64x
PFClin:run883-36bp-170x
3D7:run368-35bp-114x
3D7:run245-35bp-95x
3D7:run851-35bp-21x
0
0.1
0.2
0.3
0.4
0.5
0 2 4 6 8 10 12 14 16 18 20
Depth of duplication
Per
cent
age
of m
atch
ed r
easd
s (%
) 3D7:PCR-free-36bp-43x
3D7:PCR-free-76bp-64x
PFClin:run883-36bp-170x
3D7:run368-35bp-114x
3D7:run245-35bp-95x
3D7:run851-35bp-21x
Solexa reads: 2x36 bp 2x76 bpNumber of reads: 14.0m 9.77mFinished genome size: 23 Mbp 23 MbpEstimated read coverage: 43x 64xInsert size: 170 bp 170 bp
Assembly features:Total number of contigs: 26,926 22839Total bases of contigs: 19.2 Mbp 21.1 MbN50 contig size: 1456 1621Largest contig: 9106 9825Averaged contig size: 706 923Contig coverage on genome: ~83.5 % 91.7%Contig extension errors: ? ?Mis-assembly errors: ? ?
Malaria 3D7 AssembliesMalaria 3D7 Assemblies
Acknowledgements:
Yong Gu Ben Blackburne Hannes Ponstingl Daniel Turner Michael Quail Tony Cox Richard Durbin