helsinki genome project-20151210-amb
TRANSCRIPT
-
Thingstoconsiderwheninitiatingagenomeprojecttheassemblypipeline@SciLifeLab
!Helsinki,Dec9th2015
lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio
-
WorkshopOutline
IntroducingSciLifeLab Theimportantconsiderationsofallgenomeprojects
Theannotationandassemblyplatforms Avisionintothefuture
-
Survey
-
Survey
Howmanyofyouhaveusedsequencingfacilities?
-
Survey
Howmanyofyouhaveusedsequencingfacilities?
Assembledagenome?
-
Survey
Howmanyofyouhaveusedsequencingfacilities?
Assembledagenome? Planningtostartagenomeproject?
-
Survey
Howmanyofyouhaveusedsequencingfacilities?
Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata?
-
Survey
Howmanyofyouhaveusedsequencingfacilities?
Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata? JustcuriousaboutNGS?
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome GCcontent AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
-
Variationspace
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism
http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
Figure 2 | Structural variation sequence signatures. There are four general sequence-based analytical approaches used to detect structural variation. Theoretically, read-pair (RP), split-read and assembly methods can be used to discover variants from all classes of structural variant (SV), but each has different biases depending on the underlying sequence content of the variants and the data properties of the sequence reads. However, read-depth approaches can be used to detect only losses (deletions) and gains (duplications), and cannot discriminate between tandem and interspersed duplications. Briefly, read-pair methods analyse the mapping information of paired-end reads and their discordancy from the expected span size and mapped strand properties. Sensitivity, specificity and breakpoint accuracy are dependent on the read length, insert size and physical coverage3,4,59,62,65,66,68,69. Breakpoints are indicated by red arrows. Read-depth analysis examines the increase and decrease in sequence coverage to detect duplications and deletions, respectively, and predict absolute copy numbers of genomic intervals45,62,7476. Split-read algorithms are capable of detecting exact breakpoints of all variant classes by analysing the sequence alignment of the reads and the reference genome; however, they usually require longer reads than the other methods and have less power in repeat- and duplication-rich loci62,78,79. Assembly algorithms8386,115 have the most power to detect SVs of all classes at the breakpoint resolution, but assembling short sequences and inserts often result in contig/scaffold fragmentation in regions with high repeat and duplication content89. MEI, mobile-element insertion. Repbase is a database of repetitive elements.
REVIEWS
368 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics
2011 Macmillan Publishers Limited. All rights reserved
AlkanC.,CoeB.P.,EichlerE.E..NatureRevGenetics(2011)
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
-
Thingstoconsider
Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians
WHATISYOURSCIENTIFICQUESTION?
WardL.D.&KellisM.NatBiotechnology(2012)
-
Aboutme
lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio
PhDBioinformatics2010 PostdocPopGenetics/CompBiol2014,L.Andersson+H.Ronne
Herring:Illumina,SOLiD,Moleculo,PacBio SpeciesPlant:454,SOLiD,Illumina SpeciesSeal(~3Gb):Illumina SpeciesBeetle:Illumina,PacBio
-
Pool-seqA sequencing technique in which sequencing libraries are not prepared from DNA of a single individual or cell but from a mixture of DNA fragments originating from different individuals or cells. In the context of this Review, Pool-seq is used to describe the unbiased sequencing of the entire genome.
CoverageThe number of reads that span a given genomic position.
Sequencing librariesSets of fragmented DNA extracted from one or more individuals that serve as the template for subsequent sequencing.
Exome sequencingA sequencing approach in which the complexity of the genome is reduced through hybridization to exonic sequences, which results in a higher sequence coverage of protein-coding regions.
Restriction-site-associated DNA markersSequence polymorphisms in close proximity to a restriction enzyme recognition site.
are subject to larger sampling variance, whereby they can result in considerable errors even when the allele frequencies in the sample have been determined at high accuracy. In other words, accurately sequencing a small population sample will still result in noisy allele frequency estimates. By contrast, Pool-seq makes use of large population samples, but not all chromosomes in the samples are analysed. The higher accuracy to cost ratio of Pool-seq arises from the fact that very few chromosomes are sequenced more than once, whereas for sequencing of individuals each chromosome is typically sequenced multiple times (540 times). This advantage is clearly demonstrated in FIG.1, in which the accuracy of Pool-seq is compared with sequencing of individuals at a fixed sequencing cost (that is, assum-ing that the same number of sequence reads is used in each case). Although Pool-seq mostly performs bet-ter when 50 individuals are pooled, its performance is clearly superior when pooling 100 or more individuals (FIG.1a). Additionally, the accuracy of Pool-seq relative to sequencing of individuals increases with the coverage of individual genomes (FIG.1b).
The cost-effectiveness of Pool-seq becomes even more evident when the costs for the preparation of the sequencing libraries are considered: Pool-seq uses a single library for the entire sample, whereas sequencing of
individuals requires a separate library to be prepared for each genome. As library construction constitutes ~20% of the total sequencing costs for species with moderate genome sizes, this is an important costfactor.
Comparison to reduced-representation sequencingSequencing individuals at a high coverage is undoubt-edly the gold standard for obtaining high-quality data, but budget constraints frequently require alternatives for studying large populations. In addition to Pool-seq, other strategies have been developed for sequenc-ing large samples (FIG.2). Below, we compare different sequencing approaches (TABLE1) and weigh their par-ticular strengths and weaknesses against those of Pool-seq (TABLE2).
In contrast to the whole-genome approach of Pool-seq, the cost savings of these alternative approaches are achieved by reducing the representation of the genome in the sequence data. Different strategies for targeting the sequencing to specific regions of the genome can be categorized into exome sequencing13,14, high-throughput RNA sequencing (RNA-seq)15 and methods using restriction-site-associated DNA markers16. Each of these methods have been combined with pooling to fur-ther reduce costs, but each approach has its particular strengths and weaknesses (see below).
Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most comparisons. Only when the number of sequenced individuals approaches the pool size does sequencing of individuals become the superior strategy. b | Influence of coverage and variation in representation of individuals in a pool is shown. With a lower coverage per individual, the advantage of Pool-seq decreases. It should be noted that with a decreasing coverage per individual, the two approaches produce very similar types of data; that is, sequencing of individuals tends to show the same limitations as Pool-seq, such as for estimating linkage disequilibrium and for distinguishing sequencing errors from low-frequency polymorphisms. Variation in the representation of individuals in the DNA pool reduces the accuracy of Pool-seq only slightly (0% (that is, all individuals are uniformly represented; orange line) and 30% (light blue line)). The graphs were generated with the PIFs software12, ignoring sequencing errors.
Nature Reviews | Genetics
0.410 20 30
Number of individuals sequenced seperately
SD p
ool/
SD in
divi
dual
s
SD p
ool/
SD in
divi
dual
s
Number of individuals sequenced seperately40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1a b
0.410 20 30 40 50
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Pool size
Coverage per sequenced individual
Deviation in DNA content fromeach individual in the pool
100
20
0%
100
20
30%
100
5
30%
100
1
30%
Pool size
Coverage per sequenced individual
Deviation in DNA content fromeach individual in the pool
500
5
30%
100
5
30%
50
5
30%
REVIEWS
750 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics
2014 Macmillan Publishers Limited. All rights reserved
SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)
Whypooling?
-
SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)
-
SciLifeLab(promotionslides)SciLifeLab
National service Local scientific center
SciLifeLab
Director (July 2015) Olli Kallioniemi Co-director Kerstin Lindblad-Toh Vision:
To be an internationally leading center that develops, uses and provides access to advanced technologies for molecular biosciences with focus on health and environment.
www.scilifelab.se
2010: Strategic research initiative 2013: National resource 2015: New management and chairman
-
SciLifeLab platforms
SciLifeLab
National Genomics Infrastructure
National Bioinformatics Infrastructure Sweden
Joakim Lundeberg Ann-Christine Syvnen Ulf Gyllensten
Bengt Persson
Clinical Diagnostics
Lars Engstrand
Computer resources free for Swedish researchers
VR
SNIC
Ongoing merge of BILS, WABI and more; complete 2016. National, distributed
-
Knowagoodbioinformatician
-
NBIS-Werehereforyou!Were here for you!
-
23
The Bioinformatics Platform 2016
Funding The Research
Council SciLifeLab KAW foundation Host universities
Applied at the Research Council as continued national infrastructure 2016-2023. Decision late 2015.
Custom-tailored support Tools Training
Today ~70 FTE
-
24
Long-term Support Wallenberg Advanced Bioinformatics Infrastructure www.scilifelab.se/facilities/wabi/
Bjrn Nystedt Thomas Svensson
Tailored solutions high impact
Siv Andersson Gunnar von Heijne
Applied bioinformatics: 500h free support/project Variant analyses in health and disease Transcriptomics Single-cell analyses Epigenetics Metagenomics
Directors
Managers Swedens strongest unit for analyses of
large-scale genomic data (24 FTE)
National committee reviews and selects projects based on scientific quality
Staff in Stockholm, Uppsala, Lund, Gothenburg, Linkping, Ume.
-
WABIpersonnel(2013-2014)
JohanReimegrd MikaelHuss saBjrklund PrEngstrm JakubOrzechowskiWestholm
EstelleProux-Wra
SanellaKjellqvist
DianaEkman PallOlason AnnaJohansson MarcelMartin AlvaroMartinezBarrioPerUnneberg
-
Knowhowtohandleyourdata
-
Today:)Human)genome)sequenced)in)days)C)towards)$1000)genome)
requires$supercomputers$for$analysis$and$storage$
Massively$parallel$sequencing.$
-
2.$Data$delivery$
SciLifeLab)Bioinforma/cs)Compute)and)Storage)(UPPNEX))
3.$Analysis$
ScienBsts$
www.uppmax.uu.se/uppnex$High%performance/computers/and/large/scale/storage/for/bioinforma6cs/analysis./
1.$Sample$transfer$
-
Login$Submit$jobs$
Job)Que)
Job$assigned$
Work$interacBvely$
How)do)you)work)on)UPPMAX)computers?)
JobQueue
-
Research$$~8000$cores$
ProducBon$~3200$cores$
Redundancy$768$cores$
Storage$~11$PB$
2015)
Longbterm$Storage$
Mosler$384$cores$
Research$3328$cores$
ProducBon$768$cores$
Storage$~7$PB$
Longbterm$Storage$
2014) Resources)
Mosler$384$cores$
Private$Cloud$1600$cores$
Chipster,$CanvasDB$
-
Project)growth)
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
100
200
300
400 Active Projects
Num
ber o
f act
ive p
roje
cts
UPPMAXUPPNEX
-
2009: $ $13,152$MSEK$from$KAW$and$SNIC$$2012:$$ $23.8$MSEK$from$KAW/SNIC$$2014: $ $20$+$20$MSEK$from$KAW$for$WGS$$ $ $SNIC$receives$47.8$MSEK$from$VR$to$handle$sensiBve$data$
$
UPPNEX)history)
-
UPPMAX)personnel)
+3$more$
-
KnowhowtoextractyourDNA
-
OlgaVinnerePettersson(UGC)[email protected]
mp4-http://bit.ly/1Ul7RmHpptx-http://bit.ly/1Z6yIFHQ&A-http://bit.ly/1I1Sb6o
mailto:[email protected]://bit.ly/1Ul7RmHhttp://bit.ly/1Z6yIFHhttp://bit.ly/1I1Sb6o
-
Bacteria Fungi
Insects Plants
-
Knowhowtomeasureyourassemblyresults
-
JustawordonN50
N50typicallyreferstoacon
-
Knowwhyassemblingisdifficult
-
Twotypesofassemblies
Case1: Flycatcher(1.2Gbp)Herring(800Mbp)Malassezia(7Mbp)
Case2: Spruce(20Gbp)Barnacle(1.4Gbp)Wolbachia(4Mbp)
Twotypesofassemblies
-
Pre-assembly
Qualitytrimming (Errorcorrection) Kmeranalysis Denovorepeatlibrary
-
Qualitytrimming
DeBruijn-graphassemblersareinprinciplesensi
-
Readsvskmers
1read:100bp
..Kmers:k=21bpN=(Lk+1)(100bp21bp+1)80
Base coverage * (L-k+1) = Kmer coverage!! ! L!
Ex: !50X * (100-21+1) = 40X (i.e.kmercoverageis80%ofbasecoverage)
! ! 100!
ReadsvsKmers
-
Kmeranalyses
Computethefrequencyofeachkmerinthedataset(e.g.Jellyfish --both-strands)Note:RAM-intense!
Howtocountkmers?
-
Diggingintothekmers
Genomesize Removelow-copykmers Iden
-
Repeats:firstshot
Thenbofdis
-
Heterozygosityandploidyandhumansareeasy.
Bacteria,archaea,fungi,someplants
Mostanimals,someplants
Manyplants
Also:Heterozygozityisgenerallyverylowinmammals;mostotherspeciesaremuchharder
-
Heterozygositywithkmergraphs
Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)
-
Awordonqualityfiltering
LightQCfilter HardQCfilter
Awordofprecautiononqualityfiltering!
Heterozygositywithkmergraphs
-
Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)
Heterozygositywithkmergraphs
-
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
Heterozygositywithkmergraphs
-
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
8
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percen
tage
(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
Heterozygositywithkmergraphs
-
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
7
Fig4.1 17-mer depth distribution
Table4.2 17-mer Data statistics
K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X
17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28
Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution
derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about
32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by
formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.
If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.
So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.
Also, this distribution can be used to determine the repeat content of the genome, if this
genome contains high proportion of repeat; the distribution will display a fat tail which indicates
more than expect proportion of the genome have a high sequencing depth which may due to
sequence similarly.
Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high
to do whole genome shotgun sequence and assembly.
4.3 Estimation of heterozygous rate
We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis
on them respectively, and then get the figure 2.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70 80 90 100
Percen
tage(%
)
Depth(X)
8
Fig4.2 Hybrid effect on K-mer distribution.
The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer
curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is
1.2%, H_0.015 is 1.5%.
From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is
becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We
can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 20 40 60 80
Percen
tage
(X)
Depth(X)
H_0.01067
Epi
H_0.012
H_0.015
Theheterozygositywasestimatedtobe1.5%
Heterozygositywithkmergraphs
-
Repeats:firstshot
Thenbofdis
-
WhyrepeatsdestroyassembliesGenomeassembly-thingstothinkabout
-
Repeatlibraryandrepeatquantification
Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads
-Mapping(e.g.bwa),or-MaskwithRepeatMasker
-
Repeatlibraryandrepeatquantification
Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads
-Mapping(e.g.bwa),or-MaskwithRepeatMasker
A!real!example!Co
verage!
%GC!
5!Mbp!mitochondrion!in!spruce!
-
RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata
R R R R R
Overlaps?
Sparseseqdata
-
RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata
R R R R R
Overlaps?
Assembledcon
-
RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata
R R R R R
Overlaps?
Assembledcon
- RepeatlibraryfromlowcoveragedataQuan
-
ClassifyingrepeatsLTRGypsy/CopiaLINE/SINEDNAelements
ThisisverytrickyClassifyingtherepeatlibrarydirectly RepeatMasker Repeatproteindomainsearch(h=p://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest)
Problems Noclosehomologsindatabases RapidevoluHonofrepeats(liketransposableelements,TEs) Non-autonomousTEsdonotcontainproteinsSoluHons FetchintactORF:sfromhitsinassembly Extendassemblymatchesandgetmorecompleteelements Checkmatchalignmentprofilesinassembly(LINEsconservedat3endbutnotat5..)=>OWenslow,manual,species-specificsoluHons
-
Knowthetechnologybias
-
61W#3.)#!/(+03(:%b1&!
0 20 40 60 80 100
050
100
150
200
250
300
350
Coverage
Num
ber o
f Mb'
s in
hg1
9
454IlluminaSOLiD
average coverage
_C%:(!)#&1-#!!
"#$%#&'#!'1W#3.)#!
!!(!
9C"!/.0.!d>1',#!ghg!B(0.&(%-e!
A.(3#/T#&/!/.0.!d>1',#!ghg!B(0.&(%-e!
7#'*(&('.++#-:
-
ClarkM.J.,etal.Nat.Biotech(2011)
PerformancecomparisonofexomeDNAsequencingtechnologies.(MikeSnyderslab)
-
NingL.,etal.ScientificReports(2015)
-
Knowtheassemblyalgorithms
-
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - H G AP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing / polishing
the overall assembly strategy is the same
but the data and tools are fundamentally different
-
http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/
-
http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/
-
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - H G AP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing / polishing
the overall assembly strategy is the same
but the data and tools are fundamentally different
Many!instruments!too!many!solu4ons!
Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!
No!easy!way!to!determine!best!assembly/assembler!
implemented!heuris4cs!are!the!key!issue!
Choice!of!approach!depends!on!data!being!assembled!
Currently!efforts!ongoing!to!establish!best!prac4ces!
Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!
-
OLCvs.deBruijn
-
OLC
Pros:Canuselongerreadsproperly Cons:Timeconsuming,highmemoryrequirements
-
deBruijn
-
deBruijn
-
GenerateassemblyviadeBruijn
Marpn&Wang,Nat.Rev.Genet.(2011)
-
GenerateassemblyviadeBruijn
Marpn&Wang,Nat.Rev.Genet.(2011)
-
GenerateassemblyviadeBruijn
Marpn&Wang,Nat.Rev.Genet.(2011)
-
Pros:Computationallyefficient,canworkwithlargecoverageshortreaddatasets
Cons:Sensitivetosequenceerrors,connectionbetweenassemblyandreadislost,doesnotworksowellwithlongerreads
DeBruijn
-
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - H G AP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing / polishing
the overall assembly strategy is the same
but the data and tools are fundamentally different
Many!instruments!too!many!solu4ons!
Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!
No!easy!way!to!determine!best!assembly/assembler!
implemented!heuris4cs!are!the!key!issue!
Choice!of!approach!depends!on!data!being!assembled!
Currently!efforts!ongoing!to!establish!best!prac4ces!
Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!
Many!instruments!too!many!solu4ons!
Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!
No!easy!way!to!determine!best!assembly/assembler!
implemented!heuris4cs!are!the!key!issue!
Choice!of!approach!depends!on!data!being!assembled!
Currently!efforts!ongoing!to!establish!best!prac4ces!
Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!
-
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - H G AP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing / polishing
the overall assembly strategy is the same
but the data and tools are fundamentally different
Many!instruments!too!many!solu4ons!
Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!
No!easy!way!to!determine!best!assembly/assembler!
implemented!heuris4cs!are!the!key!issue!
Choice!of!approach!depends!on!data!being!assembled!
Currently!efforts!ongoing!to!establish!best!prac4ces!
Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!
Many!instruments!too!many!solu4ons!
Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!
No!easy!way!to!determine!best!assembly/assembler!
implemented!heuris4cs!are!the!key!issue!
Choice!of!approach!depends!on!data!being!assembled!
Currently!efforts!ongoing!to!establish!best!prac4ces!
Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!
-
Somerecommendations
Largeeukaryotegenome,Illuminadata:Allpaths-LG(needsspecificlibraries),SOAPdenovo,SGA,Masurca,DISCOVAR
Largeeukaryotegenome,additionallongerreads:Masurca,Newbler,CABOG
Smalleukaryoteorprokaryotegenome,Illuminadata:Spades,Masurca,SOAPdenovo,Abyss,Velvet,DISCOVAR
Smalleukaryoteorprokaryotegenome,mixeddata:MIRA,Spades,Masurca,Newbler
Needtoruninparallel:Abyss,Rai Amplifieddata(SingleCellGenomics):Spades
-
StandardcontiguitymetricsJustawordonN50
N50typicallyreferstoacon
-
Thedevilisintherepeats
De Novo Assembly: Instruments Our Experience Validation
Repeats and Short Reads
Moreover
short reads
Short reads make everything harder!!
F. Vezzi NGS
C R A B
Mathema,callybestresult:
-
Repeaterrors
Overlappingnon-iden/calreads Collapsedrepeatsandchimeras
Wrongcon/gorder Inversions
-
ATCGGGTATATAG-CCTA!||||||| || || ||||!ATCGGGTGTACAGCCCTA!!
?
A
BA&B
A:B:
Collapsablerepeaterrors(worst!)
-
Knowhowtopatchgaps/finalize
-
Gaps
-
Gaps
-
CCSvsCLR
-
CCSvsCLR
-
CCSvsCLR
-
other options for assembling PacBio reads
https: / / github.com / PacificBiosciences / Bioinformatics-Training / wiki / Large-Genome-Assembly-with-PacBio-Long-Reads
-
Hybridassemblies
-
Gaps
-
PacBio data cannot (currently) be assembled in its raw state
several strategies exist for correcting reads prior to assembly correction without complementary technology used to be
difficult until recently, was limited by computational power and SMRT cell
throughput
PacBio data is noisy
Koren & Philippy Curr Op M icro 2014
-
Hybridassemblers(forPacBio)
105
other options for assembling PacBio reads
-
Hybridassemblers
106
other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)
-
Hybridassemblers
107
other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)
-
PurePacBio
PRE-
PRO
CESS
ING
ASS
EMBL
YPO
LISH
ING
Short Reads (Illumina) - graph assembly
adapterremoval
qualitytrimming
de Bruijn or string graph construction
errorcorrection
T
T
A
T
T
scaffolding
contigs
read pairs
NNNNNN
read mapping
Long Reads (PacBio) - H G AP assembly
read length
read
s
read self-correction
overlap-layout-consensusassembly
consensus calling withquiver
assembled genome
ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT
1
2
3
1 pre-processing 2 assembly 3 finishing / polishing
the overall assembly strategy is the same
but the data and tools are fundamentally different
-
PurePacBio
-
PurePacBio
-
PurePacBioother options for assembling PacBio reads
-
PurePacBioother options for assembling PacBio reads
-
PurePacBio
-
Finishing/Polishing(Olli-Pekka)
-
Finishing/Polishing(Olli-Pekka)
-
Finishing/Polishing(Olli-Pekka)
quiver isnt perfect using Pilon to polish remaining indels
makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100
Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > Q V 55 consensus
-
Finishing/Polishing(Olli-Pekka)
quiver isnt perfect using Pilon to polish remaining indels
makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies
$ java -Xmx16G jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100
Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > Q V 55 consensus
-
Finishing/Polishing(Olli-Pekka)
-
Finishing/Polishing(Olli-Pekka)
-
5 0 8 | N A T U R E | V O L 5 2 7 | 2 6 N O V E M B E R 2 0 1 5 2015 Macmillan Publishers Limited. All rights reserved
LETTERdoi:10.1038/nature15714
Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeumRobert VanBuren1*, Doug Bryant1*, Patrick P. Edger2,3, Haibao Tang4,5, Diane Burgess2, Dinakar Challabathula6, Kristi Spittle7, Richard Hall7, Jenny Gu7, Eric Lyons4, Michael Freeling2, Dorothea Bartels6, Boudewijn Ten Hallers8, Alex Hastie8, Todd P. Michael9 & Todd C. Mockler1
Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE)2. Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a near-complete draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.
The genomes of Arabidopsis3, rice4, poplar, grape and Sorghum5 were first sequenced using high-quality and reiterative Sanger-based approaches producing a series of gold standard reference genomes. The advent of next-generation sequencing (NGS) technologies reduced costs of sequencing substantially, which has enabled sequencing of over 100 plant genomes1. The quality of plant genome assemblies depends on genome size, ploidy, heterozygosity and sequence coverage, but most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds. The short read lengths of NGS, inherent biases and non-random sequencing errors have resulted in highly fragmented draft genome assemblies that are not complete, which means they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements, centromeres, telomeres and haplotype-specific structural variations. It is becoming clear from ENCODE projects that complete genomes are needed to better understand the importance of the non-coding regions of genomes2.
More than 40% of calories consumed by humans are derived from grasses, and the grass family (Poaceae) is arguably the most important plant family with regard to global food security6. The size and complex-ity of most grass genomes has challenged progress in gene discovery
and comparative genomics, although draft genomes are now avail-able for most agriculturally important grasses1. The largest genome assemblies, such as maize (2,300 megabases (Mb))7, barley (5,100 Mb)8 and wheat (hexaploid, 17,000 Mb)9 are highly fragmented as a result of the inability of current sequencing technologies to span complex repeat regions. Near-finished reference genomes are available for rice4, Sorghum5 and Brachypodium10, but more high-quality grass genomes are needed for comparative genomics and gene discovery. Here we pres-ent the near-complete draft genome of the grass Oropetium thomaeum, the first high-quality reference genome from the Chloridoideae sub-family. The draft genome is near complete because we were able to sequence through complex repeat regions that are unassembled in most draft genomes. Oropetium has the smallest known grass genome at 245 Mb and is also a resurrection plant that can survive the extreme water stress such as loss of >95% of cellular water (Fig. 1)11.
Single-molecule real-time (SMRT) sequencing (Pacific Biosciences) produces long and unbiased sequences, which enables assembly of complex repeat structures and GC- and AT-rich regions that are often unassembled or highly fragmented in NGS-based draft genomes. We generated ~72 sequencing coverage of the Oropetium genome using 32 SMRT cells on the PacBio RS II platform (which is equivalent to
-
Knowhowtoannotate
-
Annotation(Jarkko)BILSassemblyandannota1onservice
1
HenrikLantz
Teamleader
MaheshPanchal
Assembly
JacquesDainat
Annota1on
Mar1nNorling
Assembly
LucileSoler
Annota1on
5PhDs,allinUppsala
Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es
-
Annotation(Jarkko)BILSassemblyandannota1onservice
1
HenrikLantz
Teamleader
MaheshPanchal
Assembly
JacquesDainat
Annota1on
Mar1nNorling
Assembly
LucileSoler
Annota1on
5PhDs,allinUppsala
Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es
Annota1on/Assemblytechnology Assembly
Perl/Makepipeline Pre-assembly
Qualitycontrol kmeranalyses
Assembly Differentassembly
programs Assemblyvalida1on
FRCbam Quast Owntools
Annota-on Maker-MPI
proteins RNA-seq
Refinementscripts Func1onalannota1on
Blast Synteny
2
-
Knowhowtovalidate
-
AssemblyvalidationAssembly!valida4on!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!
can!think!of.!How!I!pick!up!one!now?!
Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!
perform!gene!annota4on?!!!!
Assembly!valida4on!is!it!important?!
Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!
can!think!of.!How!I!pick!up!one!now?!
Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!
perform!gene!annota4on?!!!!
-
AssemblyvalidationAssembly!valida4on!
Assembly!valida4on!is!extremely!difficult! Too!o_en!only!connec4vity!measures!are!used! There!is!no!a!real!solu4on,!only!a!set!of!best!prac4ces!
that!one!can!follow!!Recently!a!lot!of!a`en4on!on!assembly!valida4on:!
-
EvaluatingassemblieswithreferenceEvalua4ng!assemblies!with!a!reference!
Coun4ng!errors!not!always!possible:! Reference!almost!always!absent.! Error! types! are! not! weighted!
accordingly.!
Visualiza4on!is!useful,!however:! No!automa4on! !Does!not!scale!on!large!genomes!
WOW.!Looks!like!that!it!is!difficult!even!with!the!answer!
-
EvaluatingassemblieswithoutreferenceEvalua4ng!assemblies!without!a!reference!
Sta4s4cs!(N50,!etc.)! Congruency!with!raw!sequencing!data:!
Alignments! QAtools! FRCbam! REAPR!
Gene!space!! CEGMA! reference!genes! transcriptome!
There!is!no!a!real!recipe,!or!a!tool.!We!can!only!suggest!some!best!prac4ce.!!
-
Yourreadsareoftenthebestsourcetovalidateyourassemblies
Checkagainyourinsertsizes(PicardTools,http://picard.sourceforge.net)!!!!!
Plottingcoveragex%GCxlength
Post!assembly!am!I!on!the!right!track?!
Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!
PE! MP!
Your!genome!Mitochondrion!
Contamina4ons!
0 2000 4000 6000 8000
020
040
060
080
010
00Insert Size Histogram for All_Reads in file MP_on_masurca_sorted.bam
Insert Size
Coun
t
FRRFTANDEM
0 100 200 300 400 500
020
0040
0060
0080
00
Insert Size Histogram for All_Reads in file PE_on_masurca_sorted.bam
Insert Size
Cou
nt
FRRFTANDEM
0 2000 4000 6000 8000 10000
020
040
060
080
0
Insert Size Histogram for All_Reads in file 7_130425_AD1YUEACXX_P469_101_index12_trimmedtoassembly.abyss.scaf_onlyAligned.bam
Insert Size
Coun
t
FRRFTANDEM
Failed!MP!or!bad!assembly?!
Plot!cov!vs!%GC!vs!length!!
Look! at! the! plots!and!at!the!tables,!duplica4on! rate!is! an! important!measure.!!
You!need!to!check!i f ! t h e! p l o t ( s )!co inc ides! w i th!what!you!expect.!
0.0 0.2 0.4 0.6 0.8 1.00
100
200
300
400
500
GC
cove
rage
coverage
Freq
uenc
y
0 100 200 300 400 500
050
100
150
0 100 200 300 400 500
010
2030
4050
cov
len
(kbp
)
0 10 20 30 40 500
24
68
10cov
len
(kbp
)
Plopng!coverage!and!GC!content!
0.0 0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0
GC
cove
rage
coverage
Freq
uenc
y
0 100 200 300 400 500
050
100
150
0 100 200 300 400 500
010
2030
4050
cov
len
(kbp
)
0 10 20 30 40 50
02
46
810
cov
len
(kbp
)
Plopng!coverage!and!GC!content!
http://picard.sourceforgenet
-
Yourreadsareoftenthebestsourcetovalidateyourassemblies
Checkagainyourinsertsizes(PicardTools,http://picard.sourceforge.net)!!!!!
Plottingcoveragex%GCxlength
Post!assembly!am!I!on!the!right!track?!
Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!
PE! MP!
Your!genome!Mitochondrion!
Contamina4ons!
0 2000 4000 6000 8000
020
040
060
080
010
00Insert Size Histogram for All_Reads in file MP_on_masurca_sorted.bam
Insert Size
Coun
t
FRRFTANDEM
0 100 200 300 400 500
020
0040
0060
0080
00
Insert Size Histogram for All_Reads in file PE_on_masurca_sorted.bam
Insert Size
Cou
nt
FRRFTANDEM
0 2000 4000 6000 8000 10000
020
040
060
080
0
Insert Size Histogram for All_Reads in file 7_130425_AD1YUEACXX_P469_101_index12_trimmedtoassembly.abyss.scaf_onlyAligned.bam
Insert Size
Coun
t
FRRFTANDEM
Failed!MP!or!bad!assembly?!
Plot!cov!vs!%GC!vs!length!!
Look! at! the! plots!and!at!the!tables,!duplica4on! rate!is! an! important!measure.!!
You!need!to!check!i f ! t h e! p l o t ( s )!co inc ides! w i th!what!you!expect.!
0.0 0.2 0.4 0.6 0.8 1.00
100
200
300
400
500
GC
cove
rage
coverage
Freq
uenc
y
0 100 200 300 400 500
050
100
150
0 100 200 300 400 500
010
2030
4050
cov
len
(kbp
)
0 10 20 30 40 500
24
68
10cov
len
(kbp
)
Plopng!coverage!and!GC!content!
0.0 0.2 0.4 0.6 0.8 1.0
010
020
030
040
050
0
GC
cove
rage
coverage
Freq
uenc
y
0 100 200 300 400 500
050
100
150
0 100 200 300 400 500
010
2030
4050
cov
len
(kbp
)
0 10 20 30 40 50
02
46
810
cov
len
(kbp
)
Plopng!coverage!and!GC!content!
http://picard.sourceforgenet
-
DatacongruencyData!congruency!
Idea:!Map!read:pairs!back!to!assembly!and!look!for!discrepancies!like:! no!read!coverage! no!span!coverage! too!long/short!pair!distances!
Reads! can! be! aligned!back! to! the! assembly! to!iden4fies! suspicious!features.!
But!what!we!do!with!this!features?!
FRCbam(Vezzietal.2012)
-
Datacongruency
FRCbam(Vezzietal.2012)
Features!
4!coverage!related!features:! LOW_COV_PE,!HIGH_COV_PE,!LOW_NORM_COV_PE,!and!HIGH_NORM_COV_PE!
!!!!!4!features!for!compression/expansion!event!(CE!stats)!
COMPR_PE,!STRECH_PE,!COMPR_MP,!and!STRECH_MP!!!!6!features!on!suspicious!pair/mate!orienta4ons:!
HIGH_SINGLE_PE,!and!HIGH_SINGLE_MP! HIGH_SPAN_PE,!and!HIGH_SPAN_MP! HIGH_OUTIE_PE,!and!HIGH_OUTIE_MP!
!
A
R1,2
B
A
R1,2
C
B
A R1 B R2 C
AGAGCTAGCAGAGCTAGCAGATCTCGCAGATCTCGC
Reads! can! be! aligned! back! to!the! assembly! to! iden4fies!suspicious!features.!
-
FRCurveFRCurve!
FRCbam!predicted!Assemblathon!2!outcome!
FRCbam!(Vezzi!et!al.!2012)!
The!Feature!Response!Curve!(FRCurve)!characterizes!the!sensi4vity!(coverage)! of! the! sequence! assembler! as! a! func4on! of! its!discrimina4on!threshold!(number!of!features!).!
Feature!Response!Curve:! Overcomes!limits!of!standard!
indicators!(i.e.!N50)! Captures!trade:off!between!
quality!and!con4guity! Deeply!connected!to!ROC!curves! Features!can!be!used!to!iden4fy!
problema4c!regions! Single!features!can!be!plo`ed!to!
iden4fy!assembler:specific!bias!0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000
0
20
40
60
80
100
120
Feature threshold
approxim
ate
coverage(%
)
Feature Space rhody TOTAL
SGARayCLC
SOAPdenovoALLPATHS-LG PB
ABySSMSRA-CACABOG PBCABOGVELVET
ALLPATHS-LG
-
FeaturesandPCAFeatures!and!PCA!
5 4 3 2 1
21
01
2
PCA1
PCA2
bifido
ecoli
enteroeubac
fragilis
kleb
staphylocossus
strep
swigtimbifido
ecoli
entero
fragilis
fuso7
kleb
staphylocossusstrep
swig
tim
bifido
clap
clap19
ecoli
entero
fragilis
fusonuke
kleb
strep
swig
tim
bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
swig
bifido
ecolientero
eubac
swig
tim
bifidoecoli
enteroeubac
fragilis
kleb
staphylocossus
strep
swig
clap
clap19
ecoli
enteroeubac
fragilis
fuso7
kleb
staphylocossus
strep
swig
tim
entero
eubac
fuso7
strep
swig
bifido
ecoli
entero
eubac
fragilis
kleb
strepswig
bifido
ecoli
entero
eubac
kleb
staphylocossus
strep
swig
bifido
ecoli
entero
eubac
fragilis
kleb
staphylocossus
strep
swig
4 2 0 2 4
64
20
24
6
PCA1
PCA2
bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
fuso7
fusonuke
kleb
staphylocossus
strep
swig timbifido
clapclap19
copro
ecoli
egg
entero
eubac
fragilis
fuso7
fusonuke
kleb
staphylocossusstrepswig
tim
bifido
clapclap19
copro
ecoli
egg
entero
eubac
fragilis
fuso7
fusonuke
kleb
strep
swig
tim
bifido
clapclap19
ecoli
egg
enteroeubac
fragilis
fuso7
fusonuke
kleb
staphylocossus
strep
swig
bifidoclap
clap19
copro
ecoli
egg
entero
eubac
fuso7
fusonuke
strep
swig
tim
bifido
clap
clap19
copro
ecoli
egg
entero eubac
fragilis
fuso7
fusonuke
kleb
staphylocossus
strep
swig
tim
bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
fuso7
fusonuke
kleb
rhody
staphylocossus
strep
swig
tim
bifido
copro
ecoli
egg
entero
eubac
fragilis
fuso7
fusonuke
kleb
strep
swig
tim
bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis fuso7
fusonuke
kleb
staphylocossus
strepswig
tim
bifido
clap
clap19
copro
ecoli
egg
entero
eubac
fragilis
fuso7
fusonuke
kleb
staphylocossus
strep
swig
bifido
clap
clap19
copro
ecoli
egg
enteroeubac
fragilis
fuso7
fusonuke
kleb
staphylocossus
strep
swig
Assembled!18!bacterial!genomes!with!11!assemblers!!
(illumina!+!PacBio!data)!
PCA!performed!on!features:! Assemblies!of!the!same!organism!
(family)!tend!to!cluster;! No!clear!difference!when!using!
PacBio!data;!
-
REAPR(Huntetal.2013)
REAPRREAPR!
REAPR!(Hunt!et!al.!2013)!
Uses!same!principle!of!FRCurve:!
Iden4fies!suspicious/erroneous!posi4ons!
Breaks#assemblies#in#suspicious#posi.ons#
The!broken!assembly!is!more!fragmented!but!hopefully!more!
corrected!(Reapr!cannot!make!
things!worse)!
-
Conservedcore(species)genespace
Gene!space!
CEGMA#(h`p://korflab.ucdavis.edu/datasets/cegma/)!HMM:s!for!248!core!eukaryo4c!genes!aligned!to!your!assembly!to!assess!completeness!of!gene!space!complete:!70%!aligned!par4al:! !30%!aligned!!!Similar#idea#based#on#aa#or#nt#alignments#of# Golden!standard!genes!from!own!species! Transcriptome!assembly! Reference!species!protein!set!Use!e.g.!GSNAP/BLAT!(nt),!exonerate/SCIPIO!(aa)!!!
-
OtherexternalvalidationmethodsOther!External!Valida4on!Methods!
! Restric4on!Map! Representa4on! of! the! cut! sites! on! a!
given! DNA! molecule! to! provide! spa4al!informa4on!of!gene4c!loci!
Op4cal!maps!can!be!used!to!check!assembly!correctness:!
Long!PacBio!Reads!can!be!used!as!well!
-
Otherexternalvalidationmethods
De!novo!reconstructs!!parts!missing!in!the!reference!strain!
Correctly!assembles!long!tandem!repeats!!
De!Novo!assembly!!!!(Illumina,!PGM)!
Set!of!un:ordered!and!not!oriented!ctgs!
Op4cal!Map!
DNA!seq!Con4gs!
Other!External!Valida4on!Methods!
-
Dont!panic.!And!dont!rush!
Keeping!up!with!the!development!can!be!stressful,!!so!you!need!to!stay!calm!!Choose!quality!before!quan4ty!!Know!your!biological!system!!so!you!know!what!to!expect!Combine!sequencing!with!other!data!!Share!knowledge!and!be!nice!to!your!bioinforma4cs!friends!
For!each!conclusion,!ask!yourself!if!it!can!be!an!artefact!due!to!!!Incomplete!assembly!!Repeats!!Indels!!Coverage!bias!!Divergent!sequences!(mapping)!
Dontpanic.Anddontrush
-
Knowthatyourfinalassemblywillbeincomplete
-
Thingsthatarenotthere
100M
b
1 2 3 4 5 6 7 8 9 10 11
12
131415
161718
19202122
X
Closed gap
Inversion
Complex event
High
Low
STR Density
Extended Data Figure 3 | Genome distribution of closed gaps andinsertions. Chromosome ideogram heatmap depicts the normalized density ofinserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of
most chromosomes. Locations of structural variants and closed gaps are givenby coloured diamonds to the left of each chromosome: closed gap sequences(red), inversions (green), and complex events (blue).
RESEARCH LETTER
Macmillan Publishers Limited. All rights reserved2014
ChaisonM.J.Petal.Nature(2014)
LETTERdoi:10.1038/nature13907
Resolving the complexity of the human genomeusing single-molecule sequencingMark J. P. Chaisson1, John Huddleston1,2, Megan Y. Dennis1, Peter H. Sudmant1, Maika Malig1, Fereydoun Hormozdiari1,Francesca Antonacci3, Urvashi Surti4, Richard Sandstrom1, Matthew Boitano5, Jane M. Landolin5, John A. Stamatoyannopoulos1,Michael W. Hunkapiller5, Jonas Korlach5 & Evan E. Eichler1,2
The human genome is arguably the most complete mammalianreference assembly13, yet more than 160 euchromatic gaps remain46
and aspects of its structural variation remain poorly understood tenyears after its completion79. To identify missing sequence and gen-etic variation, here we sequence and analyse a haploid human genome(CHM1) using single-molecule, real-time DNA sequencing10. We closeor extend 55% of the remaining interstitial gaps in the human GRCh37reference genome78% of which carried long runs of degenerateshort tandem repeats, often several kilobases in length, embeddedwithin (G1C)-rich genomic regions. We resolve the complete sequenceof 26,079 euchromatic structural variants at the base-pair level, includ-ing inversions, complex insertions and long tracts of tandem repeats.Most have not been previously reported, with the greatest increasesin sensitivity occurring for events less than 5 kilobases in size. Com-pared to the human reference, we find a significant insertional bias(3:1) in regions corresponding to complex insertions and long shorttandem repeats. Our results suggest a greater complexity of the humangenome in the form of variation of longer and more complex repet-itive DNA that can now be largely resolved with the application ofthis longer-read sequencing technology.
Data generated by single-molecule, real-time (SMRT) sequencingtechnology differ drastically from most sequencing platforms becausenative DNA is sequenced without cloning or amplification, and readlengths typically exceed 5 kilobases (kb). Despite overall lower individualread accuracy (,85%), longer read length facilitates high confidencemapping across a greater percentage of the genome11,12.We generated,40-fold sequence coverage from a human CHM1 hydatidiform moleusing long-read SMRT sequence technology (average mapped readlength 5 5.8 kb; Supplementary Table 1). We selected a complete hyda-tidiform mole to sequence because it is haploid, lacking allelic variation,and provides higher effective sequence coverage. We aligned 93.8% ofall sequence reads to the human reference genome (GRCh37) using amodified version of BLASR11 (Supplementary Information) and gener-ated local assemblies of the mapped reads using Celera13 and Quiver14,the latter of which leverages estimates of insertion, deletion and substi-tution probabilities to determine consensus sequences accurately. Wecompared the consensus sequences of regions with previously sequencedand assembled large-insert bacterial artificial chromosome (BAC) clonesgenerated from CHM1tert (ref. 15). The comparison shows a consensussequencing concordance of .99.97% (phred quality 5 37.5), with 72%of the errors confined to indels within homopolymer stretches (Sup-plementary Table 3).
We initially assessed whether the mapped reads could facilitate clos-ure of any of the 164 interstitial euchromatic gaps within the humanreference genome (GRCh37). We extended into gap regions using areiterative map-and-assemble strategy, in which SMRT whole-genomesequencing (WGS) reads mapping to each edge of a gap were assembledinto a new high-quality consensus, which, in turn, served as a template
for recruiting additional sequence reads for assembly (SupplementaryInformation). Using this approach, we closed 50 gaps and extended into40 others (60 boundaries), adding 398 kb and 721 kb of novel sequenceto the genome, respectively (Supplementary Table 4). The closed gapsin the human genome were enriched for simple repeats, long tandemrepeats, and high (G1C) content (Fig. 1) but also included novel exons(Supplementary Table 20) and putative regulatory sequences based onDNase I hypersensitivity and chromatin immunoprecipitation followedby high-throughput DNA sequencing (ChIP-seq) analysis (Supplemen-tary Information). We identified a significant 15-fold enrichment of shorttandem repeats (STRs) when compared to a random sample (P , 0.00001)(Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences werecomposed of 10% or more of STRs. The STRs were frequently embeddedin longer, more complex, tandem arrays of degenerate repeats reach-ing up to 8,000 bp in length (Extended Data Fig. 1ac), some of whichbore resemblance to sequences known to be toxic to Escherichia coli16.Because most human reference sequences17,18 have been derived fromclones propagated in E. coli, it is perhaps not surprising that the appli-cation of a long-read sequence technology to uncloned DNA wouldresolve such gaps. Moreover, the length and complex degeneracy of theseSTRs embedded within (G1C)-rich DNA probably thwarted efforts tofollow up most of these by PCR amplification and sequencing.
Next, we developed a computational pipeline (Extended Data Fig. 2)to characterize structural variation systematically (structural variationdefined here as differences $50 bp in length, including deletions, dupli-cations, insertions and inversions7). Structural variants were discoveredby mapping SMRT sequencing reads to the human reference genome11
1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA. 2Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195,USA. 3Dipartimento di Biologia, Universita degli Studi di Bari Aldo Moro, Bari 70125, Italy. 4Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania15261, USA. 5Pacific Biosciences ofCalifornia, Inc., Menlo Park, California 94025, USA.
P = 0.02712P = 0.00003
P < 0.00001
0
25
50
75
100
(G+C
) con
tent
Reference flank
Gap closure
Tandem repeatP < 2.2 1016
0.00
0.25
0.50
0.75
1.00
Gaps Reference
Pro
port
ion
of re
gion
with
sim
ple
repe
ats
a b
Gap
only
Tand
em re
peats
Gap
with
out
tande
m re
peats
Samp
led re
feren
ce
Figure 1 | Sequence content of gap closures. a, Gap closures are enrichedfor simple repeats compared to equivalently sized regions randomly sampledfrom GRCh37. b, Human genome gaps typically consist of (G1C)-richsequence (yellow) flanking complex (A1T)-rich STRs (green) (empiricalP value; Supplementary Information). Red line indicates genomic (G1C)content.
0 0 M O N T H 2 0 1 4 | V O L 0 0 0 | N A T U R E | 1
Macmillan Publishers Limited. All rights reserved2014
-
Thingsthatarenotthere
SteinbergK.M.etal.GenomeResearch(2014)
reference assembly, many groups have described shortcomings ofthis resource, including remaining gaps, single-nucleotide errors,or gross misassembly due to complex haplotypic variation (Eichleret al. 2004; Doggett et al. 2006; Kidd et al. 2010; Chen and Butte2011; The 1000 Genomes Project Consortium 2012). Both gapsand misassembled regions often arise because the DNA sequenceused for the assembly was from multiple diploid sources contain-ing complex structural variation. Because such loci often containmedically relevant gene families, it is important to resolve varia-tion at these sites, as the structural and single-nucleotide diversityis likely associated with clinical phenotypes (Eichler et al. 2004).Thus, to resolve structurally complex regions and provide a moreeffective reference resource for such loci, we combined WGS dataand BAC sequences from a haploid DNA source to create a singlehaplotype assembly of the human genome.
Haplotype information is critical to interpreting clinical andpersonal genomic information as well as genetic diversity and an-cestry data, and most previously sequenced individual human ge-nomes are not haploresolved. The current reference human genomesequence represents a mosaic that further complicates haplotyping;within a BAC clone there is a single haplotype representation, buthaplotypes can switch at BAC clone junctions. By utilizing an es-sentially haploid DNA source, we resolved a single haplotype acrosscomplex regions of the genome where the reference genome con-tained a mixture of haplotypes from various sources and/or con-tained unresolved gaps. For example, a gap on Chromosome 4p14in GRCh37 (Chr 4: 4029639740297096) was completely resolvedusing CHM1WGS data. The gap was flanked by repetitive elementsthat were not traversed by a clone. This region has subsequentlybeen updated with a complete tiling path in GRCh38.
Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio cliff reads where thealignment coverage dropped off sharply.WGS component (light green lines) boundaries flanked by such reads aremarked with red dashed lines. The endsof each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads aremarked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest thatthe twoWGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio readsin these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottomlight green lines show a proposed tiling pathwith the orientation corrected; the letters indicate where each end of the initial tiling path components shouldbe placed.
CHM1 assembly of the human genome
Genome Research 7www.genome.org
Cold Spring Harbor Laboratory Press on November 16, 2014 - Published by genome.cshlp.orgDownloaded from
-
Summary
Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.
Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.
Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity)anddolibrariesforit.
Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.
Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.
-
Summary
Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.
Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.
Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity,ornumberoferrorfreebases).
Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.
Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.
SettledownanassemblysoSciencecancontinue!
-
Knowthefuture
-
Avisionintothefuture
-
Avisionintothefuture
-
Avisionintothefuture
-
Avisionintothefuture
-
Acknowledgements
OlgaVinnerePettersson BjrnNysted OlaSpujth HenrikLantz JacquesDaimat FrancescoVezzi BGI JonBadalamenti(BondLab) StephanC.Schuster(PennU)