helsinki genome project-20151210-amb

152
Things to consider when initiating a genome project the assembly pipeline @ SciLifeLab Helsinki, Dec 9th 2015 Álvaro Martínez Barrio, PhD [email protected] linkedin.com/in/ambarrio @ambarrio

Upload: barrioam

Post on 15-Apr-2017

480 views

Category:

Science


8 download

TRANSCRIPT

  • Thingstoconsiderwheninitiatingagenomeprojecttheassemblypipeline@SciLifeLab

    !Helsinki,Dec9th2015

    lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio

  • WorkshopOutline

    IntroducingSciLifeLab Theimportantconsiderationsofallgenomeprojects

    Theannotationandassemblyplatforms Avisionintothefuture

  • Survey

  • Survey

    Howmanyofyouhaveusedsequencingfacilities?

  • Survey

    Howmanyofyouhaveusedsequencingfacilities?

    Assembledagenome?

  • Survey

    Howmanyofyouhaveusedsequencingfacilities?

    Assembledagenome? Planningtostartagenomeproject?

  • Survey

    Howmanyofyouhaveusedsequencingfacilities?

    Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata?

  • Survey

    Howmanyofyouhaveusedsequencingfacilities?

    Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata? JustcuriousaboutNGS?

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome GCcontent AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

  • Variationspace

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

    http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism

    http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

    Figure 2 | Structural variation sequence signatures. There are four general sequence-based analytical approaches used to detect structural variation. Theoretically, read-pair (RP), split-read and assembly methods can be used to discover variants from all classes of structural variant (SV), but each has different biases depending on the underlying sequence content of the variants and the data properties of the sequence reads. However, read-depth approaches can be used to detect only losses (deletions) and gains (duplications), and cannot discriminate between tandem and interspersed duplications. Briefly, read-pair methods analyse the mapping information of paired-end reads and their discordancy from the expected span size and mapped strand properties. Sensitivity, specificity and breakpoint accuracy are dependent on the read length, insert size and physical coverage3,4,59,62,65,66,68,69. Breakpoints are indicated by red arrows. Read-depth analysis examines the increase and decrease in sequence coverage to detect duplications and deletions, respectively, and predict absolute copy numbers of genomic intervals45,62,7476. Split-read algorithms are capable of detecting exact breakpoints of all variant classes by analysing the sequence alignment of the reads and the reference genome; however, they usually require longer reads than the other methods and have less power in repeat- and duplication-rich loci62,78,79. Assembly algorithms8386,115 have the most power to detect SVs of all classes at the breakpoint resolution, but assembling short sequences and inserts often result in contig/scaffold fragmentation in regions with high repeat and duplication content89. MEI, mobile-element insertion. Repbase is a database of repetitive elements.

    REVIEWS

    368 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics

    2011 Macmillan Publishers Limited. All rights reserved

    AlkanC.,CoeB.P.,EichlerE.E..NatureRevGenetics(2011)

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

  • Thingstoconsider

    Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

    WHATISYOURSCIENTIFICQUESTION?

    WardL.D.&KellisM.NatBiotechnology(2012)

  • Aboutme

    lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio

    PhDBioinformatics2010 PostdocPopGenetics/CompBiol2014,L.Andersson+H.Ronne

    Herring:Illumina,SOLiD,Moleculo,PacBio SpeciesPlant:454,SOLiD,Illumina SpeciesSeal(~3Gb):Illumina SpeciesBeetle:Illumina,PacBio

  • Pool-seqA sequencing technique in which sequencing libraries are not prepared from DNA of a single individual or cell but from a mixture of DNA fragments originating from different individuals or cells. In the context of this Review, Pool-seq is used to describe the unbiased sequencing of the entire genome.

    CoverageThe number of reads that span a given genomic position.

    Sequencing librariesSets of fragmented DNA extracted from one or more individuals that serve as the template for subsequent sequencing.

    Exome sequencingA sequencing approach in which the complexity of the genome is reduced through hybridization to exonic sequences, which results in a higher sequence coverage of protein-coding regions.

    Restriction-site-associated DNA markersSequence polymorphisms in close proximity to a restriction enzyme recognition site.

    are subject to larger sampling variance, whereby they can result in considerable errors even when the allele frequencies in the sample have been determined at high accuracy. In other words, accurately sequencing a small population sample will still result in noisy allele frequency estimates. By contrast, Pool-seq makes use of large population samples, but not all chromosomes in the samples are analysed. The higher accuracy to cost ratio of Pool-seq arises from the fact that very few chromosomes are sequenced more than once, whereas for sequencing of individuals each chromosome is typically sequenced multiple times (540 times). This advantage is clearly demonstrated in FIG.1, in which the accuracy of Pool-seq is compared with sequencing of individuals at a fixed sequencing cost (that is, assum-ing that the same number of sequence reads is used in each case). Although Pool-seq mostly performs bet-ter when 50 individuals are pooled, its performance is clearly superior when pooling 100 or more individuals (FIG.1a). Additionally, the accuracy of Pool-seq relative to sequencing of individuals increases with the coverage of individual genomes (FIG.1b).

    The cost-effectiveness of Pool-seq becomes even more evident when the costs for the preparation of the sequencing libraries are considered: Pool-seq uses a single library for the entire sample, whereas sequencing of

    individuals requires a separate library to be prepared for each genome. As library construction constitutes ~20% of the total sequencing costs for species with moderate genome sizes, this is an important costfactor.

    Comparison to reduced-representation sequencingSequencing individuals at a high coverage is undoubt-edly the gold standard for obtaining high-quality data, but budget constraints frequently require alternatives for studying large populations. In addition to Pool-seq, other strategies have been developed for sequenc-ing large samples (FIG.2). Below, we compare different sequencing approaches (TABLE1) and weigh their par-ticular strengths and weaknesses against those of Pool-seq (TABLE2).

    In contrast to the whole-genome approach of Pool-seq, the cost savings of these alternative approaches are achieved by reducing the representation of the genome in the sequence data. Different strategies for targeting the sequencing to specific regions of the genome can be categorized into exome sequencing13,14, high-throughput RNA sequencing (RNA-seq)15 and methods using restriction-site-associated DNA markers16. Each of these methods have been combined with pooling to fur-ther reduce costs, but each approach has its particular strengths and weaknesses (see below).

    Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most comparisons. Only when the number of sequenced individuals approaches the pool size does sequencing of individuals become the superior strategy. b | Influence of coverage and variation in representation of individuals in a pool is shown. With a lower coverage per individual, the advantage of Pool-seq decreases. It should be noted that with a decreasing coverage per individual, the two approaches produce very similar types of data; that is, sequencing of individuals tends to show the same limitations as Pool-seq, such as for estimating linkage disequilibrium and for distinguishing sequencing errors from low-frequency polymorphisms. Variation in the representation of individuals in the DNA pool reduces the accuracy of Pool-seq only slightly (0% (that is, all individuals are uniformly represented; orange line) and 30% (light blue line)). The graphs were generated with the PIFs software12, ignoring sequencing errors.

    Nature Reviews | Genetics

    0.410 20 30

    Number of individuals sequenced seperately

    SD p

    ool/

    SD in

    divi

    dual

    s

    SD p

    ool/

    SD in

    divi

    dual

    s

    Number of individuals sequenced seperately40 50

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1a b

    0.410 20 30 40 50

    0.5

    0.6

    0.7

    0.8

    0.9

    1.0

    1.1

    Pool size

    Coverage per sequenced individual

    Deviation in DNA content fromeach individual in the pool

    100

    20

    0%

    100

    20

    30%

    100

    5

    30%

    100

    1

    30%

    Pool size

    Coverage per sequenced individual

    Deviation in DNA content fromeach individual in the pool

    500

    5

    30%

    100

    5

    30%

    50

    5

    30%

    REVIEWS

    750 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

    2014 Macmillan Publishers Limited. All rights reserved

    SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)

    Whypooling?

  • SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)

  • SciLifeLab(promotionslides)SciLifeLab

    National service Local scientific center

    SciLifeLab

    Director (July 2015) Olli Kallioniemi Co-director Kerstin Lindblad-Toh Vision:

    To be an internationally leading center that develops, uses and provides access to advanced technologies for molecular biosciences with focus on health and environment.

    www.scilifelab.se

    2010: Strategic research initiative 2013: National resource 2015: New management and chairman

  • SciLifeLab platforms

    SciLifeLab

    National Genomics Infrastructure

    National Bioinformatics Infrastructure Sweden

    Joakim Lundeberg Ann-Christine Syvnen Ulf Gyllensten

    Bengt Persson

    Clinical Diagnostics

    Lars Engstrand

    Computer resources free for Swedish researchers

    VR

    SNIC

    Ongoing merge of BILS, WABI and more; complete 2016. National, distributed

  • Knowagoodbioinformatician

  • NBIS-Werehereforyou!Were here for you!

  • 23

    The Bioinformatics Platform 2016

    Funding The Research

    Council SciLifeLab KAW foundation Host universities

    Applied at the Research Council as continued national infrastructure 2016-2023. Decision late 2015.

    Custom-tailored support Tools Training

    Today ~70 FTE

  • 24

    Long-term Support Wallenberg Advanced Bioinformatics Infrastructure www.scilifelab.se/facilities/wabi/

    Bjrn Nystedt Thomas Svensson

    Tailored solutions high impact

    Siv Andersson Gunnar von Heijne

    Applied bioinformatics: 500h free support/project Variant analyses in health and disease Transcriptomics Single-cell analyses Epigenetics Metagenomics

    Directors

    Managers Swedens strongest unit for analyses of

    large-scale genomic data (24 FTE)

    National committee reviews and selects projects based on scientific quality

    Staff in Stockholm, Uppsala, Lund, Gothenburg, Linkping, Ume.

  • WABIpersonnel(2013-2014)

    JohanReimegrd MikaelHuss saBjrklund PrEngstrm JakubOrzechowskiWestholm

    EstelleProux-Wra

    SanellaKjellqvist

    DianaEkman PallOlason AnnaJohansson MarcelMartin AlvaroMartinezBarrioPerUnneberg

  • Knowhowtohandleyourdata

  • Today:)Human)genome)sequenced)in)days)C)towards)$1000)genome)

    requires$supercomputers$for$analysis$and$storage$

    Massively$parallel$sequencing.$

  • 2.$Data$delivery$

    SciLifeLab)Bioinforma/cs)Compute)and)Storage)(UPPNEX))

    3.$Analysis$

    ScienBsts$

    www.uppmax.uu.se/uppnex$High%performance/computers/and/large/scale/storage/for/bioinforma6cs/analysis./

    1.$Sample$transfer$

  • Login$Submit$jobs$

    Job)Que)

    Job$assigned$

    Work$interacBvely$

    How)do)you)work)on)UPPMAX)computers?)

    JobQueue

  • Research$$~8000$cores$

    ProducBon$~3200$cores$

    Redundancy$768$cores$

    Storage$~11$PB$

    2015)

    Longbterm$Storage$

    Mosler$384$cores$

    Research$3328$cores$

    ProducBon$768$cores$

    Storage$~7$PB$

    Longbterm$Storage$

    2014) Resources)

    Mosler$384$cores$

    Private$Cloud$1600$cores$

    Chipster,$CanvasDB$

  • Project)growth)

    2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

    100

    200

    300

    400 Active Projects

    Num

    ber o

    f act

    ive p

    roje

    cts

    UPPMAXUPPNEX

  • 2009: $ $13,152$MSEK$from$KAW$and$SNIC$$2012:$$ $23.8$MSEK$from$KAW/SNIC$$2014: $ $20$+$20$MSEK$from$KAW$for$WGS$$ $ $SNIC$receives$47.8$MSEK$from$VR$to$handle$sensiBve$data$

    $

    UPPNEX)history)

  • UPPMAX)personnel)

    +3$more$

  • KnowhowtoextractyourDNA

  • OlgaVinnerePettersson(UGC)[email protected]

    mp4-http://bit.ly/1Ul7RmHpptx-http://bit.ly/1Z6yIFHQ&A-http://bit.ly/1I1Sb6o

    mailto:[email protected]://bit.ly/1Ul7RmHhttp://bit.ly/1Z6yIFHhttp://bit.ly/1I1Sb6o

  • Bacteria Fungi

    Insects Plants

  • Knowhowtomeasureyourassemblyresults

  • JustawordonN50

    N50typicallyreferstoacon

  • Knowwhyassemblingisdifficult

  • Twotypesofassemblies

    Case1: Flycatcher(1.2Gbp)Herring(800Mbp)Malassezia(7Mbp)

    Case2: Spruce(20Gbp)Barnacle(1.4Gbp)Wolbachia(4Mbp)

    Twotypesofassemblies

  • Pre-assembly

    Qualitytrimming (Errorcorrection) Kmeranalysis Denovorepeatlibrary

  • Qualitytrimming

    DeBruijn-graphassemblersareinprinciplesensi

  • Readsvskmers

    1read:100bp

    ..Kmers:k=21bpN=(Lk+1)(100bp21bp+1)80

    Base coverage * (L-k+1) = Kmer coverage!! ! L!

    Ex: !50X * (100-21+1) = 40X (i.e.kmercoverageis80%ofbasecoverage)

    ! ! 100!

    ReadsvsKmers

  • Kmeranalyses

    Computethefrequencyofeachkmerinthedataset(e.g.Jellyfish --both-strands)Note:RAM-intense!

    Howtocountkmers?

  • Diggingintothekmers

    Genomesize Removelow-copykmers Iden

  • Repeats:firstshot

    Thenbofdis

  • Heterozygosityandploidyandhumansareeasy.

    Bacteria,archaea,fungi,someplants

    Mostanimals,someplants

    Manyplants

    Also:Heterozygozityisgenerallyverylowinmammals;mostotherspeciesaremuchharder

  • Heterozygositywithkmergraphs

    Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)

  • Awordonqualityfiltering

    LightQCfilter HardQCfilter

    Awordofprecautiononqualityfiltering!

    Heterozygositywithkmergraphs

  • Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)

    Heterozygositywithkmergraphs

  • 7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    Heterozygositywithkmergraphs

  • 7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    8

    Fig4.2 Hybrid effect on K-mer distribution.

    The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer

    curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is

    1.2%, H_0.015 is 1.5%.

    From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is

    becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We

    can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 20 40 60 80

    Percen

    tage

    (X)

    Depth(X)

    H_0.01067

    Epi

    H_0.012

    H_0.015

    Heterozygositywithkmergraphs

  • 7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    7

    Fig4.1 17-mer depth distribution

    Table4.2 17-mer Data statistics

    K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

    17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

    Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

    derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

    32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

    formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

    If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

    So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

    Also, this distribution can be used to determine the repeat content of the genome, if this

    genome contains high proportion of repeat; the distribution will display a fat tail which indicates

    more than expect proportion of the genome have a high sequencing depth which may due to

    sequence similarly.

    Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

    to do whole genome shotgun sequence and assembly.

    4.3 Estimation of heterozygous rate

    We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

    on them respectively, and then get the figure 2.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 10 20 30 40 50 60 70 80 90 100

    Percen

    tage(%

    )

    Depth(X)

    8

    Fig4.2 Hybrid effect on K-mer distribution.

    The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer

    curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is

    1.2%, H_0.015 is 1.5%.

    From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is

    becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We

    can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.

    0

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    0 20 40 60 80

    Percen

    tage

    (X)

    Depth(X)

    H_0.01067

    Epi

    H_0.012

    H_0.015

    Theheterozygositywasestimatedtobe1.5%

    Heterozygositywithkmergraphs

  • Repeats:firstshot

    Thenbofdis

  • WhyrepeatsdestroyassembliesGenomeassembly-thingstothinkabout

  • Repeatlibraryandrepeatquantification

    Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads

    -Mapping(e.g.bwa),or-MaskwithRepeatMasker

  • Repeatlibraryandrepeatquantification

    Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads

    -Mapping(e.g.bwa),or-MaskwithRepeatMasker

    A!real!example!Co

    verage!

    %GC!

    5!Mbp!mitochondrion!in!spruce!

  • RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata

    R R R R R

    Overlaps?

    Sparseseqdata

  • RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata

    R R R R R

    Overlaps?

    Assembledcon

  • RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata

    R R R R R

    Overlaps?

    Assembledcon

  • RepeatlibraryfromlowcoveragedataQuan
  • ClassifyingrepeatsLTRGypsy/CopiaLINE/SINEDNAelements

    ThisisverytrickyClassifyingtherepeatlibrarydirectly RepeatMasker Repeatproteindomainsearch(h=p://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest)

    Problems Noclosehomologsindatabases RapidevoluHonofrepeats(liketransposableelements,TEs) Non-autonomousTEsdonotcontainproteinsSoluHons FetchintactORF:sfromhitsinassembly Extendassemblymatchesandgetmorecompleteelements Checkmatchalignmentprofilesinassembly(LINEsconservedat3endbutnotat5..)=>OWenslow,manual,species-specificsoluHons

  • Knowthetechnologybias

  • 61W#3.)#!/(+03(:%b1&!

    0 20 40 60 80 100

    050

    100

    150

    200

    250

    300

    350

    Coverage

    Num

    ber o

    f Mb'

    s in

    hg1

    9

    454IlluminaSOLiD

    average coverage

    _C%:(!)#&1-#!!

    "#$%#&'#!'1W#3.)#!

    !!(!

    9C"!/.0.!d>1',#!ghg!B(0.&(%-e!

    A.(3#/T#&/!/.0.!d>1',#!ghg!B(0.&(%-e!

    7#'*(&('.++#-:

  • ClarkM.J.,etal.Nat.Biotech(2011)

    PerformancecomparisonofexomeDNAsequencingtechnologies.(MikeSnyderslab)

  • NingL.,etal.ScientificReports(2015)

  • Knowtheassemblyalgorithms

  • PRE-

    PRO

    CESS

    ING

    ASS

    EMBL

    YPO

    LISH

    ING

    Short Reads (Illumina) - graph assembly

    adapterremoval

    qualitytrimming

    de Bruijn or string graph construction

    errorcorrection

    T

    T

    A

    T

    T

    scaffolding

    contigs

    read pairs

    NNNNNN

    read mapping

    Long Reads (PacBio) - H G AP assembly

    read length

    read

    s

    read self-correction

    overlap-layout-consensusassembly

    consensus calling withquiver

    assembled genome

    ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

    1

    2

    3

    1 pre-processing 2 assembly 3 finishing / polishing

    the overall assembly strategy is the same

    but the data and tools are fundamentally different

  • http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/

  • http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/

  • PRE-

    PRO

    CESS

    ING

    ASS

    EMBL

    YPO

    LISH

    ING

    Short Reads (Illumina) - graph assembly

    adapterremoval

    qualitytrimming

    de Bruijn or string graph construction

    errorcorrection

    T

    T

    A

    T

    T

    scaffolding

    contigs

    read pairs

    NNNNNN

    read mapping

    Long Reads (PacBio) - H G AP assembly

    read length

    read

    s

    read self-correction

    overlap-layout-consensusassembly

    consensus calling withquiver

    assembled genome

    ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

    1

    2

    3

    1 pre-processing 2 assembly 3 finishing / polishing

    the overall assembly strategy is the same

    but the data and tools are fundamentally different

    Many!instruments!too!many!solu4ons!

    Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

    No!easy!way!to!determine!best!assembly/assembler!

    implemented!heuris4cs!are!the!key!issue!

    Choice!of!approach!depends!on!data!being!assembled!

    Currently!efforts!ongoing!to!establish!best!prac4ces!

    Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

  • OLCvs.deBruijn

  • OLC

    Pros:Canuselongerreadsproperly Cons:Timeconsuming,highmemoryrequirements

  • deBruijn

  • deBruijn

  • GenerateassemblyviadeBruijn

    Marpn&Wang,Nat.Rev.Genet.(2011)

  • GenerateassemblyviadeBruijn

    Marpn&Wang,Nat.Rev.Genet.(2011)

  • GenerateassemblyviadeBruijn

    Marpn&Wang,Nat.Rev.Genet.(2011)

  • Pros:Computationallyefficient,canworkwithlargecoverageshortreaddatasets

    Cons:Sensitivetosequenceerrors,connectionbetweenassemblyandreadislost,doesnotworksowellwithlongerreads

    DeBruijn

  • PRE-

    PRO

    CESS

    ING

    ASS

    EMBL

    YPO

    LISH

    ING

    Short Reads (Illumina) - graph assembly

    adapterremoval

    qualitytrimming

    de Bruijn or string graph construction

    errorcorrection

    T

    T

    A

    T

    T

    scaffolding

    contigs

    read pairs

    NNNNNN

    read mapping

    Long Reads (PacBio) - H G AP assembly

    read length

    read

    s

    read self-correction

    overlap-layout-consensusassembly

    consensus calling withquiver

    assembled genome

    ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

    1

    2

    3

    1 pre-processing 2 assembly 3 finishing / polishing

    the overall assembly strategy is the same

    but the data and tools are fundamentally different

    Many!instruments!too!many!solu4ons!

    Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

    No!easy!way!to!determine!best!assembly/assembler!

    implemented!heuris4cs!are!the!key!issue!

    Choice!of!approach!depends!on!data!being!assembled!

    Currently!efforts!ongoing!to!establish!best!prac4ces!

    Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

    Many!instruments!too!many!solu4ons!

    Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

    No!easy!way!to!determine!best!assembly/assembler!

    implemented!heuris4cs!are!the!key!issue!

    Choice!of!approach!depends!on!data!being!assembled!

    Currently!efforts!ongoing!to!establish!best!prac4ces!

    Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

  • PRE-

    PRO

    CESS

    ING

    ASS

    EMBL

    YPO

    LISH

    ING

    Short Reads (Illumina) - graph assembly

    adapterremoval

    qualitytrimming

    de Bruijn or string graph construction

    errorcorrection

    T

    T

    A

    T

    T

    scaffolding

    contigs

    read pairs

    NNNNNN

    read mapping

    Long Reads (PacBio) - H G AP assembly

    read length

    read

    s

    read self-correction

    overlap-layout-consensusassembly

    consensus calling withquiver

    assembled genome

    ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

    1

    2

    3

    1 pre-processing 2 assembly 3 finishing / polishing

    the overall assembly strategy is the same

    but the data and tools are fundamentally different

    Many!instruments!too!many!solu4ons!

    Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

    No!easy!way!to!determine!best!assembly/assembler!

    implemented!heuris4cs!are!the!key!issue!

    Choice!of!approach!depends!on!data!being!assembled!

    Currently!efforts!ongoing!to!establish!best!prac4ces!

    Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

    Many!instruments!too!many!solu4ons!

    Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

    No!easy!way!to!determine!best!assembly/assembler!

    implemented!heuris4cs!are!the!key!issue!

    Choice!of!approach!depends!on!data!being!assembled!

    Currently!efforts!ongoing!to!establish!best!prac4ces!

    Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

  • Somerecommendations

    Largeeukaryotegenome,Illuminadata:Allpaths-LG(needsspecificlibraries),SOAPdenovo,SGA,Masurca,DISCOVAR

    Largeeukaryotegenome,additionallongerreads:Masurca,Newbler,CABOG

    Smalleukaryoteorprokaryotegenome,Illuminadata:Spades,Masurca,SOAPdenovo,Abyss,Velvet,DISCOVAR

    Smalleukaryoteorprokaryotegenome,mixeddata:MIRA,Spades,Masurca,Newbler

    Needtoruninparallel:Abyss,Rai Amplifieddata(SingleCellGenomics):Spades

  • StandardcontiguitymetricsJustawordonN50

    N50typicallyreferstoacon

  • Thedevilisintherepeats

    De Novo Assembly: Instruments Our Experience Validation

    Repeats and Short Reads

    Moreover

    short reads

    Short reads make everything harder!!

    F. Vezzi NGS

    C R A B

    Mathema,callybestresult:

  • Repeaterrors

    Overlappingnon-iden/calreads Collapsedrepeatsandchimeras

    Wrongcon/gorder Inversions

  • ATCGGGTATATAG-CCTA!||||||| || || ||||!ATCGGGTGTACAGCCCTA!!

    ?

    A

    BA&B

    A:B:

    Collapsablerepeaterrors(worst!)

  • Knowhowtopatchgaps/finalize

  • Gaps

  • Gaps

  • CCSvsCLR

  • CCSvsCLR

  • CCSvsCLR

  • other options for assembling PacBio reads

    https: / / github.com / PacificBiosciences / Bioinformatics-Training / wiki / Large-Genome-Assembly-with-PacBio-Long-Reads

  • Hybridassemblies

  • Gaps

  • PacBio data cannot (currently) be assembled in its raw state

    several strategies exist for correcting reads prior to assembly correction without complementary technology used to be

    difficult until recently, was limited by computational power and SMRT cell

    throughput

    PacBio data is noisy

    Koren & Philippy Curr Op M icro 2014

  • Hybridassemblers(forPacBio)

    105

    other options for assembling PacBio reads

  • Hybridassemblers

    106

    other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)

  • Hybridassemblers

    107

    other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)

  • PurePacBio

    PRE-

    PRO

    CESS

    ING

    ASS

    EMBL

    YPO

    LISH

    ING

    Short Reads (Illumina) - graph assembly

    adapterremoval

    qualitytrimming

    de Bruijn or string graph construction

    errorcorrection

    T

    T

    A

    T

    T

    scaffolding

    contigs

    read pairs

    NNNNNN

    read mapping

    Long Reads (PacBio) - H G AP assembly

    read length

    read

    s

    read self-correction

    overlap-layout-consensusassembly

    consensus calling withquiver

    assembled genome

    ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

    1

    2

    3

    1 pre-processing 2 assembly 3 finishing / polishing

    the overall assembly strategy is the same

    but the data and tools are fundamentally different

  • PurePacBio

  • PurePacBio

  • PurePacBioother options for assembling PacBio reads

  • PurePacBioother options for assembling PacBio reads

  • PurePacBio

  • Finishing/Polishing(Olli-Pekka)

  • Finishing/Polishing(Olli-Pekka)

  • Finishing/Polishing(Olli-Pekka)

    quiver isnt perfect using Pilon to polish remaining indels

    makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies

    $ java -Xmx16G jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100

    Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > Q V 55 consensus

  • Finishing/Polishing(Olli-Pekka)

    quiver isnt perfect using Pilon to polish remaining indels

    makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies

    $ java -Xmx16G jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100

    Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > Q V 55 consensus

  • Finishing/Polishing(Olli-Pekka)

  • Finishing/Polishing(Olli-Pekka)

  • 5 0 8 | N A T U R E | V O L 5 2 7 | 2 6 N O V E M B E R 2 0 1 5 2015 Macmillan Publishers Limited. All rights reserved

    LETTERdoi:10.1038/nature15714

    Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeumRobert VanBuren1*, Doug Bryant1*, Patrick P. Edger2,3, Haibao Tang4,5, Diane Burgess2, Dinakar Challabathula6, Kristi Spittle7, Richard Hall7, Jenny Gu7, Eric Lyons4, Michael Freeling2, Dorothea Bartels6, Boudewijn Ten Hallers8, Alex Hastie8, Todd P. Michael9 & Todd C. Mockler1

    Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE)2. Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a near-complete draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.

    The genomes of Arabidopsis3, rice4, poplar, grape and Sorghum5 were first sequenced using high-quality and reiterative Sanger-based approaches producing a series of gold standard reference genomes. The advent of next-generation sequencing (NGS) technologies reduced costs of sequencing substantially, which has enabled sequencing of over 100 plant genomes1. The quality of plant genome assemblies depends on genome size, ploidy, heterozygosity and sequence coverage, but most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds. The short read lengths of NGS, inherent biases and non-random sequencing errors have resulted in highly fragmented draft genome assemblies that are not complete, which means they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements, centromeres, telomeres and haplotype-specific structural variations. It is becoming clear from ENCODE projects that complete genomes are needed to better understand the importance of the non-coding regions of genomes2.

    More than 40% of calories consumed by humans are derived from grasses, and the grass family (Poaceae) is arguably the most important plant family with regard to global food security6. The size and complex-ity of most grass genomes has challenged progress in gene discovery

    and comparative genomics, although draft genomes are now avail-able for most agriculturally important grasses1. The largest genome assemblies, such as maize (2,300 megabases (Mb))7, barley (5,100 Mb)8 and wheat (hexaploid, 17,000 Mb)9 are highly fragmented as a result of the inability of current sequencing technologies to span complex repeat regions. Near-finished reference genomes are available for rice4, Sorghum5 and Brachypodium10, but more high-quality grass genomes are needed for comparative genomics and gene discovery. Here we pres-ent the near-complete draft genome of the grass Oropetium thomaeum, the first high-quality reference genome from the Chloridoideae sub-family. The draft genome is near complete because we were able to sequence through complex repeat regions that are unassembled in most draft genomes. Oropetium has the smallest known grass genome at 245 Mb and is also a resurrection plant that can survive the extreme water stress such as loss of >95% of cellular water (Fig. 1)11.

    Single-molecule real-time (SMRT) sequencing (Pacific Biosciences) produces long and unbiased sequences, which enables assembly of complex repeat structures and GC- and AT-rich regions that are often unassembled or highly fragmented in NGS-based draft genomes. We generated ~72 sequencing coverage of the Oropetium genome using 32 SMRT cells on the PacBio RS II platform (which is equivalent to

  • Knowhowtoannotate

  • Annotation(Jarkko)BILSassemblyandannota1onservice

    1

    HenrikLantz

    Teamleader

    MaheshPanchal

    Assembly

    JacquesDainat

    Annota1on

    Mar1nNorling

    Assembly

    LucileSoler

    Annota1on

    5PhDs,allinUppsala

    Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es

  • Annotation(Jarkko)BILSassemblyandannota1onservice

    1

    HenrikLantz

    Teamleader

    MaheshPanchal

    Assembly

    JacquesDainat

    Annota1on

    Mar1nNorling

    Assembly

    LucileSoler

    Annota1on

    5PhDs,allinUppsala

    Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es

    Annota1on/Assemblytechnology Assembly

    Perl/Makepipeline Pre-assembly

    Qualitycontrol kmeranalyses

    Assembly Differentassembly

    programs Assemblyvalida1on

    FRCbam Quast Owntools

    Annota-on Maker-MPI

    proteins RNA-seq

    Refinementscripts Func1onalannota1on

    Blast Synteny

    2

  • Knowhowtovalidate

  • AssemblyvalidationAssembly!valida4on!is!it!important?!

    Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!

    can!think!of.!How!I!pick!up!one!now?!

    Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!

    perform!gene!annota4on?!!!!

    Assembly!valida4on!is!it!important?!

    Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!

    can!think!of.!How!I!pick!up!one!now?!

    Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!

    perform!gene!annota4on?!!!!

  • AssemblyvalidationAssembly!valida4on!

    Assembly!valida4on!is!extremely!difficult! Too!o_en!only!connec4vity!measures!are!used! There!is!no!a!real!solu4on,!only!a!set!of!best!prac4ces!

    that!one!can!follow!!Recently!a!lot!of!a`en4on!on!assembly!valida4on:!

  • EvaluatingassemblieswithreferenceEvalua4ng!assemblies!with!a!reference!

    Coun4ng!errors!not!always!possible:! Reference!almost!always!absent.! Error! types! are! not! weighted!

    accordingly.!

    Visualiza4on!is!useful,!however:! No!automa4on! !Does!not!scale!on!large!genomes!

    WOW.!Looks!like!that!it!is!difficult!even!with!the!answer!

  • EvaluatingassemblieswithoutreferenceEvalua4ng!assemblies!without!a!reference!

    Sta4s4cs!(N50,!etc.)! Congruency!with!raw!sequencing!data:!

    Alignments! QAtools! FRCbam! REAPR!

    Gene!space!! CEGMA! reference!genes! transcriptome!

    There!is!no!a!real!recipe,!or!a!tool.!We!can!only!suggest!some!best!prac4ce.!!

  • Yourreadsareoftenthebestsourcetovalidateyourassemblies

    Checkagainyourinsertsizes(PicardTools,http://picard.sourceforge.net)!!!!!

    Plottingcoveragex%GCxlength

    Post!assembly!am!I!on!the!right!track?!

    Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!

    PE! MP!

    Your!genome!Mitochondrion!

    Contamina4ons!

    0 2000 4000 6000 8000

    020

    040

    060

    080

    010

    00Insert Size Histogram for All_Reads in file MP_on_masurca_sorted.bam

    Insert Size

    Coun

    t

    FRRFTANDEM

    0 100 200 300 400 500

    020

    0040

    0060

    0080

    00

    Insert Size Histogram for All_Reads in file PE_on_masurca_sorted.bam

    Insert Size

    Cou

    nt

    FRRFTANDEM

    0 2000 4000 6000 8000 10000

    020

    040

    060

    080

    0

    Insert Size Histogram for All_Reads in file 7_130425_AD1YUEACXX_P469_101_index12_trimmedtoassembly.abyss.scaf_onlyAligned.bam

    Insert Size

    Coun

    t

    FRRFTANDEM

    Failed!MP!or!bad!assembly?!

    Plot!cov!vs!%GC!vs!length!!

    Look! at! the! plots!and!at!the!tables,!duplica4on! rate!is! an! important!measure.!!

    You!need!to!check!i f ! t h e! p l o t ( s )!co inc ides! w i th!what!you!expect.!

    0.0 0.2 0.4 0.6 0.8 1.00

    100

    200

    300

    400

    500

    GC

    cove

    rage

    coverage

    Freq

    uenc

    y

    0 100 200 300 400 500

    050

    100

    150

    0 100 200 300 400 500

    010

    2030

    4050

    cov

    len

    (kbp

    )

    0 10 20 30 40 500

    24

    68

    10cov

    len

    (kbp

    )

    Plopng!coverage!and!GC!content!

    0.0 0.2 0.4 0.6 0.8 1.0

    010

    020

    030

    040

    050

    0

    GC

    cove

    rage

    coverage

    Freq

    uenc

    y

    0 100 200 300 400 500

    050

    100

    150

    0 100 200 300 400 500

    010

    2030

    4050

    cov

    len

    (kbp

    )

    0 10 20 30 40 50

    02

    46

    810

    cov

    len

    (kbp

    )

    Plopng!coverage!and!GC!content!

    http://picard.sourceforgenet

  • Yourreadsareoftenthebestsourcetovalidateyourassemblies

    Checkagainyourinsertsizes(PicardTools,http://picard.sourceforge.net)!!!!!

    Plottingcoveragex%GCxlength

    Post!assembly!am!I!on!the!right!track?!

    Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!

    PE! MP!

    Your!genome!Mitochondrion!

    Contamina4ons!

    0 2000 4000 6000 8000

    020

    040

    060

    080

    010

    00Insert Size Histogram for All_Reads in file MP_on_masurca_sorted.bam

    Insert Size

    Coun

    t

    FRRFTANDEM

    0 100 200 300 400 500

    020

    0040

    0060

    0080

    00

    Insert Size Histogram for All_Reads in file PE_on_masurca_sorted.bam

    Insert Size

    Cou

    nt

    FRRFTANDEM

    0 2000 4000 6000 8000 10000

    020

    040

    060

    080

    0

    Insert Size Histogram for All_Reads in file 7_130425_AD1YUEACXX_P469_101_index12_trimmedtoassembly.abyss.scaf_onlyAligned.bam

    Insert Size

    Coun

    t

    FRRFTANDEM

    Failed!MP!or!bad!assembly?!

    Plot!cov!vs!%GC!vs!length!!

    Look! at! the! plots!and!at!the!tables,!duplica4on! rate!is! an! important!measure.!!

    You!need!to!check!i f ! t h e! p l o t ( s )!co inc ides! w i th!what!you!expect.!

    0.0 0.2 0.4 0.6 0.8 1.00

    100

    200

    300

    400

    500

    GC

    cove

    rage

    coverage

    Freq

    uenc

    y

    0 100 200 300 400 500

    050

    100

    150

    0 100 200 300 400 500

    010

    2030

    4050

    cov

    len

    (kbp

    )

    0 10 20 30 40 500

    24

    68

    10cov

    len

    (kbp

    )

    Plopng!coverage!and!GC!content!

    0.0 0.2 0.4 0.6 0.8 1.0

    010

    020

    030

    040

    050

    0

    GC

    cove

    rage

    coverage

    Freq

    uenc

    y

    0 100 200 300 400 500

    050

    100

    150

    0 100 200 300 400 500

    010

    2030

    4050

    cov

    len

    (kbp

    )

    0 10 20 30 40 50

    02

    46

    810

    cov

    len

    (kbp

    )

    Plopng!coverage!and!GC!content!

    http://picard.sourceforgenet

  • DatacongruencyData!congruency!

    Idea:!Map!read:pairs!back!to!assembly!and!look!for!discrepancies!like:! no!read!coverage! no!span!coverage! too!long/short!pair!distances!

    Reads! can! be! aligned!back! to! the! assembly! to!iden4fies! suspicious!features.!

    But!what!we!do!with!this!features?!

    FRCbam(Vezzietal.2012)

  • Datacongruency

    FRCbam(Vezzietal.2012)

    Features!

    4!coverage!related!features:! LOW_COV_PE,!HIGH_COV_PE,!LOW_NORM_COV_PE,!and!HIGH_NORM_COV_PE!

    !!!!!4!features!for!compression/expansion!event!(CE!stats)!

    COMPR_PE,!STRECH_PE,!COMPR_MP,!and!STRECH_MP!!!!6!features!on!suspicious!pair/mate!orienta4ons:!

    HIGH_SINGLE_PE,!and!HIGH_SINGLE_MP! HIGH_SPAN_PE,!and!HIGH_SPAN_MP! HIGH_OUTIE_PE,!and!HIGH_OUTIE_MP!

    !

    A

    R1,2

    B

    A

    R1,2

    C

    B

    A R1 B R2 C

    AGAGCTAGCAGAGCTAGCAGATCTCGCAGATCTCGC

    Reads! can! be! aligned! back! to!the! assembly! to! iden4fies!suspicious!features.!

  • FRCurveFRCurve!

    FRCbam!predicted!Assemblathon!2!outcome!

    FRCbam!(Vezzi!et!al.!2012)!

    The!Feature!Response!Curve!(FRCurve)!characterizes!the!sensi4vity!(coverage)! of! the! sequence! assembler! as! a! func4on! of! its!discrimina4on!threshold!(number!of!features!).!

    Feature!Response!Curve:! Overcomes!limits!of!standard!

    indicators!(i.e.!N50)! Captures!trade:off!between!

    quality!and!con4guity! Deeply!connected!to!ROC!curves! Features!can!be!used!to!iden4fy!

    problema4c!regions! Single!features!can!be!plo`ed!to!

    iden4fy!assembler:specific!bias!0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000

    0

    20

    40

    60

    80

    100

    120

    Feature threshold

    approxim

    ate

    coverage(%

    )

    Feature Space rhody TOTAL

    SGARayCLC

    SOAPdenovoALLPATHS-LG PB

    ABySSMSRA-CACABOG PBCABOGVELVET

    ALLPATHS-LG

  • FeaturesandPCAFeatures!and!PCA!

    5 4 3 2 1

    21

    01

    2

    PCA1

    PCA2

    bifido

    ecoli

    enteroeubac

    fragilis

    kleb

    staphylocossus

    strep

    swigtimbifido

    ecoli

    entero

    fragilis

    fuso7

    kleb

    staphylocossusstrep

    swig

    tim

    bifido

    clap

    clap19

    ecoli

    entero

    fragilis

    fusonuke

    kleb

    strep

    swig

    tim

    bifido

    ecoli

    entero

    eubac

    fragilis

    kleb

    staphylocossus

    strep

    swig

    bifido

    ecolientero

    eubac

    swig

    tim

    bifidoecoli

    enteroeubac

    fragilis

    kleb

    staphylocossus

    strep

    swig

    clap

    clap19

    ecoli

    enteroeubac

    fragilis

    fuso7

    kleb

    staphylocossus

    strep

    swig

    tim

    entero

    eubac

    fuso7

    strep

    swig

    bifido

    ecoli

    entero

    eubac

    fragilis

    kleb

    strepswig

    bifido

    ecoli

    entero

    eubac

    kleb

    staphylocossus

    strep

    swig

    bifido

    ecoli

    entero

    eubac

    fragilis

    kleb

    staphylocossus

    strep

    swig

    4 2 0 2 4

    64

    20

    24

    6

    PCA1

    PCA2

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    enteroeubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossus

    strep

    swig timbifido

    clapclap19

    copro

    ecoli

    egg

    entero

    eubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossusstrepswig

    tim

    bifido

    clapclap19

    copro

    ecoli

    egg

    entero

    eubac

    fragilis

    fuso7

    fusonuke

    kleb

    strep

    swig

    tim

    bifido

    clapclap19

    ecoli

    egg

    enteroeubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossus

    strep

    swig

    bifidoclap

    clap19

    copro

    ecoli

    egg

    entero

    eubac

    fuso7

    fusonuke

    strep

    swig

    tim

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    entero eubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossus

    strep

    swig

    tim

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    enteroeubac

    fragilis

    fuso7

    fusonuke

    kleb

    rhody

    staphylocossus

    strep

    swig

    tim

    bifido

    copro

    ecoli

    egg

    entero

    eubac

    fragilis

    fuso7

    fusonuke

    kleb

    strep

    swig

    tim

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    enteroeubac

    fragilis fuso7

    fusonuke

    kleb

    staphylocossus

    strepswig

    tim

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    entero

    eubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossus

    strep

    swig

    bifido

    clap

    clap19

    copro

    ecoli

    egg

    enteroeubac

    fragilis

    fuso7

    fusonuke

    kleb

    staphylocossus

    strep

    swig

    Assembled!18!bacterial!genomes!with!11!assemblers!!

    (illumina!+!PacBio!data)!

    PCA!performed!on!features:! Assemblies!of!the!same!organism!

    (family)!tend!to!cluster;! No!clear!difference!when!using!

    PacBio!data;!

  • REAPR(Huntetal.2013)

    REAPRREAPR!

    REAPR!(Hunt!et!al.!2013)!

    Uses!same!principle!of!FRCurve:!

    Iden4fies!suspicious/erroneous!posi4ons!

    Breaks#assemblies#in#suspicious#posi.ons#

    The!broken!assembly!is!more!fragmented!but!hopefully!more!

    corrected!(Reapr!cannot!make!

    things!worse)!

  • Conservedcore(species)genespace

    Gene!space!

    CEGMA#(h`p://korflab.ucdavis.edu/datasets/cegma/)!HMM:s!for!248!core!eukaryo4c!genes!aligned!to!your!assembly!to!assess!completeness!of!gene!space!complete:!70%!aligned!par4al:! !30%!aligned!!!Similar#idea#based#on#aa#or#nt#alignments#of# Golden!standard!genes!from!own!species! Transcriptome!assembly! Reference!species!protein!set!Use!e.g.!GSNAP/BLAT!(nt),!exonerate/SCIPIO!(aa)!!!

  • OtherexternalvalidationmethodsOther!External!Valida4on!Methods!

    ! Restric4on!Map! Representa4on! of! the! cut! sites! on! a!

    given! DNA! molecule! to! provide! spa4al!informa4on!of!gene4c!loci!

    Op4cal!maps!can!be!used!to!check!assembly!correctness:!

    Long!PacBio!Reads!can!be!used!as!well!

  • Otherexternalvalidationmethods

    De!novo!reconstructs!!parts!missing!in!the!reference!strain!

    Correctly!assembles!long!tandem!repeats!!

    De!Novo!assembly!!!!(Illumina,!PGM)!

    Set!of!un:ordered!and!not!oriented!ctgs!

    Op4cal!Map!

    DNA!seq!Con4gs!

    Other!External!Valida4on!Methods!

  • Dont!panic.!And!dont!rush!

    Keeping!up!with!the!development!can!be!stressful,!!so!you!need!to!stay!calm!!Choose!quality!before!quan4ty!!Know!your!biological!system!!so!you!know!what!to!expect!Combine!sequencing!with!other!data!!Share!knowledge!and!be!nice!to!your!bioinforma4cs!friends!

    For!each!conclusion,!ask!yourself!if!it!can!be!an!artefact!due!to!!!Incomplete!assembly!!Repeats!!Indels!!Coverage!bias!!Divergent!sequences!(mapping)!

    Dontpanic.Anddontrush

  • Knowthatyourfinalassemblywillbeincomplete

  • Thingsthatarenotthere

    100M

    b

    1 2 3 4 5 6 7 8 9 10 11

    12

    131415

    161718

    19202122

    X

    Closed gap

    Inversion

    Complex event

    High

    Low

    STR Density

    Extended Data Figure 3 | Genome distribution of closed gaps andinsertions. Chromosome ideogram heatmap depicts the normalized density ofinserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of

    most chromosomes. Locations of structural variants and closed gaps are givenby coloured diamonds to the left of each chromosome: closed gap sequences(red), inversions (green), and complex events (blue).

    RESEARCH LETTER

    Macmillan Publishers Limited. All rights reserved2014

    ChaisonM.J.Petal.Nature(2014)

    LETTERdoi:10.1038/nature13907

    Resolving the complexity of the human genomeusing single-molecule sequencingMark J. P. Chaisson1, John Huddleston1,2, Megan Y. Dennis1, Peter H. Sudmant1, Maika Malig1, Fereydoun Hormozdiari1,Francesca Antonacci3, Urvashi Surti4, Richard Sandstrom1, Matthew Boitano5, Jane M. Landolin5, John A. Stamatoyannopoulos1,Michael W. Hunkapiller5, Jonas Korlach5 & Evan E. Eichler1,2

    The human genome is arguably the most complete mammalianreference assembly13, yet more than 160 euchromatic gaps remain46

    and aspects of its structural variation remain poorly understood tenyears after its completion79. To identify missing sequence and gen-etic variation, here we sequence and analyse a haploid human genome(CHM1) using single-molecule, real-time DNA sequencing10. We closeor extend 55% of the remaining interstitial gaps in the human GRCh37reference genome78% of which carried long runs of degenerateshort tandem repeats, often several kilobases in length, embeddedwithin (G1C)-rich genomic regions. We resolve the complete sequenceof 26,079 euchromatic structural variants at the base-pair level, includ-ing inversions, complex insertions and long tracts of tandem repeats.Most have not been previously reported, with the greatest increasesin sensitivity occurring for events less than 5 kilobases in size. Com-pared to the human reference, we find a significant insertional bias(3:1) in regions corresponding to complex insertions and long shorttandem repeats. Our results suggest a greater complexity of the humangenome in the form of variation of longer and more complex repet-itive DNA that can now be largely resolved with the application ofthis longer-read sequencing technology.

    Data generated by single-molecule, real-time (SMRT) sequencingtechnology differ drastically from most sequencing platforms becausenative DNA is sequenced without cloning or amplification, and readlengths typically exceed 5 kilobases (kb). Despite overall lower individualread accuracy (,85%), longer read length facilitates high confidencemapping across a greater percentage of the genome11,12.We generated,40-fold sequence coverage from a human CHM1 hydatidiform moleusing long-read SMRT sequence technology (average mapped readlength 5 5.8 kb; Supplementary Table 1). We selected a complete hyda-tidiform mole to sequence because it is haploid, lacking allelic variation,and provides higher effective sequence coverage. We aligned 93.8% ofall sequence reads to the human reference genome (GRCh37) using amodified version of BLASR11 (Supplementary Information) and gener-ated local assemblies of the mapped reads using Celera13 and Quiver14,the latter of which leverages estimates of insertion, deletion and substi-tution probabilities to determine consensus sequences accurately. Wecompared the consensus sequences of regions with previously sequencedand assembled large-insert bacterial artificial chromosome (BAC) clonesgenerated from CHM1tert (ref. 15). The comparison shows a consensussequencing concordance of .99.97% (phred quality 5 37.5), with 72%of the errors confined to indels within homopolymer stretches (Sup-plementary Table 3).

    We initially assessed whether the mapped reads could facilitate clos-ure of any of the 164 interstitial euchromatic gaps within the humanreference genome (GRCh37). We extended into gap regions using areiterative map-and-assemble strategy, in which SMRT whole-genomesequencing (WGS) reads mapping to each edge of a gap were assembledinto a new high-quality consensus, which, in turn, served as a template

    for recruiting additional sequence reads for assembly (SupplementaryInformation). Using this approach, we closed 50 gaps and extended into40 others (60 boundaries), adding 398 kb and 721 kb of novel sequenceto the genome, respectively (Supplementary Table 4). The closed gapsin the human genome were enriched for simple repeats, long tandemrepeats, and high (G1C) content (Fig. 1) but also included novel exons(Supplementary Table 20) and putative regulatory sequences based onDNase I hypersensitivity and chromatin immunoprecipitation followedby high-throughput DNA sequencing (ChIP-seq) analysis (Supplemen-tary Information). We identified a significant 15-fold enrichment of shorttandem repeats (STRs) when compared to a random sample (P , 0.00001)(Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences werecomposed of 10% or more of STRs. The STRs were frequently embeddedin longer, more complex, tandem arrays of degenerate repeats reach-ing up to 8,000 bp in length (Extended Data Fig. 1ac), some of whichbore resemblance to sequences known to be toxic to Escherichia coli16.Because most human reference sequences17,18 have been derived fromclones propagated in E. coli, it is perhaps not surprising that the appli-cation of a long-read sequence technology to uncloned DNA wouldresolve such gaps. Moreover, the length and complex degeneracy of theseSTRs embedded within (G1C)-rich DNA probably thwarted efforts tofollow up most of these by PCR amplification and sequencing.

    Next, we developed a computational pipeline (Extended Data Fig. 2)to characterize structural variation systematically (structural variationdefined here as differences $50 bp in length, including deletions, dupli-cations, insertions and inversions7). Structural variants were discoveredby mapping SMRT sequencing reads to the human reference genome11

    1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA. 2Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195,USA. 3Dipartimento di Biologia, Universita degli Studi di Bari Aldo Moro, Bari 70125, Italy. 4Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania15261, USA. 5Pacific Biosciences ofCalifornia, Inc., Menlo Park, California 94025, USA.

    P = 0.02712P = 0.00003

    P < 0.00001

    0

    25

    50

    75

    100

    (G+C

    ) con

    tent

    Reference flank

    Gap closure

    Tandem repeatP < 2.2 1016

    0.00

    0.25

    0.50

    0.75

    1.00

    Gaps Reference

    Pro

    port

    ion

    of re

    gion

    with

    sim

    ple

    repe

    ats

    a b

    Gap

    only

    Tand

    em re

    peats

    Gap

    with

    out

    tande

    m re

    peats

    Samp

    led re

    feren

    ce

    Figure 1 | Sequence content of gap closures. a, Gap closures are enrichedfor simple repeats compared to equivalently sized regions randomly sampledfrom GRCh37. b, Human genome gaps typically consist of (G1C)-richsequence (yellow) flanking complex (A1T)-rich STRs (green) (empiricalP value; Supplementary Information). Red line indicates genomic (G1C)content.

    0 0 M O N T H 2 0 1 4 | V O L 0 0 0 | N A T U R E | 1

    Macmillan Publishers Limited. All rights reserved2014

  • Thingsthatarenotthere

    SteinbergK.M.etal.GenomeResearch(2014)

    reference assembly, many groups have described shortcomings ofthis resource, including remaining gaps, single-nucleotide errors,or gross misassembly due to complex haplotypic variation (Eichleret al. 2004; Doggett et al. 2006; Kidd et al. 2010; Chen and Butte2011; The 1000 Genomes Project Consortium 2012). Both gapsand misassembled regions often arise because the DNA sequenceused for the assembly was from multiple diploid sources contain-ing complex structural variation. Because such loci often containmedically relevant gene families, it is important to resolve varia-tion at these sites, as the structural and single-nucleotide diversityis likely associated with clinical phenotypes (Eichler et al. 2004).Thus, to resolve structurally complex regions and provide a moreeffective reference resource for such loci, we combined WGS dataand BAC sequences from a haploid DNA source to create a singlehaplotype assembly of the human genome.

    Haplotype information is critical to interpreting clinical andpersonal genomic information as well as genetic diversity and an-cestry data, and most previously sequenced individual human ge-nomes are not haploresolved. The current reference human genomesequence represents a mosaic that further complicates haplotyping;within a BAC clone there is a single haplotype representation, buthaplotypes can switch at BAC clone junctions. By utilizing an es-sentially haploid DNA source, we resolved a single haplotype acrosscomplex regions of the genome where the reference genome con-tained a mixture of haplotypes from various sources and/or con-tained unresolved gaps. For example, a gap on Chromosome 4p14in GRCh37 (Chr 4: 4029639740297096) was completely resolvedusing CHM1WGS data. The gap was flanked by repetitive elementsthat were not traversed by a clone. This region has subsequentlybeen updated with a complete tiling path in GRCh38.

    Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio cliff reads where thealignment coverage dropped off sharply.WGS component (light green lines) boundaries flanked by such reads aremarked with red dashed lines. The endsof each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads aremarked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest thatthe twoWGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio readsin these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottomlight green lines show a proposed tiling pathwith the orientation corrected; the letters indicate where each end of the initial tiling path components shouldbe placed.

    CHM1 assembly of the human genome

    Genome Research 7www.genome.org

    Cold Spring Harbor Laboratory Press on November 16, 2014 - Published by genome.cshlp.orgDownloaded from

  • Summary

    Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.

    Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.

    Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity)anddolibrariesforit.

    Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.

    Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.

  • Summary

    Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.

    Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.

    Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity,ornumberoferrorfreebases).

    Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.

    Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.

    SettledownanassemblysoSciencecancontinue!

  • Knowthefuture

  • Avisionintothefuture

  • Avisionintothefuture

  • Avisionintothefuture

  • Avisionintothefuture

  • Acknowledgements

    OlgaVinnerePettersson BjrnNysted OlaSpujth HenrikLantz JacquesDaimat FrancescoVezzi BGI JonBadalamenti(BondLab) StephanC.Schuster(PennU)