helsinki genome project-20151210-amb

Thingstoconsiderwheninitiatingagenomeprojecttheassemblypipeline@SciLifeLab

!Helsinki,Dec9th2015

lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio

WorkshopOutline

IntroducingSciLifeLab Theimportantconsiderationsofallgenomeprojects

Theannotationandassemblyplatforms Avisionintothefuture

Survey

Survey

Howmanyofyouhaveusedsequencingfacilities?

Survey


Assembledagenome?

Survey


Assembledagenome? Planningtostartagenomeproject?

Survey


Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata?

Survey


Assembledagenome? Planningtostartagenomeproject? HaveworkedwithNGSdata? JustcuriousaboutNGS?

Thingstoconsider

Repeats Heterozygosity Sizeofyourgenome GCcontent AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

Thingstoconsider

Repeats Heterozygosity Sizeofyourgenome AccesstomaterialandspecificallyHMWDNA Accesstoagoodcomputationalcluster Goodbioinformaticians/labtechnicians

WHATISYOURSCIENTIFICQUESTION?

Variationspace



Thingstoconsider



http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism

http://www.intechopen.com/books/recent-advances-in-autism-spectrum-disorders-volume-i/discovering-the-genetics-of-autism

Thingstoconsider



Figure 2 | Structural variation sequence signatures. There are four general sequence-based analytical approaches used to detect structural variation. Theoretically, read-pair (RP), split-read and assembly methods can be used to discover variants from all classes of structural variant (SV), but each has different biases depending on the underlying sequence content of the variants and the data properties of the sequence reads. However, read-depth approaches can be used to detect only losses (deletions) and gains (duplications), and cannot discriminate between tandem and interspersed duplications. Briefly, read-pair methods analyse the mapping information of paired-end reads and their discordancy from the expected span size and mapped strand properties. Sensitivity, specificity and breakpoint accuracy are dependent on the read length, insert size and physical coverage3,4,59,62,65,66,68,69. Breakpoints are indicated by red arrows. Read-depth analysis examines the increase and decrease in sequence coverage to detect duplications and deletions, respectively, and predict absolute copy numbers of genomic intervals45,62,7476. Split-read algorithms are capable of detecting exact breakpoints of all variant classes by analysing the sequence alignment of the reads and the reference genome; however, they usually require longer reads than the other methods and have less power in repeat- and duplication-rich loci62,78,79. Assembly algorithms8386,115 have the most power to detect SVs of all classes at the breakpoint resolution, but assembling short sequences and inserts often result in contig/scaffold fragmentation in regions with high repeat and duplication content89. MEI, mobile-element insertion. Repbase is a database of repetitive elements.

REVIEWS

368 | MAY 2011 | VOLUME 12 www.nature.com/reviews/genetics

2011 Macmillan Publishers Limited. All rights reserved

AlkanC.,CoeB.P.,EichlerE.E..NatureRevGenetics(2011)

Thingstoconsider



Thingstoconsider



WardL.D.&KellisM.NatBiotechnology(2012)

Aboutme

lvaroMartnezBarrio,[email protected]/in/ambarrio@ambarrio

PhDBioinformatics2010 PostdocPopGenetics/CompBiol2014,L.Andersson+H.Ronne

Herring:Illumina,SOLiD,Moleculo,PacBio SpeciesPlant:454,SOLiD,Illumina SpeciesSeal(~3Gb):Illumina SpeciesBeetle:Illumina,PacBio

Pool-seqA sequencing technique in which sequencing libraries are not prepared from DNA of a single individual or cell but from a mixture of DNA fragments originating from different individuals or cells. In the context of this Review, Pool-seq is used to describe the unbiased sequencing of the entire genome.

CoverageThe number of reads that span a given genomic position.

Sequencing librariesSets of fragmented DNA extracted from one or more individuals that serve as the template for subsequent sequencing.

Exome sequencingA sequencing approach in which the complexity of the genome is reduced through hybridization to exonic sequences, which results in a higher sequence coverage of protein-coding regions.

Restriction-site-associated DNA markersSequence polymorphisms in close proximity to a restriction enzyme recognition site.

are subject to larger sampling variance, whereby they can result in considerable errors even when the allele frequencies in the sample have been determined at high accuracy. In other words, accurately sequencing a small population sample will still result in noisy allele frequency estimates. By contrast, Pool-seq makes use of large population samples, but not all chromosomes in the samples are analysed. The higher accuracy to cost ratio of Pool-seq arises from the fact that very few chromosomes are sequenced more than once, whereas for sequencing of individuals each chromosome is typically sequenced multiple times (540 times). This advantage is clearly demonstrated in FIG.1, in which the accuracy of Pool-seq is compared with sequencing of individuals at a fixed sequencing cost (that is, assum-ing that the same number of sequence reads is used in each case). Although Pool-seq mostly performs bet-ter when 50 individuals are pooled, its performance is clearly superior when pooling 100 or more individuals (FIG.1a). Additionally, the accuracy of Pool-seq relative to sequencing of individuals increases with the coverage of individual genomes (FIG.1b).

The cost-effectiveness of Pool-seq becomes even more evident when the costs for the preparation of the sequencing libraries are considered: Pool-seq uses a single library for the entire sample, whereas sequencing of

individuals requires a separate library to be prepared for each genome. As library construction constitutes ~20% of the total sequencing costs for species with moderate genome sizes, this is an important costfactor.

Comparison to reduced-representation sequencingSequencing individuals at a high coverage is undoubt-edly the gold standard for obtaining high-quality data, but budget constraints frequently require alternatives for studying large populations. In addition to Pool-seq, other strategies have been developed for sequenc-ing large samples (FIG.2). Below, we compare different sequencing approaches (TABLE1) and weigh their par-ticular strengths and weaknesses against those of Pool-seq (TABLE2).

In contrast to the whole-genome approach of Pool-seq, the cost savings of these alternative approaches are achieved by reducing the representation of the genome in the sequence data. Different strategies for targeting the sequencing to specific regions of the genome can be categorized into exome sequencing13,14, high-throughput RNA sequencing (RNA-seq)15 and methods using restriction-site-associated DNA markers16. Each of these methods have been combined with pooling to fur-ther reduce costs, but each approach has its particular strengths and weaknesses (see below).

Figure 1 | Cost-effectiveness of Pool-seq. The accuracy of allele frequency estimates is compared for whole-genome sequencing of pools of individuals (Pool-seq) and whole-genome sequencing of individuals using the ratio of the standard deviation (SD) of the estimated allele frequency with both methods. The same number of reads is used for both sequencing strategies. A value smaller than one indicates that Pool-seq is more accurate than sequencing of individuals. a | The influence of the pool size is shown. A larger pool size results in higher accuracy of Pool-seq, but Pool-seq still produces more accurate allele frequency estimates even for pool sizes of 50 individuals in most comparisons. Only when the number of sequenced individuals approaches the pool size does sequencing of individuals become the superior strategy. b | Influence of coverage and variation in representation of individuals in a pool is shown. With a lower coverage per individual, the advantage of Pool-seq decreases. It should be noted that with a decreasing coverage per individual, the two approaches produce very similar types of data; that is, sequencing of individuals tends to show the same limitations as Pool-seq, such as for estimating linkage disequilibrium and for distinguishing sequencing errors from low-frequency polymorphisms. Variation in the representation of individuals in the DNA pool reduces the accuracy of Pool-seq only slightly (0% (that is, all individuals are uniformly represented; orange line) and 30% (light blue line)). The graphs were generated with the PIFs software12, ignoring sequencing errors.

Nature Reviews | Genetics

0.410 20 30

Number of individuals sequenced seperately

SD p

ool/

SD in

divi

dual

s

SD p

ool/

SD in

divi

dual

s

Number of individuals sequenced seperately40 50

0.5

0.6

0.7

0.8

0.9

1.0

1.1a b

0.410 20 30 40 50

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Pool size

Coverage per sequenced individual

Deviation in DNA content fromeach individual in the pool

100

20

0%

100

20

30%

100

5

30%

100

1

30%

Pool size

Coverage per sequenced individual

Deviation in DNA content fromeach individual in the pool

500

5

30%

100

5

30%

50

5

30%

REVIEWS

750 | NOVEMBER 2014 | VOLUME 15 www.nature.com/reviews/genetics

2014 Macmillan Publishers Limited. All rights reserved

SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)

Whypooling?

SchlttererC.,ToblerR.,KoflerR.andNolteV.NatureRevGenetics(2014)

SciLifeLab(promotionslides)SciLifeLab

National service Local scientific center

SciLifeLab

Director (July 2015) Olli Kallioniemi Co-director Kerstin Lindblad-Toh Vision:

To be an internationally leading center that develops, uses and provides access to advanced technologies for molecular biosciences with focus on health and environment.

www.scilifelab.se

2010: Strategic research initiative 2013: National resource 2015: New management and chairman

SciLifeLab platforms

SciLifeLab

National Genomics Infrastructure

National Bioinformatics Infrastructure Sweden

Joakim Lundeberg Ann-Christine Syvnen Ulf Gyllensten

Bengt Persson

Clinical Diagnostics

Lars Engstrand

Computer resources free for Swedish researchers

VR

SNIC

Ongoing merge of BILS, WABI and more; complete 2016. National, distributed

Knowagoodbioinformatician

NBIS-Werehereforyou!Were here for you!

23

The Bioinformatics Platform 2016

Funding The Research

Council SciLifeLab KAW foundation Host universities

Applied at the Research Council as continued national infrastructure 2016-2023. Decision late 2015.

Custom-tailored support Tools Training

Today ~70 FTE

24

Long-term Support Wallenberg Advanced Bioinformatics Infrastructure www.scilifelab.se/facilities/wabi/

Bjrn Nystedt Thomas Svensson

Tailored solutions high impact

Siv Andersson Gunnar von Heijne

Applied bioinformatics: 500h free support/project Variant analyses in health and disease Transcriptomics Single-cell analyses Epigenetics Metagenomics

Directors

Managers Swedens strongest unit for analyses of

large-scale genomic data (24 FTE)

National committee reviews and selects projects based on scientific quality

Staff in Stockholm, Uppsala, Lund, Gothenburg, Linkping, Ume.

WABIpersonnel(2013-2014)

JohanReimegrd MikaelHuss saBjrklund PrEngstrm JakubOrzechowskiWestholm

EstelleProux-Wra

SanellaKjellqvist

DianaEkman PallOlason AnnaJohansson MarcelMartin AlvaroMartinezBarrioPerUnneberg

Knowhowtohandleyourdata

Today:)Human)genome)sequenced)in)days)C)towards)$1000)genome)

requires$supercomputers$for$analysis$and$storage$

Massively$parallel$sequencing.$

2.$Data$delivery$

SciLifeLab)Bioinforma/cs)Compute)and)Storage)(UPPNEX))

3.$Analysis$

ScienBsts$

www.uppmax.uu.se/uppnex$High%performance/computers/and/large/scale/storage/for/bioinforma6cs/analysis./

1.$Sample$transfer$

Login$Submit$jobs$

Job)Que)

Job$assigned$

Work$interacBvely$

How)do)you)work)on)UPPMAX)computers?)

JobQueue

Research$$~8000$cores$

ProducBon$~3200$cores$

Redundancy$768$cores$

Storage$~11$PB$

2015)

Longbterm$Storage$

Mosler$384$cores$

Research$3328$cores$

ProducBon$768$cores$

Storage$~7$PB$

Longbterm$Storage$

2014) Resources)

Mosler$384$cores$

Private$Cloud$1600$cores$

Chipster,$CanvasDB$

Project)growth)

2004 2005 2006 2007 2008 2009 2010 2011 2012 2013

100

200

300

400 Active Projects

Num

ber o

f act

ive p

roje

cts

UPPMAXUPPNEX

2009: $ $13,152$MSEK$from$KAW$and$SNIC$$2012:$$ $23.8$MSEK$from$KAW/SNIC$$2014: $ $20$+$20$MSEK$from$KAW$for$WGS$$ $ $SNIC$receives$47.8$MSEK$from$VR$to$handle$sensiBve$data$

$

UPPNEX)history)

UPPMAX)personnel)

+3$more$

KnowhowtoextractyourDNA

OlgaVinnerePettersson(UGC)[email protected]

mp4-http://bit.ly/1Ul7RmHpptx-http://bit.ly/1Z6yIFHQ&A-http://bit.ly/1I1Sb6o

mailto:[email protected]://bit.ly/1Ul7RmHhttp://bit.ly/1Z6yIFHhttp://bit.ly/1I1Sb6o

Bacteria Fungi

Insects Plants

Knowhowtomeasureyourassemblyresults

JustawordonN50

N50typicallyreferstoacon

Knowwhyassemblingisdifficult

Twotypesofassemblies

Case1: Flycatcher(1.2Gbp)Herring(800Mbp)Malassezia(7Mbp)

Case2: Spruce(20Gbp)Barnacle(1.4Gbp)Wolbachia(4Mbp)

Twotypesofassemblies

Pre-assembly

Qualitytrimming (Errorcorrection) Kmeranalysis Denovorepeatlibrary

Qualitytrimming

DeBruijn-graphassemblersareinprinciplesensi

Readsvskmers

1read:100bp

..Kmers:k=21bpN=(Lk+1)(100bp21bp+1)80

Base coverage * (L-k+1) = Kmer coverage!! ! L!

Ex: !50X * (100-21+1) = 40X (i.e.kmercoverageis80%ofbasecoverage)

! ! 100!

ReadsvsKmers

Kmeranalyses

Computethefrequencyofeachkmerinthedataset(e.g.Jellyfish --both-strands)Note:RAM-intense!

Howtocountkmers?

Diggingintothekmers

Genomesize Removelow-copykmers Iden

Repeats:firstshot

Thenbofdis

Heterozygosityandploidyandhumansareeasy.

Bacteria,archaea,fungi,someplants

Mostanimals,someplants

Manyplants

Also:Heterozygozityisgenerallyverylowinmammals;mostotherspeciesaremuchharder

Heterozygositywithkmergraphs

Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)

Awordonqualityfiltering

LightQCfilter HardQCfilter

Awordofprecautiononqualityfiltering!


Doublepeakinthekmerhistogram;clearindica6onofheterozygosityNoten6relyeasytoquan6fy(althougha=emptshavebeenmade)


7

Fig4.1 17-mer depth distribution

Table4.2 17-mer Data statistics

K K-mer_NO. Peak_depth Genome Size Used Bases Used Reads X

17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28

Total 26.82Gb data was retained for 17-mer analysis .the 17-mer frequency distribution

derived from the sequenced reads was plotted in Fig1, the peak of the 17-mer distribution is about

32, and the total K-mer count is 22,425,038,045, then the genome size can be estimated ( by

formula: Genome Size=K-mer_num/Peak_depth) as 700.78Mb.

If the heterozygous rate is higher, then a small peak will be presented at 1/2 of Peak depth.

So this K-mer analysis can be used to roughly determine the heterozygous rate of a given genome.

Also, this distribution can be used to determine the repeat content of the genome, if this

genome contains high proportion of repeat; the distribution will display a fat tail which indicates

more than expect proportion of the genome have a high sequencing depth which may due to

sequence similarly.

Conclusion: Genome size is 700.78Mb, and the heterozygous rate in this genome is too high

to do whole genome shotgun sequence and assembly.

4.3 Estimation of heterozygous rate

We simulate herring genome with different heterozygosis rate, and make the 17-mer analysis

on them respectively, and then get the figure 2.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)


7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

8

Fig4.2 Hybrid effect on K-mer distribution.

The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer

curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is

1.2%, H_0.015 is 1.5%.

From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is

becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We

can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 20 40 60 80

Percen

tage

(X)

Depth(X)

H_0.01067

Epi

H_0.012

H_0.015


7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

7




17 22,425,038,045 32 700,782,438 26,827,492,429 275,153,399 38.28










sequence similarly.






0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 10 20 30 40 50 60 70 80 90 100

Percen

tage(%

)

Depth(X)

8

Fig4.2 Hybrid effect on K-mer distribution.

The X axis is the depth of 17-mer and Y axis is the ratio of 17-mer. The Epi is the 17-mer

curve of herring. The H_0.01067 means that the heterozygosis rate is 1.067%, and H_0.012 is

1.2%, H_0.015 is 1.5%.

From this figure, we can see that with the heterozygosis rate increasing, the sub-peak is

becoming more apparent at the position of the half of the expected K-mer depth on the X axis. We

can get the conclusion that the heterozygosis rate of herring genome is about 1.5%.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

0 20 40 60 80

Percen

tage

(X)

Depth(X)

H_0.01067

Epi

H_0.012

H_0.015

Theheterozygositywasestimatedtobe1.5%


Repeats:firstshot

Thenbofdis

WhyrepeatsdestroyassembliesGenomeassembly-thingstothinkabout

Repeatlibraryandrepeatquantification

Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads

-Mapping(e.g.bwa),or-MaskwithRepeatMasker

Repeatlibraryandrepeatquantification

Createadenovorepeatlibrary Runalow-coverage(e.g.0.1X)assembly(e.g.RepeatExplorerorTrinity) Filtercontaminantsandmito/chloro [Makenon-redundant(e.g.Cdhit)] QuanJfythe(high)repeatcontentbyanindependentsubsetofreads

-Mapping(e.g.bwa),or-MaskwithRepeatMasker

A!real!example!Co

verage!

%GC!

5!Mbp!mitochondrion!in!spruce!

RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata

R R R R R

Overlaps?

Sparseseqdata

RepeatlibraryfromlowcoveragedataRepeatlibraryfromlowcoveragedata

R R R R R

Overlaps?

Assembledcon

RepeatlibraryfromlowcoveragedataQuan

ClassifyingrepeatsLTRGypsy/CopiaLINE/SINEDNAelements

ThisisverytrickyClassifyingtherepeatlibrarydirectly RepeatMasker Repeatproteindomainsearch(h=p://www.repeatmasker.org/cgi-bin/RepeatProteinMaskRequest)

Problems Noclosehomologsindatabases RapidevoluHonofrepeats(liketransposableelements,TEs) Non-autonomousTEsdonotcontainproteinsSoluHons FetchintactORF:sfromhitsinassembly Extendassemblymatchesandgetmorecompleteelements Checkmatchalignmentprofilesinassembly(LINEsconservedat3endbutnotat5..)=>OWenslow,manual,species-specificsoluHons

Knowthetechnologybias

61W#3.)#!/(+03(:%b1&!

0 20 40 60 80 100

050

100

150

200

250

300

350

Coverage

Num

ber o

f Mb'

s in

hg1

9

454IlluminaSOLiD

average coverage

_C%:(!)#&1-#!!

"#$%#&'#!'1W#3.)#!

!!(!

9C"!/.0.!d>1',#!ghg!B(0.&(%-e!

A.(3#/T#&/!/.0.!d>1',#!ghg!B(0.&(%-e!

7#'*(&('.++#-:

ClarkM.J.,etal.Nat.Biotech(2011)

PerformancecomparisonofexomeDNAsequencingtechnologies.(MikeSnyderslab)

NingL.,etal.ScientificReports(2015)

Knowtheassemblyalgorithms

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING

Short Reads (Illumina) - graph assembly

adapterremoval

qualitytrimming

de Bruijn or string graph construction

errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping

Long Reads (PacBio) - H G AP assembly

read length

read

s

read self-correction

overlap-layout-consensusassembly

consensus calling withquiver

assembled genome

ATCGTT-CCGAGTCTCCCCGCAATCGCAAGCG-TTTCAT CGAGTCT-CGCGCAATCGCAAGCG-TTTCATCGTT-CCGAGTCTCCCCGCCATC TT-CCGAGACTCCCCGCAATCGCAAGCGATT GTTTCCGAGTCTCCCCGCAATCGCTAGCG-TTGCAT

1

2

3

1 pre-processing 2 assembly 3 finishing / polishing

the overall assembly strategy is the same

but the data and tools are fundamentally different

http://www.lucigen.com/NxSeq-Long-Mate-Pair-Library-Kit/

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING


adapterremoval

qualitytrimming


errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping


read length

read

s




assembled genome


1

2

3




Many!instruments!too!many!solu4ons!

Assembler#Name# Algorithm# Input#Arachne! OLC! Sanger!CAP3! OLC! Sanger!TIGR! Greedy! Sanger!Newbler! OLC! 454/Roche!Edena! OLC! Illumina!SGA! OLC! Illumina!MaSuRCA! De!Bruijn/OLC! Illumina!Velvet! De!Bruijn! Illumina!ALLPATHS! De!Bruijn! Illumina/PacBio!ABySS! De!Bruijn! Illumina!SOAPdenovo! De!Bruijn! Illumina!CLC! De!Bruijn! Illumina/454!CABOG! OLC! Hybrid!!

No!easy!way!to!determine!best!assembly/assembler!

implemented!heuris4cs!are!the!key!issue!

Choice!of!approach!depends!on!data!being!assembled!

Currently!efforts!ongoing!to!establish!best!prac4ces!

Assemblathons!and!GAGE!to!evaluate!exis4ng!solu4ons!

OLCvs.deBruijn

OLC

Pros:Canuselongerreadsproperly Cons:Timeconsuming,highmemoryrequirements

deBruijn

GenerateassemblyviadeBruijn

Marpn&Wang,Nat.Rev.Genet.(2011)

Pros:Computationallyefficient,canworkwithlargecoverageshortreaddatasets

Cons:Sensitivetosequenceerrors,connectionbetweenassemblyandreadislost,doesnotworksowellwithlongerreads

DeBruijn

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING


adapterremoval

qualitytrimming


errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping


read length

read

s




assembled genome


1

2

3


















Somerecommendations

Largeeukaryotegenome,Illuminadata:Allpaths-LG(needsspecificlibraries),SOAPdenovo,SGA,Masurca,DISCOVAR

Largeeukaryotegenome,additionallongerreads:Masurca,Newbler,CABOG

Smalleukaryoteorprokaryotegenome,Illuminadata:Spades,Masurca,SOAPdenovo,Abyss,Velvet,DISCOVAR

Smalleukaryoteorprokaryotegenome,mixeddata:MIRA,Spades,Masurca,Newbler

Needtoruninparallel:Abyss,Rai Amplifieddata(SingleCellGenomics):Spades

StandardcontiguitymetricsJustawordonN50

N50typicallyreferstoacon

Thedevilisintherepeats

De Novo Assembly: Instruments Our Experience Validation

Repeats and Short Reads

Moreover

short reads

Short reads make everything harder!!

F. Vezzi NGS

C R A B

Mathema,callybestresult:

Repeaterrors

Overlappingnon-iden/calreads Collapsedrepeatsandchimeras

Wrongcon/gorder Inversions

ATCGGGTATATAG-CCTA!||||||| || || ||||!ATCGGGTGTACAGCCCTA!!

?

A

BA&B

A:B:

Collapsablerepeaterrors(worst!)

Knowhowtopatchgaps/finalize

CCSvsCLR

other options for assembling PacBio reads

https: / / github.com / PacificBiosciences / Bioinformatics-Training / wiki / Large-Genome-Assembly-with-PacBio-Long-Reads

Hybridassemblies

PacBio data cannot (currently) be assembled in its raw state

several strategies exist for correcting reads prior to assembly correction without complementary technology used to be

difficult until recently, was limited by computational power and SMRT cell

throughput

PacBio data is noisy

Koren & Philippy Curr Op M icro 2014

Hybridassemblers(forPacBio)

105

other options for assembling PacBio reads

Hybridassemblers

106

other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)

Hybridassemblers

107

other options for assembling PacBio reads ZiminA.V.,MaraisG.,PuiuD.,RobertsM.,SalzbergS.L.,YorkeJ.A.Bioinformatics(2013)

PurePacBio

PRE-

PRO

CESS

ING

ASS

EMBL

YPO

LISH

ING


adapterremoval

qualitytrimming


errorcorrection

T

T

A

T

T

scaffolding

contigs

read pairs

NNNNNN

read mapping


read length

read

s




assembled genome


1

2

3




PurePacBio

PurePacBioother options for assembling PacBio reads

PurePacBio

Finishing/Polishing(Olli-Pekka)


quiver isnt perfect using Pilon to polish remaining indels

makes use of short read mapping to identify potential indels, SNPs, ambiguous bases, local misassemblies

$ java -Xmx16G jar path/to/pilon-1.8.jar \ --genome path/to/fasta --unpaired path/to/mapping.bam \ --output sample_name --changes --variant --tracks \ --mindepth 100

Pilon removed 128 remaining indels in 3.8 Mbp genome despite Quiver calling > Q V 55 consensus

5 0 8 | N A T U R E | V O L 5 2 7 | 2 6 N O V E M B E R 2 0 1 5 2015 Macmillan Publishers Limited. All rights reserved

LETTERdoi:10.1038/nature15714

Single-molecule sequencing of the desiccation-tolerant grass Oropetium thomaeumRobert VanBuren1*, Doug Bryant1*, Patrick P. Edger2,3, Haibao Tang4,5, Diane Burgess2, Dinakar Challabathula6, Kristi Spittle7, Richard Hall7, Jenny Gu7, Eric Lyons4, Michael Freeling2, Dorothea Bartels6, Boudewijn Ten Hallers8, Alex Hastie8, Todd P. Michael9 & Todd C. Mockler1

Plant genomes, and eukaryotic genomes in general, are typically repetitive, polyploid and heterozygous, which complicates genome assembly1. The short read lengths of early Sanger and current next-generation sequencing platforms hinder assembly through complex repeat regions, and many draft and reference genomes are fragmented, lacking skewed GC and repetitive intergenic sequences, which are gaining importance due to projects like the Encyclopedia of DNA Elements (ENCODE)2. Here we report the whole-genome sequencing and assembly of the desiccation-tolerant grass Oropetium thomaeum. Using only single-molecule real-time sequencing, which generates long (>16 kilobases) reads with random errors, we assembled 99% (244 megabases) of the Oropetium genome into 625 contigs with an N50 length of 2.4 megabases. Oropetium is an example of a near-complete draft genome which includes gapless coverage over gene space as well as intergenic sequences such as centromeres, telomeres, transposable elements and rRNA clusters that are typically unassembled in draft genomes. Oropetium has 28,466 protein-coding genes and 43% repeat sequences, yet with 30% more compact euchromatic regions it is the smallest known grass genome. The Oropetium genome demonstrates the utility of single-molecule real-time sequencing for assembling high-quality plant and other eukaryotic genomes, and serves as a valuable resource for the plant comparative genomics community.

The genomes of Arabidopsis3, rice4, poplar, grape and Sorghum5 were first sequenced using high-quality and reiterative Sanger-based approaches producing a series of gold standard reference genomes. The advent of next-generation sequencing (NGS) technologies reduced costs of sequencing substantially, which has enabled sequencing of over 100 plant genomes1. The quality of plant genome assemblies depends on genome size, ploidy, heterozygosity and sequence coverage, but most NGS-based genomes have on the order of tens of thousands of short contigs distributed in thousands of scaffolds. The short read lengths of NGS, inherent biases and non-random sequencing errors have resulted in highly fragmented draft genome assemblies that are not complete, which means they are missing biologically meaningful sequences including entire genes, regulatory regions, transposable elements, centromeres, telomeres and haplotype-specific structural variations. It is becoming clear from ENCODE projects that complete genomes are needed to better understand the importance of the non-coding regions of genomes2.

More than 40% of calories consumed by humans are derived from grasses, and the grass family (Poaceae) is arguably the most important plant family with regard to global food security6. The size and complex-ity of most grass genomes has challenged progress in gene discovery

and comparative genomics, although draft genomes are now avail-able for most agriculturally important grasses1. The largest genome assemblies, such as maize (2,300 megabases (Mb))7, barley (5,100 Mb)8 and wheat (hexaploid, 17,000 Mb)9 are highly fragmented as a result of the inability of current sequencing technologies to span complex repeat regions. Near-finished reference genomes are available for rice4, Sorghum5 and Brachypodium10, but more high-quality grass genomes are needed for comparative genomics and gene discovery. Here we pres-ent the near-complete draft genome of the grass Oropetium thomaeum, the first high-quality reference genome from the Chloridoideae sub-family. The draft genome is near complete because we were able to sequence through complex repeat regions that are unassembled in most draft genomes. Oropetium has the smallest known grass genome at 245 Mb and is also a resurrection plant that can survive the extreme water stress such as loss of >95% of cellular water (Fig. 1)11.

Single-molecule real-time (SMRT) sequencing (Pacific Biosciences) produces long and unbiased sequences, which enables assembly of complex repeat structures and GC- and AT-rich regions that are often unassembled or highly fragmented in NGS-based draft genomes. We generated ~72 sequencing coverage of the Oropetium genome using 32 SMRT cells on the PacBio RS II platform (which is equivalent to

Knowhowtoannotate

Annotation(Jarkko)BILSassemblyandannota1onservice

1

HenrikLantz

Teamleader

MaheshPanchal

Assembly

JacquesDainat

Annota1on

Mar1nNorling

Assembly

LucileSoler

Annota1on

5PhDs,allinUppsala

Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es

Annotation(Jarkko)BILSassemblyandannota1onservice

1

HenrikLantz

Teamleader

MaheshPanchal

Assembly

JacquesDainat

Annota1on

Mar1nNorling

Assembly

LucileSoler

Annota1on

5PhDs,allinUppsala

Annota1on2years,assembly1year Notdrivingownresearch,focusingonsupport 80hoffreesupporttoallprojects-submiPedbycustomer Dedicatedcomputeclusterforannota1on,~160cores Assembliesrunonsharedcluster,~3200cores Allorganisms-alltypesofdata Closecontactwithsequencingfacili1es

Annota1on/Assemblytechnology Assembly

Perl/Makepipeline Pre-assembly

Qualitycontrol kmeranalyses

Assembly Differentassembly

programs Assemblyvalida1on

FRCbam Quast Owntools

Annota-on Maker-MPI

proteins RNA-seq

Refinementscripts Func1onalannota1on

Blast Synteny

2

Knowhowtovalidate

AssemblyvalidationAssembly!valida4on!is!it!important?!

Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!

can!think!of.!How!I!pick!up!one!now?!

Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!

perform!gene!annota4on?!!!!

Assembly!valida4on!is!it!important?!

Some4mes,!easy!ques4ons!are!the!most!difficult:! Is!my!de!novo!assembly!correct?! What!assembler!I!need!to!use?! I!just!used!all!the!possible!assemblers!one!

can!think!of.!How!I!pick!up!one!now?!

Does!my!assembly!contain!genes?! Is!my!assembly!good!!enough!to!!

perform!gene!annota4on?!!!!

AssemblyvalidationAssembly!valida4on!

Assembly!valida4on!is!extremely!difficult! Too!o_en!only!connec4vity!measures!are!used! There!is!no!a!real!solu4on,!only!a!set!of!best!prac4ces!

that!one!can!follow!!Recently!a!lot!of!a`en4on!on!assembly!valida4on:!

EvaluatingassemblieswithreferenceEvalua4ng!assemblies!with!a!reference!

Coun4ng!errors!not!always!possible:! Reference!almost!always!absent.! Error! types! are! not! weighted!

accordingly.!

Visualiza4on!is!useful,!however:! No!automa4on! !Does!not!scale!on!large!genomes!

WOW.!Looks!like!that!it!is!difficult!even!with!the!answer!

EvaluatingassemblieswithoutreferenceEvalua4ng!assemblies!without!a!reference!

Sta4s4cs!(N50,!etc.)! Congruency!with!raw!sequencing!data:!

Alignments! QAtools! FRCbam! REAPR!

Gene!space!! CEGMA! reference!genes! transcriptome!

There!is!no!a!real!recipe,!or!a!tool.!We!can!only!suggest!some!best!prac4ce.!!

Yourreadsareoftenthebestsourcetovalidateyourassemblies

Checkagainyourinsertsizes(PicardTools,http://picard.sourceforge.net)!!!!!

Plottingcoveragex%GCxlength

Post!assembly!am!I!on!the!right!track?!

Check!lib!insert!sizes!(use!PicardTools!h`p://picard.sourceforge.net/)!

PE! MP!

Your!genome!Mitochondrion!

Contamina4ons!

0 2000 4000 6000 8000

020

040

060

080

010

00Insert Size Histogram for All_Reads in file MP_on_masurca_sorted.bam

Insert Size

Coun

t

FRRFTANDEM

0 100 200 300 400 500

020

0040

0060

0080

00

Insert Size Histogram for All_Reads in file PE_on_masurca_sorted.bam

Insert Size

Cou

nt

FRRFTANDEM

0 2000 4000 6000 8000 10000

020

040

060

080

0

Insert Size Histogram for All_Reads in file 7_130425_AD1YUEACXX_P469_101_index12_trimmedtoassembly.abyss.scaf_onlyAligned.bam

Insert Size

Coun

t

FRRFTANDEM

Failed!MP!or!bad!assembly?!

Plot!cov!vs!%GC!vs!length!!

Look! at! the! plots!and!at!the!tables,!duplica4on! rate!is! an! important!measure.!!

You!need!to!check!i f ! t h e! p l o t ( s )!co inc ides! w i th!what!you!expect.!

0.0 0.2 0.4 0.6 0.8 1.00

100

200

300

400

500

GC

cove

rage

coverage

Freq

uenc

y

0 100 200 300 400 500

050

100

150

0 100 200 300 400 500

010

2030

4050

cov

len

(kbp

)

0 10 20 30 40 500

24

68

10cov

len

(kbp

)

Plopng!coverage!and!GC!content!

0.0 0.2 0.4 0.6 0.8 1.0

010

020

030

040

050

0

GC

cove

rage

coverage

Freq

uenc

y

0 100 200 300 400 500

050

100

150

0 100 200 300 400 500

010

2030

4050

cov

len

(kbp

)

0 10 20 30 40 50

02

46

810

cov

len

(kbp

)

Plopng!coverage!and!GC!content!

http://picard.sourceforgenet

DatacongruencyData!congruency!

Idea:!Map!read:pairs!back!to!assembly!and!look!for!discrepancies!like:! no!read!coverage! no!span!coverage! too!long/short!pair!distances!

Reads! can! be! aligned!back! to! the! assembly! to!iden4fies! suspicious!features.!

But!what!we!do!with!this!features?!

FRCbam(Vezzietal.2012)

Datacongruency

FRCbam(Vezzietal.2012)

Features!

4!coverage!related!features:! LOW_COV_PE,!HIGH_COV_PE,!LOW_NORM_COV_PE,!and!HIGH_NORM_COV_PE!

!!!!!4!features!for!compression/expansion!event!(CE!stats)!

COMPR_PE,!STRECH_PE,!COMPR_MP,!and!STRECH_MP!!!!6!features!on!suspicious!pair/mate!orienta4ons:!

HIGH_SINGLE_PE,!and!HIGH_SINGLE_MP! HIGH_SPAN_PE,!and!HIGH_SPAN_MP! HIGH_OUTIE_PE,!and!HIGH_OUTIE_MP!

!

A

R1,2

B

A

R1,2

C

B

A R1 B R2 C

AGAGCTAGCAGAGCTAGCAGATCTCGCAGATCTCGC

Reads! can! be! aligned! back! to!the! assembly! to! iden4fies!suspicious!features.!

FRCurveFRCurve!

FRCbam!predicted!Assemblathon!2!outcome!

FRCbam!(Vezzi!et!al.!2012)!

The!Feature!Response!Curve!(FRCurve)!characterizes!the!sensi4vity!(coverage)! of! the! sequence! assembler! as! a! func4on! of! its!discrimina4on!threshold!(number!of!features!).!

Feature!Response!Curve:! Overcomes!limits!of!standard!

indicators!(i.e.!N50)! Captures!trade:off!between!

quality!and!con4guity! Deeply!connected!to!ROC!curves! Features!can!be!used!to!iden4fy!

problema4c!regions! Single!features!can!be!plo`ed!to!

iden4fy!assembler:specific!bias!0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 5,000 5,500 6,000 6,500 7,000 7,500 8,000

0

20

40

60

80

100

120

Feature threshold

approxim

ate

coverage(%

)

Feature Space rhody TOTAL

SGARayCLC

SOAPdenovoALLPATHS-LG PB

ABySSMSRA-CACABOG PBCABOGVELVET

ALLPATHS-LG

FeaturesandPCAFeatures!and!PCA!

5 4 3 2 1

21

01

2

PCA1

PCA2

bifido

ecoli

enteroeubac

fragilis

kleb

staphylocossus

strep

swigtimbifido

ecoli

entero

fragilis

fuso7

kleb

staphylocossusstrep

swig

tim

bifido

clap

clap19

ecoli

entero

fragilis

fusonuke

kleb

strep

swig

tim

bifido

ecoli

entero

eubac

fragilis

kleb

staphylocossus

strep

swig

bifido

ecolientero

eubac

swig

tim

bifidoecoli

enteroeubac

fragilis

kleb

staphylocossus

strep

swig

clap

clap19

ecoli

enteroeubac

fragilis

fuso7

kleb

staphylocossus

strep

swig

tim

entero

eubac

fuso7

strep

swig

bifido

ecoli

entero

eubac

fragilis

kleb

strepswig

bifido

ecoli

entero

eubac

kleb

staphylocossus

strep

swig

bifido

ecoli

entero

eubac

fragilis

kleb

staphylocossus

strep

swig

4 2 0 2 4

64

20

24

6

PCA1

PCA2

bifido

clap

clap19

copro

ecoli

egg

enteroeubac

fragilis

fuso7

fusonuke

kleb

staphylocossus

strep

swig timbifido

clapclap19

copro

ecoli

egg

entero

eubac

fragilis

fuso7

fusonuke

kleb

staphylocossusstrepswig

tim

bifido

clapclap19

copro

ecoli

egg

entero

eubac

fragilis

fuso7

fusonuke

kleb

strep

swig

tim

bifido

clapclap19

ecoli

egg

enteroeubac

fragilis

fuso7

fusonuke

kleb

staphylocossus

strep

swig

bifidoclap

clap19

copro

ecoli

egg

entero

eubac

fuso7

fusonuke

strep

swig

tim

bifido

clap

clap19

copro

ecoli

egg

entero eubac

fragilis

fuso7

fusonuke

kleb

staphylocossus

strep

swig

tim

bifido

clap

clap19

copro

ecoli

egg

enteroeubac

fragilis

fuso7

fusonuke

kleb

rhody

staphylocossus

strep

swig

tim

bifido

copro

ecoli

egg

entero

eubac

fragilis

fuso7

fusonuke

kleb

strep

swig

tim

bifido

clap

clap19

copro

ecoli

egg

enteroeubac

fragilis fuso7

fusonuke

kleb

staphylocossus

strepswig

tim

bifido

clap

clap19

copro

ecoli

egg

entero

eubac

fragilis

fuso7

fusonuke

kleb

staphylocossus

strep

swig

bifido

clap

clap19

copro

ecoli

egg

enteroeubac

fragilis

fuso7

fusonuke

kleb

staphylocossus

strep

swig

Assembled!18!bacterial!genomes!with!11!assemblers!!

(illumina!+!PacBio!data)!

PCA!performed!on!features:! Assemblies!of!the!same!organism!

(family)!tend!to!cluster;! No!clear!difference!when!using!

PacBio!data;!

REAPR(Huntetal.2013)

REAPRREAPR!

REAPR!(Hunt!et!al.!2013)!

Uses!same!principle!of!FRCurve:!

Iden4fies!suspicious/erroneous!posi4ons!

Breaks#assemblies#in#suspicious#posi.ons#

The!broken!assembly!is!more!fragmented!but!hopefully!more!

corrected!(Reapr!cannot!make!

things!worse)!

Conservedcore(species)genespace

Gene!space!

CEGMA#(h`p://korflab.ucdavis.edu/datasets/cegma/)!HMM:s!for!248!core!eukaryo4c!genes!aligned!to!your!assembly!to!assess!completeness!of!gene!space!complete:!70%!aligned!par4al:! !30%!aligned!!!Similar#idea#based#on#aa#or#nt#alignments#of# Golden!standard!genes!from!own!species! Transcriptome!assembly! Reference!species!protein!set!Use!e.g.!GSNAP/BLAT!(nt),!exonerate/SCIPIO!(aa)!!!

OtherexternalvalidationmethodsOther!External!Valida4on!Methods!

! Restric4on!Map! Representa4on! of! the! cut! sites! on! a!

given! DNA! molecule! to! provide! spa4al!informa4on!of!gene4c!loci!

Op4cal!maps!can!be!used!to!check!assembly!correctness:!

Long!PacBio!Reads!can!be!used!as!well!

Otherexternalvalidationmethods

De!novo!reconstructs!!parts!missing!in!the!reference!strain!

Correctly!assembles!long!tandem!repeats!!

De!Novo!assembly!!!!(Illumina,!PGM)!

Set!of!un:ordered!and!not!oriented!ctgs!

Op4cal!Map!

DNA!seq!Con4gs!

Other!External!Valida4on!Methods!

Dont!panic.!And!dont!rush!

Keeping!up!with!the!development!can!be!stressful,!!so!you!need!to!stay!calm!!Choose!quality!before!quan4ty!!Know!your!biological!system!!so!you!know!what!to!expect!Combine!sequencing!with!other!data!!Share!knowledge!and!be!nice!to!your!bioinforma4cs!friends!

For!each!conclusion,!ask!yourself!if!it!can!be!an!artefact!due!to!!!Incomplete!assembly!!Repeats!!Indels!!Coverage!bias!!Divergent!sequences!(mapping)!

Dontpanic.Anddontrush

Knowthatyourfinalassemblywillbeincomplete

Thingsthatarenotthere

100M

b

1 2 3 4 5 6 7 8 9 10 11

12

131415

161718

19202122

X

Closed gap

Inversion

Complex event

High

Low

STR Density

Extended Data Figure 3 | Genome distribution of closed gaps andinsertions. Chromosome ideogram heatmap depicts the normalized density ofinserted CHM1 base pairs per 5-Mb bin with a strong bias noted near the end of

most chromosomes. Locations of structural variants and closed gaps are givenby coloured diamonds to the left of each chromosome: closed gap sequences(red), inversions (green), and complex events (blue).

RESEARCH LETTER

Macmillan Publishers Limited. All rights reserved2014

ChaisonM.J.Petal.Nature(2014)

LETTERdoi:10.1038/nature13907

Resolving the complexity of the human genomeusing single-molecule sequencingMark J. P. Chaisson1, John Huddleston1,2, Megan Y. Dennis1, Peter H. Sudmant1, Maika Malig1, Fereydoun Hormozdiari1,Francesca Antonacci3, Urvashi Surti4, Richard Sandstrom1, Matthew Boitano5, Jane M. Landolin5, John A. Stamatoyannopoulos1,Michael W. Hunkapiller5, Jonas Korlach5 & Evan E. Eichler1,2

The human genome is arguably the most complete mammalianreference assembly13, yet more than 160 euchromatic gaps remain46

and aspects of its structural variation remain poorly understood tenyears after its completion79. To identify missing sequence and gen-etic variation, here we sequence and analyse a haploid human genome(CHM1) using single-molecule, real-time DNA sequencing10. We closeor extend 55% of the remaining interstitial gaps in the human GRCh37reference genome78% of which carried long runs of degenerateshort tandem repeats, often several kilobases in length, embeddedwithin (G1C)-rich genomic regions. We resolve the complete sequenceof 26,079 euchromatic structural variants at the base-pair level, includ-ing inversions, complex insertions and long tracts of tandem repeats.Most have not been previously reported, with the greatest increasesin sensitivity occurring for events less than 5 kilobases in size. Com-pared to the human reference, we find a significant insertional bias(3:1) in regions corresponding to complex insertions and long shorttandem repeats. Our results suggest a greater complexity of the humangenome in the form of variation of longer and more complex repet-itive DNA that can now be largely resolved with the application ofthis longer-read sequencing technology.

Data generated by single-molecule, real-time (SMRT) sequencingtechnology differ drastically from most sequencing platforms becausenative DNA is sequenced without cloning or amplification, and readlengths typically exceed 5 kilobases (kb). Despite overall lower individualread accuracy (,85%), longer read length facilitates high confidencemapping across a greater percentage of the genome11,12.We generated,40-fold sequence coverage from a human CHM1 hydatidiform moleusing long-read SMRT sequence technology (average mapped readlength 5 5.8 kb; Supplementary Table 1). We selected a complete hyda-tidiform mole to sequence because it is haploid, lacking allelic variation,and provides higher effective sequence coverage. We aligned 93.8% ofall sequence reads to the human reference genome (GRCh37) using amodified version of BLASR11 (Supplementary Information) and gener-ated local assemblies of the mapped reads using Celera13 and Quiver14,the latter of which leverages estimates of insertion, deletion and substi-tution probabilities to determine consensus sequences accurately. Wecompared the consensus sequences of regions with previously sequencedand assembled large-insert bacterial artificial chromosome (BAC) clonesgenerated from CHM1tert (ref. 15). The comparison shows a consensussequencing concordance of .99.97% (phred quality 5 37.5), with 72%of the errors confined to indels within homopolymer stretches (Sup-plementary Table 3).

We initially assessed whether the mapped reads could facilitate clos-ure of any of the 164 interstitial euchromatic gaps within the humanreference genome (GRCh37). We extended into gap regions using areiterative map-and-assemble strategy, in which SMRT whole-genomesequencing (WGS) reads mapping to each edge of a gap were assembledinto a new high-quality consensus, which, in turn, served as a template

for recruiting additional sequence reads for assembly (SupplementaryInformation). Using this approach, we closed 50 gaps and extended into40 others (60 boundaries), adding 398 kb and 721 kb of novel sequenceto the genome, respectively (Supplementary Table 4). The closed gapsin the human genome were enriched for simple repeats, long tandemrepeats, and high (G1C) content (Fig. 1) but also included novel exons(Supplementary Table 20) and putative regulatory sequences based onDNase I hypersensitivity and chromatin immunoprecipitation followedby high-throughput DNA sequencing (ChIP-seq) analysis (Supplemen-tary Information). We identified a significant 15-fold enrichment of shorttandem repeats (STRs) when compared to a random sample (P , 0.00001)(Fig. 1a). A total of 78% (39 out of 50) of the closed gap sequences werecomposed of 10% or more of STRs. The STRs were frequently embeddedin longer, more complex, tandem arrays of degenerate repeats reach-ing up to 8,000 bp in length (Extended Data Fig. 1ac), some of whichbore resemblance to sequences known to be toxic to Escherichia coli16.Because most human reference sequences17,18 have been derived fromclones propagated in E. coli, it is perhaps not surprising that the appli-cation of a long-read sequence technology to uncloned DNA wouldresolve such gaps. Moreover, the length and complex degeneracy of theseSTRs embedded within (G1C)-rich DNA probably thwarted efforts tofollow up most of these by PCR amplification and sequencing.

Next, we developed a computational pipeline (Extended Data Fig. 2)to characterize structural variation systematically (structural variationdefined here as differences $50 bp in length, including deletions, dupli-cations, insertions and inversions7). Structural variants were discoveredby mapping SMRT sequencing reads to the human reference genome11

1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA. 2Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195,USA. 3Dipartimento di Biologia, Universita degli Studi di Bari Aldo Moro, Bari 70125, Italy. 4Department of Pathology, University of Pittsburgh, Pittsburgh, Pennsylvania15261, USA. 5Pacific Biosciences ofCalifornia, Inc., Menlo Park, California 94025, USA.

P = 0.02712P = 0.00003

P < 0.00001

0

25

50

75

100

(G+C

) con

tent

Reference flank

Gap closure

Tandem repeatP < 2.2 1016

0.00

0.25

0.50

0.75

1.00

Gaps Reference

Pro

port

ion

of re

gion

with

sim

ple

repe

ats

a b

Gap

only

Tand

em re

peats

Gap

with

out

tande

m re

peats

Samp

led re

feren

ce

Figure 1 | Sequence content of gap closures. a, Gap closures are enrichedfor simple repeats compared to equivalently sized regions randomly sampledfrom GRCh37. b, Human genome gaps typically consist of (G1C)-richsequence (yellow) flanking complex (A1T)-rich STRs (green) (empiricalP value; Supplementary Information). Red line indicates genomic (G1C)content.

0 0 M O N T H 2 0 1 4 | V O L 0 0 0 | N A T U R E | 1

Macmillan Publishers Limited. All rights reserved2014

Thingsthatarenotthere

SteinbergK.M.etal.GenomeResearch(2014)

reference assembly, many groups have described shortcomings ofthis resource, including remaining gaps, single-nucleotide errors,or gross misassembly due to complex haplotypic variation (Eichleret al. 2004; Doggett et al. 2006; Kidd et al. 2010; Chen and Butte2011; The 1000 Genomes Project Consortium 2012). Both gapsand misassembled regions often arise because the DNA sequenceused for the assembly was from multiple diploid sources contain-ing complex structural variation. Because such loci often containmedically relevant gene families, it is important to resolve varia-tion at these sites, as the structural and single-nucleotide diversityis likely associated with clinical phenotypes (Eichler et al. 2004).Thus, to resolve structurally complex regions and provide a moreeffective reference resource for such loci, we combined WGS dataand BAC sequences from a haploid DNA source to create a singlehaplotype assembly of the human genome.

Haplotype information is critical to interpreting clinical andpersonal genomic information as well as genetic diversity and an-cestry data, and most previously sequenced individual human ge-nomes are not haploresolved. The current reference human genomesequence represents a mosaic that further complicates haplotyping;within a BAC clone there is a single haplotype representation, buthaplotypes can switch at BAC clone junctions. By utilizing an es-sentially haploid DNA source, we resolved a single haplotype acrosscomplex regions of the genome where the reference genome con-tained a mixture of haplotypes from various sources and/or con-tained unresolved gaps. For example, a gap on Chromosome 4p14in GRCh37 (Chr 4: 4029639740297096) was completely resolvedusing CHM1WGS data. The gap was flanked by repetitive elementsthat were not traversed by a clone. This region has subsequentlybeen updated with a complete tiling path in GRCh38.

Figure 5. Overview of the Chr 11 (NC_018922.2) 1.9-Mb region, exhibiting three alignment bins with a large number of PacBio cliff reads where thealignment coverage dropped off sharply.WGS component (light green lines) boundaries flanked by such reads aremarked with red dashed lines. The endsof each component at the boundary are labeled with letters to show orientation. Pairs of alignments corresponding to three different PacBio reads aremarked in yellow, green, and dark blue. These alignments overlap by < 10% on each of the reads. The split alignments for these three reads suggest thatthe twoWGS components marked in purple should be inverted and translocated as indicated by the arrow at the top of the image. The other PacBio readsin these bins exhibit the same pattern of split alignments, which supports the proposed reordering and orientation of the WGS components. The bottomlight green lines show a proposed tiling pathwith the orientation corrected; the letters indicate where each end of the initial tiling path components shouldbe placed.

CHM1 assembly of the human genome

Genome Research 7www.genome.org

Cold Spring Harbor Laboratory Press on November 16, 2014 - Published by genome.cshlp.orgDownloaded from

Summary

Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.

Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.

Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity)anddolibrariesforit.

Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.

Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.

Summary

Genomesizeandrepeatcontentcanbeestimatedw/oanassembly. AdaptersandtrimlowQVisgoodunlesstheassemblyprogramdoesECitself.

Assessthelevelsofheterozygosityinyourtargetgenomebeforeyouassemble(orsequence)itandsetyourexpectationsaccordingly.

Chooseanassemblerthatexcelsintheareayouareinterestedin(e.g.,coverage,continuity,ornumberoferrorfreebases).

Interestedindoingjustcodingpotentialanalyses?(e.g.,trainingagenefinder,studyingcodonusagebias,lookingforintron-specificmotifs)=>Considerstudyingexomeassemblies.

Orconsideraproxy,studyingaspeciethatitissufficientlycloseevolutionarywhichgenomeisquitegoodinquality.

SettledownanassemblysoSciencecancontinue!

Knowthefuture

Avisionintothefuture

Acknowledgements

OlgaVinnerePettersson BjrnNysted OlaSpujth HenrikLantz JacquesDaimat FrancescoVezzi BGI JonBadalamenti(BondLab) StephanC.Schuster(PennU)