genome assembly: then and now (with notes) — v1.2

277
Genome assembly: then and now Keith Bradnam Image from Wellcome Trust v1.2 Author: Keith Bradnam, Genome Center, UC Davis This work is licensed under a Creative Commons Attribution 4.0 International License. This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on learning the command-line. Other versions of this talk are probably available at slideshare.com

Upload: keith-bradnam

Post on 24-Jan-2015

451 views

Category:

Science


5 download

DESCRIPTION

This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop. It concerns the Assemblathon projects as well as other aspects relating to genome assembly. A version of this talk is also available on Slideshare without notes. Note, this is an evolving talk. There are older and newer versions of the talk also available on slideshare.

TRANSCRIPT

Page 1: Genome assembly: then and now (with notes) — v1.2

Genome assembly: then and nowKeith Bradnam

Image from Wellcome Trust

v1.2

Author: Keith Bradnam, Genome Center, UC Davis This work is licensed under a Creative Commons Attribution 4.0 International License. This was a talk given on 2014-09-17 for the Genome Center’s Bioinformatics Core as part of a 1 week workshop on learning the command-line. Other versions of this talk are probably available at slideshare.com

Page 2: Genome assembly: then and now (with notes) — v1.2

Image from flickr.com/photos/dougitdesign/5613967601/

Contents

Sequencing 101!! Genome assembly: then!! Genome assembly: now

Assemblathons!! Intermission!!Advice

Page 3: Genome assembly: then and now (with notes) — v1.2

Sequencing 101A, C, G, T...

Image from nlm.nih.gov

Fred Sanger, who invented the sequencing technology that helped sequence most of the good quality genomes that are out there. He was also a winner of two Nobel prizes.

Page 4: Genome assembly: then and now (with notes) — v1.2

Read

Most sequencing technologies start with a sequencing read. A read could be as short as 25 bp (Solexa sequencing from a few years ago), or >25,000 bp (e.g. with PacBio or Oxford Nanopore). The record read length is currently held by PacBio and is over 50,000 bp.

Page 5: Genome assembly: then and now (with notes) — v1.2

Read pair

Most sequencing is done with pairs of connected reads, separated by a short interval whose approximate length is known. Not all reads will have this exact ‘insert size’. There can be a LOT of variation. Read pairs can also overlap with each other.

Page 6: Genome assembly: then and now (with notes) — v1.2

Read pair

Mate pair

Mate pairs, also known as jumping pairs, have much larger inserts (thousands or tens of thousands of bp), but it is hard to make good mate pair libraries. Having very large inserts is very useful for the purposes of genome assembly. Again, there is a lot of variation in the actual size of inserts (as determined by mapping mate pairs back to a known reference).

Page 7: Genome assembly: then and now (with notes) — v1.2

If you sequence a lot of read pairs, hopefully they will overlap with each other and allow you to start making contiguous sequences...

Page 8: Genome assembly: then and now (with notes) — v1.2

Contigs

...which are better known as contigs.

Page 9: Genome assembly: then and now (with notes) — v1.2

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.

Page 10: Genome assembly: then and now (with notes) — v1.2

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.

Page 11: Genome assembly: then and now (with notes) — v1.2

ScaffoldNNNNNNNNNNNNNNNNNNN

Mate pairs — or other information — can hopefully be used to connect contigs together into scaffolds. The unknown gap between contigs is replaced with unknown bases (Ns). Some scaffold sequences can therefore end up containing a lot of Ns.

Page 12: Genome assembly: then and now (with notes) — v1.2

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. !Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.

Page 13: Genome assembly: then and now (with notes) — v1.2

Assembly size

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

Assembly size is simply the sum of all scaffolds or contigs that are included in the final genome assembly. If you are calculating the assembly size from scaffolds, then some fraction of that final size will come from the Ns in scaffold sequences. !Here we have a toy genome assembly, with 12 scaffolds totaling 200 Mbp.

Page 14: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 15: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 16: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

200 Mbp

15

15

15

5

70

The most widely used measure to describe genome assemblies is the N50 length of scaffolds or contigs. This is essentially a weighted mean, designed to be more informative than a crude mean length (which is not very useful if you end up with thousands of very short scaffolds/contigs). To calculate the N50 scaffold length, start with the length of the longest scaffold...

Page 17: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

95

…if this length does not exceed 50% of the total assembly size (50% is why it is called N50), then proceed to the next longest scaffold, and add that length to a running total.

Page 18: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

95

…if this length does not exceed 50% of the total assembly size (50% is why it is called N50), then proceed to the next longest scaffold, and add that length to a running total.

Page 19: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

115

After looking at 3 scaffolds, we have exceeded 50% of the total assembly size.

Page 20: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

115

After looking at 3 scaffolds, we have exceeded 50% of the total assembly size.

Page 21: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

55

15

15

15

5

200 Mbp

The length of the contig or scaffold that takes you past 50% is what is reported as the N50 length. So here, we have an N50 length of 20 Mbp.

Page 22: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

5

5

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 23: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

5

5

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 24: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

N50 may be more robust than using a simple mean length, but it can still be easily manipulated. What if we excluded the two shortest scaffolds from our assembly?

Page 25: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

Now the total assembly size is 10 Mbp smaller, which is only a 5% reduction, but the N50 increases to 25 Mbp...a 25% increase. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think that the first assembly was much better.

Page 26: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

190 Mbp

Now the total assembly size is 10 Mbp smaller, which is only a 5% reduction, but the N50 increases to 25 Mbp...a 25% increase. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think that the first assembly was much better.

Page 27: Genome assembly: then and now (with notes) — v1.2

N50 length

NNNNNNNNNNNNNNNNNNN

NNNNNNNNNNN

NNNNNNNNNNN

70 25

20

10

10

5

5

15

15

15

190 Mbp

Now the total assembly size is 10 Mbp smaller, which is only a 5% reduction, but the N50 increases to 25 Mbp...a 25% increase. If these were two different assemblies and you only saw an N50 of 25 Mbp vs N50 of 20 Mbp, you might think that the first assembly was much better.

Page 28: Genome assembly: then and now (with notes) — v1.2

N50 for two assemblies

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 29: Genome assembly: then and now (with notes) — v1.2

N50 for two assemblies

208 Mbp 190 Mbp

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 30: Genome assembly: then and now (with notes) — v1.2

N50 for two assemblies

208 Mbp 190 Mbp

N50 = 15 Mbp N50 = 25 Mbp

Here are another two fictional assemblies. The first assembly now has a lower N50 value, but this is purely because it contains more sequence (which come from very short scaffolds). Do you want more sequence in your assembly, or fewer but longer sequences?

Page 31: Genome assembly: then and now (with notes) — v1.2

NG50 for two assemblies

208 Mbp 190 Mbp

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.

Page 32: Genome assembly: then and now (with notes) — v1.2

NG50 for two assemblies

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.

Page 33: Genome assembly: then and now (with notes) — v1.2

NG50 for two assemblies

Expected genome size = 250 Mbp

We prefer a measure called NG50. This does not use the assembly size, but instead uses the known (or estimated) genome size (the 'G' in NG50 refers to the Genome). We first used this measure in the Assemblathon 1 paper and (thankfully) it has seen some adoption by the assembly community.

Page 34: Genome assembly: then and now (with notes) — v1.2

Expected genome size = 250 Mbp

NG50 for two assemblies

The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.

Page 35: Genome assembly: then and now (with notes) — v1.2

NG50 = 15 Mbp NG50 = 15 Mbp

Expected genome size = 250 Mbp

NG50 for two assemblies

The NG50 of these two assemblies is now the same. We think that NG50 is a fairer way of comparing genome assemblies that might differ in their total size.

Page 36: Genome assembly: then and now (with notes) — v1.2

People always seem to want higher N50 values, so I recently published a tool called 'N50 Booster!!!' that can increase the N50 length of any genome assembly.

Page 37: Genome assembly: then and now (with notes) — v1.2

$ n50_booster.pl c_japonica.WS230.genomic.fa!!Before:!==============!Total assembly size = 166256191 bp!N50 length = 94149 bp!!Boosting N50...please wait!!After:!==============!Total assembly size = 166256191 bp!N50 length = 104766 bp!!Improvement in N50 length = 10617 bp!!See file c_japonica.WS230.genomic.fa.n50 for your new (and improved) assembly

This is some real output from my 'N50 Booster!!!' script. In this case, it increased the N50 length of the Caenorhabiditis japonica assembly from 94.1 Kbp to 104.8 Kbp. Note that, amazingly, the assembly size remains the same!

Page 38: Genome assembly: then and now (with notes) — v1.2

Notice the date? This was indeed an April Fool's prank, but the script is only being slightly dishonest. The first thing it does is simply discard the shortest 25% of all sequences. Then it adds an equivalent length of Ns to some of the remaining contigs (to preserve the assembly size). Manipulating assemblies like this is unscientific, but most genome assemblers will remove some of the shortest sequences. Which ones should you keep or remove?

Page 39: Genome assembly: then and now (with notes) — v1.2

Notice the date? This was indeed an April Fool's prank, but the script is only being slightly dishonest. The first thing it does is simply discard the shortest 25% of all sequences. Then it adds an equivalent length of Ns to some of the remaining contigs (to preserve the assembly size). Manipulating assemblies like this is unscientific, but most genome assemblers will remove some of the shortest sequences. Which ones should you keep or remove?

Page 40: Genome assembly: then and now (with notes) — v1.2

You should check that high N50 values!are not simply due to lots of Ns in the scaffolds!

You should always look at your assembly before you do anything with it!

Page 41: Genome assembly: then and now (with notes) — v1.2

Assembly 'x'

In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).

Page 42: Genome assembly: then and now (with notes) — v1.2

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).

Page 43: Genome assembly: then and now (with notes) — v1.2

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

Ns = 90.6% !!!

In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).

Page 44: Genome assembly: then and now (with notes) — v1.2

Assembly 'x'

Size: 859 Mbp!!

Number of scaffolds: 28!!

N50 = 70.3 Mbp

Ns = 90.6% !!!

In this assembly, which I was asked to run CEGMA on, it turned out to be 91% N! This assembly is not going to be good for anything. There are lots of other ambiguity characters too (e.g. R for puRine).

Page 45: Genome assembly: then and now (with notes) — v1.2

Basic assembly metrics

Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 46: Genome assembly: then and now (with notes) — v1.2

Basic assembly metrics

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence !and/or input read data

Number of genes From comparison to reference transcriptome !and/or set of known genes

Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 47: Genome assembly: then and now (with notes) — v1.2

Basic assembly metrics

Metric Description

Assembly size With or without very short contigs?

N50 / NG50 For contigs and/or scaffolds

Coverage When compared to a reference sequence

Errors Base errors from alignment to reference sequence !and/or input read data

Number of genes From comparison to reference transcriptome !and/or set of known genes

And many, many more...

Apart from assembly size, and N50/NG50 length, there are many other ways to describe a genome assembly.

Page 48: Genome assembly: then and now (with notes) — v1.2

Genome assemblyBack in the day...

How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?

Page 49: Genome assembly: then and now (with notes) — v1.2

Genome assemblyBack in the day...

1998

How were genomes assembled back in the late 1990s when genome sequencing projects were starting to make the news?

Page 50: Genome assembly: then and now (with notes) — v1.2

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 51: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 52: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 53: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 54: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 55: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 56: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

Genome sequencing projects often had a fantastic amount of supporting material which helped put the genome together. They were further helped by targeting genomes which had low heterozygosity. And of course this was all done with Sanger sequencing which gave long, accurate reads.

Page 57: Genome assembly: then and now (with notes) — v1.2

So what was the result of spending millions of dollars !to assemble genomes of well-characterized species,!with accurate long reads, and detailed maps???

So hopefully this gave us a useful set of finished genomes, right?

Page 58: Genome assembly: then and now (with notes) — v1.2

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 59: Genome assembly: then and now (with notes) — v1.2

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 60: Genome assembly: then and now (with notes) — v1.2

✤ 2000: published genome size = 125 Mbp

✤ 2007: genome size = 157 Mbp

✤ 2012: genome size = 135 Mbp

✤ Amount sequenced = 119 Mbp

✤ Ns = 0.2% of genome

Arabidopsis thaliana

Many published genome sizes are sometimes based on estimates which can be wrong. As they sequenced more and more of the Arabidopsis genome, they had to revise how big it was. So between 2000 and 2007 they produced more sequence but paradoxically it became less complete because the estimate of the size went up. Now it has come back down again. But the genome remains unfinished.

Page 61: Genome assembly: then and now (with notes) — v1.2

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.

Page 62: Genome assembly: then and now (with notes) — v1.2

Drosophila melanogaster

✤ Genome published 1998

✤ Heterochromatin finished 2007

✤ Ns = 4% of genome

The fly genome was 'finished' in 1998. But this was only really the easy-to-sequence portion of the genome (the euchromatin). The trickier heterochromatin was sequenced as a separate project that didn't finish until almost a decade later. The fly genome remains unfinished.

Page 63: Genome assembly: then and now (with notes) — v1.2

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last big batch of changes all occurred fairly recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 64: Genome assembly: then and now (with notes) — v1.2

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last big batch of changes all occurred fairly recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 65: Genome assembly: then and now (with notes) — v1.2

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last big batch of changes all occurred fairly recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 66: Genome assembly: then and now (with notes) — v1.2

Caenorhabditis elegans

✤ Genome published 1998

✤ 2004: last N removed

✤ 1998–2014: genome sequence changes

✤ 558 insertions

✤ 230 deletions

✤ 614 substitutions

} Nov 2012

The worm genome has no unknown bases in it. However, since the publication of the genome sequence the genome has continued to be refined as errors are corrected. The last big batch of changes all occurred fairly recently (November 2012). So after almost 15 years of post-genome-publication, it was still possible to find over 1,400 errors in one of the best characterized genome sequences that exists.

Page 67: Genome assembly: then and now (with notes) — v1.2

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.

Page 68: Genome assembly: then and now (with notes) — v1.2

Saccharomyces cerevisiae

✤ Genome published 1997

✤ 12 Mbp genome

✤ 1,653 changes to genome since 1997

✤ Last changes made in 2011

Likewise in yeast. The first eukaryotic genome sequence continues to receives fixes to correct the sequence. The last set of changes were made in 2011. These changes affected coding sequences, not just intergenic and intronic DNA.

Page 69: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✓ Physical maps ✓Understanding of target genome ✓Haploid / low heterozygosity genome ✓Accurate & long reads ✓Resources (time, money, people) ✓

Genome assembly: then

And all of this was done in an era when we had all of these supporting materials.

Page 70: Genome assembly: then and now (with notes) — v1.2

Genetic maps ✗

Physical maps ✗

Understanding of target genome ✗

Haploid / low heterozygosity genome ✗

Accurate & long reads ✗

Resources (time, money, people) ✗

Genome assembly: now

We don't have these now! Genome sequencing no longer requires an international consortium, rather it could be a project for a Grad student.

Page 71: Genome assembly: then and now (with notes) — v1.2

Assembling & finishing!a genome is not easy!

It was never easy, even when we access to lots of resources to help us put together genomes. And it is not easy now. Don't be fooled into thinking that because there are many published genome sequences, that these sequences represent the absolute ideal genome sequence. !And don’t be fooled that just because you can afford to sequence a genome, that you will have the resources to make a useful assembly from that sequence data.

Page 72: Genome assembly: then and now (with notes) — v1.2

AssemblathonsA new idea is born

Image from flickr.com/photos/dullhunk/4422952630

Page 73: Genome assembly: then and now (with notes) — v1.2

The Assemblathon was born out of the Genome 10K project.

Page 74: Genome assembly: then and now (with notes) — v1.2

If you sequence 10,000 genomes...!...you need to assemble 10,000 genomes

The Assemblathon was born out of the Genome 10K project.

Page 75: Genome assembly: then and now (with notes) — v1.2

How many assembly tools are out there?

There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 125 on this page). People may not have the time, patience, or expertise to try more than a handful of these. But…

Page 76: Genome assembly: then and now (with notes) — v1.2

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LG

SGACurtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA

Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTARagout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC Omega

GABenchToB

HiPGA

SAGE

HyDA-Vista

MHAP

Mapsembler 2

GAML

SAT-Assembler

RAMPART

VICUNA

There are many, many tools out there for assembling, or helping to assemble, a genome sequence (there are 125 on this page). People may not have the time, patience, or expertise to try more than a handful of these. But…

Page 77: Genome assembly: then and now (with notes) — v1.2

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LG

SGACurtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA

Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTARagout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC Omega

GABenchToB

HiPGA

SAGE

HyDA-Vista

MHAP

Mapsembler 2

GAML

SAT-Assembler

RAMPART

VICUNA

…people want to know which one is the best!

Page 78: Genome assembly: then and now (with notes) — v1.2

bambus2

How many assembly tools are out there?

Ray

Celera

MIRA

ALLPATHS-LG

SGACurtain MetassemblerPhusion

ABySS

Amos

Arapan

CLC

Cortex

DNAnexus

DNA Dragon

EdenaForge

GeneiousIDBA

Newbler

PRICE

PADENA

PASHA

Phrap

TIGR

Sequencher

SeqMan NGen

SHARCGS

SOPRA

SSAKE

SPAdes

Taipan

VCAKE

Velvet

Arachne

PCAP

GAM

MonumentAtlas

ABBA

Anchor

ATAC

Contrail

DecGPU GenoMinerLasergene

PE-Assembler

Pipeline Pilot

QSRA

SeqPrep

SHORTY

fermiTelescoper

QuastSCARPA

Hapsembler

HapCompass

HaploMerger

SWiPS

GigAssembler

MSR-CA

MaSuRCA

GARM

Cerulean

TIGRA

ngsShoRT

PERGA

SOAPdenovo

REAPR

FRCBam

EULER-SR SSPACE

Opera

mip

gapfiller

image

PBJelly

HGAP

FALCON

Dazzler

GGAKE

A5

CABOG

SHRAPSR-ASM

SuccinctAssembly

SUTTARagout

Tedna

Trinity

SWAP-Assembler

SILP3

AutoAssemblyD

KGBAssembler

MetAMOS

iMetAMOS

MetaVelvet-SL

KmerGenie

Nesoni

Pilon

Platanus

CGAL

GAGM

Enly

BESST

Khmer

GRIT

IDBA-MTP

dipSPAdes

WhatsHap

SHEAR

ELOPER

OMACC Omega

GABenchToB

HiPGA

SAGE

HyDA-Vista

MHAP

Mapsembler 2

GAML

SAT-Assembler

RAMPART

VICUNA

Which is the best?

…people want to know which one is the best!

Page 79: Genome assembly: then and now (with notes) — v1.2

At the time I presented this talk, these were six new papers that had recently been published, all of which describe new tools to help make genome assemblies. These papers were all published in about a month of each other. Genome assembly is a hard field to stay on top of!

Page 80: Genome assembly: then and now (with notes) — v1.2

All published since August 14th, 2014!

At the time I presented this talk, these were six new papers that had recently been published, all of which describe new tools to help make genome assemblies. These papers were all published in about a month of each other. Genome assembly is a hard field to stay on top of!

Page 81: Genome assembly: then and now (with notes) — v1.2

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

It is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 82: Genome assembly: then and now (with notes) — v1.2

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

It is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 83: Genome assembly: then and now (with notes) — v1.2

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

It is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 84: Genome assembly: then and now (with notes) — v1.2

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

✤ used same sequencing technologies but have different sequence libraries

It is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 85: Genome assembly: then and now (with notes) — v1.2

Comparing assemblers

✤ Can't fairly compare two assemblers if they:

✤ produced assemblies from different species

✤ assembled same species, but used sequence data from different sequencing technologies

✤ used same sequencing technologies but have different sequence libraries

✤ Even using different options for the same assembler may produce very different assemblies!

It is not always straightforward to compare two tools if they were used on different species or on different datasets from the same species.

Page 86: Genome assembly: then and now (with notes) — v1.2

The PRICE genome assembler has 52 command-line options!!!

This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.

Page 87: Genome assembly: then and now (with notes) — v1.2

The PRICE genome assembler has 52 command-line options!!!

how many of them are you going to learn?

This assembler has 52 command-line options! Not all of these will affect the resulting assembly, but many of them will.

Page 88: Genome assembly: then and now (with notes) — v1.2

A genome assembly competition

That's where the Assemblathon came in.

Page 89: Genome assembly: then and now (with notes) — v1.2

An attempt to standardize some aspects !of the genome assembly process

Genome assembly contests

Others have been trying to do the same thing. E.g. GAGE, and dnGASP. If you can at least give difference assemblers the same input sequence data, you can start to take account of one of the biggest variables in genome assembly.

Page 90: Genome assembly: then and now (with notes) — v1.2

✤ 2010–2011!

✤ Used synthetic data!

✤ Small genome (~100 Mbp)!

✤ We knew the answer!

Assemblathon 1

It is easier to judge a tool when you know what the final answer should look like. However, many people that work on developing assemblers would prefer to work with real data…

Page 91: Genome assembly: then and now (with notes) — v1.2

…which is where Assemblathon 2 came in.

Page 92: Genome assembly: then and now (with notes) — v1.2

Published in GigaScience,!July 2013

The paper was formally published in the journal GigaScience in mid 2013…

Page 93: Genome assembly: then and now (with notes) — v1.2

First published !on arXiv.org!

Jan 2013

…but we first published the paper to arXiv.org and ensure that we uploaded updates as the paper changed.

Page 94: Genome assembly: then and now (with notes) — v1.2

Attracted lots of interest, and provoked lots of commentary

Many blogs commented on the paper.

Page 95: Genome assembly: then and now (with notes) — v1.2

The Altmetric site, which tracks the social media engagement of academic research, reveals how much interest there has been in the paper.

Page 96: Genome assembly: then and now (with notes) — v1.2

The Altmetric site, which tracks the social media engagement of academic research, reveals how much interest there has been in the paper.

Page 97: Genome assembly: then and now (with notes) — v1.2

The open nature by which we conducted the research was recognized with the 2013 BioMed Central award for Open Data. I strongly believe that trying to conduct this science in an open manner ended up making our research much more visible to the scientific community.

Page 98: Genome assembly: then and now (with notes) — v1.2

But what did the paper reveal?

Page 99: Genome assembly: then and now (with notes) — v1.2

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Assemblathon 2 became a much bigger contest compared to Assemblathon 1.

Page 100: Genome assembly: then and now (with notes) — v1.2

Type of data Number of genomes

Size of genomes

Do we know the answer?

Assemblathon 1 Synthetic 1 Small ✓

Assemblathon 2 Real 3 Large ✗

Assemblathon 2 became a much bigger contest compared to Assemblathon 1.

Page 101: Genome assembly: then and now (with notes) — v1.2

Melopsittacus undulatus

Boa constrictor constrictorMaylandia zebra

These were the 3 species that were used: a budgie, a cichlid fish from Lake Mawali, and a reptile.

Page 102: Genome assembly: then and now (with notes) — v1.2

Bird

SnakeFish

Let's simplify the names for the rest of the presentation.

Page 103: Genome assembly: then and now (with notes) — v1.2

Why these three species?

There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.

Page 104: Genome assembly: then and now (with notes) — v1.2

Why these three species?

Because they were there

There is no special reason why these species were used. People had a need to sequence the genomes, and some companies were willing to donate sequences.

Page 105: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. !This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!

Page 106: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. !This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!

Page 107: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Roche 454

16x!(3 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. !This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!

Page 108: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Estimated genome size

1.2 Gbp

1.0 Gbp

1.6 Gbp

Illumina

285x!(14 libraries)

192x!(8 libraries)

125x!(4 libraries)

Roche 454

16x!(3 libraries)

PacBio

10x!(2 libraries)

Assemble this!

Lots of sequence data was provided for the bird. Mate pair and read pair libraries were available for all Illumina datasets. !This probably doesn’t reflect a real world scenario. Not everyone can afford over 400x of sequence coverage!

Page 109: Genome assembly: then and now (with notes) — v1.2

Who took part?

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 110: Genome assembly: then and now (with notes) — v1.2

Who took part?

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 111: Genome assembly: then and now (with notes) — v1.2

Who took part?

21 teams!43 assemblies!

52,013,623,777 bp of sequence

Lots of teams took part. Not just from the big sequencing/genome centers.

Page 112: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Competitive entries

12

10

12

Entries

There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).

Page 113: Genome assembly: then and now (with notes) — v1.2

Species

Bird

Fish

Snake

Competitive entries

12

10

12

Evaluation entries

3

6

0

Entries

There were evaluation entries (not eligible to be declared the winner) allowed in addition to competition entries (only 1 per team).

Page 114: Genome assembly: then and now (with notes) — v1.2

Goals

Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.

Page 115: Genome assembly: then and now (with notes) — v1.2

Goals

✤ Assess 'quality' of assemblies

Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.

Page 116: Genome assembly: then and now (with notes) — v1.2

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.

Page 117: Genome assembly: then and now (with notes) — v1.2

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.

Page 118: Genome assembly: then and now (with notes) — v1.2

Goals

✤ Assess 'quality' of assemblies

✤ Define quality!

✤ Produce ranking of assemblies for each species

✤ Produce ranking of assemblers across species?

Defining quality was really the toughest part of organizing the Assemblathon competitions. Lots of people have lots of (different) ideas as to what contributes to assembly quality.

Page 119: Genome assembly: then and now (with notes) — v1.2

Who did what?

Person/group Jobs

Me, Ian Korf, and Joseph Fass Perform various analyses of all assemblies

David Schwarz et al. Produce & evaluate optical maps

Jay Shendure et al. Produce Fosmid sequences !(bird & snake only)

Martin Hunt & Thomas Otto Performed REAPR analysis

Dent Earl & Benedict Paten Help with meta-analysis of final rankings

Lots of different groups were involved in the organization and assessment of the Assemblathon 2 entries.

Page 120: Genome assembly: then and now (with notes) — v1.2

91 co-authors!

flickr.com/photos/jamescridland/613445810

Hard to get agreement on how best to interpret the results. Some analyses and interpretations in the Assemblathon 2 paper end up being compromises.

Page 121: Genome assembly: then and now (with notes) — v1.2

Results!

Page 122: Genome assembly: then and now (with notes) — v1.2

Lots of results!

A screen grab of my master spreadsheet that contains all of the numerical results. Each row represents a submitted assembly, and each column represents a different assembly metric.

Page 123: Genome assembly: then and now (with notes) — v1.2

There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).

Page 124: Genome assembly: then and now (with notes) — v1.2

102 different metrics!

There were a lot of metrics. Many of these were not important or highly informative (e.g. %N).

Page 125: Genome assembly: then and now (with notes) — v1.2

10 key metrics

We focused on 10 of 102 metrics that we thought were a) useful and b) captured different aspects of an assembly's quality.

Page 126: Genome assembly: then and now (with notes) — v1.2

Key Metric Description

1 NG50 scaffold length

2 NG50 contig length

3 Amount of assembly in 'gene-sized' scaffolds

4 Number of 'core genes' present

5 Fosmid coverage

6 Fosmid validity

7 Short-range scaffold accuracy

8 Optical map: level 1

9 Optical map: levels 1–3

10 REAPR summary score

The 10 key metrics.

Page 127: Genome assembly: then and now (with notes) — v1.2

Key Metric Description

1 NG50 scaffold length

2 NG50 contig length

3 Amount of assembly in 'gene-sized' scaffolds

4 Number of 'core genes' present

5 Fosmid coverage

6 Fosmid validity

7 Short-range scaffold accuracy

8 Optical map: level 1

9 Optical map: levels 1–3

10 REAPR summary score

In the remainder of this talk, I’ll just focus on a few of these metrics. See the Assemblathon 2 paper (or older version of this talk) for more details about the other metrics.

Page 128: Genome assembly: then and now (with notes) — v1.2

1) Scaffold NG50 lengths

✤ Can calculate NG50 length for each assembly!

✤ But also calculate NG60, NG70 etc.!

✤ Plot all results as a graph

An N50 (or NG50) value on its own doesn't tell you that much. Ideally you should always be aware of the total assembly size and the distribution of lengths when comparing assemblies. You can do this by not only calculating NG50, but NG1..NG100. NG1 would be the length of scaffold that captures 1% of the estimated genome size (when summing scaffolds from longest to shortest).

Page 129: Genome assembly: then and now (with notes) — v1.2

1) Scaffold NG50 lengths

Scaffold length is on a log axis and team identifiers are shown in the legend. !The black dashed line shows the NG50 value, but the point where each series starts on the left shows the lengths of the longest scaffolds. Also, if the NG100 value is greater than zero, then that assembly is bigger than the known/estimated genome size.

Page 130: Genome assembly: then and now (with notes) — v1.2

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 131: Genome assembly: then and now (with notes) — v1.2

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 132: Genome assembly: then and now (with notes) — v1.2

2) Contig vs scaffold NG50

We did the same thing for contig NG50 as well as scaffold NG50. The two measure are sometimes, but not always, correlated. The two highlighted data points show outliers for bird assemblies, reflecting assemblies that are good at making long contigs *or* good at making long scaffolds, but not both.

Page 133: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 134: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 135: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 136: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 137: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

✤ Some assembly folks get a little obsessed by length!

✤ How long is 'long enough' for a scaffold?

✤ What if you just wanted to find genes?

✤ Average vertebrate gene = ~25 Kbp

It is great to have long scaffolds, but maybe for many questions that you might be interested in (e.g. studying codon usage bias), you only need to have scaffolds that have a good chance of capturing a full-length gene.

Page 138: Genome assembly: then and now (with notes) — v1.2

3) Gene-sized scaffolds

The red data series orders the bird assemblies in order of their NG50 scaffold length. The blue line shows the percentage of the estimated genome size that is present in scaffolds of 25 Kbp or longer. Most assemblies, even if they have a much shorter *average* scaffold length, may contain many scaffolds that are still long enough to contain a single gene.

Page 139: Genome assembly: then and now (with notes) — v1.2

4) Core genes

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. !Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 140: Genome assembly: then and now (with notes) — v1.2

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. !Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 141: Genome assembly: then and now (with notes) — v1.2

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. !Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 142: Genome assembly: then and now (with notes) — v1.2

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. !Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 143: Genome assembly: then and now (with notes) — v1.2

4) Core genes

✤ Used CEGMA (Core Eukaryotic Gene Mapping Approach)

✤ CEGMA uses a set of 458 'Core Eukaryotic Genes' (CEGs)

✤ CEGs are conserved in: S. cerevisiae, S. pombe, A. thaliana, C. elegans, D. melanogaster, and H. sapiens

✤ How many full-length CEGs are in each assembly?

A previously developed tool (CEGMA) was used to see how many 'core genes' (extremely highly conserved) are present in each assembly. These genes have been identified in six different eukaryotes and are expected to be present in all eukaryotes. !Note that CEGMA finds genes where a full-length (or nearly full-length) gene is present within a single scaffold. Many core genes might be present, but split across scaffolds.

Page 144: Genome assembly: then and now (with notes) — v1.2

4) Core genes

Species

Bird

Fish

Snake

Core genes (out of 458)

Best individual assembly

420

436

438

In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.

Page 145: Genome assembly: then and now (with notes) — v1.2

4) Core genes

Species

Bird

Fish

Snake

Core genes (out of 458)

Best individual assembly

420

436

438

Across all assemblies

442

455

454

In the three species, most of the core genes were present across all assemblies, but individual assemblies typically lacked several core genes.

Page 146: Genome assembly: then and now (with notes) — v1.2

4) Core genes

These results show the number of CEGMA genes that were present in any one assembly as a percentage of all possible CEGMA genes (i.e. those present across all assemblies for each species).

Page 147: Genome assembly: then and now (with notes) — v1.2

What does this all mean?

Page 148: Genome assembly: then and now (with notes) — v1.2

102 metrics!per assembly

10 key !metrics

1 final!ranking

Using the 10 key metrics, we combined the results to produce a single score for each assembly by which to rank them.

Page 149: Genome assembly: then and now (with notes) — v1.2

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 150: Genome assembly: then and now (with notes) — v1.2

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Rank

1

2

3

4

5

6

7

8

9

10

11

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 151: Genome assembly: then and now (with notes) — v1.2

Assembly

CRACS

SYMB

PHUS

BCM

SGA

MERAC

ABYSS

SOAP

RAY

GAM

CURT

Number of !core genes

438

436

435

434

433

430

429

428

422

415

360

Rank

1

2

3

4

5

6

7

8

9

10

11

Z-score

+0.68

+0.59

+0.54

+0.49

+0.44

+0.30

+0.25

+0.21

–0.08

–0.41

–3.02

Although we did take an average rank from the 10 individual rankings, we preferred to use a Z-score approach. Each assembly was scored based on the total number of standard deviations from the average of each metric. This rewards/penalizes assemblies with very high/low scores in individual metrics. The above results are from the CEGMA metric in bird.

Page 152: Genome assembly: then and now (with notes) — v1.2

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.

Page 153: Genome assembly: then and now (with notes) — v1.2

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.

Page 154: Genome assembly: then and now (with notes) — v1.2

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.

Page 155: Genome assembly: then and now (with notes) — v1.2

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.

Page 156: Genome assembly: then and now (with notes) — v1.2

This graph shows the final rankings of bird assemblies based on their sum Z-scores. Assemblies in red are the evaluation entries. The error bars reflect what would be the highest and lowest sum Z-score if we had used any of the possible combinations of 9 key metrics rather than 10. Note that the highest ranked bird assembly was an evaluation assembly by Baylor College of Medicine (BCM), their competitive entry ranked number 2.

Page 157: Genome assembly: then and now (with notes) — v1.2

In fish, the BCM entry ranked 1st though the error bars suggest there is much variability. The lack of Fosmid data means that there is only 7 key metrics rather than 10.

Page 158: Genome assembly: then and now (with notes) — v1.2

Snake seemed to the only species where it looked like one assembler outperformed all others (SGA, in this case). We will return to this issue. Note that there were no evaluation entries for snake.

Page 159: Genome assembly: then and now (with notes) — v1.2

Another way of looking at all of this data is to plot the Z-scores for each metric as a heat map (red = higher Z-scores).

Page 160: Genome assembly: then and now (with notes) — v1.2

A parallel coordinates plot is another way of trying to show all of the information at once. Although you can try to show all of the results in a single figure, it doesn't always mean that you should. I.e. perhaps not easy to make sense of this.

Page 161: Genome assembly: then and now (with notes) — v1.2

What does this all mean?

Page 162: Genome assembly: then and now (with notes) — v1.2

No really, what does this all mean?

Still a bit hard to make sense of the overall rankings. What are the main findings from our paper?

Page 163: Genome assembly: then and now (with notes) — v1.2

Some conclusions

✤ Very hard to find assemblers that performed well across all 10 key metrics!

✤ Assemblers that perform well in one species, do not always perform as well in another!

✤ Bird & snake assemblies appear better than fish!

✤ No real 'winner' for bird and fish

This type of news is perhaps disappointing to many.

Page 164: Genome assembly: then and now (with notes) — v1.2

SGA — best assembler for snake?

Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?

Page 165: Genome assembly: then and now (with notes) — v1.2

SGA — best assembler for snake?

Even if we had happened to use 9 key metrics rather than 10, and even if we threw out the metric where SGA performed the best, it would still probably rank 1st. So is that the end of the story?

Page 166: Genome assembly: then and now (with notes) — v1.2

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.

Page 167: Genome assembly: then and now (with notes) — v1.2

Description Rank of snake SGA assembly

NG50 scaffold length 2

NG50 contig length 5

Amount of assembly in 'gene-sized' scaffolds 7

Number of 'core genes' present 5

Fosmid coverage 2

Fosmid validity 2

Short-range scaffold accuracy 3

Optical map: level 1 2

Optical map: levels 1–3 1

REAPR summary score 2

SGA only ranked 1st in one of the ten key metrics and ranked 7th in another. So it is a good assembler *on average*. But if one of these metrics was highly important to you, you may want to use an assembler that ranked higher in that metric.

Page 168: Genome assembly: then and now (with notes) — v1.2

We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?

Page 169: Genome assembly: then and now (with notes) — v1.2

We found it interesting that the best bird assembly was the evaluation entry by Baylor College of Medicine. What is different about this entry compared to their competitive entry?

Page 170: Genome assembly: then and now (with notes) — v1.2

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 171: Genome assembly: then and now (with notes) — v1.2

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 172: Genome assembly: then and now (with notes) — v1.2

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 173: Genome assembly: then and now (with notes) — v1.2

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

Validity!Z-score

+1.4

–0.8

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 174: Genome assembly: then and now (with notes) — v1.2

Assembler

BCM - evaluation

BCM - competitive

Final rank

1

2

NGS data used in

assembly

Illumina + 454

Illumina + 454 + PacBio

Coverage!Z-score

+2.0

–0.3

Validity!Z-score

+1.4

–0.8

NG50 Contig Z-score

+1.5

+2.7

BCM bird assemblies

The only difference is that the BCM competitive entry included PacBio data, and somehow this led to the paradoxical situation where including more sequence in the assembly produced a lower measures for coverage and validity (from the Fosmids), though one key metric (NG50 contig length) did improve.

Page 175: Genome assembly: then and now (with notes) — v1.2

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM used PacBio data in a targeted way…the PacBio reads were used to help fill in the gaps in their scaffolds.

Page 176: Genome assembly: then and now (with notes) — v1.2

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

NNNNNNNNNNNNNNNNNNN

BCM used PacBio data in a targeted way…the PacBio reads were used to help fill in the gaps in their scaffolds.

Page 177: Genome assembly: then and now (with notes) — v1.2

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

NNNNNNNNNNNNNNNNNNN

PacBio sequence

BCM used PacBio data in a targeted way…the PacBio reads were used to help fill in the gaps in their scaffolds.

Page 178: Genome assembly: then and now (with notes) — v1.2

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

CGTCGNNATCNNGGTTACG

Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.

Page 179: Genome assembly: then and now (with notes) — v1.2

BCM evaluation scaffold

NNNNNNNNNNNNNNNNNNN

BCM competition scaffold

CGTCGNNATCNNGGTTACG

Mismatches from PacBio sequence penalized alignment !score more than matching unknown bases

Errors in the PacBio sequence were penalized by the choice of alignment program used to align Fosmid sequences to scaffolds.

Page 180: Genome assembly: then and now (with notes) — v1.2

The choice of one command-line option,!used by one tool in the calculation of one key metric...

...probably made enough difference to drop!the PacBio-containing assembly to 2nd place.

This was actually down to the use of a single command-line option in the lastz alignment program. If we had not chosen this option, the PacBio-containing entry would have probably ranked 1st among all bird assemblies.

Page 181: Genome assembly: then and now (with notes) — v1.2

Other conclusions

✤ Different metrics tell different stories!

✤ Heterozygosity was a big issue for bird & fish assemblies!

✤ Final rankings very sensitive to changes in metrics!

✤ N50 is a semi-useful predictor of assembly quality

The last point may disappoint some. Despite looking at many different metrics, N50 scaffold length still does a reasonable job of predicting overall quality. However...

Page 182: Genome assembly: then and now (with notes) — v1.2

...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.

Page 183: Genome assembly: then and now (with notes) — v1.2

...the outliers in this relationship should be noted. The highlighted bird assembly had the second highest scaffold N50 length, but ranked 6th among bird assemblies.

Page 184: Genome assembly: then and now (with notes) — v1.2

Inter-specific differences matter

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 185: Genome assembly: then and now (with notes) — v1.2

Inter-specific differences matter

✤ The three species have genomes with different properties !

✤ repeats!

✤ heterozygosity

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 186: Genome assembly: then and now (with notes) — v1.2

Inter-specific differences matter

✤ The three species have genomes with different properties !

✤ repeats!

✤ heterozygosity

✤ The three genomes had very different NGS data sets!

✤ Only bird had PacBio & 454 data!

✤ Different insert sizes in short-insert libraries

Biological differences may account for differences in assembler performance between different species. However, the input data for each species was also very different and this may play a role as well (some assemblers perform prefer certain short-insert sizes).

Page 187: Genome assembly: then and now (with notes) — v1.2

The Big Conclusion

People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the Assemblathon 2 contest.

Page 188: Genome assembly: then and now (with notes) — v1.2

The Big Conclusion

"You can't always get what you want"Sir Michael Jagger, 1969

People would like an assembler that consistently performs well across most (all?) metrics and across most species. We didn’t find such an assembler in the Assemblathon 2 contest.

Page 189: Genome assembly: then and now (with notes) — v1.2

What comes next?

Page 190: Genome assembly: then and now (with notes) — v1.2

What comes next?

There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).

Page 191: Genome assembly: then and now (with notes) — v1.2

What comes next?

3?

There may one day be an Assemblathon 3 but there are no immediate plans (and no funding for us at UC Davis to do so).

Page 192: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 193: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 194: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 195: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 196: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 197: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 198: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use FASTG or GFA genome assembly file format?

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 199: Genome assembly: then and now (with notes) — v1.2

A wish list for Assemblathon 3

✤ Only have 1 species

✤ Teams have to 'buy' resources using virtual budgets

✤ Factor in CPU time/cost?

✤ Agree on metrics before evaluating assemblies!

✤ Encourage experimental assemblies

✤ Use FASTG or GFA genome assembly file format?

✤ Get someone else to write the paper!

If there is to be an Assemblathon 3, here are some things that we have learned from Assemblathon 2.

Page 200: Genome assembly: then and now (with notes) — v1.2

But maybe we don't need an Assemblathon 3?

Page 201: Genome assembly: then and now (with notes) — v1.2

nucleotid.es is a new website that aims to a) provide a catalog of modern genome assemblers and b) evaluate their performance using some standardized sets of input read data.

Page 202: Genome assembly: then and now (with notes) — v1.2

It uses Docker containers to help make the software easy to run for others. And people are encouraged to upload 'dockerized' version of their assemblers.

Page 203: Genome assembly: then and now (with notes) — v1.2

The website also allows benchmarks for different versions of the same assembler, e.g. either using different parameter options, or different pre- and post-assembly filtering steps.

Page 204: Genome assembly: then and now (with notes) — v1.2

The website also allows benchmarks for different versions of the same assembler, e.g. either using different parameter options, or different pre- and post-assembly filtering steps.

Page 205: Genome assembly: then and now (with notes) — v1.2

These are some of the current 'winners' on the nucleotid.es site. Hopefully, more people will start using this site and maybe we won't ever need to have a dedicated Assemblathon 3 contest.

Page 206: Genome assembly: then and now (with notes) — v1.2

Intermission

And now a break in the scheduled program in order to let me vent a little steam.

Page 207: Genome assembly: then and now (with notes) — v1.2

NGS must die!

Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?

Page 208: Genome assembly: then and now (with notes) — v1.2

NGS must die!

‘NGS’ is used to refer to everything post-Sanger

Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?

Page 209: Genome assembly: then and now (with notes) — v1.2

NGS must die!

‘NGS’ is used to refer to everything post-Sanger

Pyrosequencing was developed ~1996

Next-generation sequencing (NGS) is heavily used as a convenient label for modern sequencing technologies. But those technologies have — in some cases — be in development since the mid 1990s. Do we refer to everything from the last 20 years as ‘next’ generation?

Page 210: Genome assembly: then and now (with notes) — v1.2

There are over 5,000 papers in Google Scholar which feature ‘Next-generation sequencing’ or ‘NGS’ in the title of the article. These do not help you if were trying to find papers that focus on pyrosequencing or nanopore sequencing. How could we improve these titles?

Page 211: Genome assembly: then and now (with notes) — v1.2

In many cases, including ‘next-generation’ adds nothing to the description of the paper. Here are the same paper titles with the words 'next-generation' removed.

Page 212: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 213: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also:

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 214: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 215: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 216: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

next-next generation sequencing

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 217: Genome assembly: then and now (with notes) — v1.2

NGS madness

Next generation sequencing

aka second generation sequencing

but there’s also: third generation sequencing

fourth generation sequencing

next-next generation sequencing

next-next-next generation sequencing

Some people have tried alternative names. These are all descriptions that have been used in published papers.

Page 218: Genome assembly: then and now (with notes) — v1.2

NGS madness

Technology

Complete Genomics

Ion Torrent

PacBio

Oxford Nanopore

According to some papers…

2nd generation

2nd generation

2nd generation

3rd generation

And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!

Page 219: Genome assembly: then and now (with notes) — v1.2

NGS madness

Technology

Complete Genomics

Ion Torrent

PacBio

Oxford Nanopore

According to some papers…

2nd generation

2nd generation

2nd generation

3rd generation

According to other papers…

3rd generation

3rd generation

3rd generation

4th generation

And of course, not everyone agrees on what is 2nd, 3rd, or 4th generation!

Page 220: Genome assembly: then and now (with notes) — v1.2

NGS madness

“PacBio is a 2.5th generation”

“Helicos lies between the transition of next-generation to third generation”

And of course, someone also has to be different!

Page 221: Genome assembly: then and now (with notes) — v1.2

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).

Page 222: Genome assembly: then and now (with notes) — v1.2

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

Use one or the other.

I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).

Page 223: Genome assembly: then and now (with notes) — v1.2

NGS madness

There are different sequencing methodologies, !and there are different sequencing platforms.

Use one or the other.

Or just say ‘current sequencing technologies’.

I would suggest that it is more helpful to refer to different sequencing technologies by their methodology (sequencing by synthesis, pyrosequencing, nanopore sequencing etc), or by the company developing the product (PacBio, Illumina etc.).

Page 224: Genome assembly: then and now (with notes) — v1.2

Intermission

And now back to our scheduled programming.

Page 225: Genome assembly: then and now (with notes) — v1.2

My #1 piece!of advice

flickr.com/julia_manzerova

If you ever have to work with genome assemblies, here is my top piece of advice.

Page 226: Genome assembly: then and now (with notes) — v1.2

flickr.com/thomashawk

Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).

Page 227: Genome assembly: then and now (with notes) — v1.2

flickr.com/thomashawk

Look at your data!

Look at your *input* data (what goes into the assembler) and *output* data (what comes out of the assembler). And really look at it (in a Unix terminal).

Page 228: Genome assembly: then and now (with notes) — v1.2

I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.

Page 229: Genome assembly: then and now (with notes) — v1.2

I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.

Page 230: Genome assembly: then and now (with notes) — v1.2

I am frequently asked to run CEGMA for people (to assess the completeness of their genome assembly). I track the CEGMA results (using a narrower set of 248 of the most conserved core genes) and also record the N50 scaffold length. Even with just two metrics, there is a lot of variation.

Page 231: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 232: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 233: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 234: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 235: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 236: Genome assembly: then and now (with notes) — v1.2

From a vertebrate genome assembly with 72,214 sequences…

Length of 10 shortest sequences: !100, 100, 99, 88, 87, 76, 73, 63, 12, and 3 bp!

In one particular assembly, nearly all of the sequence was represented by incredibly short scaffolds. The shortest sequence in the assembly was 3 bp. Assemblies like this are not likely to be useful for anything. Unsurprisingly, this assembly didn’t contain any core genes.

Page 237: Genome assembly: then and now (with notes) — v1.2

For some of the CEGMA runs that I have made, I’ve noted which assemblers was used…

Page 238: Genome assembly: then and now (with notes) — v1.2

These results show that any assembler can be used to make a bad genome assembly. There is no one assembler which consistently performs well (as assessed by these two metrics). Note that these assemblies were generated from many different species.

Page 239: Genome assembly: then and now (with notes) — v1.2

Reasons to be cheerful

flickr.com/danielygo

After sounding quite pessimistic so far, here are some more positive reasons why genome assembly might be getting better.

Page 240: Genome assembly: then and now (with notes) — v1.2

Improvements in sequencing technology !will lead to improvements in genome assembly

Page 241: Genome assembly: then and now (with notes) — v1.2

Data from Lex Nederbragt’s blog, June 2014

Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome assemblers).

Page 242: Genome assembly: then and now (with notes) — v1.2

Data from Lex Nederbragt’s blog, June 2014

Sequencing technologies continue to improve. 10,000 bp is sort of a ‘breakthrough’ length that would greatly assist genome assembly. Producing many reads that are >10,000 bp means that you can sequence all the way through most eukaryotic repeats (which are one of the two major scourges for genome assemblers).

Page 243: Genome assembly: then and now (with notes) — v1.2

Long-read technology

Moleculo read data from Illumina BaseSpace, July 2013

Moleculo (now owned by Illumina) can take Illumina reads and somehow (not sure anyone knows the science behind how it works) combine them to make much longer reads.

Page 244: Genome assembly: then and now (with notes) — v1.2

Long-read technology

From https://flxlexblog.wordpress.com (Lex Nederbragt's blog)

PacBio!data

Library preparation is a hugely important part of the genome assembly process. The Blue Pippin library prep greatly improves the number of super long PacBio reads.

Page 245: Genome assembly: then and now (with notes) — v1.2

Long-read technology

MinIon from Oxford Nanopore

Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.

Page 246: Genome assembly: then and now (with notes) — v1.2

Long-read technology

MinIon from Oxford Nanopore

Oxford Nanopore burst on to the scene and excited everyone. But it has been a wait before people had the chance to use their MinION devices for themselves. The UC Davis Genome Center recently received 3 MinIONs as part of the early access program.

Page 247: Genome assembly: then and now (with notes) — v1.2

Where is the data?

Nick Loman was the first person to publish a ‘real world’ read from these devices.

Page 248: Genome assembly: then and now (with notes) — v1.2

Where is the data?

Nick Loman was the first person to publish a ‘real world’ read from these devices.

Page 249: Genome assembly: then and now (with notes) — v1.2

Where is the data?

Nick Loman published the first real-world data on June 10th

Nick Loman was the first person to publish a ‘real world’ read from these devices.

Page 250: Genome assembly: then and now (with notes) — v1.2

He also shared some of the statistics from his entire run. This nanopore sequencing technology seems limited by how large your DNA fragments are. It may be possible to generated much longer reads.

Page 251: Genome assembly: then and now (with notes) — v1.2

An E. coli dataset was released to the GigaDB database (http://gigadb.org)

Page 252: Genome assembly: then and now (with notes) — v1.2

Nick also released the first MinION dataset on September 10th

An E. coli dataset was released to the GigaDB database (http://gigadb.org)

Page 253: Genome assembly: then and now (with notes) — v1.2

Although Illumina holds such a strong position in the world of sequencing, other companies continue to work on new sequencing technologies.

Page 254: Genome assembly: then and now (with notes) — v1.2

Base4 are developing a 'microdroplet sequencing' approach. All new technologies seem keen to target the world of 'single molecule' sequencing, with very long reads, and real-time (or 'near' real-time in this case) results.

Page 255: Genome assembly: then and now (with notes) — v1.2

PicoSeq have developed a technology called SIMDEQ (Single-molecule Magnetic Detection and Quantification) which has the potential to be used to generate long-reads from single molecules. !Maybe companies like Base4 and PicoSeq will never usurp Illumina, but it is good to see people trying to develop new technologies. Competition will help drive down the price of sequencing.

Page 256: Genome assembly: then and now (with notes) — v1.2

Some other ways to tackle the problems !inherent in genome assembly

Page 257: Genome assembly: then and now (with notes) — v1.2

Single chromosome assembly?

Breaking the problem up into smaller chunks may be one other way of tackling the genome assembly problem (though many single chromosomes in eukaryotes are still very long).

Page 258: Genome assembly: then and now (with notes) — v1.2

Tackling heterozygosity

1000 Genomes project plans to sequence 15 'trios' in high-depth

The second major problem for genome assemblers is that of heterozygosity that is present in most (diploid) genomes. The 1,000 Genomes project is trying to tackle this by sequencing ‘trios’, an individual plus their parents and will try to use the combination of datasets to resolve the heterozygosity.

Page 259: Genome assembly: then and now (with notes) — v1.2

Hi-C

✤ Nature Biotechnology, 31, 2013 !

✤ Burton et al.!

✤ Selvaraj et al.!

✤ Kaplan & Dekker

Hi-C is another new technology that might be able to improve the scaffolding step of genome assembly.

Page 260: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.

Page 261: Genome assembly: then and now (with notes) — v1.2

Kwik-E-Assembler

acgtaacacaancac gggaacnnnacatta acnactagcataata nnnnnnnnnnaacac actttaaattatatc

The future of genome assembly

Maybe one day, genome assembly will be as simple as downloading a sequence to your iPhone and clicking ‘assemble’. That day is still some time away.

Page 262: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 263: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 264: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 265: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 266: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

✤ Data management will remain an issue:

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 267: Genome assembly: then and now (with notes) — v1.2

The future of genome assembly

✤ At some point we will look back with embarrassment at this era.

✤ Assembly must, and will, get better, but...

✤ ...'perfect' genomes may remain elusive.

✤ Data management will remain an issue:

✤ the human genome -> human genomes -> tissue-specific genomes

Currently a lot of effort is spent generating huge datasets in order to produce a final genome assembly. There are hundreds of genome assemblies out there which are very poor and incomplete. In many cases we don’t know just how good or bad they are. The pace of change in this field means it will often be easier to simply resequence and reassemble a genome rather than attempt to work with a previous genome assembly. !Even if assembly improves, there will be lots of data to manage in future. And people will have their genomes sequenced at different time points throughout their lives (part of your ‘genome checkup’ at the doctors?).

Page 268: Genome assembly: then and now (with notes) — v1.2

Summary

The last point on this slide is something that I repeat every 5 years!

Page 269: Genome assembly: then and now (with notes) — v1.2

Summary

✤ There is no real consensus on how to make a good genome assembly

The last point on this slide is something that I repeat every 5 years!

Page 270: Genome assembly: then and now (with notes) — v1.2

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

The last point on this slide is something that I repeat every 5 years!

Page 271: Genome assembly: then and now (with notes) — v1.2

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

The last point on this slide is something that I repeat every 5 years!

Page 272: Genome assembly: then and now (with notes) — v1.2

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

✤ Look at your input and output data

The last point on this slide is something that I repeat every 5 years!

Page 273: Genome assembly: then and now (with notes) — v1.2

Summary

✤ There is no real consensus on how to make a good genome assembly

✤ Try different assemblers, try different command-line options

✤ Decide what it is you want to get out of a genome assembly

✤ Look at your input and output data

✤ Wait 5 years and come back, we’ll (probably) have solved everything!

The last point on this slide is something that I repeat every 5 years!

Page 274: Genome assembly: then and now (with notes) — v1.2

Useful blogs/tweeps to follow

Lex Nederbragt!@lexnederbragt!

flxlexblog.wordpress.com

Nick Loan!@pathogenomenick!

pathogenomic.bham.ac.uk/blog

Mick Watson!@BioMickWatson!

biomickwatson.wordpress.com

These people use their blogs to write about latest and greatest news in the worlds of sequencing and genome assembly. Their twitter accounts are also worth following.

Page 275: Genome assembly: then and now (with notes) — v1.2

Thank you for listening!

@kbradnam @assemblathon

My blog: http://acgt.me

And here are some of the ways to follow what I do. My ACGT blog is a source for many of my frustrations about the world of genomics and bioinformatics :-)

Page 276: Genome assembly: then and now (with notes) — v1.2
Page 277: Genome assembly: then and now (with notes) — v1.2

Any questions???