genotoul bioinfo & sigenae genome assembly...

29
Genotoul Bioinfo & Sigenae Genome assembly work Christophe Klopp June 2018

Upload: others

Post on 24-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

Genotoul Bioinfo & SigenaeGenome assembly work

Christophe KloppJune 2018

Page 2: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

2

Assemblies in Toulouse

http://www1.rfi.fr/actufr/articles/061/article_33204.asp

Page 3: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

3

LIPM results

Page 4: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

4

Genotoul Bioinfo projects

https://umr-lstm.cirad.fr/principaux-projets/aeschynod

Page 5: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

5

Sigenae projects

● Honey bee genome (GenPhyse) : 250Mb● CatFish genome (USDA) : 1 Gb● Long collaboration with LPGP Rennes

– Major Sex determinant

– JavaFish genome collaboration with a lab in Japan 900 Mb

– Accepted projects● SEX’N PERCH (american and european perch)● STURGEONoMIX (Huso huso)● GENOFISH (7 species)● SIBER’SEX (Acipenser baerï)

– Range : 700 Mb – 4.5 Gb

https://oceana.org/marine-life/ocean-fishes/beluga-sturgeon

Page 6: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

6

Why do biologists need good genome assemblies ?

● Because it is a reference for the community working on the species which can be expertized and improved

● It is much easier to work with :– Many experiments can be analyzed by read to reference

mapping :● Genetic analysis : which gene explains this phenotype? What are

the variations found in the genome of this species ?● Transcriptomics : which genes are differentially expressed

between these conditions?● Epigenetics : which part of the genome is not accessible in these

conditions?

Page 7: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

7

What is good genome assembly ?

● Species community point of view :– Representing the breeds used by the community

– Somewhere average in terms of population

● Genome point of view :– Representing all the chromosomes

– Covering all base pairs

– Containing all genes well formed and located

● Should it be unique ?

Page 8: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

8

What makes genomes difficult to assemble ?

● Genome size

● Number of chromosomes

● Repeat content

● Repeat size (structure)

● Heterozygocity

● Ploïdy

● DNA conformation

● Contamination

https://en.wikipedia.org/wiki/Genome_size

Page 9: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

9

Sturgeon karyotype

http://www.unife.it/dipartimento/biologia-evoluzione/progetti/geneweb/immagini/babur73_g.jpg

2n ~ 209...249

Page 10: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

10

Brief Get-Plage sequencer history

Roche 454

Illumina

PB RSII

ONT

2008 2010 2012 2014 2016 2018

Illumina NovaSeq 02/18, 6 To/run ONT promethion, 09/18, 7 To/run

Page 11: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

11

Sequencing technologies

Oxford Nanopore

Very long reads (up to 500kb)

High through-put

High error rate

Non random errors

Illumina

Short reads (2*150bp)

High through-put

Low error rate

Random errors

Page 12: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

12

10X Library preparation

http://gqinnovationcenter.com/services/sequencing/chromium.aspx?l=fhttp://sjackman.ca/tigmint-slides/#/tigmint

Page 13: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

13

Understanding the studied genome

https://academic.oup.com/bioinformatics/article/33/14/2202/3089939

1/ get genome size http://genomesize.com/

2/ Analyze genome kmer content http://qb.cshl.edu/genomescope/

Page 14: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

14

Pearl oyster : 2.6 % het

Page 15: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

15

Impact of heterozygocity

https://academic.oup.com/bioinformatics/article/33/14/2202/3089939

Page 16: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

16

Crassostrea gigas !!

GenomeScope version 1.0k = 21property min max Heterozygosity 6.09571% 6.17102% Genome Haploid Length 720,368,835 bp 723,451,801 bp Genome Repeat Length 647,690,896 bp 650,462,822 bp Genome Unique Length 72,677,940 bp 72,988,980 bp Model Fit 93.522% 97.9978% Read Error Rate 0.733165% 0.733165%

Page 17: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

17

Our assembly strategy

Long reads

10X Short reads

Filtering

Cleaning

Correction

Assembly

Contigs

Long readsLong reads

Scaffolds

HiC Short reads

Genetic map marker Short reads

Chromosomes

Polishing

Page 18: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

18

Assembly software packages

● ONT – Adapter trimming : porechop– Correction : Canu– Assembly : Canu, smartdenovo, RA– Polishing : Racon

● 10X :– polishing : Pilon– scaffolding : arcs

● HiC ‘chromosoming’ : 3d-dna● Genetic map ‘chromosoming’ : ALLMAPS

Page 19: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

19

Contig assembly metrics

SpeciesAssembler

Genome size

Coverage Total contig size

Contig count

N50 L50

Caenorhabditis elegansFalcon PB

100 Mb 47X 101,72 Mb 122 2,022,653 17

Aeschynomene eveniaHgap PB

400 Mb 120X 374.45 Mb 5,711 648,407 87

Ictalurus punctatusFalcon PB

900 Mb 120X 826,04 Mb 1 ,554 4,431,159 50

Oryzias javanicusFalcon PB

900 Mb 50X 865,44 Mb 1,286 3,821,811 59

Arabidopsis thalianaCanu ONT

120 Mb 81X 121,09 Mb 197 16,062,269 5

Page 20: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

20

scaffolding

American perch

Page 21: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

21

Building chromosomes : map

Page 22: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

22

Building chromosomes : Hi-C

http://www.abcam.com/protocols/getting-started-with-chromatin-conformation-capture-3c

Perca flavescens

Hi-C protocol

Page 23: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

23

Polishing impacthttps://busco.ezlab.org/

Page 24: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

24

Assembly validation

● Assembly metrics : assemblathon_stats.pl● Single copy gene content evaluation : BUSCO● Telomeric repeats

Page 25: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

25

Assembly validation 2 : bee genome

http://dgenies.toulouse.inra.fr/

Page 26: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

26

Annotation

● Multi-tissue RNA-Seq data ● De novo transcriptome assemblies● Ab initio gene calling (intron exon boundary

training)● Still not a very automatic process

Page 27: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

27

Computer science questions

● Heterozygote (with blocks) genome assembly● Scaffolding with 10X reads (missing small

contigs)● Optimizing contig correction (unique path)● Annotation : simplifying and reducing time

Page 28: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

28

Conclusions

● DNA quality is crucial.● Sequencing and assembling large genome is now

affordable for small teams (20k€ for a 3Gb genome).● Promethion sequencer will lower the price : up to 150G per

run (2-3k€).● It is possible to reach chromosomes without producing a

genetic map.● Two sequencing technologies are needed to correct non

random sequencing errors.● We have an open temporary position.

https://www.sfbi.fr/content/cdd-ing%C3%A9nieur-bio-informaticien-assemblage-et-annotation-de-g%C3%A9nomes-de-poissons

Page 29: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is

29

Acknowledgments

● Our collaborators

● Alain Roulet, Céline Lopez-Roques, Catherine Zanchetta and the rest of the Get-Plage team

● Cédric Cabau (chromosome generation and annotation)

● Erika Sallet (Fish coding sequence model)

● Jérôme Gouzy (assembly and polishing discussions)