genotoul bioinfo & sigenae genome assembly...
TRANSCRIPT
![Page 1: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/1.jpg)
Genotoul Bioinfo & SigenaeGenome assembly work
Christophe KloppJune 2018
![Page 2: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/2.jpg)
2
Assemblies in Toulouse
http://www1.rfi.fr/actufr/articles/061/article_33204.asp
![Page 3: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/3.jpg)
3
LIPM results
![Page 4: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/4.jpg)
4
Genotoul Bioinfo projects
https://umr-lstm.cirad.fr/principaux-projets/aeschynod
![Page 5: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/5.jpg)
5
Sigenae projects
● Honey bee genome (GenPhyse) : 250Mb● CatFish genome (USDA) : 1 Gb● Long collaboration with LPGP Rennes
– Major Sex determinant
– JavaFish genome collaboration with a lab in Japan 900 Mb
– Accepted projects● SEX’N PERCH (american and european perch)● STURGEONoMIX (Huso huso)● GENOFISH (7 species)● SIBER’SEX (Acipenser baerï)
– Range : 700 Mb – 4.5 Gb
https://oceana.org/marine-life/ocean-fishes/beluga-sturgeon
![Page 6: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/6.jpg)
6
Why do biologists need good genome assemblies ?
● Because it is a reference for the community working on the species which can be expertized and improved
● It is much easier to work with :– Many experiments can be analyzed by read to reference
mapping :● Genetic analysis : which gene explains this phenotype? What are
the variations found in the genome of this species ?● Transcriptomics : which genes are differentially expressed
between these conditions?● Epigenetics : which part of the genome is not accessible in these
conditions?
![Page 7: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/7.jpg)
7
What is good genome assembly ?
● Species community point of view :– Representing the breeds used by the community
– Somewhere average in terms of population
● Genome point of view :– Representing all the chromosomes
– Covering all base pairs
– Containing all genes well formed and located
● Should it be unique ?
![Page 8: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/8.jpg)
8
What makes genomes difficult to assemble ?
● Genome size
● Number of chromosomes
● Repeat content
● Repeat size (structure)
● Heterozygocity
● Ploïdy
● DNA conformation
● Contamination
https://en.wikipedia.org/wiki/Genome_size
![Page 9: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/9.jpg)
9
Sturgeon karyotype
http://www.unife.it/dipartimento/biologia-evoluzione/progetti/geneweb/immagini/babur73_g.jpg
2n ~ 209...249
![Page 10: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/10.jpg)
10
Brief Get-Plage sequencer history
Roche 454
Illumina
PB RSII
ONT
2008 2010 2012 2014 2016 2018
Illumina NovaSeq 02/18, 6 To/run ONT promethion, 09/18, 7 To/run
![Page 11: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/11.jpg)
11
Sequencing technologies
Oxford Nanopore
Very long reads (up to 500kb)
High through-put
High error rate
Non random errors
Illumina
Short reads (2*150bp)
High through-put
Low error rate
Random errors
![Page 12: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/12.jpg)
12
10X Library preparation
http://gqinnovationcenter.com/services/sequencing/chromium.aspx?l=fhttp://sjackman.ca/tigmint-slides/#/tigmint
![Page 13: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/13.jpg)
13
Understanding the studied genome
https://academic.oup.com/bioinformatics/article/33/14/2202/3089939
1/ get genome size http://genomesize.com/
2/ Analyze genome kmer content http://qb.cshl.edu/genomescope/
![Page 14: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/14.jpg)
14
Pearl oyster : 2.6 % het
![Page 15: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/15.jpg)
15
Impact of heterozygocity
https://academic.oup.com/bioinformatics/article/33/14/2202/3089939
![Page 16: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/16.jpg)
16
Crassostrea gigas !!
GenomeScope version 1.0k = 21property min max Heterozygosity 6.09571% 6.17102% Genome Haploid Length 720,368,835 bp 723,451,801 bp Genome Repeat Length 647,690,896 bp 650,462,822 bp Genome Unique Length 72,677,940 bp 72,988,980 bp Model Fit 93.522% 97.9978% Read Error Rate 0.733165% 0.733165%
![Page 17: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/17.jpg)
17
Our assembly strategy
Long reads
10X Short reads
Filtering
Cleaning
Correction
Assembly
Contigs
Long readsLong reads
Scaffolds
HiC Short reads
Genetic map marker Short reads
Chromosomes
Polishing
![Page 18: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/18.jpg)
18
Assembly software packages
● ONT – Adapter trimming : porechop– Correction : Canu– Assembly : Canu, smartdenovo, RA– Polishing : Racon
● 10X :– polishing : Pilon– scaffolding : arcs
● HiC ‘chromosoming’ : 3d-dna● Genetic map ‘chromosoming’ : ALLMAPS
![Page 19: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/19.jpg)
19
Contig assembly metrics
SpeciesAssembler
Genome size
Coverage Total contig size
Contig count
N50 L50
Caenorhabditis elegansFalcon PB
100 Mb 47X 101,72 Mb 122 2,022,653 17
Aeschynomene eveniaHgap PB
400 Mb 120X 374.45 Mb 5,711 648,407 87
Ictalurus punctatusFalcon PB
900 Mb 120X 826,04 Mb 1 ,554 4,431,159 50
Oryzias javanicusFalcon PB
900 Mb 50X 865,44 Mb 1,286 3,821,811 59
Arabidopsis thalianaCanu ONT
120 Mb 81X 121,09 Mb 197 16,062,269 5
![Page 20: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/20.jpg)
20
scaffolding
American perch
![Page 21: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/21.jpg)
21
Building chromosomes : map
![Page 22: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/22.jpg)
22
Building chromosomes : Hi-C
http://www.abcam.com/protocols/getting-started-with-chromatin-conformation-capture-3c
Perca flavescens
Hi-C protocol
![Page 24: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/24.jpg)
24
Assembly validation
● Assembly metrics : assemblathon_stats.pl● Single copy gene content evaluation : BUSCO● Telomeric repeats
![Page 25: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/25.jpg)
25
Assembly validation 2 : bee genome
http://dgenies.toulouse.inra.fr/
![Page 26: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/26.jpg)
26
Annotation
● Multi-tissue RNA-Seq data ● De novo transcriptome assemblies● Ab initio gene calling (intron exon boundary
training)● Still not a very automatic process
![Page 27: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/27.jpg)
27
Computer science questions
● Heterozygote (with blocks) genome assembly● Scaffolding with 10X reads (missing small
contigs)● Optimizing contig correction (unique path)● Annotation : simplifying and reducing time
![Page 28: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/28.jpg)
28
Conclusions
● DNA quality is crucial.● Sequencing and assembling large genome is now
affordable for small teams (20k€ for a 3Gb genome).● Promethion sequencer will lower the price : up to 150G per
run (2-3k€).● It is possible to reach chromosomes without producing a
genetic map.● Two sequencing technologies are needed to correct non
random sequencing errors.● We have an open temporary position.
https://www.sfbi.fr/content/cdd-ing%C3%A9nieur-bio-informaticien-assemblage-et-annotation-de-g%C3%A9nomes-de-poissons
![Page 29: Genotoul Bioinfo & Sigenae Genome assembly workgenoweb.toulouse.inra.fr/.../Genotoul_Sigenae_genome_assembly_w… · 6 Why do biologists need good genome assemblies ? Because it is](https://reader035.vdocuments.site/reader035/viewer/2022070813/5f0c7f0d7e708231d435b187/html5/thumbnails/29.jpg)
29
Acknowledgments
● Our collaborators
● Alain Roulet, Céline Lopez-Roques, Catherine Zanchetta and the rest of the Get-Plage team
● Cédric Cabau (chromosome generation and annotation)
● Erika Sallet (Fish coding sequence model)
● Jérôme Gouzy (assembly and polishing discussions)