the tomato genome re-seq project - university of florida - flinkers.pdf · ignores differences...
TRANSCRIPT
The tomato genome re-seq project
http://www.tomatogenome.net
5 February 2013, Richard Finkers & Sjaak van Heusden
Rationale
Genetic diversity in commercial tomato germplasm relatively narrow
Unexploited genetic diversity available in land races and old varieties?
Cultivated tomato has lost valuable traits during domestication
Wild species - source of genetic diversity
● Diverse habitat ● Variation in flowers and fruits ● Variation in mating systems
Most wild species can be crossed with cultivated tomato (introgression breeding)
Rationale
Tomato Genome (Re-) Sequencing Project • Identify alleles underpinning phenotypic diversity
across the entire genome and entire tomato clade
Acknowledgement: Sjaak van Heuden, Paris market
Tomato fruit shape variation
Rodríguez et al (2011) Plant physiology 156: 275-85
EU-SOL core collection
https://www.eu-sol.wur.nl Information:
Marker data Phenotype data Passport data
Markers 20 (7000 -> 1000) 384 (1000 -> 200) 7500 ( 200 -> 34)
Selected landraces for (re-)sequencing
200 landraces
1000 landraces
> 7000 landraces
Acknowledgement: Dani Zamir et al. & Keygene N.V.
Landraces & old cultivar collection
Fruit phenotypes EU-SOL collection
Improving with exotic genetic libraries
Wild tomato species are valuable candidate for novel alleles
Dani Zamir, Nature Reviews Genetics 2, 983-989 (December 2001)
Improving with exotic genetic libraries
Moyle 2008
Phylogenetic relationships in the Solanum clade
51
(re-)sequencing collection
Lycopersicon group
Arcanum group
Eriopersicon group
Neolycopersicon group
2 6 4
3 2 2 1 3 2 7 2
Tree according to Anderson et al. (2010), redrawn from Moyle 2008
Genome Alignment
Read mapping to cv. Heinz Genome structure
wild tomato relatives?
Lycopersicon group
Arcanum group
Eriopersicon group
Neolycopersicon group
Reference genomes: De novo assembly selection
Heinz1706
LA 2157
LYC 4
LA 716
Data production
84 Resequenced genomes ● 500 bp, 2x100 bp Paired-end Illumina
● Average coverage 41x
3 de novo genomes (S. arcanum, S. habrochaites, S. pennellii) ● 170 bp, 2x 100 bp Paired end Illumina
● 2 kb, 2 x 100 bp Mate-paired end Illumina
● 8 kb matepair (454)
● 20 kb matepair (454)
● Average coverage 205x
Genomic sequencing libraries
K-mer graph
0
100
200
300
400
500
600
700
800
900
1000
0 10 20 30 40 50 60 70 80 90 100
31
-mer
vol
um
e M
illio
ns
31-mer frequency
31-mer histogram
'001'
FIT
'045'
FIT
'046'
FIT
'053'
FIT
'054'
FIT
'058'
FIT
'072'
FIT
'074'
FIT
Data: 500 bp, 2x100 bp Paired-end Illumina
Acknowledgement: Theo Borm
K-mer exploration
Fitted modi ● Homozygous ● Heterozygous ● Duplicated (2x)
Conclusions
● % heterozygosity is neglectable
● Duplicated portion is not neglectable
0
50
100
150
200
250
300
30 50 70 90
31
-mer
vol
um
e M
illio
ns
31-mer frequency
31-mer histogram '001'
FIT
'045'
FIT
'046'
FIT
'053'
FIT
'054'
FIT
'058'
FIT
'072'
FIT
'074'
FIT
Genome size estimates
Genomic K-mer based estimate Ignores differences GC-AT
ratio Underestimation
Nr Species
Est. Size (Mb)
Draft Size (Mb)
%CP
01 SL 723 1.9 Heinz 760
45 SP 749 1.9 46 SP 775 6.3
LA1589 739 53 SG 728 4.4 54 SC 760 6.2 58 SA 830 3.0 72 SH 779 7.1 74 SP 962 8.6
Acknowledgement: Theo Borm
The Tomato Genome Consortium Nature 485, 635–641 (2012)
Optimizing assembly strategy
Checking assebly integrity
Average completeness per 10 contigs: ALL-PATHS (96.62%) CLC-BIO (74.62%)
Heinz dot plot
SL2.40 ch11 – region (1 Mbp)
Status de novo assembly genomes
Status de novo assembly genomes
N50 N90 Longest Shortest Mean Median N
Contigs Total
length
Heinz 1706 reference
16,467,796
3,041,128
42,121,211 2000
242,428
2,847
3,223
781,345,411
S. habrochaites_allpaths
90,424
12,290
990,035 902
43,409
20,461
16,935
735,128,396
S. habrochaites_scaf
515,730
104,925
3,252,897 902
130,475
9,758
5,873
766,277,628
S. pennellii_allpaths
64,671
7,460
627,722 887
27,680
11,008
26,589
735,990,792
S. pennellii_scaf
206,135
38,969
1,269,801 887
49,209
5,932
15,886
781,730,072
S. arcanum_clc
18,651
2,524
241,690 200
2,869
428
290,145
832,461,203
Conclusions
Sequencing completed Quality and coverage threshold satisfied Cleaning resequencing data completed De novo assembly of S. habrochaites and S. pennelli
comparable with tomato reference De novo assembly of S. arcanum in progress Read mapping and SNP analysis finished
And now the fun begins...
Average SNP rate/KB (vs. SL2.40)
Homozygous vs Heterozygous feature rate
Exploring the FW9-2-5 locus (Lin5)
Sucrose synthase gene Cloned from S. pennellii amino acid substitutions:
● 2878 (Asp in LP to Glu in LE)
● 2932 (Asp to Asn) ● 2953 (Val to Leu)
Fridman et al. Proc Natl Acad Sci U S A. 2000 Apr 25;97(9):4718-23.
FW9-2-5 variation (Lin5)
S. galapagense
Needs
Whole genome variant catalogue Annotation for the three wild species genomes Pan genome reconstruction How good is our sampling?
Perspectives
Direct application for Reverse genetics studies ● Use identified allelic variation ● Calculate distance based on all genes?
Better understanding of genome organization ● Improve introgression breeding ● Homozygous vs. hetrerozygous features ● Scan for inversions
Diamond jewelry?
150 tomato genome consortium
Questions
Project site:
● http://www.tomatogenome.net
Phenotype data & Images:
● https://www.eu-sol.wur.nl
SOL100:
● http://solgenomics.net or http://solgenomics.wur.nl
Acknowledgments
Data production ● Elio Schijlen ● Bas te Lintel Hekkert
Quality control
● Saulo Aflitos
Data management and assembly ● Sandra Smit ● Jan van Haarst ● Henri van de Geest ● Lars Smits
Project management
● Sander Peters ● Richard Finkers ● Andries Koops
● Huanwen Zhu ● Minling Xiao ● Tao Ma ● Xiaoli Wang
● Jiumeng Min ● Jie Chen ● Xiaoli Wang
● Jianbo Jian ● Yadan Luo ● Li Liao ● Tina(Na) Xu