solanaceae comparative genomics prl lunchtime seminar 2009
TRANSCRIPT
Brett Whitty, Buell Lab
April 10, 2013
The Solanaceae
Capiscum anuum
Nicotiana benthamiana
Nicotiana langsdorffii x N. sanderae
Nicotiana tabacumSolanum tuberosum
Solanum lycopersicum
Petunia x hybrida
Solanum melongena
…and moreTobacco genome is ~4,500Mb, ~1Gb euchromatin
Sequenced at NCSU, funded by Philip Morris USA ($17.6M) 2004
Methyl-filtration strategy used to sequence gene-rich regions90% coverage of coding regions (theoretical)
856Mb (in 953,214 assemblies) of m-f reads released late 2008
http://www.pngg.org/tgi/
Tomato genome is ~950Mb, ~220Mb euchromatin
International genome sequencing project started in 2004
Target is 12 finished chromosomesSequencing is 41% complete
U.S. effort (chr. 1 & 10) is currently unfunded
http://www.sgn.cornell.edu/about/tomato_sequencing.pl
Potato genome is ~840Mb, ~220Mb euchromatin
International genome sequencing consortium formed in 2006
Target is 12 finished chromosomesSequencing is 20% complete
Our lab has been working on chromosome 6
http://www.potatogenome.net
• An integrated resource for publicly available sequence data for the Solanaceae
• Leverage partial genomic and transcriptomic sequence data to providebioinformatics tools and data that add value to, and improve usability of the available data for the Solanaceae community
• Provide consistent annotation of sequence data
• Provide comparative bioinformatics analyses and displays
http://solanaceae.plantbiology.msu.edu
Solanaceae Genomic Sequence Resources
We retrieve any new Solanaceae BAC sequences from GenBank on a weekly basis
This includes sequences from our Potato chr. 6 sequencing project submitted by the sequencing center
We purposefully rely on public sequence databases as the primary repository for sequence data to support and encourage data accessibility
2 2 11 3 5 2 2 6 2 34 1
779
1 1 5 1118
261
165
125
254
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
Number of Solanaceae BACs in Resource Databases by Species
Release 3
Release 2
Release 1
0.3 0.1 1.4 0.3 1.2 0.3 0.5 0.8 0.1 3.7 0.0
90.4
0.1 0.0 0.2 0.0
16.6
24.7
21.1
13.4
32.2
0
10
20
30
40
50
60
70
80
90
100
110
120
130
Meg
abas
es
Total Length (in Mbp) of Solanaceae BAC Sequence by Species
Release 3
Release 2
Release 1
A Brief History of TIGR Gene Indices/TAs
TIGR Gene Indices
2005 John Quackenbush leaves TIGR Harvard Gene Indices
2007 Robin Buell leaves JCVI
2006 Plant group creates TIGR TAs TIGR Plant TAs
Childs KL, Hamilton JP, Zhu W, Ly E, Cheung F, Wu H, Rabinowicz PD, Town CD, Buell CR, Chan AP. 2007. The TIGR Plant Transcript Assemblies database. Nucleic Acids Res (2007) vol. 35 pp. D846-51
PlantGDB-assembled Putative Unique Transcripts (PUTs)
http://www.plantgdb.org/prj/ESTCluster/
Goal of assembly is to provideclosest approximation of arepresentative transcript set
Available for any plant specieswith >10,000 ESTs in GenBank will do build with <10k on request
Currently 11 Sol species have PUTs
15,278 18,037
6,791 7,612
114,191
9,884 7,110 4,024
48,945
3,718
70,344
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
Number of Solanaceae Transcript Assemblies in Resource Databases by Species
8.111.6
3.4 3.1
71.7
5.3 5.62.6
33.9
1.9
49.9
0
10
20
30
40
50
60
70
80
Me
gab
ase
s
Length (in Mbp) of Solanaceae Transcript Assembly Sequence by Species
Annotation of Solanaceae BACs Our annotation pipeline is run on all Solanaceae genomic sequences
publicly available in Genbank
The MAKER gene annotation pipeline software is used to produce genemodels by incorporating transcript and protein evidence with ab initiogenefinder predictions; these supplement any gene models previouslyannotated on the assemblies, should those be present, in the public data
*number of models annotated in BAC GenBank records --- does not include public models released by ITAG through SGN
Total Sol BACs
Total Length (bp)
Public Models*
MAKER Models
1810 210,050,443 1,135 29,234
Annotation of Solanaceae BACs (2)
Other computational analyses are performed, including:
Alignment of PlantGDB-assembled Solanaceae transcripts (PUTs) tothe genomic sequence using exonerate
Alignment of UniProt's SwissProt & UniRef protein databases to thegenomic sequence using exonerate
BLASTP of Solanaceae gene models against model dicot proteomes(Arabidopsis, Grape, Medicago, Poplar)
InterProScan search on the models to identify functional domains
Repeat feature prediction (using RepeatMasker)
ncRNA feature prediction (using tRNAscan-SE and RNAmmer)
…and additional computational analyses
Model Dicot vs. Solanaceae Comparative Genome Browsers We have created browsers for the public genome releases of
Arabidopsis (TAIR8), Grape(v1) and Poplar (v1.1) using the Generic Genome Browser (GBrowse)
Browser tracks:
Model genome public annotation (gene models, repeat regions, etc.)
All Solanaceae PUTs (11 species) aligned to the genomic sequence using exonerate’s est2genome model with a cutoff of 70% identity/70% of the length of the PUT
All Solanaceae PUTs aligned to the model genome’s proteome using TBLASTN with a cutoff of 70% identity/70% of the length of the PUT; alignments are displayed relative to the position of each gene model in the genome
Comparative Mapping to Model Dicot Genomes by BLAST Best Hit PUTs were mapped to Arabidopsis, Grape and Poplar genes
by best TBLASTX hit with an E-value cutoff of 1e-10
Arabidopsis Grape Poplar
PUTs Species Total PUTs# w/BLAST
hit% w/BLAST
hit# w/BLAST
hit% w/BLAST
hit# w/BLAST
hit% w/BLAST
hit
Capsicum annuum 15,278 10,292 67.36% 10,481 68.60% 10,589 69.31%
Nicotiana benthamiana 18,037 10,644 59.01% 10,884 60.34% 11,036 61.19%
N. langsdorffii x sanderae 6,791 4,026 59.28% 4,032 59.37% 4,155 61.18%
Nicotiana sylvestris 7,612 4,743 62.31% 4,917 64.60% 4,965 65.23%
Nicotiana tabacum 89,461 35,736 39.95% 37,546 41.97% 37,866 42.33%
Petunia x hybrida 9,884 6,271 63.45% 6,405 64.80% 6,500 65.76%
Solanum chacoense 7,110 5,038 70.86% 5,062 71.20% 5,163 72.62%
Solanum habrochaites 4,024 3,214 79.87% 3,255 80.89% 3,271 81.29%
Solanum lycopersicum 48,945 34,275 70.03% 34,855 71.21% 35,134 71.78%
Solanum pennellii 3,718 2,676 71.97% 2,732 73.48% 2,747 73.88%
Solanum tuberosum 70,344 45,125 64.15% 46,376 65.93% 46,993 66.80%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Percentage of Solanaceae PUTs with Arabidopsis, Grape and Poplar TBLASTX Hits (at E <= 1e-10)
Arabidopsis
Grape
Poplar
Lineage-Specific Transcript Assemblies
All Solanaceae PUTSArabidopsis
Genome (TAIR8)
Grape Genome (v1)
TBLASTX
PUTs with no significant hits at E <= 1e-5
TBLASTX
PUTs with no significant hits at E <= 1e-5
Poplar Genome (v1.1)
TBLASTX
PUTs with no significant hits at E <= 1e-5
All PUTs Excluding Solanaceae
TBLASTX
PUTs with no significant hits at E <= 1e-5
Non-Solanaceae UniProt UniRef100
BLASTX
Putative Lineage-Specific Transcript Assemblies
(no significant hits at E <= 1e-5)
Lineage-Specific Transcript Assemblies (2)
PUT Species Total PUTs
# Putative Lineage
Specific PUTs
% Putative Lineage Specific
# PUT Length >200bp
% PUT Length >200bp
# with ESTScan
Translations
% with ESTScan
Translations
Capsicum annuum 15,278 3,262 21.4% 2,648 17.3% 2,012 13.2%
Nicotiana benthamiana 18,037 5,518 30.6% 4,223 23.4% 3,381 18.7%
N. langsdorffii x sanderae 6,791 2,049 30.2% 1,455 21.4% 1,124 16.6%
Nicotiana sylvestris 7,612 1,937 25.4% 1,544 20.3% 1,458 19.2%
Nicotiana tabacum 89,461 42,102 47.1% 35,060 39.2% 29,773 33.3%
Petunia x hybrida 9,884 2,200 22.3% 1,549 15.7% 1,520 15.4%
Solanum chacoense 7,110 1,284 18.1% 1,235 17.4% 843 11.9%
Solanum habrochaites 4,024 434 10.8% 391 9.7% 254 6.3%
Solanum lycopersicum 48,945 9,850 20.1% 7,461 15.2% 5,561 11.4%
Solanum pennellii 3,718 287 7.7% 206 5.5% 179 4.8%
Solanum tuberosum 70,344 17,232 24.5% 15,323 21.8% 12,408 17.6%
SNP IdentificationUsing Transcript Assemblies Input is multiple sequence alignments of PUT member
sequences
provided in PlantGDB PUTs dataset
we use vmatch to remap ESTs that are near-identical sub-sequence matches to PUT member ESTs; these are excluded from the PlantGDB assembly process, and the PlantGDB MSA
SNP-finding script identifies SNPs at positions in alignments with the following criteria:
minimum read depth of 4
minimum of 2 reads supporting an alternative base
SNP Identification on Transcripts (2)
SNP Identification on Transcripts (3)
PUTs Species Total PUTs# of PUTs w/SNP(s)
% of PUTs w/SNP(s)
Total Length of PUTs
w/SNP (bp)# of SNP Positions
Average Depth of Coverage at SNP Position
Average Alternative
Base Support
Capsicum annuum 15278 510 3.34% 460,754 1,461 16.9 7.2
Nicotiana benthamiana 18037 966 5.36% 1,164,075 5,106 18.7 9.7
N. langsdorffii x sanderae 6791 191 2.81% 143,052 821 31.1 15.6
Nicotiana sylvestris 7612 17 0.22% 8,756 33 6.4 3.8
Nicotiana tabacum 89461 2,110 2.36% 2,262,286 10,303 13.7 7.5
Petunia x hybrida 9884 133 1.35% 114,872 315 8.3 3.4
Solanum chacoense 7110 13 0.18% 10,399 48 5.4 2.2
Solanum habrochaites 4024 127 3.16% 157,292 695 33.8 14.0
Solanum lycopersicum 48945 5,198 10.62% 6,347,780 16,531 29.2 13.5
Solanum pennellii 3718 99 2.66% 86,679 273 35.6 17.2
Solanum tuberosum 70344 7,722 10.98% 8,872,526 57,705 19.0 9.7
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
4 8 12 16 20 24 28 32 36 >40
Nu
mb
er
of
Pu
tati
ve S
NP
Po
siti
on
s
Minimum Depth of Coverage at SNP Position
Number of Putative SNPs vs. Minimum Depth of Coverage at SNP Positions
min 2 alt allele depth
min 4 alt allele depth
min 6 alt allele depth
min 8 alt allele depth
min 10 alt allele depth
The Solanaceae Comparative Genomics Resource in 2009/2010
SNP prediction on genomic sequences
Gene-centric views of data and resources
Integration of Tobacco genomic sequence into site resources
Increased annotation quality
Phylogenetic analysis
Comparative synteny displays
“Next generation” Potato genome sequencing?
Other Web Resources in the Buell LabPlease visit http://buell-lab.plantbiology.msu.edu
…also Biofuels Feedstock Genomics Resource (and more?)
Bioinformatics Programmer:
Morgan Chaires
Thanks:Kevin Childs
John Hamilton
Mike Geoffroy
Steven Lundback
AcknowledgementsPI:
Robin Buell
Bioinformatics/Project Lead:
Brett Whitty
Funding:
Solanaceae Comparative Genomics
Potato Chromosome 6