evaluating genes and transcripts (“genebuild”)
DESCRIPTION
Evaluating Genes and Transcripts (“Genebuild”). Outline. Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project. Biological Evidence. All Ensembl gene predictions are based on experimental evidence:. - PowerPoint PPT PresentationTRANSCRIPT
1 of 28
Evaluating Genes and TranscriptsEvaluating Genes and Transcripts(“Genebuild”)(“Genebuild”)
2 of 28
OutlineOutline
• Ensembl gene set
• Ensembl EST genes
• Ab initio predictions
• Manual curation (Vega)
• Ensembl / Havana merged gene set
• CCDS project
3 of 28
Biological EvidenceBiological Evidence
• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy
• NCBI RefSeqA partially manually curated database
• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features
• EMBL / GenBank / DDBJPrimary nucleotide sequence repository
All Ensembl gene predictions are based on experimental evidence:
4 of 28
The Ensembl GenebuildThe Ensembl Genebuild
Genome assembly
Computer programs
Experimental evidence
Ensembl Ensembl GenesGenes
+
+
5 of 28
The Ensembl GenebuildThe Ensembl Genebuild
A new release of Ensembl doesn’t contain a new genebuild for each species!
New genebuilds are only done if there is:• a new genome assembly• a lot of new supporting evidence
6 of 28
Genome AssembliesGenome Assemblies
Genome assemblies are not created byEnsembl, but provided by other institutes / consortia, e.g.
• NCBI: human, mouse• Rat Genome Sequencing Consortium: rat• Sanger: zebrafish• Broad Institute: mammals• Baylor College: cow• Washington University: chicken
etc. etc.
7 of 28
The Ensembl GenebuildThe Ensembl Genebuild
• Targeted build:
Align species-specific proteins to the genome to create transcripts
• Similarity build:
Align proteins from closely related species to locate additional transcripts
• Add UTRs using mRNA evidence• Eliminate redundant transcripts and create
genes
8 of 28
““Special” casesSpecial” cases
• Pseudogenes
• Non-coding RNA genes
sequences from RFAM and miRBase dbs and covariance modelshand-checked set
• Ig Segment Genes (Immunoglobulin and T-cell receptor segments)
sequences from IMGT db and Exonerate
9 of 28
Classification of TranscriptsClassification of Transcripts
• Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries
• Genes that map to species-specific protein/mRNA records are classified as known
• Genes that do not map to species-specific protein/mRNA records are classified as novel
10 of 28
Names and DescriptionsNames and Descriptions
• Transcript names are inferred from mapped transcripts and proteins• Swiss-Prot > RefSeq > TrEMBL ID
• Novel transcripts have only Ensembl identifiers
• Genes are assigned the official gene symbol if available• HGNC (HUGO) symbol for human genes• Species-specific nomenclature committees (MGI,
ZFIN etc.)
• Otherwise Swiss-Prot > RefSeq > TrEMBL ID• Gene description is inferred from mapped
database entries, the source is always given
11 of 28
Supporting evidenceSupporting evidence
ExonView
mRNA
peptidepeptide
mRNA
UTR coding/UTR
12 of 28
Supporting evidenceSupporting evidenceContigView
13 of 28
Configuring the GenebuildConfiguring the Genebuild
Genebuild configured for each species
Data availibility• Targeted build most useful in human, mouse• Similarity build most useful in C. intestinalis,
mosquito
Structural issues• Zebrafish
• Many duplications• Genome from different haplotypes
• Mosquito• Many single-exon genes• Genes within genes
14 of 28
Low Coverage GenomesLow Coverage Genomes
• Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes
• Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)
15 of 28
Low Coverage GenomesLow Coverage Genomes
NNNNNN “gene-scaffold”
reference assembly
16 of 28
EST Gene SetEST Gene Set
• ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes
• EST libraries are regularly contaminated with genomic DNA
• Generally ~ 400 bp, so unlikely to cover a whole gene
THEREFORE
• EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set
17 of 28
EST Gene SetEST Gene SetContigView
ESTs
EST genes
18 of 28
Ab initioAb initio Predictions Predictions
• Predict translatable transcript structures solely on the basis of genome sequence.
• No validation with biological expression information.
• GENSCAN for vertebrate genomes• SNAP better for invertebrates• NB: Both programs are over-
predicting transcript structures.
19 of 28
Ab initioAb initio Predictions PredictionsContigView
GENSCAN prediction
20 of 28
Automatic vs Manual Automatic vs Manual AnnotationAnnotation
Automatic Annotation• Quick• Use unfinished
sequence or shotgun assembly
• Consistent annotation
Manual Annotation• Slow• Need finished
sequence• Flexible, can deal
with inconsistencies• Most rules have
exceptions• Consult publications
as well as databases
21 of 28
Annotation that Causes Problems Annotation that Causes Problems for Ensemblfor Ensembl
• Multiple variants
• UTRs
• Pseudogenes
• Non-coding genes (ncRNAs)
• Overlapping genes, anti-sense genes
• Gene duplication events
22 of 28
Manually Curated Gene SetsManually Curated Gene Sets
• FlyBase fruitfly
• WormBase C. elegans
• SGD yeast
• Vega human, zebrafish, mouse, dog
23 of 28
Vega Genome BrowserVega Genome Browser
http://vega.sanger.ac.uk
24 of 28
Vega TranscriptsVega Transcripts
Vega transcripts
Vega Havana transcripts annotated by the Havana team at SangerVega External transcripts annotated by other Vega teams
25 of 28
Ensembl / Havana MergeEnsembl / Havana Merge
Transcripts:• Ensembl/Havana: gold • Ensembl: red / black• Havana: blue
Genes: • Ensembl/Havana: gold • Ensembl: red / black• Havana: blue
Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set
26 of 28
Ensembl / Havana MergeEnsembl / Havana Merge
Merged Ensembl / Havana gene
Merged Ensembl / Havana transcript
27 of 28
CCDSCCDS((CConsensus onsensus CCoodding ing SSequences)equences)
• Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse
• Long term aim is to get to a single gene set for human and mouse
• The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)
28 of 28
QQ&&AAQ U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S