evaluating genes and transcripts (“genebuild”)

1 of 28

Evaluating Genes and TranscriptsEvaluating Genes and Transcripts(“Genebuild”)(“Genebuild”)

2 of 28

OutlineOutline

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega)

• Ensembl / Havana merged gene set

• CCDS project

3 of 28

Biological EvidenceBiological Evidence

• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy

• NCBI RefSeqA partially manually curated database

• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features

• EMBL / GenBank / DDBJPrimary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

4 of 28

The Ensembl GenebuildThe Ensembl Genebuild

Genome assembly

Computer programs

Experimental evidence

Ensembl Ensembl GenesGenes

+

+

5 of 28


A new release of Ensembl doesn’t contain a new genebuild for each species!

New genebuilds are only done if there is:• a new genome assembly• a lot of new supporting evidence

6 of 28

Genome AssembliesGenome Assemblies

Genome assemblies are not created byEnsembl, but provided by other institutes / consortia, e.g.

• NCBI: human, mouse• Rat Genome Sequencing Consortium: rat• Sanger: zebrafish• Broad Institute: mammals• Baylor College: cow• Washington University: chicken

etc. etc.

7 of 28


• Targeted build:

Align species-specific proteins to the genome to create transcripts

• Similarity build:

Align proteins from closely related species to locate additional transcripts

• Add UTRs using mRNA evidence• Eliminate redundant transcripts and create

genes

8 of 28

““Special” casesSpecial” cases

• Pseudogenes

• Non-coding RNA genes

sequences from RFAM and miRBase dbs and covariance modelshand-checked set

• Ig Segment Genes (Immunoglobulin and T-cell receptor segments)

sequences from IMGT db and Exonerate

9 of 28

Classification of TranscriptsClassification of Transcripts

• Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries

• Genes that map to species-specific protein/mRNA records are classified as known

• Genes that do not map to species-specific protein/mRNA records are classified as novel

10 of 28

Names and DescriptionsNames and Descriptions

• Transcript names are inferred from mapped transcripts and proteins• Swiss-Prot > RefSeq > TrEMBL ID

• Novel transcripts have only Ensembl identifiers

• Genes are assigned the official gene symbol if available• HGNC (HUGO) symbol for human genes• Species-specific nomenclature committees (MGI,

ZFIN etc.)

• Otherwise Swiss-Prot > RefSeq > TrEMBL ID• Gene description is inferred from mapped

database entries, the source is always given

11 of 28

Supporting evidenceSupporting evidence

ExonView

mRNA

peptidepeptide

mRNA

UTR coding/UTR

12 of 28

Supporting evidenceSupporting evidenceContigView

13 of 28

Configuring the GenebuildConfiguring the Genebuild

Genebuild configured for each species

Data availibility• Targeted build most useful in human, mouse• Similarity build most useful in C. intestinalis,

mosquito

Structural issues• Zebrafish

• Many duplications• Genome from different haplotypes

• Mosquito• Many single-exon genes• Genes within genes

14 of 28

Low Coverage GenomesLow Coverage Genomes

• Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes

• Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

15 of 28

Low Coverage GenomesLow Coverage Genomes

NNNNNN “gene-scaffold”

reference assembly

16 of 28

EST Gene SetEST Gene Set

• ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes

• EST libraries are regularly contaminated with genomic DNA

• Generally ~ 400 bp, so unlikely to cover a whole gene

THEREFORE

• EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set

17 of 28

EST Gene SetEST Gene SetContigView

ESTs

EST genes

18 of 28

Ab initioAb initio Predictions Predictions

• Predict translatable transcript structures solely on the basis of genome sequence.

• No validation with biological expression information.

• GENSCAN for vertebrate genomes• SNAP better for invertebrates• NB: Both programs are over-

predicting transcript structures.

19 of 28

Ab initioAb initio Predictions PredictionsContigView

GENSCAN prediction

20 of 28

Automatic vs Manual Automatic vs Manual AnnotationAnnotation

Automatic Annotation• Quick• Use unfinished

sequence or shotgun assembly

• Consistent annotation

Manual Annotation• Slow• Need finished

sequence• Flexible, can deal

with inconsistencies• Most rules have

exceptions• Consult publications

as well as databases

21 of 28

Annotation that Causes Problems Annotation that Causes Problems for Ensemblfor Ensembl

• Multiple variants

• UTRs

• Pseudogenes

• Non-coding genes (ncRNAs)

• Overlapping genes, anti-sense genes

• Gene duplication events

22 of 28

Manually Curated Gene SetsManually Curated Gene Sets

• FlyBase fruitfly

• WormBase C. elegans

• SGD yeast

• Vega human, zebrafish, mouse, dog

23 of 28

Vega Genome BrowserVega Genome Browser

http://vega.sanger.ac.uk

24 of 28

Vega TranscriptsVega Transcripts

Vega transcripts

Vega Havana transcripts annotated by the Havana team at SangerVega External transcripts annotated by other Vega teams

25 of 28

Ensembl / Havana MergeEnsembl / Havana Merge

Transcripts:• Ensembl/Havana: gold • Ensembl: red / black• Havana: blue

Genes: • Ensembl/Havana: gold • Ensembl: red / black• Havana: blue

Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set

26 of 28

Ensembl / Havana MergeEnsembl / Havana Merge

Merged Ensembl / Havana gene

Merged Ensembl / Havana transcript

27 of 28

CCDSCCDS((CConsensus onsensus CCoodding ing SSequences)equences)

• Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse

• Long term aim is to get to a single gene set for human and mouse

• The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

28 of 28

QQ&&AAQ U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S

evaluating genes and transcripts (“genebuild”)

Documents

ensembl identifiersgenes

mapped transcripts

official gene symbol

new genebuilds

speciesspecific proteins

annotated reference

transcriptssimilarity

sequence tags