evaluating genes and transcripts (“genebuild”)

28
1 of 28 Evaluating Genes and Evaluating Genes and Transcripts Transcripts (“Genebuild”) (“Genebuild”)

Upload: jacqui

Post on 27-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Evaluating Genes and Transcripts (“Genebuild”). Outline. Ensembl gene set Ensembl EST genes Ab initio predictions Manual curation (Vega) Ensembl / Havana merged gene set CCDS project. Biological Evidence. All Ensembl gene predictions are based on experimental evidence:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Evaluating Genes and Transcripts (“Genebuild”)

1 of 28

Evaluating Genes and TranscriptsEvaluating Genes and Transcripts(“Genebuild”)(“Genebuild”)

Page 2: Evaluating Genes and Transcripts (“Genebuild”)

2 of 28

OutlineOutline

• Ensembl gene set

• Ensembl EST genes

• Ab initio predictions

• Manual curation (Vega)

• Ensembl / Havana merged gene set

• CCDS project

Page 3: Evaluating Genes and Transcripts (“Genebuild”)

3 of 28

Biological EvidenceBiological Evidence

• UniProt/Swiss-ProtA manually curated database and therefore of highest accuracy

• NCBI RefSeqA partially manually curated database

• UniProt/TrEMBLAutomatically annotated translations of EMBL coding sequence (CDS) features

• EMBL / GenBank / DDBJPrimary nucleotide sequence repository

All Ensembl gene predictions are based on experimental evidence:

Page 4: Evaluating Genes and Transcripts (“Genebuild”)

4 of 28

The Ensembl GenebuildThe Ensembl Genebuild

Genome assembly

Computer programs

Experimental evidence

Ensembl Ensembl GenesGenes

+

+

Page 5: Evaluating Genes and Transcripts (“Genebuild”)

5 of 28

The Ensembl GenebuildThe Ensembl Genebuild

A new release of Ensembl doesn’t contain a new genebuild for each species!

New genebuilds are only done if there is:• a new genome assembly• a lot of new supporting evidence

Page 6: Evaluating Genes and Transcripts (“Genebuild”)

6 of 28

Genome AssembliesGenome Assemblies

Genome assemblies are not created byEnsembl, but provided by other institutes / consortia, e.g.

• NCBI: human, mouse• Rat Genome Sequencing Consortium: rat• Sanger: zebrafish• Broad Institute: mammals• Baylor College: cow• Washington University: chicken

etc. etc.

Page 7: Evaluating Genes and Transcripts (“Genebuild”)

7 of 28

The Ensembl GenebuildThe Ensembl Genebuild

• Targeted build:

Align species-specific proteins to the genome to create transcripts

• Similarity build:

Align proteins from closely related species to locate additional transcripts

• Add UTRs using mRNA evidence• Eliminate redundant transcripts and create

genes

Page 8: Evaluating Genes and Transcripts (“Genebuild”)

8 of 28

““Special” casesSpecial” cases

• Pseudogenes

• Non-coding RNA genes

sequences from RFAM and miRBase dbs and covariance modelshand-checked set

• Ig Segment Genes (Immunoglobulin and T-cell receptor segments)

sequences from IMGT db and Exonerate

Page 9: Evaluating Genes and Transcripts (“Genebuild”)

9 of 28

Classification of TranscriptsClassification of Transcripts

• Ensembl Transcripts and Proteins are mapped to UniProt/Swiss-Prot, NCBI RefSeq and UniProt/TrEMBL entries

• Genes that map to species-specific protein/mRNA records are classified as known

• Genes that do not map to species-specific protein/mRNA records are classified as novel

Page 10: Evaluating Genes and Transcripts (“Genebuild”)

10 of 28

Names and DescriptionsNames and Descriptions

• Transcript names are inferred from mapped transcripts and proteins• Swiss-Prot > RefSeq > TrEMBL ID

• Novel transcripts have only Ensembl identifiers

• Genes are assigned the official gene symbol if available• HGNC (HUGO) symbol for human genes• Species-specific nomenclature committees (MGI,

ZFIN etc.)

• Otherwise Swiss-Prot > RefSeq > TrEMBL ID• Gene description is inferred from mapped

database entries, the source is always given

Page 11: Evaluating Genes and Transcripts (“Genebuild”)

11 of 28

Supporting evidenceSupporting evidence

ExonView

mRNA

peptidepeptide

mRNA

UTR coding/UTR

Page 12: Evaluating Genes and Transcripts (“Genebuild”)

12 of 28

Supporting evidenceSupporting evidenceContigView

Page 13: Evaluating Genes and Transcripts (“Genebuild”)

13 of 28

Configuring the GenebuildConfiguring the Genebuild

Genebuild configured for each species

Data availibility• Targeted build most useful in human, mouse• Similarity build most useful in C. intestinalis,

mosquito

Structural issues• Zebrafish

• Many duplications• Genome from different haplotypes

• Mosquito• Many single-exon genes• Genes within genes

Page 14: Evaluating Genes and Transcripts (“Genebuild”)

14 of 28

Low Coverage GenomesLow Coverage Genomes

• Low coverage genomes (~2x) come in lots of scaffolds: “classic” genebuild will result in many partial and fragmented genes

• Whole Genome Alignment (WGA) to an annotated reference genome: this method reduces fragmentation by piecing together scaffolds into “gene-scaffolds” that contain complete gene(s)

Page 15: Evaluating Genes and Transcripts (“Genebuild”)

15 of 28

Low Coverage GenomesLow Coverage Genomes

NNNNNN “gene-scaffold”

reference assembly

Page 16: Evaluating Genes and Transcripts (“Genebuild”)

16 of 28

EST Gene SetEST Gene Set

• ESTs (Expressed Sequence Tags) are single reads, high chance of sequencing mistakes

• EST libraries are regularly contaminated with genomic DNA

• Generally ~ 400 bp, so unlikely to cover a whole gene

THEREFORE

• EST gene predictions are less reliable and thus kept separate from the core Ensembl Gene Set

Page 17: Evaluating Genes and Transcripts (“Genebuild”)

17 of 28

EST Gene SetEST Gene SetContigView

ESTs

EST genes

Page 18: Evaluating Genes and Transcripts (“Genebuild”)

18 of 28

Ab initioAb initio Predictions Predictions

• Predict translatable transcript structures solely on the basis of genome sequence.

• No validation with biological expression information.

• GENSCAN for vertebrate genomes• SNAP better for invertebrates• NB: Both programs are over-

predicting transcript structures.

Page 19: Evaluating Genes and Transcripts (“Genebuild”)

19 of 28

Ab initioAb initio Predictions PredictionsContigView

GENSCAN prediction

Page 20: Evaluating Genes and Transcripts (“Genebuild”)

20 of 28

Automatic vs Manual Automatic vs Manual AnnotationAnnotation

Automatic Annotation• Quick• Use unfinished

sequence or shotgun assembly

• Consistent annotation

Manual Annotation• Slow• Need finished

sequence• Flexible, can deal

with inconsistencies• Most rules have

exceptions• Consult publications

as well as databases

Page 21: Evaluating Genes and Transcripts (“Genebuild”)

21 of 28

Annotation that Causes Problems Annotation that Causes Problems for Ensemblfor Ensembl

• Multiple variants

• UTRs

• Pseudogenes

• Non-coding genes (ncRNAs)

• Overlapping genes, anti-sense genes

• Gene duplication events

Page 22: Evaluating Genes and Transcripts (“Genebuild”)

22 of 28

Manually Curated Gene SetsManually Curated Gene Sets

• FlyBase fruitfly

• WormBase C. elegans

• SGD yeast

• Vega human, zebrafish, mouse, dog

Page 23: Evaluating Genes and Transcripts (“Genebuild”)

23 of 28

Vega Genome BrowserVega Genome Browser

http://vega.sanger.ac.uk

Page 24: Evaluating Genes and Transcripts (“Genebuild”)

24 of 28

Vega TranscriptsVega Transcripts

Vega transcripts

Vega Havana transcripts annotated by the Havana team at SangerVega External transcripts annotated by other Vega teams

Page 25: Evaluating Genes and Transcripts (“Genebuild”)

25 of 28

Ensembl / Havana MergeEnsembl / Havana Merge

Transcripts:• Ensembl/Havana: gold • Ensembl: red / black• Havana: blue

Genes: • Ensembl/Havana: gold • Ensembl: red / black• Havana: blue

Full-length protein-coding transcripts annotated by the Sanger Havana team (part of Vega) are merged with the human Ensembl transcript set

Page 26: Evaluating Genes and Transcripts (“Genebuild”)

26 of 28

Ensembl / Havana MergeEnsembl / Havana Merge

Merged Ensembl / Havana gene

Merged Ensembl / Havana transcript

Page 27: Evaluating Genes and Transcripts (“Genebuild”)

27 of 28

CCDSCCDS((CConsensus onsensus CCoodding ing SSequences)equences)

• Collaboration between NCBI, UCSC, Ensembl and Havana to produce a set of stable, reliable, complete (ATG->stop) CDS structures for human and mouse

• Long term aim is to get to a single gene set for human and mouse

• The genebuild pipeline has been modified to retain these ‘blessed’ CDSs (stored in a database for incorporation in the build)

Page 28: Evaluating Genes and Transcripts (“Genebuild”)

28 of 28

QQ&&AAQ U E S T I O N SQ U E S T I O N S

A N S W E R SA N S W E R S