The Ensembl Gene setThe Ensembl Gene setThe “Genebuild”The “Genebuild”
21 April 2008
2 of 32
The GeneBuild (determining the Ensembl gene set)
What it means for the scientist? ‘annotation pipeline’ vs ‘manual curation’
Pseudogenes ncRNAs The CCDS project
OutlineOutline
3 of 32
What is available?
I) Sequence Assemblies from genome sequencing efforts
IntroductionIntroduction
4 of 32
Gene Sequencing- Gene Sequencing- the Assemblythe Assembly
http://seqcore.brcf.med.umich.edu/doc/educ/dnapr/sequencing.htmlThis generates clones, vs new sequencing methods
5 of 32
Clones AvailableClones Available
Human:
(Tilepath- used in the assembly)
Ciona intestinalis
Shotgun assembly
6 of 32
ContigView: Clones and ContigsContigView: Clones and Contigs
Contigs
Clones(Plate/well numbers) Ensembl
Transcripts
7 of 32
Task:
View the tilepath clone in ContigView for the region containing the human
BRCA2 gene.
Hint: Start with a search for the BRCA2 gene.
8 of 32
The Ensembl GenesetThe Ensembl Geneset
How does Ensembl use mRNA and protein information along with the sequence assembly to define distinct genes on the genome?
Protein Sequence Assembly Ensembl Geneset
9 of 32
Once the Assembly is Imported…Once the Assembly is Imported…
Proteins/mRNAs are aligned.
These have been submitted to databases such as:
UniProt (manually curated) and
RefSeq (partially manually curated)
10 of 32
The BiologicalThe Biological EvidenceEvidence
UniProt/Swiss-Prot
A manually curated database and therefore of highest accuracy
NCBI RefSeq
A partially manually curated database
UniProt/TrEMBL
Automatically annotated translations of EMBL coding sequence (CDS) features
EMBL / GenBank / DDBJ
Primary nucleotide sequence repository
All Ensembl gene predictions are based on experimental evidence:
11 of 32
Database RelationshipDatabase Relationship
NCBIRefSeq
EMBL-BankDDBJ
GenBank
UniProt
Swiss-Prot TrEMBL
IndividualLab’s
Submission
12 of 32
Sequence(Assembly)
Proteins(e.g. Swiss-Prot)
mRNA
EST
Manual annotation (HAVANA)
ESTgenes
Ensembl
GenebuildGenebuild
EMBL-BankGenBank
DDBJ
13 of 32
Ensembl genes may be based on multiple protein/mRNAs
What is an Ensembl gene based on?
Why do I want to know?…Why do I want to know?…
14 of 32
Task
Look at the evidence for the human EPO gene.
What was this gene based on?
Hint: Go to Exon Information from the GeneView page
15 of 32
EPO gene supporting evidence
16 of 32
Species-Specific GeneBuildsSpecies-Specific GeneBuilds
Pan troglodytes genes are built by projection from human genes.
Zebrafish has many gene duplications.
Homo sapiens genes must have
protein evidence, not just mRNA.
17 of 32
Task
When was the chimpanzee (Pan troglodytes) Genebuild performed?
Can you find information as to how genes were annotated?
Hint: Look on the chimpanzee index page
18 of 32
External Gene Set: VEGA/HavanaExternal Gene Set: VEGA/Havana
Human, zebrafish, mouse and dog
Havana transcripts in blue or gold…
What are Havana transcripts?
20 of 32
Havana and Ensembl match
When a Havana (manually curated) and Ensembl (automatic methods) predictthe same transcript, basepair for basepair, the transcripts are merged and
coloured gold.
21 of 32
Manually-curated gene sets in Manually-curated gene sets in EnsemblEnsembl
Vega (Havana)
Homo sapiens, Danio rerio,
Mus musculus and Canis familiaris
WormBase Caenorhabditis elegans
FlyBase Drosophila melanogaster
SGD Saccharomyces cerevisiae
23 of 32
What Can Go Wrong?What Can Go Wrong?
I) A Gap in the assembly
Gene might not be found in Ensembl
II) Fused genes
BLAST hit(SwissProt
entry)
Gene might be associated with two names
24 of 32
The genome sequence The Genebuild ‘manual curation’ by Havana Other: EST gene set
Pseudogenes
ncRNAs
OutlineOutline
25 of 32
Expressed Sequence Tags vs Expressed Sequence Tags vs ‘cDNA’‘cDNA’
ESTs are annotated separately. Why?
mRNA and cDNA used in the GeneBuild:Sequenced to high standard, often complete.
EST: Lower quality sequence.
‘One shot’ sequencing of cDNA from the 5’ and 3’ end creates the EST sequence. ESTs are only 500-800 nucleotides longLow quality fragment- sequence error of ~2%.
BUT confers useful expression information discovery of new genes esp in diseased organisms Tissue type Timing/developmental stage Samples more transcripts, variants
26 of 32
Where Can I See This EST Geneset?Where Can I See This EST Geneset?ContigView ContigView
Choose EST genes
EST track
27 of 32
Pseudogenes: ‘False’ GenesPseudogenes: ‘False’ Genes
Unprocessed
Produced by gene duplication andrearrangement
Reverse transcription and re-integration
mRNA
pseudogene
AAAAAA
Processed
AAAAAA
28 of 32
ncRNAs (non coding RNAs)ncRNAs (non coding RNAs)
What types are in Ensembl?
tRNA (transfer RNA)
rRNA (ribosomal RNA)
scRNA (small cytoplasmic)
snRNA (small nuclear)
snoRNA (small nucleolar)
miRNA (microRNA)
29 of 32
ncRNAs (2 types)ncRNAs (2 types)
I) RNA with low homology can be identified through conserved 2ary structure (search genome using Rfam pattern)
II) High sequence conservation (miRNA)
BLAST alignment
‘RNA fold’ applied to make sure
sequences can fold (hairpin)
30 of 32
ncRNAs… where can I see them?ncRNAs… where can I see them?
Find them in ContigView:
or use BioMart.
31 of 32
*All Ensembl genes are based on biological evidence (protein and mRNA)
One Ensembl gene may come from proteins and mRNAs in various databases.
Havana (manually curated) genes are incorporated into the Ensembl geneset, merged for human.
The CCDS set strives for consensus coding sequences across databases.
Pseudogenes and RNAs are annotated, along with a separate EST gene set.
Summary – Ensembl GenesSummary – Ensembl Genes
32 of 32
For more on GeneBuild:For more on GeneBuild:
Help and Documentation
(About Ensembl)
http://www.ensembl.org/info/about/docs/genome_annotation.html