introduction to the gene ontology and go annotation resources
DESCRIPTION
Rachael Huntley UniProtKB-GOA Curator EBI. Introduction to the Gene Ontology and GO annotation resources. Talk Overview. Intro to GO and GO terms. Annotating to GO. Accessing GO annotations. Practical use of GO. GO analysis tools. Precautions. Reasons for the Gene Ontology. - PowerPoint PPT PresentationTRANSCRIPT
EBI is an Outstation of the European Molecular Biology Laboratory.
Introduction to the Gene Ontologyand GO annotation resources
Rachael Huntley
UniProtKB-GOA Curator
EBI
2
Talk Overview
• Intro to GO and GO terms
• Annotating to GO
• Practical use of GO
• GO analysis tools
• Precautions
• Accessing GO annotations
3
Reasons for the Gene Ontology
www.geneontology.org
• Inconsistency in naming of biological concepts
• Large datasets need to be interpreted quickly
• Increasing amounts of biological data available
4
Search on ‘DNA repair’...get over 60,000 results
Increasing amounts of biological data available
Expansion of sequence information
5
Reasons for the Gene Ontology
• Inconsistency in naming of biological concepts
• Large datasets need to be interpreted quickly
• Increasing amounts of biological data available
6
• Need to organise the data, analyse it, share it in a timely manner to benefit other researchers
• Inconsistencies in analyses make cross-species or cross-database comparison difficult
Large datasets need to be interpreted quickly
7
Reasons for the Gene Ontology
• Inconsistency in naming of biological concepts
• Large datasets need to be interpreted quickly
• Increasing amounts of biological data available
8
English is not a very precise language
• Same name for different concepts• Different names for the same concept
Inconsistency in naming of biological concepts
?
An example …
Tactition Tactile sense
Taction
Sensory perception of touch ; GO:0050975
9
• A way to capture biological knowledge in a written and computable form
The Gene Ontology
• A set of concepts and their relationships to each other arrangedas a hierarchy
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
10
The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity• insulin receptor activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
11
Anatomy of a GO term
Unique identifierUnique identifier
Term nameTerm name
DefinitionDefinitionSynonymsSynonyms
Cross-referencesCross-references
12
Ontology structure
• Directed acyclic graph
Terms can have more than one parent
• Terms are linked by relationships
is_a
part_of
regulates
+ve regulates
-ve regulateswww.ebi.ac.uk/QuickGO
13
http://amigo.geneontology.org/cgi-bin/amigo/go.cgi
… there are more browsers available on the GO Consortium Tools page:
http://www.geneontology.org/GO.tools.shtml
Searching for GO terms
The latest OBO format Gene Ontology file can be downloaded from:
http://www.geneontology.org/ontology/gene_ontology.obo
http://www.ebi.ac.uk/QuickGO
14 http://www.ebi.ac.uk/QuickGO
The EBI's QuickGO browser
Search GO terms or proteins
15
GO Annotation
16
Reactome
http://www.geneontology.org
17
Aims of the GO project
• Compile the ontologies
- currently over 34,000 terms- constantly increasing and improving
• Annotate gene products using ontology terms
- around 30 groups provide annotations • Provide a public resource of data and tools
- regular releases of annotations- tools for browsing/querying annotations and editing the ontology
18
Gene Ontology Annotation (UniProtKB-GOA) database at the EBI
• Largest open-source contributor of annotations to GO
• Member of the GO Consortium since 2001
• Provides annotation for more than 360,000 species
• UniProtKB-GOA’s priority is to annotate the human proteome
• UniProtKB-GOA is responsible for human, chicken and bovine annotations in the GO Consortium
19
A GO annotation is … …a statement that a gene product;
1. has a particular molecular function or is involved in a particular biological process
or is located within a certain cellular component
2. as determined by a particular method
3. as described in a particular reference
P00505
Accession Name GO ID GO term name Reference Evidence code
IDAPMID:2731362aspartate transaminase activityGO:0004069GOT2
20
Electronic
Manual
GOA makes annotations using two methods;
• Quick way of producing large numbers of annotations• Annotations use less-specific GO terms
• Time-consuming process producing lower numbers of annotations
• Annotations are very detailed and accurate
21
Electronic
Manual
Evidence Codes
Inferred from Direct Assay (IDA) – e.g. enzyme assay
Inferred from Physical Interaction (IPI) – e.g. yeast 2-hybrid
Inferred from Electronic Annotation (IEA) – e.g. UniProtKB keyword mapping
Some examples…
See the full list on the GO Consortium websitehttp://www.geneontology.org/GO.evidence.shtml
22
1. Mapping of external concepts to GO terms
e.g. InterPro2GO, Swiss-Prot Keyword2GO, Enzyme Commission2GO
Annotations are high-quality and have an explanation of the method (GO_REF)
Electronic annotation by UniProtKB-GOA
Macaque
Mouse
Dog
e.g. Human
Cow
Guinea PigChimpanzee Rat
Chicken
Ensembl compara
2. Automatic transfer of annotations to orthologs
23
Mapping of concepts from UniProtKB files
GO:0004715 ; non-membrane spanning protein tyrosine kinase activity
GO:0005856: cytoskeleton
GO:0007155 ; cell adhesion
24
1. Mapping of external concepts to GO terms
e.g. InterPro2GO, Swiss-Prot Keyword2GO, Enzyme Commission2GO
Annotations are high-quality and have an explanation of the method (GO_REF)
Electronic annotation by UniProtKB-GOA
Macaque
Mouse
DogCow
Guinea PigChimpanzee Rat
Chicken
Ensembl compara
2. Automatic transfer of annotations to orthologs
...and more
e.g. Human
25
Manual annotation by UniProtKB-GOA
High–quality, specific annotations made using:
• Peer-reviewed papers
• A range of evidence codes to categorise the types of evidence found in a
paper
http://www.ebi.ac.uk/GOA
26
Finding annotations in a paper
In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response…
Process: response to wounding GO:0009611
wound response
serine/threonine kinase activity,
Function: protein serine/threonine kinase activity GO:0004674
integral membrane protein
Component: integral to plasma membrane GO:0005887
…for B. napus PERK1 protein (Q9ARH1)
PubMed ID: 12374299
27
Additional information
• QualifiersModify the interpretation of an annotation
• NOT (protein is not associated with the GO term)
• colocalizes_with (protein associates with complex but is not a bona fide member)
• contributes_to (describes action of a complex of proteins)
• 'With' column
Can include further information on the method being referenced
e.g. the protein accession of an interacting protein
28
GO:0005739 Mitochondrion
HumanProteinAtlas
UniProtKB-GOA integrates manual annotations from external groups
LifeDB (subcellular locations)
Reactome (pathways)
also;
29
* Includes manual annotations integrated from external model organism and multi-species databases
Status of UniProtKB-GOA Annotation
0.9%
66%
159,630 910,536Manual annotations*
11,261,38298,037,844Electronic annotations
Proteins Annotations Evidence Source UniProt
Coverage
Aug 2011 Statistics
30
How to access and use
GO annotation data
31
Where can you find annotations?
UniProtKB
Ensembl
Entrez gene
32
GO Consortium website
UniProtKB-GOA website
Gene Association Files17 column files containing all information for each annotation
33
GO browsers
34 http://www.ebi.ac.uk/QuickGO
The EBI's QuickGO browser
Search GO terms or proteins
Find sets of GO annotations
35
• Access gene product functional information
• Analyse high-throughput genomic or proteomic datasets
• Validation of experimental techniques
• Get a broad overview of a proteome
• Obtain functional information for novel gene products
How scientists use the GO
Some examples…
36
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
attacked
time
control
Puparial adhesionMolting cycleHemocyanin
Defense responseImmune responseResponse to stimulusToll regulated genesJAK-STAT regulated genes
Immune responseToll regulated genes
Amino acid catabolismLipid metobolism
Peptidase activityProtein catabolismImmune response
Selected Gene Tree: pearson lw n3d ...Branch color classification:Set_LW_n3d_5p_...
Colored by: Copy of Copy of C5_RMA (Defa...Gene List: all genes (14010)
Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.
MicroArray data analysis
Analysis of high-throughput genomic datasets
37
(Orrù et al., Molecular and Cellular Proteomics 2007)
Characterisation of proteins interacting with ribosomal protein S19
Analysis of high-throughput proteomic datasets
38
Validation of experimental techniques
(Cao et al., Journal of Proteome Research 2006)
Rat liver plasma membrane isolation
39
Obtain functional information for novel gene products
MPYVSQSQHIDRVRGAIEGRLPAPGNSSRLVSSWQRSYEQYRLDPGSVIGPRVLTSSELR DVQGKEEAFLRASGQCLARLHDMIRMADYCVMLTDAHGVTIDYRIDRDRRGDFKHAGLYI GSCWSEREEGTCGIASVLTDLAPITVHKTDHFRAAFTTLTCSASPIFAPTGELIGVLDAS AVQSPDNRDSQRLVFQLVRQSAALIEDGYFLNQTAQHWMIFGHASRNFVEAQPEVLIAFD ECGNIAASNRKAQECIAGLNGPRHVDEIFDTSAVHLHDVARTDTIMPLRLRATGAVLYAR IRAPLKRVSRSACAVSPSHSGQGTHDAHNDTNLDAISRFLHSRDSRIARNAEVALRIAGK HLPILILGETGVGKEVFAQALHASGARRAKPFVAVNCGAIPDSLIESELFGYAPGAFTGA RSRGARGKIAQAHGGTLFLDEIGDMPLNLQTRLLRVLAEGEVLPLGGDAPVRVDIDVICA THRDLARMVEEGTFREDLYYRLSGATLHMPPLRERADILDVVHAVFDEEAQSAGHVLTLD GRLAERLARFSWPGNIRQLRNVLRYACAVCDSTRVELRHVSPDVAALLAPDEAALRPALA LENDERARIVDALTRHHWRPNAAAEALGM
www.ebi.ac.uk/InterProScan
InterProScan
40
Annotating novel sequences
• Can use BLAST queries to find similar sequences with GO annotation which can be transferred to the new sequence
• Two tools currently available;
AmiGO BLAST (from GO Consortium) http://amigo.geneontology.org/cgi-bin/amigo/blast.cgi
– searches the GO Consortium database
BLAST2GO (from Babelomics) http://www.blast2go.org/
– searches the NCBI database
41
AmiGO BLAST
amigo.geneontology.org/cgi-bin/amigo/blast.cgi
Exportin-T from Pongo abelii (Sumatran orangutan)
42
Analysis of large datasets using GO
43
Using the GO to provide a functional overview for a large dataset
• Many GO analysis tools use GO slims to give a broad overview of the dataset
• GO slims are cut-down versions of the GO andcontain a subset of the terms in the whole GO
• GO slims usually contain less-specialised GO terms
44
Slimming the GO using the ‘true path rule’ Many gene products are associated with a large number of descriptive, leaf GO nodes:
45
Slimming the GO using the ‘true path rule’ …however annotations can be mapped up to a smaller set of parent GO terms:
46
GO slims
or you can make your own using;
Custom slims are available for download;
http://www.geneontology.org/GO.slims.shtml
• AmiGO's GO slimmer
• QuickGOhttp://www.ebi.ac.uk/QuickGO
http://amigo.geneontology.org/cgi-bin/amigo/slimmer
47 www.ebi.ac.uk/QuickGO
Map-up annotations with GO slims
The EBI's QuickGO browser
Search GO terms or proteins
Find sets of GO annotations
48
GO analysis tools
49
Numerous Third Party Tools
www.geneontology.org/GO.tools.shtml
50
Term enrichment
• Most popular type of GO analysis
• Determines which GO terms are more oftenassociated with a specified list of genes/proteins compared with a control list or rest of genome
• Many tools available to do this analysis
• User must decide which is best for their analysis
51
Precautions when using GO annotations for analysis
• Recommended that ‘NOT’ annotations are removed before analysis - only ~5000 out of ~100 million annotations are ‘NOT’- can confuse the analysis
• The Gene Ontology is always changing and GO annotations are continually being created
- always use a current version of both
- if publishing your analyses please report the versions/dates you used
http://www.geneontology.org/GO.cite.shtml
52
Precautions when using GO annotations for analysis
• Unannotated is not unknown
- where there is no evidence in the literature for a process, function orlocation the gene product is annotated to the appropriate ontology’sroot node with an ‘ND’ evidence code (no biological data), thereby distinguishing between unannotated and unknown
• Pay attention to under-represented GO terms- a strong under-representation of a pathway may mean that normalfunctioning of that pathway is necessary for the given condition
53
The UniProtKB-GOA group
Curators:
Software developer:
Team leaders:
Emily DimmerRachael Huntley
Rolf Apweiler
Email: [email protected]
http://www.ebi.ac.uk/GOA
Claire O’Donovan
Tony Sawford
Yasmin Alam-Faruque
54
Acknowledgements
GO ConsortiumMembers of;
UniProtKB
InterPro
IntAct
HAMAP
FundingNational Human Genome Research Institute (NHGRI)Kidney Research UKEMBL