The Gene Ontology & Gene Ontology Annotation resources

Download The Gene Ontology & Gene Ontology Annotation resources

Post on 23-Jan-2017

920 views

Category:

Health & Medicine

3 download

Embed Size (px)

TRANSCRIPT

PowerPoint-Prsentation

The Gene Ontologyand Gene Ontology Annotation resourcesMlanie Courtot, Ph.D.EMBL-EBIGO/GOA Project leaderSPOT/UniProt content teamsmcourtot@ebi.ac.uk

Industry workshopMarch 17 2016

1

In 1999, collaboration between 3 Model Organism Databases

Ashburner et al., Nat Genet. 2000 May;25(1):25-9.

GO was originally developed as a mechanism of being able to query across MODs

A few of them got together and realised that there was a lot more power in their data if they used a common semantics for describing protein function.

2

A way to capture biological knowledge for individual gene productsin a written and computable form A set of concepts and their relationships to each other arrangedas a hierarchyhttp://www.ebi.ac.uk/QuickGOLess specific conceptsMore specific conceptsThe Gene Ontology

1. Molecular Function

An elemental activity or task or jobprotein kinase activityinsulin receptor activity

3. Cellular ComponentWhere a gene product is located

mitochondrion mitochondrial matrix mitochondrial inner membrane

2. Biological ProcessA commonly recognized series of eventscell division

Provide a public resource of data and toolsAnnotate gene products using ontology termsDevelop the ontologyAims of the GO project

Develop the ontologyAn OWL ontology of >41,000 classesbiological process, cellular component, molecular function > 14,000 imported classes (CL, Uberon, ChEBI, NCBI_tax)>136,000 logical axioms, including:~72,000 subClassOf axioms between named GO classes~41,000 simple existential restrictions (subClassOf R some C)EL expressivity => fast, scalable reasoning (with ELK)

https://www.cs.ox.ac.uk/isg/tools/ELK/

Building the GOThe GO editorial teamSubmission via GitHub, https://github.com/geneontology/Submissions via TermGenie, http://go.termgenie.org~80% terms are now created this way

Annotate gene products

gene -> GO termassociated genes

GO Database

genome and protein databases

The individual genome and protein databases submit their genes and proteins annotated to GO terms to a central GO database.

In turn consumed by many downstream tools

8

a statement that a gene product;

P00505AccessionNameGO IDGO term nameReferenceEvidence codeIDAPMID:2731362aspartate transaminase activityGO:0004069GOT2

A GO annotation is

a statement that a gene product;

1. has a particular molecular function or is involved in a particular biological processor is located within a certain cellular component

A GO annotation is

P00505AccessionNameGO IDGO term nameReferenceEvidence codeIDAPMID:2731362aspartate transaminase activityGO:0004069GOT2

a statement that a gene product;

1. has a particular molecular function or is involved in a particular biological processor is located within a certain cellular component

2. as described in a particular reference

A GO annotation is

P00505AccessionNameGO IDGO term nameReferenceEvidence codeIDAPMID:2731362aspartate transaminase activityGO:0004069GOT2

a statement that a gene product;

1. has a particular molecular function or is involved in a particular biological processor is located within a certain cellular component

2. as described in a particular reference

3. as determined by a particular method

A GO annotation is

P00505AccessionNameGO IDGO term nameReferenceEvidence codeIDAPMID:2731362aspartate transaminase activityGO:0004069GOT2

Experimental dataComputational analysisAuthor statements/curator inference(+ Inferred from electronic annotations)http://www.evidenceontology.org/Tracking provenance

Manual annotationsTime-consuming process producing lower numbers of annotations (~2,800 taxons covered) More specific GO terms Manual annotation is essential for creating predictions

AleksandraShypitsyna

ElenaSperetta

AlexHolmes

TonySawford

Electronic Annotations

Quick way of producing large numbers of annotations Annotations use less-specific GO terms Only source of annotation for ~438,000 non-model organism species

orthology

taxon constraints

Provides automatic prediction of for uncharacterized sequence

Predicts membership of protein families and presence of domains and features

manually curated mapping of families and domains to GO terms

High-level has to be true for all (or most) members of a family can add downstream filters

Incorrect annotations when spotted and fed back to improve the mapping changing the mapping or adding QC filtering downstream e.g. taxon constraints

Mapping of domains has recently been improved used to be to the whole protein that contained the domain, now just to the domain itself. Much more accurate.

16

* Includes manual annotations integrated from external model organism and specialist groups 2,752,604Manual annotations* 269,207,317Electronic annotations

Provide a public resource of data and tools

Number of annotations in UniProt-GOA database (March 2016)

http://www.ebi.ac.uk/GOAhttps://www.ebi.ac.uk/QuickGO/

17

Enrichment analysis

SampleReference

40%

20%

20%

20%

=> The sample is over-enriched for

Spinocerebellar ataxia type 28

PaolaRoncaglia

Novel biomarkers of rectal radiotherapy

Biomarker for diagnosis and prognosis

Gene expression changes in diabetes

Improved network analysis

25

Many gene products are associated with a large number of descriptive, leaf GO nodes:GO slims

however annotations can be mapped up to a smaller set of parent GO terms:GO slims

Slim generation for industryCollaboration funded by RocheNeed a custom GO slim for analysis of genesets of interestNeed to be descriptive enoughWithout redundancyInternal proprietary vocabulary hard to maintainDesire to automatically map to GOhttp://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdf

ROCHE CV

GSEA with full GOGSEA with Roche CVCourtesy Laura Badi

Mapping query: participant_OR_reg_participant some cannabinoidDescription: A process in which a cannabinoid participates, or that regulates a process in which a cannabinoid participates.

ResultsWe have successfully mapped 84% of terms from RCV (308/365) to OWL queries that can be used to replicate some proportion of the original manual mapping. In addition, these queries find 1000s of terms that were missed in the original mapping.

David Osumi-Sutherland

GO SLIM (generic)

GSEA comparing expression between embryo and adult liver using gene sets derived from the generic GO slim (Panel A), manually mapped RCV (panel B), and semi-automated RCV (panel C). Red nodes indicate gene sets enriched in liver compared to embryos. Blue nodes gene sets enriched in embryos compared to liver. The size of the node is proportional to the size of the gene set. Connecting edge thickness is a measure of the number of enriched genes in common between two gene sets.

32

ROCHE CV MANUAL ONLY

ROCHE CV MANUAL + AUTO

AcknowledgementsGO editors and developersGO annotatorsThe Gene Ontology (GO) ConsortiumSamples, Phenotype and Ontology team (Helen Parkinson)Protein Function Content team (Claire ODonovan)Funding: EMBL-EBI, National Human Genome Research Institute (NHGRI)

Useful linksOntology browser: http://www.ebi.ac.uk/ols/beta/ontologies/goBrowsing GO & annotations, GO slims: https://www.ebi.ac.uk/QuickGO/GO Annotation: http://www.ebi.ac.uk/GOAEBI-Roche collaboration paper: http://www.swat4ls.org/wp-content/uploads/2015/10/SWAT4LS_2015_paper_44.pdfContact: mcourtot@ebi.ac.uk