introduction to the go: a user’s guide ncsu go workshop 29 october 2009

Post on 20-Jan-2016

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to the GO:a user’s guide

NCSU GO Workshop

29 October 2009

Genomic Annotation Genome annotation is the process of

attaching biological information to genomic sequences. It consists of two main steps:

1. identifying functional elements in the genome: “structural annotation”

2. attaching biological information to these elements: “functional annotation”

biologists often use the term “annotation” when they are referring only to structural annotation

CHICK_OLF6

DNA annotation

Protein annotation

Data from Ensembl Genome browser

TRAF 1, 2 and 3 TRAF 1 and 2

Structural annotation:

catenin

Functional annotation:

Structural & Functional AnnotationStructural Annotation: Open reading frames (ORFs) predicted during genome

assembly predicted ORFs require experimental confirmation the Sequence Ontology (SO) provides a structured controlled

vocabulary for sequence annotation

Functional Annotation: annotation of gene products = Gene Ontology (GO)

annotation initially, predicted ORFs have no functional literature and GO

annotation relies on computational methods (rapid) functional literature exists for many genes/proteins prior to

genome sequencing GO annotation does not rely on a completed genome

sequence!

Introduction to GO1. Bio-ontologies

2. the Gene Ontology (GO) a GO annotation example GO evidence codes literature biocuration & computation analysis ND vs no GO sources of GO

3. Using the GO

4. The gene association file

1. Bio-ontologies

Bio-ontologies Bio-ontologies are used to capture biological

information in a way that can be read by both humans and computers.necessary for high-throughput “omics” datasetsallows data sharing across databases

Objects in an ontology (eg. genes, cell types, tissue types, stages of development) are well defined.

The ontology shows how the objects relate to each other.

Bio-ontologies:http://www.obofoundry.org/

Ontologies

digital identifier(computers)

description(humans)

relationships between terms

2. The Gene Ontology

Functional Annotation Gene Ontology (GO) is the de facto method

for functional annotation Widely used for functional genomics (high

throughput) Many tools available for gene expression

analysis using GO The GO Consortium homepage:

http://www.geneontology.org

GO Mapping Example

NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa

Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA

Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA

NDUFAB1

Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA

GO Mapping Example

NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa

Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA

Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA

NDUFAB1

Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA

aspect or ontologyGO:ID (unique)

GO term nameGO evidence code

GO Mapping Example

NDUFAB1 (UniProt P52505)Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa

Biological Process (BP or P)GO:0006633 fatty acid biosynthetic process TASGO:0006120 mitochondrial electron transport, NADH to ubiquinone TASGO:0008610 lipid biosynthetic process IEA

Cellular Component (CC or C)GO:0005759 mitochondrial matrix IDAGO:0005747 mitochondrial respiratory chain complex I IDAGO:0005739 mitochondrion IEA

NDUFAB1

Molecular Function (MF or F)GO:0005504 fatty acid binding IDAGO:0008137 NADH dehydrogenase (ubiquinone) activity TASGO:0016491 oxidoreductase activity TASGO:0000036 acyl carrier activity IEA

GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction

Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation

OtherNR - not recorded (historical)ND - no biological data available

ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

GO Mapping Example

NDUFAB1

GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction

Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation

OtherNR - not recorded (historical)ND - no biological data available

ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

Biocuration of literature• detailed function • “depth”• slower (manual)

P05147

PMID: 2976880

Find a paperabout the protein.

Biocuration of Literature:detailed gene function

Read paper to get experimental evidence of function

Use most specific termpossible

experiment assayed kinase activity:use IDA evidence code

GO Mapping Example

NDUFAB1

GO EVIDENCE CODESDirect Evidence CodesIDA - inferred from direct assayIEP - inferred from expression patternIGI - inferred from genetic interactionIMP - inferred from mutant phenotypeIPI - inferred from physical interaction

Indirect Evidence Codesinferred from literatureIGC - inferred from genomic contextTAS - traceable author statementNAS - non-traceable author statementIC - inferred by curatorinferred by sequence analysisRCA - inferred from reviewed computational analysisIS* - inferred from sequence*IEA - inferred from electronic annotation

OtherNR - not recorded (historical)ND - no biological data available

ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

Biocuration of literature• detailed function • “depth”• slower (manual)

Sequence analysis• rapid (computational)• “breadth” of coverage • less detailed

Unknown Function vs No GO ND – no data

Biocurators have tried to add GO but there is no functional data available

Previously: “process_unknown”, “function_unknown”, “component_unknown”

Now: “biological process”, “molecular function”, “cellular component”

No annotations (including no “ND”): biocurators have not annotated this is important for your dataset: what % has

GO?

1. Primary sources of GO: from the GO Consortium (GOC) & GOC members

most up to date most comprehensive

2. Secondary sources: other resources that use GO provided by GOC members

public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) GO expression analysis tools

Sources of GO

Different tools and databases display the GO annotations differently.

Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.

EXAMPLES: public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix)

CONSIDERATIONS: What is the original source? When was it last updated? Are evidence codes displayed?

Secondary Sources of GO annotation

For more information about GO GO Evidence Codes:

http://www.geneontology.org/GO.evidence.shtml

gene association file information: http://www.geneontology.org/GO.format.annotation.shtml

tools that use the GO: http://www.geneontology.org/GO.tools.shtml

GO Consortium wiki: http://wiki.geneontology.org/index.php/Main_Page

3. Using the GO

Use GO Browsers for:

searching for GO terms searching for gene product annotation filtering sets of annotations and

downloading results creating/using GO slims

GO Browsers QuickGO Browser (EBI GOA Project)

http://www.ebi.ac.uk/ego/Can search by GO Term or by UniProt ID Includes IEA annotations

AmiGO Browser (GO Consortium Project)http://amigo.geneontology.org/cgi-bin/amigo/g

o.cgiCan search by GO Term or by UniProt IDDoes not include IEA annotations

Use GO for……. Determining which classes of gene products

are over-represented or under-represented. Grouping gene products by biological

function. Relating a protein’s location to its function. Focusing on particular biological pathways

and functions (hypothesis-driven data interrogation).

http://www.geneontology.org/

However…. many of these tools do not support non-model

organisms the tools have different computing requirements may be difficult to determine how up-to-date the

GO annotations are…

Need to evaluate tools for your system.

Evaluating GO toolsSome criteria for evaluating GO Tools:1. Does it include my species of interest (or do I have to

“humanize” my list)?2. What does it require to set up (computer usage/online)3. What was the source for the GO (primary or secondary) and

when was it last updated?4. Does it report the GO evidence codes (and is IEA included)?5. Does it report which of my gene products has no GO?6. Does it report both over/under represented GO groups and

how does it evaluate this?7. Does it allow me to add my own GO annotations?8. Does it represent my results in a way that facilitates

discovery?

4. gene association files

The gene association (ga) file standard file format used to capture GO annotation

data tab-delimited file containing 15* fields of information:

Information about the gene product (database, accession, name, symbol, synonyms, species)

information about the function: GO ID, ontology, reference, evidence, qualifiers, context

(with/from) data about the functional annotation

date, annotator

* 2 additional fields will soon be added to capture information about isoforms and other ontologies.

(additional column added to this example)

gene product information

metadata: when & who

function information

Gene association files GO Consortium ga files

many organism specific files also includes EBI GOA files

EBI GOA ga files UniProt file contains GO annotation for all species

represented in UniProtKB AgBase ga files

organism specific files AgBase GOC file – submitted to GO Consortium & EBI

GOA AgBase Community file – GO annotations not yet

submitted or not supported all files are quality checked

top related