the philosophy of biocuration and its use to analyse the fission yeast genome content

The philosophy of biocuration

and its use to analyse the fission yeast genome

content

Valerie Wood

What is Biocuration

Two main aspects to fission yeast curation

1. Literature curation: involves reading the full text of publications and associating novel biological information with the appropriate genes or features

2. Sequence analysis: to infer biological information for unpublished genes

• We need to make annotations as specific (complete depth), and as comprehensively (complete breadth) as possible. We need to group similar annotations consistently

so users can

i) Access required information on a gene by gene basis

ii) Analyse their own datasets e.g enrichment

iii) Search for candidate genes of interest

iv) Access similar features in other organisms

The Challenges

Data gathering for genes of interest

• traditionally small number of genes • requires detailed literature searching

• time-consuming Gene 1RNA recognition motifmRNA exportprotein phosphorylationnuclearmitotic cell cyclephosphorylated....

Gene 2SAP domainmRNA exportnucleolarRNA elongation (pol II)…

Gene 3mRNA exporttranscription (pol II)…

Gene 4mRNA exporttranscription polyadenylation…

Gene 5mRNA exportRNA elongation…

Gene 6mRNA exportrRNA transcriptionDNA topological change…

Gene 5000cell cyclechromosome segregationkinetochore assemblyprotein localization…

Not Scalable!

Grouping by “feature”

By establishing links between similar features we can begin to identify tends (enrichments and depletions) in thousands of genes typically obtained in functional genomics datasets

mRNA exportGene 1Gene 2Gene 3Gene 4Gene 5

nucleolarGene 10Gene 15Gene 18…

phosphorylatedGene 1Gene 7Gene 10…

transcriptionGene 1Gene 2Gene 3Gene 4Gene 5..

Cell cycleGene 1Gene 7Gene 8…

RNA recognition motifGene 1Gene 7Gene 8…

The literature corpus

What is the size of the ‘annotation problem’?

Fission yeast OR pombe gives 9264Adding “cell cycle” gives 2871

SolutionsMore curatorsCommunity curation

ProblemsFunders don’t want to fund curationCan we make the community curate

Grant

• Additional curators (2) to ensure comprehensive and deep curation of the literature

• Software to support curation activities (including community curation)• A computational infrastructure to

integrate nd display the curated data with the HTP data within Ensembl

http://www.sanger.ac.uk/Projects/S_pombe/

Need to make an intuitive web based user interface where the community can add “consistent” and comprehensive curationWatch this space!

Ontologies• Ontologies provides a “controlled vocabulary” for

biological knowledge • Consistent unambiguous descriptions• Species independent, interpreted identically both

within and between genomes, therefore enabling cross species comparisons

• Provides a way to capture and represent biological knowledge in a computable form

• Ability to annotate to different levels of granularity depending what is know or what can be inferred

Ontologies Include:

1. A vocabulary of terms (names for concepts)2. Definitions3. Defined logical relationships to each other

bud initiation?tooth bud initiation, cell bud initiation, plant bud initiationConversely different names are used for the same concepts MVB sorting, multivesicular body sorting, late endosome to vacuole transport, alternative names are exact synonyms

Disambiguation and Grouping

This principle applies to any type of curation, for example when describing phenotypes, similar cells can be described as “skittle” “bottle” or “dumbell”

GO is 3 ontologies F molecular function (activity, GTPase, transporter, receptor)P biological process (cell division transcription,gluconeogenesisC cellular component (location or complex)

Demonstrating ontology principles with GO

DAG Structure

Many-to-many parental relationship

Each child may have one or more parents

DAG: Directed Acyclic Graph

One-to-many parental relationship

Each child has only one parent

Heirarchy

Relationships between terms

cell

membrane chloroplast

mitochondrial chloroplastmembrane membrane

is-apart-of

Inheritance

An important feature of GO is that broader parents give rise to more specific children.When a gene is directly annotated to a term (I.e DNA replication), it is automatically indirectly annotated to all of its parent terms

Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred

gene A

Ontologies.....

• Provides a standard for annotation• Have 2 components the ontology and the

annotations• Allows experimental work to be evaluated in the

context of other experimental data which may be annotated at different levels of granularity

• Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments)

• Becomes increasingly powerful as the ontologies and annotations are refined

Other annotation types• products (special case, unique descriptors)• annotation status• species distribution• orthology• phenotype data, will use (PATO)• protein modifications, will use(MOD) • metabolites will use (Chebi, chemical entities of biological

importance)• sequence features will use (SO)• protein-protein interactions will use (MI) and BioGrid Increasingly, features will be described using “cross products”

derived from multiple ontologies:e.g.“response to a specific drug” will be made with the GO

biological process term “response to drug” and a drug from the ChEBI

e.g. phenotypes are typically annotated using a PATO “quality” term combined with a wild-type GO process (e.g. conjugation, defective; crossover formation, abolished)

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.

The curation process and annotation status

Manual Curation• Emphasis on Primary

Literature • Manual inspection of

sequence similarity

Computational Mappings • Inferred electronically

No data for FP or C 2542

Total 34032

GO Curation Strategy

1829 publications17655 annotations

4127 annotations

9708 annotations

Evidence Codes Used

Oct 07 Dec 08 June 09 8618 8889 9076 IDA inferred from direct assay 776 991 1083 IPI inferred from physical interaction 901 1129 1164 IGI inferred from genetic interaction 1089 1091 1106 TAS traceable author statement 1073 1164 1264 IC inferred by curator 9045 9706 9708 ISS inferred from sequence similarity 1912 2328 2455 IMP inferred from mutant phenotype 522 595 617 NAS non-traceable author statement 6397 4620 4127 IEA from electronic annotation

2542 ND no data, root node annotations 185 IEP 702 RCA

30333 31676 34032

Molecular Function: 9049Biological Process: 10985Cellular Component: 13998Total 34032

30,616 annotations to 3080 terms 06/06/07



GO annotation progress

Analysing the curated data

GO aspect coverage

Total 5025All 3 aspects unknown 118

experimentallycharacterised, known

inferred from orthology,known

conserved unknown

sequence orphan

pombe specific family

639

312

2133

1817

Protein Annotation Status

56

36.7 %

43.0 %

12.9 %

6.3 %

1.1 %

Total 4957

639

98 Bacteria,Fungi,Plant

196 Fungi only

346 to Metazoa of these 235 1:1 of these 131 nuclear

over 100 nature papers?

The conserved “unknown” unknowns

This is the 53 at the top of the list

Splicing?

Kim D-U, Hayles J, Kim D et al (manuscript submitted)

• High level view of GO (genes annotated to granular terms are mapped to higher level terms)

• Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets

• Different Annotation groups have created specific GO_Slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes).

• You can create and use your own GO slim with high level terms of interest

• CARE: not a gene product count, as gene products have multiple annotations (will explain this in the workshop)

“Slimming”

Process Super Slim

Added 8454 i.e. more than the number of genes. Not mutually exclusive, therefore it doesn’t make sense to put in a pie chart and show as percentagesAlso important to show which genes are not annotated (root node annotations)Which genes are not in the slim set but are annotated to other terms

Term Enrichment

• Finding significantly enriched terms shared among a list of genes

• Discover what these genes may have in common • Statistical measure of how likely your differentially

regulated genes fall into that category by chance

This is a comparative enrichment analysis (fission yeast vs. budding yeast)

It is showing processes enriched in the essential gene set in the non-essential gene set.

The enrichment also identified many child terms which were enriched but the results were presented as a “slim” of the high level terms, and the complete tem lists are presented in supplementary data

Kim D-U, Hayles J, Kim D et al (manuscript submitted)

Acknowledgements

• Martin Aslett (WT Sanger UK)• Midori Harris and the GO editorial

team (EBI UK)• Jacky Hayles (CRUK) and the

deletion project consortium (Kwang Lae-Hoe)

Data mining, complex A B C D E F G H I J A cell division 10

18 356 224 31 49 2 271 132 - -

B transcription>translat. 1367 53 66 172 0 111 47 - - C cytoskeletal/morph/vmt 842 152 32 30 78 160 - - D metabolic pathways 800 196 61 36 52 - - E mitochondrial translation 732 98 47 14 - - F membrane transport 299 6 2 - - G stress 422 65 - - H signal transduction 369 - - I other 323 - J none 988

What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study

UPDATE

• A gene product can have several functions, cellular locations and be involved in many processes

• Annotation of a gene product to one ontology is independent from its annotation to other ontologies

• Annotations are only to terms reflecting a normal activity or location

• Usage of ‘unknown’ GO terms

Additional points

1. NOT• a gene product is NOT associated with the GO term • to document conflicting claims in the literature.

2. Contributes to• distinguishes between individual subunit functions and

whole complex functions• used with GO Function Ontology

3. Colocalizes with• transiently or peripherally associated with an organelle

or complex • used with GO Component Ontology

Modifying the interpretation of an annotation: the

Qualifier column

Fatty acid biosynthesis (Swiss-Prot Keyword)

EC:6.4.1.2 (EC number)

IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)

GO:Fatty acid biosynthesis

(GO:0006633)

GO:acetyl-CoA carboxylase activity

(GO:0003989)

GO:acetyl-CoA carboxylaseactivity

(GO:0003989)

Electronic Annotations

Unknown v.s. Unannotated• Direct root node annotations are used when

the curator has determined that there is no existing literature to support an annotation.– Biological process GO:0000004– Molecular function GO:0005554– Cellular component GO:0008372

• NOT the same as having no annotation at all – No annotation means that no one has looked yet

All three aspects unknown 105 (564 S. cerevisiae)

Function 3542 (includes protein binding)

Biological Process4019

Cellular Component4821

14672679

3279(3455)

191 54

18

Total 5004 (5780 S. cerevisiae)

993

GO aspect coverage (old)

the philosophy of biocuration and its use to analyse the fission yeast genome content

Documents

novel biological information

similar annotations

literature curation

appropriate genes

thousands of genes

curation activities

similar cells

type of curation