the philosophy of biocuration and its use to analyse the fission yeast genome content
DESCRIPTION
The philosophy of biocuration and its use to analyse the fission yeast genome content. Valerie Wood. What is Biocuration. Two main aspects to fission yeast curation - PowerPoint PPT PresentationTRANSCRIPT
The philosophy of biocuration
and its use to analyse the fission yeast genome
content
Valerie Wood
What is Biocuration
Two main aspects to fission yeast curation
1. Literature curation: involves reading the full text of publications and associating novel biological information with the appropriate genes or features
2. Sequence analysis: to infer biological information for unpublished genes
• We need to make annotations as specific (complete depth), and as comprehensively (complete breadth) as possible. We need to group similar annotations consistently
so users can
i) Access required information on a gene by gene basis
ii) Analyse their own datasets e.g enrichment
iii) Search for candidate genes of interest
iv) Access similar features in other organisms
The Challenges
Data gathering for genes of interest
• traditionally small number of genes • requires detailed literature searching
• time-consuming Gene 1RNA recognition motifmRNA exportprotein phosphorylationnuclearmitotic cell cyclephosphorylated....
Gene 2SAP domainmRNA exportnucleolarRNA elongation (pol II)…
Gene 3mRNA exporttranscription (pol II)…
Gene 4mRNA exporttranscription polyadenylation…
Gene 5mRNA exportRNA elongation…
Gene 6mRNA exportrRNA transcriptionDNA topological change…
Gene 5000cell cyclechromosome segregationkinetochore assemblyprotein localization…
Not Scalable!
Grouping by “feature”
By establishing links between similar features we can begin to identify tends (enrichments and depletions) in thousands of genes typically obtained in functional genomics datasets
mRNA exportGene 1Gene 2Gene 3Gene 4Gene 5
nucleolarGene 10Gene 15Gene 18…
phosphorylatedGene 1Gene 7Gene 10…
transcriptionGene 1Gene 2Gene 3Gene 4Gene 5..
Cell cycleGene 1Gene 7Gene 8…
RNA recognition motifGene 1Gene 7Gene 8…
The literature corpus
What is the size of the ‘annotation problem’?
Fission yeast OR pombe gives 9264Adding “cell cycle” gives 2871
SolutionsMore curatorsCommunity curation
ProblemsFunders don’t want to fund curationCan we make the community curate
Grant
• Additional curators (2) to ensure comprehensive and deep curation of the literature
• Software to support curation activities (including community curation)• A computational infrastructure to
integrate nd display the curated data with the HTP data within Ensembl
http://www.sanger.ac.uk/Projects/S_pombe/
Need to make an intuitive web based user interface where the community can add “consistent” and comprehensive curationWatch this space!
Ontologies• Ontologies provides a “controlled vocabulary” for
biological knowledge • Consistent unambiguous descriptions• Species independent, interpreted identically both
within and between genomes, therefore enabling cross species comparisons
• Provides a way to capture and represent biological knowledge in a computable form
• Ability to annotate to different levels of granularity depending what is know or what can be inferred
Ontologies Include:
1. A vocabulary of terms (names for concepts)2. Definitions3. Defined logical relationships to each other
bud initiation?tooth bud initiation, cell bud initiation, plant bud initiationConversely different names are used for the same concepts MVB sorting, multivesicular body sorting, late endosome to vacuole transport, alternative names are exact synonyms
Disambiguation and Grouping
This principle applies to any type of curation, for example when describing phenotypes, similar cells can be described as “skittle” “bottle” or “dumbell”
GO is 3 ontologies F molecular function (activity, GTPase, transporter, receptor)P biological process (cell division transcription,gluconeogenesisC cellular component (location or complex)
Demonstrating ontology principles with GO
DAG Structure
Many-to-many parental relationship
Each child may have one or more parents
DAG: Directed Acyclic Graph
One-to-many parental relationship
Each child has only one parent
Heirarchy
Relationships between terms
cell
membrane chloroplast
mitochondrial chloroplastmembrane membrane
is-apart-of
Inheritance
An important feature of GO is that broader parents give rise to more specific children.When a gene is directly annotated to a term (I.e DNA replication), it is automatically indirectly annotated to all of its parent terms
Allows curators to assign terms at different levels of granularity, depending what is known or can be inferred
gene A
Ontologies.....
• Provides a standard for annotation• Have 2 components the ontology and the
annotations• Allows experimental work to be evaluated in the
context of other experimental data which may be annotated at different levels of granularity
• Allows biologists to search and analyse data (particularly for identifying groups of overrepresented genes in large scale experiments)
• Becomes increasingly powerful as the ontologies and annotations are refined
Other annotation types• products (special case, unique descriptors)• annotation status• species distribution• orthology• phenotype data, will use (PATO)• protein modifications, will use(MOD) • metabolites will use (Chebi, chemical entities of biological
importance)• sequence features will use (SO)• protein-protein interactions will use (MI) and BioGrid Increasingly, features will be described using “cross products”
derived from multiple ontologies:e.g.“response to a specific drug” will be made with the GO
biological process term “response to drug” and a drug from the ChEBI
e.g. phenotypes are typically annotated using a PATO “quality” term combined with a wild-type GO process (e.g. conjugation, defective; crossover formation, abolished)
QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.
The curation process and annotation status
Manual Curation• Emphasis on Primary
Literature • Manual inspection of
sequence similarity
Computational Mappings • Inferred electronically
No data for FP or C 2542
Total 34032
GO Curation Strategy
1829 publications17655 annotations
4127 annotations
9708 annotations
Evidence Codes Used
Oct 07 Dec 08 June 09 8618 8889 9076 IDA inferred from direct assay 776 991 1083 IPI inferred from physical interaction 901 1129 1164 IGI inferred from genetic interaction 1089 1091 1106 TAS traceable author statement 1073 1164 1264 IC inferred by curator 9045 9706 9708 ISS inferred from sequence similarity 1912 2328 2455 IMP inferred from mutant phenotype 522 595 617 NAS non-traceable author statement 6397 4620 4127 IEA from electronic annotation
2542 ND no data, root node annotations 185 IEP 702 RCA
30333 31676 34032
Molecular Function: 9049Biological Process: 10985Cellular Component: 13998Total 34032
30,616 annotations to 3080 terms 06/06/07
31,676 annotations to 3263 terms 13/12/08
34,035 annotations to 3361 terms 16/06/09
GO annotation progress
Analysing the curated data
GO aspect coverage
Total 5025All 3 aspects unknown 118
experimentallycharacterised, known
inferred from orthology,known
conserved unknown
sequence orphan
pombe specific family
639
312
2133
1817
Protein Annotation Status
56
36.7 %
43.0 %
12.9 %
6.3 %
1.1 %
Total 4957
639
98 Bacteria,Fungi,Plant
196 Fungi only
346 to Metazoa of these 235 1:1 of these 131 nuclear
over 100 nature papers?
The conserved “unknown” unknowns
This is the 53 at the top of the list
Splicing?
Kim D-U, Hayles J, Kim D et al (manuscript submitted)
• High level view of GO (genes annotated to granular terms are mapped to higher level terms)
• Allows users to group genes into broader categories to assess their distribution, useful for large scale, genome wide analyses or smaller gene sets
• Different Annotation groups have created specific GO_Slims are available at GO’s FTP site (pombe now has an “official GO slim” which give good coverage of high level processes).
• You can create and use your own GO slim with high level terms of interest
• CARE: not a gene product count, as gene products have multiple annotations (will explain this in the workshop)
“Slimming”
Process Super Slim
Added 8454 i.e. more than the number of genes. Not mutually exclusive, therefore it doesn’t make sense to put in a pie chart and show as percentagesAlso important to show which genes are not annotated (root node annotations)Which genes are not in the slim set but are annotated to other terms
Term Enrichment
• Finding significantly enriched terms shared among a list of genes
• Discover what these genes may have in common • Statistical measure of how likely your differentially
regulated genes fall into that category by chance
This is a comparative enrichment analysis (fission yeast vs. budding yeast)
It is showing processes enriched in the essential gene set in the non-essential gene set.
The enrichment also identified many child terms which were enriched but the results were presented as a “slim” of the high level terms, and the complete tem lists are presented in supplementary data
Kim D-U, Hayles J, Kim D et al (manuscript submitted)
Acknowledgements
• Martin Aslett (WT Sanger UK)• Midori Harris and the GO editorial
team (EBI UK)• Jacky Hayles (CRUK) and the
deletion project consortium (Kwang Lae-Hoe)
Data mining, complex A B C D E F G H I J A cell division 10
18 356 224 31 49 2 271 132 - -
B transcription>translat. 1367 53 66 172 0 111 47 - - C cytoskeletal/morph/vmt 842 152 32 30 78 160 - - D metabolic pathways 800 196 61 36 52 - - E mitochondrial translation 732 98 47 14 - - F membrane transport 299 6 2 - - G stress 422 65 - - H signal transduction 369 - - I other 323 - J none 988
What: You can data mine the entire genome to find overlaps and intersections between terms of interest to target genes for further study
UPDATE
• A gene product can have several functions, cellular locations and be involved in many processes
• Annotation of a gene product to one ontology is independent from its annotation to other ontologies
• Annotations are only to terms reflecting a normal activity or location
• Usage of ‘unknown’ GO terms
Additional points
1. NOT• a gene product is NOT associated with the GO term • to document conflicting claims in the literature.
2. Contributes to• distinguishes between individual subunit functions and
whole complex functions• used with GO Function Ontology
3. Colocalizes with• transiently or peripherally associated with an organelle
or complex • used with GO Component Ontology
Modifying the interpretation of an annotation: the
Qualifier column
Fatty acid biosynthesis (Swiss-Prot Keyword)
EC:6.4.1.2 (EC number)
IPR000438: Acetyl-CoA carboxylase carboxyl transferase beta subunit (InterPro entry)
GO:Fatty acid biosynthesis
(GO:0006633)
GO:acetyl-CoA carboxylase activity
(GO:0003989)
GO:acetyl-CoA carboxylaseactivity
(GO:0003989)
Electronic Annotations
Unknown v.s. Unannotated• Direct root node annotations are used when
the curator has determined that there is no existing literature to support an annotation.– Biological process GO:0000004– Molecular function GO:0005554– Cellular component GO:0008372
• NOT the same as having no annotation at all – No annotation means that no one has looked yet
All three aspects unknown 105 (564 S. cerevisiae)
Function 3542 (includes protein binding)
Biological Process4019
Cellular Component4821
14672679
3279(3455)
191 54
18
Total 5004 (5780 S. cerevisiae)
993
GO aspect coverage (old)