the gene ontology categorizer

25
The Gene Ontology Categorizer C.A. Joslyn 1 , S.M. Mniszewski 1 , A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National Laboratory, 2 Corp orate Biotechnology, Miami Valley La bs and 3 Corporate Functions-IT, Proct er & Gamble, USA (Bioinformatics, Vol. 2 0, Suppl. 1, 2004, p. i169-i177)

Upload: leroy-kane

Post on 03-Jan-2016

30 views

Category:

Documents


2 download

DESCRIPTION

The Gene Ontology Categorizer. C.A. Joslyn 1 , S.M. Mniszewski 1 , A. Fulmer 2 and G. Heaton 3 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Gene Ontology Categorizer

The Gene Ontology Categorizer

C.A. Joslyn1, S.M. Mniszewski1, A. Fulmer2 and G. Heaton3

1Computer and Computational Sciences, Los Alamos National Laboratory, 2Corporate Biotechnology, Miami Valley Labs and 3Corporate Functi

ons-IT, Procter & Gamble, USA (Bioinformatics, Vol. 20, Suppl. 1, 2004, p. i169-i177)

Page 2: The Gene Ontology Categorizer

2/25

Abstract (1/2) Given a list of genes of interest,

what are the best nodes of the GO to summarize or categorize that list?

From a drug discovery process, we wish to understand the overall effect of some cell treatment or condition by identifying ‘where’ in the GO the differentially expressed genes fall.

Page 3: The Gene Ontology Categorizer

3/25

Abstract (2/2) View bio-ontologies more as combinato

rially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO.

Issues: categorization task, distances in ontologies and ontology merger and exchange.

Page 4: The Gene Ontology Categorizer

4/25

1. Introduction (1/3) A gene expression experiment involves high-t

hroughput microarrays, a biomedical researcher will need to extract useful information on the types of biological processes affected in the experiment.

The categorization task arises from the researcher wanting to take the names of some genes and gain an understanding of their overall function by examining their distribution through the GO: are they localized, grouped in distinct areas or spread uniformly?

Page 5: The Gene Ontology Categorizer

5/25

1. Introduction (2/3) The Gene Ontology Categorizer (GOC) applies

novel research in the discrete mathematics of posets for semantic hierarchies to GO analysis.

Represent the GO as a poset ontology, then use pseudo-distances between comparable nodes to develop scoring functions. Finally, cluster the resulting rank-ordered list to produce a ranked list of appropriate summarizing nodes within the GO, which act as functional hypotheses about the characteristics of the genes expressed.

Page 6: The Gene Ontology Categorizer

6/25

1. Introduction (3/3) GO analysis weaknesses

Many researchers consider the GO simply as a list of categories, ignoring any structural relationships among the categories.

Even those researchers with a treatment closest in spirit to authors consider the GO primarily as a tree, or even cast it as a graph for determining distances between nodes.

Page 7: The Gene Ontology Categorizer

7/25

2. Methodology (1/2) A finite partially ordered set (poset) is a math

ematical structure P =<P,≤>, where P is a finite set and ≤⊆P2 is a reflexive, anti-symmetric, transitive binary relation on P.

Every poset is a digraph with no cycles and they are general than trees or lattices in that collections of nodes can have multiple parents.

The GO is a pair of directed acyclic graphs (DAGs), one for the is-a and has-part links.

Page 8: The Gene Ontology Categorizer

8/25

Page 9: The Gene Ontology Categorizer

9/25

2. Methodology (2/2) PGO is the set of nodes such as ‘DNA unwindi

ng’ and ‘DNA replication’. The ordering ≤ in ‘DNA repair ≤ DNA metab

olism’ represents that DNA repair is a kind of DNA metabolism.

GO, cast as a pair of posets Pis=<PGO,≤is> and Phas=<PGO,≤has> for the two kinds of relations, is a large, taxonomically organized semantic hierarchy.

This paper treats two kinds of links to be equivalent: PGO=<PGO,≤GO>, where ≤GO=≤is≤has.

Page 10: The Gene Ontology Categorizer

10/25

2.1 Poset theory (1/3) Two nodes p1, p2∈P are comparable, denoted p1~

p2, if either p1≤p2 or p2≤p1. A chain C⊆P is a collection of comparable nodes. Height H(P) is the size of the largest chain. Two nodes p1, p2∈P are non-comparable if

p1 p2. An antichain is a collection of non-comparable no

des. Width W(P) is the size of the largest anti-chain.

Page 11: The Gene Ontology Categorizer

11/25

2.1 Poset theory (2/3) Given two comparable nodes p1≤p2, the set of

all nodes ‘between’ them is the interval [p1, p2] ={p: p1≤p≤p2}, which is equivalent to the set of all chains between p1 and p2, denoted C(p1, p2).

The vector of chain lengths h(p1, p2)=|C(p1, p2)| is the collection of the lengths of all these chains.

Minimal and maximum chain lengths between p1 and p2 are h∗(p1, p2)= minC∈C(p1,p2)|C| and h∗(p1, p2)=maxC∈C(p1,p2)|C|, respectively.

Page 12: The Gene Ontology Categorizer

12/25

2.1 Poset theory (3/3) P={1,A,B, . . . ,K} B and J are noncomparab

le, while A≤B are comparable.

[A,B]={A,F,G,H,I,B} consists of the three chains C(A, B)={A≤F≤B, A≤G≤B, A≤H≤I≤B}.

h(A, B)=<2,2,3> with h∗(A,B)=2, h∗(A,B)=3.

H(P)=5 (a maximal chain is D≤E≤I≤C≤1) and W(P)=5 (the largest anti-chain is {F,G,H,E,J}). 18

Page 13: The Gene Ontology Categorizer

13/25

Poset statistics of the GO

Page 14: The Gene Ontology Categorizer

14/25

2.2 Methods (1/4) Define a POSet Ontology (POSO) as O=<P,X,

F>, where X is a finite, non-empty set of labels, and F: X→2P is an annotation function mapping each label x∈X to a collection of nodes F(x)⊆P.

E.g. X={a,b,…,j }, F(b)={A,E,F}. In GOC, OGO=<PGO, XGO, FGO>, where the gene

products XGO and annotations FGO are provided by the GO file.

Page 15: The Gene Ontology Categorizer

15/25

2.2 Methods (2/4) A pseudo-distance function δ: P2 → R

The minimum path length δm =h∗

The maximum path length δx =h∗

The average of extreme path lengths

The average of all path lengths

h∗(p1, p2)≤δ(p1, p2)≤h∗(p1, p2). A normalized distance as δ=δ/H(P).

Page 16: The Gene Ontology Categorizer

16/25

2.2 Methods (3/4)

A scoring function Sy(p) that returns the weighted rank of a node pP based on requested nodes Y.

Two kinds of scores An unnormalized score SY : P → R+ which r

eturns an ‘absolute’ number A normalized score which retur

ns a ‘relative’ number.

Page 17: The Gene Ontology Categorizer

17/25

2.2 Methods (4/4) s{…,-1,0,1,2,3,…}, where low s emphasizes cove

rages, and high s emphasizes specificity. Let r=2s, then we have four scoring functions: Unnormalized distance and unnormalized score:

Unnormalized distance and normalized score:

Normalized distance and unnormalized score:

Normalized distance and normalized score:

Page 18: The Gene Ontology Categorizer

18/25Cluster heads are marked with +, and secondaries with -.

12

Page 19: The Gene Ontology Categorizer

19/25

3. Expert validation (1/2) An experienced molecular immunologist cons

tructed two nonoverlapping lists of genes: KT1 a list of 242 genes involved in immune processes; and KT4 a list of 147 genes involved in cell–cell/cell–matrix interactions.

KT1, KT4 and KT1∪KT4 provided three queries for GOC into the BP branch of the GO using δm, s=7 and scoring function .

Page 20: The Gene Ontology Categorizer

20/25

3. Expert validation (2/2)

Two assessed values Utility (1=low to 5=high): Did the

cluster terms provide a useful description of a specific biological process?

Expectation (1=high to 5=low): Was the identified biological process expected for the genes in the query?

Page 21: The Gene Ontology Categorizer

21/25

Page 22: The Gene Ontology Categorizer

22/25

4. Formal validation (1/3) An independent source of annotations of collections

of GO nodes: the InterPro project, which catalogs assignments of protein families, domains and functional sites to GO IDs.

E.g. ‘phosphofructokinase’ is InterPro ID IPR000023, and is annotated to GO:0006096=‘glycolysis’, GO:0003872=‘6-phosphofructokinase activity’, and GO:0005945=‘6-phosphofructokinase complex’. It also maps to 175 proteins. Thus the validation task is to make these 175 proteins a GOC query, and see how well cluster heads match against the set of GO IDs {GO:0006096, GO:0003872, GO:0005945}.

Page 23: The Gene Ontology Categorizer

23/25

4. Formal validation (2/3) In the run, there were 4,866 InterP

ro IDs with GO annotations, with 11,370 mappings to GO nodes and 787,760 mappings to proteins in total. Of these proteins, they were able to locate 778 494, or >99% with GO annotations.

Page 24: The Gene Ontology Categorizer

24/25

4. Formal validation (3/3)

• Immediate family: child/parent/sibling.

• Extended family: grandparent/grandchild/cousin/aunt/uncle/niece/nephew

Page 25: The Gene Ontology Categorizer

25/25

5. Conclusions The GOC methodology provides a valid and useful ap

proach to categorization in the GO. Future work

Methodological development in combinatorial approaches to data analysis, including distances between noncomparable nodes, interval-valued measures of ‘level’ in posets, algorithms for poset width calculation and poset matching.

Expansion to other ontologies. Continuation of work in textual approaches, mapping back

and forth from semantic relations among GO nodes to those among its lexical components.