the gene ontology categorizer

Post on 03-Jan-2016

31 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

The Gene Ontology Categorizer. C.A. Joslyn 1 , S.M. Mniszewski 1 , A. Fulmer 2 and G. Heaton 3 - PowerPoint PPT Presentation

TRANSCRIPT

The Gene Ontology Categorizer

C.A. Joslyn1, S.M. Mniszewski1, A. Fulmer2 and G. Heaton3

1Computer and Computational Sciences, Los Alamos National Laboratory, 2Corporate Biotechnology, Miami Valley Labs and 3Corporate Functi

ons-IT, Procter & Gamble, USA (Bioinformatics, Vol. 20, Suppl. 1, 2004, p. i169-i177)

2/25

Abstract (1/2) Given a list of genes of interest,

what are the best nodes of the GO to summarize or categorize that list?

From a drug discovery process, we wish to understand the overall effect of some cell treatment or condition by identifying ‘where’ in the GO the differentially expressed genes fall.

3/25

Abstract (2/2) View bio-ontologies more as combinato

rially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO.

Issues: categorization task, distances in ontologies and ontology merger and exchange.

4/25

1. Introduction (1/3) A gene expression experiment involves high-t

hroughput microarrays, a biomedical researcher will need to extract useful information on the types of biological processes affected in the experiment.

The categorization task arises from the researcher wanting to take the names of some genes and gain an understanding of their overall function by examining their distribution through the GO: are they localized, grouped in distinct areas or spread uniformly?

5/25

1. Introduction (2/3) The Gene Ontology Categorizer (GOC) applies

novel research in the discrete mathematics of posets for semantic hierarchies to GO analysis.

Represent the GO as a poset ontology, then use pseudo-distances between comparable nodes to develop scoring functions. Finally, cluster the resulting rank-ordered list to produce a ranked list of appropriate summarizing nodes within the GO, which act as functional hypotheses about the characteristics of the genes expressed.

6/25

1. Introduction (3/3) GO analysis weaknesses

Many researchers consider the GO simply as a list of categories, ignoring any structural relationships among the categories.

Even those researchers with a treatment closest in spirit to authors consider the GO primarily as a tree, or even cast it as a graph for determining distances between nodes.

7/25

2. Methodology (1/2) A finite partially ordered set (poset) is a math

ematical structure P =<P,≤>, where P is a finite set and ≤⊆P2 is a reflexive, anti-symmetric, transitive binary relation on P.

Every poset is a digraph with no cycles and they are general than trees or lattices in that collections of nodes can have multiple parents.

The GO is a pair of directed acyclic graphs (DAGs), one for the is-a and has-part links.

8/25

9/25

2. Methodology (2/2) PGO is the set of nodes such as ‘DNA unwindi

ng’ and ‘DNA replication’. The ordering ≤ in ‘DNA repair ≤ DNA metab

olism’ represents that DNA repair is a kind of DNA metabolism.

GO, cast as a pair of posets Pis=<PGO,≤is> and Phas=<PGO,≤has> for the two kinds of relations, is a large, taxonomically organized semantic hierarchy.

This paper treats two kinds of links to be equivalent: PGO=<PGO,≤GO>, where ≤GO=≤is≤has.

10/25

2.1 Poset theory (1/3) Two nodes p1, p2∈P are comparable, denoted p1~

p2, if either p1≤p2 or p2≤p1. A chain C⊆P is a collection of comparable nodes. Height H(P) is the size of the largest chain. Two nodes p1, p2∈P are non-comparable if

p1 p2. An antichain is a collection of non-comparable no

des. Width W(P) is the size of the largest anti-chain.

11/25

2.1 Poset theory (2/3) Given two comparable nodes p1≤p2, the set of

all nodes ‘between’ them is the interval [p1, p2] ={p: p1≤p≤p2}, which is equivalent to the set of all chains between p1 and p2, denoted C(p1, p2).

The vector of chain lengths h(p1, p2)=|C(p1, p2)| is the collection of the lengths of all these chains.

Minimal and maximum chain lengths between p1 and p2 are h∗(p1, p2)= minC∈C(p1,p2)|C| and h∗(p1, p2)=maxC∈C(p1,p2)|C|, respectively.

12/25

2.1 Poset theory (3/3) P={1,A,B, . . . ,K} B and J are noncomparab

le, while A≤B are comparable.

[A,B]={A,F,G,H,I,B} consists of the three chains C(A, B)={A≤F≤B, A≤G≤B, A≤H≤I≤B}.

h(A, B)=<2,2,3> with h∗(A,B)=2, h∗(A,B)=3.

H(P)=5 (a maximal chain is D≤E≤I≤C≤1) and W(P)=5 (the largest anti-chain is {F,G,H,E,J}). 18

13/25

Poset statistics of the GO

14/25

2.2 Methods (1/4) Define a POSet Ontology (POSO) as O=<P,X,

F>, where X is a finite, non-empty set of labels, and F: X→2P is an annotation function mapping each label x∈X to a collection of nodes F(x)⊆P.

E.g. X={a,b,…,j }, F(b)={A,E,F}. In GOC, OGO=<PGO, XGO, FGO>, where the gene

products XGO and annotations FGO are provided by the GO file.

15/25

2.2 Methods (2/4) A pseudo-distance function δ: P2 → R

The minimum path length δm =h∗

The maximum path length δx =h∗

The average of extreme path lengths

The average of all path lengths

h∗(p1, p2)≤δ(p1, p2)≤h∗(p1, p2). A normalized distance as δ=δ/H(P).

16/25

2.2 Methods (3/4)

A scoring function Sy(p) that returns the weighted rank of a node pP based on requested nodes Y.

Two kinds of scores An unnormalized score SY : P → R+ which r

eturns an ‘absolute’ number A normalized score which retur

ns a ‘relative’ number.

17/25

2.2 Methods (4/4) s{…,-1,0,1,2,3,…}, where low s emphasizes cove

rages, and high s emphasizes specificity. Let r=2s, then we have four scoring functions: Unnormalized distance and unnormalized score:

Unnormalized distance and normalized score:

Normalized distance and unnormalized score:

Normalized distance and normalized score:

18/25Cluster heads are marked with +, and secondaries with -.

12

19/25

3. Expert validation (1/2) An experienced molecular immunologist cons

tructed two nonoverlapping lists of genes: KT1 a list of 242 genes involved in immune processes; and KT4 a list of 147 genes involved in cell–cell/cell–matrix interactions.

KT1, KT4 and KT1∪KT4 provided three queries for GOC into the BP branch of the GO using δm, s=7 and scoring function .

20/25

3. Expert validation (2/2)

Two assessed values Utility (1=low to 5=high): Did the

cluster terms provide a useful description of a specific biological process?

Expectation (1=high to 5=low): Was the identified biological process expected for the genes in the query?

21/25

22/25

4. Formal validation (1/3) An independent source of annotations of collections

of GO nodes: the InterPro project, which catalogs assignments of protein families, domains and functional sites to GO IDs.

E.g. ‘phosphofructokinase’ is InterPro ID IPR000023, and is annotated to GO:0006096=‘glycolysis’, GO:0003872=‘6-phosphofructokinase activity’, and GO:0005945=‘6-phosphofructokinase complex’. It also maps to 175 proteins. Thus the validation task is to make these 175 proteins a GOC query, and see how well cluster heads match against the set of GO IDs {GO:0006096, GO:0003872, GO:0005945}.

23/25

4. Formal validation (2/3) In the run, there were 4,866 InterP

ro IDs with GO annotations, with 11,370 mappings to GO nodes and 787,760 mappings to proteins in total. Of these proteins, they were able to locate 778 494, or >99% with GO annotations.

24/25

4. Formal validation (3/3)

• Immediate family: child/parent/sibling.

• Extended family: grandparent/grandchild/cousin/aunt/uncle/niece/nephew

25/25

5. Conclusions The GOC methodology provides a valid and useful ap

proach to categorization in the GO. Future work

Methodological development in combinatorial approaches to data analysis, including distances between noncomparable nodes, interval-valued measures of ‘level’ in posets, algorithms for poset width calculation and poset matching.

Expansion to other ontologies. Continuation of work in textual approaches, mapping back

and forth from semantic relations among GO nodes to those among its lexical components.

top related