predicting gene functions from text using a cross-species approach emilia stoica and marti hearst...

Predicting Gene Functions from Text

Using a Cross-Species Approach

Emilia Stoica and Marti HearstSchool of Information

University of California, Berkeley

Research Supported by NSF DBI-0317510 and a gift from Genentech

Goal

Annotate genes with functional information derived from journal articles.

Gene Ontology (GO)

Gene Ontology (GO) controlled vocabulary for functional annotation

~ 17,600 terms (circa July 2004) Organized into 3 distinct acyclic

graphs molecular functions biological processes cellular locations

More general terms are “parents” of less general terms: development (GO:0007275) is the parent

of embryonic development (GO:0001756)

Challenges GO tokens might not appear explicitly Example: PubMed 10692450

GO:0008285: negative regulation of cell proliferationOccurs as: inhibition of cell proliferation

GO tokens might not occur contiguously Example: PubMed 10734056, GO:0007186:

G-protein coupled receptor protein signaling pathway Occurs as:

Results indicate that CCR1-mediated responses are regulated …in the signaling pathway, by receptor phosphorylation at the level of receptor G/protein coupling … CCR1 binds MIP-1 alpha.

Challenges

The simplest strategy (assigning GO codes to genes simply because the GO tokens occur near the gene) yields a large number of false positives.

Issues: a) The text does not contain evidence to

support the annotation,b) The text contains evidence for the

annotation, but the curator knows the gene to be involved in a function that is more general or more specific than the GO code matched in text.

Challenges

GO contains hints about what kinds of evidence are required for annotation, e.g.: The text should mention co-

purification, co-immunoprecipitation experiments

Requiring these evidence terms does not seem to improve algorithms.

Related Work Mainly in the context of BioCreative competition

(2004) Chiang and Yu 2003, 2004:

Find phrase patterns commonly used in sentences describing gene functions

(e.g., “gene plays an important role in”, “gene is involved in”) Final assignments made with a Naïve Bayes

classifier Ray and Craven 2004, 2005:

Learn a statistical model for each GO code (which words are likely to co-occur in the paragraphs containing GO codes);

Decide among candidates via a multinomial Naïve Bayes classifier

Rice et al. 2004: Train an SVM for each GO code. Target genes assigned best-scoring GO code.

Related Work, cont.

Couto et al. 2004 Determine if the “information content” of the

matching GO terms is larger than for all the candidate GO terms.

Verspoor et al. 2004 Expand GO tokens with words that frequently

co-occur in a training set; use a categorizer that explores the structure of the Gene Ontology to find best hits.

Ehler and Ruch 2004: Treat each document as a query to be

categorized Create a score based on a combination of

pattern matching and TF*IDF weighting Annotate gene with top-scoring GO codes.

Our Approach

Two main contributions: Use cross-species information (CSM) Check for biological (in)

consistencies (CSC)

Cross-Species MatchMain Idea

Use orthologous genes [Genes of different species that have

evolved directly from a common ancestor.] Assumption:

Since there is an overlap between the genomes of the two species, their orthologs may share some functions, and consequently some GO codes

Idea: to predict GO codes for target genes in target species, use the GO codes assigned to their orthologous genes We use Mouse vs. Human genes

General procedure

Analyze text at sentence level Eliminate stop words, punctuation

characters and divide the text into tokens using space as delimiter

Normalize and match different variations of gene names using the algorithm of Bhalotia et al.’03

For every sentence that contains the target gene: A GO code is matched if the sentence

contains a percentage of GO tokens larger than a threshold (0.75 for CSM and 1 for CSC)

Cross Species Match Algorithm

CSM(g, a): For a target gene g, search in article a for only the GO codes annotated to its ortholog

If at least 75% of the GO code terms are found in a sentence containing the gene name, the code is matched.

Note: we must eliminate annotations of orthologs marked with IEA and ISS codes to avoid circular references.

Cross-Species Correlation Main Idea

Observation: Since GO codes indicate gene function, it is logical

for some to often co-occur in annotations and for others to rarely do so.

Assumption: If one GO code tends to occur in the orthologous

genes’ annotations when another one does not, then assume the second is not a valid assignment for the target species

Example: If text seems to contain evidence for rRNA

transcription (GO:0009303) nucleolus (GO:0005737) and extracellular (GO:0005576), then extracellular is suspicious.

The algorithm identifies the “suspicious” cases.

Cross-Species Correlation Algorithm For every pair of GO codes in the

orthologous genes database, compute a X2 coefficient.

N: the total number of GO codes O11: # of times the ortholog is annotated with both

GO1 and GO2

O12: # of times the ortholog is annotated with GO1 but not GO2

O21: # of times the ortholog is annotated with GO2 but not GO1

O12: # of times the ortholog is not annotated with GO1 or GO2

)(*)(*)(*)(

)**(*)**(*

2221221221111211

2112221121122211

OOOOOOOO

OOOOOOOON

X2

Cross-Species Correlation Algorithm

M(g,a) = GO codes matched in article a for gene g

O(g) = GO codes assigned to the ortholog of g

o = size of O(g), p = percentage (0.2) For every potentially matching GO code GO1

in M(g,a) For every GO code GO2 in O(g)

Count how often X2(GO1,GO2) is significant

If this count is < p*o then assume GO1 is not valid.

Else assign GO1 to g

Information Flow

Evaluation using BioCreative

Task 2.2: Annotate 138 human genes with GO

codes using 99 full text articles; For each annotation, provide the

passage of text that the annotation was based upon.

Annotations from participants were manually judged by human curators

A prediction was considered “perfect” if the text passage contained the gene name, and provided evidence for annotating the

gene with the GO code

Results on BioCreative

Our research was conducted after the competition had past, so our annotations could not be judged by the same curators

Used the “perfect predictions” (unfair to our system; ignores relevant

predictions we find that other systems do not)

Our prediction is correct if it matches a perfect prediction (e.g., vhl is annotated with transcription (GO:0006350) in PubMed 12169961 “vhl inhibits transcription elongation, mRNA stability and PKC activity”)

BioCreative Results

System Precision

TP (Recall)

F-measure

CSM 0.36 16 (0.07) 0.11

CSC 0.18 44 (0.19) 0.18

CSM+CSC 0.24 51 (0.21)

0.23

Ray and Craven

0.21 52 (0.22) 0.22

Chiang and Yu

0.33 37 (0.16) 0.21

Ehler and Ruch

0.12 78 (0.33) 0.18

Couto et al. 0.09 58 (0.25) 0.13

Verspoor et al.

0.06 19 (0.08) 0.07

Rice et al. 0.04 16 (0.07) 0.05

Results on Larger Dataset

A much larger test set has been made publicly available by Chiang and Yu.

EBI human test set 4,410 genes 13,626 GO code annotations

MGI mouse test set 2,188 genes 6,338 GO code annotations

Note that Chiang and Yu used the same data for both training and testing.

Results on EBI Human and MGI datasets

EBI human: 4,410 genes and 5,714 abstracts MGI: 2,188 genes and 1,947 abstracts

Dataset

System Precision

Recall F-measure

EBI CSM 0.29 0.03 0.06

CSM+CSC 0.16 0.09 0.12

Chiang and Yu

0.32 0.06 0.11

MGI CSM 0.33 0.05 0.09

CSC+CSC 0.17 0.12 0.14Chiang and Yu

0.33 0.05 0.09

Conclusions and Future Work

We propose an algorithm that annotates genes with GO codes using the information available from other species

Experimental results on three datasets show that our algorithm consistently achieves higher F-measures than other solutions

Future improvements to our algorithm: - combine or use a voting scheme between the predictions our system makes and the predictions of a machine learning system

- investigate how effective are other genes with sequences similar to the target gene (but not orthologous to the gene) for predicting the GO codes

Thank you!

http://biotext.berkeley.edu

Research Supported by NSF DBI-0317510 and a gift from Genentech

predicting gene functions from text using a cross-species approach emilia stoica and marti hearst...

Documents

gene ontology

genentech slide

evidence terms

predicting gene functions

general terms

consistencies csc slide

functional information

functional annotation