similarity search in large sets of genes using semantic similarity of gene ontology annotations ...

Similarity Search in Large Datasets using Gene Ontology

COMPUTATIONAL INFORMATICS

Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman

Gene01: ACGGTAGGCTAGACTAGATATTAACG

Gene02: CCTGAGTACCTGGACTAGATAC

Gene03: GATGCGGTTACGTACGATCCATGGA

Gene04: CATTTATTATATATACGCGCGCGA

Gene05: TTTCGATAGGGGATATATTAACGCCG

Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC

Gene07: GATAGACTCGCGCCGATATATAG

Gene08: ATATATTTCCTAGATCGAGAGATAC

Gene09: GATAGGTTAATTAATTTCCTATAT

Gene10: TGGATTGGATAGCGCGATAGATC

Gene11: AAAAGTCGATAAGGCTAGAGCTAG

Gene12: GGATATAGATATATCTAGATATC

Gene13: CGATATAGCCCTCTAGAGATACTTT

Gene14: GATACCCGCGATATATCAT

Gene15: TAGATCCCCGAGATAGAGACT

Gene16: CACCATAGAAGACTGATCGAGATAG

Gene01: GGCTAGACTAGATATTAACGACGGTA

Gene02: AGTACCTGGACTAGCCTGTAC

Gene03: GATGCGGTTACGCCATTACGAT

Gene04: GATATATATATATACGCGCGCGA

Gene05: CATTTATGGGATATATTAACGCCG

Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC

Gene07: GATAGACTCGCGCCGATATATAG

Gene08: TCCTAGATCAGATCGAGAGATAC

Gene09: GATAGGTTAATTAATTTCCTATAT

Gene10: GCGATCCTATGGATAGCAGATC

Gene11: AAAAGTCGATAAGGCTAGAGCTAG

Gene12: GGATATAGATATATCTAGATATC

Gene13: CGATATAGCCAGAAGTCGAACTTT

Gene14: GATACCCGCGCTCTATATATCAT

Gene15: TAGATCCCCGAGATAGAGACT

Gene16: CACCATAGAAGACTGATCGAGATAG

N. perurans N. pemaquidensis

Compare sets of genes and gene products to discover:

1. Similarities between them. 2. The most dissimilar genes in each dataset.

3 |

Semantic Similarity Search

Algorithms for Comparing Large Datasets

Results

Semantic Similarity Search in Large Datasets | Heiko Müller

Gene Ontology (GO)

Semantic Similarity Search in Large Datasets | Heiko Müller 4 |

Example from Molecular Function ontology

GO Annotations


GOA(g1) = {GO:0055100, GO:0070122}

“[...] the pathway from a child term all the way up to its top-level parent(s) must always be true“.

True Path Rule

Semantic Similarity


GOA(g1) = {GO:0055100, GO:0070122}

GOA(g2) = {GO:0030332, GO:0070012}

• Annotations provide an objective representation to compare genes on functional aspects.

• Semantic similarity measure quantifies relationships between (sets of) GO terms.

sim(g1, g2) = ?

Term Specificity

less similar

more similar

))(log()( tPtic

Corpus-based

Structure-based

)_log(

)1)(log(1)()(

termstotal

tdesctdepthtic

Quantify semantics or information content (ic) of GO terms.

Group-wise Semantic Similarity


GOA(g1) = {GO:0055100, GO:0070122}

GOA(g2) = {GO:0030332, GO:0070012}

IC(g1) = 10.6609

IC(g2) = 9.7925

IC(g1 g2) = 2.7925

sim(g1, g2) = 0.2736

)(

)(

)(

)(

2

1),(

2

21

1

2121

gIC

ggIC

gIC

ggICggsim

Group-wise Similarity

X. Chen et al., Gene, 509 (2012)

10 |



Results


Gene Identifier Sets

1 = g11: GO:0003824, GO:0005488 2 = g12: GO:0016787, GO:0042562 3 = g13: GO:0008233, GO:0031406 4 = g14: GO:0005515, GO:0016787 5 = g15: GO:0055100, GO:0070122

D1

1 = g21: GO:0003824, GO:0005488 2 = g22: GO:0016829, GO:0042562 3 = g23: GO:0043168, GO:0008233 4 = g24: GO:0055100, GO:0070012 5 = g25: GO:0004325, GO:0043177

D2

5 4

1-5 1-5

2-5 3-4

5

3,5

4

3-4

1-5 1-5

Exhaustive Search

Term IC GIDS-D1 GIDS-D2

GO:0070012 4.0000 4

GO:0070122 3.5212 5

GO:0055100 3.0000 5 4

GO:0004325 2.7734 5

GO:0031406 2.2228 3

GO:0043177 1.6616 3 5

GO:0008233 1.6305 3,5 3-4

GO:0043168 1.3777 3 3

GO:0042562 1.3472 2,5 2,4

GO:0036094 0.8873 3 5

GO:0043167 0.8624 3 3

GO:0016829 0.6347 2,5

GO:0005515 0.5123 4-5 4

GO:0016787 0.4144 2-5 3-4

GO:0005488 0.1898 1-5 1-5

GO:0003824 0.0455 1-5 1-5

1 2 3 4 5

1

2

3

4

5

IC-D1

IC-D2

IC-D12

4

3.52 3

7

6.52

Similarity-based Ranking


sim(g1,g2) = 1

sim(g3,g4) = 0.82

simrank(g1,g2)

simrank(g1,g2) = 0.2353

simrank(g3,g4) = 14.0304

),()( 2121 ggsimggIC

Top-k Search

Term IC GIDS-D1 GIDS-D2

GO:0070012 4.0000 4

GO:0070122 3.5212 5

GO:0055100 3.0000 5 4

GO:0004325 2.7734 5

GO:0031406 2.2228 3

GO:0043177 1.6616 3 5

GO:0008233 1.6305 3,5 3-4

GO:0043168 1.3777 3 3

GO:0042562 1.3472 2,5 2,4

GO:0036094 0.8873 3 5

GO:0043167 0.8624 3 3

GO:0016829 0.6347 2,5

GO:0005515 0.5123 4-5 4

GO:0016787 0.4144 2-5 3-4

GO:0005488 0.1898 1-5 1-5

GO:0003824 0.0455 1-5 1-5

1

2

3

4

5

Top-5

5,4 4.68

5,3 0.82

5,2 0.68

5,1 0.12

5,5 0.01

Step 1

5,4 4.68

3,3 3.36

3,5 1.04

5,3 0.82

5,2 0.68

Step 2

5,4 4.68

3,3 3.36

2,2 1.19

2,4 1.18

3,5 1.04

Step 3

IC-D2

1 2 3 4 5

IC-D1 0.24 2 9.29 1.16 10.7

0.24 2.22 4.52 11.1 6.19

1 2 3 4 5

15 |



Results


Results

Runtime – MF (438.406 entries with GO annotations)

UniProt – Swiss-Prot (Rel. 2014_02)

Baseline Exhaustive Top 10,000 Top 1,000 Top 100

> 2 days ~ 45 min. 2.5 - 4.5 min. 1 – 3.5 min. 15 sec. – 2.5 min.


Results (cont.)

• Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot against itself (e=10-4).


How does it compare to sequence similarity search?

Number of similar pairs in Top 10,000

that are not included in BLAST

results

0

1000

2000

3000

4000

5000

6000

7000

8000

MF-ALL MF-CUR

CORPUS

STRUCTURE

Heiko Müller

e [email protected] t +61 3 6232 5575

COMPUTATIONAL INFORMATICS

Thank you

similarity search in large sets of genes using semantic similarity of gene ontology annotations ...

Technology

sequence similarity

semantic similarity

large datasets heiko

large datasets heiko

icd1 icd2 icd12

g2 simrankg1

icg1 g2

information content