similarity search in large sets of genes using semantic similarity of gene ontology annotations ...
DESCRIPTION
Over the past years, more than 30 different semantic similarity measures for GO annotations have been proposed. In the first part of the presentation I will give an overview on the strength and weaknesses of these methods. In the second part I will talk about our efforts to develop algorithms that allow time efficient semantic similarity search in large sets of genes (e.g. UniProt).TRANSCRIPT
Similarity Search in Large Datasets using Gene Ontology
COMPUTATIONAL INFORMATICS
Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman
Gene01: ACGGTAGGCTAGACTAGATATTAACG
Gene02: CCTGAGTACCTGGACTAGATAC
Gene03: GATGCGGTTACGTACGATCCATGGA
Gene04: CATTTATTATATATACGCGCGCGA
Gene05: TTTCGATAGGGGATATATTAACGCCG
Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC
Gene07: GATAGACTCGCGCCGATATATAG
Gene08: ATATATTTCCTAGATCGAGAGATAC
Gene09: GATAGGTTAATTAATTTCCTATAT
Gene10: TGGATTGGATAGCGCGATAGATC
Gene11: AAAAGTCGATAAGGCTAGAGCTAG
Gene12: GGATATAGATATATCTAGATATC
Gene13: CGATATAGCCCTCTAGAGATACTTT
Gene14: GATACCCGCGATATATCAT
Gene15: TAGATCCCCGAGATAGAGACT
Gene16: CACCATAGAAGACTGATCGAGATAG
Gene01: GGCTAGACTAGATATTAACGACGGTA
Gene02: AGTACCTGGACTAGCCTGTAC
Gene03: GATGCGGTTACGCCATTACGAT
Gene04: GATATATATATATACGCGCGCGA
Gene05: CATTTATGGGATATATTAACGCCG
Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC
Gene07: GATAGACTCGCGCCGATATATAG
Gene08: TCCTAGATCAGATCGAGAGATAC
Gene09: GATAGGTTAATTAATTTCCTATAT
Gene10: GCGATCCTATGGATAGCAGATC
Gene11: AAAAGTCGATAAGGCTAGAGCTAG
Gene12: GGATATAGATATATCTAGATATC
Gene13: CGATATAGCCAGAAGTCGAACTTT
Gene14: GATACCCGCGCTCTATATATCAT
Gene15: TAGATCCCCGAGATAGAGACT
Gene16: CACCATAGAAGACTGATCGAGATAG
N. perurans N. pemaquidensis
Compare sets of genes and gene products to discover:
1. Similarities between them. 2. The most dissimilar genes in each dataset.
3 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Gene Ontology (GO)
Semantic Similarity Search in Large Datasets | Heiko Müller 4 |
Example from Molecular Function ontology
GO Annotations
Semantic Similarity Search in Large Datasets | Heiko Müller 5 |
GOA(g1) = {GO:0055100, GO:0070122}
“[...] the pathway from a child term all the way up to its top-level parent(s) must always be true“.
True Path Rule
Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller 6 |
GOA(g1) = {GO:0055100, GO:0070122}
GOA(g2) = {GO:0030332, GO:0070012}
• Annotations provide an objective representation to compare genes on functional aspects.
• Semantic similarity measure quantifies relationships between (sets of) GO terms.
sim(g1, g2) = ?
Term Specificity
less similar
more similar
))(log()( tPtic
Corpus-based
Structure-based
)_log(
)1)(log(1)()(
termstotal
tdesctdepthtic
Quantify semantics or information content (ic) of GO terms.
Group-wise Semantic Similarity
Semantic Similarity Search in Large Datasets | Heiko Müller 8 |
GOA(g1) = {GO:0055100, GO:0070122}
GOA(g2) = {GO:0030332, GO:0070012}
IC(g1) = 10.6609
IC(g2) = 9.7925
IC(g1 g2) = 2.7925
sim(g1, g2) = 0.2736
)(
)(
)(
)(
2
1),(
2
21
1
2121
gIC
ggIC
gIC
ggICggsim
Group-wise Similarity
X. Chen et al., Gene, 509 (2012)
10 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Gene Identifier Sets
1 = g11: GO:0003824, GO:0005488 2 = g12: GO:0016787, GO:0042562 3 = g13: GO:0008233, GO:0031406 4 = g14: GO:0005515, GO:0016787 5 = g15: GO:0055100, GO:0070122
D1
1 = g21: GO:0003824, GO:0005488 2 = g22: GO:0016829, GO:0042562 3 = g23: GO:0043168, GO:0008233 4 = g24: GO:0055100, GO:0070012 5 = g25: GO:0004325, GO:0043177
D2
5 4
1-5 1-5
2-5 3-4
5
3,5
4
3-4
1-5 1-5
Exhaustive Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 5
GO:0031406 2.2228 3
GO:0043177 1.6616 3 5
GO:0008233 1.6305 3,5 3-4
GO:0043168 1.3777 3 3
GO:0042562 1.3472 2,5 2,4
GO:0036094 0.8873 3 5
GO:0043167 0.8624 3 3
GO:0016829 0.6347 2,5
GO:0005515 0.5123 4-5 4
GO:0016787 0.4144 2-5 3-4
GO:0005488 0.1898 1-5 1-5
GO:0003824 0.0455 1-5 1-5
1 2 3 4 5
1
2
3
4
5
IC-D1
IC-D2
IC-D12
4
3.52 3
7
6.52
Similarity-based Ranking
Semantic Similarity Search in Large Datasets | Heiko Müller 13 |
sim(g1,g2) = 1
sim(g3,g4) = 0.82
simrank(g1,g2)
simrank(g1,g2) = 0.2353
simrank(g3,g4) = 14.0304
),()( 2121 ggsimggIC
Top-k Search
Term IC GIDS-D1 GIDS-D2
GO:0070012 4.0000 4
GO:0070122 3.5212 5
GO:0055100 3.0000 5 4
GO:0004325 2.7734 5
GO:0031406 2.2228 3
GO:0043177 1.6616 3 5
GO:0008233 1.6305 3,5 3-4
GO:0043168 1.3777 3 3
GO:0042562 1.3472 2,5 2,4
GO:0036094 0.8873 3 5
GO:0043167 0.8624 3 3
GO:0016829 0.6347 2,5
GO:0005515 0.5123 4-5 4
GO:0016787 0.4144 2-5 3-4
GO:0005488 0.1898 1-5 1-5
GO:0003824 0.0455 1-5 1-5
1
2
3
4
5
Top-5
5,4 4.68
5,3 0.82
5,2 0.68
5,1 0.12
5,5 0.01
Step 1
5,4 4.68
3,3 3.36
3,5 1.04
5,3 0.82
5,2 0.68
Step 2
5,4 4.68
3,3 3.36
2,2 1.19
2,4 1.18
3,5 1.04
Step 3
IC-D2
1 2 3 4 5
IC-D1 0.24 2 9.29 1.16 10.7
0.24 2.22 4.52 11.1 6.19
1 2 3 4 5
15 |
Semantic Similarity Search
Algorithms for Comparing Large Datasets
Results
Semantic Similarity Search in Large Datasets | Heiko Müller
Results
Runtime – MF (438.406 entries with GO annotations)
UniProt – Swiss-Prot (Rel. 2014_02)
Baseline Exhaustive Top 10,000 Top 1,000 Top 100
> 2 days ~ 45 min. 2.5 - 4.5 min. 1 – 3.5 min. 15 sec. – 2.5 min.
Semantic Similarity Search in Large Datasets | Heiko Müller 16 |
Results (cont.)
• Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot against itself (e=10-4).
Semantic Similarity Search in Large Datasets | Heiko Müller 17 |
How does it compare to sequence similarity search?
Number of similar pairs in Top 10,000
that are not included in BLAST
results
0
1000
2000
3000
4000
5000
6000
7000
8000
MF-ALL MF-CUR
CORPUS
STRUCTURE