similarity search in large sets of genes using semantic similarity of gene ontology annotations ...

18
Similarity Search in Large Datasets using Gene Ontology COMPUTATIONAL INFORMATICS Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman

Upload: australian-bioinformatics-network

Post on 10-May-2015

286 views

Category:

Technology


4 download

DESCRIPTION

Over the past years, more than 30 different semantic similarity measures for GO annotations have been proposed. In the first part of the presentation I will give an overview on the strength and weaknesses of these methods. In the second part I will talk about our efforts to develop algorithms that allow time efficient semantic similarity search in large sets of genes (e.g. UniProt).

TRANSCRIPT

Page 1: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Similarity Search in Large Datasets using Gene Ontology

COMPUTATIONAL INFORMATICS

Heiko Müller, David Rozado, Mat Cook, Ashfaqur Rahman

Page 2: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Gene01: ACGGTAGGCTAGACTAGATATTAACG

Gene02: CCTGAGTACCTGGACTAGATAC

Gene03: GATGCGGTTACGTACGATCCATGGA

Gene04: CATTTATTATATATACGCGCGCGA

Gene05: TTTCGATAGGGGATATATTAACGCCG

Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC

Gene07: GATAGACTCGCGCCGATATATAG

Gene08: ATATATTTCCTAGATCGAGAGATAC

Gene09: GATAGGTTAATTAATTTCCTATAT

Gene10: TGGATTGGATAGCGCGATAGATC

Gene11: AAAAGTCGATAAGGCTAGAGCTAG

Gene12: GGATATAGATATATCTAGATATC

Gene13: CGATATAGCCCTCTAGAGATACTTT

Gene14: GATACCCGCGATATATCAT

Gene15: TAGATCCCCGAGATAGAGACT

Gene16: CACCATAGAAGACTGATCGAGATAG

Gene01: GGCTAGACTAGATATTAACGACGGTA

Gene02: AGTACCTGGACTAGCCTGTAC

Gene03: GATGCGGTTACGCCATTACGAT

Gene04: GATATATATATATACGCGCGCGA

Gene05: CATTTATGGGATATATTAACGCCG

Gene06: GTAGGTAGGTGGAGGCCCGCAGACGC

Gene07: GATAGACTCGCGCCGATATATAG

Gene08: TCCTAGATCAGATCGAGAGATAC

Gene09: GATAGGTTAATTAATTTCCTATAT

Gene10: GCGATCCTATGGATAGCAGATC

Gene11: AAAAGTCGATAAGGCTAGAGCTAG

Gene12: GGATATAGATATATCTAGATATC

Gene13: CGATATAGCCAGAAGTCGAACTTT

Gene14: GATACCCGCGCTCTATATATCAT

Gene15: TAGATCCCCGAGATAGAGACT

Gene16: CACCATAGAAGACTGATCGAGATAG

N. perurans N. pemaquidensis

Compare sets of genes and gene products to discover:

1. Similarities between them. 2. The most dissimilar genes in each dataset.

Page 3: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

3 |

Semantic Similarity Search

Algorithms for Comparing Large Datasets

Results

Semantic Similarity Search in Large Datasets | Heiko Müller

Page 4: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Gene Ontology (GO)

Semantic Similarity Search in Large Datasets | Heiko Müller 4 |

Example from Molecular Function ontology

Page 5: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

GO Annotations

Semantic Similarity Search in Large Datasets | Heiko Müller 5 |

GOA(g1) = {GO:0055100, GO:0070122}

“[...] the pathway from a child term all the way up to its top-level parent(s) must always be true“.

True Path Rule

Page 6: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Semantic Similarity

Semantic Similarity Search in Large Datasets | Heiko Müller 6 |

GOA(g1) = {GO:0055100, GO:0070122}

GOA(g2) = {GO:0030332, GO:0070012}

• Annotations provide an objective representation to compare genes on functional aspects.

• Semantic similarity measure quantifies relationships between (sets of) GO terms.

sim(g1, g2) = ?

Page 7: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Term Specificity

less similar

more similar

))(log()( tPtic

Corpus-based

Structure-based

)_log(

)1)(log(1)()(

termstotal

tdesctdepthtic

Quantify semantics or information content (ic) of GO terms.

Page 8: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Group-wise Semantic Similarity

Semantic Similarity Search in Large Datasets | Heiko Müller 8 |

GOA(g1) = {GO:0055100, GO:0070122}

GOA(g2) = {GO:0030332, GO:0070012}

Page 9: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

IC(g1) = 10.6609

IC(g2) = 9.7925

IC(g1 g2) = 2.7925

sim(g1, g2) = 0.2736

)(

)(

)(

)(

2

1),(

2

21

1

2121

gIC

ggIC

gIC

ggICggsim

Group-wise Similarity

X. Chen et al., Gene, 509 (2012)

Page 10: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

10 |

Semantic Similarity Search

Algorithms for Comparing Large Datasets

Results

Semantic Similarity Search in Large Datasets | Heiko Müller

Page 11: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Gene Identifier Sets

1 = g11: GO:0003824, GO:0005488 2 = g12: GO:0016787, GO:0042562 3 = g13: GO:0008233, GO:0031406 4 = g14: GO:0005515, GO:0016787 5 = g15: GO:0055100, GO:0070122

D1

1 = g21: GO:0003824, GO:0005488 2 = g22: GO:0016829, GO:0042562 3 = g23: GO:0043168, GO:0008233 4 = g24: GO:0055100, GO:0070012 5 = g25: GO:0004325, GO:0043177

D2

5 4

1-5 1-5

2-5 3-4

5

3,5

4

3-4

1-5 1-5

Page 12: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Exhaustive Search

Term IC GIDS-D1 GIDS-D2

GO:0070012 4.0000 4

GO:0070122 3.5212 5

GO:0055100 3.0000 5 4

GO:0004325 2.7734 5

GO:0031406 2.2228 3

GO:0043177 1.6616 3 5

GO:0008233 1.6305 3,5 3-4

GO:0043168 1.3777 3 3

GO:0042562 1.3472 2,5 2,4

GO:0036094 0.8873 3 5

GO:0043167 0.8624 3 3

GO:0016829 0.6347 2,5

GO:0005515 0.5123 4-5 4

GO:0016787 0.4144 2-5 3-4

GO:0005488 0.1898 1-5 1-5

GO:0003824 0.0455 1-5 1-5

1 2 3 4 5

1

2

3

4

5

IC-D1

IC-D2

IC-D12

4

3.52 3

7

6.52

Page 13: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Similarity-based Ranking

Semantic Similarity Search in Large Datasets | Heiko Müller 13 |

sim(g1,g2) = 1

sim(g3,g4) = 0.82

simrank(g1,g2)

simrank(g1,g2) = 0.2353

simrank(g3,g4) = 14.0304

),()( 2121 ggsimggIC

Page 14: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Top-k Search

Term IC GIDS-D1 GIDS-D2

GO:0070012 4.0000 4

GO:0070122 3.5212 5

GO:0055100 3.0000 5 4

GO:0004325 2.7734 5

GO:0031406 2.2228 3

GO:0043177 1.6616 3 5

GO:0008233 1.6305 3,5 3-4

GO:0043168 1.3777 3 3

GO:0042562 1.3472 2,5 2,4

GO:0036094 0.8873 3 5

GO:0043167 0.8624 3 3

GO:0016829 0.6347 2,5

GO:0005515 0.5123 4-5 4

GO:0016787 0.4144 2-5 3-4

GO:0005488 0.1898 1-5 1-5

GO:0003824 0.0455 1-5 1-5

1

2

3

4

5

Top-5

5,4 4.68

5,3 0.82

5,2 0.68

5,1 0.12

5,5 0.01

Step 1

5,4 4.68

3,3 3.36

3,5 1.04

5,3 0.82

5,2 0.68

Step 2

5,4 4.68

3,3 3.36

2,2 1.19

2,4 1.18

3,5 1.04

Step 3

IC-D2

1 2 3 4 5

IC-D1 0.24 2 9.29 1.16 10.7

0.24 2.22 4.52 11.1 6.19

1 2 3 4 5

Page 15: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

15 |

Semantic Similarity Search

Algorithms for Comparing Large Datasets

Results

Semantic Similarity Search in Large Datasets | Heiko Müller

Page 16: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Results

Runtime – MF (438.406 entries with GO annotations)

UniProt – Swiss-Prot (Rel. 2014_02)

Baseline Exhaustive Top 10,000 Top 1,000 Top 100

> 2 days ~ 45 min. 2.5 - 4.5 min. 1 – 3.5 min. 15 sec. – 2.5 min.

Semantic Similarity Search in Large Datasets | Heiko Müller 16 |

Page 17: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Results (cont.)

• Compare Top 10,000 matches against results from ‘BLASTing’ Swiss-Prot against itself (e=10-4).

Semantic Similarity Search in Large Datasets | Heiko Müller 17 |

How does it compare to sequence similarity search?

Number of similar pairs in Top 10,000

that are not included in BLAST

results

0

1000

2000

3000

4000

5000

6000

7000

8000

MF-ALL MF-CUR

CORPUS

STRUCTURE

Page 18: Similarity search in large sets of genes using semantic similarity of gene ontology annotations   heiko muller

Heiko Müller

e [email protected] t +61 3 6232 5575

COMPUTATIONAL INFORMATICS

Thank you