sarah cohen-boulakia

26
Sarah Cohen - Boulakia Laboratoire de Recherche en Informatique CNRS UMR 8623 Université Paris-Sud, Université Paris-Saclay

Upload: others

Post on 19-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sarah Cohen-Boulakia

Sarah Cohen-BoulakiaLaboratoire de Recherche en Informatique

CNRS UMR 8623

Université Paris-Sud, Université Paris-Saclay

Page 2: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 2

Understanding Life Sciences Progress in multiple domains: biology, chemistry, maths,

computer sciences…

Emergence of new technologies: Next generation sequencing,…

Increasing volumes of raw data

All stored in Web data sources

Raw data are not sufficientData Annotated by experts

Bioinformatics analysis of data New data sources

Concrete example: Querying NCBI Entrez

http://www.ncbi.nlm.nih.gov/gquery/(« Gquery NCBI » on google )

Page 3: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 3

National Center for Biotechnology Information (NCBI) provides access to biomedical and genomic information through its Web portal

33 databases queries

Breast cancer

14 772 genes

Page 4: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 4

• Content of a biological database = a set of pages (a book!)

• Each database is focused on one biological entity (here, Gene)

• Each page describes one instance (ie, one gene)

Page 5: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 5

Last update (freshness)

Status of the sequence (reliability)

Various quality features

« volume of information » (completeness)

+ Reputation of the data source, number of incominglinks (paths) to reach the data…

Page 6: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 6

… to determine the importance of results (obtained by several ways) and get complete sets of answers

771

Mamamilian carcinoma

771 genes not all included in the previous result!

Mammilian carcinoma

Page 7: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 7

NCBI ranks answers for each keyword using the relevance◦ Number of occurrences of the keyword (breast cancer) in the files

Other approaches have been tested◦ Combining various criteria (weighted functions)

Freshness first then reputation of the source then…

◦ Following PageRank or ObjectRank-like solutions (Biozon, Biorank…)

How to deal with several sets of answers?◦ How to aggregate several rankings?

Answers obtained by several keywords should have higher priority…

Wanted: Ranking solution exploiting many criteria

◦ Problem : How to combine all such criteria?

Alternative: Consensus rankings?

Page 8: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 8

Need for consensus rankings to order biological data

Definition of the problem

Experimental Study of existing approaches

ConQuR-Bio

Conclusion

Page 9: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 9

Input: a set of rankings (A to D are four genes)◦ Ranking 1 = [A, D, C, B]

◦ Ranking 2 = [B, A, D, C]

◦ Ranking 3 = [D, A, B, C]

Output: one consensus ranking◦ Consensus = [A, D, B, C]

A consensus ranking makes the most of several rankings, each obtained by given ranking methods◦ Put emphasis on their common points

◦ Does not put too much importance on data classified “good” by only one or a few ranking methods

The optimal consensus is one ranking which is the closestto the input rankings Distance

Page 10: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 10

The Kendall−τ D(π,σ) distance

counts the number of pairs of elements inversed

(ie in the opposite order) between two rankings

Page 11: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 11

The Kendall−τ D(π,σ) distance

counts the number of pairs of elements inversed

(ie in the opposite order) between two rankings

Kemeny Score

Optimal Consensus (median)

Page 12: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 12

The Kendall−τ D(π,σ) distance

counts the number of pairs of elements inversed

(ie in the opposite order) between two rankings

Kemeny Score

Optimal Consensus (median)

Complexity [Dwork et al 2001, Biedl et al. 2009]

NP-Difficult for an odd number

of permutations ≥ 4 Approximation and heuristics

Page 13: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 13

Real data sets ◦ Have equalities: several items ex-aequo

◦ May be incomplete not all on the same sets of elements

One unifying bucket at the end of each ranking

with the missing elements

Unification process

Generalized Kendall−τ G(r,s) counts

the number of pairs of elements◦ inversed between two rankings r et s

◦ tied in only one of the two rankings

Complexity: Finding an optimal consensus with ties is at least as difficult as in the permutation case [VLDB15]

Optimal consensus implemented [VLDB15]◦ Provides exact solutions until n=40 elements

Approximation and heuristics

R1=[A, {B,C},D,{E}]R2=[{A,D},E,{B,C}]R3=[{C,E},A,B,{D}]

Elements in the unifying bucket are equally unimportant

Page 14: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 14

MEDRank [Fagin et al. 2003]

CopelandMethod [Copeland et al. 1951]

BordaCount [Borda 1781]

MC4 [Dwork et al. 2001]Positionnal approaches

B&B [Ali et al. 2012]

ChanasBoth [Coleman et al. 2009]

Chanas [Chanas et al. 1996]

PNE (exact) [Conitzer, et al. 2006]

KwikSort [Ailon et al. 2008]

RepeatChoice [Ailon et al. 2010]

permutations

Kendall-τ distance

Pick-A-Perm [Ailon et al. 2008]

Ailon3/2 [Fagin et al. 2004]

BioConsert [Cohen-Boulakia et al. 2011]

FaginDyn [Fagin et al. 2004]

GeneralizedKendall-τ distance

Page 15: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 15

Compatibility with ties

Ability to consider the cost of (un)tying

[VLDB15]

Page 16: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 16

Optimal solution can be found until n~40 Comparison of 12 algorithms◦ Re-implemented and adapted to ties◦ Evaluated on

Quality: gap (optimal consensus c*) Time

Real datasets ◦ Extracted from previous publications◦ 7 datasets: EachMovie, F1, GiantSlalom, SkiCross/Jumping,

WebCommunities, WebSearch, BioMedical

Synthetic datasets1. Uniformly generated with MuPad Combinat2. With increasing levels of similarity3. Similarity and unification process

Precise understanding of the impact of the similarity and unification process

[VLDB15]

Page 17: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 17

● BioConsert can be used in a verylarge majority of the cases

● For very large data sets(>30.000 elements)

● KwikSort can be preferred

● If there is a need to seed up then

● In case of few equalities useBordaCount

● Otherwise use MEDRank

● Alternativeley: use bothalgorithms and pick the best High quality low quality

Tim

e →

(gap)

http://rank-aggregation-with-ties.lri.fr/

[VLDB15]

Page 18: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 18

Need for consensus rankings to order biological data

Definition of the problem

Experimental Study of existing approaches

ConQuR-Bio

Conclusion

Page 19: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 19

Finding all synonyms istime-consuming

Querying using all synonyms providehuge amounts of data sets whichhave to beranked….

Synonyms:Breast cancer vs mammalian carcinoma(14 772 vs 771 genes, not all included)

Abbreviations:Attention deficit hyperactivity disorders vs ADHD (109 vs 144 genes, 74 common)

Linguistics variations: tumour vs tumor(& breast cancer) : 681 vs 291 genes

More precise reformulations:colorectal cancer vs Lynch syndrom(+6 new genes)

Which are the genes

associated to a given

disease?

[DILS14]

Page 20: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 20

I) Reformulations (synonyms) can be automatically generated

• Input: the user‘s Keyword

• Automatic search of synonymsin major biomedicalterminologies

• Output : set of Synonyms ofthe input keyword

[DILS14]

Page 21: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 21

I) Reformulations (synonyms) can be automatically generated

• Input: the user‘s Keyword

• Automatic search of synonymsin major biomedicalterminologies

• Output : set of Synonyms ofthe input keyword

II) DB querying is automated(based on keywords)

• Input : set of synonymsobtained in I)

• Querying gene databases witheach synonym

(#queries = # synonyms)

• Output : sets of rankings(ranked genes), one rankingper synonym

[DILS14]

Page 22: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 22

III) Aggregating using a series of consensus algorithmswith a variant of theGeneralized Kendall−τ distance

I) Reformulations (synonyms) can be automatically generated

• Input: the user‘s Keyword

• Automatic search of synonymsin major biomedicalterminologies

• Output : set of Synonyms ofthe input keyword

II) DB querying is automated(based on keywords)

• Input : set of synonymsobtained in I)

• Querying gene databases witheach synonym

(#queries = # synonyms)

• Output : sets of rankings(ranked genes), one rankingper synonym

[DILS14]

Page 23: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 23

Highly accessed by members of APHP (Hospitals), Institut Curie, Institut Pasteur…

Page 24: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 24

Ranking Biological data is a complex task◦ Many quality criteria, difficult to combine Consensus ranking approaches

Classification of consensus ranking algorithms◦ Rankings with ties, incomplete data sets

Study of distances and normalization techniques◦ Complexity results, new distances considered, new

heuristics

Guide in the choice of distances and algorithms

Rank-n-ties platform available

Concrete application to real biological data ◦ Consensus of query reformulations◦ Currently under evaluation based on real datasets

(Leukemia)

Increasing number of reformulations + increasing number of answers ranking big data sets

Page 25: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 25

Bryan Brancotte

Mastodon Project QualibioConsensus: LRI (Paris-Sud), LIGM (Marne-la-vallée), IFB (Institut Français de Bioinformatique) Other partners: Univ. Montréal, Institut Curie, APHP Georges Pompidou,

APHP Paul Brousse

Page 26: Sarah Cohen-Boulakia

Sarah Cohen-Boulakia, Université Paris-Sud 26

Rank aggregation with ties: Experiments and Analysis

B Brancotte, B Yang, G Blin, S Cohen-Boulakia, A Denise,

S Hamel. Proceedings of the VLDB Endowment 8 (11), 1202-1213

Interrogation de bases de données biologiques publiques par reformulation de requêtes et classement des résultats avec ConQuR-Bio

B Brancotte, B Rance, A Denise, S Cohen-Boulakia.

JOBIM (Journées Ouvertes Biologie Informatique Mathématiques) 2015

Conqur-bio: Consensus ranking with query reformulation for biologicaldata B Brancotte, B Rance, A Denise, S Cohen-Boulakia. International

Conference on Data Integration in the Life Sciences (DILS), 128- 142. 2014

Using medians to generate consensus rankings for biological data

S Cohen-Boulakia, A Denise, S Hamel. International Conference on

Scientific and Statistical Database Management (SSDBM). 2011