sarah cohen-boulakia

Sarah Cohen-BoulakiaLaboratoire de Recherche en Informatique

CNRS UMR 8623

Université Paris-Sud, Université Paris-Saclay

Sarah Cohen-Boulakia, Université Paris-Sud 2

Understanding Life Sciences Progress in multiple domains: biology, chemistry, maths,

computer sciences…

Emergence of new technologies: Next generation sequencing,…

Increasing volumes of raw data

All stored in Web data sources

Raw data are not sufficientData Annotated by experts

Bioinformatics analysis of data New data sources

Concrete example: Querying NCBI Entrez

http://www.ncbi.nlm.nih.gov/gquery/(« Gquery NCBI » on google )

http://www.ncbi.nlm.nih.gov/gquery/


National Center for Biotechnology Information (NCBI) provides access to biomedical and genomic information through its Web portal

33 databases queries

Breast cancer

14 772 genes


• Content of a biological database = a set of pages (a book!)

• Each database is focused on one biological entity (here, Gene)

• Each page describes one instance (ie, one gene)


Last update (freshness)

Status of the sequence (reliability)

Various quality features

« volume of information » (completeness)

+ Reputation of the data source, number of incominglinks (paths) to reach the data…


… to determine the importance of results (obtained by several ways) and get complete sets of answers

771

Mamamilian carcinoma

771 genes not all included in the previous result!

Mammilian carcinoma


NCBI ranks answers for each keyword using the relevance◦ Number of occurrences of the keyword (breast cancer) in the files

Other approaches have been tested◦ Combining various criteria (weighted functions)

Freshness first then reputation of the source then…

◦ Following PageRank or ObjectRank-like solutions (Biozon, Biorank…)

How to deal with several sets of answers?◦ How to aggregate several rankings?

Answers obtained by several keywords should have higher priority…

Wanted: Ranking solution exploiting many criteria

◦ Problem : How to combine all such criteria?

Alternative: Consensus rankings?


Need for consensus rankings to order biological data

Definition of the problem

Experimental Study of existing approaches

ConQuR-Bio

Conclusion


Input: a set of rankings (A to D are four genes)◦ Ranking 1 = [A, D, C, B]

◦ Ranking 2 = [B, A, D, C]

◦ Ranking 3 = [D, A, B, C]

Output: one consensus ranking◦ Consensus = [A, D, B, C]

A consensus ranking makes the most of several rankings, each obtained by given ranking methods◦ Put emphasis on their common points

◦ Does not put too much importance on data classified “good” by only one or a few ranking methods

The optimal consensus is one ranking which is the closestto the input rankings Distance


The Kendall−τ D(π,σ) distance

counts the number of pairs of elements inversed

(ie in the opposite order) between two rankings





Kemeny Score

Optimal Consensus (median)





Kemeny Score

Optimal Consensus (median)

Complexity [Dwork et al 2001, Biedl et al. 2009]

NP-Difficult for an odd number

of permutations ≥ 4 Approximation and heuristics


Real data sets ◦ Have equalities: several items ex-aequo

◦ May be incomplete not all on the same sets of elements

One unifying bucket at the end of each ranking

with the missing elements

Unification process

Generalized Kendall−τ G(r,s) counts

the number of pairs of elements◦ inversed between two rankings r et s

◦ tied in only one of the two rankings

Complexity: Finding an optimal consensus with ties is at least as difficult as in the permutation case [VLDB15]

Optimal consensus implemented [VLDB15]◦ Provides exact solutions until n=40 elements

Approximation and heuristics

R1=[A, {B,C},D,{E}]R2=[{A,D},E,{B,C}]R3=[{C,E},A,B,{D}]

Elements in the unifying bucket are equally unimportant


MEDRank [Fagin et al. 2003]

CopelandMethod [Copeland et al. 1951]

BordaCount [Borda 1781]

MC4 [Dwork et al. 2001]Positionnal approaches

B&B [Ali et al. 2012]

ChanasBoth [Coleman et al. 2009]

Chanas [Chanas et al. 1996]

PNE (exact) [Conitzer, et al. 2006]

KwikSort [Ailon et al. 2008]

RepeatChoice [Ailon et al. 2010]

permutations

Kendall-τ distance

Pick-A-Perm [Ailon et al. 2008]

Ailon3/2 [Fagin et al. 2004]

BioConsert [Cohen-Boulakia et al. 2011]

FaginDyn [Fagin et al. 2004]

GeneralizedKendall-τ distance


Compatibility with ties

Ability to consider the cost of (un)tying

[VLDB15]


Optimal solution can be found until n~40 Comparison of 12 algorithms◦ Re-implemented and adapted to ties◦ Evaluated on

Quality: gap (optimal consensus c*) Time

Real datasets ◦ Extracted from previous publications◦ 7 datasets: EachMovie, F1, GiantSlalom, SkiCross/Jumping,

WebCommunities, WebSearch, BioMedical

Synthetic datasets1. Uniformly generated with MuPad Combinat2. With increasing levels of similarity3. Similarity and unification process

Precise understanding of the impact of the similarity and unification process

[VLDB15]


● BioConsert can be used in a verylarge majority of the cases

● For very large data sets(>30.000 elements)

● KwikSort can be preferred

● If there is a need to seed up then

● In case of few equalities useBordaCount

● Otherwise use MEDRank

● Alternativeley: use bothalgorithms and pick the best High quality low quality

Tim

e →

(gap)

http://rank-aggregation-with-ties.lri.fr/

[VLDB15]

http://rank-aggregation-with-ties.lri.fr/


Need for consensus rankings to order biological data

Definition of the problem

Experimental Study of existing approaches

ConQuR-Bio

Conclusion


Finding all synonyms istime-consuming

Querying using all synonyms providehuge amounts of data sets whichhave to beranked….

Synonyms:Breast cancer vs mammalian carcinoma(14 772 vs 771 genes, not all included)

Abbreviations:Attention deficit hyperactivity disorders vs ADHD (109 vs 144 genes, 74 common)

Linguistics variations: tumour vs tumor(& breast cancer) : 681 vs 291 genes

More precise reformulations:colorectal cancer vs Lynch syndrom(+6 new genes)

Which are the genes

associated to a given

disease?

[DILS14]


I) Reformulations (synonyms) can be automatically generated

• Input: the user‘s Keyword

• Automatic search of synonymsin major biomedicalterminologies

• Output : set of Synonyms ofthe input keyword

[DILS14]






II) DB querying is automated(based on keywords)

• Input : set of synonymsobtained in I)

• Querying gene databases witheach synonym

(#queries = # synonyms)

• Output : sets of rankings(ranked genes), one rankingper synonym

[DILS14]


III) Aggregating using a series of consensus algorithmswith a variant of theGeneralized Kendall−τ distance





II) DB querying is automated(based on keywords)

• Input : set of synonymsobtained in I)

• Querying gene databases witheach synonym

(#queries = # synonyms)

• Output : sets of rankings(ranked genes), one rankingper synonym

[DILS14]


Highly accessed by members of APHP (Hospitals), Institut Curie, Institut Pasteur…


Ranking Biological data is a complex task◦ Many quality criteria, difficult to combine Consensus ranking approaches

Classification of consensus ranking algorithms◦ Rankings with ties, incomplete data sets

Study of distances and normalization techniques◦ Complexity results, new distances considered, new

heuristics

Guide in the choice of distances and algorithms

Rank-n-ties platform available

Concrete application to real biological data ◦ Consensus of query reformulations◦ Currently under evaluation based on real datasets

(Leukemia)

Increasing number of reformulations + increasing number of answers ranking big data sets


Bryan Brancotte

Mastodon Project QualibioConsensus: LRI (Paris-Sud), LIGM (Marne-la-vallée), IFB (Institut Français de Bioinformatique) Other partners: Univ. Montréal, Institut Curie, APHP Georges Pompidou,

APHP Paul Brousse


Rank aggregation with ties: Experiments and Analysis

B Brancotte, B Yang, G Blin, S Cohen-Boulakia, A Denise,

S Hamel. Proceedings of the VLDB Endowment 8 (11), 1202-1213

Interrogation de bases de données biologiques publiques par reformulation de requêtes et classement des résultats avec ConQuR-Bio

B Brancotte, B Rance, A Denise, S Cohen-Boulakia.

JOBIM (Journées Ouvertes Biologie Informatique Mathématiques) 2015

Conqur-bio: Consensus ranking with query reformulation for biologicaldata B Brancotte, B Rance, A Denise, S Cohen-Boulakia. International

Conference on Data Integration in the Life Sciences (DILS), 128- 142. 2014

Using medians to generate consensus rankings for biological data

S Cohen-Boulakia, A Denise, S Hamel. International Conference on

Scientific and Statistical Database Management (SSDBM). 2011

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=8QeCBlwAAAAJ&sortby=pubdate&citation_for_view=8QeCBlwAAAAJ:iH-uZ7U-co4C

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=8QeCBlwAAAAJ&sortby=pubdate&citation_for_view=8QeCBlwAAAAJ:isC4tDSrTZIC

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=8QeCBlwAAAAJ&sortby=pubdate&citation_for_view=8QeCBlwAAAAJ:RHpTSmoSYBkC

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=8QeCBlwAAAAJ&cstart=20&sortby=pubdate&citation_for_view=8QeCBlwAAAAJ:4TOpqqG69KYC

sarah cohen-boulakia

Documents