sarah cohen-boulakia
TRANSCRIPT
Sarah Cohen-BoulakiaLaboratoire de Recherche en Informatique
CNRS UMR 8623
Université Paris-Sud, Université Paris-Saclay
Sarah Cohen-Boulakia, Université Paris-Sud 2
Understanding Life Sciences Progress in multiple domains: biology, chemistry, maths,
computer sciences…
Emergence of new technologies: Next generation sequencing,…
Increasing volumes of raw data
All stored in Web data sources
Raw data are not sufficientData Annotated by experts
Bioinformatics analysis of data New data sources
Concrete example: Querying NCBI Entrez
http://www.ncbi.nlm.nih.gov/gquery/(« Gquery NCBI » on google )
Sarah Cohen-Boulakia, Université Paris-Sud 3
National Center for Biotechnology Information (NCBI) provides access to biomedical and genomic information through its Web portal
33 databases queries
Breast cancer
14 772 genes
Sarah Cohen-Boulakia, Université Paris-Sud 4
• Content of a biological database = a set of pages (a book!)
• Each database is focused on one biological entity (here, Gene)
• Each page describes one instance (ie, one gene)
Sarah Cohen-Boulakia, Université Paris-Sud 5
Last update (freshness)
Status of the sequence (reliability)
Various quality features
« volume of information » (completeness)
+ Reputation of the data source, number of incominglinks (paths) to reach the data…
Sarah Cohen-Boulakia, Université Paris-Sud 6
… to determine the importance of results (obtained by several ways) and get complete sets of answers
771
Mamamilian carcinoma
771 genes not all included in the previous result!
Mammilian carcinoma
Sarah Cohen-Boulakia, Université Paris-Sud 7
NCBI ranks answers for each keyword using the relevance◦ Number of occurrences of the keyword (breast cancer) in the files
Other approaches have been tested◦ Combining various criteria (weighted functions)
Freshness first then reputation of the source then…
◦ Following PageRank or ObjectRank-like solutions (Biozon, Biorank…)
How to deal with several sets of answers?◦ How to aggregate several rankings?
Answers obtained by several keywords should have higher priority…
Wanted: Ranking solution exploiting many criteria
◦ Problem : How to combine all such criteria?
Alternative: Consensus rankings?
Sarah Cohen-Boulakia, Université Paris-Sud 8
Need for consensus rankings to order biological data
Definition of the problem
Experimental Study of existing approaches
ConQuR-Bio
Conclusion
Sarah Cohen-Boulakia, Université Paris-Sud 9
Input: a set of rankings (A to D are four genes)◦ Ranking 1 = [A, D, C, B]
◦ Ranking 2 = [B, A, D, C]
◦ Ranking 3 = [D, A, B, C]
Output: one consensus ranking◦ Consensus = [A, D, B, C]
A consensus ranking makes the most of several rankings, each obtained by given ranking methods◦ Put emphasis on their common points
◦ Does not put too much importance on data classified “good” by only one or a few ranking methods
The optimal consensus is one ranking which is the closestto the input rankings Distance
Sarah Cohen-Boulakia, Université Paris-Sud 10
The Kendall−τ D(π,σ) distance
counts the number of pairs of elements inversed
(ie in the opposite order) between two rankings
Sarah Cohen-Boulakia, Université Paris-Sud 11
The Kendall−τ D(π,σ) distance
counts the number of pairs of elements inversed
(ie in the opposite order) between two rankings
Kemeny Score
Optimal Consensus (median)
Sarah Cohen-Boulakia, Université Paris-Sud 12
The Kendall−τ D(π,σ) distance
counts the number of pairs of elements inversed
(ie in the opposite order) between two rankings
Kemeny Score
Optimal Consensus (median)
Complexity [Dwork et al 2001, Biedl et al. 2009]
NP-Difficult for an odd number
of permutations ≥ 4 Approximation and heuristics
Sarah Cohen-Boulakia, Université Paris-Sud 13
Real data sets ◦ Have equalities: several items ex-aequo
◦ May be incomplete not all on the same sets of elements
One unifying bucket at the end of each ranking
with the missing elements
Unification process
Generalized Kendall−τ G(r,s) counts
the number of pairs of elements◦ inversed between two rankings r et s
◦ tied in only one of the two rankings
Complexity: Finding an optimal consensus with ties is at least as difficult as in the permutation case [VLDB15]
Optimal consensus implemented [VLDB15]◦ Provides exact solutions until n=40 elements
Approximation and heuristics
R1=[A, {B,C},D,{E}]R2=[{A,D},E,{B,C}]R3=[{C,E},A,B,{D}]
Elements in the unifying bucket are equally unimportant
Sarah Cohen-Boulakia, Université Paris-Sud 14
MEDRank [Fagin et al. 2003]
CopelandMethod [Copeland et al. 1951]
BordaCount [Borda 1781]
MC4 [Dwork et al. 2001]Positionnal approaches
B&B [Ali et al. 2012]
ChanasBoth [Coleman et al. 2009]
Chanas [Chanas et al. 1996]
PNE (exact) [Conitzer, et al. 2006]
KwikSort [Ailon et al. 2008]
RepeatChoice [Ailon et al. 2010]
permutations
Kendall-τ distance
Pick-A-Perm [Ailon et al. 2008]
Ailon3/2 [Fagin et al. 2004]
BioConsert [Cohen-Boulakia et al. 2011]
FaginDyn [Fagin et al. 2004]
GeneralizedKendall-τ distance
Sarah Cohen-Boulakia, Université Paris-Sud 15
Compatibility with ties
Ability to consider the cost of (un)tying
[VLDB15]
Sarah Cohen-Boulakia, Université Paris-Sud 16
Optimal solution can be found until n~40 Comparison of 12 algorithms◦ Re-implemented and adapted to ties◦ Evaluated on
Quality: gap (optimal consensus c*) Time
Real datasets ◦ Extracted from previous publications◦ 7 datasets: EachMovie, F1, GiantSlalom, SkiCross/Jumping,
WebCommunities, WebSearch, BioMedical
Synthetic datasets1. Uniformly generated with MuPad Combinat2. With increasing levels of similarity3. Similarity and unification process
Precise understanding of the impact of the similarity and unification process
[VLDB15]
Sarah Cohen-Boulakia, Université Paris-Sud 17
● BioConsert can be used in a verylarge majority of the cases
● For very large data sets(>30.000 elements)
● KwikSort can be preferred
● If there is a need to seed up then
● In case of few equalities useBordaCount
● Otherwise use MEDRank
● Alternativeley: use bothalgorithms and pick the best High quality low quality
Tim
e →
(gap)
http://rank-aggregation-with-ties.lri.fr/
[VLDB15]
Sarah Cohen-Boulakia, Université Paris-Sud 18
Need for consensus rankings to order biological data
Definition of the problem
Experimental Study of existing approaches
ConQuR-Bio
Conclusion
Sarah Cohen-Boulakia, Université Paris-Sud 19
Finding all synonyms istime-consuming
Querying using all synonyms providehuge amounts of data sets whichhave to beranked….
Synonyms:Breast cancer vs mammalian carcinoma(14 772 vs 771 genes, not all included)
Abbreviations:Attention deficit hyperactivity disorders vs ADHD (109 vs 144 genes, 74 common)
Linguistics variations: tumour vs tumor(& breast cancer) : 681 vs 291 genes
More precise reformulations:colorectal cancer vs Lynch syndrom(+6 new genes)
Which are the genes
associated to a given
disease?
[DILS14]
Sarah Cohen-Boulakia, Université Paris-Sud 20
I) Reformulations (synonyms) can be automatically generated
• Input: the user‘s Keyword
• Automatic search of synonymsin major biomedicalterminologies
• Output : set of Synonyms ofthe input keyword
[DILS14]
Sarah Cohen-Boulakia, Université Paris-Sud 21
I) Reformulations (synonyms) can be automatically generated
• Input: the user‘s Keyword
• Automatic search of synonymsin major biomedicalterminologies
• Output : set of Synonyms ofthe input keyword
II) DB querying is automated(based on keywords)
• Input : set of synonymsobtained in I)
• Querying gene databases witheach synonym
(#queries = # synonyms)
• Output : sets of rankings(ranked genes), one rankingper synonym
[DILS14]
Sarah Cohen-Boulakia, Université Paris-Sud 22
III) Aggregating using a series of consensus algorithmswith a variant of theGeneralized Kendall−τ distance
I) Reformulations (synonyms) can be automatically generated
• Input: the user‘s Keyword
• Automatic search of synonymsin major biomedicalterminologies
• Output : set of Synonyms ofthe input keyword
II) DB querying is automated(based on keywords)
• Input : set of synonymsobtained in I)
• Querying gene databases witheach synonym
(#queries = # synonyms)
• Output : sets of rankings(ranked genes), one rankingper synonym
[DILS14]
Sarah Cohen-Boulakia, Université Paris-Sud 23
Highly accessed by members of APHP (Hospitals), Institut Curie, Institut Pasteur…
Sarah Cohen-Boulakia, Université Paris-Sud 24
Ranking Biological data is a complex task◦ Many quality criteria, difficult to combine Consensus ranking approaches
Classification of consensus ranking algorithms◦ Rankings with ties, incomplete data sets
Study of distances and normalization techniques◦ Complexity results, new distances considered, new
heuristics
Guide in the choice of distances and algorithms
Rank-n-ties platform available
Concrete application to real biological data ◦ Consensus of query reformulations◦ Currently under evaluation based on real datasets
(Leukemia)
Increasing number of reformulations + increasing number of answers ranking big data sets
Sarah Cohen-Boulakia, Université Paris-Sud 25
Bryan Brancotte
Mastodon Project QualibioConsensus: LRI (Paris-Sud), LIGM (Marne-la-vallée), IFB (Institut Français de Bioinformatique) Other partners: Univ. Montréal, Institut Curie, APHP Georges Pompidou,
APHP Paul Brousse
Sarah Cohen-Boulakia, Université Paris-Sud 26
Rank aggregation with ties: Experiments and Analysis
B Brancotte, B Yang, G Blin, S Cohen-Boulakia, A Denise,
S Hamel. Proceedings of the VLDB Endowment 8 (11), 1202-1213
Interrogation de bases de données biologiques publiques par reformulation de requêtes et classement des résultats avec ConQuR-Bio
B Brancotte, B Rance, A Denise, S Cohen-Boulakia.
JOBIM (Journées Ouvertes Biologie Informatique Mathématiques) 2015
Conqur-bio: Consensus ranking with query reformulation for biologicaldata B Brancotte, B Rance, A Denise, S Cohen-Boulakia. International
Conference on Data Integration in the Life Sciences (DILS), 128- 142. 2014
Using medians to generate consensus rankings for biological data
S Cohen-Boulakia, A Denise, S Hamel. International Conference on
Scientific and Statistical Database Management (SSDBM). 2011