probabilistic ranking of database query results

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Weimin HeCSE@UTA

04/19/23 Weimin He CSE@UTA 2

Outline

Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems

Motivating example

Realtor DB: Table D=(TID, Price , City, Bedrooms,

Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:Select * From D Where City=Seattle AND View=Waterfront

Motivation

Many-answers problem Two alternative solutions:

Query reformulation Automatic ranking Apply probabilistic model in IR to

DB tuple ranking

Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of

m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its

domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

},...,{ 1 nttS

System Architecture

Intuition for Ranking Function Select * From D Where City=“Seattle” And

View=“Waterfront”

Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified

Attribute Values E.g., Homes with good school districts are

globally desirable Conditional Score: Correlations between

Specified and Unspecified Attribute Values E.g., Waterfront BoatDock

Probabilistic Model in IR Bayes’ Rule Product Rule

)()|()|(

apabpbap

),|()|()|,( cabpcapcbap

)()|()(

RpRtptp

tRptScore

Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Adaptation of PIR to DB

Tuple t is considered as a document

Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function

until final ranking function is obtained

Preliminary Derivation

Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

CxpCXp )()(

CypCYp )()(

Continuing Derivation

Workload-based Estimation of )( Ryp

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

),()( WXypRyp

Final Ranking Function

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in W

Relative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

Example for Computing Atomic Probabilities

Select * From D Where City=“Seattle” And View=“Waterfront”

Y={SchoolDistrict, BoatDock, …}

D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5

p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005

Indexing Atomic Probabilities

)( Wyp

)( Dyp

),( Dyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

),( Wyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities

P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-

Tuple Return Top-K Tuples

Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set Another extreme

Pre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

Two kinds of Ranked List CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

Index Module

List Merge Algorithm

Experimental Setup Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

Quality Experiments

Conducted on Seattle Homes and Movies tables

Collect a workload from users Compare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

Performance Experiments – Pre-computation Time

Performance Experiments – Execution Time

Conclusion and Open Problems

Automatic ranking for many-answers

Adaptation of PIR to DB

Mutiple-table query Non-categorical attributes

probabilistic ranking of database query results

Documents

query dependent ranking using k-nearest neighbor

depth estimation for ranking query optimization

ranking methods for networks · the whole web page set with...

link-based ranking · 2 purpose of link-based ranking...

query answering in probabilistic datalog+/{ ontologies under...

web image re ranking using query-specific semantic...

probabilistic ranking of database query results

regular paper dan suciu efﬁcient query evaluation on...

automated ranking of database query results

a toolbox of query evaluation techniques for probabilistic...

query-specific learning and inference for probabilistic...

web image re-ranking using query-speciﬁc...

probabilistic structured query methods

probabilistic query rewriting for efﬁcient and effective

ad hoc now2008 probabilistic query dissemination

temporal query log profiling to improve web search ranking

probabilistic threshold range aggregate query processing...

semantic query extension through probabilistic description...

uncertain sequence data: algorithms and applications james...

probabilistic ranking