probabilistic ranking of database query results

Post on 31-Dec-2015

22 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Probabilistic Ranking of Database Query Results. Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International University Gerhard Weikum, MPI Informatik. Presented by Weimin He CSE@UTA. Outline. Motivation Problem Definition - PowerPoint PPT Presentation

TRANSCRIPT

Probabilistic Ranking of Database Query Results

Surajit Chaudhuri, Microsoft ResearchGautam Das, Microsoft ResearchVagelis Hristidis, Florida International UniversityGerhard Weikum, MPI Informatik

Presented by Weimin HeCSE@UTA

04/19/23 Weimin He CSE@UTA 2

Outline

Motivation Problem Definition System Architecture Construction of Ranking Function Implementation Experiments Conclusion and open problems

04/19/23 Weimin He CSE@UTA 3

Motivating example

Realtor DB: Table D=(TID, Price , City, Bedrooms,

Bathrooms, LivingArea, SchoolDistrict, View, Pool, Garage, BoatDock)

SQL query:Select * From D Where City=Seattle AND View=Waterfront

04/19/23 Weimin He CSE@UTA 4

Motivation

Many-answers problem Two alternative solutions:

Query reformulation Automatic ranking Apply probabilistic model in IR to

DB tuple ranking

04/19/23 Weimin He CSE@UTA 5

Problem DefinitionGiven a database table D with n tuples {t1, …, tn} over a set of

m categorical attributes A = {A1, …, Am}and a query Q: SELECT * FROM D WHERE X1=x1 AND … AND Xs=xswhere each Xi is an attribute from A and xi is a value in its

domain.

The set of attributes X ={X1, …, Xs} is known as the set of attributes specified by the query, while the set Y = A – X is known as the set of unspecified attributes

Let be the answer set of Q

How to rank tuples in S and return top-k tuples to the user ?

},...,{ 1 nttS

04/19/23 Weimin He CSE@UTA 6

System Architecture

04/19/23 Weimin He CSE@UTA 7

Intuition for Ranking Function Select * From D Where City=“Seattle” And

View=“Waterfront”

Score of a Result Tuple t depends on Global Score: Global Importance of Unspecified

Attribute Values E.g., Homes with good school districts are

globally desirable Conditional Score: Correlations between

Specified and Unspecified Attribute Values E.g., Waterfront BoatDock

04/19/23 Weimin He CSE@UTA 8

Probabilistic Model in IR Bayes’ Rule Product Rule

)(

)()|()|(

bp

apabpbap

),|()|()|,( cabpcapcbap

)|(

)|(

)(

)()|()(

)()|(

)|(

)|()(

Rtp

Rtp

tp

RpRtptp

RpRtp

tRp

tRptScore

Document t, Query QR: Relevant document setR = D - R: Irrelevant document set

Vagelis Hristidis
Let's see how by adapting PIR techniques to our problem we can create a ranking function.

04/19/23 Weimin He CSE@UTA 9

Adaptation of PIR to DB

Tuple t is considered as a document

Partition t into t(X) and t(Y) t(X) and t(Y) are written as X and Y Derive from initial scoring function

until final ranking function is obtained

04/19/23 Weimin He CSE@UTA 10

Preliminary Derivation

04/19/23 Weimin He CSE@UTA 11

Limited Independence Assumptions

Given a query Q and a tuple t, the X (and Y) values within themselves are assumed to be independent, though dependencies between the X and Y values are allowed

Xx

CxpCXp )()(

Yy

CypCYp )()(

04/19/23 Weimin He CSE@UTA 12

Continuing Derivation

04/19/23 Weimin He CSE@UTA 13

Workload-based Estimation of )( Ryp

Assume a collection of “past” queries existed in system

Workload W is represented as a set of “tuples”

Given query Q and specified attribute set X, approximate R as all query “tuples” in W that also request for X

All properties of the set of relevant tuple set R can be obtained by only examining the subset of the workload that caontains queries that also request for X

),()( WXypRyp

04/19/23 Weimin He CSE@UTA 14

Final Ranking Function

04/19/23 Weimin He CSE@UTA 15

Pre-computing Atomic Probabilities in Ranking Function

)( Wyp

)( Dyp

),( Dyxp

Relative frequency in W

Relative frequency in D

),( Wyxp (#of tuples in W that conatains x, y)/total # of tuples in W

(#of tuples in D that conatains x, y)/total # of tuples in D

04/19/23 Weimin He CSE@UTA 16

Example for Computing Atomic Probabilities

Select * From D Where City=“Seattle” And View=“Waterfront”

Y={SchoolDistrict, BoatDock, …}

D=10,000 W=1000 W{excellent}=10 W{waterfront &yes}=5

p(excellent|W)=10/1000=0.1 p(excellent|D)=10/10,000=0.01 p(waterfront|yes,W)=5/1000=0.005 p(waterfront|yes,D)=5/10,000=0.0005

04/19/23 Weimin He CSE@UTA 17

Indexing Atomic Probabilities

)( Wyp

)( Dyp

),( Dyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

),( Wyxp

{AttName, AttVal, Prob}

B+ tree index on (AttName, AttVal)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

{AttNameLeft, AttValLeft, AttNameRight, AttValRight, Prob}

B+ tree index on (AttNameLeft, AttValLeft, AttNameRight, AttValRight)

04/19/23 Weimin He CSE@UTA 18

Scan AlgorithmPreprocessing - Atomic Probabilities Module Computes and Indexes the Quantities

P(y | W), P(y | D), P(x | y, W), and P(x | y, D) for All Distinct Values x and y

Execution Select Tuples that Satisfy the Query Scan and Compute Score for Each Result-

Tuple Return Top-K Tuples

04/19/23 Weimin He CSE@UTA 19

Beyond Scan Algorithm Scan algorithm is Inefficient

Many tuples in the answer set Another extreme

Pre-compute top-K tuples for all possible queriesStill infeasible in practice

Trade-off solutionPre-compute ranked lists of tuples for all possible atomic queriesAt query time, merge ranked lists to get top-K tuples

04/19/23 Weimin He CSE@UTA 20

Two kinds of Ranked List CondList Cx

{AttName, AttVal, TID, CondScore}B+ tree index on (AttName, AttVal, CondScore)

GlobList Gx

{AttName, AttVal, TID, GlobScore}B+ tree index on (AttName, AttVal, GlobScore)

04/19/23 Weimin He CSE@UTA 21

Index Module

04/19/23 Weimin He CSE@UTA 22

List Merge Algorithm

04/19/23 Weimin He CSE@UTA 23

Experimental Setup Datasets:

MSR HomeAdvisor Seattle (http://houseandhome.msn.com/)

Internet Movie Database (http://www.imdb.com)

Software and Hardware: Microsoft SQL Server2000 RDBMS P4 2.8-GHz PC, 1 GB RAM C#, Connected to RDBMS through DAO

04/19/23 Weimin He CSE@UTA 24

Quality Experiments

Conducted on Seattle Homes and Movies tables

Collect a workload from users Compare Conditional Ranking

Method in the paper with the Global Method [CIDR03]

04/19/23 Weimin He CSE@UTA 25

Quality Experiment-Average Precision

For each query Qi , generate a set Hi of 30 tuples likely to contain a good mix of relevant and irrelevant tuples

Let each user mark 10 tuples in Hi as most relevant to Qi

Measure how closely the 10 tuples marked by the user match the 10 tuples returned by each algorithm

04/19/23 Weimin He CSE@UTA 26

Quality Experiment- Fraction of Users Preferring Each Algorithm

5 new queries Users were given the top-5 results

04/19/23 Weimin He CSE@UTA 27

Performance Experiments

Table NumTuples Database Size (MB)

Seattle Homes 17463 1.936

US Homes 1380762 140.432

Datasets

Compare 2 Algorithms: Scan algorithm List Merge algorithm

04/19/23 Weimin He CSE@UTA 28

Performance Experiments – Pre-computation Time

04/19/23 Weimin He CSE@UTA 29

Performance Experiments – Execution Time

04/19/23 Weimin He CSE@UTA 30

Performance Experiments – Execution Time

04/19/23 Weimin He CSE@UTA 31

Performance Experiments – Execution Time

04/19/23 Weimin He CSE@UTA 32

Conclusion and Open Problems

Automatic ranking for many-answers

Adaptation of PIR to DB

Mutiple-table query Non-categorical attributes

top related