efficient ir-style keyword search over relational databases vagelis hristidis university of...

Efficient IR-Style Keyword Searchover Relational Databases

• Vagelis Hristidis

University of California, San Diego

• Luis Gravano

Columbia University

• Yannis Papakonstantinou

University of California, San Diego

Motivation

• Keyword search is the dominant information discovery method in documents

• Increasing amount of data is stored in databases

• Plain text coexists with structured data

Motivation

• Up until recently, information discovery in databases required:– Knowledge of schema– Knowledge of a query language (e.g., SQL)– Knowledge of the role of the keywords

• Goal: Enable IR-style keyword search over DBMSs without the above requirements

IR-Style Search over DBMSs

• IR keyword search well developed for document search

• Modern DBMSs offer IR-style keyword search over individual text attributes

• What is equivalent to document in databases?

Example – Complaints Database Schema

ComplaintsprodIdcustIddatecomments

ProductsprodIdmanufacturermodel

CustomerscustIdnameoccupation

Example - Complaints Database Data

tupleId prodId custId date comments

c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”

c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”

c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”

tupleId prodId manufacturer model

p1 p121 “Maxtor” “D540X”

p2 p131 “IBM” “Netvista”

p3 p141 “Tripplite” “Smart 700VA”

tupleId custId name occupation

u1 c3232 “John Smith”

“Software Engineer”

u2 c3131 “Jack Lucas”

“Architect”

u3 c3143 “John Mayer”

“Student”

Complaints

Customers Products

Example – Keyword Query [Maxtor Netvista]













“Architect”


“Student”

Complaints

Customers Products

Keyword Query Semantics (definition of “document” in databases)

Keywords are:• in same tuple• in same relation• in tuples connected through primary-foreign key

relationships

Score of result:• distance of keywords within a tuple• distance between keywords in terms of primary-

foreign key connections• IR-style score of result tree














“Architect”


“Student”

Complaints

Customers Products

Results: (1) c3, (2) p2 c3, (3) p1 c1

Vagelis Hristidis

result (2) is higher than (3) because the attr-score of comments for c3 is higher than that of c1

Result of Keyword Query

Result is tree T of tuples where:

• each edge corresponds to a primary-foreign key relationship

• no tuple of T is redundant (minimality)

• - “AND” query semantics: Every query keyword appears in T

- “OR” query semantics: Some query keywords might be missing from T

Vagelis Hristidis

that is, no tuple with zero score can be a leaf

Score of Result T

• Combining function Score combines scores of attribute values of T

• One reasonable choice:

Score=aTScore(a)/size(T)

• Attribute value scores Score(a) calculated using the DBMS's IR “datablades”

Vagelis Hristidis

which generally use well-known IR, tf-idf ranking functions

Vagelis Hristidis

attrs is a better choice than tuples because they represent the minimal information unit. Attributes are grouped into tuples based on schema design decisions.

Shortcomings of Prior Work

• Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies

• No straightforward extension to improve efficiency by returning just top-k results

• Not good in handling free-text attributes[DBXplorer,DISCOVER]

vagelis

for example two attributes that contain a keyword are treated equally regardless of their total #words














“Architect”


“Student”

Complaints

Customers Products

Results: (1) c3, (2) p2 c3, (3) p1 c1

Score(c3) = 4/3

Score(p2 c3) = (1+4/3)/2 = 7/6

Score(p1 c1) = (1+1/3)/2 = 4/6

score

1/3

1/3

4/3

score

1

1

0

Vagelis Hristidis

result (2) is higher than (3) because the attr-score of comments for c3 is higher than that of c1

Architecture

Keywords

IR Engine

Tuple Sets

Candidate NetworkGenerator

DatabaseSchema

Execution Engine

Database

Candidate Networks

Parameterized,Prepared

SQL Queries

Top-k Joining Treesof Tuples

User

IR Index

ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]

ComplaintsQ ProductsQ


ComplaintsQ Customer{} ComplaintsQ

ComplaintsQ Product{} ComplaintsQ

...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...

[Maxtor Netvista]

c3p2 c3p1 c2

Candidate Network Generator

• Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002]

• Use single non-free tuple set for each relation– allows “OR” semantics

– fewer CNs are generated

– extra filtering step required for “AND” semantics

Candidate Network Generator Example

For query [Maxtor Netvista], CNs:• ComplaintsQ • ProductsQ

• ComplaintsQ ProductsQ

• ComplaintsQ Customer{} ComplaintsQ

• ComplaintsQ Product{} ComplaintsQ

Non-CNs:• ComplaintsQ Customer{} Complaints{}

• Product Q Complaints{} ProductQ

Architecture

Keywords

IR Engine

Tuple Sets

Candidate NetworkGenerator

DatabaseSchema

Execution Engine

Database

Candidate Networks

Parameterized,Prepared

SQL Queries

Top-k Joining Treesof Tuples

User

IR Index

c3p2 c3p1 c2

ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]



ComplaintsQ Customer{} ComplaintsQ

ComplaintsQ Product{} ComplaintsQ

...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...

[Maxtor Netvista]

Execution Algorithms

• Users usually want top-k results.• Hence, submitting to DBMS a SQL query for

each CN (Naïve algorithm) is inefficient.• When queries produce at most very few

results, Naïve algorithm is efficient, since it fully exploits DBMS.

• Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

Vagelis Hristidis

as prior works do

Sparse Algorithm: Example Execution

CN results score MFS

ProductsQ

ComplaintsQ


ComplaintsQ tupleId score c2 7 c1 5 c3 1

ProductsQ tupleId score p1 9 p2 6 p3 1

c2 7 7

p1 9 9

c1 p1 (9+5)/2=7 (9+7)/2 = 8

•Best when query produces at most a few results

vagelis

Improves Naïve algorithm, by discarding CN with no chance of producing top-k results, given results so far.Compute Maximum Possible Score for each CN C, by combining top tuples of non-free tuple sets of C.

Single Pipelined Algorithm: Example Execution

ComplaintsQ tupleId score c2 7 c1 5 c3 1

ProductsQ tupleId score p1 9 p2 6 p3 1

CN: ComplaintsQ ProductsQ

MPFS = Max[(5+9)/2, (7+6)/2]=7Max[(1+9)/2, (7+6)/2]=6.5 result score

Results queue

p1→c1 7

Output: p1→c1

Max[(1+9)/2, (7+1)/2]=5

p2→c2 6.5

p2→c2

Get next tuple from

most promising non-free tuple set

vagelis

Operates on a single CN C.Read and process a minimum prefix from each non-free TS (ordered by score desc) of C.Keep Minimum Possible Future Score (MPFS) of C.For each newly retrieved tuple t, submit all possible parameterized queries joining t to already retrieved tuples of other non-free tuple sets.

Global Pipelined Algorithm : Example Execution

C4C5C1C3C2

C3

Queue of CN processesordered by ascending

MPFS

Processing Unit

Output

MPFS3 = 3.5

Complaints tupleId score c2 7 c1 5 c3 1

Products tupleId score p1 9 p2 6 p3 1

scoreresult scoreresult

Results queue

p1→c1 7

p2→c2 6.5

global MPFS=max(MPFSi) over all CNs Ci

• Best when query produces many results.

vagelis

Execute an instance of Single Pipelined algorithm for each CN in a round robin manner.Keep a global MPFS=max(MPFSi) over all CNs Ci.In each step retrieve next tuple from most promising tuple set of most promising CN.

Hybrid Algorithm

• Estimate number of results.– For “OR”-semantics, use DBMS estimator– For “AND”-semantics, probabilistically

adjust DBMS estimator.

• If at most a few query results expected, then use Sparse Algorithm.

• If many query results expected, then use Global Pipelined Algorithm.

Vagelis Hristidis

as prior works do

Related Work• DBXplorer [ICDE 2002], DISCOVER [VLDB 2002]

– Similar three-step architecture– Score = 1/size(T)– Only AND semantics– No straightforward extension for efficient top-k execution

• BANKS [ICDE 2002], Goldman et al. [VLDB 1998]– Database viewed as graph– No use of schema

• Florescu et al. [WWW 2000], XQuery Full-Text• Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001]

– Top-k algorithms for join queries

Vagelis Hristidis

submit a SQL query for each CN

Experiments – DBLP DatasetC(cid,name)

Y(yid,year,cid)

P(pid,title,yid)

A(aid,name)

PP(pid1,pid2)

PA(pid,aid)

DBLP contains few citation edges.

Synthetic citation edges were added such that average # citations is 20.

Final dataset is 56MB.

Experiments run over state-of-the-art commercial RDBMS.

C: Conference

Y: Year

P: Paper

A: Author

OR Semantics: Effect of Maximum Allowed CN Size

10

100

1000

10000

100000

1000000

2 3 4 5 6 7

max CN size

mse

c

Naive Sparse SA SASymmetric GA GASymmetric Hybrid

Average execution time of 100 2-keyword top-10 queries

Vagelis Hristidis

only show results for OR0semantics in presentation

OR Semantics: Effect of Number of Objects Requested k

Average execution time of 100 2-keyword queries with maximum candidate-network size of 6

10

100

1000

10000

100000

1000000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

k

ms

ec

Naive Sparse SA SASymmetric GA GASymmetric Hybrid

OR Semantics: Effect of Number of Query Keywords

Average execution time of 100 top-10 queries with maximum candidate-network size of 6

100

1000

10000

100000

2 3 4 5#keywords

mse

c

Naive Sparse GA GASymmetric Hybrid

Conclusions

• Extend IR-style ranking to databases.• Exploit text-search capabilities of modern

DBMSs, to generate results of higher quality.• Support both “AND” and “OR” semantics.• Achieve substantial speedup over prior work

via pipelined top-k query processing algorithms.

Questions?

Compare algorithms wrt Result size

1

10

100

1000

10000

100000

0 100 500 1000 2000 6000total # results

mse

c

GA Sparse

OR-semantics

1

10

100

1000

10000

100000

0 5 20 100 200 700

total # results

mse

c

GA Sparse

Max CN size = 6, top-10, 2 keywords, OR-semantics

AND-semantics

Ranking Functions

• Proposed algorithms support tuple monotone combining functions

• That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')

efficient ir-style keyword search over relational databases vagelis hristidis university of...

Documents

ibm netvista unstable

query keyword

ibm netvista x41 c2p131c3131

lowerend ibm netvista

result of keyword query

dbmss ir keyword search

motivation keyword search

irstyle search