efficient ir-style keyword search over relational databases vagelis hristidis university of...
TRANSCRIPT
Efficient IR-Style Keyword Searchover Relational Databases
• Vagelis Hristidis
University of California, San Diego
• Luis Gravano
Columbia University
• Yannis Papakonstantinou
University of California, San Diego
Motivation
• Keyword search is the dominant information discovery method in documents
• Increasing amount of data is stored in databases
• Plain text coexists with structured data
Motivation
• Up until recently, information discovery in databases required:– Knowledge of schema– Knowledge of a query language (e.g., SQL)– Knowledge of the role of the keywords
• Goal: Enable IR-style keyword search over DBMSs without the above requirements
IR-Style Search over DBMSs
• IR keyword search well developed for document search
• Modern DBMSs offer IR-style keyword search over individual text attributes
• What is equivalent to document in databases?
Example – Complaints Database Schema
ComplaintsprodIdcustIddatecomments
ProductsprodIdmanufacturermodel
CustomerscustIdnameoccupation
Example - Complaints Database Data
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”
c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”
c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”
tupleId prodId manufacturer model
p1 p121 “Maxtor” “D540X”
p2 p131 “IBM” “Netvista”
p3 p141 “Tripplite” “Smart 700VA”
tupleId custId name occupation
u1 c3232 “John Smith”
“Software Engineer”
u2 c3131 “Jack Lucas”
“Architect”
u3 c3143 “John Mayer”
“Student”
Complaints
Customers Products
Example – Keyword Query [Maxtor Netvista]
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”
c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”
c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”
tupleId prodId manufacturer model
p1 p121 “Maxtor” “D540X”
p2 p131 “IBM” “Netvista”
p3 p141 “Tripplite” “Smart 700VA”
tupleId custId name occupation
u1 c3232 “John Smith”
“Software Engineer”
u2 c3131 “Jack Lucas”
“Architect”
u3 c3143 “John Mayer”
“Student”
Complaints
Customers Products
Keyword Query Semantics (definition of “document” in databases)
Keywords are:• in same tuple• in same relation• in tuples connected through primary-foreign key
relationships
Score of result:• distance of keywords within a tuple• distance between keywords in terms of primary-
foreign key connections• IR-style score of result tree
Example – Keyword Query [Maxtor Netvista]
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”
c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”
c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”
tupleId prodId manufacturer model
p1 p121 “Maxtor” “D540X”
p2 p131 “IBM” “Netvista”
p3 p141 “Tripplite” “Smart 700VA”
tupleId custId name occupation
u1 c3232 “John Smith”
“Software Engineer”
u2 c3131 “Jack Lucas”
“Architect”
u3 c3143 “John Mayer”
“Student”
Complaints
Customers Products
Results: (1) c3, (2) p2 c3, (3) p1 c1
Result of Keyword Query
Result is tree T of tuples where:
• each edge corresponds to a primary-foreign key relationship
• no tuple of T is redundant (minimality)
• - “AND” query semantics: Every query keyword appears in T
- “OR” query semantics: Some query keywords might be missing from T
Score of Result T
• Combining function Score combines scores of attribute values of T
• One reasonable choice:
Score=aTScore(a)/size(T)
• Attribute value scores Score(a) calculated using the DBMS's IR “datablades”
Shortcomings of Prior Work
• Simplistic ranking methods (e.g., based only on size of connecting tree), ignoring well-studied IR ranking strategies
• No straightforward extension to improve efficiency by returning just top-k results
• Not good in handling free-text attributes[DBXplorer,DISCOVER]
Example – Keyword Query [Maxtor Netvista]
tupleId prodId custId date comments
c1 p121 c3232 6-30-2002 “disk crashed after just one week of moderate use on an IBM Netvista X41”
c2 p131 c3131 7-3-2002 “lower-end IBM Netvista caught fire, starting apparently with disk”
c3 p131 c3143 8-3-2002 “IBM Netvista unstable with Maxtor HD”
tupleId prodId manufacturer model
p1 p121 “Maxtor” “D540X”
p2 p131 “IBM” “Netvista”
p3 p141 “Tripplite” “Smart 700VA”
tupleId custId name occupation
u1 c3232 “John Smith”
“Software Engineer”
u2 c3131 “Jack Lucas”
“Architect”
u3 c3143 “John Mayer”
“Student”
Complaints
Customers Products
Results: (1) c3, (2) p2 c3, (3) p1 c1
Score(c3) = 4/3
Score(p2 c3) = (1+4/3)/2 = 7/6
Score(p1 c1) = (1+1/3)/2 = 4/6
score
1/3
1/3
4/3
score
1
1
0
Architecture
Keywords
IR Engine
Tuple Sets
Candidate NetworkGenerator
DatabaseSchema
Execution Engine
Database
Candidate Networks
Parameterized,Prepared
SQL Queries
Top-k Joining Treesof Tuples
User
IR Index
ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]
ComplaintsQ ProductsQ
ComplaintsQ ProductsQ
ComplaintsQ Customer{} ComplaintsQ
ComplaintsQ Product{} ComplaintsQ
...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...
[Maxtor Netvista]
c3p2 c3p1 c2
Architecture
Keywords
IR Engine
Tuple Sets
Candidate NetworkGenerator
DatabaseSchema
Execution Engine
Database
Candidate Networks
Parameterized,Prepared
SQL Queries
Top-k Joining Treesof Tuples
User
IR Index
ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]
ComplaintsQ ProductsQ
ComplaintsQ ProductsQ
ComplaintsQ Customer{} ComplaintsQ
ComplaintsQ Product{} ComplaintsQ
...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...
[Maxtor Netvista]
c3p2 c3p1 c2
Candidate Network Generator
• Find all trees of tuple sets (free or non-free) that may produce a result, based on DISCOVER's CN generator [VLDB 2002]
• Use single non-free tuple set for each relation– allows “OR” semantics
– fewer CNs are generated
– extra filtering step required for “AND” semantics
Candidate Network Generator Example
For query [Maxtor Netvista], CNs:• ComplaintsQ • ProductsQ
• ComplaintsQ ProductsQ
• ComplaintsQ Customer{} ComplaintsQ
• ComplaintsQ Product{} ComplaintsQ
Non-CNs:• ComplaintsQ Customer{} Complaints{}
• Product Q Complaints{} ProductQ
Architecture
Keywords
IR Engine
Tuple Sets
Candidate NetworkGenerator
DatabaseSchema
Execution Engine
Database
Candidate Networks
Parameterized,Prepared
SQL Queries
Top-k Joining Treesof Tuples
User
IR Index
c3p2 c3p1 c2
ComplaintsQ = [(c3,comments,1.33), (c1,comments,0.33), (c2,comments,0.33)] ProductsQ = [(p1,manufacturer,1), (p2,model,1)]
ComplaintsQ ProductsQ
ComplaintsQ ProductsQ
ComplaintsQ Customer{} ComplaintsQ
ComplaintsQ Product{} ComplaintsQ
...SELECT * FROM ComplaintsQ c, ProductsQ pWHERE c.prodId = p.prodId AND c.prodId=? AND c.custId = ?;...
[Maxtor Netvista]
Execution Algorithms
• Users usually want top-k results.• Hence, submitting to DBMS a SQL query for
each CN (Naïve algorithm) is inefficient.• When queries produce at most very few
results, Naïve algorithm is efficient, since it fully exploits DBMS.
• Monotonic combining functions: if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')
Sparse Algorithm: Example Execution
CN results score MFS
ProductsQ
ComplaintsQ
ComplaintsQ ProductsQ
ComplaintsQ tupleId score c2 7 c1 5 c3 1
ProductsQ tupleId score p1 9 p2 6 p3 1
c2 7 7
p1 9 9
c1 p1 (9+5)/2=7 (9+7)/2 = 8
•Best when query produces at most a few results
Single Pipelined Algorithm: Example Execution
ComplaintsQ tupleId score c2 7 c1 5 c3 1
ProductsQ tupleId score p1 9 p2 6 p3 1
CN: ComplaintsQ ProductsQ
MPFS = Max[(5+9)/2, (7+6)/2]=7Max[(1+9)/2, (7+6)/2]=6.5 result score
Results queue
p1→c1 7
Output: p1→c1
Max[(1+9)/2, (7+1)/2]=5
p2→c2 6.5
p2→c2
Get next tuple from
most promising non-free tuple set
Global Pipelined Algorithm : Example Execution
C4C5C1C3C2
C3
Queue of CN processesordered by ascending
MPFS
Processing Unit
Output
MPFS3 = 3.5
Complaints tupleId score c2 7 c1 5 c3 1
Products tupleId score p1 9 p2 6 p3 1
scoreresult scoreresult
Results queue
p1→c1 7
p2→c2 6.5
global MPFS=max(MPFSi) over all CNs Ci
• Best when query produces many results.
Hybrid Algorithm
• Estimate number of results.– For “OR”-semantics, use DBMS estimator– For “AND”-semantics, probabilistically
adjust DBMS estimator.
• If at most a few query results expected, then use Sparse Algorithm.
• If many query results expected, then use Global Pipelined Algorithm.
Related Work• DBXplorer [ICDE 2002], DISCOVER [VLDB 2002]
– Similar three-step architecture– Score = 1/size(T)– Only AND semantics– No straightforward extension for efficient top-k execution
• BANKS [ICDE 2002], Goldman et al. [VLDB 1998]– Database viewed as graph– No use of schema
• Florescu et al. [WWW 2000], XQuery Full-Text• Ilyas et al. [VLDB 2003], J* algorithm [VLDB 2001]
– Top-k algorithms for join queries
Experiments – DBLP DatasetC(cid,name)
Y(yid,year,cid)
P(pid,title,yid)
A(aid,name)
PP(pid1,pid2)
PA(pid,aid)
DBLP contains few citation edges.
Synthetic citation edges were added such that average # citations is 20.
Final dataset is 56MB.
Experiments run over state-of-the-art commercial RDBMS.
C: Conference
Y: Year
P: Paper
A: Author
OR Semantics: Effect of Maximum Allowed CN Size
10
100
1000
10000
100000
1000000
2 3 4 5 6 7
max CN size
mse
c
Naive Sparse SA SASymmetric GA GASymmetric Hybrid
Average execution time of 100 2-keyword top-10 queries
OR Semantics: Effect of Number of Objects Requested k
Average execution time of 100 2-keyword queries with maximum candidate-network size of 6
10
100
1000
10000
100000
1000000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
k
ms
ec
Naive Sparse SA SASymmetric GA GASymmetric Hybrid
OR Semantics: Effect of Number of Query Keywords
Average execution time of 100 top-10 queries with maximum candidate-network size of 6
100
1000
10000
100000
2 3 4 5#keywords
mse
c
Naive Sparse GA GASymmetric Hybrid
Conclusions
• Extend IR-style ranking to databases.• Exploit text-search capabilities of modern
DBMSs, to generate results of higher quality.• Support both “AND” and “OR” semantics.• Achieve substantial speedup over prior work
via pipelined top-k query processing algorithms.
Questions?
Compare algorithms wrt Result size
1
10
100
1000
10000
100000
0 100 500 1000 2000 6000total # results
mse
c
GA Sparse
OR-semantics
1
10
100
1000
10000
100000
0 5 20 100 200 700
total # results
mse
c
GA Sparse
Max CN size = 6, top-10, 2 keywords, OR-semantics
AND-semantics
Ranking Functions
• Proposed algorithms support tuple monotone combining functions
• That is, if results T, T' have same schema and for every attribute Score(ai)≤Score(a'i) then Score(T)≤Score(T')