spark: top-k keyword query in relational databasesweiw/files/sigmod07-keyword-final.pdf · spark:...

12
SPARK: Top-k Keyword Query in Relational Databases Yi Luo Xuemin Lin Wei Wang University of New South Wales Australia {luoyi, lxue, weiw}@cse.unsw.edu.au Xiaofang Zhou University of Queensland Australia [email protected] ABSTRACT With the increasing amount of text data stored in relational databases, there is a demand for RDBMS to support key- word queries over text data. As a search result is often as- sembled from multiple relational tables, traditional IR-style ranking and query evaluation methods cannot be applied directly. In this paper, we study the effectiveness and the effi- ciency issues of answering top-k keyword query in relational database systems. We propose a new ranking formula by adapting existing IR techniques based on a natural notion of virtual document. Compared with previous approaches, our new ranking method is simple yet effective, and agrees with human perceptions. We also study efficient query process- ing methods for the new ranking method, and propose algo- rithms that have minimal accesses to the database. We have conducted extensive experiments on large-scale real data- bases using two popular RDBMSs. The experimental results demonstrate significant improvement to the alternative ap- proaches in terms of retrieval effectiveness and efficiency. Categories and Subject Descriptors H.2.4 [Database Management]: Query Processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms, Experimentaion, Performance Keywords top-k, keyword search, relational database, information re- trieval 1. INTRODUCTION Integration of DB and IR technologies has been an active research topic recently [6]. One fundamental driving force is the fact that more and more text data are now stored This work was supported by ARC Grant DP0666428 and UNSW Goldstar Grant. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGMOD’07, June 11–14, 2007, Beijing, China. Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00. in relational databases. Examples include commercial ap- plications such as customer relation management systems (CRM), and personal or social applications such as Web blogs and wiki sites. Since the dominant form of query- ing free text is through keyword search, there is a natural demand for relational databases to support effective and ef- ficient IR-style keyword queries. In this paper, we focus on the problem of supporting ef- fective and efficient top-k keyword search in relational data- bases. While many RDBMSs support full-text search, they only allow retrieving relevant tuples from within the same re- lation. A unique feature of keyword search over RDBMSs is that search results are often assembled from relevant tuples in several relations such that they are inter-connected and collectively be relevant to the keyword query [1, 3]. Support- ing such feature has a number of advantages. Firstly, data may have to be split and stored in different relations due to database normalization requirement. Such data will not be returned if keyword search is limited to only single relations. Secondly, it lowers the barrier for casual users to search data- bases, as it does not require users to have knowledge about query languages and database schema. Thirdly, it helps to reveal interesting or unexpected relationships among enti- ties [19]. Lastly, for websites with database back-ends, it provides a more flexible search method than the existing solution that uses a fixed set of pre-built template queries. For example, we issued a search of “2001 hanks” using the search interface on imdb.com, and failed to find relevant an- swers (Figure 1). In contrast, the same search on our system (on a database populated with imdb.com’s data) will return results shown in Table 1, where relevant tuples from mul- tiple relations (marked in bold font) are joined together to form a meaningful answer to the query. Figure 1: Searching “2001 hanks” on imdb.com There has been many related work dedicated to keyword search in databases recently [12, 1, 16, 15, 3, 18, 19, 20, 22]. Among them, [15] first incorporates state-of-the-art IR ranking formula to address the retrieval effectiveness issue. It also presents several efficient query execution algorithms optimized for returning top-k relevant answers. The rank- ing formula is subsequently improved by Liu, et al. [22] by using several refined weighting schemes. BANKS [3] and BANKS2 [18] took another approach by modelling the

Upload: duongngoc

Post on 11-Sep-2018

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

SPARK: Top-k Keyword Query in Relational Databases∗

Yi Luo Xuemin Lin Wei WangUniversity of New South Wales

Australia{luoyi, lxue, weiw}@cse.unsw.edu.au

Xiaofang ZhouUniversity of Queensland

[email protected]

ABSTRACTWith the increasing amount of text data stored in relationaldatabases, there is a demand for RDBMS to support key-word queries over text data. As a search result is often as-sembled from multiple relational tables, traditional IR-styleranking and query evaluation methods cannot be applieddirectly.

In this paper, we study the effectiveness and the effi-ciency issues of answering top-k keyword query in relationaldatabase systems. We propose a new ranking formula byadapting existing IR techniques based on a natural notion ofvirtual document. Compared with previous approaches, ournew ranking method is simple yet effective, and agrees withhuman perceptions. We also study efficient query process-ing methods for the new ranking method, and propose algo-rithms that have minimal accesses to the database. We haveconducted extensive experiments on large-scale real data-bases using two popular RDBMSs. The experimental resultsdemonstrate significant improvement to the alternative ap-proaches in terms of retrieval effectiveness and efficiency.

Categories and Subject DescriptorsH.2.4 [Database Management]: Query Processing; H.3.3[Information Storage and Retrieval]: Information Searchand Retrieval

General TermsAlgorithms, Experimentaion, Performance

Keywordstop-k, keyword search, relational database, information re-trieval

1. INTRODUCTIONIntegration of DB and IR technologies has been an active

research topic recently [6]. One fundamental driving forceis the fact that more and more text data are now stored

∗This work was supported by ARC Grant DP0666428 and

UNSW Goldstar Grant.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD’07, June 11–14, 2007, Beijing, China.Copyright 2007 ACM 978-1-59593-686-8/07/0006 ...$5.00.

in relational databases. Examples include commercial ap-plications such as customer relation management systems(CRM), and personal or social applications such as Webblogs and wiki sites. Since the dominant form of query-ing free text is through keyword search, there is a naturaldemand for relational databases to support effective and ef-ficient IR-style keyword queries.

In this paper, we focus on the problem of supporting ef-fective and efficient top-k keyword search in relational data-bases. While many RDBMSs support full-text search, theyonly allow retrieving relevant tuples from within the same re-lation. A unique feature of keyword search over RDBMSs isthat search results are often assembled from relevant tuplesin several relations such that they are inter-connected andcollectively be relevant to the keyword query [1, 3]. Support-ing such feature has a number of advantages. Firstly, datamay have to be split and stored in different relations due todatabase normalization requirement. Such data will not bereturned if keyword search is limited to only single relations.Secondly, it lowers the barrier for casual users to search data-bases, as it does not require users to have knowledge aboutquery languages and database schema. Thirdly, it helps toreveal interesting or unexpected relationships among enti-ties [19]. Lastly, for websites with database back-ends, itprovides a more flexible search method than the existingsolution that uses a fixed set of pre-built template queries.For example, we issued a search of “2001 hanks” using thesearch interface on imdb.com, and failed to find relevant an-swers (Figure 1). In contrast, the same search on our system(on a database populated with imdb.com’s data) will returnresults shown in Table 1, where relevant tuples from mul-tiple relations (marked in bold font) are joined together toform a meaningful answer to the query.

Figure 1: Searching “2001 hanks” on imdb.com

There has been many related work dedicated to keywordsearch in databases recently [12, 1, 16, 15, 3, 18, 19, 20,22]. Among them, [15] first incorporates state-of-the-art IRranking formula to address the retrieval effectiveness issue.It also presents several efficient query execution algorithmsoptimized for returning top-k relevant answers. The rank-ing formula is subsequently improved by Liu, et al. [22]by using several refined weighting schemes. BANKS [3]and BANKS2 [18] took another approach by modelling the

Page 2: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

Table 1: Top-3 Search Results on Our System

1 Movies: “Primetime Glick” (:::

2001) Tom:::::

Hanks/Ben Stiller

(#2.1)2 Movies: “Primetime Glick” (

:::

2001) Tom:::::

Hanks/Ben Stiller

(#2.1) ← ActorPlay: Character = Himself → Actors:

:::::

Hanks, Tom3 Actors: John

:::::

Hanks ← ActorPlay: Character = Alexander

Kerst → Movies: Rosamunde Pilcher - Wind uber dem Fluss(

:::

2001)

database content as a graph and proposed sophisticatedranking and query execution algorithms. Recently, [19, 20]studied theoretical aspects of efficient query processing fortop-k keyword queries.

Despite the previous studies, there are still several issueswith existing ranking methods, some of which may evenlead to search results contradictory to human perception. Inthis paper, we analyze shortcomings of previous approachesand propose a new ranking method by adapting existing IRranking methods and principles to our problem based ona virtual document model. Our ranking method also takesinto consideration other factors (e.g., completeness and sizeof a result). Another feature is the use of a single tuningparameter to inject AND or OR semantics into the rank-ing formula. The technical challenge with the new rank-ing method is that the final score of an answer is aggre-gated from multiple scores of each constituent tuples, yetthe final score is not monotonic with respect to any of itssub-components. Existing work on top-k query optimizationcannot be immediately applied as they all rely on the mono-tonicity of the rank aggregation function. Therefore, wealso study efficient query processing methods optimized forour non-monotonic ranking function. We propose a skylinesweeping algorithm that achieves minimal database prob-ing by using a monotonic score upper bounding function forour ranking formula. We also explore the idea of employ-ing another non-monotonic upper bounding function to fur-ther reduce unnecessary database accesses, which results inthe block pipeline algorithm. We have conducted extensiveexperiments on large-scale real databases on two popularRDBMSs. The experimental results demonstrate that ourproposed approach is superior to the previous methods interms of effectiveness and efficiency.

We summarize our contributions as:

• We propose a novel and nontrivial ranking method thatadapts the state-of-the-art IR ranking methods to rank-ing heterogeneous joined results of database tuples. Thenew method addresses an important deficiency in the pre-vious methods and results in substantial improvement ofthe quality of search results.• We propose two algorithms, skyline sweeping and block

pipeline, to provide efficient query processing mechanismbased on our new ranking method. The key challenge isthat the non-monotonic nature of our ranking functionrenders existing top-k query processing techniques inap-plicable. Our new algorithms are based on several novelscore upper bounding functions. They also have the desir-able feature of interacting minimally with the databases.• We conduct comprehensive experiments on large-scale real

databases containing up to ten million tuples. Our exper-iment results demonstrated that the new ranking methodoutperformed alternative approaches, and that our queryprocessing algorithms delivered superior performance toprevious ones.

Similar to previous work [1, 15, 22], our system is designedto work on top of a relational DBMS that supports full-textquery and inverted indexes. Our current implementationsupports both Oracle and MySQL. The rationale for choos-ing this architecture over other alternatives (such as graph-

based method [3, 18, 9]) is that (a) Relational databasesare widely used in enterprises and over the Internet, anda sizeable percentage of information stored inside RDBMSsis text; (b) it is easier to exploit many features offered byRDBMSs, e.g., data storage and recovery, index buildingand maintenance, and sophisticated query processing andoptimization capabilities. Nonetheless, we believe our re-search into the RDBMS-based approach is complementaryto alternative approaches.

The rest of the paper is organized as follows: Section 2provides an overview of the problem and existing solutions.Section 3 presents our new ranking method and Section 4introduces two query processing algorithms optimized forefficient top-k retrieval. Experimental results are reportedin Section 5. We introduce related work in Section 6 andSection 7 concludes the paper. A full version of the papercan be found in [23].

2. PRELIMINARIES

2.1 Problem Overview and Problem DefinitionWe consider a relational schema R as a set of relations

{R1, R2, . . . , R|R|}. These relations are interconnected atthe schema level via foreign key to primary key references.We denote Ri → Rj if Ri has a set of foreign key attribute(s)referencing Rj ’s primary key attribute(s), following the con-vention in drawing relational schema graphs. For simplicity,we assume all primary key and foreign key attributes aremade of single attribute, and there is at most one foreignkey to primary key relationship between any two relations.We do not impose such limitations in our implementation.A query Q consists of (1) a set of distinct keywords, i.e.,Q = {w1, w2, . . . , w|Q|}; and (2) a parameter k indicatingthat a user is only interested in top-k results ranked by rele-vance scores associated with each result. Ties can be brokenarbitrarily. A user can also specify AND or OR semanticsfor the query, which mandates that a result must or may notmatch all the keywords, respectively. The default mode isthe OR semantics to allow more flexible result ranking [15].

A result of a top-k keyword query is a tree, T , of tuples,such that each leaf node of T contains at least one of thequery keyword, and each pair of adjacent tuples in T is con-nected via a foreign key to primary key relationship. Wecall such an answer tree a joined tuple tree (JTT). The sizeof a JTT is the number of tuples (i.e., nodes) in the tree.Note that we allow two tuples in a JTT to belong to thesame relation. Each JTT belongs to the results producedby a relational algebra expression — we just replace eachtuple with its relation name and impose a full-text selectioncondition on the relation if the tuple is a leaf node. Suchrelational algebra expression (or its SQL equivalent) is alsotermed as Candidate Network (CN) [16]. Relations in theCN are also called tuple sets. There are two kinds of tuplesets: those that are constrained by keyword selection con-ditions are called non-free tuple sets (denoted as RQ) andothers are called free tuple sets (denoted as R). Every JTTas an answer to a query has its relevance score, which, in-tuitively, indicates how relevant the JTT is to the query.Conceptually, all JTTs of a query will be sorted accordingto the descending order of their scores and only those withtop-k highest scores will be returned.

Example 1. In this paper, we use the same running ex-ample as the previous work [15] (shown in Figure 2).

In the example, R = {P, C, U}.1 Foreign key to primarykey relationships are: C → P and C → U . A user wants toretrieve top-3 answer to the query “maxtor netvista”.

1Initials of relation names are used as shorthands (except thatwe use U to denote Customers).

Page 3: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

Complaints

rid prodId custID date comments

c1 p121 c3232 6-30-2002 disk crashed after just one week of moderate use onan IBM

:::::::

Netvista X41

c2 p131 c3131 7-3-2002 lower-end IBM:::::::

Netvista caught fire, starting appar-

ently with diskc3 p131 c3143 8-3-2002 IBM

::::::

Netvista unstable with::::::

Maxtor HD

Products

rid prodId manufacturer model

p1 p121::::::

Maxtor D540X

p2 p131 IBM:::::::

Netvista

p3 p141 Tripplite Smart 700VA

Customers

rid custId name occupation

u1 c3232 John Smith Software Engineeru2 c3131 Jack Lucas Architectu3 c3143 John Mayer Student

Figure 2: A Running Example from [15] (Query is “maxtor netvista”; Matches are Underlined)

Some example JTTs include: c3, c3 → p2, c1 → p1,c2 → p2, and c2 → p2 ← c3. The first JTT belongs toCN CQ; the next three JTTs belong to CN CQ → P Q; andthe last JTT belongs to CN CQ → P Q ← CQ. Note thatc3 → u3 is not a valid JTT to the query, as the leaf node u3

does not contribute to a match to the query.A possible answer for this top-3 query may be: c3, c3 →

p2, and c1 → p1. We believe that most users will preferc1 → p1 to c2 → p2, because the former complaint is reallyabout a IBM Netvista equipped with a Maxtor disk, andthat it is not certain whether Product p2 mentioned in thelatter JTT is equipped with a Maxtor hard disk or not.

2.2 Overview of Existing SolutionsWe will use the running example to briefly introduce the

basic ideas of existing query processing and ranking meth-ods.

Given the query keywords, it is easy to find relations thatcontain at least one tuple that matches at least one searchkeyword, if the system supports full-text query and invertedindex. The matched tuples from those relations forms thenon-free tuple sets, and are usually ordered in descendingorder by their IR-style relevance scores. The challenge isto find inter-connected tuples that collectively form validJTTs. Given the schema of the database, we can enumerateall possible relational algebra expressions (i.e., CNs) suchthat each of them might generate an answer to the query.

Example 2. For the query “maxtor netvista”, only Pand C have tuples matching at least one keyword of thequery. The non-free tuple set of C is CQ = [c3, c2, c1], andthe non-free tuple set of P Q is [p1, p2]. The free tuple set ofU is U itself. While CQ → P Q might produce an answer,CQ → U cannot produce any valid answer (i.e., JTT), asthe joining U tuple won’t contribute any keyword match tothe query. However, note that other larger CNs whose queryexpressions contain that of CQ → U (e.g., CQ → U ← CQ)may still produce an answer.

DISCOVER [16] has proposed a breadth-first CN enu-meration algorithm that is both sound and complete. Thealgorithm is essentially enumerating all subgraphs of sizek that does not violate any pruning rules. The algorithmvaries k from 1 to some search range threshold M . Threepruning rules are used and they are listed below. We alsoshow the traces of the CN generation algorithm running onour example (Table 2).

Rule 1 Prune duplicate CNs.Rule 2 Prune non-minimal CNs, i.e., CNs containing at

least one leave node which does not contain a query key-word.

Rule 3 Prune CNs of type: RQ ← S∗ → RQ. The ratio-nale is that any tuple s ∈ S∗ (S∗ may be a free or non-free

tuple set) which has a foreign key pointing to a tuple inRQ must point to the same tuple in RQ.

Table 2: Enumerating CNs. P and C Both Matchthe Two Query Keywords. We Mark Invalid CNsin Gray. (We Omit CNs Pruned by Rule 1)

Schema: P Q CQ U

Size CN ID CN Valid? Violates

1 CN1 P Q Y

1 CN2 CQ Y

2 CN3 P Q← CQ Y

2 CQ→ U n Rule (2)

3 P Q← CQ

→ U n Rule (2)

3 P Q← CQ

→ P Q n Rule (3)

3 CN4 CQ→ P Q

← CQ Y

3 U ← CQ→ U n Rules (2, 3)

4...

.

.

....

Four valid CNs (CN1 to CN4) are found in the above ex-ample. Each CN naturally corresponds to a database query.E.g., CN3 corresponds to the following SQL statement inOracle’s syntax:

SELECT *FROM Products P, Complaints CWHERE P.prodId = C.prodId

AND (CONTAINS(P.manufacturer , ’maxtor ,netvista ’)>0OR CONTAINS(P.model , ’maxtor ,netvista ’)>0)

AND CONTAINS(C.comments , ’maxtor ,netvista ’)>0

To find top-k answers to the query, a naıve solution is toissue an SQL query for each CN and union them to find thetop-k results by their relevance scores. DISCOVER2 [15] in-troduce two alternative query evaluation strategies: sparseand global pipeline algorithms, both optimized for stoppingthe query execution immediately after the true top-k-th re-sult can be determined.2 The basic idea is to use an upperbounding function to bound the scores of potential answersfrom each CN (either before execution or in the middle ofits execution). The upper bound score ensures that any po-tential result from future execution of a CN will not havea higher score. Thus the algorithm can stop earlier if thecurrent top-k-th result has a score no smaller than the up-per bound scores of all CNs. We note that this is the mainoptimization technique for other variants of top-k queriestoo [11, 25, 5].

2In this paper, we name the system in [15] as DISCOVER2.A hybrid algorithm that selects either sparse or global pipelinealgorithm for a query based on selectivity estimation is alsoproposed in [15]. It is discussed and compared with in Section 5.

Page 4: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

The sparse algorithm executes one CN at a time and up-dates the current top-k results; it uses the above-mentionedcriterion to stop query execution earlier. The global pipelinealgorithm adopts a more aggressive optimization: it does notexecute a CN to its full; instead, at each iteration, it (a) firstselects the most promising CN, i.e., the CN with the highestupper bound score; (b) admits the next unseen tuple fromone of the CN’s non-free tuple sets and join the new tuplewith all the already seen tuples in all the other non-free tu-ple sets. As such, the query processing strategy (of a singleCN) is similar to that of ripple join [14].

2.3 Overview of Our SolutionIn this paper, we assume that the DBMS can efficiently

locate the matching tuples for each search keyword and formthe non-free tuple sets. We will focus on the following twosub-problems: (a) how to score a JTT, and (b) how to gen-erate and order the SQL queries for the CNs of a query,such that minimal database accesses (also called probes) arerequired before top-k results are returned.

The first problem is studied in the next section. The sec-ond problem is addressed in Section 4.

3. RANKING FUNCTIONDue to the fuzzy nature of keyword queries, retrieval ef-

fectiveness is vital to keyword search on RDBMSs. Theinitial attempt was a simple ranking by the size of CNs [1,16]. DISCOVER2 later proposed a ranking formula basedon the state-of-the-art IR scoring function [15]. More re-cently, several sophisticated improvements to the rankingformula in [15] have been suggested [22].

In this section, we first motivate our work by present-ing observations that reveal several problems in the existingschemes. We then discuss our solutions to address them.

3.1 Problems with Existing Ranking FunctionsThe basic idea of the ranking method used in DISCOVER2

[15] (and its variant [22]) is to

1. assign each tuple in the JTT a score using a standardIR-ranking formula (or its variants); and

2. combine the individual scores together using a score ag-gregation function, comb(·), to obtain the final score.Only monotonic aggregation functions, e.g., SUM, havebeen considered.3

For example, the IR-style ranking function used in DIS-COVER2 is adapted from the TF-IDF ranking formula as:4

score(T, Q) =∑

t∈T

score(t, Q)

score(t, Q) =∑

w∈t∩Q

1 + ln(1 + ln(tfw(t)))

(1− s) + s · dltavdlt

· ln(idfw) ,

where idfw =NRel(t) + 1

dfw(Rel(t)),

tfw(t) denotes the number of times a keyword w appearsin a database tuple t, dlt denotes the length of the textattribute of a tuple t, and avdlt is the average length ofthe text attribute in the relation which t belongs to (i.e.,Rel(t)), NRel(t) denotes the number of tuples in Rel(t), anddfw(Rel(t)) denotes the number of tuples in Rel(t) that con-tain keyword w. The score of a JTT is the sum of the localscores of every tuple in the JTT.

3The aggregation function used in [22] is not monotonic.However, query processing issues with this non-monotonicaggregation function are not discussed.4To obtain the final score of a JTT, score(T, Q) needs to be

further normalized by T ’s size, i.e., multiple another 1size(T )

.

Table 3: Different Scoring Functions ProducesDifferent Rankings (ln(idfmaxtor) = ln(idfnetvista) = 1.0and dlt = avdlt)

CN t ∈ CN tfmaxtor tfnetvista Scoret ScoreT Our Score

c3 → p2c3 1 1 2.0

}

3.0 1.13p2 0 1 1.0

c1 → p1c1 0 1 1.0

}

2.0 0.98p1 1 0 1.0

c2 → p2c2 0 1 1.0

}

2.0 0.44p2 0 1 1.0

We illustrate an inherent problem in the above frame-work by using the running example in Figure 2. The queryis “maxtor netvista”. Let us consider the CN: CQ → P Q.If the CN is executed completely, it will produce 3 results.In Table 3, we list the detailed steps to obtain the scores(ScoreT ) according to the above-mentioned method. Forexample, c3 → p2 consists of two tuples c3 and p2 belong-ing to C and P , respectively. c3 contains one maxtor andone netvista, while p2 contains one netvista only. Forsimplicity, we do not consider length normalization in theexample (i.e., setting dlt = avdlt for all t), and assume thatthe ln(idf) values of both keywords are 1. Therefore, wecan calculate score(c3, Q) as 1 + ln(1 + ln(tfmaxtor(c3)) + 1 +ln(1 + ln(tfnetvista(c3)) = 2.0, and score(p2, Q) = 1 + ln(1 +ln(tfnetvista(p2)) = 1.0. The final score for the joined tuple,score(c3 → p2), is 2.0 + 1.0 = 3.0. Similarly, c1 → p1 andc2 → p2 both have the same score 2.0 and thus are bothranked as the second.

However, a careful inspection of the latter two results re-veals that c1 → p2 in fact matches both search keywordswhile c2 → p2 matches only one keyword (netvista) al-beit twice. We believe that most users will find the formeranswer more relevant to the query than the latter one. Infact, it is not hard to construct an extreme example wherethe DISCOVER2’s ranking contradicts human perceptionby ranking results that contain a large amount of one searchkeyword over results that contain all or most search key-words but only once.

There are two reasons for the above-mentioned rankingproblem. Firstly, when a user inputs short queries, thereis a strong implicit tendency for the user to prefer answersmatching queries completely to those matching queries par-tially. We propose a completeness factor in Section 3.3 toquantify this factor. Secondly, the framework of combininglocal IR ranking scores has an inherent side effect of overlyrewarding contributions of the same keyword in differenttuples in the same JTT.

We note that a similar observation and remedy about theneed of non-linear term frequency attenuation was also madeby IR researchers [26]. The difference is that the same ap-proach is motivated by the semantics of our search problem;in addition, our problem is more general and a number ofother modifications to the IR ranking function are made(e.g., inverse document frequencies and document lengthnormalization for each CN).

3.2 Modelling a Joined Tuple Treeas a Virtual Document

We propose a solution based on the idea of modelling aJTT as a virtual document. Consequently, the entire resultsproduced by a CN will be modelled as a document collec-tion. The rationale is that most of the CNs carry certaindistinct semantics. E.g., CQ → P Q gives all details aboutcomplaints and their related products that are collectivelyrelevant to the query Q and form integral logical informa-tion units. In fact, the actual lodgment of a complaint wouldcontain both product information and the detailed comment

Page 5: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

— it was split into multiple tables due to the normalizationrequirement imposed by the physical implementation of theRDBMSs.

A very similar notion of virtual document was proposedin [30]. Our definition differs from [30] in that ours is query-specific and dynamic. For example, a customer tuple is onlyjoined with complains matching the query to form a virtualdocument on the run-time, rather than joining with all thecomplaints as [30] does.

By adopting such a model, we could naturally computethe IR-style relevance score without using an esoteric scoreaggregation function. More specifically, we assign an IRranking score to a JTT T as

scorea(T, Q) =∑

w∈T∩Q

1 + ln(1 + ln(tfw(T )))

(1− s) + s · dlTavdlCN∗(T )

· ln(idfw), (1)

where tfw(T ) =∑

t∈T

tfw(t), idfw =NCN∗(T ) + 1

dfw(CN∗(T )),

CN(T ) denotes the CN which the JTT T belongs to, CN∗(T )is identical to CN(T ) except that all full-text selection con-ditions are removed. CN∗(T ) is also written as CN∗ if thereis no ambiguity.

Example 3. Consider the CN CQ → P Q, CN∗ is C → P(i.e., C 1 P ) in Table 3. NCN∗ = 3. dfmaxtor = 2 anddfnetvista = 3.

In our proposed method, the contributions of the samekeyword in different relations are first combined and thenattenuated by the term frequency normalization. There-fore, tfmaxtor(c2 → p2) = 0, tfnetvista(c2 → p2) = 2, whiletfmaxtor(c1 → p1) = 1, tfnetvista(c1 → p1) = 1. Accord-ing to Equation (1) and omitting the size normalization,scorea(c2 → p2) = 0.44, while scorea(c1 → p1) = 0.98.Thus, c1 → p1 is ranked higher than c2 → p2, which agreeswith human judgments.5

There are still two technical issues remaining: how to ob-tain dfw(CN∗) and NCN∗ and how to obtain avdlCN∗ . Nodoubt that computing dfw(CN∗) and NCN∗ exactly will in-cur prohibitive cost. Therefore, we resort to an approximate

solution: we estimate p = dfw(CN∗)NCN∗

, such that the idf value

of the term in CN∗ can be approximated as 1p. Consider a

CN∗ = R1 1 R2 1 . . . 1 Rl, and denote the percentage oftuples in Rj that matches at least a keyword w as pw(Rj).We can derive

dfw(CN∗)

NCN∗ + 1≈

dfw(CN∗)

NCN∗

= p ≈ 1−Πj(1− pw(Rj))

by assuming that (a) NCN∗ is a large number, and (b) tu-ples matching keyword w are uniformly and independentlydistributed in each relation Rj . In a similar fashion, weestimate avdlCN∗ as

javdlRj . From our preliminary ex-

perimental study [23], these approximations usually haveacceptable accuracy (within 30% relative error) for smallCNs of size up to 3. It is our future work to study morerobust estimation techniques or alternative approaches.

3.3 Other Ranking Factors

Completeness Factor. As motivated in Section 3.1, webelieve that users usually prefer documents matching manyquery keywords to those matching only few keywords. Toquantify this factor, we propose to multiply a completenessfactor to the raw IR ranking score. We note that the same

5c3 ← p2 will have score 1.13, which still makes it rankedas the first result.

intuition has been recognized by IR researchers when study-ing ranking for short queries [31, 27].

Our proposed completeness factor is derived from the ex-tended Boolean model [28]. The central idea of the extendedBoolean model is to map each document into a point in am-dimensional space [0, 1]m, if there are m keywords in thequery Q. A document d will have a large coordinate valueon a dimension, if it has high relevance to the correspondingkeyword. As we prefer documents containing all the key-words, the ideal answer should be located at the positionPideal = [1, . . . , 1

︸ ︷︷ ︸

m

]. In our virtual document model, a JTT is

a document and can be projected into this m-dimensionalspace just as a normal document. We thus use the distanceof a document to the ideal position, Pideal, as the complete-ness value of the JTT. More specifically, we use the Lp dis-tance and normalize the value into [0, 1]. The completenessfactor, scoreb, is then defined as:

scoreb(T, Q) = 1−

(∑

1≤i≤m(1− T.i)p

m

) 1p

, (2)

where T.i denotes the normalized term frequency of a JTTT with respect to keyword wi, i.e.,

T.i =tfwi (T )

max1≤j≤m tfwj (T )·

idfwi

max1≤j≤m idfwj

.

In Equation (2), p is a tuning parameter. p can smoothlyswitch the completeness factor biased towards the OR se-mantics to the AND semantics, when p increases from 1.0to ∞. To see that, consider p → ∞, the completeness fac-tor will essentially become min1≤i≤m T.i, which essentiallygives 0 score to a result failing to match all the search key-words. In our experiment, we observed that a p value of 2.0is already good enough to enforce the AND-semantics foralmost all the queries tested.

Apart from the nice theoretical properties, the ability toswitch between AND and OR semantics is a salient featureto query processing. It enables a unified framework opti-mized for top-k query processing for both AND and ORsemantics. In contrast, previous approaches are either op-timized for the AND semantics [1] or for the OR seman-tics [15].

Size Normalization Factor. The size of the CN or JTTis also an important factor. A larger JTT tends to havemore occurrences of keywords. A straightforward normal-ization by 1

size(CN)[15] usually penalizes too much for even

moderate-sized CNs.We experimentally found that the fol-lowing size normalization factor works well in the experi-ment:

scorec = (1 + s1 − s1 · size(CN)) · (1 + s2 − s2 · size(CNnf ))(3)

where size(CNnf ) is the number of non-free tuple sets forthe CN. In our experiments, we found that s1 = 0.15 ands2 = 1

|Q|+1yielded good retrieval results for most of the

queries.

3.4 The Final Scoring FunctionIn summary, our ranking method can be conceptually

thought as first merging all the tuples in a JTT into a virtualdocument, and then obtaining its IR ranking score (Equa-tion (1)), the completeness factor score (Equation (2), andthe size normalization factor score (Equation (3)). The finalscore of the JTT, score(T, Q), is the product of all the threescores:

score(T, Q) = scorea(T, Q) · scoreb(T, Q) · scorec(T, Q)

Page 6: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

4. Top-k JOIN ALGORITHMWhile effectiveness of keyword search is certainly the most

important factor, we believe that the efficiency of query pro-cessing is also a critical issue. Query execution time willbecome prohibitively large for large databases, if the queryprocessing algorithm is not fully optimized for the rankingfunction and top-k queries. In this section, we propose twoefficient query processing algorithms for our newly proposedranking function.

4.1 Dealing with Non-monotonic ScoringFunction

The technical challenge of query processing mainly lieswith the non-monotonic scoring function (mainly the scorea(·)and scoreb(·) functions) used in our ranking method. To thebest of our knowledge, none of the existing top-k query pro-cessing methods deals with non-monotonic scoring function.We use the single pipeline algorithm [15] to illustrate thechallenge and motivate our algorithms.

C

P

c[i] c[i + 1]

p[j]

p[j + 1]

y

xI II

III IV

(a) Single PipelineC

P

y

xI

II

z

z′

z′′

(b) Skyline Sweeping

Figure 3: Query Evaluation StrategiesExample 4. Figure 3(a) illustrates a snapshot of running

the single pipeline algorithm on the CN C → P . Assumethat we have processed the light gray area marked with “I”(i.e., the rectangle up to (c[i], p[j])). We use the notationc[i] to denote the i-th tuple in the non-free tuple set CQ indescending order of their scores.

In the figure, hollow circles denote candidates that wehave examined but did not produce any result, filled circlesdenote joined results, and hollow triangles denote candidatesthat have not been examined.

Assume that a user asks for top-2 results, and we havealready found two results x and y. The single pipeline algo-rithm needs to decide whether to continue the query execut-ing (and check more candidates) or stop and return {x, y}as the query result. DISCOVER2 needs to bound the max-imum score that any unseen candidate can achieve. If thelast seen candidate is c[i] → p[j], then the upper bound ismax(score(p[1], c[i + 1], score(p[j + 1], c[1]). This is true be-cause DISCOVER2 uses a monotonic scoring function (SUM)and therefore score(p[1], c[i + 1]) ≥ score(p[u], c[v]) (u ≥1, v ≥ i + 1) and score(p[j + 1], c[1]) ≥ score(p[u], c[v])(u ≥ j+1, v ≥ 1); the combination of the right-hand sides inthe above two inequalities covers all the unseen candidates(i.e., those marked as triangles).

We note that, with our new ranking method, the scoreof a JTT is not monotonic with respect to the score ofits constituent tuples. For example, consider the last twoJTTs in Table 3. If we assume idfnetvista > idfmaxtor, thenscore(c2) = score(c1) but score(p2) > score(p1). However,we have score(c2 → p2) < score(c1 → p1), even if we do notimpose the penalty from the completeness factor. Conse-quently, previous algorithms on top-k query processing can-not be immediately applied, and a naıve approach wouldhave to produce all the results to find the top-k results.

Our solution, which underlies both of our proposed al-gorithms, is based on the observation that if we can finda (preferably tight) monotonic, upper bounding function tothe actual scoring function, we can stop the query process-ing earlier too. We derive such an upper bounding functionfor our ranking function in the following.

Let us denote the a JTT as T , which consists of tuplest1, . . . , tm. Without loss of generality, we assume every ti isfrom a non-free tuple set; otherwise, we just ignore it fromthe subsequent formulas. Let sumidf =

w∈CN(T )∩Qidfw

and watf(ti) =

w∈ti∩Q(tfw(ti) · idfw)

sumidf(i.e., pseudo weighted

average tf of tuple ti). Then we have the following lemma.

Lemma 1. scorea(T, Q) (Equation (1)) can be boundedby a function uscorea(T, Q) = 1

1−s·min(A, B), where

A = sumidf ·

(

1 + ln

(

1 + ln

(∑

ti∈T∩Qwatf(ti)

)))

B = sumidf ·∑

ti∈T∩Qwatf(ti) .

In addition, the bound is tight.

A tight upper bound for the completeness factor (denotedas uscoreb) can be determined given the keywords matchedin each non-free tuple sets of a CN. The size normalizationfactor is also a constant for a given CN. Therefore, we havethe following theorem to upper bound the score of a JTT.

Theorem 1.

score(T, Q) ≤uscore(T, Q), where (4)

uscore(T, Q) =uscorea(T, Q) · uscoreb(CN(T ), Q)

· scorec(CN(T ), Q)

and for a given CN , the upper bound score is monotonicwith respect to watf(ti) (ti ∈ T ).

This result immediately suggests that we should sort allthe tuples (ti) in the non-free tuple set of a CN by the de-creasing order of their watf(ti) values (rather than theirlocal IR scores as used in previous work), such that we canobtain an upper bound score of all the unseen candidates.

Example 5. Continuing the previous example, assume thatwe have ordered all tuples t in CQ and P Q according to thedescending order of their watf(t) values. Then the scoreof the unseen candidates in the CN: X = CQ → P Q isbounded by M · uscoreb(X, Q) · scorec(X, Q), where M =max(uscorea(c[i + 1], p[1]), uscorea(c[1], p[j + 1])).

4.2 Skyline Sweeping AlgorithmBased on Theorem 1, we could modify the existing sin-

gle or global pipeline algorithm such that it will correctlycompute the top-k answers for our new ranking function.However, single/global pipeline algorithm may incur manyunnecessary join checking. Therefore, we design a new algo-rithm, skyline sweeping, that is guaranteed not to incur anyunnecessary checking and thus has the minimal number ofaccesses to the database.

Example 6. Consider the single pipeline algorithm run-ning on the example in Figure 3(a). Assume that the algo-rithm has processed c[1] . . . c[i] on non-free tuple set CQ andp[1] . . . p[j] on P Q. If the algorithm cannot stop, it will pickup either c[i + 1] or p[j + 1]. If it picks c[i + 1], j probingqueries will be sent to verify whether c[i+1] joins with p[k],where 1 ≤ k ≤ j.

It is obvious that some of these j queries might be unnec-essary, if, e.g., if c[i + 1] joins with p[1] and its real scoreis higher than the upper bound scores of the rest of thecandidates, then the other j− 1 probes will be unnecessary.

Page 7: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

We propose an algorithm designed to minimize the num-ber of join checking operations, which typically dominatesthe cost of the algorithm. Our intuition is that if there aretwo candidates x and y and the upper bound score of x ishigher than that of y, y should not be checked unless x hasbeen checked. Therefore, we should arrange all the candi-dates to be checked according to their upper bound scores.A naıve strategy is to calculate the upper bound scores forall the candidates, sort them according to the upper boundscores, and check them one by one according to this opti-mal order. This will incur excessive amount of unnecessarywork, since not all the candidates need to be checked.

We take the following approach. We define a dominaterelationship among candidates. Denote x.di as the order(i.e., according to their watf values) of candidate x on thenon-free tuple set di. If x.di ≤ y.di for all non-free tuple setdi, then uscore(x) ≤ uscore(y). This enables us to computethe upper bound score and check candidates in a lazy fash-ion: immediately after we check a candidate x, we push allthe other candidates directly dominated by x into a priorityqueue by the descending order of their upper bound scores.It can be shown that the candidates in the queue form a sky-line [4] and the skyline sweeps across the Cartesian space ofthe CN as the algorithm progresses, hence the name of thealgorithm.

Algorithm 1 Skyline Sweeping Algorithm

1: Q.push((

m︷ ︸︸ ︷

1, 1, . . . , 1), calc uscore((

m︷ ︸︸ ︷

1, 1, . . . , 1)))2: top-k ← ∅3: while top-k[k].score < Q.head().uscore do4: head← Q.pop max()5: r ← executeSQL(formQuery(head))6: if r 6= nil then7: top-k.insert(r, score(r))8: for i← 1 to m do9: t← head.dup()10: t.i← t.i + 111: Q.push(t, calc uscore(t)) {According to Equation (4)}12: if head.i > 1 then13: break14: return top-k

The algorithm is shown in Algorithm 1. A result list,top-k, contains no more than k results ordered by the de-scending real scores. The main data structure is a priorityqueue, Q, containing all the candidates (which are mappedto multi-dimensional points) according to the descending or-der of their upper bound scores. The algorithm also main-tains the invariant that the candidate at the head of thepriority queue has the highest upper bound score amongall candidates in the CN. The invariant is maintained by(a) pushing the candidate formed by the top tuple fromall dimensions into the queue (Line 1), and (b) whenever acandidate is popped from the queue, its adjacent candidatesare pushed into the queue together with their upper bounds(Lines 8–13). The algorithm stops when the real score of thecurrent top-k-th result is no smaller than the upper boundscore of the head element of the priority queue; the latter isexactly the upper bound score of all the unprocessed candi-dates.

A technical point is that we should avoid inserting thesame candidate multiple times into the queue. Doing dupli-cate checking is inefficient in terms of time and space. Weadopt a space partitioning method to totally avoid gener-ating duplicate candidates. This is implemented in Lines12–13 using the same ideas as [25]. For example, in Fig-ure 3(b), assume the order of the dimensions is P , C. Bothz′ and z′′ are the adjacent candidates to z, but only z′ willbe pushed into Q when z is examined by the algorithm.

Theorem 2. The skyline sweeping algorithm has the min-imal number of probing to the database.

4.2.1 Generalizing to Multiple CNsThe skyline sweeping algorithm can be easily generalized

to support more than one CN. The only modification is tochange the initialization step: we just push the top candi-date of each CN to the priority queue Q.

4.3 Block Pipeline AlgorithmWe present another algorithm to further improve the per-

formance of the skyline sweeping algorithm. We observethat the aggregation function we used is non-monotonic, yet,in order to stop execution earlier, we have to use a mono-tonic upper bounding function to bound it. As such, theupper bounding may be rather loose at places. We illus-trate this observation using a one dimensional example inFigure 4. Since the upper bounding function, uscore, mustbe monotonic, even if we use more complex functions, itwon’t be able to approximate the concave part of the realscoring function (score) well for x between 3 to 7.

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

score(x)uscore(x)

Figure 4: Deficiency of Bounding a Non-monotonicFunction Using a Monotonic Function

Large gaps between the upper bound scores and the cor-responding real scores cause two problems in the skylinesweeping algorithm: (a) it is harder to stop the execution,as the upper bound of un-processed candidates may be muchhigher than their real score, and consequently higher thanthe real score of the top-k-th result, and (b) the order of theprobes is not optimal, as the algorithm will perform largenumber of probes and only obtain candidates with rather lowreal score, which cannot contribute to the final top-k answer.

In order to address the above problems, we propose anovel block pipeline algorithm. The central idea of the al-gorithm is to employ another local non-monotonic upperbounding function that bounds the real score of JTTs moreaccurately. As such, we will check the most promising can-didates first and thus further reduce the number of probesto the database.

To illustrate the idea, we define several concepts first.Consider a non-free tuple set RQ and a query Q = {w1, . . . ,

wm}. We define the signature of a tuple t in RQ as anordered sequence of term frequencies for all the query key-words, i.e., 〈tfw1(t), . . . , tfwm(t)〉. Then, we can partitioneach RQ into a number of strata such that all tuples withinthe same stratum have the same signature (also called thesignature of the stratum). For a given CN, the partition-ing of its non-free tuple sets naturally induces a partitioningof all the join candidates. We call each partition of thejoin candidates a block. The signatures of the strata thatforms a block b can be summed up as 〈

ti∈Ttfw1(ti), . . . ,

ti∈Ttfwm(ti)〉, to form the signature of the block (denoted

as sig(b)).If two candidates in the same block both pass the join

test, they should have similar real scores, as they agree onthe term frequencies of all the query keywords (and thus thecompleteness wrt. the query), and the size of the result. Thisobservation helps to derive a much tighter upper bounding

Page 8: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

function, bscore, for any candidate T within the same blockvia the block signature:

bscore(b, Q) =∑

w∈Q∩b

1 + ln(1 + ln(sigw(b))

1− s· ln(idfw)

· scoreb(b, Q) · scorec(CN(T ), Q) (5)

We note that this new bounding function, albeit being tighter(as it is no larger than uscore(T, Q) defined in Lemma 1),cannot be directly used to derive the stopping condition fortop-k query processing algorithms, as it is not monotonicwith respect to any single computable measure of its non-free tuple sets.

Algorithm 2 Block Pipeline Algorithm

Input:CN is the set of CNs.

Description:1: Q← ∅2: for all cn ∈ CN do3: b← the first block of cn4: b.status← USCORE5: Q.push(b, calc uscore(b)) {According to Equation (4)}6: while top-k[k].score < Q.head().getScore() do7: head← Q.pop max()8: if head.status = USCORE then9: head.status← BSCORE10: Q.push(head, calc bscore(head)) {based on Eq. (5)}11: for all the adjacent blocks b′ to head enumerated in a

non-redundant way do12: b′.status← USCORE13: Q.push(b′, calc uscore(b′))14: else if head.status = BSCORE then15: R← executeSQL(formQuery(b))16: for all result t ∈ R do17: t.status← SCORE18: Q.push(r, calc score(head)) {compute the real score}19: else20: Insert head into top-k21: return top-k

We introduce a solution using lazy block calculation andintegrate it with the monotonic upper bounding score func-tion (Equation (4)) seamlessly. Algorithm 2 describes thepseudo-code of the block pipeline algorithm. Intuitively,the algorithm is “unwilling” to issue a database probingquery if the current top-ranked item in the priority queueis only associated with its upper bound score (uscore), asthe score might not be close enough to its real score. Ournon-monotonic bounding function plays its role here by re-inserting the item back to the priority queue, but with itsbscore (Lines 9–10).

Theorem 3. The block pipeline algorithm will never beworse than the skyline sweeping algorithm in terms of num-ber of probes to the database. When the score aggregationfunction is non-monotonic, there exists a database instancesuch that the block pipeline algorithm will check fewer can-didates than the skyline sweeping algorithm.

Example 7. Consider the example in Figure 3(a) and as-sume that we have only i + 1 tuples in the non-free tupleset CQ and j + 1 tuples in P Q. Further assume that bothc[1], . . . , c[i] and p[1] . . . , p[j] are tuples matching the samekeyword, w1, once (and thus form two strata), and bothc[i + 1] and p[j + 1] match w2 once (and form another twostrata). Assume the idf values of w1 is higher than that ofw2, and hence the strata containing matches of w1 is rankedabove those matching w2. This gives us four blocks. E.g.,block I is [c[1] . . . c[i]]× [p[1] . . . p[j]], and its block signatureis 〈2, 0〉. Similarly, block II and III have the same blocksignature as 〈1, 1〉.

We assume ln(idfw1) = 1.1, ln(idfw2) = 1.0, the com-pleteness factor, scoreb, is 0.5, the size normalization fac-tor, scorec, is 1.0, and s is 0.2. We calculate the bscoresand uscores for each block in the following table:

Block bscore uscore

I 1.05 2.74II 2.63 2.63III 2.63 2.63IV 0.95 2.50

The skyline sweeping algorithm will inspect tuples in BlockI first, then Block II and III, as tuples in Block I all havehigher uscores than those in Block II or III. However, all an-swers in Block I, if any, will have rather low scores (no higherthan 1.05), and are not likely to become top-k results.

In contrast, in the block pipeline algorithm, even thoughBlock I is pushed into the queue first (Lines 3–5), it is re-inserted with its bscore (calculated by calc bscore) as 1.05.Blocks II and III will go through the same process, but theywill both be associated with a bscore of 2.63. Thus theywill both be checked against the database before any can-didate in Block I. Furthermore, if k results are found afterevaluating candidates in Blocks II and III and the real scoreof the top-k-th result is higher than 1.05, the block pipelinealgorithm can terminate immediately.

4.4 Optimizations and Discussions

4.4.1 Using Range Parametric QueriesSince our system is external to the RDBMS kernel, exe-

cuting an SQL query and fetching its results require inter-process communications, and suffer from DBMS internaloverheads. If the database is large and the query is com-plex, a large number of probing queries will be sent to theDBMS. In addition, join selectivity is usually very low andmost of such probing queries will return empty result set.Therefore, there will be substantial overhead if we issue aparametric query to check each candidate.

In our implementation, we adopt the optimization of group-ing candidates into ranges and issue a single parametricquery to the database. The selection condition can be rewrit-ten as either a collection of primary key selections obtainedfrom the inverted index combined using or or creating atemporary table and using a range selection condition onthe row number. Both global pipeline and block pipelinealgorithm can benefit from this optimization.

Another benefit of this optimization is that RDBMS mightanswer such query more efficiently (e.g., by using indexes oravoiding re-accessing/scanning tables multiple times)

4.4.2 Instance OptimalityInstance optimality is a notion proposed by Fagin et al. [11]

to assert that the cost of an algorithm is bounded by a con-stant factor of any other correct algorithms on all databaseinstances. This notation is widely studied and adopted inmost top-k query processing work.

We note that although the skyline sweeping algorithm canbe shown to be instance-optimal, this notion of optimalityis not helpful in our problem setting. Consider a single CN.If the skyline sweeping algorithm accesses di tuples from theeach of the m non-free tuple sets, and let d = max1≤i≤m di,we can show that any other algorithm must at least access atleast d tuples. Therefore, the total cost in terms of tuple ac-cesses of the skyline sweep algorithm can be bounded by anm-factor of other algorithms. However, the dominant cost inour problem setting is the cost of probing the database. Fora large CN with a number of free and non-free tuple sets,each probe is a complex query involving joins of multiplerelations. In contrast, sequentially accessing tuples in the

Page 9: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

non-free tuple sets is practically an inexpensive in-memoryoperation. Consider the sketch of proof of the instance-optimality above, it is possible that the skyline sweepingalgorithm has to probe the database O(dm) times, while an-other algorithm only needs to probe the database for O(d)times. As a result, the cost ratio cannot be bounded by aconstant factor.

It is our future work to consider an appropriate notionof optimality using a cost model based on the number ofdatabase probes.

5. EXPERIMENTSIn order to evaluate the effectiveness and the efficiency

of proposed methods, we have conducted extensive experi-ments on large-scale real datasets under a number of config-urations.

The datasets we used include: the Internet Movie Database(IMDB) 6, DBLP data (DBLP) [7], and Mondial 7. Allare real datasets. IMDB and DBLP are much larger thanMondial and results on the Mondial dataset are similar tothose on IMDB and DBLP, so we will omit Mondial in therest of the discussion8. For the IMDB dataset, we converteda subset of its raw text files into relational tables. Schemaand statistics of the two datasets can be found in Tables 4(a)and 4(b).

We manually picked a large number of queries for eachdataset. We tried to include a wide variety of keywords andtheir combinations in the query sets, e.g., selectivity of key-words, size of the most relevant answers, number of potentialrelevant answers, etc. We focus on a subset of the querieshere. There are 22 queries for the IMDB dataset (IQ1 toIQ22) with query length ranging from 2 to 3. There are 18queries for the DBLP dataset (DQ1 to DQ18) with querylength ranging from 2 to 4.

Table 4: Dataset Statistics (Text Attributes AreUnderlined)

(a) IMDB Dataset

Relation Schema # Tuples

movies(mID, name) 833,512direct(mID, dID) 561,173directors(dID, name) 121,928actressplay(asID, charactor, mID) 2,262,149actresses(asID, name) 445,020actorplay(atID, charactor, mID) 4,244,600actors(atID, name) 741,449genres(mID, genre) 629,195

Total Number of Tuples 9,839,026

(b) DBLP Dataset

Relation Schema # Tuples

InProceeding(InProceedingId, Title,

Pages, URL, ProceedingId)

212,273

Person(PersonId, Name) 174,709RelationPersonInProceeding(InProceedingId,

PersonId)491,777

Proceeding(ProceedingId, Title, EditorId,PublisherId, SeriesId, Year, Url)

3,007

Publisher(PublisherId, Name) 86Series(SeriesId, Title, Url) 24

Total Number of Tuples 881,867

6http://www.imdb.com/interfaces

7http://www.dbis.informatik.uni-goettingen.de/Mondial/

8Results on Mondial dataset can be found in the full paper [23].

The databases used are the Linux version of Oracle 10gExpress and MySQL v5.0.18, both with their default config-urations. Indexes were built on all primary key and foreignkey attributes. For most of the queries, similar results wereobtained on the two systems and thus we focus on the resultson Oracle. We implemented the Sparse and global pipeline(GP) algorithms, and our skyline sweep (SS) and blockpipeline algorithms (BP). Note that we can lower bound theexecution time of the Hybrid algorithm [15] as the mini-mum of the running times of Sparse and GP. We improvedthe original GP algorithm by using the range parametricquery optimization (Section 4.4.1). Without this optimiza-tion, the original GP algorithm would have sent an excessivenumber of queries to the database and incurred significantoverhead. Unless specified explicitly, all algorithms ran us-ing OR semantics.

All algorithms were implemented using JDK 1.5 and JDBC.All experiments were run on a PC with a 1.8GHz CPUand 512M memory running Debian GNU/Linux 3.1. Thedatabase server and the client were run on the same PC. Allalgorithms were run in warm buffer mode and Java JIT wasenabled.

To measure the effectiveness, we adopt two metrics usedin the previous study [22]: (a) number of top-1 answers thatare relevant (#Rel), and (b) reciprocal rank (R-Rank).In order to select the relevant answer, we ran all the al-gorithms for the same query and merged their top-20 re-sults. Then we manually judged and picked the relevantanswer(s) for each query. The relevant answer(s) must sat-isfy two conditions: it must match all the search keywordand its size must be the smallest. For example, the manuallymarked relevant answer for the query “nikos clique” is apaper named “Constraint-Based Algorithms for ComputingClique Intersection Joins” written by “Nikos Mamoulis”.When measuring the reciprocal rank, we search for the firstrelevant answer in the top-20 results. In case none of the top-20 answers is relevant, we upper bound its R-Rank value by

1#uniq score+1

, where #uniq score is the number of unique

scores in its top-20 results. We did not use the recall andsubsequent MAP measures as we found the recall values arenot comparable across different top-k queries.9 To measureefficiency, we measure the average elapsed times of the al-gorithms over several runs.

5.1 EffectivenessWe show the reciprocal ranks of [15], [22], and our pro-

posed method on the DBLP dataset in Table 5. For [22], weused all four normalizations, but not the phrase-based rank-ing. For our ranking method, we vary the tuning parameterp from 1.0 to 2.0, thus representing the change of preferencefrom the OR-semantics to the AND-semantics. The resultsshow that our R-Rank is higher than other methods, as ourmethod returns the relevant result as the top-1 result for16 out of the 18 DBLP queries when p = 1.0. While [15]and [22] methods have a tie in #Rel measure, [22] actuallyperforms better than [15], because it often returns relevantanswer(s) within the top-5 results, while [15] method oftenfails to find any relevant answer in the top-20 results. Thisis reflected in their R-Rank measures. Similar results wereobtained on the IMDB dataset too.

Manual inspection of the top-20 answers returned by thealgorithms reveals some interesting “features” of the rankingmethods. Due to the inherent bias in [15]’s ranking aggrega-tion method and extremely harsh penalty on the CN sizes,it tends to return results that have only partial matches to

9For example, it is not hard to construct a query that will returna large number (N) of papers written by a particular author andpublished in a particular conference. Even the ideal algorithmcan only achieve a recall of k

N, which could be arbitrarily small.

Page 10: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

the query or small-sized results. [22] proposed using a softCN size normalization and a non-linear rank aggregationmethod. Consequently, it tends to return large-sized re-sults that match most of the keywords. Our method seemsto strike a good balance between the completeness of thematches and size of the results. For instance, we show thetop-1 results returned by all ranking methods for DQ1 onDBLP in Table 6.

Table 5: Effectiveness on the DBLP Dataset Basedon Top-20 Results

[15] [22] p = 1.0 p = 1.4 p = 2.0

#Rel 2 2 16 16 18R-Rank ≤ 0.243 ≤ 0.333 0.926 0.935 1

Table 7: p’s Impact on R-Rank

Dataset QueryID p = 1 p = 1.4 p = 2.0

DBLP DQ9 1/3 1/2 1DBLP DQ17 1/3 1/3 1IMDB IQ10 1 1 1IMDB IQ17 1/3 1/3 1IMDB IQ19 1/2 1 1IMDB IQ21 1/2 1/2 1/2

We also conducted experiments by varying p from 1.0 to2.0. This should inject more AND semantics into our rank-ing method. As the default p = 1.0 already returns relevantresults for most queries, we only list queries whose resultqualities (R-Rank values) are affected by the varying p inTable 7. With an increasing value of p, the R-Rank valuesfor most such queries increase. This is because we start topenalize more on results that does not match all the key-words. For example, when p = 1.0, the relevant answer forDQ9 is only ranked as the third. The top-1 answer matchesall but one keyword. When p increases to 1.4, the relevantanswer moves up to the second. Finally, when p reaches 2.0,it is successfully ranked as the top answer.

5.2 EfficiencyWe show running time for all queries on the DBLP and

IMDB datasets in Figures 5(a) to 5(d) for k = 10 and k = 1.Note that the y-axis is in logarithm scale. We can make thefollowing observations:

• BP is usually the fastest algorithm on both DBLP andIMDB datasets. The speedup is most substantial on hardqueries, e.g., DQ7, DQ13, and DQ17. BP can achieve upto two orders of magnitude speedup against the betteralgorithm of Sparse and GP (thus the lower bound of Hy-brid algorithm). BP can return top-10 answers within 2seconds for 89% of the queries on the DBLP dataset and77% queries on the IMDB dataset which is 10 times larger.• SS usually outperforms Sparse and GP, with only a few

losses to Sparse, even for k = 20. When k is small, SSshows more performance advantages. SS can achieve upto one order of magnitude speedup against the better al-gorithm of Sparse and GP.• There is no sure winner between Sparse and GP. In gen-

eral, while Sparse might lose for small k values or easyqueries, its performance does not deteriorate too muchfor large k or hard queries.• All algorithms are more responsive for smaller k values.

We note that since our ranking function usually returnsthe relevant answer as the top-1 answer, the executiontime for top-1 answer is an important indicator of systemperformance from the user’s perspective.

We plotted the execution times with different k values forall queries. We selected three representative figures fromthe DBLP query set and show the results in Figures 5(e)

to 5(g). In general, the costs of all algorithms increase withthe increasing k value, as more candidate answers need tobe found and compared. Some of the queries are amenableto top-k optimized algorithms (e.g., DQ11), where GP, SSand BP all perform significantly better than Sparse. Thereare also queries where GP performs pretty well for smallk values but jumps to a large running cost afterwards (e.g.,DQ15). In contrast, although both SS and BP also exhibit asimilar increase in query time, they still outperform Sparse.There are also hard queries where neither Sparse nor GPruns well (e.g., DQ13). SS still delivers an acceptable per-formance for some small to medium k values, and BP canstill maintain outstanding performance.

We also run all the algorithm using the AND semanticsfor top-1 results on DBLP. All algorithms can find the rele-vant results as the top-1 results, so we focus on the executiontime, which is plot in Figure 5(h). It is obvious that similarconclusions can be drawn about the relative performance ofthe algorithms.

The fundamental reason of superior performances of SSand BP algorithms is that they avoid many unnecessarydatabase probes. To verify this, we recorded the number ofcandidates each algorithm has checked against the database(QSize) and the number of queries sent to the database(QNum) and plot them in Figures 5(i) and 5(j), respectively.Specifically, we choose SS as the baseline algorithm (sinceit has the minimal QSize without utilizing a second upperbounding function) and calculate the ratio of probes of otheralgorithms over this baseline number. We can observe fromFigure 5(i) that Sparse usually has to examine many can-didates, as it cannot stop earlier until the complete queryof a CN has been executed. GP is only slightly better thanSparse, partly because we specify a rather large k value,hence GP’s performance drops quickly. BP algorithm makesuse of additional upper bounding functions and delays prob-ing the database as much as possible. As a result, it usuallyexamines fewer candidates than SS. In terms of query num-bers, as expected, Sparse always sends a small number ofqueries. Interestingly, SS sends the largest number of queriesin all the cases compared to GP and BP. The reason is thatboth GP and BP can examine a number of candidates to-gether in one single query using our range parametric queryoptimization; in contrast, SS exams candidates in an ad-hocmanner and such optimization cannot be applied.

We broke down the elapsed time for all the four algo-rithms. We summed up time used by the RDBMS to pro-cess queries and divided them over the total elapsed timesof the queries. The result for the DBLP dataset is shownin Figure 5(k). Overall speaking, the dominant cost for allalgorithms is the DBMS query processing time. GP’s costis mostly dominated by DBMS query processing time, asGP only needs to keep a few data structures (the currenttuple in each of the non-free tuple set of the CNs) and doesnot have expensive calculation and data structure mainte-nance overhead. The DBMS query time ratios of Sparse andSS come as second for different reasons. Sparse’s overheadmainly comes from the need to calculate the IR scores for allresults returned by its large-sized queries. SS needs to spendtime on maintaining the priority queue. Overall speaking,the BP algorithm is still dominated by DBMS query process-ing time (averaged about 71%), but to the least extent. Thisis because, intuitively, BP spends more time in its internalcalculation (of upper bounding scores) to avoid expensivedatabase probes.

6. RELATED WORKS

Keyword Search Systems. There has been several exist-ing work supporting free-style keyword search on database

Page 11: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

Table 6: Top-1 Result for DQ1 (nikos clique) on DBLP

Method Size Top-1 Result

[15] 1 InProceeding::::::

Clique-to-:::::

Clique Distance Computation Using a Specific Architecture

[22] 6 Person:::::

Nikos Karatzas←Proceeding→ Series←ProceedingInProceeding: Maximum

:::::

Clique Transversals

InProceeding: On . . .:::::

Clique-Width and . . .Ours 3 Person:

::::

Nikos::::::::

Mamoulis ← RPI → InProceeding: Constraint-Based Algorithms for Computing:::::

Clique Intersection Joins

contents. Early work includes [12, 13] where entities “near”the occurrence of search keywords are ranked and returned.The idea has been extended to find virtual entities consist-ing of inter-connected tuples that collectively contain all thekeywords [1, 16, 3, 18]. Recent work focuses on brining moreeffective ranking from IR literatures and its related queryprocessing methods [15, 22]. Specifically, [22] improves theranking method in [15] by the following normalizations: tu-ple tree size normalization, refined document length nor-malization, document frequency normalization, and inter-document weight normalization. Some advanced features,such as schema term awareness and phrase-based ranking,are also proposed. Our work can be viewed as a furtherimprovement along the line of enhancing the retrieval effec-tiveness. [30] is the first to propose a materialization-basedapproach to answer keyword queries on top of RDBMSs.Most recently, keyword search has been studied under a fewgeneralized contexts too [29, 21].

Top-k Query Processing. Top-k query processing hasalso been extensively studied in the literature. Fagin etal.. [10, 11] introduced a set of novel algorithms, assum-ing sorted access and/or random access of the objects isavailable on each attribute. A number of improvementshave been suggested after Fagin’s seminal work, for example,minimizing computational cost [24], and minimizing the IOcost [2]. Some other approaches to attack the problem in-clude building indexes [32] or building materialized views [8].

A generalized version of Fagin’s top-K problem is to con-sider complex relationships among objects. [25] studied find-ing top-K joined objects and proposed a J∗ algorithm. Anumber of improvements were suggested by Iiyas et al. [17].Chang et al. considered more general predicates, and an op-timal algorithm, MPro, was proposed based on the necessaryprobing principle [5]. However, all the above work assumesa monotonic aggregation function to rank the objects, whileour ranking function is non-monotonic and existing methodscannot be immediately applied.

7. CONCLUSIONSIn this paper, we studied supporting effective and efficient

top-k keyword queries over relational databases. We pro-posed a new ranking method that adapts the state-of-the-art IR ranking function and principles into ranking trees ofjoined database tuples. Our ranking method also has severalsalient features over existing ones. We also studied queryprocessing method tailored for our non-monotonic rankingfunctions. Two algorithms were proposed that aggressivelyminimize database probes. We have conducted extensiveexperiments on large-scale real databases. The experimen-tal results confirmed that our ranking method could achievehigh precision with high efficiency to scale to databases withtens of millions of tuples.

AcknowledgementWe would like to thank Nino Svonja for his contribution inthe early stage of this work. We also want to thank theanonymous reviewers who provided helpful suggestions tothe paper.

8. REFERENCES[1] S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A

system for keyword-based search over relational databases.In ICDE, pages 5–16, 2002.

[2] H. Bast, D. Majumdar, R. Schenkel, M. Theobald, andG. Weikum. IO-Top-k: Index-access optimized top-k queryprocessing. In VLDB, pages 475–486, 2006.

[3] G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti,and S. Sudarshan. Keyword searching and browsing indatabases using BANKS. In ICDE, pages 431–440, 2002.

[4] S. Borzsonyi, D. Kossmann, and K. Stocker. The skylineoperator. In ICDE, pages 421–430, 2001.

[5] K. C.-C. Chang and S. Hwang. Minimal probing: support-ing expensive predicates for top-k queries. In SIGMOD,pages 346–357, 2002.

[6] S. Chaudhuri, R. Ramakrishnan, and G. Weikum. Inte-grating DB and IR technologies: What is the sound of onehand clapping? In CIDR, pages 1–12, 2005.

[7] R. Cyganiak. D2RQ benchemarking. http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2rq/benchmarks/.

[8] G. Das, D. Gunopulos, N. Koudas, and D. Tsirogiannis.Answering top-k queries using views. In VLDB, pages451–462, 2006.

[9] B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin.Finding top-k min-cost connected trees in databases. InICDE, 2007.

[10] R. Fagin. Combining fuzzy information from multiplesystems. J. Comput. Syst. Sci., 58(1):83–99, 1999.

[11] R. Fagin, A. Lotem, and M. Naor. Optimal aggregationalgorithms for middleware. In PODS, 2001.

[12] R. Goldman, N. Shivakumar, S. Venkatasubramanian,and H. Garcia-Molina. Proximity search in databases. InVLDB, 1998.

[13] T. Grabs, K. Bohm, and H.-J. Schek. PowerDB-IR: Infor-mation retrieval on top of a database cluster. In CIKM,pages 411–418, 2001.

[14] P. J. Haas and J. M. Hellerstein. Ripple joins for onlineaggregation. In SIGMOD 1999, pages 287–298, 1999.

[15] V. Hristidis, L. Gravano, and Y. Papakonstantinou. Effi-cient IR-Style Keyword Search over Relational Databases.In VLDB, 2003.

[16] V. Hristidis and Y. Papakonstantinou. DISCOVER:Keyword search in relational databases. In VLDB, pages670–681, 2002.

[17] I. F. Ilyas, W. G. Aref, and A. K. Elmagarmid. Supportingtop-k join queries in relational databases. VLDB Journal,13(3):207–221, 2004.

[18] V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan,R. Desai, and H. Karambelkar. Bidirectional expansionfor keyword search on graph databases. In VLDB, pages505–516, 2005.

[19] B. Kimelfeld and Y. Sagiv. Efficient engines for keywordproximity search. In WebDB, pages 67–72, 2005.

[20] B. Kimelfeld and Y. Sagiv. Finding and approximatingtop-k answers in keyword proximity search. In PODS,pages 173–182, 2006.

[21] G. Koutrika, A. Simitsis, and Y. Ioannidis. Precis: Theessence of a query answer. In ICDE, 2006.

[22] F. Liu, C. T. Yu, W. Meng, and A. Chowdhury. Effectivekeyword search in relational databases. In SIGMOD, pages563–574, 2006.

[23] Y. Luo, X. Lin, W. Wang, and X. Zhou. SPARK: Top-kkeyword query in relational databases. Technical Report0708, School of Computer Science and Engineering,University of New South Wales, 2007.

[24] N. Mamoulis, K. H. Cheng, M. L. Yiu, and D. W. Cheung.Efficient aggregation of ranked inputs. In ICDE, 2006.

Page 12: SPARK: Top-k Keyword Query in Relational Databasesweiw/files/SIGMOD07-Keyword-Final.pdf · SPARK: Top-k Keyword Query in Relational Databases∗Yi Luo Xuemin Lin Wei Wang University

����������������������������

����������������������������

SSSparse GP BP

110

102103104105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

time(

ms)

query ID

(a) Query Time, DBLP, k = 10, the i-th Query is DQi.

110

102103104105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

time(

ms)

query ID

(b) Query Time, IMDB, k = 10, the i-th Query is IQi.

110

102103104105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

time(

ms)

query ID

(c) Query Time, DBLP, k = 1, the i-th Query is DQi.

110

102103104105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

time(

ms)

query ID

(d) Query Time, IMDB, k = 1, the i-th Query is IQi.

GP SS BPSparse

10

102

103

1 3 5 7 9 11 13 15 17 19

time(

ms)

(e) DBLP, DQ11, k = 1 to 20

10

102

103

104

105

1 3 5 7 9 11 13 15 17 19

time(

ms)

(f) DBLP, DQ13, k = 1 to 20

1

10

102

103

104

1 3 5 7 9 11 13 15 17 19

time(

ms)

(g) DBLP, DQ15, k = 1 to 20

�������������� ��������������SSSparse GP BP

110

102103104105

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

time(

ms)

query ID

(h) Query Time, DBLP, k = 1, AND Semantics

10-410-310-210-1

110

102103104

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18quer

y si

ze r

atio

vs.

SS

query ID

Sparse GP BP

(i) Query Size Ratio, DBLP, k = 20, the i-th Query is DQi.

����������������������������

����������������������������

SSSparse GP BP

1

10

102

103

104

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

num

ber

of q

uerie

s

query ID

(j) Query Number, DBLP, k = 20, the i-th Query is DQi.

0 0.2 0.4 0.6 0.8

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18query ID

(k) Ratio of DBMS Query Processing Time over OverallElapsed Time, DBLP, k = 20, the i-th Query is DQi.

Figure 5: Evaluation of Query Processing Performance

[25] A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S.Vitter. Supporting incremental join queries on rankedinputs. In VLDB, pages 281–290, 2001.

[26] S. E. Robertson, H. Zaragoza, and M. J. Taylor. SimpleBM25 extension to multiple weighted fields. In CIKM,pages 42–49, 2004.

[27] D. E. Rose and D. R. Cutting. Ranking for usability:Enhanced retrieval for short queries. Technical Report 163,Apple Technical Report, 1996.

[28] G. Salton, E. A. Fox, and H. Wu. Extended booleaninformation retrieval. Communication of the ACM,26(11):1022–1036, 1983.

[29] M. Sayyadan, H. LeKhac, A. Doan, and L. Gravano.Efficient keyword search across heterogeneous relationaldatabases. In ICDE, 2007.

[30] Q. Su and J. Widom. Indexing relational database contentoffline for efficient keyword-based search. In IDEAS, 2005.

[31] R. Wilkinson, J. Zobel, and R. Sacks-Davis. Similaritymeasures for short queries. In TREC, 1995.

[32] D. Xin, C. Chen, and J. Han. Towards robust indexing forranked queries. In VLDB, pages 235–246, 2006.