a on the complexity of query result...

63
A On the Complexity of Query Result Diversification TING DENG, RCBD and SKLSDE, Beihang University WENFEI FAN, Informatics, University of Edinburgh, and RCBD and SKLSDE, Beihang University Query result diversification is a bi-criteria optimization problem for ranking query results. Given a database D, a query Q and a positive integer k, it is to find a set of k tuples from Q(D) such that the tuples are as relevant as possible to the query, and at the same time, as diverse as possible to each other. Subsets of Q(D) are ranked by an objective function defined in terms of relevance and diversity. Query result diversification has found a variety of applications in databases, information retrieval and operations research. This paper investigates the complexity of result diversification for relational queries. (1) We identify three problems in connection with query result diversification, to determine whether there exists a set of k tuples that is ranked above a bound with respect to relevance and diversity, to assess the rank of a given k-element set, and to count how many k-element sets are ranked above a given bound based on an objective function. (2) We study these problems for a variety of query languages and for the three objective functions proposed in [Gollapudi and Sharma 2009]. We establish the upper and lower bounds of these problems, all matching, for both combined complexity and data complexity. (3) We also investigate several special settings of these problems, identifying tractable cases. Moreover, (4) we re-investigate these problems in the presence of compatibility constraints commonly found in practice, and provide their complexity in all these settings. Categories and Subject Descriptors: H.2.3 [DATABASE MANAGEMENT]: Languages General Terms: Design, Algorithms, Theory Additional Key Words and Phrases: Result diversification, relevance, diversity, recommender systems, database queries, combined complexity, data complexity, counting problems 1. INTRODUCTION Result diversification for relational queries is a bi-criteria optimization problem. Given a query Q, a database D and a positive integer k, it is to find a set U of k tuples in the query result Q(D) such that the tuples in U are as relevant as possible to query Q, and at the same time, as diverse as possible to each other. More specifically, we want to find a set U Q(D) such that |U | = k, and the value F (U ) of U is maximum. Here F (·) is called an objective function. It is defined on sets of tuples from Q(D), in terms of a relevance function δ rel (·, ·) and a distance function δ dis (·, ·), where — for each tuple t Q(D), δ rel (t,Q) is a number indicating the relevance of answer t to query Q, such that the higher δ rel (t,Q) is, the more relevant t is to Q; and — for all tuples t 1 ,t 2 Q(D), δ dis (t 1 ,t 2 ) is the distance between t 1 and t 2 , such that the larger δ dis (t 1 ,t 2 ) is, the more diverse the answers t 1 and t 2 are. Deng and Fan are supported in part by 973 Program 2014CB340302. Fan is also supported in part by NSFC 61133002, 973 Program 2012CB316200, Guangdong Innovative Research Team Program 2011D005 and Shenzhen Peacock Program 1105100030834361, and EPSRC EP/J015377/1. Author’s addresses: T. Deng; W. Fan, School of Informatics, Laboratory for Foundations of Computer Science, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, Scotland, UK. Email: [email protected] Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c YYYY ACM 1539-9087/YYYY/01-ARTA $15.00 DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000 ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Upload: others

Post on 14-Aug-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A

On the Complexity of Query Result Diversification

TING DENG, RCBD and SKLSDE, Beihang University

WENFEI FAN, Informatics, University of Edinburgh, and RCBD and SKLSDE, Beihang University

Query result diversification is a bi-criteria optimization problem for ranking query results. Given a databaseD, a query Q and a positive integer k, it is to find a set of k tuples from Q(D) such that the tuples are asrelevant as possible to the query, and at the same time, as diverse as possible to each other. Subsets of Q(D)are ranked by an objective function defined in terms of relevance and diversity. Query result diversificationhas found a variety of applications in databases, information retrieval and operations research.

This paper investigates the complexity of result diversification for relational queries. (1) We identify threeproblems in connection with query result diversification, to determine whether there exists a set of k tuplesthat is ranked above a bound with respect to relevance and diversity, to assess the rank of a given k-elementset, and to count how many k-element sets are ranked above a given bound based on an objective function.(2) We study these problems for a variety of query languages and for the three objective functions proposedin [Gollapudi and Sharma 2009]. We establish the upper and lower bounds of these problems, all matching,for both combined complexity and data complexity. (3) We also investigate several special settings of theseproblems, identifying tractable cases. Moreover, (4) we re-investigate these problems in the presence ofcompatibility constraints commonly found in practice, and provide their complexity in all these settings.

Categories and Subject Descriptors: H.2.3 [DATABASE MANAGEMENT]: Languages

General Terms: Design, Algorithms, Theory

Additional Key Words and Phrases: Result diversification, relevance, diversity, recommender systems,database queries, combined complexity, data complexity, counting problems

1. INTRODUCTION

Result diversification for relational queries is a bi-criteria optimization problem. Givena query Q, a database D and a positive integer k, it is to find a set U of k tuples in thequery result Q(D) such that the tuples in U are as relevant as possible to query Q,and at the same time, as diverse as possible to each other. More specifically, we wantto find a set U ⊆ Q(D) such that |U | = k, and the value F (U) of U is maximum. HereF (·) is called an objective function. It is defined on sets of tuples from Q(D), in terms ofa relevance function δrel(·, ·) and a distance function δdis(·, ·), where

— for each tuple t ∈ Q(D), δrel(t, Q) is a number indicating the relevance of answer t toquery Q, such that the higher δrel(t, Q) is, the more relevant t is to Q; and

— for all tuples t1, t2 ∈ Q(D), δdis(t1, t2) is the distance between t1 and t2, such that thelarger δdis(t1, t2) is, the more diverse the answers t1 and t2 are.

Deng and Fan are supported in part by 973 Program 2014CB340302. Fan is also supported in partby NSFC 61133002, 973 Program 2012CB316200, Guangdong Innovative Research Team Program2011D005 and Shenzhen Peacock Program 1105100030834361, and EPSRC EP/J015377/1.Author’s addresses: T. Deng; W. Fan, School of Informatics, Laboratory for Foundations of Computer Science,Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, Scotland, UK. Email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© YYYY ACM 1539-9087/YYYY/01-ARTA $15.00DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 2: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:2 Ting Deng and Wenfei Fan

In particular, three generic objective functions have been proposed and studied in [Gol-lapudi and Sharma 2009] based on an axiom system, namely,max-sum diversification,max-min diversification and mono-objective formulation. Each of these functions is de-fined in terms of generic functions δrel(·, ·) and δdis(·, ·), with a parameter λ ∈ [0, 1]specifying the tradeoff between relevance and diversity.

Query result diversification aims to improve user satisfaction by remedying the over-specification problem of retrieving too homogeneous answers. The diversity of queryanswers is measured in terms of (1) contents, to include items that are dissimilar toeach other, (2) novelty, to retrieve items that contain new information not found inprevious results, and (3) coverage, to cover items in different categories [Drosou andPitoura 2010]. It has proven effective in Web search [Gollapudi and Sharma 2009;Vieira et al. 2011], recommender systems [Yu et al. 2009b; Zhang and Hurley 2008;Ziegler et al. 2005], databases [Demidova et al. 2010; Liu et al. 2009; Vee et al. 2008],sponsored-search advertising [Feuerstein et al. 2007], and in operations research andfinance (see [Drosou and Pitoura 2010; Minack et al. 2009] for surveys).

This paper investigates the complexity of result diversification analysis for rela-tional queries. While there has been a host of work on result diversification, the previ-ous work has mostly focused on diversity and relevance metrics, and on algorithms forcomputing diverse results [Drosou and Pitoura 2010; Minack et al. 2009]. Few complex-ity results have been developed for query result diversification, and the known resultsare mostly lower bounds (NP-hardness) [Agrawal et al. 2009; Gollapudi and Sharma2009; Liu et al. 2009; Stefanidis et al. 2010; Vieira et al. 2011]. Furthermore, theseresults are established by assuming that query result Q(D) is already known. In otherwords, the prior work conducts diversification in two steps: first compute Q(D), andthen rank k-element subsets of Q(D) and find a set with the maximum F (·) value. Theknown complexity results are for the second step only, based on a specific objective func-tion F (·). However, it is typically expensive to compute Q(D). To avoid the overhead,we want to combine the two steps by embedding diversification in query evaluation,and stop as soon as top-ranked results are found based on F (·) (i.e., early termination),rather than to retrieve entire Q(D) in advance [Demidova et al. 2010]. Nonetheless,the complexity of such a query result diversification process has not been studied.

This highlights the need for establishing the complexity of query result diversifica-tion, both upper bounds and lower bounds, when Q(D) is not provided, and for dif-ferent query languages and various objective functions. Indeed, to develop practicalalgorithms for computing diverse query results, we have to understand the impact ofquery languages and objective functions on the complexity of result diversification.

Example 1.1. Consider a recommender system to help people find gifts for variousevents or occasions, e.g., FindGift1. Its underlying databaseD0 consists of two relationsspecified by the following relation schemas:

catalog(item, type, price, inStock),

history(item, buyer, recipient, gender, age, rel, event, rating).

Here each catalog tuple specifies an item for present, its type (e.g., jewelry, book), price,and the number of the item in stock. Purchase history is recorded by relation history: ahistory tuple indicates that a buyer bought an item for a recipient specified by gender, ageand relationship with the buyer, for an event (e.g., birthday, wedding, holiday), as wellas rating given by the buyer in the range of [1,5].

1http://www.findgift.com.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 3: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:3

Peter wants to use the engine to find a Christmas gift for his 14 year-old niece Grace,in the price range of [$20, $30]. His request can be converted to a query Q0 defined ondatabase D0. The relevance δrel(t, Q0) of a tuple t returned by Q0(D0) can be assessedby using the information from relation history, by taking into account previous presentspurchased for girls of 12–16 year old by the girls’ relatives for holidays, as well as therating by those buyers. The distance (diversity) δdis(t1, t2) between two items t1 andt2 returned by Q0(D0) can be estimated by considering the differences between theirtypes. Peter wants the system to recommend a set of 10 items fromQ0(D0) such that onone hand, those items are as fit as possible as a Christmas present for a teenage girl,and on the other hand, are as dissimilar as possible to cover a wide range of choices.

The computational complexity of processing such requests depends on both thequeries expressing users’ requests and the objective function used by the system.

(1) Query languages. QueryQ0 can be expressed as a conjunctive query (CQ). Nonethe-less, if Peter wants a new gift that is different from previous gifts he gave to Grace,we need first-order logic (FO) to express Q0, by using negation on relation history. Inpractice one cannot expect that Q0(D0) is already computed when Peter submits hisrequest. As remarked earlier, it is too costly to compute Q0(D0) first and then pick atop set of k items from Q0(D0). Instead, we want to embed result diversification in theevaluation of Q0, and ideally, find a satisfactory set of k items without retrieving theentire set Q0(D0) and paying its cost. A question concerns what difference CQ and FOmake on the complexity of processing such requests when Q0(D0) is not necessarilyavailable. One naturally wants to know whether the complexity is introduced by thequery languages or is inherent to result diversification.

(2)Objective functions. Consider the objective function by max-sum diversification pro-posed in [Gollapudi and Sharma 2009] and revised in [Vieira et al. 2011]:

FMS(U) = (k − 1)(1 − λ) ·∑

t∈U

δrel(t, Q) + λ ·∑

t,t′∈U

δdis(t, t′),

where U is a set of tuples in Q(D). To assess the diversity, FMS(U) only requires tocompute δdis(t, t

′) for t and t′ in a given k-element set U ⊆ Q(D). Similarly, the objectivefunction by max-min diversification is defined as [Gollapudi and Sharma 2009]:

FMM(U) = (1 − λ) · mint∈U

δrel(t) + λ · mint,t′∈U,t6=t′

δdis(t, t′).

In contrast, consider the mono-objective formulation of [Gollapudi and Sharma 2009]:

Fmono(U) =∑

t∈U

(

(1 − λ) · δrel(t, Q) +λ

|Q(D)| − 1·

t′∈Q(D)

δdis(t, t′))

.

It asks for δdis(t, t′) for each t ∈ U and for all t′ ∈ Q(D), i.e., the average dissimilarity

w.r.t. all other results in Q(D) [Minack et al. 2009]. The question is what differentimpacts FMS(·), FMM(·) and Fmono(·) have on the complexity of diversification. 2

To the best of our knowledge, no prior work has answered these questions. Theseissues require a full treatment for different query languages and objective functions,to find out where the complexity of query result diversification arises.

Contributions.We study several fundamental problems in connection with result di-versification for relational queries, and establish their upper bounds and lower bounds,all matching, for a variety of query languages and objective functions.

Diversification problems. We identify three problems for query result diversification.Given a query Q, a database D, an objective function F (·), and a positive integer k,

(1) the query result diversification problem (QRD) is a decision problem to determine

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 4: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:4 Ting Deng and Wenfei Fan

whether there exists a k-element set U ⊆ Q(D) such that F (U) ≥ B for a given boundB, i.e., whether there exists a set U that satisfies the users’ need at all;

(2) the diversity ranking problem (DRP) is to decide whether a given k-element setU ⊆ Q(D) is among top-r ranked sets, such that there exist no more than r−1 sets S ⊆Q(D) of k elements with F (S) > F (U); as advocated in [Jin and Patel 2011], a decisionprocedure for DRP can help us assess how well a given k-element set U satisfies theusers’ request, and help vendors evaluate their products w.r.t. users’ need; and

(3) the result diversity counting problem (RDC) is to count the number of k-element setsU ⊆ Q(D) such that F (U) ≥ B for a given bound B. It is a counting problem that helpsus find out how many k-element sets can be extracted from Q(D) and be suggested tothe users, and provide a guidance for recommender systems to adjust their stock.

Complexity results. For all these problems we establish their combined complexity anddata complexity (i.e., when both data D and query Q may vary, and when Q is fixedwhile D may vary, respectively; see [Abiteboul et al. 1995]). We parameterize theseproblems with various query languages, including conjunctive queries (CQ), unionsof conjunctive queries (UCQ), positive existential FO queries (∃FO+) and first-orderlogic queries (FO) [Abiteboul et al. 1995], all with built-in predicates =, 6=, <,≤, >,≥.These languages have been used in query result diversification tools, e.g., CQ [Chenand Li 2007], ∃FO+ [Vee et al. 2008] and FO [Demidova et al. 2010]. For each of thesequery languages, we study these problems with each of the objective functions pro-posed by [Gollapudi and Sharma 2009], i.e., objective functions defined in terms ofmax-sum diversification, max-min diversification, and mono-objective formulation.

We provide a comprehensive account of upper and lower bounds for these problems,all matching when the problems are intractable; that is, we show that such a problemis C-hard and is in C for a complexity class C that is NP or beyond in the polynomialhierarchy. We also study special cases of these problems, such as when either onlydiversity or only relevance is considered, when Q is an identity query, and when k is apredefined constant. We identify practical tractable cases. It should be remarked thatall the previous complexity results (NP-hardness) are established for a special case ofQRD studied in this work only, namely, when Q is an identity query.

Compatibility constraints. We also re-investigate these problems in the presence ofcompatibility constraints [Koutrika et al. 2009; Lappas et al. 2009; Parameswaranet al. 2010; Parameswaran et al. 2011; Xie et al. 2012]. Such constraints are definedon a set U of top-k items, to specify what items have to be taken together and whatitems have conflict with each other, among other things. The need for such constraintsis evident in practice. For instance, when Peter buys a Christmas gift for Grace, he alsowants to buy a Christmas card together; when one selects a course A for an undergrad-uate package, she has to include all the prerequisites of A in the package [Koutrikaet al. 2009; Parameswaran et al. 2010]; and when one forms a basketball team, hewould like to get recommendation for a center, two forwards and two point guards [Lap-pas et al. 2009]. No matter how important, however, few previous work has studied theimpact of compatibility constraints on the analyses of query result diversification.

We propose a class Cm of compatibility constraints for query result diversification,which suffices to express compatibility requirements we commonly encounter in prac-tice. The constraints of Cm are of a restricted form of tuple generating dependencies(see, e.g., [Abiteboul et al. 1995]), and can be validated in PTIME, i.e., given a set Σ ofconstraints in Cm and a dataset U , it is in PTIME to decide whether U satisfies Σ. Weinvestigate the impact of such constraints on the combined complexity and data com-plexity of QRD, DRP and RDC, for query languages ranging over CQ, UCQ, ∃FO+and FO,

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 5: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:5

and when the objective function is FMS, FMM or Fmono. We also study these problems inall the special settings mentioned above, when a set of compatibility constraints of Cmis additionally imposed on the selected sets of query results.

Impact. These results tell us where the complexity arises (see Tables I, II and III inSection 10 for a detailed summary of the complexity bounds).

(1) Query languages LQ. Query languages may dominate the combined complexity ofresult diversification. For objective functions defined in terms of max-sum or max-min diversification, QRD, DRP and RDC are NP-complete, coNP-complete and #·NP-complete, respectively, when LQ is CQ. In contrast, when it comes to FO, these prob-lems become PSPACE-complete, PSPACE-complete and #·PSPACE-complete, respec-tively. This said, the presence of disjunction in LQ does not complicate the diversifica-tion analyses. Indeed, these problems remain NP-complete, coNP-complete and #·NP-complete, respectively, when LQ is either UCQ or ∃FO+.

In contrast, different query languages have no impact on the data complexity ofthese problems, as expected. Indeed, for max-sum or max-min diversification, QRD,DRP and RDC are NP-complete, coNP-complete and #·NP-complete, respectively, andfor mono-objective formulation, they are in PTIME (polynomial time), PTIME and #·P-complete, respectively, no matter whether LQ is CQ or FO. Intuitively, a naive algo-rithm for QRD works in two steps: first compute Q(D), and then finds whether thereexists a k-element set U from Q(D) such that F (U) ≥ B; similarly for DRP and RDC.When Q is fixed as in the setting of data complexity analysis, Q(D) is in PTIME re-gardless of what query language LQ we use to express Q. The data complexity of theproblems arises from the second step, i.e., the diversification computation.

(2) Objective functions F (·). When F (·) is Fmono, however, the objective function domi-nates the complexity: QRD, DRP and RDC are PSPACE-complete, PSPACE-complete and#·PSPACE-complete, respectively, no matter whether LQ is CQ or FO. Contrast thesewith their counterparts given above for FMS and FMM. The impact of F (·) is even moreevident on the data complexity. As remarked earlier, for FMS and FMM, these problemsare NP-complete, coNP-complete and #·NP-complete, respectively, for data complexity,whereas they are in PTIME, PTIME and #·P-complete, respectively, for Fmono.

(3) Diversity vs. relevance. The complexity is mostly introduced by the diversity re-quirement. This is consistent with the observation of [Vieira et al. 2011], which stud-ied a special case of QRD when F (·) is FMS. Indeed, when the relevance function δrel(·, ·)is absent, the combined and data complexity bounds remain unchanged for all theseproblems, for any of the three objective functions. In contrast, when the distance func-tion δdis(·, ·) is dropped, QRD and DRP become tractable when data complexity is con-sidered. Moreover, for Fmono, when δdis(·, ·) is absent, the combined complexity of QRD,DRP and RDC becomes NP-complete, coNP-complete and #·NP-complete, down fromPSPACE-complete, PSPACE-complete and #·PSPACE-complete, respectively. In particu-lar, for FMS (resp. FMM), one can draw an analogy between δrel(·, ·) and sorting with atarget weight, and between δdis(·, ·) and partitioning with dispersed objects, which arerequirements of the (resp. Maximum) Dispersion Problem [Prokopyev et al. 2009].

(4) Compatibility constraints. Although the constraints of Cm are simple enough to bevalidated in PTIME, their presence complicates the analyses of QRD, DRP and RDC, toan extent. Indeed, all tractable cases of these problems in the absence of compatibilityconstraints become intractable when constraints of Cm are present. These include(a) the data complexity analyses of QRD, DRP and RDC when F (·) is Fmono, or whenF (·) is FMS or FMM defined in terms of the relevance function δrel(·, ·) only; and (b) thecombined complexity of these problems for identity queries when F (·) is Fmono. The

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 6: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:6 Ting Deng and Wenfei Fan

only exception is the case when the bound k on the number of selected tuples is aconstant; in this case, the data complexity analyses of these problems are tractable nomatter whether the constraints of Cm are present or not.

These results reveal the impacts of various factors on the complexity of query resultdiversification. In particular, the results tell us that the complexity of these problemsfor CQ, UCQ and ∃FO+may be inherent to result diversification itself, rather than aconsequence of the complexity of the query languages. From the results we can seethat these problems are intricate and mostly intractable. This highlights the need fordeveloping efficient heuristic (approximation whenever possible) algorithms for them.

Organization. We discuss related work in Section 2, and present a general model forquery result diversification in Section 3. Problems QRD, DRP and RDC are formulatedin Section 4, and their combined complexity and data complexity are established inSections 5, 6 and 7, respectively. Section 8 studies special cases of these problems, andSection 9 revisits these problems in the presence of compatibility constraints. Finally,Section 10 identifies directions for future work. Due to the space constraint we deferthe proofs of the results of Sections 8 and 9 to the electronic appendix.

2. RELATED WORK

This paper is an extension of our earlier work [vld ] by including the following. (1)Detailed proofs of all the results (Sections 5, 6, 7 and 8), which were not presentedin [vld ]. A variety of techniques are used to prove these results, including countingarguments, a wide range of reductions and constructive proofs with algorithms. (2)An extension of the query result diversification model with compatibility constraints,and the (combined and data) complexity of all these problems in the presence of com-patibility constraints (Section 9). These are among the first results for incorporatingcompatibility constraints into query results diversification.

This work is also related to prior work on result diversification (for search andqueries), recommender systems and top-k query answering, discussed as follows.

Diversification. Diversification has been studied for Web search [Agrawal et al. 2009;Borodin et al. 2012; Capannini et al. 2011; Gollapudi and Sharma 2009; Vieira et al.2011], recommender systems [Yu et al. 2009a; 2009b; Zhang and Hurley 2008; Ziegleret al. 2005], structured databases [Demidova et al. 2010; Fraternali et al. 2012; Liuet al. 2009; Vee et al. 2008] and sponsored-search advertising [Feuerstein et al. 2007](see [Drosou and Pitoura 2010; Minack et al. 2009] for surveys). The previous workhas mostly focused on metrics for assessing relevance and diversity, and optimizationtechniques for computing diverse answers. The prior work often adopts specific objec-tive functions based on the similarity of, e.g., taxonomy [Ziegler et al. 2005], explana-tions [Yu et al. 2009a], features [Vee et al. 2008] or locations [Fraternali et al. 2012]. Ageneral model for result diversification was proposed in [Gollapudi and Sharma 2009]based on an axiom system, along with the three objective functions mentioned earlier.A minor revision of max-sum diversification of [Gollapudi and Sharma 2009] was pre-sented in [Vieira et al. 2011]. This work extends the model of [Gollapudi and Sharma2009] by incorporating queries (and compatibility constraints). Like in [Borodin et al.2012], we focus on the objective functions proposed in [Gollapudi and Sharma 2009].

The complexity of result diversification has been studied in [Agrawal et al. 2009;Gollapudi and Sharma 2009; Liu et al. 2009; Stefanidis et al. 2010; Vieira et al. 2011],which differ from this work in the following.

(1) The previous work provided lower bounds (NP-hardness) but stopped short of giving

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 7: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:7

a matching upper bound. In contrast, we provide a complete picture of matching upperand lower bounds, for both combined and data complexity.

(2) The prior work assumed that the search space Q(D) is already computed, and istaken as input. As remarked earlier, this assumption is not very realistic in practice.In contrast, we treat Q and D as input instead of Q(D), and investigate the impactof query languages on the complexity of diversification. As will be seen later, the com-plexity bounds of these problems when Q(D) is not available is quite different fromtheir counterparts when Q(D) is assumed in place (i.e., when Q is an identity query).

(3) The previous work focused on a special cases of QRD, whenQ is an identity query. Itis one of the special cases studied in Section 8 of this paper. Note that the intractabilityof QRD for max-sum or max-min diversification given in the prior work [Drosou andPitoura 2009; Gollapudi and Sharma 2009; Vieira et al. 2011] may be adapted to estab-lish the data complexity of QRD in these settings. Nonetheless, the detailed proofs arenot given in those papers. Further, for mono-objective formulation, no previous workhas studied the complexity of QRD for identity queries, which will be shown in PTIMEin this work. Moreover, we are not aware of any complexity results for DRP and RDCpublished by previous work, although DRP was advocated in [Jin and Patel 2011].

(4) This work also considers several special cases of diversification (Section 8), to iden-tify tractable cases and the impact of diversity and relevance requirements on the com-plexity of the diversification analyses. Moreover, we also study diversification in thepresence of compatibility constraints, about which we are not aware of any prior work.

Recommender problems. Recommender systems (a.k.a. recommender engines and rec-ommendation platforms) are to recommend information items or social elements thatare likely to be of interest to users (see [Adomavicius and Tuzhilin 2005] for a survey).There has been a host of work on recommender systems [Deng et al. 2012; Amer-Yahia2011; Lappas et al. 2009; Koutrika et al. 2009; Parameswaran et al. 2011; Xie et al.2012], studying item and package recommendation. Given a query Q, a database Dof items and a utility (scoring) function f(·) defined on items, item recommendationis to find top-k items from Q(D) ranked by f(·), for a given positive integer k. Pack-age recommendation takes as additional input a set Σ of compatibility constraints, twofunctions cost(·) and val(·) defined on sets of items, and a bound C. It is to find top-kpackages of items such that each package satisfies Σ, its cost does not exceed C, andits val is among the k highest. Here a package is a set of items that has a variable size.

There is an intimate connection between recommendation and diversification: bothaim to recommend top-k (sets of) items from the resultQ(D) of queryQ inD. Moreover,diversification has been used in recommender systems to rectify the problem of retriev-ing too homogeneous results. However, there are subtle differences between them.

(1) Item recommendation is a single-criterion optimization problem based on a utilityfunction f(·) defined on individual items. In contrast, query result diversification is abi-criteria optimization problem based on a relevance function δrel(·, ·) and a distancefunction δdis(·, ·) defined on sets of items. In particular, the distance function δdis(U)assesses the diversity of elements in a set U , and is not expressible as a utility function.

(2) Package recommendation is to find top-k sets of items with variable sizes, whichare ranked by val(·), subject to compatibility constraints Σ and aggregate constraintsdefined in terms of cost(·) and bound C, where cost(·) and val(·) are generic PTIMEcomputable functions [Deng et al. 2012]. In contrast, query result diversification is tofind a single set of k items, based on a particular objective function F (·). When F (·) ismax-sum or max-min diversification, diversification can be viewed as a special case of

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 8: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:8 Ting Deng and Wenfei Fan

package recommendation for finding a single set of a fixed size k, based on a particularF (·), and in the absence of aggregate constraints. As a consequence of the specific re-strictions of F (·), the lower bounds developed for package recommendation do not carryover to its counterpart for diversification, and conversely, the upper bounds for diver-sification may not be tight for package recommendation. When F (·) is mono-objective,F (U) is not even expressible in the model of recommendation, since it assesses thediversity of elements in a set U with all tuples in Q(D), and is not in PTIME in |U |.

There has been work on the complexity of recommendation analyses [Amer-Yahiaet al. 2013; Deng et al. 2012; Lappas et al. 2009; Koutrika et al. 2009; Parameswaranet al. 2011; Xie et al. 2012]. In addition to different settings of recommendation anddiversification remarked earlier, this work differs from the prior work in the following.

(3) Problems QRD and DRP studied in this paper have not been considered in the pre-vious work for recommendation. This said, the results of this work on these problemsmay be of interest to the study of recommendation.

(4) Problem RDC considered here is similar to a counting problem studied in [Denget al. 2012] for recommendation. However, given the different settings remarked ear-lier, RDC differs from that counting problem from complexity bounds to proofs. Indeed,the counting problem for recommendation is #·coNP-complete when LQ is CQ, UCQ or∃FO+ [Deng et al. 2012]. In contrast, as will be seen in Section 7, for the same querylanguages, (a) RDC is #·NP-complete when F (·) is FMS or FMM, while #·coNP = #·NPif and only if P = NP [Durand et al. 2005]; and (b) RDC is #·PSPACE-complete whenF (·) is Fmono, substantially more intriguing than the problem studied in [Deng et al.2012]. Furthermore, the proofs of this paper have to be tailored to the three objectivefunctions, as opposed to the proofs of [Deng et al. 2012]. Indeed, the proofs for FMS andFMM are quite different from their counterparts for Fmono, as indicated by the differentcombined complexity bounds in these settings.

Compatibility constraints have been studied for package recommendation [Amer-Yahia et al. 2013; Koutrika et al. 2009; Lappas et al. 2009; Parameswaran et al. 2010;Parameswaran et al. 2011; Xie et al. 2012]. As remarked above, the prior results do notcarry over to the diversification analysis in the presence of compatibility constraints.

Top-k query answering. Top-k query answering is to retrieve top-k tuples from queryresults, ranked by a scoring function. It typically assumes that the attributes of tuplesare already sorted, and studies how to combine different ratings of the attributes forthe same tuple based on a (monotonic) scoring function. A number of top-k query eval-uation algorithms have been developed (e.g., [Fagin et al. 2003; Jin and Patel 2011; Liet al. 2005; Schnaitter and Polyzotis 2008]; see [Ilyas et al. 2008] for a survey), focus-ing on how to achieve early termination and reduce random access. This work differsfrom the prior work in the following. (a) A scoring function for top-k query answering isdefined on individual items, as opposed to the distance function δdis(·) and the objectivefunction F (·) defined on sets of items. (b) We focus on the complexity of diversificationproblems rather than the efficiency or optimization of query evaluation.

3. DIVERSIFICATION AND OBJECTIVE FUNCTIONS

We first present a model for query result diversification, by extending the modelof [Gollapudi and Sharma 2009]. We then review the three objective functions pro-posed by [Gollapudi and Sharma 2009], which are used to define diversification.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 9: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:9

3.1. Query Result Diversification

As remarked earlier, query result diversification aims to improve user satisfactionwhen computing answers to a query Q in a database D. We specify database D with arelational schema R = (R1, . . . , Rn), where each relation schema Ri is defined over afixed set of attributes. We consider query Q expressed in a query language LQ.

Diversification. GivenQ, D, a positive integer k and an objective function F (·), queryresult diversification aims to find a set U ⊆ Q(D) such that (a) |U | = k, and (b) F (U) ismaximum, i.e., for all other sets U ′ ⊆ Q(D), if |U ′| = k then F (U) ≥ F (U ′). Here F (·)is an objective function defined on sets of tuples of RQ, where RQ denotes the schemaof query result Q(D), such that given any set U of tuples of RQ, F (U) returns a non-negative real number. In other words, F (·) is defined on subsets U ⊆ Q(D). We writeF (·) as F when it is clear from the context.

Intuitively, query result diversification is to retrieve a set U of k answers to Q in Dsuch that the tuples in U are as relevant as possible to Q and meanwhile, as diverse aspossible. It extends the notion of result diversification given in [Gollapudi and Sharma2009] by taking query Q and D as input, rather than assuming that Q(D) is alreadycomputed. The notion of [Gollapudi and Sharma 2009] is a special case of query resultdiversification, when Q is an identity query, i.e., when Q(D) = D is given as input.

Query result diversification is a bi-criteria optimization problem characterized byobjective function F which is defined in terms of a relevance function δrel(·, ·) and adistance function δdis(·, ·); these functions are presented as follows.

Relevance functions and distance functions. A relevance function δrel(·, ·) is de-fined on tuples of schema RQ and queries in LQ. It specifies the relevance of a tuple tof RQ to a queryQ ∈ LQ. More specifically, δrel(t, Q) is a non-negative real number suchthat the larger δrel(t, Q) is, the more relevant the answer t is to query Q.

A distance function δdis(·, ·) is a binary function defined on tuples of schema RQ. Itspecifies the diversity between two tuples t, s ∈ Q(D): δdis(t, s) is a non-negative realnumber such that the larger δdis(t, s) is, the more diverse (dissimilar) t and s are toeach other. We assume that δdis(·, ·) is symmetric, i.e., δdis(t, s) = δdis(s, t) for all tuplest, s of RQ. Moreover, δdis(t, t) = 0, i.e., the distance between a tuple and itself is 0.

We simply assume that δrel(·, ·) and δdis(·, ·) are PTIME computable functions, as com-monly found in practice, and focus on their generic properties. We also write δrel(·, ·)and δdis(·, ·) as δrel and δdis, respectively, if it is clear from the context.

Example 3.1. Recall the request of Peter for shopping a gift for Grace described inExample 1.1. The request can be expressed as a query Q0 in FO as follows:

Q0(n) = ∃ t, p, s(

catalog(n, t, p, s) ∧ p ≤ 30 ∧ p ≥ 20 ∧∀n′, b, r, g, a, x, e, y ¬(history(n′, b, r, g, a, x, e, y) ∧

b = idP ∧ r = “Grace” ∧ n = n′))

,

where idP denotes Peter’s buyer id. The query selects such gifts in the price range[$20, $30] that have not been purchased by Peter for Grace earlier.As remarked in Example 1.1, for each gift t ∈ Q0(D0), the relevance δrel(t, Q0) of t to

Q0 can be assessed in terms of the rating of t if t appears in the history relation. For in-stance, δrel(t, Q0) is high if t was presented as a gift for a girl of age [11, 14] by a relativefor a holiday, and was rated high. If t is not in history, δrel(t, Q0) takes a default value.

For tuples t, s ∈ Q0(D0), δdis(t, s) can be defined in terms of the difference betweentheir types, e.g., δdis(t, s) = 2 if t is in the “artsy” category and s is “educational”, andδdis(t, s) = 1 if t is of type “jewelry” and s is of type “fashion”. The types can be classifiedinto various categories and brands, and δdis(t, s) is defined accordingly. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 10: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:10 Ting Deng and Wenfei Fan

3.2. Objective Functions

An objective function F is defined by means of relevance function δrel and distancefunction δdis. Like in [Borodin et al. 2012], we focus on the objective functions proposedby [Gollapudi and Sharma 2009] in this work.Consider δrel and δdis, a parameter λ ∈ [0, 1] to balance relevance and diversity, a

query Q, a database D and a positive integer k. Let U ⊆ Q(D) be a set of tuples with|U | = k. A minor revision of max-sum diversification of [Gollapudi and Sharma 2009]was given in [Vieira et al. 2011] by associating (1 − λ) with the relevance component,which allows us to study two extreme cases: diversity only (i.e., when λ = 1), and rel-evance only (i.e., when λ = 0). Along the same line as [Vieira et al. 2011], we considerminor variations of the max-min diversification and mono-objective functions of [Gol-lapudi and Sharma 2009]. We present the revised objective functions as follows.

Max-sum diversification. The first objective is to maximize the sum of the relevanceand dissimilarity of the selected set U [Gollapudi and Sharma 2009; Vieira et al. 2011]:

FMS(U) = (k − 1)(1 − λ) ·∑

t∈U

δrel(t, Q) + λ ·∑

t,t′∈U

δdis(t, t′).

Here FMS(U) measures both the relevance of the tuples in U to query Q, and the di-versity among the k tuples in U . Following [Gollapudi and Sharma 2009], we scale upthe two components δrel and δdis by using k − 1 since the relevance sum ranges over knumbers while the diversity sum is over k(k − 1) numbers (note that the same effectmay also be achieved by tailoring λ; we adopt k − 1 here to simplify the discussion).As observed by [Gollapudi and Sharma 2009; Vieira et al. 2011], when the objec-

tive function is FMS, result diversification can be modeled as the Dispersion Problemstudied in operations research [Prokopyev et al. 2009], when Q is an identity query.

Max-min diversification. The second objective is to maximize the minimum rele-vance and dissimilarity of the selected set [Gollapudi and Sharma 2009]:

FMM(U) = (1 − λ) · mint∈U

δrel(t, Q) + λ · mint,t′∈U,t6=t′

δdis(t, t′).

Here FMM(U) is computed in terms of both the minimum relevance of the k tuple inU to query Q, and the minimum distance between any two tuples in U . In contrastto FMS(U), FMM(U) tends to penalize U that includes a single item irrelevant to Q,or that contains a pair of homogeneous items, although all the rest are diverse. Asshown in [Gollapudi and Sharma 2009], diversification by FMM can be expressed asthe Maxmin Dispersion Problem [Prokopyev et al. 2009] if Q is an identity query.

Mono-objective formulation. The last one is given as [Gollapudi and Sharma 2009]:

Fmono(U) =∑

t∈U

(

(1 − λ) · δrel(t, Q) +λ

|Q(D)| − 1·

t′∈Q(D)

δdis(t, t′))

.

As opposed to FMS(U) and FMM(U) that compute intro-list diversity, Fmono(U) measuresthe “global” diversity of each tuple t ∈ U by taking the mean of its distance to alltuples in the entire set Q(D), rather than its distances to the tuples in the selectedset U [Gollapudi and Sharma 2009]. It computes the average dissimilarity of tuples inU by comparing them uniformly with all other results in Q(D) [Minack et al. 2009],to assess the novelty and coverage of the items in U . In this formulation, the distancefunction behaves similarly to the relevance function; hence it is named “mono”. Incontrast with FMS and FMM, Fmono(U) does not reduce to facility dispersion.

Example 3.2. Consider the query Q0, database D0, and the relevance and distancefunctions δrel and δdis described in Example 3.1. Assume that k = 10. Then

(1) with FMS, query result diversification aims to find a set U1 of 10 gifts from the query

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 11: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:11

result Q0(D0) such that the weighted sum of the relevance values of the selected giftsin U1 to Q0 and the dissimilarity values among the gifts in U1 is maximum.

(2) The objective function FMM is to find a set U2 of 10 gifts from Q0(D0) such that theweighted sum of the minimum relevance of the gifts in U2 to Q0 and the minimumdistance between pairs of gifts in U2 is maximum.

(3) The objective function Fmono is to find a set U3 of 10 gifts from Q0(D0) such thatthe weighted sum of the relevance values of the gifts in U3 to Q0 and the mean ofthe distances between the selected gifts in U3 and all candidate gifts in the entireset Q0(D0) is maximized. In particular, here the diversity criterion is to assess thecoverage of various gifts in the entire set Q0(D0) by the set U chosen. 2

Remarks. Observe the following.

(1) Objective functions FMS, FMM and Fmono are defined in terms of two criteria: rele-vance δrel and diversity δdis. The larger the parameter λ is, the more weight we placeon the diversity of the results selected. When λ = 0, FMS, FMM and Fmono measure therelevance only. On the other hand, when λ = 1, these objective functions are defined interms of δdis only and assess the diversity alone.

(2) For a given set U ⊆ Q(D), FMS(U) and FMM(U) are PTIME computable as longas δrel and δdis are PTIME computable. In contrast, when it comes to mono-objective,Fmono(U) may not be PTIME computable when Q and D are given as input but Q(D) isnot assumed available, as commonly found in practice. Indeed, for each tuple t ∈ U ,Fmono(U) has to compute δdis(t, t

′) when t′ ranges over all tuples in Q(D).

4. REASONING ABOUT RESULT DIVERSIFICATION

In this section we first identify three problems in connection with query result diver-sification, for which the complexity will be provided in the next five sections. We thendemonstrate possible applications of the complexity analyses of these problems.

4.1. Decision and Counting Problems

Consider a database D, a query Q in a language LQ, a positive integer k, and anobjective function F defined with relevance and distance functions δrel and δdis.

The query result diversification problem. We start with a decision problem, re-ferred to as the query result diversification problem and denoted by QRD(LQ, F ). Toformulate this problem, we need the following notations.We call a set U ⊆ Q(D) a candidate set for (Q,D, k) if |U | = k. Given a real number B

as a bound, we refer to a candidate set U as a valid set for (Q,D, k, F,B) if F (U) ≥ B.That is, the F value of U is large enough to meet the objective B.Given these, QRD(LQ, F ) is stated as follows.

QRD(LQ, F ): The query result diversification problem.INPUT: A databaseD, a queryQ ∈ LQ, an objective function F , a real number

B and a positive integer k ≥ 1.QUESTION: Does there exist a valid set for (Q,D, k, F,B)?

Observe that QRD(LQ, F ) is the decision version of the function problem for com-puting a top-ranked set U based on F , and is fundamental to understanding the com-plexity of query result diversification. As remarked earlier, we simply consider genericPTIME functions δrel and δdis when defining F .

The diversity ranking problem. In practice, given a candidate set U picked by usersor produced by a system, we want to assess how well U meets a diversification objec-tive and hence, satisfies the users’ need. This suggests that we study another decision

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 12: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:12 Ting Deng and Wenfei Fan

problem, referred to as the diversity ranking problem and denoted by DRP(LQ, F ), toassess the rank of a given candidate set based on F .To state this problem, we use the following notion of ranks. Consider a candidate set

U and a positive integer r. We say that the rank of U is r, denoted by rank(U) = r, ifthere exists a collection S of r − 1 distinct candidate sets for (Q,D, k) such that (a) forall S ∈ S, F (S) > F (U); and (b) for any candidate set S′ for (Q,D, k), if S′ 6∈ S, thenF (U) ≥ F (S′). That is, there exist exactly r − 1 candidates sets for (Q,D, k) that areranked above U based on F . Obviously, the less rank(U) is, the higher U is ranked.Assume a positive integer r that is a constant. We state DRP(LQ, F ) as follows.

DRP(LQ, F ): The diversity ranking problem.INPUT: A database D, a query Q ∈ LQ, an objective function F , a positive

integer k ≥ 1 and a candidate set U for (Q,D, k).QUESTION: Does rank(U) ≤ r?

This problem was advocated in [Jin and Patel 2011], but its complexity was not set-tled before. The need for studying this is evident, as will be elaborated in Section 4.2.

In practice, rank r is below a threshold (e.g., top 20) and hence, is typically treatedas a constant. One may also want to treat r as part of input rather than a constant.We will elaborate its impact on the complexity bounds of DRP in Section 6.

The result diversity counting problem. Given an objective B, one often wants toknow how many valid sets are out there and hence, can be selected and recommended.This suggests that we study the counting problem below, referred to as the result di-versity counting problem and denoted by RDC(LQ, F ).

RDC(LQ, F ): The result diversity counting problem.INPUT: A databaseD, a queryQ ∈ LQ, an objective function F , a real number

B and a positive integer k ≥ 1.QUESTION: How many valid sets are there for (Q,D, k, F,B)?

That is, RDC(LQ, F ) is to count the number of candidate sets in D that satisfy theusers’ request. An effective counting procedure is useful in practice (see Section 4.2).

Parameters of the problems. We study these problems for (a) objective functionsF ranging over max-sum diversification FMS, max-min diversification FMM, and mono-objective formulation Fmono (Section 3), and for (b) query languages LQ ranging overthe following (see, e.g., [Abiteboul et al. 1995] for details of these languages):

(1) conjunctive queries (CQ), built up from atomic formulas with constants and vari-ables, i.e., relation atoms in database schema R and built-in predicates (=, 6=, <,≤,>,≥), by closing under conjunction ∧ and existential quantification ∃;

(2) union of conjunctive queries (UCQ) Q1 ∪ · · · ∪Qr, where Qi is in CQ for i ∈ [1, r];

(3) positive existential FO queries (∃FO+), built from atomic formulas by closing under∧, disjunction ∨ and ∃; and

(4) first-order logic queries (FO) built from atomic formulas using ∧, ∨, negation ¬, ∃and universal quantification ∀.

That is, FO is relational algebra, CQ is the class of SPC queries supporting selection,projection and Cartesian product, UCQ is the class of SPCU queries, and ∃FO+is thefragment of relational algebra with selection, projection, Cartesian product and union.

To the best of our knowledge, no prior work has studied the complexity of DRP andRDC. When it comes to QRD, only a special case was studied, when LQ consists of

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 13: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:13

identity queries and F is FMS or FMM. No prior work has considered QRD when LQ isCQ or beyond, or when F is Fmono.

4.2. Applications of Diversification Analyses

Before we establish the complexity of these problems, we first demonstrate how theiranalyses can be used in practice. The complexity bounds of these problems are notonly of theoretical interest, but may also help practitioners when developing diversifi-cation models and algorithms. As remarked in Section 2, query result diversification isneeded for Web search, recommendation systems and advertising, among other things.

Guidance for what features should be supported in a system. When developing a sys-tem that supports query result diversification, one has to decide the following. Whatquery language should be supported?What diversification function should be adopted?Would a relevance function alone suffice in the application so that one does not have topay the cost introduced by distance functions? Would a fixed set of queries suffice forusers to express their requests? Are compatibility constraints a must? The complex-ity study of diversification problems in different settings may help the system vendorstrike a balance between the cost of diversification analyses and the expressive powerneeded. For instance, if users of the system are to use only a fixed set of FO queries,adopt a mono-objective function and need no compatibility constraints, then the di-versification analyses are in PTIME (see data complexity in Table I, Section 10). Incontrast, if the system allows users to issue arbitrary CQ queries, then one should beprepared for higher cost (PSPACE-complete, Table I). In the latter case, the develop-ers of the system should go for heuristic algorithms for diversification analyses ratherthan for exact PTIME algorithms, as suggested by the lower bounds of the problems.

Assessing the stock of an e-commerce system. Suppose that a recommendation systemmaintains a collection D of items for sale. When a decision procedure for QRD con-stantly fails to find desired item sets upon frequent users’ requests, or a procedure forDRP finds that those items recommended by the system are often ranked low, then themanager of the system should consider adjusting the stock in D. As another exam-ple, when a procedure for RDC finds a large stock of popular items, the manager mayconsider advertising those items to particular user groups.

Assessing the choices of users. A user of an e-commerce system may employ a proce-dure for DRP to asses how good a set of items of her choice is, before she commits tothe purchase. Moreover, she could use a procedure for RDC to find out different varietyof choices that meet her need, and hence, choose one from them.

5. THE QUERY RESULT DIVERSIFICATION PROBLEM

We start with the decision problem QRD. We first establish its combined complexityin Section 5.1 and data complexity in Section 5.2. We will then identify and studyits special cases in Section 8. The complexity bounds of QRD in various settingsare depicted in Fig. 1, annotated with their corresponding theorems, where eacharrow indicates how the complexity of QRD is reduced in different settings (similarlyfor Figures 3 and 4 to be given in Sections 6 and 7, respectively). Both combinedcomplexity and data complexity are summarized, when LQ ranges over the querylanguages given in Section 4 (here CQ/FO denotes either CQ or FO), objective functionF is FMS, FMM (shown in Fig. 1(a)) or Fmono (Fig. 1(b)), and when certain conditionsare imposed in special settings (here λ = 0 means that F is defined with a relevancefunction only, constant k indicates the setting in which the number of items to be

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 14: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:14 Ting Deng and Wenfei Fan

PSPACE-complete (Th. 5.1)

FO, combined

NP-complete (Th. 5.1)

CQ/∃FO+,

combined

NP-complete (Th.5.4)

CQ/FO, data

PTIME(Th. 8.2)

CQ/FO, λ=0, data

PTIME(Cor. 8.4)

CQ/FO, constant k, data

PSPACE-complete (Th. 5.2)

CQ/FO, combined

PTIME(Th. 5.4)

CQ/FO, data

PTIME(Cor. 8.1)

CQ/FO, identity queries, combined

NP-complete (Th. 8.2)

CQ/FO, λ=0, combined

PTIME(Th. 8.2)

CQ/FO, λ=0, data

Fig. 1. The complexity bounds of QRD

selected is a constant, and identity queries mean that LQ consists of identity queriesonly). They reveal the impacts of various factors on the complexity of QRD.

5.1. The Combined Complexity of QRD

Consider a naive algorithm for QRD: first computeQ(D), and then rank k-element setsU of Q(D) based on F (U); after these steps we simply pick the top-ranked set U andcheck whether F (U) ≥ B. One might think that the complexity of QRD would equalthe higher complexity of the two steps. The result below tells us that this is the casewhen F is FMS or FMM. However, when F is Fmono, the story is quite different.

(1) When the objective is max-sum or max-min diversification, query language LQdominates the combined complexity of QRD: it is NP-complete for CQ, UCQ and ∃FO+,but is PSPACE-complete for FO. That is, while the presence of disjunction in UCQ and∃FO+does not make it harder than CQ, negation in FO complicates the analysis.

(2) When F is Fmono, the problem becomesmore intricate for CQ, UCQ and ∃FO+: it is al-ready PSPACE-complete, the same as its complexity for FO. Note that the membershipproblem for CQ is NP-complete (see the statement of the problem shortly). Moreover,Fmono is in PTIME after Q(D) is computed (see Corollary 8.1 for details). In contrast,QRD(CQ, Fmono) is PSPACE-complete. Hence in this case, the complexity is inherentto query result diversification, and is not equal to the higher complexity of the twosteps given above. This is because mono-objective formulation requires to aggregatedistances between elements in U and all tuples in Q(D), and is more costly to computethan FMS and FMM, as remarked in Section 3.

Below we first study QRD(LQ, F ) when F is FMS or FMM.

THEOREM 5.1. The combined complexity of QRD(LQ, FMS) and QRD(LQ, FMM) is

—NP-complete when LQ is CQ, UCQ or ∃FO+; and— PSPACE-complete when LQ is FO. 2

To verify the lower bounds, we use reductions from the following problems.

(1) 3SAT: Given a formula ϕ = C1 ∧ . . . ∧ Cl in which each clause Ci is a disjunction ofthree variables or their negations taken from X = {x1, . . . , xm}, it is to decide whetherϕ is satisfiable. It is known that 3SAT is NP-complete (cf. [Papadimitriou 1994]).

(2) The membership problem for FO: Given an FO query Q, a database D and a tuples, it is to decide whether s ∈ Q(D). This problem is PSPACE-complete [Vardi 1982].

Given these, we prove Theorem 5.1 as follows.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 15: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:15

PROOF. We study QRD(LQ, FMS) and QRD(LQ, FMM) first for CQ, UCQ and ∃FO+, andthen for FO.

(1) When LQ is CQ, UCQ or ∃FO+. It suffices to prove that QRD(CQ, FMS) and QRD(CQ,

FMM) are NP-hard and that QRD(∃FO+, FMS) and QRD(∃FO+, FMM) are in NP.

(1.1) Lower bound. We show that QRD(CQ, FMS) and QRD(CQ, FMM) are NP-hard byreductions from the 3SAT problem, even when λ = 1 and Q is an identity query.We first consider QRD(CQ, FMS). Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over

variables X = {x1, . . . , xm}, we define a database D, a fixed CQ query Q, functions δrel,δdis and FMS, and constants B and k. We show that ϕ is satisfiable if and only if thereexists a valid set U for (Q,D, k, FMS, B). Assume w.l.o.g. that l > 1.

(1) The database D has a single relation IC specified by RC(cid, L1, V1, L2, V2, L3, V3)and populated as follows. For each i ∈ [1, l], let clause Ci be li1 ∨ li2 ∨ li3. For anytruth assignment µi of variables in the literals in Ci that makes Ci true, we add atuple (i, xj , vj , xk, vk, xl, vl) to IC , where xj = li1 if li1 ∈ X and xj = lj1 otherwise; andvj = µi(xj); similarly for xk, xl and vk and vl. Here IC encode all satisfying truth as-signments for clauses separately. This does not give rise to an exponential blow up aseach clause has only three variables. At most 8 tuples are included in IC for each Ci.

(2) We define the query Q as the identity query on RC instances.

(3) We define δrel to be a constant function that returns 1 for each tuple t ofRQ. For eachpair of distinct tuples t and s of RQ, we define δdis(t, s) = 1 in case that (a) t[cid] 6= s[cid](i.e., t and s correspond to distinct clauses of ϕ) and (b) t and s have the same value foreach variable appearing in both t and s. For any other pair of tuples t′ and s′ of RQ, wedefine δdis(t

′, s′) = 0. Furthermore, we use λ = 1. Then for each set U of tuples of RQ,we have that FMS(U) =

t,s∈U δdis(t, s) for each set U .

(4) We use k = l, i.e., we consider only valid sets of l tuples, one for each clause of ϕ.Finally, let B = l · (l− 1). Obviously, for any subset U of Q(D) with |U | = l, FMS(U) ≥ Bif and only if every clause has at least one satisfying assignment which is encoded byone tuple in U , and all these assignments are consistent over the variables.

We verify that ϕ is satisfiable if and only if there is a valid set U for (Q,D, k, FMS, B).

⇒ Assume that ϕ is satisfiable. Then there exists a truth assignment µ0X of X vari-

ables such that every clause Cj of ϕ is true by µ0X . Let U consist of l tuples of RQ,

one for each clause, in which the values for the variables in X agree with µ0X . Then

FMS(U) = l · (l − 1) ≥ B by the definition of FMS.

⇐ Conversely, assume that ϕ is not satisfiable. Suppose by contradiction that thereexists a set U ⊆ Q(D) such that |U | = l and FMS(U) ≥ B. Then U consists of l tuplesthat corresponds to l pairwise distinct clauses in ϕ and all agree with values of vari-ables in X , by the definition of δdis. Let µX be the truth assignment of variables in Xsuch that for each xi ∈ X , µX(xi) equals the value of xi in tuples in U . It is easy to seethat µX satisfies all clauses in ϕ. This leads to a contradiction.

We next show that QRD(CQ, FMM) is NP-hard, also by reduction from 3SAT. Given aninstance ϕ = C1 ∧ . . . ∧Cl of 3SAT, we construct the same D, Q, δrel, δdis as given above,and set k = l. Furthermore, we let λ = 1, and hence FMM(U) = mint,s∈U,t6=s δdis(t, s),for each set U of tuples of RQ. Finally, we let B = 1. It is easy to see that µ0

X is atruth assignment of X variables that makes ϕ true if and only if there exists a set Uconsisting of l tuples, one for each clause, in which the values for the variables agreewith µ0

X , such that U ⊆ Q(D), |U | = k and FMM(U) ≥ B.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 16: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:16 Ting Deng and Wenfei Fan

(1.2) Upper bound. We show that QRD(∃FO+, F ) is in NP, when F is FMS or FMM, bygiving an NP algorithm. Given a query Q in ∃FO+, a database D, a positive integer k,objective function F and a constant B, the algorithm checks whether there exists a setU valid for (Q, D, k, F B), as follows:

1. guess k CQ queries from Q, and for each CQ query, guess a tableau from D (see,e.g., [Abiteboul et al. 1995] for tableaux); these tableaux yield a set U ⊆ Q(D);

2. check whether |U | = k and whether F (U) ≥ B; if so, return “yes”, and otherwisereject the guess and go back to step 1.

Clearly, step 2 is in PTIME since FMS(U) and FMM(U) and PTIME computable. Thusthe algorithm is in NP, and so are QRD(∃FO+, FMS) and QRD(∃FO+, FMM).

(2) When LQ is FO. We next study QRD(FO, FMS) and QRD(FO, FMM).

(2.1) Lower bound. We first show that QRD(FO, FMS) is PSPACE-hard even when λ = 0and k is a constant, by reduction from the membership problem for FO. Given an in-stance (Q,D, s) of the membership problem for FO, we define a database D′ = (D, I01),where I01 = {(1), (0)} is a unary relation specified by schema R01(X), encoding theBoolean domain. Moreover, we define a query Q′ in FO as follows:

Q′(~x, c) = Q(~x) ∧R01(c).

Clearly, if s ∈ Q(D), Q′ must return two tuples (s, 1) and (s, 0). We define δrel((s, 1), Q′)= 1, and for any other tuple t of RQ′ , δrel(t, Q

′) = 0. Furthermore, we define δdis as aconstant function that returns 0 for each pair of tuples of RQ. We use λ = 0. ThenFMS(U) = (k− 1) ·

t∈U δrel(t, Q′) for a set U of k tuples. Finally, we let k = 2 and B = 1.

We show that s ∈ Q(D) if and only if there exists a valid set for (Q′, D′, k, FMS, B).First assume that s ∈ Q(D). Then there is a valid set U = {(s, 1), (s, 0)} for (Q′, D′, k,FMS, B). Conversely, if s 6∈ Q(D), then by the definition ofQ′, (s, 1) is not inQ′(D). Thus,by the definition of FMS, there is no set U ⊆ Q′(D′) such that |U | = 2 and F (U) ≥ B.

We next consider QRD(FO, FMM). Given an instance (Q,D, s) of the membershipproblem for FO, we define the same D′, Q′, δrel and δdis as given above. Furthermore,we set λ = 0, B = 1 and k = 1. Then FMM(U) = mint∈U δrel(t, Q

′). It is easy to see thatt ∈ Q(D) if and only if there exists a set U valid for (Q′, D′, k, FMM, B), along the samelines as the argument given above for QRD(FO, FMS).

(2.2) Upper bound. We next provide a PSPACE algorithm for QRD(FO, F ), when F isFMS or FMM. The algorithm works as follows:

1. guess a set U consisting of k distinct tuples of RQ;

2. check whether U ⊆ Q(D) and F (U) ≥ B; if so, return “yes”, and otherwise rejectthe guess and go back to step 1.

The algorithm is in NPSPACE since step 2 is in PSPACE. Indeed, checking whetherU ⊆ Q(D) is in PSPACE when Q is an FO query, and checking F (U) ≥ B is in PTIMEwhen F is FMS or FMM. Thus the algorithm is in NPSPACE = PSPACE. As a result,QRD(FO, FMS) and QRD(FO, FMM) are in PSPACE. 2

We next establish the combined complexity of QRD(LQ, Fmono)

THEOREM 5.2. The combined complexity of QRD(LQ, Fmono) is PSPACE-complete,no matter whether LQ is CQ, UCQ, ∃FO+or FO. 2

In contrast to its counterparts for FMS and FMM, the lower bound proof for Fmono

needs a counting argument and is more involved. It is verified by reduction fromQ3SAT: given a sentence ϕ = P1x1 . . . Pmxmψ, where Pi is either ∀ or ∃ and ψ is an

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 17: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:17

t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16

1

1

1

1

0

0

00 1

1

1

0

0 1 0

0

1

1

1

0

0

00 1

1

1 1

0

00

x1

x2

x3

x4

ϕ = ∃x1∀x2∃x3∀x4ψ, ψ = (x1 ∨ x2 ∨ x3) ∧ (x2 ∨ x3 ∨ x4).

l = 3, P4 = ∀ :δdis(t1, t2) = 0 ⇐ ψ[t1/~x] = 1, ψ[t2/~x] = 0,

δdis(t3, t4) = 1 ⇐ ψ[t3/~x] = 1,ψ[t4/~x] = 1,

δdis(t5, t6) = 1 ⇐ ψ[t5/~x] = 1, ψ[t6/~x] = 1,

δdis(t7, t8) = 1 ⇐ ψ[t7/~x] = 1, ψ[t8/~x] = 1,

δdis(t9, t10) = 0 ⇐ ψ[t9/~x] = 1, ψ[t10/~x] = 0,

δdis(t11, t12) = 1 ⇐ ψ[t11/~x] = 1, ψ[t12/~x] = 1,

δdis(t13, t14) = 0 ⇐ ψ[t13/~x] = 0, ψ[t14/~x] = 0,

δdis(t15, t16) = 1 ⇐ ψ[t15/~x] = 1, ψ[t16/~x] = 1.

l = 2, P3 = ∃ :δdis(ti, tj) = 1, i ∈ [1, 2], j ∈ [3, 4] ⇐ δdis(t1, t2) = 0, δdis(t3, t4) = 1,

δdis(ti, tj) = 1, i ∈ [5, 6], j ∈ [7, 8] ⇐ δdis(t5, t6) = 1, δdis(t7, t8) = 1,

δdis(ti, tj) = 1, i ∈ [9, 10], j ∈ [11, 12] ⇐ δdis(t9, t10) = 0, δdis(t11, t12) = 1,δdis(ti, tj) = 1, i ∈ [13, 14], j ∈ [15, 16] ⇐ δdis(t13, t14) = 0, δdis(t15, t16) = 1.

l = 1, P2 = ∀ :δdis(ti, tj) = 1, i ∈ [1, 4], j ∈ [5, 8] ⇐ δdis(t1, t4) = 1, δdis(t5, t8) = 1,

δdis(ti, tj) = 1, i ∈ [9, 12], j ∈ [13, 16] ⇐ δdis(t9, t12) = 1, δdis(t13.t16) = 1.

l = 0, P1 = ∃ :δdis(ti, tj) = 1, i ∈ [1, 8], j ∈ [9, 16] ⇐ δdis(t1, t8) = 1, δdis(t9, t16) = 1.

Fig. 2. Example distance function δdis when m = 4 in the lower bound proof of Theorem 5.2

instance of 3SAT over variables in X = {x1, . . . , xm}, Q3SAT is to decide whether ϕ istrue. It is known to be PSPACE-complete (cf. [Papadimitriou 1994]).To give the reduction, we need some notations and a lemma. Consider an instance ϕ

of Q3SAT. We want to use “Boolean” tuples t = (u1, . . . , um) to encode a true assignmentfor variables in X of ϕ, where ui is either 0 or 1 for i ∈ [1,m]. Denote by tl the prefixes(u1, . . . , ul) of length l, for l ∈ [0,m − 1]. Consider a pair of tuples t = (u1, . . . , um)and s = (v1, . . . , vm) such that tl = sl but ul+1 6= vl+1, i.e., t and s agree on the first lattribute values but differ in the (l + 1)-th attribute. Note that tl and sl define a truthassignment of variables xi for i ∈ [1, l], referred to as the truth assignment encoded bytl. In the reduction, we want to define a distance function δdis such that δdis(t, s) = 1if and only if formula Pl+1xl+1 . . . Pmxmψ is satisfied by the truth assignment encodedby prefix tl. More specifically, we define δdis inductively from l =m−1 down to l = 0.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 18: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:18 Ting Deng and Wenfei Fan

(i) When l = m−1, i.e., t and s differ only in their last attribute, we define δdis(t, s)= 1 ifeither (a) Pm = ∀ and both the truth assignments encoded by t an s make ψ true, or(b) Pm = ∃, and there exists at least one truth assignment µX (encoded by either tor s) such that µX satisfies ψ. For any other tuples t′ and s′, we define δdis(t

′, s′) = 0.

(ii) When m − 2 ≥ l ≥ 0, we define δdis(t, s) = 1 (a) when Pl+1 = ∀, and δdis((tl,

1, 1, . . . , 1), (tl, 1, 0, . . . , 0)) = 1 and δdis((sl, 0, 1 . . . , 1), (sl, 0, 0, . . . , 0)) = 1; or (b) when

Pl+1 = ∃, and at least one of δdis((tl, 1, 1, . . . , 1), (tl, 1, 0, . . . , 0)) and δdis((s

l, 0, 1 . . . , 1),(sl, 0, 0, . . . , 0)) is 1. For any other tuples t′ and s′, we define δdis(t

′, s′) = 0.

An example δdis is depicted in Fig. 2, for a sentence ϕ with 4 variables.Such distance functions have the following property. That is, the cases of tl+1 and

sl+1 (branches of truth assignments tl) given above suffice to ensure Lemma 5.3.

LEMMA 5.3. Consider any pair of Boolean tuples t = (u1, . . . , um) and s = (v1,. . . , vm) that encode truth assignments of ϕ = Pl+1xl+1 . . . Pmxmψ. If tl = sl andul+1 6= vl+1 for some 0 ≥ l ≥ m − 1, then δdis(t, s) = 1 if and only if ϕ is true under µlX ,

where µlX is the truth assignment for variables x1, . . . , xl encoded by tl. 2

PROOF. We prove Lemma 5.3 by induction on l from l = m− 1 down to l = 0. Whenl = m−1, by the definition of δdis, we have that δdis((t

m−1, 1), (tm−1, 0)) = 1 if and only if(a) when Pm = ∀, the truth assignments encoded by (tm−1, 1) and (tm−1, 0) both satisfyψ; and (b) when Pm = ∃, at least one of the truth assignments encoded by (tm−1, 1)and (tm−1, 0) makes ψ true. Hence the statement holds in the base case.

To give more intuition for the inductive step, we also illustrate the case whenl = m−2. For any two tuples t = (u1, . . . , um) and s = (v1, . . . , vm) such that tm−2 = sm−2

and um−1 6= vm−1, by the definition of δdis, we have that δdis(t, s) = 1 if and only if (a)when Pm−1 = ∀, δdis((t

m−2, 1, 1), (tm−2, 1, 0)) = 1 and δdis((tm−2, 0, 1), (tm−2, 0, 0)) = 1;

and (b) when Pm−1 = ∃, we have that either δdis((tm−2, 1, 1), (tm−2, 1, 0)) = 1 or

δdis((tm−2, 0, 1),(tm−2, 0, 0)) = 1. Then case (a) holds if and only if (i) when Pm = ∀, the

truth assignments encoded by (tm−2, 1, 1) and (tm−2, 1, 0) both make ψ true; similarlyfor (tm−2, 0, 1) and (tm−2, 0, 0); and (ii) when Pm = ∃, there exists at least one truth as-signment (encoded by (tm−2, 1, 1) or (tm−2, 1, 0)) that satisfies ψ; similarly for (tm−2, 0, 1)and (tm−2, 0, 0). Clearly, case (a) (resp. case (b)) holds if and only if ∀xm−1Pmxm ψ(resp. ∃xm−1Pmxm ψ) is true under the truth assignment encoded by tm−2.

Now we prove the inductive step. Assume that when l = p and p ∈ [1,m − 3],Lemma 5.3 holds. That is, for all tuples t = (u1, . . . , um) and s = (v1, . . . , vn) such thattp = sp but up+1 6= vp+1, δdis(t, s) = 1 if and only if Pp+1xp+1 . . . Pmxm ψ is true underthe truth assignment encoded by tp for variables x1, . . . , xp. We next consider the casewhen l = p − 1. Let t and s be an arbitrary pair of tuples that agree on the first p − 1attributes but disagree on the p-th attribute. By the definition of δdis, we have thatδdis(t, s) = 1 if and only if (c) when Pp = ∀, δdis((t

p−1, 1, 1, . . . , 1), (tp−1, 1, 0, . . . , 0)) = 1and δdis((t

p−1, 0, 1, . . . , 1), (tp−1, 0, 0, . . . , 0)) = 1; and (d) when Pp = ∃, eitherδdis((t

p−1, 1, 1, . . . , 1), (tp−1, 1, 0, . . . , 0)) = 1 or δdis((tp−1, 0, 1, . . . , 1), (tp−1, 0, 0, . . . , 0)) = 1.

By the induction hypothesis, for l = p, we know that case (c) holds if and only ifPp+1xp+1 . . . Pmxm ψ is true under the truth assignments of variables x1, . . . , xp−1

encoded by ~up−1, no matter whether xp is 1 or 0. The latter holds if and only if∀xpPp+1xp+1 . . . Pmxm ψ is true under the truth assignment of x1, . . . , xp−1 encoded bytp−1. Similarly, case (d) is satisfied if and only if ∃xpPp+1xp+1 . . . Pmxm ψ is true underthe truth assignment of x1, . . . , xp−1 encoded by tp−1. This proves Lemma 5.3. 2

Leveraging Lemma 5.3, we next prove Theorem 5.2.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 19: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:19

PROOF. It suffices to show that QRD(CQ, Fmono) is PSPACE-hard, even when λ = 1and k is a constant, and that QRD(FO, Fmono) is in PSPACE.

(1) Lower bound. We show that QRD(CQ, Fmono) is PSPACE-hard by reduction fromthe Q3SAT problem. Given an instance ϕ = P1x1 . . . Pmxmψ of Q3SAT with variablesin X = {x1, . . . , xm}, we construct a CQ query Q, a database D, functions δrel, δdis andFmono, and let k = 1 and B = 1. We prove that ϕ is true if and only if there exists avalid set U for (Q, D, k, Fmono, B).

(1) Database D has a single unary relation I01 = {(1), (0)} specified by schema R01(X),encoding the Boolean domain as in the proof of Theorem 5.1 for the FO case.

(2) We define the CQ query Q as follows:

Q(~x) = R01(x1) ∧ . . . ∧R01(xm),

where ~x = (x1, . . . , xm). Intuitively, query Q generates all truth assignments forvariables in X . Note that |Q(D)| = 2m.

(3) The function δrel is a constant function that returns 1 for each tuple of RQ. We usethe distance function δdis given above, and set λ = 1.

We next verify that there exists a valid set U for (Q, D, k, Fmono, B) if and only if ϕis true, by giving a counting argument.

⇒ Assume that there exists a valid set U = {t} for (Q,D, k = 1, Fmono, B = 1). Then

by λ = 1, Fmono(U)= 12m−1 ·

s∈Q(D) δdis(t, s) ≥ 1. Thus∑

s∈Q(D) δdis(t, s) ≥ 2m− 1. Since

|Q(D)| = 2m, for each tuple s ∈ Q(D), if s 6= t, δdis(t, s) must be 1 by the definition ofδdis. In particular, there exists a tuple s′ such that s′ differs from t in the first attributeand δdis(t, s

′) = 1. Then by Lemma 5.3, we have that ϕ = P1x1 . . . Pmxmψ is true.

⇐ Conversely, assume that ϕ is true. We show that there must exist a valid set U . By

Lemma 5.3, for each pair of tuples t and s such that t1 6= s1, we have that δdis(t, s) = 1.Moreover, since ϕ is true, no matter what P1 is, there must exist a truth assignment µx1

for variable x1 such that P2x2 . . . Pmxmψ is true under µx1. Then again by Lemma 5.3,

for each pair of tuples t′ = (u′1, . . . , u′m) and s′ = (v′1, . . . , v

′m), δdis(t

′, s′) = 1 if t′1 = s′1 =µx1

but u′2 6= v′2. Furthermore, since P2x2 . . . Pmxmψ is true under the truth assignmentµx1

for variable x1, there must exist a truth assignment µx2of variable x2 such that

P3x3 . . . Pmxmψ is true under µx1and µx2

. Similarly, we get truth assignments µxlfor

variable xl, where l ∈ [3,m], such that for each pair of tuples t and s, δdis(t, s) = 1, iftl = sl = (µx1

, µx2, . . . , µxl

), and ul+1 6= vl+1 for l ∈ [0,m − 1]. Let t∗ = (µx1, . . . , µxm

)and U = {t∗}. Then by the definition of Q, t∗ ∈ Q(D). Obviously, |U | = 1. To see thatFmono(U) = 1

2m−1

s∈Q(D) δdis(t∗, s) ≥ B = 1, we compute the number of tuples s such

that δdis(t∗, s) = 1. As discussed earlier, for each tuple s = (v1, . . . , vm), δdis(t

∗, s) = 1if there exists l ∈ [0,m − 1] such that (t∗)l = sl but µxl+1

6= vl+1. It is easy to see that

there are 2m/2l+1 = 2m−l−1 tuples s such that δdis(t∗, s) = 1. Thus the number of tuples

s such that δdis(t∗, s) = 1 is

∑m−1l=0 2m−l−1 = 2m−1 + . . .+ 2 + 1 = 2m − 1. Recall that we

set k = 1 and B = 1. Hence Fmono(U) = 12m−1

s∈Q(D) δdis(t∗, s) = 1 ≥ B.

(2) Upper bound. We show that QRD(FO, Fmono) is in PSPACE. Let MQ be the binaryrepresentation of tuples t of RQ such that each of its attributes has the largestconstants from the active domain. Note that tuple t is bounded by arity(RQ) andadom(Q,D), where arity(RQ) denotes the arity of the schema RQ, and adom(Q,D) is theset of constants appearing in D or Q. We present an NPSPACE algorithm as follows:

1. guess a set U of k tuples of RQ;2. if U ⊆ Q(D), continue; otherwise, reject the guess and go back to step 1;

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 20: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:20 Ting Deng and Wenfei Fan

3. let n denote the number of tuples in Q(D), and variable dist keep track of the sumof distances between each pair of tuples t ∈ Q(D) and s ∈ Q(D), both representedin binary; initially, we set n = 0 and dist = 0;

4. for each tuple s ∈ [0,MQ] (encoded in binary with necessary delimiters) do:a. check whether s ∈ Q(D); if so, continue; otherwise, go back to step 4;b. for each tuple t ∈ U , dist := dist + δdis(t, s);c. increase n by 1;

5. compute Fmono(U) = (1 − λ) ·∑

t∈U δrel(t, Q) + (λ/(n + 1)) · dist; check whetherFmono(U) ≥ B; if so, return “yes”.

Observe the following. (a) Step 2 is in PSPACE for FO queries. (b) Steps 4 and 5 canalso be done in polynomial space when tuples t, counter n and sum dist are encodedin binary. Indeed, step 4(a) is in PSPACE for FO queries. Moreover, steps 4(b), 4(c)and 5 can be done in PSPACE because we consider tuples in binary coding. Hence thealgorithm is in NPSPACE = PSPACE. Thus QRD(FO, Fmono) is in PSPACE. 2

5.2. The Data Complexity of QRD

We next re-investigate QRD for data complexity. That is, when query Q is predefinedand fixed, but database D may vary, i.e., the complexity of evaluating a fixed query forvariable database inputs (see [Abiteboul et al. 1995] for details). Data complexity isalso of practical interest since in many real-life applications, one often uses a fixed setof queries, e.g., Web forms, while the database may be frequently updated.From the result below we can see that when the objective is given by FMS and FMM,

fixing query Q does not reduce the complexity of QRD for CQ, UCQ and ∃FO+: theproblem remains NP-complete, the same as its combined complexity. In contrast, fixedqueries do simplify the analysis of the problem.

(1) when LQ is FO and F is FMS or FMM, QRD(LQ, F ) becomes NP-complete; or

(2) when F is Fmono, the problem becomes tractable in this case.

Contrast these with the PSPACE-completeness of their combined complexity (The-orem 5.1 and 5.2). These demonstrate that when Q is fixed, objective functionsdetermine the data complexity, while query languages have no impact.

THEOREM 5.4. The data complexity of QRD(LQ, FMS) and QRD(LQ, FMM) is NP-complete, while that of QRD(LQ, Fmono) is in PTIME, when LQ is CQ, UCQ, ∃FO+or FO. 2

PROOF. We first study the data complexity of QRD(LQ, FMS) and QRD(LQ, FMM) forCQ, UCQ, ∃FO+and FO, and then investigate it of QRD(LQ, Fmono).

(1) When F is FMS or FMM. It suffices to prove that QRD(CQ, F ) is NP-hard and thatQRD(FO, F ) is in NP. Recall that the lower bounds of QRD(CQ, FMS) and QRD(CQ, FMM)of Theorem 5.1 are shown by using a fixed identity query. Thus the lower boundshold here, i.e., QRD(CQ, FMS) and QRD(CQ, FMM) are NP-hard. For the upper bound,note that the algorithm given in the proof of Theorem 5.1 for FO can carry over here.Clearly, its step 2 is in PTIME since Q(D) is PTIME computable for a fixed FO queryQ, and F is also in PTIME when F is FMS or FMM. Thus the algorithm is in NP for fixedFO queries, and hence so is the data complexity of QRD(FO, FMS) and QRD(FO, FMM).

(2) When F is Fmono. We give a PTIME algorithm to check whether there exists a set Uvalid for (Q, D, k, Fmono, B). Given a fixed Q, D, Fmono, k and B, the algorithm firstcomputes the entire set Q(D) of query answers. It then checks whether Q(D) containsat least k items at all, and if so, it picks k tuples from Q(D) with the largest “score” interms of the relevance and diversity. Finally, it checks whether the sum of the scores ofthese tuples is beyond the bound B. It returns “yes” if so and “no” otherwise. When Q

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 21: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:21

PSPACE-complete(Th. 6.1)

FO, combined

coNP-complete (Th. 6.1)

CQ/∃FO+, combined

coNP-complete(Th. 6.4)

CQ/FO, data

PTIME(Th. 8.2)

CQ/FO, λ=0, data

PTIME(Cor. 8.4)

CQ/FO, constant k, data

PSPACE-complete(Th. 6.2)

CQ/FO, combined

PTIME(Th. 6.4)

CQ/FO, data

PTIME(Cor. 8.4)

CQ/FO, identity queries, combined

(a) When F is FMS or FMM (b) When F is Fmono

coNP-complete(Th. 8.2)

CQ/FO, λ=0, combined

PTIME(Th. 8.2)

CQ/FO, λ=0, data

Fig. 3. The complexity bounds of DRP

is fixed, all these can be done in PTIME. In contrast to FMS and FMM, Fmono allows us tocompute the scores by inspecting each tuple inQ(D) individually, rather than by check-ing each k-element subset of Q(D). More specifically, the algorithm works as follows:

1. compute Q(D), in PTIME;2. check whether |Q(D)| ≥ k; if so, continue; otherwise return “no”;3. for each tuple t, compute v(t) = (1−λ)·δrel(t, Q)+ λ

|Q(D)|−1 ·∑

s∈Q(D) δdis(t, s) in PTIME;

4. let vsum be the sum of the k largest values in U , where U = {v(t) | t ∈ Q(D)}, andcheck whether vsum ≥ B; if so, return “yes”; otherwise, return “no”.

The algorithm is in PTIME. Indeed, since Q is fixed, Q(D) is PTIME computable. Thussteps 1-3 are all in PTIME. Hence so is the data complexity of QRD(FO, Fmono). 2

6. THE DIVERSITY RANKING PROBLEM

We next study the decision problem DRP. We give its combined complexity in Sec-tion 6.1 and data complexity in Section 6.2, and will investigate its special cases inSection 8. Along the same lines as Fig. 1, we summarize the complexity bounds of DRPin various setting in Fig. 3, which depicts the impact of various parameters.All the complexity bounds of this section remain intact when rank r is taken as

input of DRP instead of a constant, except the PTIME bound of the data complexity forFmono (Theorem 6.4; see details there).

6.1. The Combined Complexity of DRP

While DRP does not reduce to QRD (or vice versa), the two decision problems areconnected. Given an instance D,Q, F, k and U ⊆ Q(D) of DRP, one can construct aninstance D,Q, F, k and B = F (U) of QRD such that rank(U) ≤ r if and only if thereexist no more r − 1 subsets U ′ of Q(D) with F (U ′) > B. From this an algorithm forDRP immediately follows, by using an QRD oracle: first guess r subsets U ′ of Q(D)with k elements each, and then check whether F (U ′) > B by calling (a minor revisionof) a procedure for QRD; return “no” if so. In light of the connection, it is not surprisingthat DRP and QRD behave similarly in their combined complexity analyses.

(1) When F is FMS or FMM, LQ dominates the combined complexity of DRP.

(2) When F is Fmono, LQ has no impact on the combined complexity of DRP.

We first study the combined complexity of DRP(LQ, FMS) and DRP(LQ, FMM).

THEOREM 6.1. The combined complexity of DRP(LQ, FMS) and DRP(LQ, FMM) is

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 22: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:22 Ting Deng and Wenfei Fan

— coNP-complete when LQ is CQ, UCQ or ∃FO+, and— PSPACE-complete when LQ is FO. 2

The proof is similar to the proof of Theorem 5.1 for QRD. In particular, the lowerbounds are verified by reductions from the complements of 3SAT and the membershipproblem for FO. Note that, however, the connection between QRD to DRP is not strongenough for us to carry (the dual of) the complexity bounds of QRD to DRP: (1) DRP doesnot inherit the lower bound of QRD since QRD does not reduce to DRP, and (2) for theupper bound, the algorithm outlined above gives us Σp2 by calling the QRD, not coNP.Here Σp2 is the class of languages recognized by a nondeterministic Turing machine

with an NP oracle, i.e., NPNP [Papadimitriou 1994].

PROOF. We first show that DRP(LQ, FMS) and DRP(LQ, FMM) are coNP-completewhen LQ is CQ, UCQ or ∃FO+, and then prove that they are PSPACE-complete for FO.

(1) When LQ is CQ, UCQ or ∃FO+. It suffices to show that DRP(LQ, FMS) and DRP(LQ,

FMM) are coNP-hard for CQ and that they are in coNP for ∃FO+.

(1.1) Lower bound. We show that DRP(CQ, FMS) and DRP(CQ, FMM) are already coNP-hard for fixed identity queries, λ = 1 and r = 1, by reductions from the complementof 3SAT. We first verify this for DRP(CQ, FMS). Given an instance ϕ of 3SAT, whereϕ = C1 ∧ . . . ∧ Cl is defined over variables in X = {x1, . . . , xm}, we define a databaseD, a CQ query Q, functions δrel, δdis and FMS, a set U and a positive integer k. We showthat ϕ is not satisfiable if and only if rank(U) ≤ r.

To give the reduction, we construct a new formula ϕ′ such that ϕ′ is satisfiable if andonly if ϕ is satisfiable. Let z be a fresh variable that is not in the setX of variables in ϕ.

We define ϕ′ = (ϕ∨z)∧ z =∧li=1(Ci∨z)∧ z. It is easy to see that for a truth assignment

µX of X variables, µX satisfies ϕ if and only if µX makes ϕ′ true with z = 0. Moreover,there always exists a truth assignment that makes ϕ′ false. Indeed, when setting z tobe 1, ϕ′ is false under any truth assignments of X . We next give the reduction.

(1) The database D includes a single relation IC specified by the following schema:RC(cid, L1, V1, L2, V2, L3, V3, Z, VZ , A), populated as follows. Let C′

i = li1∨ li2∨ l

i3∨z be the

i-th clause of ϕ′, where i ∈ [1, l]. For any possible truth assignment µi of variables in theliterals of C′

i, we add a tuple (i, xk, vk, xl, vl, xm, vm, z, vz, a), where xk = li1 if li1 ∈ X , andxk = li1 if li1 = xk. We set vk = µi(xk); similarly for xl, xm, z and vl, vm and vz. Moreover,we set a = 1 if the truth assignment µi satisfies C

′i; otherwise we set a = 0. Further, for

z, we add two tuples (l+1, e1, f1, e2, f2, e3, f3, z, 1, 0) and (l+1, e1, f1, e2, f2, e3, f3, z, 0, 1),where all ei and fi are distinct constants that are not in X ∪ {z, 0, 1}.

(2) We define Q as the identity query on RC instances.

(3) Let k = l + 1. We let the set U consist of l + 1 tuples from D, one for each clause inϕ′ such that all variables in X and z are set to be 1. Obviously, U ⊆ Q(D).

(4) We define the relevance function δrel to be a constant function that returns 1 foreach tuple t of RQ. Moreover, for each pair of tuples t and s of RQ, we define δdis(t, s) = 1if (i) t[cid] 6= s[cid], i.e., t and s encode two different clauses in ϕ′; (ii) t and s have thesame value for each variable appearing in both t and s; and (iii) t[A] = s[A] = 1, i.e., tand s encode truth assignments that satisfy their corresponding clauses, respectively.Furthermore, for any other pair of tuples t′ and s′, we define δdis(t

′, s′) = 0. Moreover,we let λ = 1. Then for each set S of tuples of RQ, FMS(S) =

t,s∈S δdis(t, s).

We next show that rank(U) = 1 ≤ r if and only if ϕ is not satisfiable.

⇒ Assume that ϕ is satisfiable. Then there exists a truth assignment µ0X for the X

variables that satisfies ϕ. We show that rank(U) ≥ 2 > r = 1. Let U0 consist of l + 1

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 23: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:23

tuples, one for each clause in ϕ′, such that the values of all variables in X agree withµ0X and z is set to be 0. Obviously, for any two different tuples t and s in U0, we have

that δdis(t, s) = 1 by the definition of δdis. Then FMS(U0) = (l + 1) · l. Note that for each

tuple t ∈ U , t[A] = 1 if t 6= (l + 1, e1, f1, e2, f2, e3, f3, z, 1, 0), and moreover, for any twotuples t, s ∈ U such that t[A] = s[A] = 1, we have that δdis(t, s) = 1, by the definition ofδdis. Then FMS(U) = l · (l− 1). Putting these together, we have that rank(U) ≥ 2 > r = 1.

⇐ Conversely, assume that ϕ is not satisfiable. Then there exists no truth assign-ment µX of the X variables that satisfies ϕ. It is easy to see that for each candidateset S for (Q,D, k), there exist at most l tuples t ∈ S such that t[A] = 1, and thusFMS(S) ≤ l · (l − 1) = FMS(U). Therefore, rank(U) = 1 ≤ r = 1.

We show that DRP(CQ, FMM) is also coNP-hard by reduction from the complement of3SAT. Given an instance ϕ of 3SAT, we construct the same ϕ′, D, Q, k, r, δrel(·, ·), U andλ = 1 as above. We define distance function δ′dis such that for any pair of distinct tuplest and s of RQ, (i) δ

′dis(t, s) = 2 if δdis(t, s) = 1 and t, s /∈ U ; (ii) δ′dis(t, s) = 1 if t, s ∈ U ; and

(iii) for any other t′ and s′, δdis(t′, s′) = 0. Then for each k-element set S of tuples of

schema RQ, FMM(S) = mint,s∈S,t6=s δdis(t, s). We show that this is indeed a reduction.By the definitions of δ′dis and FMM, for each candidate set S for (Q,D, k), (i) FMM(S) = 2if S encodes a truth assignment of X that satisfies ϕ with z = 0; (ii) FMM(S) = 1 ifS = U ; and otherwise (iii) FMM(S) = 0. Then along the same line as for DRP(CQ, FMS),one can verify that ϕ is not satisfiable if and only if rank(U) = 1 ≤ r = 1.

(1.2) Upper bound. We show that when F is FMS or FMM, DRP(∃FO+, F ) is in coNP, bygiving an NP algorithm for its complement. Given Q, D, F , U and k, the algorithmchecks whether rank(U) > r, i.e., whether there exist r candidate sets for (Q,D, k) suchthat for each such set S, F (S) > F (U).

1. make r guess such that each time, guess k CQ queries from Q, and for each CQquery, guess a tableau fromD; these tableaux yield a subset ofQ(D), collected in S;

2. check whether S contains r distinct sets and whether for each S ∈ S, |S| = k;

3. for each set S ∈ S, check if F (S) > F (U); if so, return “no”.

The algorithm is in NP since step 2 is in PTIME; moreover, step 3 is also in PTIMEsince F is PTIME computable when F is FMS or FMM. Hence the problem is in coNP.

(2) When LQ is FO. We show that DRP is PSPACE-complete for FMS and FMM.

(2.1) Lower bound. We show that for FO, DRP is already PSPACE-hard when λ = 0 andr = 1, by reductions from the complement of the membership problem for FO.We first consider DRP(FO, FMS). Given an instance (Q,D, s) of the membership prob-

lem for FO, we define a database D′ = (D, I01), where I01 = {(1), (0)} is an instance ofschema R01(X) encoding the Boolean domain. We define a query Q′ as follows:

Q′(~x, z, c) =(

Q(~x) ∨ (R01(z) ∧ z = 1))

∧R01(c).

Clearly, no matter whether s ∈ Q(D) or not, (s, 1, 1) and (s, 1, 0) are in Q′(D′). More-over, when s ∈ Q(D), (s, 0, 1) and (s, 0, 0) must be in Q′(D′). We define δrel((s, 0, 1), Q′) =δrel((s, 0, 0), Q′) = 3, δrel((s, 1, 1), Q′) = δrel((s, 1, 0), Q′) = 2, and for any other tuple t ofRQ′ , δrel(t, Q

′) = 1. Furthermore, we define δdis as a constant function that returns 0 foreach pair of tuples of RQ. We set λ = 0 and k = 2; hence, for each set S of tuples of R′

Q

with k tuples, FMS(S) = (k − 1) ·∑

t∈S δrel(t, Q′). Note that FMS({(s, 0, 1), (s, 0, 0)}) = 6,

FMS({(s, 1, 1), (s, 1, 0)}) = 4, FMS({t, t′}) = 5 if t ∈ {(s, 0, 1), (s, 0, 0)} and t′ ∈

{(s, 1, 1), (s, 1, 0)}, and for any other pair of tuples t and t′, FMS({t, t′}) = 2. Finally, let

r = 1 and U = {(s, 1, 1), (s, 1, 0)}. As remarked earlier, U ⊆ Q(D).

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 24: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:24 Ting Deng and Wenfei Fan

We show that s /∈ Q(D) if and only if rank(U) ≤ r. First assume that s /∈ Q(D). Then(s, 0, 1) and (s, 0, 0) are not in Q′(D′). Thus rank(U) = 1 ≤ r by the definition of FMS.Conversely, if s ∈ Q(D), (s, 0, 1) and (s, 0, 0) are both in Q′(D′), and rank(U) > r = 1.

We next show that DRP(FO, FMM) is PSPACE-hard, also by reduction from thecomplement of the membership problem for FO. Given an instance (Q,D, s) of thatproblem, we define the sameQ′,D′, δrel, λ = 0 and r = 1 as above. Then for each set S oftuples of R′

Q, FMM(S) = mint∈S δrel(t, Q′). Let U = {(s, 1, 1)}. Then along the same line

as above, one can verify that s /∈ Q(D) if and only if rank(U) = 1 ≤ r for (Q′, D′, k, FMM).

(2.2) Upper bound. We next present a PSPACE algorithm for the complement ofDRP(FO, F ) when F is FMS or FMM. The algorithm works as follows:

1. guess r distinct sets such that each such set S consists of k different tuples of RQ;2. denote by S the collection of the r sets; for each S ∈ S, check whether S ⊆ Q(D);

if so, continue;3. check whether for all S ∈ S, F (S) > F (U); if so, return “no”.

The algorithm is in NPSPACE = PSPACE since step 2 is in PSPACE for FO queries andstep 3 is in PTIME since FMS or FMM are in PTIME; hence so is the problem. 2

We next study the combined complexity of DRP(LQ, Fmono)

THEOREM 6.2. The combined complexity of DRP(LQ, Fmono) is PSPACE-completewhen LQ is CQ, UCQ, ∃FO+or FO. 2

The proof is a variation of the one for Theorem 5.2. The lower bound is also verifiedby reduction from Q3SAT. Given an instance ϕ = P1x1 . . . Pmxmψ of Q3SAT overvariablesX = {x1, . . . , xm}, we define a distance function δ∗dis similar to its counterpartgiven for QRD. More specifically, let t = (u1, . . . , um) and s = (v1, . . . , vm) be anypair of tuples encoding two truth assignments of X variables, such that tl = sl butul+1 6= vl+1, where tl and sl are prefixes of t and s of length l, respectively. We want touse δ∗dis to ensure that δ∗dis(t, s) > 0 if and only if formula Pl+1xl+1 . . . Pmxmψ is satisfiedby the truth assignment encoded by prefix tl, for variables x1, . . . , xl. To do so, denoteby t the m-arity tuple (1, . . . , 1). We define δ∗dis by revising δdis given in the proof ofTheorem 5.2 for QRD(CQ, Fmono), such that

(i) δ∗dis(t, s) = (1/2) ·δdis(t, s) for all tuples s = (1, v2, . . . , vm) with vi ∈ {0, 1} for i ∈ [2,m];

(ii) δ∗dis(t, s) = 2 · δdis(t, s) for all s = (0, v2, . . . , vm) with vi ∈ {0, 1} for i ∈ [2,m]; and

(iii) for any other pair of tuples t′ and s′ of RQ, δ∗dis(t

′, s′) = δdis(t′, s′).

It is easy to see that for each pair t and s of m-arity tuples given above, δdis(t, s) = 1 ifand only if δ∗dis(t, s) > 0. Along the same line as Lemma 5.3, one can show the following.

LEMMA 6.3. Consider any pair of tuples t = (u1, . . . , um) and s = (v1, . . . , vm) thatencode truth assignments of ϕ = Pl+1xl+1 . . . Pmxmψ, where tl = sl and ul+1 6= vl+1 forsome 0 ≥ l ≥ m − 1. Let µlX be a truth assignment of variables x1, . . . , xl encoded by

tl. Then the following statements are equivalent: (1) Pl+1xl+1 . . . Pmxmψ is satisfied byµlX , (2) δ∗dis(t, s) > 0, and (3) there exist two tuples t′ = (u′1, . . . , u

′m) and s′ = (v′1, . . . , v

′m)

for ϕ such that δ∗dis(t′, s′) > 0, where t′l = s′l = µlX but u′l+1 6= v′l+1. 2

Capitalizing on Lemma 6.3, we next prove Theorem 6.2.

PROOF. We show that DRP(CQ, Fmono) is PSPACE-hard DRP(FO, Fmono)is in PSPACE.The lower bound holds even when λ = 1 and k is a constant.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 25: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:25

(1) Lower bound. We show the lower bound by reduction from Q3SAT. Given aninstance ϕ = P1x1 . . . Pmxmψ of Q3SAT, we construct the same query Q, databaseD and relevance function δrel as their counterparts for QRD(CQ, Fmono), and letk = 1 and r = 1. Furthermore, let U = {t}, where t is the m-arity tuple (1, . . . , 1)in Q(D) given above. Finally, we set λ = 1. Then for each set S of tuples of RQ,Fmono(S) = 1

2m−1

t∈S,s∈Q(D) δ∗dis(t, s), where δ∗dis is defined as above.

We show that ϕ is true if and only if rank(U) = 1 ≤ r = 1.

⇒ Assume first that ϕ is true. Then by Lemma 6.3 we have that δ∗dis(t, s) = 2 foreach tuple s = (0, v2, . . . , vm), where vi ∈ {0, 1} for i ∈ [2,m]. Note that there exist2m−1 such tuples s in total. Thus

s∈Q(D) δ∗dis(t, s) ≥ 2 · 2m−1 = 2m. We next prove

that for any other tuple t that is distinct from t, we have that∑

s∈Q(D) δ∗dis(t, s) ≤ 2m,

and thus rank(U) = 1 by the definition of Fmono. Indeed, for any tuple t 6= t, thereexist at most 2m − 1 tuples s 6= t such that δ∗dis(t, s) > 0 (recall that |Q(D)| = 2m).

Consider the following two cases. For any tuple t = (u1, . . . , um) and t 6= t, (a) ifu1 = 0, then by the definition of δ∗dis,

s∈Q(D) δ∗dis(t, s) =

s∈Q(D)\{t} δ∗dis(t, s) + δ∗dis(t, t)

≤ (2m−2)+2 = 2m; and (b) otherwise,∑

s∈Q(D) δ∗dis(t, s) =

s∈Q(D)\{t} δ∗dis(t, s)+δ

∗dis(t, t)

≤ (2m − 2) + 1/2 = 2m − 3/2 < 2m. As a result, we have that rank(U) = 1 ≤ r = 1.

⇐ Conversely, assume that ϕ is false. Let l0 be the minimum value in [0,m − 1]

such that there exists a tuple s = (v1, . . . , vm) with δdis(t, s) = 1, where sl0 = (1, . . . , 1)and vl0+1 6= 1. Then by Lemma 6.3, l0 is also the minimum value in [0,m − 1] suchthat δdis(t

′, s′) = 1 for any pair of tuples t′ = (u′1, . . . , u′m) and s′ = (v′1, . . . , v

′m), where

t′l0 = s′l0 = (1, . . . , 1) but u′l0+1 6= v′l0+1. Note that there exist at most 2m−l0 − 1 tuples

s such that δ∗dis(t, s) > 0, where sl0 = (1, . . . , 1) and vl0+1 6= 1 (recall that there areat most 2m−l0 tuples s such that sl0 = (1, . . . , 1) and vl0+1 6= 1). Then we have that∑

s∈Q(D) δdis(t, s) ≤ (1/2)· (2m−l0 − 1) = 2m−l0−1 − 1/2. We next show that there exists

a tuple t∗ that is distinct from t such that∑

s∈Q(D) δdis(t∗, s) ≥ 2m−l0−1, and hence

rank(U) > r = 1. Indeed, let t∗ = (u∗1, . . . , u∗m) be a tuple such that (t∗)l0−1 = (1, . . . , 1)

but u∗l0 6= 1. Then there exist at least 2m−l0−1 tuples s = (v1, . . . , vm) such that

δdis(t∗, s) > 0, where (t∗)l0 = sl0 but u∗l0+1 6= vl0+1, and moreover, for each such tuple s,

δ∗dis(t∗, s) = 1 by the definition of δ∗dis. Thus we have that

s∈Q(D) δdis(t∗, s) ≥ 2m−l0−1.

Hence by the definition of Fmono, Fmono(U) < Fmono(U′), for each set U ′ consisting of a

single tuple t∗ given above. Hence rank(U) > r = 1.

(2) Upper bound. We show that DRP(FO, Fmono) is in PSPACE. Recall the algorithmgiven earlier for DRP(FO, F ) for FMS and FMM. Obviously, the algorithm also workshere. We show that it is in PSPACE. Indeed, since Fmono is PSPACE computable (asargued in the proof of Theorem 5.2), step 3 of that algorithm is in PSPACE here;moreover, step 2 is in PSPACE. Thus the algorithm is in NPSPACE = PSPACE. 2

6.2. The Data Complexity of DRP

One might be tempted to think that fixing Q would make DRP(LQ, F ) easier. Never-theless, DRP(LQ, F ) becomes simpler only (1) when F is Fmono, or (2) when LQ is FO,and F is FMS or FMM. Its data complexity remains the same as its combined complexitywhen LQ is CQ, UCQ or ∃FO+, for FMS and FMM (see Theorem 6.4). This is consistentwith the data complexity analysis of QRD(LQ, F ) (Theorem 5.4).

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 26: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:26 Ting Deng and Wenfei Fan

THEOREM 6.4. For CQ, UCQ, ∃FO+and FO, the data complexity of DRP(LQ, FMS)and DRP(LQ, FMM) are coNP-complete, while that of DRP(LQ, Fmono) is in PTIME. 2

As remarked earlier, all the results of this section remain valid even when r is treatedas part of the input for DRP rather than a constant, except the data complexity ofDRP for Fmono. Indeed, if r is not a constant and if the numeric value is encoded inbinary instead of unary, the PTIME algorithm to be presented below for the Fmono casebecomes pseudo-polynomial time rather than PTIME. The other proofs of Theorem 6.4are quite similar to their counterparts for Theorem 5.4.

PROOF. We first investigate the data complexity of DRP when F is FMS or FMM. Wethen study it when F is Fmono.

(1) When F is FMS or FMM. It suffices to prove that DRP(CQ, FMS) and DRP(CQ, FMM)are coNP-hard and that DRP(FO, FMS) and DRP(FO, FMM) are in coNP for fixed queries.Recall that the lower bounds of DRP(CQ, FMS) and DRP(CQ, FMM) for Theorem 6.1

are established by using a fixed identity query. Thus the lower bounds hold here.Hence DRP(CQ, FMS) and DRP(CQ, FMM) are both coNP-hard. For the upper bound, thealgorithm given in the proof of Theorem 6.1 for FO and for FMS and FMM works here,which is to check whether rank(U) > r. We show that the algorithm is in NP. Indeed,Q(D) is PTIME computable when Q is a fixed FO query. Hence its step 2 is PTIME.Moreover, step 3 is in PTIME for FMS and FMM since F is PTIME computable. Thus thealgorithm is in NP, and as a result, DRP(FO, FMS) and DRP(FO, FMM) are in coNP.

(2) When the objective is mono-objective. We show that DRP(FO, Fmono) is in PTIME bygiving a PTIME algorithm. Given Q, D, k, r, δdis, δdis, Fmono, and a set U , the algorithmreturns “yes” if rank(U) ≤ r, and returns “no” otherwise.

Intuitively, the algorithm first finds a collection S of top-r candidate sets for (Q,D, k)based on Fmono. Let S = {S1, . . . , Sr}, where Fmono(S1) ≥ . . . ≥ Fmono(Sr). Then wesimply need to check whether Fmono(U) < Fmono(Sr); the algorithm returns no if so,and yes otherwise. To see this, observe that for each candidate set V /∈ S, thereexists no set S ∈ S such that Fmono(V ) > Fmono(S). After S is found, consider thefollowing cases. (a) If U ∈ S, then there exist at most r − 1 candidate sets V such thatFmono(V ) > Fmono(U), and thus rank(U) ≤ r. (b) If U /∈ S but Fmono(U) = Fmono(Sr),then similarly, rank(U) ≤ r. (c) If U /∈ S and Fmono(U) < Fmono(Sr), then rank(U) > rsince there exist at least r candidate sets V such that Fmono(V ) > Fmono(U). Thus thealgorithm checks which condition of (a), (b) or (c) is satisfied by S and U . It returns“yes” in both cases (a) and (b), and returns “no” in case (c).

We next show how to compute collection S. We construct S by adding sets S1, . . . , Srto S, one set or multiple sets at a time. Let FindNext(Q,D,Fmono,S, S

′, k, l, l′) be aprocedure that given Q,D, Fmono, k, l and a collection S that consists of top-l candidatesets for (Q,D, k), returns a collection S′ such that S ⊂ S′ and S′ is a collection oftop-l′ candidate sets for (Q,D, k), where 1 ≤ l < l′ ≤ r, by finding one or morecandidate sets S /∈ S. Procedure FindNext(Q, D, Fmono, S, S

′, k, l, l′) works as follows.Let S = {S1, . . . , Sl} (l > 0). For each tuple t ∈ Q(D), let v(t) = (1 − λ) · δrel(t, Q)+(λ/(|Q(D)| − 1))

t′∈Q(D) δdis(t, t′). Then it carries out the following steps.

1. For each Si ∈ S, find sets V by replacing one tuple t in Si with another tuple s inQ(D) \ Si, where v(s) ≤ v(t), such that sets V have the highest Fmono values amongall such new sets. More specifically, do the following:a. For each t ∈ Si and s ∈ Q(D) such that s /∈ Si and v(s) ≤ v(t), we get the candi-

date set Si(s, t) by replacing t in Si with s. Obviously, Fmono(Si(s, t)) ≤ Fmono(Si).

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 27: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:27

Denote by Ei(t) the collection of such sets Si(s, t) with the highest Fmono valuesuch that Si(s, t) /∈ S. Note that |Ei(t)| ≥ 0. When all tuples t in Si are pro-cessed, we get k sets Ei(t) for each t ∈ Si. Let Ei be the collection of candidatesets in

t∈SiEi(t) with the highest Fmono value. Note that for each set V ∈ Ei,

we have that Fmono(V ) ≤ Fmono(Si) and V /∈ S.b. Let E be the collection of sets in E1 ∪ . . . ∪ El with the highest Fmono value. We

will show that S ∪E must consist of the top-l′ candidate sets. Note that |S ∪E|may be greater than r while |S| < r.

c. Check whether |S ∪E| > r. If so, we get S′ by adding only r− |S| sets from E toS, picked randomly as they have the same F value; otherwise, let S′ be S ∪E.

2. Return S′.

Capitalizing on FindNext(), the algorithm for DRP with fixed FO queries is given asfollows, which returns “yes” if rank(U) ≤ r, and returns “no” otherwise:

1. computeQ(D); for each t ∈ Q(D), compute v(t) = (1−λ) ·δrel(t, Q) +(λ/(|Q(D)|−1))∑

t′∈Q(D) δdis(t, t′); sort tuples inQ(D) in descending order based on their v() values;

2. let S be the collection consisting of top-l candidate sets for (Q,D, k); initially, Scontains only the set S1 that consists of the first k tuples in the sorted Q(D);obviously, S is the top-1 candidate set; moreover, let l = 1;

3. while |S| < r, do the following:a. let S′ = FindNext(Q, D, Fmono, S, S

′, k, l, l′);b. let S be S′;

4. when |S| = r, check whether condition (a) or (b) given above is satisfied by S andU ; if so, return “yes”; otherwise return “no”.

We show that the algorithm is correct and is in PTIME. Obviously it is correct if andonly if FindNext(Q, D, Fmono, S, S

′, k, l, l′) finds top-l′ candidate sets. Assume w.l.o.g.that |Q(D)| > k. We show that FindNext finds top-l′ candidate sets by induction on lsuch that when FindNext finds top-l′ candidate sets S′ based on the top-l candidatesets S (where 1 < l < l′), then FindNext can find the top-l′′ candidate sets S′′ based onS′ for some l′′ > l′. Obviously, we need only to consider the case when l′′ ≤ r, i.e., S′′

contains all the new candidate sets with the highest Fmono value found by FindNext.

When l = 1, S contains only the set S1 that consists of the first k tuples in the sortedQ(D). Based on S, FindNext finds all sets V by replacing tuples t ∈ S1 one at a timewitha tuple s ∈ Q(D)\S1, where v(s) ≤ v(t). Denote by E the collection consisting of all suchsets V that have the highest Fmono value. Let S

′ = S∪E. We show that for any candidateset V ′ /∈ S′, Fmono(V

′) ≤ Fmono(S) for each set S ∈ S′. Note that Fmono(V′) ≤ Fmono(S1)

since S1 is the top-1 candidate set. Nowwe only need to prove that Fmono(V′) ≤ Fmono(S)

for each set S ∈ E. Consider the following two cases. (a) There exist tuples t ∈ S1 ands ∈ Q(D) \ S1 such that v(s) ≤ v(t) and V ′ is obtained by replacing t in S1 with s.Recall that V ′ /∈ E since V ′ /∈ S′. Thus we have that Fmono(V

′) ≤ Fmono(S) for each setS ∈ E, since all sets in E have the highest Fmono value in sets obtained by an one-tuplereplacement. (b) The set V ′ is not one given in (a) above. Then V ′ must contain no morethan k − 2 tuples in S1. Assume that V ′ is obtained by replacing tuples t1, . . . , tj in S1

with s1 . . . , sj , where j ∈ [2, k], such that s1, . . . , sj are not in S1 and for each i ∈ [1, j],v(si) ≤ v(ti). Let V

′′ be the set obtained by replacing only one tuple t1 in S1 with s1.Obviously, Fmono(V

′) ≤ Fmono(V′′). Moreover, we have that Fmono(V

′′) ≤ Fmono(S) foreach set S ∈ E, since all the sets in E have the highest Fmono value. Thus Fmono(V

′) ≤Fmono(S) for each S ∈ E. Hence S′ contains the top-(1 + |E|) candidate sets.

Assume that when l = n and n ∈ [2, r], FindNext finds the top-l′ candidate sets S′

based on the top-l candidate sets S for some l′ > l. For the inductive step, we considerthe case when l = l′. Let the collection E′ consist of all sets V with the highest Fmono

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 28: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:28 Ting Deng and Wenfei Fan

value found by FindNext based on the top-l′ candidate sets S′, following the one-tuple-at-a-time replacement strategy. Let S′′ = S′ ∪ E′. We show that S′′ contains thetop-(l′ + |E′|) candidate sets. It suffices to prove that for each candidate set V ′ ⊆ Q(D)such that V ′ /∈ S′′, Fmono(V

′) ≤ Fmono(S) for each S ∈ S′′. Note that by the inductionhypothesis, for each set S ∈ S′, Fmono(V

′) ≤ Fmono(S) since S′ contains the top-l′ can-didate sets. Thus we need only to show that Fmono(V

′) ≤ Fmono(S) for each set S ∈ E′.This can be verified along the same lines as the argument for expanding l = 1 above,examining cases (a) and (b) given there. This verifies the correctness of the algorithm.

We next show that the algorithm is in PTIME. Note that procedure FindNext is calledat most r − 1 times. In each call, for each set S in S, there are at most k · |Q(D)|replacements of tuples in S. Thus, at most O(r · k · |Q(D)|) time is needed for eachcall of FindNext. Putting these together, the algorithm is in O((r − 1) · r · k · |Q(D)|)time. Since Q is fixed, r is a constant and Q(D) is bounded by a polynomial in |D|, thealgorithm is in PTIME. Therefore, the data complexity of DRP(FO, Fmono) is in PTIME.

We remark that the algorithm given above is in pseudo-polynomial time if r is nota constant and if r is encoded in binary. It is in PTIME when k is a constant. 2

7. THE RESULT DIVERSITY COUNTING PROBLEM

We now study the counting problem RDC. We establish its combined complexity inSection 7.1 and data complexity in Section 7.2; we will also identify and investigateits special cases in Section 8. Along the same lines as Fig. 1, we depict in Fig. 4 theconnections between complexity bounds of RDC in various settings.

7.1. The Combined Complexity of RDC

We first study the combined complexity of RDC. Here we use the framework ofpredicate-based counting classes introduced in [Hemaspaandra and Vollmer 1995].For a complexity class C of decision problems, #·C is the class of all counting problemsassociated with a predicate RL that satisfies the following conditions:

—RL is polynomially balanced, i.e., there exists a polynomial q such that for all stringsx and y, if RL(x, y) is true then |y| ≤ q(|x|); and

— the following decision problem is in class C: “given x and y, it is to decide whetherRL(x, y)”.

A counting problem is to compute the cardinality of the set {y | RL(x, y)}, i.e., it isto find how many y there are such that predicate RL(x, y) is satisfied.We show that when the objective is for max-sum or max-min diversification, the

problem becomes harder for FO than for CQ, UCQ and ∃FO+. In contrast, when theobjective is for mono-objective formulation, Fmono has greater impact on the complexitythan LQ: it remains #·PSPACE-complete when LQ ranges over CQ, UCQ, ∃FO+and FO.This is consistent with its counterparts for QRD and DRP.The results are verified by parsimonious reductions. A parsimonious reduction from

a counting problem #A to a counting problem #B is a PTIME function σ such that forall x, |{y | (x, y) ∈ A}| = |{z | (σ(x), z) ∈ B}|, i.e., σ is a bijection [Durand et al. 2005].

Below we first study RDC(LQ, F ) when F is FMS or FMM.

THEOREM 7.1. The combined complexity of RDC(LQ, FMS) and RDC(LQ, FMM) is

—#·NP-complete when LQ is CQ, UCQ or ∃FO+, and—#·PSPACE-complete when LQ is FO.

All the results hold under parsimonious reductions. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 29: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:29

#�PSPACE-complete(Th. 7.1)

FO, combined

#�NP-comple(Th. 7.1)

CQ/∃FO+, combined

#P-complete (parsimonious)

(Th. 7.4)

CQ/FO, data

#P-complete (Turing)(Th. 8.2)

CQ/FO, λ=0, FMS, data,

FP(Cor. 8.4)

CQ/FO, constant k, data

#�PSPACE-complete(Th. 7.2)

CQ/FO, combined

#P-complete (Turing)(Th. 7.5)

CQ/FO, data

#P-complete(Turing)

(Cor. 8.1)CQ/FO, identity

queries, combined

(a) When F is FMS or FMM (b) When F is Fmono

#�NP-complete(Th. 8.2)

CQ/∃FO+,

λ=0, combined

#P-complete (Turing)(Th. 8.2)

CQ/FO, λ=0, data

FP(Th. 8.2)

CQ/FO, λ=0, FMM, data

FP(Cor. 8.4)

CQ/FO, constant k, data

Fig. 4. The complexity bounds of RDC

I01 =X

10

I∨ =

B A1 A2

0 0 01 0 11 1 01 1 1

I∧ =

B A1 A2

0 0 00 0 10 1 01 1 1

I¬ =A A

0 11 0

Fig. 5. Relation instances used in the lower bound proofs of Theorem 7.1.

The lower bounds are verified by reductions from the following problems.

(1) #Σ1SAT: Given an existentially quantified Boolean formula of the formϕ(X,Y ) = ∃Xψ(X,Y ), where ψ(X,Y ) is of the form C1 ∧ . . .∧Cl, and Ci is disjunctionsof variables or negated variables taken from X = {x1, . . . , xm} and Y = {y1, . . . , yn},it is to count the number of truth assignments of Y that satisfy ϕ. It is known that#Σ1SAT is #·NP-complete [Durand et al. 2005].

(2) #QBF: Given a Boolean formula of the form ϕ = ∃X ∀y1P2y2 · · ·Pnyn ψ, wherePi ∈ {∃, ∀} for i ∈ [2, n], and ψ is quantifier-free Boolean formula over the variables inX ={x1, . . . , xm} and Y = {y1, . . . , yn}, it is to count the number of truth assignmentsof X variables that satisfy ϕ. It is known to be #·PSPACE-complete [Ladner 1989].

Given these, we prove Theorem 7.1 as follows.

PROOF. We start with RDC for CQ, UCQ and ∃FO+, and then study them for FO.

(1) When LQ is CQ, UCQ or ∃FO+. It suffices to show that RDC(CQ, FMS) and RDC(CQ,

FMM) are #·NP-hard and that RDC(∃FO+, FMS) and RDC(∃FO+, FMM) are in #·NP.

(1.1) Lower bound. We first show that RDC(CQ, FMS) is #·NP-hard even when λ = 0and k is a constant, by parsimonious reductions from #Σ1SAT. Given an instanceϕ(X,Y ) = ∃Xψ(X,Y ) of #Σ1SAT, we define a database D, a CQ query Q, functions δrel,δdis and FMS, a positive integer k and a real number B. We show that the number ofvalid sets for (Q,D, k, FMS, B) equals the number of truth assignments of Y variablesthat satisfy ϕ. In particular, we set k = 2 and B = 3.

(1) The database consists of four relations I01, I∨, I∧ and I¬ as shown in Fig. 5, speci-fied by schemas R01(X), R∨(B,A1, A2), R∧(B,A1, A2) and R¬(A, A), respectively. Here

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 30: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:30 Ting Deng and Wenfei Fan

I01 encodes the Boolean domain, and I∨, I∧ and I¬ encode disjunction, conjunction andnegation, respectively, such that we can express ϕ′ below in CQ with these relations.

(2) To define the CQ query Q, we first construct a new formula ϕ′ as follows:

ϕ′(~y) = ∃~x, z(

(ψ(~x, ~y) ∨ z) ∧ z)

,

where z is a new variable not in X ∪ Y . It can be verified that a truth assignment µYof Y variables satisfies ϕ if and only if µY makes ϕ′ true when z is set to be 0.We now define the CQ query Q as follows:

Q(~y, z, a) = ∃~x(

QX(~x) ∧QY (~y) ∧R01(z) ∧Q1(~x, ~y, z, a))

.

Here ~x = (x1, . . . , xm) and ~y = (y1, . . . , yn). Queries QX and QY generate all truthassignments of variables in X and Y , respectively, by means of Cartesian products ofR01. Sub-query Q1 leverages R∨, R∧ and R¬ to encode the formula ϕ′. The semanticsof Q1 is that for a given truth assignment µX of variables in X , µY for Y and µz for z,which are encoded by tuples tX , tY and tZ , respectively, Q1(tX , tY , tz, a) returns a = 1if (ψ ∨ z) ∧ z is satisfied by µX , µY and µz ; and it returns a = 0 otherwise. Intuitively,Q returns all tuples (tY , tz, a) such that Q1 returns a = 1 if the truth assignment µYof Y variables (encoded by tY ) and µz of z (represented by tz) satisfy ϕ

′.

(3) We define (a) δrel((tY , 0, 1), Q) = 1, (b) δrel((1, . . . , 1, 0), Q) = 2 and (c) δrel(t, Q) = 0for any other tuple t of RQ. Furthermore, we define δdis as a constant function thatreturns 0 for each pair of tuples of RQ, and let λ = 0. Then for each set U of tuples ofRQ with k tuples, FMS(U) = (k − 1) ·

t∈U δrel(t, Q).

We show that the construction above is indeed a parsimonious reduction. To seethis, observe that for each set U ⊆ Q(D) such that |U | = 2, FMS(U) ≥ 3 if and onlyif U = {(tY , 0, 1), (1, . . . , 1, 0)}, by the definition of FMS and δrel given above, where tYencodes a truth assignment of Y variable that satisfies ϕ′ with z = 0. Moreover, asdiscussed earlier, a truth assignment µY makes ϕ true if and only if µY satisfies ϕ′

when z = 0. Thus, the number of truth assignments of Y variables that satisfy ϕequals the number of valid sets U for (Q,D, k, FMS, B).

We next show that RDC(CQ, FMM) is #·NP-hard, also by parsimonious reduction from#Σ1SAT. Given an instance ϕ(X,Y ) = ∃Xψ(X,Y ) of #Σ1SAT, we define the samequery ϕ′, database D, queryQ and function δdis as given above. Furthermore, we defineδrel((tY , 0, 1), Q) = 1 and for any other tuple t of RQ, δrel(t, Q) = 0. Let λ = 0 and k = 1.Then for each set U of tuples of RQ, FMM(U) = mint∈U δrel(t, Q). Finally, we set B = 1.We show that this also makes a parsimonious reduction. For each set U ⊆ Q(D) such

that |U | = k = 1 and FMM(U) ≥ B = 1, U consists of a single tuple (tY , 0, 1) such thattY encodes a truth assignment of Y variables that satisfies ϕ. So the number of validsets for (Q,D, k, FMM, B) equals the number of truth assignments of Y that satisfies ϕ.

(1.2) Upper bound. We show that RDC(∃FO+, F ) is in #·NP, when F is FMS or FMM.It suffices to show that it is in NP to verify whether a given set U is valid for(Q,D, k, F,B), by the definition of #·NP. Indeed, given a set U , it is in NP to checkwhether U ⊆ Q(D) for ∃FO+, in PTIME to check whether |U | = k, and in PTIME tocheck whether F (U) ≥ B since F is PTIME computable when F is FMS or FMM. ThusRDC(∃FO+, F ) is in #·NP.

(2) When LQ is FO. We next study RDC(FO, FMS) and RDC(FO, FMM).

(2.1) Lower bound. We verify that RDC(FO, FMS) and RDC(FO, FMM) are #·PSPACE-hardeven when λ = 0 and k is a constant, by parsimonious reductions from #QBF.

We start with RDC(FO, FMS). Given an instance ϕ of #QBF as described above, weconstruct a database D, a query Q, and functions δrel, δdis and FMS, and set k = 2 and

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 31: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:31

B = 3. We show that the number of valid sets for (D, Q, k, FMS, B) equals the numberof truth assignments of X that satisfies ϕ.

(1) The database consists of the four relations I01, I∨, I∧ and I¬ shown in Fig. 5,specified by schemas R01(X), R∨(B,A1, A2), R∧(B,A1, A2) and R¬(A, A), respectively.

(2) To define the query Q, we construct a new formula ϕ′ from ϕ as before:

ϕ′ = ∃X ∀y1P2y2 · · ·Pnyn(

(ψ ∨ z) ∧ z)

.

where z is a new variable not in X ∪ Y . A truth assignment µX of X satisfies ϕ if andonly if µX make ϕ′ true with z = 0. Let ψ′ = (ψ∨z)∧z. We define the queryQ as follows:

Q(~x, z, b) = ∀y1P2y2 · · ·Pnyn(

QX(~x) ∧QY (~y) ∧R01(z) ∧Qψ′(~x, ~y, z, b))

.

Here ~x = (x1, . . . , xm) and ~y = (y1, . . . , yn). Queries QX and QY generate all truthassignments of variables in X and Y , respectively, by means of Cartesian productsof R01. Furthermore, query Qψ′(~x, ~y, z, b) encodes the truth value of ψ′ for given truthassignments µX of X variables, µY of Y variables and µz of variable z, such that itreturns b = 1 if ψ′ is true under µX , µY and µz; otherwise it returns b = 0. Intuitively,Q returns all tuples (tX , tz, b) such that Q returns b = 1 if the truth assignments µX(encoded by tX ) and µz (encoded by tz) satisfy ϕ

′; and Q returns b = 0 otherwise.

(3) We define (a) δrel((tX , 0, 1), Q) = 1, (b) δrel((1, . . . , 1, 0), Q) = 2 and (c) for any othertuple t of RQ, δrel(t, Q) = 0. Furthermore, we define δdis as a constant function thatreturns 0 for each pair of tuples of RQ. Finally, we set λ = 0. Then for each set U oftuples of RQ with k tuples, FMS(U) = (k − 1) ·

t∈U δrel(t, Q).Then along the same line as the proof of RDC(CQ, FMS) given earlier, one can readily

verify that the number of truth assignments of X variables that satisfy ϕ is equal tothe number of sets U valid for (Q,D,FMS, k, B).

We now show that RDC(FO, FMM) is #·PSPACE-hard, also by parsimonious reductionfrom #QBF. Given an instance ϕ of #QBF, we construct the same formula ϕ′, databaseD, query Q, and function δdis as above. Moreover, we define δrel((tX , 0, 1), Q) = 1 foreachm-arity tuple tX that encodes a truth assignment ofX variables, and δrel(t, Q) = 0for any other tuple t of RQ. Let λ = 0, k = 1 and B = 1. Then for each set U ofconsisting of k tuples of RQ, FMM(U) = mint∈U δrel(t, Q). Then following the proof forRDC(FO, FMS), one can verify that the number of truth assignments of X variablesthat satisfy ϕ equals the number of sets U valid for (Q,D,FMM, k, B).

(2.2) Upper bound. To verify that RDC(FO, F ) is in #·PSPACE for FMS and FMM, weonly need to show that verifying whether a given set U is valid is in PSPACE. Indeed,given a set U , it is in PSPACE to check whether U ⊆ Q(D) and it is in PTIME to checkwhether |U | = k and F (U) ≥ B when F is FMS or FMM. 2

We next investigate the combined complexity of RDC(LQ, F ) when F is Fmono.

THEOREM 7.2. The combined complexity of RDC(LQ, Fmono) is #·PSPACE-completeunder parsimonious reductions, when LQ is CQ, UCQ, ∃FO+or FO. 2

We verify the lower bound by parsimonious reduction from #QBF. The proofextends the counting argument given in the proof of Theorem 5.2. Given an in-stance ϕ = ∃X∀y1P2y2 . . . Pnynψ(X,Y ) of #QBF over variables X = {x1, . . . , xm} andY = {y1, . . . , yn}, we define a distance function δ∗∗ similar to its counterpart δdis givenfor QRD. More specifically, let t = (u1, . . . , um+n) and s = (v1, . . . , vm+n) be any pair ofm + n-arity tuples that encode two truth assignments of X ∪ Y variables, such thattm+l = sm+l and um+l+1 6= vm+l+1, where tm+l and sm+l are prefixes of t and s of lengthm+ l, respectively. We want to use δ∗∗ to ensure that δ∗∗(t, s) > 0 if and only if formula

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 32: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:32 Ting Deng and Wenfei Fan

Pl+1yl+1 . . . Pnyn ψ is satisfied by the truth assignment encoded by tm+l, for variablesx1, . . . , xm, y1, . . . , yl. We define δ∗∗dis by revising δdis as follows. (a) We let δ∗∗dis(t, s) = 0 iftm 6= sm, i.e., tm and sm encode distinct truth assignments of X variables. Otherwise,let tm be an arbitrary m-arity tuple that encodes a truth assignment of X , andt = (tm, 1, . . . , 1). For any tuple s with sm = tm, we define (b) δ∗∗dis(t, s) = (1/2) · δdis(t, s)

if s = (tm, 1, v2, . . . , vn); (c) δ∗∗dis(t, s) = 4 · δdis(u, s) if s = (tm, 0, v2, . . . , vn); and moreover,

(d) for any other pair of tuples t′ and s′, we let δ∗∗dis(t′, s′) = δdis(t

′, s′).

It is easy to see that for each pair of t and s of (m + n)−arity tuples given above,δdis(t, s) = 1 if and only if δ∗∗dis(t, s) > 0, for l ∈ [0, n − 1]. Moreover, along the same lineas Lemma 5.3, one can readily verify the following property of δ∗∗dis(t, s).

LEMMA 7.3. Consider an instance ϕ = ∃X∀y1P2y2 . . . Pnynψ(X,Y ) of #QBF, a truthassignment µmX of X variables and a truth assignment µlY of variables in {y1, . . . , yl},for some l ∈ [0, n− 1]. For any pair of tuples t = (u1, . . . , um+n) and s = (v1, . . . , vm+n)that encode truth assignments of variables in ϕ, if tm+l = sm+l = (µmX , µ

lY ) but

um+l+1 6= vm+l+1, then Pl+1yl+1 . . . Pnyn ψ is true under µmX and µlY if and only if

δ∗∗dis(t, s) > 0, where tm+l and sm+l both encode µmX and µlY . 2

Based on Lemma 7.3, we next prove Theorem 7.2.

PROOF. It suffices to show that RDC(CQ, Fmono) is #·PSPACE-hard and RDC(FO,Fmono) is in #·PSPACE. The lower bounds holds even when λ = 1 and k = 1.

(1) Lower bound. We verify the lower bound by parsimonious reduction from #QBF.Given an instance ∃X∀y1P2y2 . . . Pnynψ(X,Y ) of #QBF, where X = {x1, . . . , xm} andY = {y1, . . . , yn}, we construct a database D, a CQ queryQ, functions δrel, δ

∗∗dis and Fmono,

a positive integer k and a real number B, such that the number of truth assignmentsof X that satisfy ϕ equals the number of valid sets U for (Q,D, k, Fmono, B). Intuitively,the reduction assures that for each truth assignment µX of X variables that satisfiesϕ, µX corresponds to a tuple (tX , 1 . . . , 1) such that U = {(tX , 1 . . . , 1)} is valid for(Q,D, k, Fmono, B), where tX is an m-arity tuple encoding µX ; moreover, there exists noother valid set for (Q,D, k, Fmono, B).

(1) The database D consists of a single relation I01 = {(0), (1)} of Fig. 5, specified byschema R01(X) and encoding the Boolean domain.

(2) We define the CQ query Q as follows:

Q(~x, ~y) =∧

i∈[1,m]

R01(xi) ∧∧

j∈[1,n]

R01(yj).

Here ~x = (x1, . . . , xm) and ~y = (y1, . . . , yn). That is, Q generates all truth assignmentsof variables in X ∪ Y . Obviously, |Q(D)| = 2m+n.

(3) We define relevance function δrel to be a constant function that returns 1 for any setU of tuples of RQ, and use the distance function δ∗∗ given above. We set λ = 1, k = 1and B = 2n+1/(2m+n − 1). Then for any set U consisting of k tuples of RQ, we havethat Fmono(U) = (1/(2m+n − 1))·

t∈U,s∈Q(D) δ∗∗dis(t, s).

We next show that the number of truth assignments of X that satisfy ϕ is the sameas the number of valid sets U for (Q, D, k, Fmono, B). More specifically, we verify thatif a truth assignment µmX of X variables satisfies ϕ, then {(tm, 1 . . . , 1)} is a valid setfor (Q,D, k, Fmono, B), where tm encodes µmX , and moreover, there exists no other tuples such that sm encodes µmX and {s} is valid for (Q,D, k, Fmono, B). For if it holds, thenthe encoding makes a parsimonious reduction from #QBF.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 33: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:33

Assume first that the truth assignment µmX of X variables satisfies ϕ. Let t =

(tm, 1, . . . , 1) such that tm encodes µmX . Then by Lemma 7.3, for each tuple s =

(tm, 0, v2, . . . , vn), we have that δ∗∗dis(t, s) = 4. Note that there exist 2n−1 such tu-

ples s. Then we have that∑

s∈Q(D) δ∗∗dis(t, s) ≥ 4 · 2n−1 = 2n+1. Thus Fmono({t}) ≥

2n+1/(2m+n−1 − 1) ≥ B. Therefore, {t} is a valid set for (Q, D, k, Fmono, B). We nextprove that for any other tuple t that is distinct from t such that tm = tm (i.e., tm

encodes µmX), we have that Fmono({t}) < B. Indeed, by the definition of δ∗∗dis, for each

tuple t = (tm, um+1, . . . , um+n) such that t 6= t, there are at most 2n − 1 tuples s suchthat δ∗∗dis(t, s) > 0, and moreover, for each such tuple s, sm = tm. Consider the follow-

ing two cases. For any tuple t = (u1, . . . , um+n) such that t 6= t and tm = tm, (a) ifum+1 = 0, then by the definition of δ∗∗dis,

s∈Q(D) δ∗∗dis(t, s) =

s∈Q(D)\{t} δ∗∗dis(t, s)+ δ∗∗dis(t, t)

≤ 2n−2+4 = 2n+2 < 2n+1 (resp. = 2n+1 ), when n > 1 (resp. when n = 1); and (b) oth-erwise

s∈Q(D) δ∗∗dis(t, s) =

s∈Q(D)\{t} δ∗∗dis(t, s)+δ

∗∗dis(t, t) ≤ 2n−2+1/2 = 2n−3/2 < 2n+1.

As a result, for each truth assignment µmX that satisfies ϕ, there exists one and only onetuple t = (tm, 1, . . . , 1) such that {t} is valid for (Q,D, k, Fmono, B), where tm encodes µmX .

(2.2) Upper bound. We show that RDC(FO, Fmono) is in #·PSPACE. It suffices to provethat it is in PSPACE to verify whether a given set U is valid for (Q,D, k, Fmono, B).Indeed, consider the algorithm given in the proof of Theorem 5.2 for QRD(FO, Fmono).Then its steps 4 and 5 can be used to check whether a given set U is valid. As shownthere, steps 4 and 5 are in PSPACE. Therefore, RDC(FO, Fmono) is in #·PSPACE. 2

7.2. The Data Complexity of RDC

We next show that fixing queries reduces the complexity of RDC, to an extent:

(1) when F is FMS or FMM, the problem becomes #P-complete under parsimonious re-ductions, down from #·NP-complete (for CQ, UCQ and ∃FO+) and #·PSPACE-complete(for FO), as opposed to Theorem 7.1; and

(2) when F is Fmono, RDC(LQ, F ) is #P-complete under polynomial Turing reductions,rather than #·PSPACE-complete, in contrast to Theorem 7.2.

Here #P is the class of functions that count the number of accepting paths ofnondeterministic PTIME Turing machines, in the machine-based framework of[Valiant 1979]. It is known that #P = #·P [Durand et al. 2005], where #·P is thepredicate-based counting class defined with a PTIME predicate (see Section 7.1).

Recall that a counting problem #A is polynomial Turing reducible to #B if thereexists a PTIME function σ such that for all x, |{y | (x, y) ∈ A}| is PTIME computableby making multiple calls to an oracle that computes |{z | (σ(x), z) ∈ B}|. We have sofar used parsimonious reductions (Section 7.1), which are stronger than polynomialTuring reductions, i.e., a parsimonious reduction from #A to #B is also a polynomialTuring reduction from #A to #B, but not necessarily vice versa.

Note that a parsimonious reduction from #A to #B is also a PTIME reduction fromits decision problem A to the decision problem B of #B. Hence if the decision problemB of #B is in P, #B cannot be #P-complete under parsimonious reductions since formany NP-complete problems, e.g., 3SAT, their counting problems are in #P. This isprecisely the case for RDC(LQ, Fmono) when LQ is CQ, UCQ, ∃FO+or FO (data complex-ity). Indeed, its decision problem QRD(LQ, Fmono) is in PTIME (Theorem 5.4). ThusRDC(LQ, Fmono) is #P-complete under polynomial Turing reductions, but not underparsimonious reductions. In contrast, for FMS or FMM, RDC(LQ, F ) is #P-completeunder parsimonious reductions.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 34: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:34 Ting Deng and Wenfei Fan

We first study the data complexity of RDC(LQ, FMS) and RDC(LQ, FMM)

THEOREM 7.4. the data complexity of RDC(LQ, FMS) and RDC(LQ, FMM) is#P-complete under parsimonious reductions for CQ, UCQ, ∃FO+and FO. 2

The lower bound is verified by parsimonious reduction by #SAT: given an instanceϕ(X) = C1 ∧ · · · ∧ Cl of 3SAT over variables X = {x1, . . . , xm}, #SAT is to countthe number of truth assignments of X that satisfy ϕ. It is known that #SAT is#P-complete (cf. [Papadimitriou 1994]). Given these, we prove Theorem 7.4 as follows.

PROOF. It suffices to show that RDC(CQ, FMS) and RDC(CQ, FMM) are #P-hardunder parsimonious reductions, and that RDC(FO, FMS) and RDC(FO, FMM) are in #P.

(1) Lower bound. We show that RDC(CQ, FMS) is #P-hard, even when λ = 1, byparsimonious reduction from #SAT. Given an instance ϕ(X) of #SAT, we define thesame D, Q, λ, δrel, δdis and FMS as their counterparts given in the proof of Theorem 5.4for QRD(CQ, FMS), and let k = l and B = l · (l − 1). Recall that in that proof, for eachset U ⊆ Q(D) such that |U | = l, FMS(U) ≥ l · (l − 1) if and only if tuples in U encode atruth assignment of X variables that satisfies ϕ. Thus the number of valid sets for (D,Q, k, FMS, B) equals the number of truth assignments of X that satisfy ϕ.

We next show that RDC(CQ, FMM) is #P-hard, also by parsimonious reduction from#SAT. Given an instance ϕ of #SAT, we construct the same D,Q, λ, δrel, δdis and FMM

as their counterparts used in the proof of Theorem 5.4 for QRD(CQ, FMM), and letk = l and B = 1. Recall that in that proof, for each set U ⊆ Q(D) such that |U | = l,FMM(U) ≥ 1 if and only if the tuples in U encode a truth assignment of X variablesthat satisfies ϕ. Thus the number of valid sets for (D,Q, k, FMM, B) is equal to thenumber of truth assignments of X variables that satisfy ϕ.

(2) Upper bound. We show that RDC(FO, F ) is in #P for max-sum and max-min diver-sification. To do this, we only need to show that it is in PTIME to verify whether a givenset U is valid for (Q,D, k, F,B), when F = FMS or F = FMM, by the definition of #·P(recall that #·P = #P). Indeed, given a set U , it is in PTIME to check whether U ⊆ Q(D)for a fixed queryQ in FO, and is in PTIME to check whether |U | = k. Moreover, checkingwhether F (U) ≥ B is also in PTIME when F is FMS or FMM. Thus RDC(FO, F ) is in #P. 2

We next investigate the data complexity of RDC(LQ, Fmono).

THEOREM 7.5. The data complexity of RDC(LQ, Fmono) is #P-complete underTuring reductions for CQ, UCQ, ∃FO+and FO. 2

To show the lower bound, we reduce from the following problems, and use a lemma.

(1) #SSP (the #subset sum problem): Given a finite set W , a function π : W → N

and a natural number d ∈ N, it is to count the number of subsets T ⊆ W suchthat

w∈T π(w) = d. It is known that #SSP is #P-complete under parsimoniousreduction [Berbeglia and Hahn 2010].

(2) #SSPk: Given a finite set W , a function π : W → N and natural numbers d, l ∈ N,#SSPk is to count the number of subsets T ⊆W such that |T | = l and

w∈T π(w) = d.

We first verify that #SSPk is #P-hard by parsimonious reduction from #SSP andthen show that RDC(CQ, Fmono) is #P-hard by Turing reduction from #SSPk.

LEMMA 7.6. The #SSPk problem is #P-complete under parsimonious reductions. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 35: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:35

PROOF. To see that #SSPk is in #P, observe that given a finite set W , a functionπ : W → N, a subset T ⊆ W , d ∈ N and l ∈ N, it is in PTIME to check whether |T | = land

w∈W π(w) = d. Thus #SSPk is in #P by the definition of #·P (recall #·P = #P).

We next show that #SSPk is #P-hard by parsimonious reduction from #SSP. Givenan instanceW , π and d of #SSP, we constructW ′, π′, d′ and l such that the number ofsubsets T ⊆ W with

w∈T π(w) = d equals the number of T ′ ⊆ W ′ with |T ′| = l and∑

w′∈T ′ π′(w′) = d′. Let W = {w1, . . . , wn} (hence |W | = n). Denote by m the numberof decimal digits in

w∈W π(w). Then for each subset T of W ,∑

w∈T π(w) can be alsorepresented as an m-digit integer. We next give the reduction.

(1) We defineW ′ = {(wi, 1), (wi, 0) | i ∈ [1, n]}. That is, we include two elements (wi, 1)and (wi, 0) inW ′ for each wi ∈W , where i ∈ [1, n].

(2) We define π′ as a function from W ′ to N. Intuitively, for each w′ ∈ W ′, we defineπ′(w′) to be an n + m-digit integer, where the first n digits in π′(w′) encode wi fori ∈ [1, n], where w′ = (wi, 1) or w′ = (wi, 0); moreover, the last m-digit integer inπ′(w′) either equals π(wi) for some wi if w

′ = (wi, 1), or equals 0 when w = (wi, 0).More specifically, for each w′ ∈ W ′, we define π′(w′) to be an n + m-digit integeru1 . . . unv1 . . . vm, such that (a) if w′ = (wi, 1) for some wi ∈ W , then ui = 1, and for eachj ∈ [1, n], if j 6= i, then uj = 0; and moreover, the m-digit integer v1 . . . vm equals π(wi);and (b) if w′ = (wj , 0) for some wj ∈ W , then uj = 1, and for each i ∈ [1, n], if i 6= j, thenui = 0, and furthermore, for all i ∈ [1,m], vi = 0.

(3) We define d′ to be an n +m-digit integer u∗1 . . . u∗nv

∗1 . . . v

∗m, where for each i ∈ [1, n],

u∗i = 1, and moreover, the m-digit integer v∗1 . . . v∗m equals d. Indeed, d is an m-digit

integer since d ≤∑

w∈W π(w). Finally, we define l = n, i.e., l = |W |.

We next show that this is indeed a parsimonious reduction, i.e., the number ofsubsets T ⊆ W such that

w∈T π(w) = d is equal to the number of subsets T ′ ⊆ W ′

with |T ′| = l and∑

w′∈T ′ π(w′) = d′. Assume that there exists a set T ′ ⊆ W ′ suchthat |T ′| = l = n and

w′∈T ′ π′(w′) = d′. Then by the definitions of d′ and π′, thesum of last m-digit integers in all π′(w′) for w′ ∈ T ′ is equal to d, and moreover,d =

(wi,1)∈T ′ π(wi). Let T be the set consisting of all wi such that (wi, 1) ∈ T ′. Then∑

wi∈Tπ(wi) =

(wi,1)∈T ′ π(wi) = d. Conversely, assume that there exists a subset

T ⊆W such that∑

w∈T π(w) = d. Define T ′ = {(wi, 1) | wi ∈ T } ∪ {(wj , 0) | wj ∈W \T }.Obviously, |T ′| = l = n. Again by the definitions of π′ and d′, one can verify that∑

w′∈T ′ π′(w′) =∑

(wi,1)∈T ′ π(wi) = d′. Obviously, the reduction is parsimonious.

Putting these together, we have that #SSPk is #P-complete. 2

Using Lemma 7.6, we next prove Theorem 7.5.

PROOF. It suffices to show that RDC(CQ, Fmono) is #P-hard under polynomialTuring reductions, and that RDC(FO, Fmono) is in #P, even when λ = 0.

(1) Lower bounds. We show that RDC(CQ, Fmono) is #P-hard by polynomial Turingreduction from #SSPk, which has been shown #P-complete by Lemma 7.6. Denoteby COUNTRDC(Q,D, k, Fmono, B) the oracle that given Q,D, k, Fmono and B, returnsthe number of valid sets U for (Q,D, k, Fmono, B). To show that #SSPk is polynomialtime Turing reducible to RDC(CQ, Fmono), it suffices to show that there exists a PTIMEalgorithm for computing #SSPk by calling COUNTRDC(Q,D, k, Fmono, B) a polynomialnumber of times, by the notion of polynomial Turing reductions given earlier.

Observe the following. Given a finite set W , a function π : W → N and naturalnumbers d and l, let X be the number of subsets T ofW such that

w∈T π(w) = d and|T | = l, Y be the number of subsets T ′ ofW such that

w∈T ′ π(w) ≥ d and |T ′| = l, and

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 36: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:36 Ting Deng and Wenfei Fan

Z be the number of subsets T ′′ ofW such that∑

w∈T ′′ π(w) ≥ d+ 1 and |T ′′| = l. ThenX = Y −Z. Based on this, we construct the polynomial Turing reduction from #SSPk toRDC(CQ, Fmono) as follows. Given an instanceW , π, l and d of #SSPk, we first constructQ, D, δrel, δdis, Fmono, k and B in PTIME, such that the number of subsets T of W with|T | = l and

w∈T π(w) ≥ d equals the number of valid sets U for (Q,D, k, Fmono, B). Asdiscussed above, we can find the solution for #SSPk, i.e., the number of subsets T ofWwith |T | = l and

w∈T π(w) = d, by calling the oracle COUNTRDC twice, for computingthe numbers X and Y of valid sets for (Q,D, k, Fmono, B) and (Q,D, k, Fmono, B + 1),respectively.

We next give the transformation from #SSPk to RDC(CQ, Fmono).

(1) The database D consists of a single relation IW = {(w) | w ∈W} of schema RW (W ).

(2) We define query Q as the identity query on RW instances.

(3) We define δrel as follows. For each tuple t = (w) ∈ Q(D), we let δrel(t, Q) = π(w).Furthermore, for any other tuple t′ of RQ, we define δrel(t

′, Q) = 0. Moreover, we takeδdis as a constant function that returns 0 for each pair of tuples of RQ. We set λ = 0,and hence for each set U of tuples of RQ, Fmono(U) =

t∈U δrel(t, Q).

(4) Finally, we set k = l and B = d.

We next show that the number of subsets T of W such that |T | = l and∑

w∈T π(w) ≥ d equals the number of valid sets U for (Q,D, k, Fmono, B). Notethat k = l and B = d. Then by the definition of δrel, for each set U ⊆ Q(D) such that|U | = k, Fmono(U) =

(w)∈U π(w) ≥ B if and only if for the set T = {w | (w) ∈ U},

we have that |T | = l and∑

w∈T π(w) ≥ d. From this we have a PTIME algorithmfor computing #SSPk by calling COUNTRDC(Q,D, k, Fmono, B) (denoted by X) andCOUNTRDC(Q,D, k, Fmono, B+ 1) (denoted by Y ). Then X − Y is the solution for #SSPk.

(2) Upper bound. We verify that RDC(FO, Fmono) is in #P, by showing that it is inPTIME to verify whether a given set U is valid for (Q,D, k, Fmono, B). Indeed, Q(D)and Fmono are both PTIME computable since Q is fixed. Thus it is in PTIME to checkwhether U ⊆ Q(D), |U | = k and Fmono(U) ≥ B. Hence RDC(FO, Fmono) is in #P. 2

Summary. Taking the results of Sections 5, 6 and 7 together, we can find the following.

(1) Both query languages and objective functions have impact on the combinedcomplexity of query result diversification. More specifically, (a) when F is FMS or FMM,the diversification problems for FO have a higher combined complexity than theircounterparts for CQ, UCQ and ∃FO+; and (b) when LQ is CQ, UCQ or ∃FO+, Fmono makesthe diversification problems harder than FMS and FMM.

(2) When the objective is given by mono-objective formulation, the objective functiondominates the combined complexity. Indeed, the combined complexity bounds of theseproblems are independent of whether we take CQ, UCQ, ∃FO+or FO as LQ.

(3) When it comes to data complexity, query languages make no difference, whileobjective functions determine the complexity. Indeed, the data complexity bounds ofthese problems remain unchanged when LQ is CQ, UCQ, ∃FO+or FO. Moreover, whenthe objective is given by Fmono, QRD and DRP become tractable, but it is not the casewhen the objective is for FMS or FMM. That is, the data complexity is inherent to resultdiversification itself, rather than a consequence of the complexity of query languages.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 37: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:37

8. SPECIAL CASES OF QUERY RESULT DIVERSIFICATION

In this section we identify and investigate several special cases of QRD, DRP and RDC.The reason for studying these is twofold. (1) The results of Sections 5, 6 and 7 tellus that these problems have rather high complexity. This suggests that we find theirspecial yet practical cases that are tractable. (2) We want to further understand theimpact of various parameters of these problems on their complexity, including querylanguages with low complexity, relevance functions δrel, distance functions δdis, and thebound k for selecting query answers.Due to the space constraint, we refer the interested reader to the electronic appendix

for the proofs of the results to be presented in this section and Section 9.

Identity queries. We first consider the case when LQ consists of identity queriesonly, i.e., when query Q is of the following form:

Q(x) = R(x),

where R is a relation atom, and |x| is the arity of R. Note that for any instance D ofschema R,D = Q(D). As remarked early, in this setting QRD was shown to be NP-hardby [Gollapudi and Sharma 2009] when the objective is for max-sum diversificationand max-min diversification. No previous work has studied QRD for mono-objectiveformulation, or DRP and RDC for any of the three objective functions.We show that identity queries reduce the complexity of these problems to an extent.

(1) When the objective is given by mono-objective formulation, QRD and DRP aretractable, as opposed to the NP-hardness of QRD for FMS and FMM [Gollapudi andSharma 2009], and RDC becomes #P-complete. In contrast, these problems arePSPACE-complete, PSPACE-complete, and #·PSPACE-complete (Theorems 5.2, 6.2 and7.2), respectively, when LQ is CQ. This further verifies that query languages haveimpact on the complexity of diversification.

(2) In contrast, when the objective is for max-sum or max-min diversification, thecombined complexity and data complexity of these problems are the same as theircounterparts whenLQ is CQ. In other words, in this setting, query languages with a lowcomplexity (for its membership problem) do not simplify the analyses of diversification.

COROLLARY 8.1. For identity queries, the combined complexity and data complex-ity of QRD, DRP and RDC coincide. More specifically,

—QRD(LQ, FMS) and QRD(LQ, FMM) are NP-complete,—DRP(LQ, FMS) and DRP(LQ, FMM) are coNP-complete, and—RDC(LQ, FMS) and RDC(LQ, FMM) are #P-complete under parsimonious reductions,

for both combined complexity and data complexity, while

—QRD(LQ, Fmono) is in PTIME,—DRP(LQ, Fmono) is in PTIME, and—RDC(LQ, Fmono) is #P-complete under polynomial Turing reductions,

for both combined complexity and data complexity, which are the same as their datacomplexity given in Theorems 5.4, 6.4 and 7.5, respectively. 2

When λ = 0. We next focus on the impact of the relevance and diversity requirementson the complexity of query result diversification analyses. We first consider the casewhen λ = 0, i.e., the objective function F is defined in terms of the relevance functionδrel only. We find that the diversity requirement has higher impact on the complexitythan relevance. Indeed, dropping distance functions δdis simplifies the analyses of theseproblems to an extent. This is consistent with the observation of [Vieira et al. 2011].

(1) When the objective function is FMS or FMM, QRD and DRP become tractable for

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 38: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:38 Ting Deng and Wenfei Fan

fixed Q. Moreover, RDC is in FP for FMM, where FP is the class of all functions that canbe computed in PTIME (cf. [Papadimitriou 1994]). That is, these problems have lowerdata complexity.

(2) When the objective is given by Fmono, the combined complexity analyses of theseproblems become simpler, when LQ is CQ, UCQ or ∃FO+.

THEOREM 8.2. When λ = 0, For FMS and FMM, the combined complexity bounds ofQRD, DRP and RDC remain the same as their counterparts given in Theorems 5.1, 6.1and 7.1, respectively. In contrast, when LQ is CQ, UCQ, ∃FO+or FO, the data complexitybounds of these problems are

— in PTIME for QRD(LQ, FMS) and QRD(LQ, FMM),— in PTIME for DRP(LQ, FMS) and DRP(LQ, FMM), and—#P-complete for RDC(LQ, FMS) under polynomial Turing reductions, but in FP for

RDC(LQ, FMM).

For Fmono, the combined complexity becomes

—NP-complete for QRD(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and PSPACE-completewhen LQ is FO;

— coNP-complete for DRP(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and PSPACE-complete for FO; and

—#·NP-complete for RDC(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and #·PSPACE-complete for FO.

The data complexity bounds of these problems remain the same as their counterpartsgiven in Theorems 5.4, 6.4, 7.4 and 7.5, respectively, when LQ is CQ, UCQ, ∃FO+or FO. 2

When λ = 1. In contrast to Theorem 8.2, we show below that dropping the relevancefunction δrel does not simplify the analyses. Indeed, when the objective function isdefined with only the diversity function δdis, both the combined complexity and datacomplexity of QRD, DRP and RDC remain the same as their counterparts when bothrelevance and diversity are taken into account. This further verifies that the diversityrequirement δdis dominates the complexity of these problems. These results, however,need new proofs and are not corollaries of the previous results.

THEOREM 8.3. When λ = 1, the combined complexity of Theorems 5.1, 5.2, 6.1,6.2, 7.1 and 7.2 and the data complexity of Theorems 5.4, 6.4, 7.4 and 7.5 remainunchanged for QRD, DRP and RDC, respectively. 2

When k is a predefined constant. Finally, we study the impact of the cardinality|U | of selected sets U of query answers on the analyses of query result diversification.When |U | is fixed to be a predefined constant k, the result below tells us the following.

(1) When Q is also fixed, QRD, DRP and RDC are all tractable. That is, fixing the sizeof U simplifies their data complexity analyses.

(2) In contrast, fixing k does not simplify the combined complexity analyses of theseproblems. Indeed, all the combined complexity bounds of these problems remain thesame as their counterparts when k is not required to be a constant.

COROLLARY 8.4. For a predefined constant k,

— the combined complexity bounds given in Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 areunchanged for QRD, DRP and RDC, respectively; and

— the data complexity is in— PTIME for QRD,

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 39: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:39

— PTIME for DRP, and— FP for RDC,

no matter whether for FMS, FMM or Fmono, and for CQ, UCQ, ∃FO+or FO.

Summary. From the results that we have got so far, we can see the following.

(1) The impact of LQ. When the objective function is given by FMS or FMM, QRD,

DRP and RDC have higher combined complexity for FO than for CQ, UCQ and∃FO+(Theorems 5.1, 6.1 and 7.1). In contrast, the complexity bounds remain intactfor CQ or the class of identity queries (Corollary 8.1). When considering Fmono, thecombined complexity bounds of these problems remain unchanged for all querylanguages CQ, UCQ, ∃FO+and FO (Theorems 5.2, 6.2 and 7.2). In contrast, when forthe class of identity queries, these problems become simpler (Corollary 8.1). Notethat the query languages have no impact on the data complexity of these problems(Theorems 5.4, 6.4, 7.4, 7.5 and Corollary 8.1).

(2) The impact of δrel and δdis. The complexity of diversification also arises from thediversity requirement. The absence of the distance function δdis simplifies (a) the datacomplexity analyses of QRD and DRP for FMS or FMM, (b) the data complexity of RDC forFMM, and (c) the combined complexity analyses of all these problems for Fmono (Theo-rem 8.2). In contrast, the absence of δrel has no impact on the complexity (Theorem 8.3).

(3) The impact of k. When k is a fixed constant, the data complexity analyses of QRD,DRP and RDC become tractable, no matter whether objective function is given by FMS,FMM or Fmono (Corollary 8.4).

9. INCORPORATING COMPATIBILITY CONSTRAINTS

In this section, we study the impact of compatibility constraints on the analyses ofquery result diversification. We first introduce a class of compatibility constraints,and extend the diversification model of Section 3 by incorporating these constraints.In the presence of such constraints, we then re-investigate QRD, DRP and RDC in allthe settings of the previous sections (Sections 5, 6, 7 and 8).

Compatibility constraints. We first define a class of compatibility constraints.Consider a database D, a query Q in a language LQ, and a predefined constant m ≥ 2.We define a class Cm of compatibility constraints on subsets U ⊆ Q(D). Let RQ denotethe schema of query results Q(D). A constraint ϕ in Cm is of the form:

∀t1, . . . , tl : RQ(

χ(t1, . . . , tl) → ∃s1, . . . , sh : RQ ξ(t1, . . . , tl, s1, . . . , sh))

.

Here (1) l and h are in the range [0,m], (2) ti and sj are tuple variables denoting atuple of RQ, and (3) χ and ξ are conjunctions of predicates of the form (a) ρ[A] = [B]or ρ[A] 6= [B], or (b) ρ[A] = c or ρ[A] 6= c, where A and B are attributes in RQ, ρ and range over tuples ti and sj for i ∈ [0, l] and j ∈ [0, h], and c is a constant.

We say that a set U ⊆ Q(D) satisfies ϕ, denoted by U |= ϕ, if for all tuples t1, . . . , tlin U that satisfy the predicates in χ following the standard semantics of first-orderlogic, there must exist tuples s1, . . . sh in U such that all the predicates in ξ are alsosatisfied. We say that D satisfies a set Σ of constraints in Cm if for each ϕ ∈ Σ, U |= ϕ.

Class Cm suffices to express compatibility constraints commonly found in practice,to specify what items should be picked together when we select top-k tuples, and whatitems have conflict with each other, as illustrated by the following example.

Example 9.1. Consider a queryQ1 posed on databaseD1 to find items for shopping.The selected items are specified by a relation schema RQ1 with attributes item, price,

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 40: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:40 Ting Deng and Wenfei Fan

etc (see Example 1.1). The compatibility constraint ρ1 below is defined on subsets ofresults Q1(D1). It assures that if one buys items a and b, then she also needs to buy c.That is, in the set U of top-k items recommended, if a and b are in U , then so is c.

ρ1 = ∀ t1, t2 : RQ1

(

(t1[item] = a ∧ t2[item] = b) → ∃ s : RQ1(s[item] = c))

.

As another example, consider a queryQ2 posed on a database D2 for course selection[Koutrika et al. 2009; Parameswaran et al. 2010]. The schema of query resultQ2(D2) isdenoted by RQ2, including attributes id and title, among other things. The compatibilityconstraint ρ2 below is defined on instances of RQ2, i.e., sets U ⊆ Q2(D2). It asserts thatif course CS450 is taken, then so must be its prerequisites CS220 and CS350.

ρ2 = ∀ t1 : RQ2

(

t1[id] = CS450 → ∃ s1, s2 : RQ2(s1[id] = CS220 ∧ s2[id] = CS350))

.

Now consider a queryQ3 posed on a databaseD3 for basketball team formation [Lap-pas et al. 2009]. The schema of Q3(D3) is RQ3, including attributes id, position, etc. Weuse the following constraint ρ3 to assure that at most two centers are needed for theteam, i.e., no more than three centers may be included in any top-k sets U ⊆ Q3(D3).

ρ3 = ∀ t1, t2, t3 : RQ3

(

t1[position] = center ∧ t2[position] = center ∧

t1[id] 6= t2[id] ∧ t1[id] 6= t3[id] ∧ t2[id] 6= t3[id] → t2[position] 6= center))

.

Observe that compatibility constraints ϕ1, ϕ2 and ϕ3 may not be expressible in thequery languages LQ for Q1, Q2 and Q3, when, e.g., LQ is CQ. 2

One can see that constraints of Cm have a form similar to tuple generating dependen-cies (TGDs) that have been well studied for databases (see, e.g., [Abiteboul et al. 1995]for TGDs), except that the number of tuples in each constraint of Cm is bounded by apredefined constant m, such as our familiar functional dependencies and inclusion de-pendencies, which are bounded by constant 2 and can be expressed in Cm whenm ≥ 2.Moreover, one can readily verify that constraints of Cm are in PTIME, i.e., for any setΣ ⊆ Cm and any set U ∈ Q(D), it takes at most PTIME in |U | and |Σ| to determinewhether U |= Σ, because the number of tuples in each ϕ ∈ Σ is bounded by m.

Query result diversification revisited. We are now ready to revise query resultdiversification by incorporating constraints of Cm. Given a queryQ in a query languageLQ, a database D, a positive integer k, an objective function F , and in addition, aset Σ of compatibility constraints in Cm defined on subsets of Q(D), query resultdiversification in the presence of compatibility constraints aims to find a set U ⊆ Q(D)such that (a) |U | = k, (b) F (U) is maximum, and moreover, (c) U |= Σ. Compared todiversification in the absence of compatibility constraints (see Section 3), the top-k setU selected is additionally required to satisfy all the constraints in Σ.

In the presence of Σ, we revise the following notions introduced in Section 4.

(1) Given Q,D, k and Σ, we say that a set U ⊆ Q(D) is a candidate set for (Q,D,Σ, k)if |U | = k and U |= Σ.

(2) Given a real number B and an objective function F , we say that a set U ⊆ Q(D) isvalid for (Q,D,Σ, k, F,B) if |U | = k, U |= Σ and F (U) ≥ B.

(3) We say that rank(U) = r for a positive integer r if there exists a collection S of r − 1distinct candidate sets for (Q,D,Σ, k) such that (a) for all S ∈ S, F (S) > F (U); and (b)for any candidate set S′ for (Q,D,Σ, k), if S′ 6∈ S, then F (U) ≥ F (S′).

Based on these, we revise the statements of problems QRD, DRP and RDC as follows.

(1) Problem QRD(LQ, F ) is to decide, given D, Q ∈ LQ, F , B and in addition, a set Σ ofcompatibility constraints in Cm, whether there exists a valid set for (Q,D,Σ, k, F,B).

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 41: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:41

(2) Problem DRP(LQ, F ) is to decide, given D, Q ∈ LQ, F , B, Σ, and a candidate set Ufor (Q,D,Σ, k), whether rank(U) ≤ r, where r is a positive integer constant.

(3) Problem RDC(LQ, F ) is to count the number of valid sets for (Q,D,Σ, k, F,B).

We next investigate these problems in the presence of compatibility constraints.We first establish their combined complexity, and then provide their data complexity.Finally, we study these problems in the special cases identified in Section 8.

Combined complexity. The good news is that compatibility constraints do notcomplicate the combined complexity analyses of the result diversification problems.Indeed, constraints of Cm can be validated in PTIME, and hence all the upper boundsgiven in Sections 5, 6 and 7 for combined complexity remain intact.

COROLLARY 9.2. In the presence of compatibility constraints of Cm, the combinedcomplexity bounds of Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 remain unchanged forQRD, DRP and RDC. 2

Data complexity. In the presence of compatibility constraints, we study the datacomplexity of these problems, i.e., when query Q and compatibility constraints Σ arepredefined and fixed, while database D may vary. We show that the presence of Σmakes the problems harder, to an extent.

(1) When the objective is given by mono-object formulation, QRD and DRP becomeNP-complete and coNP-complete, respectively, as opposed to their tractability inthe absence of compatibility constraints (Theorem 5.4 and 6.4), and DRP becomes#P-complete under parsimonious reductions, rather than under polynomial Turingreductions (Theorem 7.5). That is, although compatibility constraints of Cm can bevalidated in PTIME, they impose additional requirements on the selection of top-k setsbased on Fmono and hence, complicate the analyses of these problems when query Q isfixed. These data complexity results hold no matter whether for CQ, UCQ, ∃FO+and FO.

(2) In contrast, when the objective is for max-sum or max-min diversification, the datacomplexity results of these problems remain the same as their counterparts in theabsence of compatibility constraints.

THEOREM 9.3. In the presence of compatibility constraints of Cm, the data com-plexity bounds of Theorems 5.4, 6.4 and 7.4 remain unchanged for QRD, DRP and RDC,respectively, for FMS and FMM. However, for Fmono,

—QRD becomes NP-complete;—DRP becomes coNP-complete; and—RDC becomes #P-complete under parsimonious reductions. 2

Special cases. We next investigate the special cases of Section 8 in the presence ofcompatibility constraints of Cm. We show that compatibility constraints make the anal-yses of query result diversification more complicated: all the tractable cases we haveseen in Section 8 except one (when the bound k is a constant) become intractable, al-though the compatibility constraints of Cm are simple enough to be validated in PTIME.

Identity queries. We first consider the case when LQ consists of identity queries (seeSection 8 for the details of identity queries). In this setting, compatibility constraintscomplicate the combined and data complexity analyses of query result diversificationwhen F is Fmono. Indeed, QRD(LQ, Fmono), DRP(LQ, Fmono) and DRP(LQ, Fmono) be-come NP-complete, coNP-complete and #P-complete under parsimonious reductions,respectively, for both their combined complexity and data complexity, as opposed toPTIME, PTIME and #P-complete under polynomial Turing reductions, respectively, in

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 42: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:42 Ting Deng and Wenfei Fan

the absence of such constraints (Corollary 8.1). This is because for an identity queryQ and a database D, while Q(D) = D and Fmono(U) for U ⊆ Q(D) is computablein PTIME, the additional requirements imposed by compatibility constraints makechecking and counting valid sets more intricate, similar to the complication introducedby the constraints to the data complexity analyses of these problems (Theorem 9.3).In contrast, when F is FMS or FMM, even the data complexity analyses of theseproblems are already intractable in the absence of compatibility constraints, and thecompatibility constraints do not increase their complexity bounds.

COROLLARY 9.4. For identity queries, in the presence of compatibility constraintsof Cm, both the combined complexity and data complexity of Corollary 8.1 remainunchanged for QRD, DRP, and RDC for FMS and FMM.

However, when it comes to Fmono,

—QRD becomes NP-complete;—DRP becomes coNP-complete; and—RDC becomes #P-complete under parsimonious reductions,

for both combined and data complexity. 2

When λ = 0. We next study the impact of the compatibility constraints on the com-plexity of query result diversification analyses when λ = 0, i.e., when the objectivefunction F is defined in terms of the relevance function δrel only. The results belowtell us the following. The presence of compatibility constraints has no impact on thecombined complexity of QRD(LQ, F ), DRP(LQ, F ) and DRP(LQ, F ), but the constraintsdo make the data complexity analyses harder.

COROLLARY 9.5. For λ = 0, in the presence of compatibility constraints of Cm, thecombined complexity bounds given in Theorem 8.2 remain unchanged for QRD, DRPand RDC, while the data complexity becomes

—NP-complete for QRD;— coNP-complete for DRP; and—#P-complete for RDC under parsimonious reductions,

no matter for FMS, FMM and Fmono, and for CQ, UCQ, ∃FO+and FO. 2

When λ = 1. Similarly, when F is defined in terms of distance function δdis only,compatibility constraints complicate the data analyses of QRD, DRP and RDC forFmono. For FMS and FMM, these problems are already NP-complete, coNP-completeand #P-complete under parsimonious reductions, respectively, in the absence ofcompatibility constraints (Theorem 8.3), and these data complexity bounds remainintact in the presence of the constraints.

COROLLARY 9.6. For λ = 1, in the presence of compatibility constraints of Cm, thecombined complexity bounds given in Theorem 8.3 remain unchanged for QRD, DRPand RDC.The data complexity bounds of Theorem 8.3 remain unchanged for FMS and FMM. In

contrast, for Fmono and for CQ, UCQ, ∃FO+and FO,

—QRD is NP-complete;—DRP is coNP-complete; and—RDC is #P-complete under parsimonious reductions. 2

When k is a predefined constant. In contrast, compatibility constraints do not compli-cate the analyses of query result diversification when the bound k is a constant.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 43: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:43

Table I. Combined complexity and data complexity ((∗): known for the lower bound)

Objective functions Languages ProblemsQRD(LQ, F ) DRP(LQ, F ) RDC(LQ, F )

(Th. 5.1, 5.2, 5.4) (Th. 6.1, 6.2, 6.4) (Th. 7.1, 7.2, 7.4,7.5)

Combined complexity

FMS and FMMCQ, UCQ, ∃FO+ NP-complete(∗) coNP-complete #·NP-complete

FO PSPACE-complete PSPACE-complete #·PSPACE-completeFmono CQ, UCQ, ∃FO+, FO PSPACE-complete PSPACE-complete #·PSPACE-complete

Data complexity

FMS and FMM CQ,UCQ,∃FO+,FO NP-complete(∗) coNP-complete#P-complete(parsimonious)

Fmono CQ,UCQ,∃FO+,FO PTIME PTIME#P-complete

(Turing)

COROLLARY 9.7. For a predefined constant k, in the presence of compatibilityconstraints of Cm, the combined complexity and data complexity of Corollary 8.4remain unchanged for QRD, DRP and RDC, respectively. 2

Summary. From the results of this section, we find the impact of compatibilityconstraints of Cm on the complexity of query results diversification as follows.

(1) Although the compatibility constraints of Cm can be validated in PTIME, theirpresence complicates the data complexity analyses of QRD(LQ, F ), DRP(LQ, F ) andRDC(LQ, F ), to an extent. More specifically, when these problems are tractable in theabsence of the constraints, they become intractable when the constraints are present.The impact is particularly evident when F is Fmono (Theorem 9.3), or when λ = 1 andF is either FMS or FMM (Corollary 9.5).

(2) When it comes to the combined complexity, the presence of compatibility con-straints makes QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) harder when F is Fmono andwhen LQ consists of identity queries only (Corollary 9.4). The constraints have noimpact on the combined complexity analyses when F is FMS or FMM.

(3) When the bound k on the cardinality |U | of selected sets U is a constant, thecomplexity results are quite robust: both the combined complexity and data com-plexity of QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) remain intact no matter whethercompatibility constraints of Cm are present or absent (Corollary 9.7).

10. CONCLUSIONS

We have extended the result diversification model of [Gollapudi and Sharma 2009]by incorporating queries Q, without assuming the entire set Q(D) of query answersas input. We have identified three decision and counting problems in connection withquery result diversification, namely, QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ). Wehave established the upper and lower bounds of these problems, all matching, for bothcombined complexity and data complexity, when the query language LQ is CQ, UCQ,∃FO+or FO, and when F ranges over all three objective functions FMS, FMM and Fmono

given in [Gollapudi and Sharma 2009]. We have also studied special cases of theseproblems, and identified tractable cases. In addition, we have investigated the impactof compatibility constraints on the analyses of query result diversification.

The main complexity results are summarized in Table I, annotated with theircorresponding theorems. The complexity bounds of special cases are shown in Table II,followed by Table III for the complexity results in the presence of compatibility con-straints that differ from their counterparts in the absence of constraints. The tables

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 44: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:44 Ting Deng and Wenfei Fan

Table II. Special cases

Conditions Complexity ProblemsQRD(LQ, F ) DRP(LQ, F ) RDC(LQ, F )

Identity queries; CombinedPTIME PTIME

#P-completeF is Fmono (Cor. 8.1) (Turing)

λ = 0; DataPTIME PTIME

#P-completeF is FMS (Th. 8.2) (Turing)λ = 0; Data

PTIME PTIME FPF is FMM (Th. 8.2)

λ = 0; CQ, UCQ, ∃FO+; CombinedNP-complete coNP-complete #·NP-complete

F is Fmono; (Th. 8.2)k is a constant; Data

PTIME PTIME FPF is FMS,FMM or Fmono (Cor. 8.4)

Table III. Complexity results in the presence of compatibility constraints

Conditions Complexity ProblemsQRD(LQ, F ) DRP(LQ, F ) RDC(LQ, F )

F is FmonoData

NP-complete coNP-complete#P-complete

(Th. 9.3) (parsimonious)

Identity queries; Combined / DataNP-complete coNP-complete

#P-completeF is Fmono (Cor. 9.4) (parsimonious)

λ = 0; DataNP-complete coNP-complete

#P-completeF is FMS,FMM,Fmono (Cor. 9.5) (parsimonious)

λ = 1; DataNP-complete coNP-complete

#P-completeF is Fmono (Cor. 9.6) (parsimonious)

tell us the impact of various factors on the complexity of diversification analyses, suchas query languages LQ, objective functions F , relevance and distance functions δrel andδdis, bound k on the number of answers, and compatibility constraints. As annotatedin Table I, among all these results, only the NP lower bound of QRD(LQ, F ) was knownprior to this work, when F is FMS or FMM, for CQ, UCQ and ∃FO+.

Several extensions are targeted for future work. First, diversification analyses aremostly intractable. We need to identify more special cases that are practical andtractable. Second, we need to develop heuristic algorithms (approximation wheneverpossible) for those intractable cases. Third, the study should be extended to otherobjective functions. Fourth, we have only considered a simple class Cm of compatibilityconstraints that can be validated in PTIME. More expressive constraint languagesshould be developed if the need for such languages emerges from practice. Finally, inpractice one may want to incorporate user preferences [Chen and Li 2007; Stefanidiset al. 2010] into the diversification model. While we may encode certain preferencesin, e.g., the relevance and distance functions, this issue deserves a full treatment.

REFERENCES

ABITEBOUL, S., HULL, R., AND VIANU, V. 1995. Foundations of Databases. Addison-Wesley.

ADOMAVICIUS, G. AND TUZHILIN, A. 2005. Towards the next generation of recommender systems: a surveyof the state-of-the-art and possible extensions. IEEE Trans. Knowl. and Data Eng. 17, 6, 734–749.

AGRAWAL, R., GOLLAPUDI, S., HALVERSON, A., AND LEONG, S. 2009. Diversifying search results. In Proc.Int. Conf. Web Search and Web Data Mining.

AMER-YAHIA, S. 2011. Recommendation projects at Yahoo! IEEE Data Eng. Bull. 34, 2, 69–77.

AMER-YAHIA, S., BONCHI, F., CASTILLO, C., FEUERSTEIN, E., MENDEZ-D IAZ, I., AND ZABALA, P. 2013.Complexity and algorithms for composite retrieval. In Proc. Int. Conf. on World Wide Web.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 45: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification A:45

BERBEGLIA, G. AND HAHN, G. 2010. Counting feasible solutions of the traveling salesman problem withpickups and deliveries is #P-complete. Discrete Applied Mathematics 157, 11, 2541–2547.

BORODIN, A., LEE, H. C., AND YE, Y. 2012. Max-sum diversification, monotone submodular functions anddynamic updates. In Proc. ACM SIGACT-SIGMOD Symp. on Principles of Database Systems.

CAPANNINI, G., NARDINI, F. M., PEREGO, R., AND SILVESTRI, F. 2011. Efficient diversification of Websearch results. In Proc. Int. Conf. on Very Large Data Bases.

CHEN, Z. AND LI, T. 2007. Addressing diverse user preferences in SQL-query-resul navigation. In Proc.ACM SIGMOD Int. Conf. on Management of Data.

DEMIDOVA, E., FANKHAUSER, P., ZHOU, X., AND NEJDL, W. 2010. DivQ: Diversification for keywordsearch over structured databases. In Proc. Annual Int. ACM SIGIR Conf. on Research and Developmentin Information Retrieval.

DENG, T., FAN, W., AND GEERTS, F. 2012. On the complexity of package recommendation problems. InProc. ACM SIGACT-SIGMOD Symp. on Principles of Database Systems.

DROSOU, M. AND PITOURA, E. 2009. Diversity over continuous data. IEEE Data Eng. Bull. 32, 4.

DROSOU, M. AND PITOURA, E. 2010. Search result diversification. SIGMOD Record 39, 1, 41–47.

DURAND, A., HERMANN, M., AND KOLAITIS, P. G. 2005. Subtractive reductions and complete problems forcounting complexity classes. Theor. Comp. Sci. 340, 3, 496–513.

FAGIN, R., LOTEM, A., AND NAOR, M. 2003. Optimal aggregation algorithms for middleware. J. Comp. andSystem Sci. 66, 4, 614–656.

FEUERSTEIN, E., HEIBER, P. A., MARTINEZ-VIADEMONTE, J., AND BAEZA-YATES, R. A. 2007. Newstochastic algorithms for scheduling ads in sponsored search. In Proc. Latin American Web Congress.

FRATERNALI, P., MARTINENGHI, D., AND TAGLIASACCHI, M. 2012. Top-k bounded diversification. In Proc.ACM SIGMOD Int. Conf. on Management of Data.

GOLLAPUDI, S. AND SHARMA, A. 2009. An axiomatic approach for result diversification. In Proc. Int. Conf.on World Wide Web.

HEMASPAANDRA, L. A. AND VOLLMER, H. 1995. The satanic notations: Counting classes beyond #P andother definitional adventures. SIGACT News 26, 1, 2–13.

ILYAS, I. F., BESKALES, G., AND SOLIMAN, M. A. 2008. A survey of top-k query processing techniques inrelational database systems. ACM Comput. Surv. 40, 4, 11:1–11:58.

JIN, W. AND PATEL, J. M. 2011. Efficient and generic evaluation of ranked queries. In Proc. ACM SIGMODInt. Conf. on Management of Data.

KOUTRIKA, G., BERCOVITZ, B., AND GARCIA-MOLINA, H. 2009. FlexRecs: expressing and combiningflexible recommendations. In Proc. ACM SIGMOD Int. Conf. on Management of Data.

LADNER, R. E. 1989. Polynomial space counting problems. SIAM J. Comput. 18, 6, 1087–1097.

LAPPAS, T., LIU, K., AND TERZI, E. 2009. Finding a team of experts in social networks. In Proc. Int. Conf.on Knowledge Discovery and Data Mining.

LI, C., SOLIMAN, M. A., CHANG, K. C.-C., AND ILYAS, I. F. 2005. RankSQL: Supporting ranking queriesin relational database management systems. In Proc. Int. Conf. on Very Large Data Bases.

LIU, Z., SUN, P., AND CHEN, Y. 2009. Structured search result differentiation. In Proc. Int. Conf. on VeryLarge Data Bases.

MINACK, E., DEMARTINI, G., AND NEJDL, W. 2009. Current approaches to search result diversification. InProc. Int. Workshop on Living Web.

PAPADIMITRIOU, C. H. 1994. Computational Complexity. Addison-Wesley.

PARAMESWARAN, A. G., GARCIA-MOLINA, H., AND ULLMAN, J. D. 2010. Evaluating, combining andgeneralizing recommendations with prerequisites. In Proc. Int. Conf. on Information and KnowledgeManagement.

PARAMESWARAN, A. G., VENETIS, P., AND GARCIA-MOLINA, H. 2011. Recommendation systems withcomplex constraints: A course recommendation perspective. ACM Trans. Information Syst. 29, 4.

PROKOPYEV, O. A., KONG, N., AND MARTINEZ-TORRES, D. L. 2009. The equitable dispersion problem.European Journal of Operational Research 197, 1, 59–67.

SCHNAITTER, K. AND POLYZOTIS, N. 2008. Evaluating rank joins with optimal cost. In Proc. ACMSIGACT-SIGMOD Symp. on Principles of Database Systems.

STEFANIDIS, K., DROSOU, M., AND PITOURA, E. 2010. Perk: Personalized keyword search in relationaldatabases through preferences. In Proc. Int. Conf. on Extending Database Technology.

VALIANT, L. 1979. The complexity of computing the permanent. Theoretical Computer Science 8, 2, 189–201.

VARDI, M. Y. 1982. The complexity of relational query languages. In Proc. Annual ACM Symp. on Theory ofComputing.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 46: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

A:46 Ting Deng and Wenfei Fan

VEE, E., SRIVASTAVA, U., SHANMUGASUNDARAM, J., BHAT, P., AND YAHIA, S. A. 2008. Efficientcomputation of diverse query results. In Proc. Int. Conf. on Data Engineering.

VIEIRA, M. R., RAZENTE, H. L., BARIONI, M. C. N., HADJIELEFTHERIOU, M., SRIVASTAVA, D., JR., C. T.,AND TSOTRAS, V. J. 2011. On query result diversification. In Proc. Int. Conf. on Data Engineering.

XIE, M., LAKSHMANAN, L. V. S., AND WOOD, P. T. 2012. Composite recommendations: From items topackages. Frontiers of Computer Science 6, 3, 264–277.

YU, C., LAKSHMANAN, L., AND AMER-YAHIA, S. 2009a. It takes variety to make a world: Diversification inrecommender systems. In Proc. Int. Conf. on Extending Database Technology. 368–378.

YU, C., LAKSHMANAN, L. V., AND AMER-YAHIA, S. 2009b. Recommendation diversification usingexplanations. In Proc. Int. Conf. on Data Engineering.

ZHANG, M. AND HURLEY, N. 2008. Avoiding monotony: Improving the diversity of recommendation lists.In Proc. ACM Conf. on Recommender Systems.

ZIEGLER, C.-N., MCNEE, S. M., KONSTAN, J. A., AND LAUSEN, G. 2005. Improving recommendation liststhrough topic diversification. In Proc. Int. Conf. on World Wide Web.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 47: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

Online Appendix to:On the Complexity of Query Result Diversification

TING DENG, RCBD and SKLSDE, Beihang University

WENFEI FAN, Informatics, University of Edinburgh, and RCBD and SKLSDE, Beihang University

A. PROOFS OF SECTION 8

COROLLARY 8.1. For identity queries, the combined complexity and data complex-ity of QRD, DRP and RDC coincide. More specifically,

—QRD(LQ, FMS) and QRD(LQ, FMM) are NP-complete,—DRP(LQ, FMS) and DRP(LQ, FMM) are coNP-complete, and—RDC(LQ, FMS) and RDC(LQ, FMM) are #P-complete under parsimonious reductions,

for both combined complexity and data complexity, while

—QRD(LQ, Fmono) is in PTIME,—DRP(LQ, Fmono) is in PTIME, and—RDC(LQ, Fmono) is #P-complete under polynomial Turing reductions,

for both combined complexity and data complexity, which are the same as their datacomplexity given in Theorems 5.4, 6.4 and 7.5, respectively. 2

PROOF. For identity queries, we first study the combined and data complexity ofQRD, DRP and RDC for FMS or FMM. We then investigate these problems for Fmono

(1) When F is FMS or FMM. The data complexity proofs of Theorems 5.4, 6.4 and 7.4for QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) when F is FMS or FMM use a fixedidentity query as Q. Hence the lower bounds hold here. Moreover, for the upper bound,Theorems 5.1 and 6.1 tell us that QRD(LQ, F ) and DRP(LQ, F ) are in NP and coNP,respectively, for CQ. Since CQ subsumes identity queries, QRD(LQ, F ) and DRP(LQ, F )are in NP and coNP, respectively, for identity queries. Furthermore, the proof ofTheorem 7.1 shows that if Q(D) is PTIME computable, such as when Q is an identityquery, the problem for verifying whether a given set is valid for (Q,D, k, F,B) is inPTIME. Hence, RDC(LQ, F ) is in #·P (i.e., #P) for identity queries.

(2) When F is Fmono. Consider the PTIME-algorithms given in the proofs of The-orems 5.4 and 6.4 for the data complexity analyses of QRD(FO, Fmono) andDRP(FO, Fmono), respectively. Note that Q(D) and Fmono are PTIME computablewhen Q is an identity query. Thus these PTIME algorithms also work here, and theircombined complexity and data complexity coincide to be in PTIME for identity queries.

We next consider RDC(LQ, Fmono). It suffices to show that RDC(LQ, Fmono) is #P-hardfor fixed identity queries, and that it is in #P for identity queries that are notnecessarily fixed. Observe the following. (a) The lower bound proof of Theorem 7.5 forRDC(CQ, Fmono) uses a fixed identity query as Q. Thus the lower bound holds here. (b)It is in PTIME to check whether a given set is valid for (Q,D, k, Fmono, B) as Q(D) andFmono are PTIME computable for identity queries. Thus RDC(LQ, Fmono) is in #P.

This completes the proof of Corollary 8.1. 2

c© YYYY ACM 1539-9087/YYYY/01-ARTA $15.00DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 48: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–2 Ting Deng and Wenfei Fan

THEOREM 8.2. When λ = 0, For FMS and FMM, the combined complexity bounds ofQRD, DRP and RDC remain the same as their counterparts given in Theorems 5.1, 6.1and 7.1, respectively. In contrast, when LQ is CQ, UCQ, ∃FO+or FO, the data complexitybounds of these problems are

— in PTIME for QRD(LQ, FMS) and QRD(LQ, FMM),— in PTIME for DRP(LQ, FMS) and DRP(LQ, FMM), and—#P-complete for RDC(LQ, FMS) under polynomial Turing reductions, but in FP for

RDC(LQ, FMM).

For Fmono, the combined complexity becomes

—NP-complete for QRD(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and PSPACE-completewhen LQ is FO;

— coNP-complete for DRP(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and PSPACE-complete for FO; and

—#·NP-complete for RDC(LQ, Fmono) when LQ is CQ, UCQ or ∃FO+, and #·PSPACE-complete for FO.

The data complexity bounds of these problems remain the same as their counterpartsgiven in Theorems 5.4, 6.4, 7.4 and 7.5, respectively, when LQ is CQ, UCQ, ∃FO+or FO. 2

PROOF. When λ = 0, we first study QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) forFMS and FMM. We then investigate them when F is Fmono.

(1) When F is FMS or FMM. We start with the combined complexity analyses.

(1.1) Combined complexity. In these settings, we first prove that QRD(LQ, F ),DRP(LQ, F ) and RDC(LQ, F ) are NP-complete, coNP-complete and #·NP-completefor CQ, UCQ and ∃FO+, respectively. We then show that they are PSPACE-complete,PSPACE-complete and #·PSPACE-complete for FO, respectively.

(1.1.1) When LQ is CQ, UCQ or ∃FO+.

(A) QRD(LQ, F ). It suffices to show that QRD(LQ, FMS) and QRD(LQ, FMM) are NP-hardfor CQ, and that they are in NP for ∃FO+.

Lower bound. We show that QRD(CQ, FMS) and QRD(CQ, FMM) are NP-hard byreductions from the 3SAT problem, even when λ = 0 and k is a constant.We first consider QRD(CQ, FMS). Given an instance ϕ of 3SAT over variables

{x1, . . . , xm}, we define a database D, a CQ query Q, a real number B, a positiveinteger k and two functions δrel and δdis (for FMS), such that ϕ is satisfiable if and onlyif there exists a valid set U for (Q, D, k, FMS, B). In particular, we take k = 2 andB = 1. That is, U consists of two tuples only and FMS(U) must be no less than 1.

(1) The database D is specified by a single relation schema R01, with its correspondinginstance I01 = {(1), (0)}, encoding the Boolean domain.

(2) We define the query Q in CQ as follows:

Q(~x) = R01(x1) ∧ . . . ∧R01(xm).

Here ~x = (x1, . . . , xm) and Q generates all truth assignments of X variables. Let RQdenote the schema of query result Q(D).

(3) For each tuple t of RQ, we define δrel(t, Q) = 1 if the truth assignment µX encodedby tuple t makes ϕ true, and let δrel(t, Q) = 0 otherwise. Furthermore, we take δdis asa constant function that returns 0 for each pair of tuples t and t′ of RQ. We set λ = 0.Then for each set U of k tuples of RQ, FMS(U) = (k − 1) ·

t∈U δrel(t, Q) (see Section 3).That is, FMS is defined in terms of δrel alone (and hence we use k = 2).

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 49: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–3

We show that ϕ is satisfiable if and only if there is a valid set U for (Q, D, k, FMS, B).

⇒ First assume that ϕ is satisfiable. Then there exists a truth assignment µ0X of X

variables satisfying ϕ. Let U = {t0, t}, where tuple t0 encodes µ0X and t is an arbitrary

tuple in Q(D) that is distinct from t0. Obviously, U ⊆ Q(D), |U | = 2, and moreover,F (U) ≥ 1 = B since δrel(t0, Q) = 1. That is, U is a valid set for (Q, D, k, FMS, B).

⇐ Conversely, assume that ϕ is not satisfiable. Then for any tuple t of RQ, δrel(t, Q) =0 by the definition of δrel. Thus for each set U of two tuples t and t′ of RQ, FMS(U) =0 < B by the definition of FMS. Hence there exists no valid set for (Q,D, k, FMS, B).

For QRD(CQ, FMM), given an instance ϕ of 3SAT, we construct the same D, Q, δrel, δdis

as their counterparts given above, and let k = 1 and B = 1. Furthermore, for each setU of tuples of RQ, we set λ = 0 and hence have that FMM(U) = mint∈U δrel(t, Q). Thenalong the same lines as above, one can readily verify that ϕ is satisfiable if and only ifthere exists a valid set U for (Q,D, k, FMM, B).

Upper bound. The algorithms given in the proof of Theorem 5.1 remain intact in thespecial case when λ = 0. Thus QRD(∃FO+, FMS) and QRD(∃FO+, FMM) are in NP.

(B) DRP(LQ, F ). It suffices to show that DRP(LQ, FMS) and DRP(LQ, FMM) arecoNP-hard for CQ and that they are in coNP for ∃FO+.

Lower bound. We verify that DRP(CQ, FMS) and DRP(CQ, FMM) are coNP-hard, whenλ = 0 and k is a constant, by reductions from the complement of 3SAT, which is knownto be coNP-complete (cf. [Papadimitriou 1994]).We first study DRP(CQ, FMS). Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over

variables X , we define a database D, a CQ query Q, functions δrel, δdis and FMS, a setU ⊆ Q(D), and a positive integer k. We prove that rank(U) ≤ r if and only if ϕ is notsatisfiable. In particular, we set k = 2 and r = 1. That is, the set U consists of twotuples only and has the highest rank.

Before giving the reduction, we first define ϕ′ = (ϕ ∨ z) ∧ z =∧li=1(Ci ∨ z)∧ z, where

z is a fresh variable that is not in the set X of variables in ϕ. As discussed in theproof of Theorem 6.1 for DRP(CQ, FMS), for a truth assignment µX of X variables, µXsatisfies ϕ if and only if µX makes ϕ′ true with z = 0, and moreover, when setting z tobe 1, ϕ′ is false under any truth assignments in X .

We next give the reduction as follows.

(1) The database consists of four relations I01, I∨, I∧ and I¬ as shown in Fig. 5, speci-fied by schemas R01(X), R∨(B,A1, A2), R∧(B,A1, A2) and R¬(A, A), respectively. HereI01 encodes the Boolean domain, and I∨, I∧ and I¬ encode disjunction, conjunction andnegation, respectively, such that ϕ and ϕ′ can be expressed in CQ with these relations.

(2) We define the CQ query Q as follows:

Q(b, c) = ∃~x ∃z(

(QX(~x ) ∧Qϕ′(~x, z, b)) ∧R01(c))

.

Here ~x = (x1, . . . , xm). Query QX(~x ) generate all truth assignments of X variables, bymeans of Cartesian products of R01. The sub-query Qϕ′(~x, z, b) encodes the truth valueof ϕ′ (i.e., b), for a given truth assignment µX represented by ~x and truth assignment µzfor z, such that b = 1 if (µX , µz) satisfies ϕ

′, and b = 0 otherwise. Obviously, Qϕ′(~x, z, b)can be expressed in CQ in terms of R∨, R∧, and R¬. Observe that given D, Q(D) is asubset of {(1, 1), (1, 0), (0, 1), (0, 0)}. Let U = {(0, 1), (0, 0)}. As remarked above, ϕ′ isfalse under the truth assignments when z = 1; hence we have that U ⊆ Q(D).

(3) We define δrel((1, 0), Q) = δrel((1, 1), Q) = 2 and δrel((0, 1), Q) = δrel((0, 0), Q) = 1. Fur-thermore, we use a constant function δdis that returns 0 for each pair of tuples t and s of

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 50: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–4 Ting Deng and Wenfei Fan

RQ. We set λ = 0. Then for each set S of k tuples of RQ, FMS(S) = (k−1) ·∑

t∈S δrel(t, Q)(see Section 3; recall k = 2). Thus FMS({(1, 1), (1, 0)}) = 4, FMS({(0, 1), (0, 0)}) = 2, andFMS({t, s}) = 3 if t ∈ {(1, 1), (1, 0)} and s ∈ {(0, 0), (0, 1)}.

We next show that ϕ is not satisfiable if and only if rank(U) ≤ r.

⇒ Assume that ϕ is not satisfiable. Then there exists no truth assignment µX of Xvariables that satisfies ϕ. Thus, tuples (1, 1) and (1, 0) cannot be in the answer Q(D)to query Q in D. Therefore, rank(U) = 1 ≤ r, by the definition of FMS.

⇐ Conversely, assume that ϕ is satisfiable. Then there exists a truth assignment µ0X

of X variables that satisfies ϕ. Thus (1, 1) and (1, 0) must be in Q(D), by the definitionof query Q. Hence rank(U) > 1 = r, by the definition of FMS.

We next show that DRP(CQ, FMM) is coNP-hard, also by reduction from the com-plement of 3SAT. Given an instance ϕ of 3SAT, we construct the same ϕ′, D, Q,δrel, and δdis as their counterparts for FMS given above, and let U = {(0, 1)} andk = r = 1. Furthermore, we set λ = 0. Then for each set S of tuples of RQ, FMM(S)= mint∈S δrel(t, Q). Then along the same lines as the argument for FMS given above,one can easily verify that rank(U) ≤ r if and only if ϕ is not satisfiable.

Upper bound. The algorithms given in the proof of Theorem 6.1 obviously work in thespecial case when λ = 0. Thus QRD(∃FO+, FMS) and QRD(∃FO+, FMM) are in coNP.

(C) RDC(LQ, F ). Recall the lower bounds of RDC(LQ, F ) given in Theorems 7.1 for FMS

and FMM when LQ ranges over CQ, UCQ or ∃FO+. Those bounds are established bytaking λ = 0 and hence, hold here. For the upper bounds, the algorithms given thereobviously remain intact in the special case when λ = 0.

(1.1.2) When LQ is FO. The lower bounds of QRD(FO, F ), DRP(FO, F ) and RDC(FO, F )given in Theorems 5.1, 6.1 and 7.1 for FMS and FMM are established by taking λ = 0.As a result, those lower bounds hold here. For the upper bounds, the algorithms giventhere obviously remain intact in the special case when λ = 0.

(1.2) Data complexity. It suffices to show that QRD(FO, F ) and DRP(FO, F ) are inPTIME when F is FMS or FMM, RDC(LQ, FMS) is #P-complete under polynomial Turingreductions for CQ, UCQ, ∃FO+and FO, and RDC(FO, FMM) is in FP.

(1.2.1) QRD(FO, FMS). We develop a PTIME algorithm for QRD(FO, FMS) when λ = 0.Recall that FMS(U) = (k − 1) ·

t∈U δrel(t, Q) for each set U of k tuples of schema RQ inthis setting. Hence we develop an algorithm that works as follows:

1. compute Q(D) and sort the tuples in Q(D) in descending order based on δrel;2. check whether |Q(D)| ≥ k; if so, continue; otherwise, return “no”;3. let U be the set consisting of the first k tuples in the sorted Q(D); check whether

FMS(U) ≥ B; if so, return “yes”; otherwise, return “no”.

It is easy to verify that the algorithm is correct. Moreover, the algorithm is in PTIME.Indeed, steps 1 and 2 are in PTIME since it is in PTIME to compute Q(D) for a fixedFO query Q, and step 3 is in PTIME because FMS is PTIME computable in this setting.

(1.2.2) QRD(FO, FMM). When λ = 0, FMM(U) = mint∈U δrel(t, Q), and it is PTIMEcomputable. It is easy to see that the PTIME algorithm for QRD(FO, FMS) given abovealso works here. Hence QRD(FO, FMM) is also in PTIME.

(1.2.3) DRP(FO, FMS) and DRP(FO, FMM). When λ = 0, for each k-tuple set U of schemaRQ of Q(D), FMS(U) = (k − 1) ·

t∈U δrel(t, Q), and FMM(U) = mint∈U δrel(t, Q). Recallthe PTIME-algorithm given in the proof of Theorem 6.4 for DRP(FO, Fmono). Obviously,

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 51: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–5

if we define v(t) = δrel(t, Q) for each tuple t ∈ Q(D) and use FMS instead of Fmono inthe algorithm, the algorithm can be used for DRP(FO, FMS), and is still in PTIME.Similarly, when using function v given above and FMM instead of Fmono, the algorithmalso works for DRP(FO, FMM), and is also in PTIME.

(1.2.4) RDC(FO, FMS). It suffices to show that RDC(CQ, FMS) is #P-hard underpolynomial Turing reductions and that RDC(FO, FMS) is in #P.

lower bound. We show that RDC(CQ, FMS) is #P-hard by polynomial Turing reductionfrom #SSPk, which is #P-complete by Lemma 7.6. Along the same line as the proofof Theorem 7.5 for RDC(CQ, Fmono), we construct a polynomial Turing reduction asfollows. Given an instance W , π, l and d of #SSPk, we define the same transformationfrom #SSPk to RDC(CQ, FMS) as given in the proof of Theorem 7.5 for RDC(CQ, Fmono),except the following: (a) when λ = 0, for each set U consisting of k tuples of RQ,FMS(U) = (k − 1) ·

t∈U δrel(t, Q); and (b) we let k = l and B = (l − 1) · d. It is easyto see that the number of subsets T of W with |T | = l and

w∈T π(w) ≥ d equalsthe number of valid sets U for (Q,D, k, FMS, B). As discussed there, we can find thesolution to #SSPk, i.e., the number of subsets T ofW with |T | = l and

w∈T π(w) = d,by calling the oracle COUNTRDC twice, to compute the numbers X and Y of valid setsfor (Q,D, k, FMS, B) and (Q,D, k, FMS, B + 1), respectively, and the solution to #SSPkis simply X − Y . Here COUNTRDC(Q,D, k, FMS, B) is the oracle that given Q,D, k, FMS

and B, returns the number of valid sets U for (Q,D, k, FMS, B).

Upper bound. To see that RDC(FO, FMS) is in #P, we only need to show that it is inPTIME to verify whether a given set U is valid for (Q,D, k, FMS, B). Indeed, Q(D) isPTIME computable since Q is fixed, and moreover, FMS is also PTIME computable.

(1.2.5) RDC(FO, FMM). When λ = 0, we show that RDC(FO, FMM) is in FP for fixedqueriesQ, by giving an FP algorithm. GivenQ,D, k, FMM, and B, the algorithm returnsthe number of valid sets for (Q,D, k, FMM, B). Recall that FMM(U) = mint∈U δrel(t, Q)when λ = 0 (see Section 3). The algorithm works as follows:

1. compute Q(D) and sort tuples in Q(D) in descending order based on their δrelvalues; let Q(D) = {t1, . . . , t|Q(D)|}, where δrel(ti, Q) ≥ δrel(tj , Q) when i ≤ j;

2. check whether δrel(ti, Q)<B for all i ∈ [1, |Q(D)|]; if so, return 0; otherwise continue;3. let ti be the tuple such that δrel(ti, Q) ≥ B and δrel(ti+1, Q) < B; check whether

i ≥ k; if so, return = Cki , represented in binary; otherwise return 0.Here step 1 is in PTIME since Q(D) is PTIME computable when Q is fixed. Step 2is obviously in PTIME when λ = 0. Moreover, step 3 is in PTIME since the output isrepresented in binary. Thus the algorithm is in PTIME.

(2) When F is Fmono. We first study the combined complexity of QRD(LQ, Fmono),DRP(LQ, Fmono) and RDC(LQ, Fmono). We then consider their data complexity.

(2.1) Combined complexity. Note that when λ = 0, Fmono(U) =∑

t∈U δrel(t, Q), andFMS(U) = (k − 1)·

t∈U δrel(t, Q) for each set U of k tuples of RQ. As a result, whenλ = 0 and k = 2, Fmono and FMS are the same function. Recall that in the lower boundsproofs of Theorem 8.2 (for QRD(LQ, FMS) and DRP(LQ, FMS)) and Theorem 7.1 (forRDC(LQ, FMS)), we set λ = 0 and k = 2. Thus the lower bounds given there hold herefor Fmono. Moreover, the algorithms given in those upper bound proofs carry over tothe special case for λ = 0. Hence the upper bounds hold here.

(2.2) Data complexity. We show that when λ = 0, the data complexity boundsof QRD(LQ, Fmono), DRP(LQ, Fmono) and RDC(LQ, Fmono) remain the same as theircounterparts given in Theorems 5.4, 6.4 and 7.5, respectively. Obviously, the PTIME

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 52: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–6 Ting Deng and Wenfei Fan

algorithms given for Theorems 5.4 and 6.4 carry over to the special case when λ = 0.For RDC(LQ, Fmono), recall that its lower bound proof for Theorem 7.5 uses λ = 0.Hence the lower bound remains intact here. Moreover, the algorithm given thereobviously also works in the special case when λ = 0.

This completes the proof of Theorem 8.2. 2

THEOREM 8.3. When λ = 1, the combined complexity of Theorems 5.1, 5.2, 6.1,6.2, 7.1 and 7.2 and the data complexity of Theorems 5.4, 6.4, 7.4 and 7.5 remainunchanged for QRD, DRP and RDC, respectively. 2

PROOF. When λ = 1, we first study QRD, DRP and RDC for FMS and FMM. We theninvestigate these problems when F is Fmono.

(1) When F is FMS or FMM. We first study the combined complexity, and then investi-gate the data complexity of these problems in these settings.

(1.1) Combined complexity.We first consider CQ, UCQ and ∃FO+, and then FO.

(1.1.1) When LQ is CQ, UCQ or ∃FO+. The lower bounds of QRD(CQ, F ) and DRP(CQ, F )given in Theorems 5.1 and 6.1 for FMS and FMM are established by taking λ = 1. Asa result, these lower bounds hold here. For the upper bounds, the algorithms giventhere obviously remain intact in the special case when λ = 0.We next only need to show that RDC(LQ, FMS) and RDC(LQ, FMM) are #P-complete

for CQ, UCQ and ∃FO+, when λ = 1.

Lower bound. We first show that RDC(CQ, FMS) is #·NP-hard by parsimonious reduc-tion from #Σ1SAT (see Section 7.1). Given an instance ϕ(X,Y ) = ∃X(C1 ∧ . . . ∧ Cl)of #Σ1SAT, we use the same reduction given in the proof of Theorem 7.1 forRDC(CQ, FMS), except the following: (i) we define δdis((tY , 0, 1), (1, . . . , 1, 0)) = 1, andfor any other pair of tuples t and s, we define δdis(t, s) = 0; (ii) we set λ = 1 andk = 2; hence for each set U of tuples of RQ, FMS(U) =

t,s∈U δdis(t, s); and (iii) we

set B = 1. Then one can verify that the number of valid sets for (Q,D, k, FMS, B) isequal to the number of truth assignments of Y that satisfy ϕ. This follows from thefact that for each set U ⊆ Q(D) such that |U | = k = 2, FMS(U) ≥ B = 1 if and only ifU = {(tY , 0, 1),(1, . . . , 1, 0)}, where the truth assignment encoded by tY satisfies ϕ.We next show that RDC(CQ, FMM) is #·NP-hard also by parsimonious reduction from

#Σ1SAT. Given an instance ϕ(X,Y ) of #Σ1SAT, we use the same reduction as givenabove for RDC(CQ, FMS) except that when λ = 1, FMM(U) = mint,s∈U,t6=s δdis(t, s) foreach set U consisting of k tuples of RQ. Then along the same line as the proof givenabove, one can show that the number of valid sets for (Q,D, k, FMM, B) equals thenumber of truth assignments of Y that satisfy ϕ.

Upper bound. It is easy to see that when λ = 1, it is still in NP to verify whether agiven set U is valid for (Q,D, k, F,B) for ∃FO+, when F is FMS or FMM following theproof of Theorem 7.1. Hence RDC(∃FO+, F ) is in #·NP in this case.

(1.1.2) When LQ is FO. We show that when λ = 1, QRD(FO, F ) and DRP(FO, F ) arePSPACE-complete, and RDC(FO, F ) is #·PSPACE-complete, for FMS and FMM.

(A) Lower bound. We first prove the lower bounds.

(a) QRD(FO, FMS). We show that when λ = 1, QRD(FO, FMS) is PSPACE-hard byreduction from the membership problem for FO. Given an instance (Q,D, s) of themembership problem, we use the same reduction as the one given in Theorem 5.1 forQRD(FO, FMS), except the following: (i) we define δdis((s, 0), (s, 1)) = 1, and for any other

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 53: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–7

two tuples t and t′ of RQ, δdis(t, t′) = 0, and (ii) we set λ = 1, and hence, for each set U

of tuples of RQ, FMS(U) =∑

t,t′∈U δdis(t, t′); and (iii) we set B = 1. Recall that k = 2.

We next show that s ∈ Q(D) if and only if there exists a valid set for (Q′, D′, k, FMS,B). First assume that s ∈ Q(D). Then there exists a set U = {(s, 1), (s, 0)} valid for (Q′,D′, k, FMS, B) by the definition of δdis. Conversely, if s 6∈ Q(D), then tuples (s, 1) and(s, 0) are not in Q(D). Thus for each pair of tuples t and t′ in Q(D), FMS({t, t

′}) = 0 < B.

(b) QRD(FO, FMM). We show that QRD(FO, FMM) is PSPACE-hard also by reductionfrom the membership problem for FO queries. Given an instance (Q,D, s) of themembership problem, we use the same reduction given above for QRD(FO, FMS) exceptthat when setting λ = 1, for each set U of tuples of RQ, FMM(U) = mint,s∈U,t6=s δdis(t, s).Then along the same line as above, one can readily verify that s ∈ Q(D) if and only ifthere exists a valid set U for (Q′, D′, k, FMM, B).

(c) DRP(FO, FMS). We show that DRP(FO, FMS) is PSPACE-hard by reduction fromthe complement of the membership problem for FO. Given an instance (Q,D, s) ofthe membership problem, we use the same reduction given in the proof of Theo-rem 6.1 for DRP(FO, FMS), except the following: (i) we define δdis((s, 1, 1), (s, 1, 0)) = 1,δdis((s, 0, 1), (s, 0, 0)) = 2 and for any other two tuples t and t′ of RQ, we let δdis(t, t

′) = 0,and (ii) we set λ = 1. Recall that k = 2, r = 1 and U = {(s, 1, 1), (s, 1, 0)}. Hence foreach set S of tuples of RQ, we have that FMS(S) =

t,t′∈S δdis(t, t′). It is easy to see

that s /∈ Q(D) if and only if rank(U) = 1 ≤ r.

(d) DRP(FO, FMM). We show that DRP(FO, FMM) is PSPACE-hard also by reduction fromthe complement of the membership problem for FO. Given an instance (Q,D, s) of themembership problem for FO, we use the same reduction given above for DRP(FO, FMS),except that when λ = 1, FMM(S) = mint,t′∈S,t6=t′ δdis(t, t

′) for each set S of tuples of RQ′ .Then one can verify that s /∈ Q(D) if and only if rank(U) = 1 ≤ r.

(e) RDC(FO, FMS). We show that when λ = 1, RDC(FO, FMS) is #·PSPACE-hardby parsimonious reduction from #QBF (see Section 7.1). Given an instanceϕ = ∃X ∀y1P2y2 · · ·Pnyn ψ of #QBF, we use the same reduction given in theproof of Theorem 7.1 for RDC(FO, FMS), except the following: (i) we defineδdis((tX , 0, 1), (1, . . . , 1, 0)) = 1, where tX is a truth assignment of the X variablesthat satisfies ϕ, and for any other pair of tuples t and t′, we define δdis(t, t

′) = 0; (ii) weset λ = 1 and k = 2, and thus for each set U of tuples of RQ, FMS(U) =

t,t′∈U δdis(t, t′);

and moreover, (iii) we set B = 1. Then along the same line as the proof of Theorem 7.1for RDC(FO, FMS), one can verify that the number of valid sets for (D,Q, k, FMS, B)equals the number of truth assignments of X that satisfy ϕ.

(f) RDC(FO, FMM). We show that RDC(FO, FMM) is #·PSPACE-hard, also by parsimo-nious reduction from #QBF. Given an instance ϕ = ∃X ∀y1P2y2 · · ·Pnyn ψ of #QBF, weuse the same reduction given above for RDC(FO, FMS), except the following: (i) whensetting λ = 1, FMM(U) = mint,s∈U,t6=s δdis(t, s) for each set U of tuples of RQ; and (ii) weset B = 1. Then one can readily verify that the number of valid sets for (D, Q, k, FMM,B) equals the number of truth assignments of X that satisfy ϕ.

(B) Upper bound. We show that when λ = 1, QRD(FO, F ), DRP(FO, F ) and RDC(FO, F )are in PSPACE, PSPACE and #·PSPACE, respectively, when F is FMS or FMM. Obviously,the algorithms given in the proofs of Theorem 5.1 and 6.1 for QRD(FO, F ) andDRP(FO, F ), respectively, work here, where F is FMS or FMM. Thus QRD(FO, F ) andDRP(FO, F ) are both in PSPACE. Moreover, it is easy to see that when λ = 1, it isstill in PSPACE to verify that whether a set U is a valid set for (Q,D, k, F,B) for FO,

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 54: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–8 Ting Deng and Wenfei Fan

when F is FMS or FMM. Hence RDC(FO, F ) is in #·PSPACE for max-sum or max-mindiversification, when λ = 1.

(1.2) Data complexity. The lower bounds of QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F )for fixed queries hold here, as their proofs for Theorems 5.4, 6.4, 7.4 and 7.5, respec-tively, use λ = 1. The algorithms given there for the upper bounds also work here.

(2) When F is Fmono. For the combined complexity, the lower bounds proofs of The-orems 5.2, 6.2 and 7.2 for QRD(LQ, Fmono), DRP(LQ, Fmono) and RDC(LQ, Fmono),respectively, are verified by using λ = 1. Hence, those lower bounds hold here.Moreover, all the upper bounds given there carry over to the special case when λ = 1.

For the data complexity, the PTIME upper bounds of Theorems 5.4 and 6.4 forQRD(LQ, Fmono) and DRP(LQ, Fmono), respectively, obviously carry over to their specialcase when λ = 1. We next show that when λ = 1, the data complexity of RDC(LQ, Fmono)is #P-complete for CQ, UCQ, ∃FO+and FO, under polynomial Turing reductions. Itsuffices to show that RDC(CQ, Fmono) is #P-hard under polynomial Turing reductions,and that RDC(FO, Fmono) is in #P, for fixed queries.

(2.1) Lower bound. We show that when λ = 1, RDC(CQ, Fmono) is #P-hard by polyno-mial Turing reduction from #SSPk, which is shown #P-complete by Lemma 7.6. Alongthe same line as the proof of Theorem 7.5, we construct a polynomial Turing reductionas follows. Given an instance W , π, l and d of #SSPk, we define Q, D, δrel, δdis, Fmono,k and B, such that the number of subsets T of W with |T | = l and

w∈T π(w) ≥ dequals the number of valid sets U for (Q,D, k, Fmono, B). As discussed there, we canfind the solution to #SSPk, i.e., the number of subsets T of W with |T | = l and∑

w∈T π(w) = d, by calling the oracle COUNTRDC twice, to compute the numbers Xand Y of valid sets for (Q,D, k, Fmono, B) and (Q,D, k, Fmono, B + 1), respectively. HereCOUNTRDC(Q,D, k, Fmono, B) is the oracle that given Q,D, k, Fmono and B, returns thenumber of valid sets U for (Q,D, k, Fmono, B).

We next give the transformation from #SSPk to RDC(CQ, Fmono), for λ = 1.

(1) For each w ∈ W , let w′ be a distinct element not in W . The database D consists ofa single relation IW = {(w), (w′) | w ∈W}, specified by schema RW (W ).

(2) We define query Q as the identity query on RW instances. Then |Q(D)| = 2|W |.

(3) We define δrel as a constant function that returns 1 for each tuple of RQ. Moreover,we define δdis((w), (w′)) = π(w), and for any other pair of tuples t and t′ of RQ,we define δdis(t, t

′) = 0. We set λ = 1, and thus for each set U of tuples of RQ,Fmono(U) = (1/(2|W | − 1)) ·

t∈U,s∈Q(D) δdis(t, s).

(4) Finally, we set k = 2l and B = d/(2|W | − 1).

We only need to show that the number of subsets T of W with |T | = l and∑

w∈T π(w)≥ d is equal to the number of valid sets U for (Q,D, k, Fmono, B). Recallthat k = 2l and B = d/(2|W | − 1). Assume that there exists a set U ⊆Q(D) with|U | = k and Fmono(U) ≥ B. Then by the definition of δdis, Fmono(U) = (1/(2|W | − 1)) ·∑

(w)∈U,(w′)∈U δdis((w), (w′)) = (1/(2|W | − 1)) ·∑

(w)∈U π(w) ≥ B. Then for the set T =

{w | (w) ∈ U}, we have that∑

w∈T π(w) ≥ d. Conversely, for a subset T ofW with |T | = land

w∈T π(w) ≥ d, the set U = {(w), (w′) | w ∈ T } is valid for (Q,D, k, Fmono, B).

(2.2) Upper bound. When λ = 1, RDC(FO, Fmono) is in #P, since it is in PTIME to checkwhether a set U is valid for (Q,D, k, Fmono, B) for fixed FO queries.

This completes the proof of Theorem 8.3 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 55: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–9

COROLLARY 8.4. For a predefined constant k,

— the combined complexity bounds given in Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 areunchanged for QRD, DRP and RDC, respectively; and

— the data complexity is in— PTIME for QRD,— PTIME for DRP, and— FP for RDC,

no matter whether for FMS, FMM or Fmono, and for CQ, UCQ, ∃FO+or FO.

PROOF. We study QRD, DRP and RDC when k is a predefined constant, i.e., weconsider only candidate sets U with a constant size. We first establish their combinedcomplexity, and then investigate their data complexity.

(1) Combined complexity. We first show the lower bounds, followed by upper bounds.

(1.1) Lower bound. Observe that lower bounds proofs of QRD(LQ, F ), DRP(LQ, F ) andRDC(LQ, F ) given for Theorem 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 are established by usingk = 2 when F is FMS, and by setting k = 1 when F is FMM or Fmono, except those forQRD(LQ, F ) and DRP(LQ, F ), when F is FMS or FMM for CQ, UCQ and ∃FO+. Hencethese lower bounds hold here. Moreover, QRD(LQ, F ) and DRP(LQ, F ) are also shownto be NP-hard and coNP-hard in Theorem 8.2, respectively, for CQ, UCQ and ∃FO+, byalso using k = 2 when F is FMS, and by setting k = 1 when F is FMM. Hence theselower bounds hold here.

(1.2) Upper bound. The upper bounds of QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F )given for Theorem 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 obviously remain intact in the specialcase when k is a constant.

(2) Data complexity. We show that QRD(FO, F ) and DRP(FO, F ) are in PTIME, andRDC(FO, F ) is in FP for fixed FO queries, when F is FMS, FMM or FMM.

(2.1) QRD(FO, ). We show that QRD(FO, F ) is in PTIME by giving a PTIME algorithm.Given Q, D, δrel, δdis, F , k and B, the algorithm checks whether there exists a valid setU for (Q,D, k, F,B). It works as follows:

1. compute Q(D);2. enumerate all subsets U of Q(D) such that |U | = k; denote by S the collection of

all such sets U ;3. check whether there exists a set U ∈ S such that F (U) ≥ B; if so, return “yes”;

otherwise, return “no”.

We show the algorithm is in PTIME when F is FMS, FMM or Fmono. Indeed, Q(D) isPTIME computable since Q is fixed. Moreover, there are only polynomial many setsU in S that need to be checked in step 3 since k is a constant. Obviously, FMS(U) andFMM(U) are PTIME computable; Fmono is also PTIME computable since it is in PTIMEto compute Q(D). Thus step 3 can be done in PTIME. Hence QRD(FO, F ) is in PTIMEfor fixed queries and constant k, when F is FMS, FMM or Fmono.

(2.2) DRP(FO, F ). We show that DRP(FO, F ) is in PTIME by giving a PTIME algorithm.Given Q, D, δrel, δdis, F , U , k and r, the algorithm works as follows:

1. compute Q(D), in PTIME;2. enumerate all subsets V of Q(D) such that |V | = k; denote by S the collection of

all such sets V ; sort all sets in S in descending order based on their F values;3. check whether rank(U) ≤ r; if so, return “yes”; otherwise return “no”.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 56: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–10 Ting Deng and Wenfei Fan

To see that the algorithm is in PTIME, note that it is in PTIME to compute Q(D) sinceQ is fixed, and that there are only polynomial many sets in S. Moreover, F is PTIMEcomputable for FMS, FMM and Fmono when Q is fixed. Thus steps 2 and 3 are also inPTIME. In particular, step 3 can be easily done by counting the number of sets S inthe sorted S that have F (S) ≥ F (U). Hence the algorithm is in PTIME.

(2.3) RDC(FO, F ). We show that RDC(FO, F ) is in FP by giving a PTIME algorithm.Given Q, D, δrel, δdis, F , k and B, the algorithm works as follows:

1. compute Q(D), in PTIME;2. enumerate all subsets U of Q(D) such that |U | = k and sort them in descending

order based on their F () values;3. count and return the number of distinct valid sets for (Q,D, k, F,B).

Its step 1 is in PTIME since Q is fixed. Moreover, there are polynomial many subsets Uin step 2 since k is a constant. Thus the algorithm is in PTIME, and the problem is in FP.

This completes the proof of Corollary 8.4. 2

B. PROOFS OF SECTION 9

COROLLARY 9.2. In the presence of compatibility constraints of Cm, the combinedcomplexity bounds of Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 remain unchanged forQRD, DRP and RDC. 2

PROOF. Observe that the lower bounds of QRD, DRP and RDC given in Theo-rems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2, respectively, are established when the compatibilityconstraints are absent. These bounds are obviously intact in the more setting whencompatibility constraints are present.

Below we show that the upper bounds given there also remain intact in the presenceof compatibility constraints Σ. To this end, we make minor changes to the algorithmsgiven there by considering candidate sets for (Q,D,Σ, k, ). We show that the revisedalgorithms still work here and their complexity remain unchanged.

(1) QRD(LQ, F ).When F is FMS or FMM, we revise the algorithms given for Theorem 5.1as follows: (a) the step 2 of the algorithm for QRD(∃FO+, F ) checks whether |U | = k,F (U) ≥ B and U |= Σ; (b) in step 2, the algorithm for QRD(FO, F ) also checks whetherU ⊆ Q(D), F (U) ≥ B and U |= Σ. When F is Fmono, we revise the algorithm given inTheorem 5.2 such that in step 2, the algorithm checks whether U ⊆ Q(D) and U |= Σ.Clearly, all the revised algorithms work here. Moreover, they are still in NP, PSPACEand PSPACE, respectively, since U |= Σ can be checked in PTIME. Thus the upperbounds given in Theorems 5.1 and 5.2 carry over here.

(2) DRP(LQ, F ). When F is FMS or FMM, we revise the algorithms given for Theorem 6.1such that (a) in step 2, the algorithm for DRP(∃FO+, F ) checks whether for each setS ∈ S, |S| = k and S |= Σ; (b) in step 2, the algorithm for DRP(FO, F ) checks whetherfor each set S ∈ S, S ⊆ Q(D) and S |= Σ. When F is Fmono, the step 2 of the algorithmgiven for Theorem 6.2 checks whether for each S ∈ S, S ⊆ Q(D) and S |= Σ. Sincewe can check whether S |= Σ is in PTIME, all the revised algorithms are still in coNP,PSPACE and PSPACE, respectively. Thus the upper bounds given in Theorem 6.1 and6.2 remain valid here.

(3) RDC(LQ, F ). It suffices to show the following: (a) when F is FMS or FMM, it is in NPand in PSPACE to check if a set U is valid for (Q,D,Σ, k, F,B) for ∃FO+queries Q andFO queries Q, respectively, and (b) when F is Fmono, it is in PSPACE to check whether

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 57: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–11

a set U is valid for (Q,D,Σ, k, F,B) for FO queries Q. Since it is in PTIME to checkwhether a set U satisfies Σ (i.e., U |= Σ), both statements (a) and (b) hold here.

This completes the proof of Corollary 9.2. 2

THEOREM 9.3. In the presence of compatibility constraints of Cm, the data com-plexity bounds of Theorems 5.4, 6.4 and 7.4 remain unchanged for QRD, DRP and RDC,respectively, for FMS and FMM. However, for Fmono,

—QRD becomes NP-complete;—DRP becomes coNP-complete; and—RDC becomes #P-complete under parsimonious reductions. 2

PROOF. The lower bounds of QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) given inTheorems 5.4, 6.4 and 7.4, respectively, are established in the absence of compatibilityconstraints Σ, for FMS and FMM. These lower bounds obviously hold in the moregeneral setting when Σ may be present. In addition, the upper bounds given therealso carry over to the setting when compatibility constraints Σ are present. Indeed,along the same line as the proof of Corollary 9.2, we can revise the algorithms givenin the upper bound proofs of Theorems 5.4, 6.4 and 7.4 by considering candidate setsfor (Q,D,Σ, k, ). The revised algorithms still work in the presence of Σ with the samecomplexity, since one can check whether U |= Σ in PTIME. Hence all the upper boundsstill hold here.

We next show that for Fmono, the data complexity of QRD, DRP and RDC isNP-complete, coNP-complete and #P-complete under parsimonious reductions,respectively, in the presence of Σ, for CQ, UCQ, ∃FO+and FO.

(1) QRD(LQ, Fmono). It suffices to show that QRD(CQ, Fmono) is NP-hard and thatQRD(FO, Fmono) is in NP.

Lower bound. We verify that QRD(CQ, Fmono) is NP-hard by reduction from 3SAT.Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT defined over variables X = {x1, . . . , xm},we construct a database D, a fixed CQ query Q, functions δrel, δdis and Fmono, a set offixed compatibility constraints Σ, a positive integer k and a real number B. We showthat ϕ is satisfiable if and only if there exists a valid set U for (Q,D,Σ, k, Fmono, B).

(1) We use the database D given in the proof of Theorem 5.1 for QRD(CQ, FMS), overschema RC(cid, L1, V1, L2, V2, L3, V3). That is, for each clause Ci, there are tuples in Dto encode all truth assignments for variables in Ci that make Ci true.

(2) The query Q is an identity query on instances of RC .

(3) For each tuple t ∈ D, we define δrel(t, Q) = 1, and for any other tuples t′ of RQ, welet δrel(t, Q) = 0. We define δdis as a constant function that returns 0 for each pair oftuples t and s of RQ. We set λ = 0. Then for each set U of tuples of RQ with k tuples,one can see that Fmono(U) =

t∈U δrel(t, Q).

(4) We define Σ consisting of 10 constraints, given as follows:

ρ1 : ∀t1, t2 : RQ(

t1[cid] = t2[cid] →∧

A∈RC

t1[A] = t2[A])

,

ρij : ∀t1, t2 : RQ(

t1[Li] = t1[Lj ] → t1[Vi] = t2[Vj ])

, i, j ∈ [1, 3].

Intuitively, constraint ρ1 states that tuples in a set U have pairwise distinct cid-values,and the set {ρij | i, j ∈ [1, 3]} ensures that for each pair of tuples t and s in U , tand s agree on the values of their common variables. Note that all these constraintsare defined on the fixed schema RC (since Q is an identity query on Rc), and have

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 58: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–12 Ting Deng and Wenfei Fan

the form of functional dependencies (see, e.g., [Abiteboul et al. 1995] for functionaldependencies). Intuitively, these compatibility constraints are used to assure thatk-element sets U to be picked encode a valid truth assignment for variables X .

(5) Finally, we take k = B = l. That is, we only consider sets U that consist of l tuples,one for each clause in ϕ.

We next show that the formula ϕ is satisfiable if and only if there exists a valid setU for (Q,D,Σ, k, F,B), i.e., the construction given above is indeed a reduction.

⇒ Assume first that ϕ is satisfiable. Then there exists a truth assignment µ0X of X

variables such that every clause Cj of ϕ is true under µ0X . Let U consist of l tuples

of RQ, one for each clause, in which the values for the variables in X agree withµ0X . Obviously, U |= Σ, and moreover, Fmono(U) = l ≥ B by the definition of Fmono.

Therefore, U is a valid set for (Q,D,Σ, k, F,B).

⇐ Conversely, assume that ϕ is not satisfiable. Suppose by contradiction that thereexists a set U ⊆ Q(D) such that |U | = l, U |= Σ, and Fmono(U) ≥ B. Then from thetuples in U , we can construct a valid truth assignment µX of variables in X thatsatisfies all clauses in ϕ, by the definition of δrel. This leads to a contradiction.

Upper bound. Observe that the algorithm for QRD(FO, Fmono) revised in the proof ofCorollary 9.2 works here. Clearly, its step 2 is in PTIME since one can check whetherU |= Σ in PTIME, Q(D) is PTIME computable for a fixed FO query Q, and moreover,Fmono(U) is also in PTIME. Thus the algorithm is in NP when Q is fixed.

(2) DRP(LQ, Fmono). It suffices to show that DRP(CQ, Fmono) is coNP-hard and thatDRP(FO, Fmono) is in coNP.

Lower bound. We verify the lower bound by reduction from the complement of 3SAT.Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over variables X = {x1, . . . , xm}, weconstruct a database D, a fixed CQ query Q, functions δrel, δdis and Fmono, a fixed set Σof compatibility constraints, a positive integer k, and a set U . We show that ϕ is notsatisfiable if and only if rank(U) ≤ r. We use constant r = 1.

(1) We define the same formula ϕ′ = (ϕ ∨ z) ∧ z =∧li=1(Ci ∨ z) ∧ z, and the same

database D (a single relation) over schema RC(L1, V1, L2, V2, L3, V3, Z, VZ , A) as theircounterparts given in the proof of Theorem 6.1. Here D consists of tuples that encode,for each clause Ci ∨ z, all truth assignments µi for the three variables in Ci and z,and the truth value of the clause Ci ∨ z under µi. It also includes two extra tuples(l + 1, e1, f1, e2, f2, e3, f3, z, 1, 0) and (l + 1, e1, f1, e2, f2, e3, f3, z, 0, 1) for z, where all eiand fi are distinct constants that are not in X ∪ {z, 0, 1}.

(2) The query Q is an identity query on instances of Rc.

(3) Let U consist of l+ 1 tuples from D, one for each clause in ϕ′ such that all variablesin X and z are all set to be 1.

(4) Along the same line as the proof of DRP(CQ, Fmono) given above, we define con-straints Σ to ensure that for any set U ⊆ Q(D), if U |= Σ, then the tuples in U havepairwise distinct cid-values and moreover, for each pair of tuples t and s in U , t and sagree on the values of their common variables.

(5) We define function δrel such that for each tuple t ∈ D, δrel(t, Q) = 1 if t[A] = 1; andfor any other tuple t′ of RQ, we let δrel(t

′, Q) = 0. We set λ = 0. Then for each set U oftuples of schema RQ, Fmono(U) =

t∈U (δrel(t, Q)).

(6) Finally, we set k = l + 1.

We next show that this is indeed a reduction.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 59: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–13

⇒ Assume first that ϕ is satisfiable. Then there exists a truth assignment µ0X for X

variables that satisfies ϕ. We show that rank(U) = 2 > r = 1. Let U0 consist of l + 1tuples, one for each clause in ϕ′, such that the values of all the variables in X agreewith µ0

X and z is set to be 0. Obviously, for any tuple t in U0, we have that δrel(t, Q) = 1by the definition of δrel. Then Fmono(U

0) = l + 1. Note that for each tuple t ∈ U , t[A] = 1if t 6= (l + 1, e1, f1, e2, f2, e3, f3, z, 1, 0). Thus Fmono(U) = l. Putting these together, wehave that rank(U) ≥ 2 > r = 1.

⇐ Conversely, assume that ϕ is not satisfiable. Then there exists no truth assign-ment µX of X variables that satisfies ϕ. It is easy to see that for each candidateset S for (Q,D, k), there exist at most l tuples t ∈ S such that t[A] = 1, and thusFmono(S) ≤ l = Fmono(U). Therefore, rank(U) = 1 ≤ r = 1.

Upper bound. Clearly, the algorithm for DRP(FO, FMS) given in the proof of Corol-lary 9.2 also works here. Indeed, its step 2 is in PTIME since it is in PTIME to checkwhether U |= Σ, Q(D) is PTIME computable for a fixed FO query Q, and moreover,Fmono(U) is also in PTIME. Thus the algorithm is in coNP here.

(3) RDC(LQ, Fmono). It suffices to show that RDC(CQ, Fmono) is #P-hard under parsimo-nious reductions, and that RDC(FO, Fmono) is in #P.

Lower bound. We show that RDC(CQ, Fmono) is #P-hard by parsimonious reductionfrom#SAT (see Section 7 for the details of #SAT). Given an instance ϕ(X) = C1∧· · ·∧Clof 3SAT over variables X , we construct the same D,Q,Σ, λ, k, B, and functions δrel, δdis

and Fmono as their counterparts given above for QRD(CQ, Fmono). We let k = l. It is easyto verify that µX is a truth assignment of X variables that satisfies ϕ(X) if and onlyif there exists a valid set U for (Q,D,Σ, k, Fmono, B) encoding µX , i.e., U consists of ltuples in D, one for each clause, in which the values for the variables in X agree withµX . Thus this is indeed a parsimonious reduction. Hence RDC(CQ, Fmono) is #P-hardunder parsimonious reductions.

Upper bound. We only need to show that it is in PTIME to check whether a set U isvalid for (Q,D,Σ, k, Fmono, B) for a fixed FO query Q. Indeed, for a set U ⊆ Q(D), it is inPTIME to check whether |U | = k, U |= Σ and Fmono(U) ≥ B; in particular,Q(D) is PTIMEcomputable, and thus Fmono is PTIME computable. Hence RDC(FO, Fmono) is in #P.

This completes the proof of Theorem 9.3. 2

COROLLARY 9.4. For identity queries, in the presence of compatibility constraintsof Cm, both the combined complexity and data complexity of Corollary 8.1 remainunchanged for QRD, DRP, and RDC for FMS and FMM.

However, when it comes to Fmono,

—QRD becomes NP-complete;—DRP becomes coNP-complete; and—RDC becomes #P-complete under parsimonious reductions,

for both combined and data complexity. 2

PROOF. In the presence of a set Σ of compatibility constraints, for identity querieswe first study the combined and data complexity of QRD, DRP and RDC for FMS andFMM. We then investigate these problems for Fmono.

(1) When F is FMS or FMM. Observe the following. (a) We have shown in the proof ofCorollary 8.1 that QRD, DRP and RDC are NP-hard, coNP-hard and #P-hard underparsimonious reductions, respectively, for fixed identity queries, when Σ is absent.

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 60: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–14 Ting Deng and Wenfei Fan

Thus the lower bounds carry over to the more general setting when Σ may be present.(b) Corollary 9.2 tells us that when Σ is present, QRD, DRP and RDC are in NP, coNPand #P, respectively, when Q is a CQ query that is not necessarily fixed. Observe thatCQ subsumes identity queries. Hence for identity queries, QRD, DRP and RDC are inNP, coNP and #P, respectively, in the presence of Σ. Therefore, for identity queries,the combined complexity and data complexity of QRD, DRP and RDC are NP-complete,coNP-complete and #P-complete under parsimonious reductions, respectively, in thepresence of Σ.

(2) When F is Fmono. Observe the following. (a) We have shown in the proof of The-orem 9.3 that QRD, DRP and RDC are NP-hard, coNP-hard and #P-hard underparsimonious reductions, respectively, for Fmono, by using a fixed identity query Q inthe presence of Σ. Thus these lower bounds hold here. (b) The upper bounds given inthe proof of Theorem 9.3 for QRD, DRP and RDC carry over here, since Q(D) is PTIMEcomputable when Q is an identity query, even if it is not fixed. From these we can seethat for identity queries, the combined and data complexity of QRD, DRP and RDCare NP-complete, coNP-complete and #P-complete under parsimonious reductions,respectively, in the presence of Σ.

This completes the proof of Corollary 9.4. 2

COROLLARY 9.5. For λ = 0, in the presence of compatibility constraints of Cm, thecombined complexity bounds given in Theorem 8.2 remain unchanged for QRD, DRPand RDC, while the data complexity becomes

—NP-complete for QRD;— coNP-complete for DRP; and—#P-complete for RDC under parsimonious reductions,

no matter for FMS, FMM and Fmono, and for CQ, UCQ, ∃FO+and FO. 2

PROOF. For λ = 0, we first study the combined complexity of QRD, DRP and RDC,and then investigate their data complexity.

(1) Combined complexity. Observe the following. (a) The lower bounds given in Theo-rem 8.2 for QRD, DRP and RDC, respectively, are established when Σ is absent. Hencethese lower bounds remain intact when Σ is possibly present. (b) The upper boundsgiven there carry over here when Σ is present. Indeed, along the same line as the proofof Corollary 9.2, we can revise the algorithms of Theorem 8.2 such that the revisedalgorithms work in the presence of Σ, with the same complexity since whether U |= Σcan be checked in PTIME.

(2) Data complexity. We first study QRD, DRP and RDC for FMS and FMM, and thenconsider these problems for Fmono.

(2.1) When F is FMS or FMM. We show that in the presence of a set Σ of compatibilityconstraints, for λ = 0, QRD, DRP and RDC are NP-complete, coNP-complete and#P-complete under parsimonious reductions respectively, for fixed queries in CQ,UCQ, ∃FO+and FO.

(2.1.1) Lower bound. It suffices to verify that when F is FMS or FMM, QRD(CQ, F ),DRP(CQ, F ) and RDC(CQ, F ) are NP-hard, coNP-hard and #P-hard under parsimoniousreductions, respectively, when λ = 0 and Σ is present.

We first show that QRD(CQ, FMS) and QRD(CQ, FMM) are NP-hard by reductionfrom 3SAT. Given an instance ϕ of 3SAT, we construct the same D, Q, Σ and

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 61: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–15

functions δrel and δdis as their counterparts given in the proof of Theorem 9.3 forQRD(CQ, Fmono). Recall that δrel is defined as follows: δrel(t, Q) = 1 if t ∈ D = Q(D),and for any other tuples t′ of RQ, δrel(t

′, Q) = 0. Thus, when λ = 0, for each set Uconsisting of k tuples in D, FMS(U) = (k − 1) ·

t∈U δrel(t, Q) = (k − 1) · Fmono(U),and FMM(U) = mint∈U δrel(t, Q) = (1/k) · Fmono(U). Moreover, the lower bound ofQRD(CQ, Fmono) in Theorem 9.3 is verified by setting λ = 0. Thus for QRD(CQ, FMS), welet k = l and B = (l − 1) · l; and for QRD(CQ, FMM), we let k = l and B = 1. Along thesame line as the proof of Theorem 9.3, we can show that ϕ is satisfiable if and only ifthere exists a set U valid for (Q,D,Σ, k, F,B), when F is FMS or FMM.

We next verify that DRP(CQ, FMS) and DRP(CQ, FMM) are coNP-hard by reductionfrom the complement of 3SAT. Given an instance ϕ of 3SAT, we use the same reduc-tion given in the proof of Theorem 9.3 for DRP(CQ, Fmono), except the following: foreach set U of k tuples of schema RQ, (a) FMS(U) = (k − 1) ·

t∈U δrel(t, Q); and (b)FMM(U) = mint∈U δrel(t, Q). Recall that we show the lower bound of DRP(CQ, Fmono) inthe proof of Theorem 9.3 by using λ = 0. Thus along the same line as that proof, onecan verify that ϕ is not satisfiable if and only if rank(U) ≤ r, for FMS and FMM.

Finally, we show that RDC(CQ, FMS) and RDC(CQ, FMM) are #P-hard by parsi-monious reductions from #SAT. We have shown in the proof of Theorem 9.3 thatRDC(CQ, Fmono) is #P-hard under parsimonious reductions for fixed queries, whenλ = 0. Given an instance ϕ(X) of #SAT over variables X , along the same line asthe analysis in the proof of QRD(CQ, FMS) and QRD(CQ, FMM) given above, we canuse the same reduction given in the proof of Theorem 9.3 for RDC(CQ, Fmono), exceptthe following: (a) for RDC(CQ, FMS), we let k = l and B = (l − 1) · l; and (b) forRDC(CQ, FMM), we let k = l and B = 1. Similarly, we can show that the number oftruth assignments of the X variables that satisfies ϕ(X) equals the number of validsets for (Q,D,Σ, k, F,B), when F is FMS or FMM.

(2.1.2) Upper bound. We only need to show that QRD(FO, F ), DRP(FO, F ) andRDC(FO, F ) are in NP, coNP and #P, respectively, when F is FMS or FMM. Clearly, theupper bounds given in Theorem 9.3 for QRD(LQ, F ), DRP(LQ, F ) and RDC(LQ, F ) forFMS and FMM carry over to the special case when λ = 0.

(2.2) When F is Fmono. Observe the following. (a) The lower bounds given in the Theo-rem 9.3 for the data complexity of QRD(LQ, Fmono), DRP(LQ, Fmono) and RDC(LQ, Fmono)are established by using λ = 0. Thus the lower bounds hold here. (b) All the algo-rithms for these problems given there work in the special case when λ = 0. Thusthe upper bounds carry over here. Hence when Σ is present, the data complexity ofQRD(LQ, Fmono), DRP(LQ, Fmono) and RDC(LQ, Fmono) are NP-complete, coNP-completeand #P-complete under parsimonious reductions, respectively.

This completes the proof of Corollary 9.5. 2

COROLLARY 9.6. For λ = 1, in the presence of compatibility constraints of Cm, thecombined complexity bounds given in Theorem 8.3 remain unchanged for QRD, DRPand RDC.The data complexity bounds of Theorem 8.3 remain unchanged for FMS and FMM. In

contrast, for Fmono and for CQ, UCQ, ∃FO+and FO,

—QRD is NP-complete;—DRP is coNP-complete; and—RDC is #P-complete under parsimonious reductions. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 62: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

App–16 Ting Deng and Wenfei Fan

PROOF. When λ = 1 and Σ is present, we study the combined complexity of QRD,DRP and RDC, followed by their data complexity.

(1) Combined complexity. Observe the following. The lower bounds given in Theo-rem 8.3 for QRD, DRP and RDC are established when Σ is absent. Thus the lowerbounds hold in the more general setting when Σ is possibly present. Furthermore,the upper bounds given in Corollary 9.2 for QRD, DRP and RDC carry over here to thespecial case when λ = 1.

(2) Data complexity. We next study the data complexity for FMS, FMM, and Fmono.

(2.1) When F is FMS or FMM. In this setting, observe the following. (a) The lower boundsgiven in Theorem 8.3 for QRD(LQ, F ), DRP and RDC are established when Σ is absent.Thus the lower bounds also hold in the presence of Σ. (b) The algorithms given in theproof of Theorem 9.3 for QRD, DRP and RDC work in the special cases when λ = 1.Therefore, the upper bounds given there remain valid the the special setting.

(2.2) When F is Fmono. We show that QRD(LQ, Fmono), DRP(LQ, Fmono) and RDC(LQ,Fmono) are NP-complete, coNP-complete and #P-complete under parsimonious reduc-tions, respectively, for fixed queries in CQ, UCQ, ∃FO+and FO.

(2.2.1) Lower bound. We first study the lower bounds.

We show that QRD(CQ, Fmono) is NP-hard by reduction from 3SAT. Given an in-stance ϕ of 3SAT, we use the same reduction given in the proof of Theorem 9.3 forQRD(CQ, Fmono), except the following: (a) δdis is a constant function that returns 1 forany two different tuples of schema RQ; (b) when λ = 1, for any set U of tuples ofschema RQ, Fmono(U) =

(

λ/(|Q(D)| − 1))

·∑

t∈U,t′∈Q(D)(δdis(t, t′)); and (c) k = l and B =

l·(l−1)/(|Q(D)|−1). Along the same line as the proof given there, one can readily verifythat ϕ is satisfiable if and only if there exists a valid set U for (Q,D,Σ, k, Fmono, B).

We next show that DRP(CQ, Fmono) is coNP-hard by reduction from the complementof 3SAT. Given an instance ϕ of 3SAT, We construct the same D, Q, U , Σ and r as theircounterparts given in the proof of Theorem 9.3 for DRP(CQ, Fmono), and moreover, wedefine the same function Fmono as the one given there when λ = 1. Along the same lineas that proof, one can verify that ϕ is not satisfiable if and only if rank(U) ≤ r = 1.

We verify that RDC(CQ, Fmono) is #P-hard by parsimonious reduction from #SAT.Given an instance ϕ(X) of #SAT, we construct the same D, Q, Σ, k, B and functionFmono as their counterparts given above for QRD(CQ, Fmono). It is easy to verify thatµX is a truth assignment of X variables that satisfies ϕ(X) if and only if there existsa valid set U for (Q,D,Σ, k, Fmono, B) that encodes µX , such that U consists of l tuplesin D, one for each clause, in which the values for the variables in X agree with µX .Thus it is indeed a parsimonious reduction. From this it follows that RDC(CQ, Fmono)is #P-hard under parsimonious reductions.

(2.2.2) Upper bound. We have already shown in Theorem 9.3 that QRD(FO, Fmono),DRP(FO, Fmono) and RDC(FO, Fmono) are in NP, coNP and #P, respectively, in thepresence of Σ. Thus the upper bounds carry over here to the special case when λ = 1.

This completes the proof of Corollary 9.6. 2

COROLLARY 9.7. For a predefined constant k, in the presence of compatibilityconstraints of Cm, the combined complexity and data complexity of Corollary 8.4remain unchanged for QRD, DRP and RDC, respectively. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.

Page 63: A On the Complexity of Query Result Diversificationhomepages.inf.ed.ac.uk/wenfei/papers/tods14-diversity.pdfQuery result diversification is a bi-criteria optimization problem for

On the Complexity of Query Result Diversification App–17

PROOF. We first study the combined complexity. Observe the following. (a) Thelower bounds given in Corollary 8.4 for QRD, DRP and RDC are established for a (fixed)constant k when compatibility constraints are absent. Thus the lower bounds remainintact in the more general setting when compatibility constraints are present. (b) Theupper bounds given there also carry over here. This follows from the fact that the algo-rithms developed in the proof of Corollary 9.2 for QRD, DRP and RDC in the presenceof compatibility constraints obviously work in the special case when k is a constant.

For the data complexity, consider the algorithms given in the proof of Corollary 8.4for QRD(FO, F ), DRP(FO, F ) and RDC(FO, F ). Along the same line as the proof ofCorollary 9.2, we revise these algorithms to inspect candidate sets for (Q,D,Σ, k), byadditionally checking whether U |= Σ. Clearly, the revised algorithms work here andthe problems are still tractable, since it is in PTIME to check whether U |= Σ.

This completes the proof of Corollary 9.7. 2

ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.