topk publish subscribe

Upload: johnwlmns3929

Post on 03-Apr-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 TopK Publish Subscribe

    1/12

    Top-k Publish-Subscribe for Social Annotation of News

    Alexander Shraer Maxim Gurevich Marcus Fontoura Vanja Josifovski

    {shralex, gmax, marcusf, vanjaj}@google.com

    Google, Inc. Work done while the authors were at Yahoo! Inc.

    ABSTRACT

    Social content, such as Twitter updates, often have thequickest first-hand reports of news events, as well as nu-merous commentaries that are indicative of public view ofsuch events. As such, social updates provide a good comple-ment to professionally written news articles. In this paperwe consider the problem of automatically annotating newsstories with social updates (tweets), at a news website serv-

    ing high volume of pageviews. The high rate of both thepageviews (millions to billions a day) and of the incomingtweets (more than 100 millions a day) make real-time in-dexing of tweets ineffective, as this requires an index thatis both queried and updated extremely frequently. The rateof tweet updates makes caching techniques almost unusablesince the cache would become stale very quickly.

    We propose a novel architecture where each story is treatedas a subscription for tweets relevant to the storys content,and new algorithms that efficiently match tweets to sto-ries, proactively maintaining the top-k tweets for each story.Such top-k pub-sub consumes only a small fraction of the re-source cost of alternative solutions, and can be applicable toother large scale content-based publish-subscribe problems.We demonstrate the effectiveness of our approach on real-

    world data: a corpus of news stories from Yahoo! News anda log of Twitter updates.

    1. INTRODUCTIONMicro-blogging services as twitter.com are becoming an

    integral part of the news consumption experience on theweb. With over 100 million users, Twitter often has thequickest first-hand reports of news events, as well as numer-ous commentaries that are indicative of the public view ofthe events. As such, micro-blogging services provide a goodcomplement to professionally written stories published bynews services. Recent events in North Africa illustrate theeffectiveness of Twitter and other microblogging services inproviding news coverage of events not covered by the tradi-

    tional media [17].

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39thInternational Conference on Very Large Data Bases,August 26th - 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 6Copyright 2013 VLDB Endowment 2150-8097/13/04... $ 10.00.

    A popular emerging approach to combining traditionaland social news content is annotating news stories with re-lated micro-blogs such as Twitter updates. There are sev-eral technical difficulties in building an efficient system forsuch social news annotation. One of the key challenges isthat tweets arrive in real time and in very high volume ofmore than 100 millions a day. As recency is one of the keyindicators of relevance for tweets, news stories need to beannotated in real time. Second, large news websites have

    high number of news pageviews that need to be served withlow latency (fractions of a second). In this context we con-sider a system that sustains hundreds of millions to billionsof serving requests per day. Finally, there is a non-trivialnumber of unique stories that need to be annotated rangingin hundreds of thousands.

    In this paper we propose a top-k publish-subscribe ap-proach for efficiently annotating news stories with socialcontent in real-time. To be able to cope with the high scaleof updates (tweets) and story requests, we use news storiesas subscriptions, and tweets as published items in a pub-sub system. In traditional pub-sub systems published itemstrigger subscriptions when they match a subscriptions pred-icate. In a top-k pub-sub each subscription (story) scores

    published items (tweets), in our case based on the contentoverlap between the story and a tweet. A subscription istriggered by a new published item only if the item scoreshigher than the k-th top scored item previously publishedfor this specific subscription.

    For each story, we maintain the current result set of top-kitems, reducing the story serving cost to an in-memory ta-ble lookup made to fetch this set. In the background, onan arrival of a new tweet, we identify the stories that thistweet is related to, and adjust their result sets accordingly.We show how top-k pub-sub makes news annotation feasiblefrom efficiency standpoint for a range of scoring functions.In this paper we do not address the issue of ranking qual-ity, however the presented system can accommodate mostof the popular ranking functions, including cosine similar-

    ity, BM25, and language model scoring [3]. Moreover, oursystem can be used as a first phase of selecting annotationcandidates, from which the final annotation set can be de-termined using other methods, which may, for example, takecontext and user preferences into account (e.g., using Ma-chine Learning).

    The pub-sub approach is more suitable for high volumeupdates and requests than the traditional pull approach,where tweets are indexed using real-time indexing and newspageview requests are issued as queries at serving time, for

  • 7/28/2019 TopK Publish Subscribe

    2/12

    the following two reasons: first, due to the real-time index-ing component, cached results would be invalidated very fre-quently; and second, due to the order-of-magnitude highernumber of serving requests than tweet updates, the cost ofquery evaluation would be very high. The combination ofthese two issues would dramatically increase the cost of serv-ing. Our evaluation shows that on average, only a very smallfraction of tweets related to a story end up annotating thestory, and that cache invalidation rate in a real-time index-

    ing approach would be 3 to 5 orders of magnitude higherthan actually required in this setting.

    Our approach is applicable to other pub-sub scenarios,where the subscriptions are triggered not only based on apredicate match, but also on their relationship with previ-ously seen items. Examples include content-based RSS feedsubscriptions, systems for combining editorial and user gen-erated content under high query volume, or updating cachedresults of head queries in a search engine. Even in caseswhen the stream of published items is not as high as in thecase of Twitter, the pub-sub approach offers lower servingcost since the processing is done on arrival of the publisheditems, while at query time the precomputed result is simplyfetched from memory. Another advantage of this approach

    is that it allows the matching to be done off-line using morecomplex matching logic than in the standard caching ap-proaches where the cache is filled by results produced byonline evaluation. Top-k pub-sub has been considered pre-viously in a similar context [16], for personalized filteringof event streams. The work presented in this paper allowsfor order of magnitude larger subscription sizes with ordersof magnitude better processing times (Section 5 provides adetailed discussion of previous work).

    Even in a pub-sub setting, there are still scalability issueswith processing incoming tweets and updating annotationlists associated with news stories. Classical document re-trieval achieves scale and low latency by using two familiesof top-k algorithms: document-at-a-time (DAAT) and term-at-a-time (TAAT). In this work we show how to adapt these

    algorithms to the pub-sub setting. We furthermore examineoptimizations of top-k pub-sub algorithms achieving in somecases reduction of the processing time by up to 89%. Thekey insight that allows for this improvement is maintain-ing threshold scores the new tweets would need to meetin order to enter the current result sets of stories. Intu-itively, if the upper bound on a tweets score is below thethreshold, the tweet will not enter the result set and thuswe can skip the full computation of story-tweet score. Scorecomputation is the key part of processing cost and thus byskipping a significant fraction of score computations we re-duce the CPU usage and the processing time of incomingtweets accordingly. Efficiently maintaining these thresholdsfor ranges of stories allows applying DAAT and TAAT skip-ping optimizations, saving up to 95% of score computations,which results in significant reduction of processing latency.

    In summary, our main contributions are as follows:

    We show how the top-k pub-sub paradigm can be usedfor annotating news stories with social updates in realtime by indexing the news stories as subscriptions andprocessing tweets as published items. This approachremoves the task of matching tweets with news articlesfrom the critical path of serving the articles, allowingfor efficient serving, and at the same time guaranteesmaximal freshness of the served annotations.

    We introduce novel algorithms for top-k pub-sub thatallow for orders of magnitude larger subscriptions withsignificant reduction in query processing times com-pared to previous approaches.

    We adapt the prevalent top-k document retrieval algo-rithms to the pub-sub setting and demonstrate varia-tions that reduce the processing time by up to 89%.

    Our experimental evaluation validates the feasibilityof the approach over real-world size corpora of newsand tweets.

    The paper proceeds as follows. In the next section we for-mulate news annotation as a top-k publish-subscribe prob-lem and overview the proposed system architecture. InSection 3 we describe our algorithms. Section 4 presentsan experimental evaluation demonstrating the feasibility ofthe proposed approach and the benefit of the algorithmicimprovements. Section 5 gives an overview of related ap-proaches and discusses alternative solutions. We concludein Section 6.

    2. NEWS ANNOTATION AS PUB-SUB

    2.1 Annotating news stories with tweetsWe consider a news website serving a collection S of news

    stories. A story served at time t is annotated with the set ofk most relevant social updates (tweets) received up to timet. Formally, given the set Ut of updates at serving time t,story s is annotated with a set of top-k updates Rts (we omitsuperscripts t when clear from the context) according to thefollowing scoring function:

    score(s,u,t) cs(s, u) rs(t, tu),

    where cs is a content-based score function, rs is a recencyscore function, and tu is the creation time of update u. Ingeneral, we assume cs to be from a family of state of theart IR scoring functions such as cosine similarity or BM25,

    and rs to monotonically decrease with t tu, at the samerate for all tweets. We say that tweet u is related to story sifcs(s, u) > 0.

    Content-based score. In this work we consider two pop-ular IR relevance functions: cosine similarity and BM25. Weadopt a variant of cosine similarity similar to the one usedin the open-source Lucene1 search engine:

    cs(s, u) =i

    ui idf2(i)

    si|s|

    ,

    where si (resp. ui) is the frequency of term i in the con-tent of s (resp. u), |s| is the length of s, and idf(i) =

    1+log( |S|1+|{sS|si>0}|

    ) is the inverse document frequency of

    iin S. With slight abuse of notation we refer to the score

    contribution of an individual term ui by cs(s, ui), e.g., in

    the above function cs(s, ui) = ui idf2(i)

    si|s|

    .

    The BM25 score function is defined as follows:

    cs(s, u) =i

    ui idf(i) si (k1 + 1)

    si + k1 (1 b + b |s|

    avgsS |s|)

    ,

    where k1 and b are parameters of the function (typical valuesare k1 = 2, b = 0.75).1lucene.apache.org

  • 7/28/2019 TopK Publish Subscribe

    3/12

    While these are simplistic scoring functions, they are basedon query-document overlap and can be implemented as dotproducts similarly to other popular scoring functions, andincurring a similar runtime cost. Designing a high-qualityscoring function for matching tweets to stories is beyond thescope of this paper. We note that these scoring functionscan be used in first phase retrieval, producing a large set ofannotation candidates, after which a second phase may em-ploy an arbitrary scoring function (based, for example, on

    Machine Learning) to produce the final ordering of resultsand determine the annotations to be displayed.

    Recency score. Social updates like tweets are often tied(explicitly or implicitly) to some specific event, and theirrelevance to current events declines as time passes. In ourexperiments we thus discount scores of older tweets by afactor of 2 every time-interval (a parameter), i.e., use ex-ponential decay recency score:

    rs(tu, t) = 2tut

    .

    Although we consider the above exponential decay func-tion, our algorithms can support other monotonically de-creasing functions. Note, however, that the efficient scorecomputation method described in Section 2.4 may not be

    applicable to some functions.

    2.2 A top-k pub-sub approachWe focus on high-volume websites serving millions to bil-

    lions daily pageviews. The typical arrival rate of tweets is100 millions a day, while new stories are added at the rate ofthousands to tens of thousands a day. We focus on annotat-ing the head stories that get the majority of pageviews;these are typically new stories describing recent events, pub-lished during the current day or the few preceding days. It isthese stories whose annotation has to be updated frequentlyas new related tweets arrive.

    Pageviews are by far the most frequent events in the sys-tem. We are thus looking for a scalable solution that woulddo as little work as possible on each pageview. It thereforemakes sense to maintain the current, up-to-date, annota-tions (sets of tweets) for each story. Let Rs be the set ofup-to-date top-k tweets for a story s S (i.e., at time t thetop-k tweets from Ut). For each arriving tweet we identifythe stories it should annotate, and add the tweet to thesestories result sets. On pageviews, the precomputed annota-tions Rs are fetched with no additional processing overhead.

    The architecture we propose is described in Figure 1. TheStory Index is the main component we develop in this paper.It indexes stories in S, is queried by the incoming tweets,and updates the current top-k tweets Rs for each story. Wedescribe the Story Index in detail in the following sections.A complementary Tweet Index can be maintained and usedto initialize annotations of new stories that are being added

    to the system. Such initialization (and hence the Tweet in-dex) is optional and one may prefer to consider only newlyarriving tweets for annotations. We note, however, that ini-tializing the list of annotations sets a higher bar for newlyincoming tweets, which is beneficial to the user as well asallows for faster queries of S (optimizations presented laterin this section allow skipping the evaluation of an incomingtweet against stories for which the tweet cannot improve theannotation set).

    We note that our approach implements a content-basedpublish-subscribe system where stories are standing sub-

    topktweetsstory

    Pageview

    TweetStory to

    to -kStory

    update

    tweets map

    queryqueryupdate

    Newtweet Newstory

    Figure 1: A pub-sub based solution.

    scriptions on tweets. It is referred to as push pub-sub sys-tem, where updates are proactively computed and pushed tothe subscribers (precomputed annotations in our case). Fora more detailed discussion of previously proposed content-based pub-sub systems, see Section 5.2.

    This design has three main scenarios: (1) every new tweet

    is used as a query for the Story Index and, for every storys, if it is part of the top-k results for s, we add it to Rs. Wealso add the new tweet to the Tweet Index; (2) for everynew story we query the Tweet Index and retrieve the top-ktweets, which are used to initialize Rs. We also add the newstory to the Story Index; (3) for every page view we simplyfetch the top-k set of tweets Rs.

    The major advantages of this solution are the following:(1) the Story Index is queried frequently, but it is updatedinfrequently; (2) for the Tweet Index, the opposite happens- it is updated frequently but queried only for new storieswhich are orders of magnitude less frequent than the numberof tweet updates; (3) the page views, which are the mostfrequent event, are served very efficiently since we only needto return the precomputed set of tweets Rs.

    2.3 The Story IndexThe main idea behind the Story Index is to index stories

    instead of tweets, and to run tweets as queries on that index.Inverted indices is one of the most popular data structuresfor information retrieval. The content of the documents (sto-ries in our case) is indexed in an inverted index structure,which is a collection of posting lists L1, L2, . . . , Lm, typi-cally corresponding to terms (or, more generally, features)in the story corpus. A list Li contains postings of the forms,ps(s, i) for each story that contains term i, where s is a

    story identifier and ps(s, i) cs(s,ui)ui

    is the partial score

    the score contribution of term i to the full score cs(s, ). For

    example, for cosine similarity, ps(s, i) = idf

    2

    (i) si

    |s| . Thefactor ui multiplies the partial score at the evaluation timegiving cs(s, ui). Postings in each list are sorted by ascendingstory identifier. Given a query (in our case a social update)u, a scoring function cs, and k, a typical IR retrieval algo-rithm, shown in Algorithm 1, traverses the inverted indexof the corpus S and returns the top-k stories for u, that is,the stories in S with the highest value ofcs(s, u).

    Note that the above described semantics is different fromwhat we want to achieve. We do not want to find the top-kstories for a given tweet, but rather all stories for which the

  • 7/28/2019 TopK Publish Subscribe

    4/12

    Algorithm 1 Generic top-k retrieval algorithm

    1: Input: Index of S2: Input: Query u3: Input: Number of results k4: Output: R min-heap of size k5: Let L1, L2, . . . L|u| be the posting lists of terms in u6: R 7: for every story s

    Li do

    8: Attempt inserting (s, cs(s, u)) into R9: return R

    tweet is among their top-k tweets. This difference precludesusing off-the-shelf retrieval algorithms.

    Algorithm 2 shows the top-k pub-sub semantics. Given atweet u and the current top-k sets for all stories Rs1 , . . . , Rsn ,the tweet u must be inserted into all sets for which u ranksamong the top-k matching tweets. (Here we ignore the re-cency score rs and handle it in Section 2.4).

    Algorithm 2 Generic pub-sub based algorithm

    1: Input: Index of S2: Input: Query u3: Input: Rs1 , Rs2 , . . . , Rsn min-heaps of size k for all

    stories in S4: Output: Updated min-heaps Rs1 , Rs2 , . . . , Rsn5: Let L1, L2, . . . L|u| be the posting lists of terms in u6: for every story s

    Li do

    7: Attempt inserting (u,cs(s, u)) into Rs8: return Rs1 , Rs2 , . . . , Rsn

    2.4 The recency functionUntil now we focused on content-based scoring only and

    ignored the recency score. Recall that our recency score

    function rs(tu, t) = 2tut

    , decays exponentially with the

    time gap between the creation time of tweet tu and thepageview time t. It is easy to see that this function sat-isfies the following invariant:

    Observation 2.1. As t grows, the relative ranking be-tween the scores of past tweets does not change.

    The above invariant means that we do not need to recom-pute scores and rerank tweets in Rs between updates causedby new tweets.

    However, it might seem that whenever we attempt to in-sert a new tweet into Rs, we have to recompute scores oftweets that are already in Rs in order to be able to comparethese scores to the score of the new tweet. Fortunately, thisrecomputation can also be avoided by writing the recency

    score as

    rs(tu, t) =2tu/

    2t/,

    and noting that the denominator 2t/ depends only on thecurrent time t, and at any given time is equal for all tweetsand all stories. Thus, and since we do not use absolute scorevalues beyond relative ranking of tweets, we can replace 2t/

    with constant 1, giving the following recency function:

    rs(tu) = 2tu/.

    The above function depends only on the creation time of thetweet and thus does not have to be recomputed later whenwe attempt to insert new tweets.2

    To detach accounting for the recency score from the re-trieval algorithm, when a new tweet arrives we computeits rs(tu) and use it as a multiplier of term weights in the

    tweets query vector u, i.e., we use 2tu/ u to query the in-verted index. Clearly, when computing the tweets content-based score cs with such a query vector, we get the desiredfinal score:

    cs(s, 2tu/ u) =i

    2tu/ cs(s, ui) =

    2tu/ cs(s, u) = score(s,u,t).

    3. RETRIEVAL ALGORITHMS FOR TOP-

    K PUB-SUBIn this section we show an adaptation of several popu-

    lar top-k retrieval strategies to the pub-sub setting, andthen evaluate their performance empirically in Section 4.Although top-k retrieval algorithms were evaluated exten-sively, [8, 23, 18, 7, 22] to name a few, the different setting

    we consider necessitates a separate evaluation.We first describe an implementation of the pub-sub re-trieval algorithm (Algorithm 2) using the term-at-a-timestrategy (TAAT).

    3.1 TAAT for pub-subIn term-at-a-time algorithms, posting lists corresponding

    to query terms are processed sequentially, while accumulat-ing the partial scores of all documents encountered in thelists. After traversing all the lists, the accumulated scoresare equal to the full query-document scores (cs(s, u)); doc-uments that did not appear in any of the posting lists havezero score.

    A top-k retrieval algorithm then picks the k documentswith highest accumulated scores and returns them as query

    result. In our setting, where query is a tweet and documentsare stories, the new tweet u may end up being added to Rsof any story s for which score(s,u,t) > 0. Thus, insteadof picking the top-k stories with highest scores, we attemptto add u into Rs of all stories having positive accumulatedscore, as shown in Algorithm 3, where s denotes the mini-mal score of a tweet in Rs (recall that ui denotes the termweight of term i in tweet u).

    3.2 TAAT for pub-sub with skippingAn optimization often implemented in retrieval algorithms

    is skipping some postings or the entire posting lists whenthe scores computed so far indicate that no documents inthe skipped postings can make it into the result set. One

    such optimization was proposed by Buckley and Lewit [8].Let ms(Li) = maxs ps(s, i) be the maximal partial score inlist Li. The algorithm of Buckley&Lewit sorts posting listsin the descending order of their maximal score, and pro-cesses them sequentially until either exhausting all lists orsatisfying an early-termination condition, in which case the

    2Since the scores would grow exponentially as new tweets ar-rive, scores may grow beyond available numerical precision,in which case a pass over all tweets in all Rs is required,subtracting a constant from all values of tu and recomput-ing the scores.

  • 7/28/2019 TopK Publish Subscribe

    5/12

    Algorithm 3 TAAT for pub-sub

    1: Input: Index of S2: Input: Query u3: Input: Rs1 , Rs2 , . . . , Rsn min-heaps of size k for all

    stories in S4: Output: Updated min-heaps Rs1 , Rs2 , . . . , Rsn5: Let L1, L2, . . . L|u| be the posting lists of terms in u, in

    the descending order of their maximal score6: A[s] 0 for all s Accumulators vector7: for i [1, 2, . . . , |u|] do8: for s,ps(s, i) Li do9: A[s] A[s] + ui ps(s, i)

    10: for every s such that A[s] > 0 do11: s min. score of a tweet in Rs if |Rs| = k, 0

    otherwise12: if s < A[s] then13: if |Rs| = k then14: Remove the least scored tweet from Rs15: Add (u, A[s]) to Rs16: return Rs1 , Rs2 , . . . , Rsn

    remaining lists are skipped and the current top-k results are

    returned. The early-termination condition ensures that nodocuments other than the current top-k can make it into thetrue top-k results of the query. This condition is satisfiedwhen the k-th highest accumulated score is greater than theupper bound on the scores of other documents that are cur-rently not among the top-k ones, calculated as the (k+1)-thhighest accumulated score plus the sum of maximal scoresof the remaining lists. More formally, let the next list to beevaluated be Li, and denote by Ak the k-th highest accu-mulated score. Then, lists Li, Li+1, . . . , L|u| can be safelyskipped if

    Ak > Ak+1 +ji

    uj ms(Lj).

    In our setting, since we are not interested in top-k storiesbut in top-k tweets for each story, we cannot use the abovecondition and develop a different condition suitable to ourproblem. In order to skip list Li, we have to make sure thattweet u will not make it into Rs of any story s in Li. Inother words, the upper bound on the score of u has to bebelow s for every s Li:

    A1 +ji

    uj ms(Lj) minsLi

    s. (1)

    When this condition does not hold, we process Li as shownin Algorithm 3, lines 8-9. When it holds, we can skip list Liand proceed to list Li+1, check the condition again and soon. Note that such skipping makes some accumulated scoresinaccurate (lower than they should be). Observe however,

    that these are scores of exactly the stories in Li that weskipped because tweet u would not make it into their Rssets even with the full score. Thus, making the accumulatedscore of these stories lower does not change the outcome ofthe algorithm.

    3.2.1 Efficient fine-grained skipping

    Although Condition 1 allows us to skip the whole list Li, itis less likely to hold for longer lists, while skipping such listsis what could make the bigger difference for the evaluationtime. Even a single story with s = 0 at the middle of a list

    would prevent skipping that list. We thus resort to a morefine-grained skipping strategy: we skip a segment of a listuntil the first story that violates Condition 1, i.e., first s inLi for which A1 +

    ji uj ms(Lj) > s. We then process

    that story by updating its score in the accumulators (line 9in Algorithm 3), and then again look for the next story inthe list that violates the condition. We thus need a primitivenext(Li,pos,UB) that given a list Li, a starting position posin that list, and the value of UB = A1 +

    ji

    uj ms(Lj),

    returns the next story s in Li such that A1 +

    ji uj

    ms(Lj) > s.Note that next(Li,pos,UB) has to be more efficient than

    just traversing the stories in Li and comparing their s toU B, as this would take the same number of steps as the orig-inal algorithm would perform traversing Li. We thus use atree-based data structure for each list Li that supports twooperations: next(pos, UB) corresponding to the next primi-tive defined above, and update(s, s) that updates the datastructure when s of a story s in Li changes. Specifically,for every posting list Li we build a balanced binary tree Iiwhere leafs represent the postings s1, s2, . . . , s|Li| in Li andstore their corresponding s values. Each internal node nin Ii stores n.s, the minimum value of its sub-tree. The

    subtree rooted at n contains postings with indices in therange n.range start to n.range end, and we say that n is re-sponsible for these indices. Figure 2 shows a possible tree Iifor Li with five postings.

    minPss {1,..,5}

    range_start =1range_end =5

    minPss {1,2,3}

    minPss {4,5}

    range_start =1range_end =3

    minPss {1,2}

    P3 P4 P5range_start =1range_end =2

    P1 P2

    range_start =range_end =1

    Figure 2: Example of a tree Ii representing a list Liwith 5 postings.

    Algorithm 4 presents the pseudo-code for next(pos, UB)on a treeIi. It uses a recursive subroutine findMaxInterval,which gets a node as a parameter (and pos and UB as im-plicit parameters) and returns endIndex the maximal in-dex of a story s in Li which appears at least in positionpos in Li and for which s UB (this is the last storywe can safely skip). If node. > UB (line 9), all stories in

    the sub-tree rooted at node can be skipped. Otherwise, wecheck whether pos is smaller than the last index for whichnodes left child is responsible (line 12). If so, we proceedby finding the maximal index in the left subtree that canbe skipped, by invoking findMaxInterval recursively withnodes left child as the parameter. If the maximal index tobe skipped is not the last in nodes left subtree (line 14)we surely cannot skip any postings in the right subtree. Ifall postings in the left subtree can be skipped, or in casepos is bigger than all indices in nodes left subtree, the lastposting to be skipped may be in nodes right subtree. We

  • 7/28/2019 TopK Publish Subscribe

    6/12

    Algorithm 4 Pseudo-code for operation next of tree Ii

    1: Input: pos [1, |Li|]2: Input: UB3: Output: next(Li,pos,UB)4: endIndex findMaxInterval(Ii.root)5: if (endIndex = |Li|) return //skip remainder ofLi6: if (endIndex = ) return pos //no skipping is possible7: return endIndex+ 1

    8: procedure findMaxInterval(node)9: if (node. > UB) return node.range end

    10: if (isLeaf(node)) return 11: p 12: if (pos node.left.range end) then13: p findMaxInterval(node.left)14: if (p < node.left.range end) return p15: q findMaxInterval(node.right)16: if (q= ) return q17: return p

    therefore proceed by invoking findMaxInterval with nodesright child as the parameter.

    In case no skipping is possible, the top-level call to findMax-Interval returns , and next in turn returns pos. IffindMax-Interval returns the last index in Li, next returns , indi-cating that we can skip over all postings in Li. Otherwise,any position endIndex returned by findMaxInterval is thelast position that can be skipped, and thus next returnsendIndex+ 1.

    Although findMaxInterval may proceed by recursivelytraversing both the left and the right child ofnode (in lines 13and 15, respectively), observe that the right sub-tree is tra-versed only in two cases: 1) if the left sub-tree is not tra-versed, i.e., the condition in line 12 evaluates to false, or 2)if the left child is examined but the condition in line 9 eval-uates to true, indicating that the whole left sub-tree can besafely skipped. In both cases, the traversal may examine theleft child of a node, but may not go any deeper into the leftsub-tree. Thus, it is easy to see that next(pos, UB) takesO(log |Li|) steps. update(s, s) is performed by finding theleaf corresponding to s and updating the values stored ateach node in the path from this leaf to the root ofIi.

    In order to minimize memory footprint, we use a standardtechnique to embed such a tree into an array of size 2|Li|.We optimize further by making each leaf in Ii responsiblefor a range of l consecutive postings in Li (instead of a sin-gle posting) and use the lowest s of a story in this range asthe value stored in the leaf. While this modification slightlyreduces the number of postings the algorithm skips, it re-duces the memory footprint of trees by a factor of l and thelookup complexity by O(log l), thus being overall beneficial.

    We did not investigate methods for choosing the optimumvalue ofl but found experimentally that setting l between 32and 1024 (depending on the index size) results in an accept-able memory-performance tradeoff. Algorithm 5 maintainsa set I of such trees, consults it to allow skipping over in-tervals of posting list as described above, and updates theaffected trees once s for some s changes. Note that whensuch change occurs, we must update all trees which contains (Algorithm 6 lines 9 and 10). Enumerating these trees isequivalent to maintaining a forward index whose size is ofthe same order as the size of the inverted index ofS.

    Algorithm 5 Skipping TAAT for pub-sub

    1: Input: Index of S2: Input: Query u3: Input: Rs1 , Rs2 , . . . , Rsn min-heaps of size k for all

    stories in S4: Output: Updated min-heaps Rs1 , Rs2 , . . . , Rsn5: Let L1, L2, . . . L|u| be the posting lists of terms in u, in

    the descending order of their maximal score6: Let I1,I2, . . .I|u| be the trees for the posting lists7: A[s] 0 for all s Accumulators vector8: for i [1, 2, . . . , |u|] do9: UB A1 +

    ji uj ms(Lj)

    10: pos Ii. next(1, UB)11: while pos |Li| do12: s, ps(s, i) posting at position pos in Li13: A[s] A[s] + ui ps(s, i)14: pos Ii. next(pos, UB)15: for every s such that A[s] > 0 do16: processScoredResult(s,u,A[s], Rs,I)17: return Rs1 , Rs2 , . . . , Rsn

    Algorithm 6 A procedure that attempts to insert a tweetu into Rs and updates trees

    1: Procedure processScoredResult(s, u, score, Rs,I)2: s min. score of a tweet in Rs if|Rs| = k, 0 otherwise3: if s < score then4: if |Rs| = k then5: Remove the least scored tweet from Rs6: Add (u, score) to Rs7: s min. score of a tweet in Rs if |Rs| = k, 0

    otherwise8: if s = s then9: for j terms ofs do

    10: Ij .update(s, s)

    To increase skipping we use an optimization of ordering

    story ids in the ascending order of their s. This reducesthe chance of encountering a stray story with low s in arange of stories with high s in a posting list, thus allowinglonger skips. Such a (re)ordering can be performed periodi-cally, as s of stories change. We do not further explore thisoptimization and in our evaluation we ordered stories onlyonce at the beginning of the evaluation.

    3.3 DAAT for pub-subDocument-at-a-time (DAAT) is an alternative strategy

    where the current top-k documents are maintained as min-heap, and each document encountered in one of the listsis fully scored and considered for insertion to the currenttop-k. Algorithm 7 traverses the posting lists in parallel,

    while each list maintains a current position. We denotethe current position in list L by L.curPosition, the currentstory by L.cur, and the partial score of the current storyby L.curPs. The current story with the lowest id is picked,scored and the lists where it was the current story are ad-vanced to the next posting. The advantage compared toTAAT is that there is no need to maintain a potentiallylarge set of partially accumulated scores.

  • 7/28/2019 TopK Publish Subscribe

    7/12

    Algorithm 7 DAAT for pub-sub

    1: Input: Index of S2: Input: Query u3: Input: Rs1 , Rs2 , . . . , Rsn min-heaps of size k for all

    stories in S4: Output: Updated min-heaps Rs1 , Rs2 , . . . , Rsn5: Let L1, L2, . . . L|u| be the posting lists of terms in u6: for i [1, 2, . . . , |u|] do7: Reset the current position in Li to the first posting8: while not all lists exhausted do9: s min1i|u| Li.cur

    10: score 011: for i [1, 2, . . . , |u|] do12: if Li.cur = s then13: score score + ui Li.curPs14: Advance by 1 the current position in Li15: s min. score of a tweet in Rs if |Rs| = k, 0

    otherwise16: if s < score then17: if |Rs| = k then18: Remove the least scored tweet from Rs19: Add (u, score) to Rs20: return Rs1 , Rs2 , . . . , Rsn

    3.4 DAAT for pub-sub with skippingSimilarly to TAAT algorithms, it is possible to skip post-

    ings in DAAT as well. One of the popular algorithms isWAND [7]. In each iteration it orders posting lists in theascending order of the current document id and looks for thepivotlist the first list Li such that the sum of the maximalscores in lists L1, . . . , Li1 is below the lowest score in thecurrent top-k:

    j

  • 7/28/2019 TopK Publish Subscribe

    8/12

  • 7/28/2019 TopK Publish Subscribe

    9/12

  • 7/28/2019 TopK Publish Subscribe

    10/12

  • 7/28/2019 TopK Publish Subscribe

    11/12

    0

    5000

    10000

    15000

    20000

    25000

    30000

    35000

    40000

    0 10000 20000 30000 40000 50000 60000

    Pageviewsperminuteofthemostpopular1Kstories

    Cacheinvalidationsperminute

    Tweetindexqueriesperminute

    Figure 10: The effect of pageview and invalidation rates.

    Figure 11 compares the overall cost of the caching ap-proach and our solution. Here the cost is a product of thequery rate to the Tweet and Story index respectively, andthe cost factor of 1 for the Story index and 10 for the Tweetindex. Expectedly, the cost of our solution does not de-pend on the pageview rate but only on the incoming tweetrate (24K per minute). The cost of caching is lower for lowpageview rates, and higher for rates above 2, 400 (for all the1K popular stories combined). Recall that major news web-sites receive orders of 100-s of millions of pageviews daily,hence 100-s of thousands pageviews per minute.

    0

    50000

    100000

    150000

    200000

    250000

    300000

    350000

    400000

    0 10000 20000 30000 40000 50000 60000

    Pageviewsperminuteofthemostpopular1Kstories

    Tweetindexcost

    Storyindexcost

    Figure 11: Comparison of query costs.

    A cost analysis like the one shown in Figure 11 can guidethe choice between our approach and caching, or, in a hybridapproach, can help select the number of most-popular sto-ries whose annotations are maintained using pub-sub, whilethe annotations of tail stories are maintained using caching.Finally, we note an additional important factor that onemust take into account: the cost of querying the Tweet in-dex using the caching approach is incurred online, duringa pageview, whereas querying the Story index occurs whena new tweet arrives and has no effect on pageview latency.

    5.2 Content based publish-subscribeIn most previous algorithms and systems for content based

    publish-subscribe (e.g., [1, 4, 10, 11, 12, 21]), publisheditems trigger subscriptions when they match a subscriptionspredicate. In contrast, top-k pub-sub scores published items

    for each subscription, in our case based on the content over-lap between a story and a tweet, and triggers a subscriptiononly if it ranks among the top-k published items.

    The notion of top-k publish-subscribe (more specifically,top-k/w publish-subscribe) was introduced in [20]. In top-k/w publish-subscribe a subscriber receives only those pub-lications that are among the k most relevant ones publishedin a time window w, where k and w are constants definedper each subscription [20, 15]. In this model, the relevanceof an event remains constant during the time window andonce its lifetime exceeds w the event simply expires (i.e., the

    relevance becomes zero). The place of the expired object isthen populated with the most relevant unexpired object bya re-evaluation algorithm. The sliding-window model waspreviously extensively studied in the context of continuoustop-k queries in data streams (see [2] for a survey). Solutionsin this model face the challenge of identifying and keepingtrack of all objects that may become sufficiently relevant atsome point in the future due to expiration of older object(see, e.g., [6]), or alternatively use constrained memory and

    provide probabilistic guarantees [20, 15]. A recent work [16]proposed a solution for the top-k/w publish-subscribe prob-lem based on the Threshold Algorithm (TA) [13], which issimilar in spirit to our solution, but relies on keeping postinglists sorted according to the current minimum score in thetop-k sets of subscriptions, which changes frequently.

    We propose a different model, which is more natural fortweets (and perhaps for other published content) and doesnot require re-evaluation, where the published events (e.g.,tweets) do not have a fixed expiration time. Instead, timeis a part of the relevance score, which gradually decays withtime. The decay rate is the same for all published events(objects) for a given subscription, and therefore older eventsretire from the top-k only when new events that score higher

    arrive and take their place. This makes re-evaluation unnec-essary, and does not require storing previously seen eventsunless they are currently in the top-k. This, together withDAAT and TAAT algorithms that do not require re-sortingposting lists and thus are more efficient in our setting, makesour solution more than an order of magnitude faster in sim-ilar setting on a comparable hardware and somewhat largerdataset than [16]. Specifically, the algorithms in [16] wereevaluated on shorter subscriptions of 3 to 5 random termsselected from the relatively small set of 657 unique terms,whereas the average subscription in our smallest index Key-words had 16 terms from a set of 83,109 unique terms.

    Query indexing is a popular approach, especially in thecontext of continuous queries over event streams, where itis simply impossible to store all events. Previous works,

    however, focused on low-dimensional data. Earlier worksemployed indexing techniques that performed well with upto 10 dimensions but performed worse than a linear scanof the queries for higher number of dimensions [24], andlater works, such as VA-files [24] (used e.g., in [20, 6]) wereshown to perform well with up to 50 dimensions. It wasalso shown that latency of matching events to subscriptionsin the top-k/w model increases linearly with the number ofdimensions [20]. Finally, the number of supported subscrip-tions was mostly relatively low (up to a few thousands in[20, 6]). Our work considers highly-dimensional data, andwe evaluate our approach on news articles and tweets in En-glish (therefore the number of dimensions is in hundreds ofthousands), with 100K subscriptions (news articles).

    A different notion of ranked content-based pub-sub sys-tems was introduced in [19]. This system produces a rankedlist of subscriptions for a given item, whereas we produce aranked set of published items for each subscription. In [19]subscriptions were defined using a more general query lan-guage, allowing specifying ranges over numeric attributes.To support such complex queries, custom index and algo-rithms were designed, unlike our approach of adapting stan-dard inverted indexing and top-k retrieval algorithms.

    An interesting related problem is providing search resultsretroactively for standing user interests or queries, e.g., as

  • 7/28/2019 TopK Publish Subscribe

    12/12

    in Google Alerts [14]. Yang et al.[25] identify such queriesautomatically from user search-logs. Such systems typicallyperiodically re-execute standing queries with the search en-gine to find new relevant results [25]. Although the problemwe consider in this paper is substantially different, we be-lieve that our approach can provide an alternative that doesnot require re-execution: each standing query can be in-dexed similarly to the way we index news articles, and newdocuments can be matched to standing queries similarly to

    the matching of relevant tweets to stories in our system.Then, if the new document scores among the top-k docu-ments maintained for this query, the user issuing the querycan be notified. Comparing these approaches in practice isan interesting direction for future work.

    6. CONCLUSIONIn this paper we dealt with the problem of real-time an-

    notation of online news stories with tweets and introduced asolution using the top-k pub-sub paradigm. Annotations arerelated to stories by building an index of the news stories assubscriptions and evaluating incoming tweets as publishedcontent. This approach is more efficient for high-volumewebsites than the classical solution based on real-time in-cremental indexing of tweets: we match tweets with newsarticles when new tweets arrive, and not during the servingof pageviews. Our solution proactively maintains annota-tion sets of stories under high volume of Twitter updates,allowing for efficient serving of pageviews while guarantee-ing maximal freshness of annotations. We presented varia-tions of four prevalent top-k document retrieval algorithmsadapted to the publish-subscribe setting and shown how thisadaptation leads to very significant reduction in processingtime. Evaluation on a real-world corpus of news stories andon a log of tweets validated the effectiveness of our approach.

    7. ACKNOWLEDGEMENTS

    We thank Edward Bortnikov, Ronny Lempel and Ben-jamin Reed for helpful discussions and valuable comments.

    8. REFERENCES

    [1] M. K. Aguilera, R. E. Strom, D. C. Sturman,M. Astley, and T. D. Chandra. Matching events in acontent-based subscription system. In PODC, pages5361, 1999.

    [2] B. Babcock, S. Babu, M. Datar, R. Motwani, andJ. Widom. Models and issues in data stream systems.In PODS, pages 116, 2002.

    [3] R. Baeza-Yates and B. Ribeiro-Neto. ModernInformation Retrieval. Addison Wesley, 1999.

    [4] G. Banavar, T. Chanra, B. Mukherjee, J. Nagarajarao,R. E. Strom, and D. C. Sturman. An efficientmulticast protocol for content-based publish-subscribesystems. In ICDCS, pages 262272, 1999.

    [5] R. Blanco, E. Bortnikov, F. Junqueira, R. Lempel,L. Telloli, and H. Zaragoza. Caching search engineresults over incremental indices. In WWW, pages8289, 2010.

    [6] C. Bohm, B. C. Ooi, C. Plant, and Y. Yan. Efficientlyprocessing continuous k-nn queries on data streams.In ICDE, pages 156165, 2007.

    [7] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, andJ. Zien. Efficient query evaluation using a two-levelretrieval process. In CIKM, pages 426434, 2003.

    [8] C. Buckley and A. F. Lewit. Optimization of invertedvector searches. In SIGIR, pages 97110, 1985.

    [9] D. Carmel and E. Amitay. Juru at 2006: Taat versusdaat in the terabyte track. In TREC, 2006.

    [10] A. Carzaniga and A. L. Wolf. Forwarding in acontent-based network. In SIGCOMM, pages 163174,2003.

    [11] Y. Diao, M. Altinel, M. J. Franklin, H. Zhang, andP. Fischer. Path sharing and predicate evaluation forhigh-performance xml filtering. TODS, 28:467516,2003.

    [12] F. Fabret, H. A. Jacobsen, F. Llirbat, J. Pereira,K. A. Ross, and D. Shasha. Filtering algorithms andimplementation for very fast publish/subscribesystems. In SIGMOD, pages 115126, 2001.

    [13] R. Fagin, A. Lotem, and M. Naor. Optimalaggregation algorithms for middleware. In PODS,pages 102113, 2001.

    [14] Google Alerts. http://alerts.google.com/.

    [15] P. Haghani, S. Michel, and K. Aberer. Evaluatingtop-k queries over incomplete data streams. In CIKM,pages 877886, 2009.

    [16] P. Haghani, S. Michel, and K. Aberer. The gist ofeverything new: personalized top-k processing overweb 2.0 streams. In CIKM, pages 489498, 2010.

    [17] B. Keane. Twitter v the msm: covering gaddafis waragainst reality.http://www.crikey.com.au/2011/03/21/twitter-v-the-msm-covering-gaddafis-war-against-reality/, 2011.

    [18] P. Lacour, C. Macdonald, and I. Ounis. Efficiencycomparison of document matching techniques. InEfficiency Issues in Information Retrieval Workshop;European Conference for Information Retrieval, 2008.

    [19] A. Machanavajjhala, E. Vee, M. N. Garofalakis, andJ. Shanmugasundaram. Scalable rankedpublish/subscribe. PVLDB, 1(1):451462, 2008.

    [20] K. Pripuzic, I. P. Zarko, and K. Aberer. Top-k/wpublish/subscribe: finding k most relevantpublications in sliding time window w. In DEBS,pages 127138, 2008.

    [21] A. C. Snoeren, K. Conley, and D. K. Gifford.Mesh-based content routing using xml. In SOSP,pages 160173, 2001.

    [22] T. Strohman, H. R. Turtle, and W. B. Croft.Optimization strategies for complex queries. InSIGIR, pages 219225, 2005.

    [23] H. R. Turtle and J. Flood. Query evaluation:

    Strategies and optimizations. Information Processingand Management, 31(6):831850, 1995.

    [24] R. Weber, H.-J. Schek, and S. Blott. A quantitativeanalysis and performance study for similarity-searchmethods in high-dimensional spaces. In VLDB, pages194205, 1998.

    [25] B. Yang and G. Jeh. Retroactive answering of searchqueries. In WWW, pages 457466, 2006.