[undergraduate thesis] final defense presentation on cloud publish/subscribe model for top-k...

1

Cloud based publish/subscribe model for Top-k matching over

continuous data streamsAuthor: Y.S. Horawalavithana10002103

Supervisor: Dr. D.N. Ranasinghe

U/Graduate Thesis DefenseJanuary 23, 2015

UNIVERSITY OF COLOMBO SCHOOL OF COMPUTINGSCS 4001: INDIVIDUAL PROJECT

2

Overview• Motivation• Target• Design & Architecture• Related work• Dynamic Diversification• Incremental Top-k• Implementation• Evaluation• Conclusion• Future work

3

Motivation – “Big Filter”

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

4

Boolean publish/subscribe

Drawbacks A subscriber may be either overloaded with

publications or receive too few publications Impossible to compare different matching

publications as ranking functions are not defined, and

Partial matching between subscriptions and publications is not supported.


5

Top-k publish/subscribe Expressive stateful query processing systems User defined parameter k restricts the

delivered publications Pub/Sub Matching

Top-k pub/sub scoring or ranking Pub/Sub Indexing

Indexing to support personalized subscriptions Indexing to support continuous Top-k

publications retrieval


http://googleresearch.blogspot.com/2014/06/influential-papers-for-2013.html

6

Target

1. How to define an efficient scoring algorithm by integrating query independent & dependent score metrics taken into account? - Relevance, Freshness & Diversity

2. How to adapt existing indexing data structures used in state-of-the-art publish/subscribe systems under

a) large subscription volume, b) high event rate and,c) the variety of subscribable attributes,

to support Top-k matching queries?


7

Scope Optimize Top-k Heuristic for specific domain

E-commerce with buyers & sellers Subscriptions & publications follow a pre-defined

data-structure The number of incoming publications follow a

Poisson random variable Retrieve Top-k publications against subscriptions,

not reverse.


8

Design & Architecture

Expire

ExpirePublication

Store

SubscriptionStore

SubscriptionIndexing

Relevance Matching

Publication Stream

MatchingPublication

Store

Publication(Relevance

Score)

PublicationIndexing

Top-kContinuous

Diversity

Personalized Subscription



Dissimilarity

Relevancy

EventDelivery

Top-kNotification

Store

Notification

Notification

Notification

Sliding window

9

Related work:General Top-k publish/subscribePub/sub model Subscription Timing

policy Diversity Scoring metric

Subscription Indexing method

Incremental publication

indexingArchitecture

PrefSIENA(Drosou, ACM

DEBS 2009)Preferential subscription

Sliding window

Relevancy + MAXMIN diversity

Subscription covering

Centralized message-brokers

RRPS(Lu, ICCSA 2009) Normal Continuous QoS Centralized

DaZaLaPs(Pripuzi, IS 2012) Normal Sliding

window Relevancy Grid based P2P

Top-k pub/sub(Shraer[Google],

VLDB 2014)Normal Continuous Relevancy +

Freshness Tree based TAAT & DAAT Centralized

Our modelPersonalized subscription

spaceSliding

windowMAXDIVREL

diversityInverted-list

basedHashing based Cloud based


10

Sliding window Top-k computation

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃5𝑃1

𝑃5𝑃6

𝑃5𝑃9

Top-2Matching publication stream

h=1

h=3

Jumping step(h)

Top-k notifications delivery On-demand Pro-active


Expired

Active

Top-k

11

Relevancy: Personalized Subscription space

Carrier = AT&T (0.4) Subscribe

Brand = HTC (0.3)

Storage (0.7)

1.75

1.3

2.3

Carrier = Verizon (0.5)

Storage 32GB (0.2)

2.52

Storage 32GB (0.6)

Brand = HTC (0.3)


12

Relevancy: Personalized Subscription space

2

Carrier = Verizon

Storage 32GB

2.5

Carrier = AT&T

Storage

1.75

Brand = HTC

1.3

2.3

Carrier = VerizonColor = WhiteOS = Android

Storage = 16GBBrand = HTC

Subscribe


13

Subscription Indexing: Modified opIndex Based on inverted-lists

Posting lists

Two level portioning Attribute posting list Operator posting list

Locate satisfying subscription tuples

Relevancy score By satisfying relations By satisfying subscription tuples


14

Freshness When window becomes larger,

Older publications may prevent the newer publications to enter into Top-k results

Lease relevancy scores? But have to re-calculate scores Forward decaying!

Fresh-relevancy score = relevancy score Freshness score


15

Diversity: Top-k representative set

Representative Top-kDrawback(without diversity)

What we want(with diversity)

Method to retrieve Top-k publications from matching publications


16

MAX* k-diversity problemwhere

1. P = {p1, …, pn}2. k ≤ n3. d: a distance metric4. f: a diversity function

),(argmax* dSfS

k|S| PS

Find:


17

Proposed: MAXDIVREL k-diversity problem

S-Pin relevancy & similarity-dis theminimize,,

Sin relevancy & similarity-dis themaximize,,g

),,(

),,(maxarg),,(argmax*

rdSh

rdS

rdSh

rdSgrdSfS

PS

where

1. P = {p1, …, pn}

2. d: a distance metric3. r: a relevance metric4. f: a diversity function


18

Formal Definition: MAXDIVREL k-diversity

SPpSpji

i

j

Sppji

i

j

ji

ji

ppdpr

pr

SPrdSh

ppdpr

pr

SrdS

,

,

dominance holds ),()(

)(

||

1,,argmin

ceindependen holds ),()(

)(

||

1,,gargmax

where

1. P = {p1, …, pn}

2. d: a distance metric3. r: a relevance metric

Independence condition:

Dominance condition:


19

NP-Hardness: Minimum independent-dominating set

𝑝1

𝑝2𝑝3

𝑝4

𝑝5

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2

𝛼

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

jijiji ppppdppodNeighborho ,| )(

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

Publication space

Graph model

Independent, dominating Independent, dominating Independent, dominating Dominating, not independent


20

NAÏVE Greedy argmax𝑟 (𝑝𝑖)

2

∑𝑝 𝑗∈𝑁 (𝑝 𝑖)

𝑟 (𝑝 𝑗)×𝑑 (𝑝𝑖 ,𝑝 𝑗)


21

Handling streaming publications

𝑝1

𝑝2𝑝3

𝑝4

𝑝5

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2𝛼

𝑝6

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2𝑣6

Continuity Requirements1. Durability

an item is selected as diversified in window may still have the chance to be in window if it's not expired & other valid items in window are failed to compete with it.

2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not-older than j.


22

MAXDIVREL continuous k-diversity

𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....

Matching publication stream

𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....

ith window

(i+1)th window

𝑆 𝑖∗

𝑆 𝑖+1∗

MAXDIVREL k-diversity

MAXDIVREL k-diversity

Independence

Dominance

Durability

Order

Straightforward solution: Apply naïve greedy method at each instance

Propose incremental index mechanism! Avoid the curse of re-calculating neighborhood


23

Locality Sensitive Hashing (LSH) Simple Idea

if two points are close together, then after a “projection” operation these two points will remain close together


24

LSH Analysis For any given points

• Hash function h is (, ) sensitive, • Ideally we need• to be large• to be small


25

LSH in MAXDIVREL:Publications as categorical data


26

LSH in MAXDIVREL:Characteristic Matrix


27

LSH in MAXDIVREL:Minhashing No Publications any more!

Signature to represent

Technique Randomly permute the rows at

characteristic matrix m times Take the number of the 1st row, in

the permuted order, which the column has a 1 for

the correspondent column of publications.

First permutation of rows at characteristic matrix


Advantage: Reduce the dimensions into a small

minhash signature

28

LSH in MAXDIVREL:Signature Matrix

Fast-minhashingSelect m number of random hash

functionsTo model the effect of m number of

random permutationMathematically proved only when,

The number of rows is a prime.


29

LSH in MAXDIVREL:LSH Buckets

Take r sized signature vectors From m sized

minhash-signature

Map them into, L Hash-Tables Each with

arbitrary b number of buckets


30

LSH in MAXDIVREL:How to select L, r?

For two vectors x,y

1. ?

2)

31

LSH in MAXDIVREL:Analysis

For two vectors x,y

For publications x & y At a particular hash table

x & y map into the same bucket:

x & y does not map into the same bucket:

At L Hash-tables x & y does not map into the same bucket:

1 −¿

True near neighbors will be unlikely to be unlucky

in all the projections


32

LSH in MAXDIVREL:Batch-wise Top-k computation

Bucket “Winner” – a publication which has the highest relevancy score

Winner is dominant to represent it's bucket neighborhood

Top-k "winners“ that have a majority of votes k winners are independent

𝑃 𝐴𝑃 𝐵𝑃𝐶𝑃𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .

ith window


33

LSH in MAXDIVREL:Incremental Top-k computation

𝑁𝑒𝑤𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑖 𝑈𝑝𝑑𝑎𝑡𝑒𝑖 h𝑡 h𝑐 𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐𝑣𝑒𝑐𝑡𝑜𝑟 Characteristic Matrix

𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑖 h𝑡 h h𝑚𝑖𝑛 𝑎𝑠 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒

Signature Matrix

Map signature into L hash-tables

Update “Winner” at bucket signature

maps into

Vote 𝑇𝑜𝑝−𝑘𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

34

LSH in MAXDIVREL:When new publication F arrives…

Only buckets will vote Follow continuity requirements

Durability Order

𝑃 𝐴𝑃 𝐵𝑃𝐶𝑃𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .

ith window

(i+1)th window


35

Implementation


36

Cloud service modules

Source: Amazon Kinesis Source: Amazon Elastic-cache


37

Publication Stream Zipfian subscriptions

Normalized preferences

Evaluation:Dataset

Amazon on-line market place data available at 17th – 19th November 2014

N - number of elements in distribution,

k - rank of element

s - value of exponent

38

Evaluation:Methodology

Subscriber Effectiveness

Performance & Efficiency

Quality

Accuracy

Resiliency

Freshness

Index construction time

Top-k matching time

Platform: Amazon AWS Linux based micro-node instances

Each with 2.3 GHz, 8GB memory

Algorithms are implemented in Java


39

Subscriber Effectiveness:Quality or natural behvior

Testing zipf or power law hypothesis on distribution of ranked results (KS Test)i. Fitting power lawii. Goodness of fit testsiii. Alternative distributions

Compute 19030 ranked distributions over 100K publication stream

Under different subscriber views Under different sized sliding window

instances

Sample distribution of ranked voteslo

g z

ipf_

prob

(ran

k)log (rank)

N - number of elements in distribution,

k - rank of element

s - value of exponent

40

Subscriber Effectiveness:i. Fitting power law

Zipf exponent values


41


Illustration of Zipf exponent values convergence


42


Zipf exponent values under different similarity threshold


43

Subscriber Effectiveness:ii. Goodness of fit tests

P-values of KS test under different subscriber views


44

Subscriber Effectiveness:iii. Testing alternative distributions


45

Subscriber Effectiveness:Other diversity based methods

P-dispersion problem

MAXMIN

MAXSUM

Minimum independent-dominating set

problem

MAXDIVREL

DisC

For an even comparison,Combine relevancy at all diversity methodTo achieve a bi-criteria objective

Average zipf law exponent in a comparison with other methods


46

Subscriber Effectiveness:Other diversity based methods

P-dispersion problem

MAXMIN

MAXSUM

Minimum independent-dominating set

problem

MAXDIVREL

DisC

A comparison of average zipf law exponent with other methods


47

Subscriber Effectiveness:Accuracy of Top-k results

LSH Index vs. NAÏVE Rank probability Diversity probability

Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85


48

Subscriber Effectiveness:Resiliency of Top-k results

Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)

49

PerformanceSubscription index update time

Index construction time on opIndex vs. modified opIndex

opIndex vs. modified opIndex


50

Efficiency:Initial matching time at modified opIndex

Initial matching time under different size of subscription spaces Initial matching time under different size of publications


51

Performance & Efficiency:LSH Index

BLSH index construction + update time on different number of minhash functions

Number of minhash functions (m) =

How much accuracy do we sacrifice by comparing small minhash signatures?


52

Performance & EfficiencyILSH vs. BLSH vs. NAÏVE

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8. .BLSH

or NAIVE

BLSH or

NAIVE

BLSH or

NAIVE

BLSH or

NAIVE

ILSH


53

Performance & Efficiency:BLSH vs. NAÏVE

log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500


54

Performance & Efficiency:ILSH vs. BLSH vs. NAÏVE

log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500


55

Conclusions Diversified results produced by MAXDIVREL based on independent-

dominating set problem Exhibits strong natural behavior other than, Methods based p-dispersion problem

Relevancy is a important factor to employ In distance based diversity methods Always has the tendency to produce the diverse set of personalized

results Absolute ranks are sensitive to the preference value

While keeping the deviation small among relative ranks


56

Conclusions (Ctd.) Locality Sensitive Hashing (LSH) indexing method

Produce MAXDIVREL diverse set of results at average 70% accuracy over naïve method

Reduce the matching time very significantly over NAÏVE method Further, refine by it’s incremental version

For handling streaming publications Avoid the curse of re-computing neighborhoods

No such k to restrict the delivery of Top publications Given a window size & delivery method Model can produce best diverse set of personalized results

To represent the set of all matching publications at given instance1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

57

Major Contributions Dynamic diversification method based on independent-dominating set

problem We introduced a novel diversity definition based on representative

neighborhoods, called MAXDIVREL k-diversity employing relevancy.

Index based diversification approach to rank results incrementally We proposed a novel, hashing based index approach to solve

MAXDIVREL continuous k-diversity problem based on Locality Sensitive Hashing (LSH) technique

Advanced evaluation method to measure the quality of diverse results First significant try to model natural behavior of diversity methods in

pub/sub community1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

58

Future work Explore other suitable use-cases to apply proposed model & develop

prototype applications, E.g. Personalized newspaper for every Facebook user Diverse set of personalized Twitter trends Social annotation of news-stories

Exploit overlap among diversified results of users who have similar interest

Employ existing implicit methods to extract human preferences E.g. click stream analytics

Develop LSH based index over multi-threaded distributed environment


59

Q&A

THANK YOU!

60

AppendixFreshness

Mean delay between publications = 5000msA comparison between relevancy scores after influenced by freshness

61

AppendixNAÏVE Ranking time

Average naïve Top-k matching time in comparison with size D of publications

62

AppendixBLSH Ranking time

Average BLSH Top-k matching time in comparison with size D of publications

63

AppendixILSH Ranking time

Average ILSH Top-k matching time in comparison with size D of publications

[undergraduate thesis] final defense presentation on cloud publish/subscribe model for top-k...

Technology

publications impossible

publications retrieval

personalized subscriptions

partial matching

matching queries

publishsubscribe model

large subscription volume

architecture prefsiena