[undergraduate thesis] final defense presentation on cloud publish/subscribe model for top-k...

63
Cloud based publish/subscribe model for Top-k matching over continuous data streams Author: Y.S. Horawalavithana 10002103 Supervisor: Dr. D.N. Ranasinghe U/Graduate Thesis Defense January 23, 2015 UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING SCS 4001: INDIVIDUAL PROJECT 1

Upload: yasanka-sameera-horawalavithana

Post on 09-Aug-2015

29 views

Category:

Technology


1 download

TRANSCRIPT

1

Cloud based publish/subscribe model for Top-k matching over

continuous data streamsAuthor: Y.S. Horawalavithana10002103

Supervisor: Dr. D.N. Ranasinghe

U/Graduate Thesis DefenseJanuary 23, 2015

UNIVERSITY OF COLOMBO SCHOOL OF COMPUTINGSCS 4001: INDIVIDUAL PROJECT

2

Overviewβ€’ Motivationβ€’ Targetβ€’ Design & Architectureβ€’ Related workβ€’ Dynamic Diversificationβ€’ Incremental Top-kβ€’ Implementationβ€’ Evaluationβ€’ Conclusionβ€’ Future work

3

Motivation – β€œBig Filter”

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

4

Boolean publish/subscribe

Drawbacks A subscriber may be either overloaded with

publications or receive too few publications Impossible to compare different matching

publications as ranking functions are not defined, and

Partial matching between subscriptions and publications is not supported.

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

5

Top-k publish/subscribe Expressive stateful query processing systems User defined parameter k restricts the

delivered publications Pub/Sub Matching

Top-k pub/sub scoring or ranking Pub/Sub Indexing

Indexing to support personalized subscriptions Indexing to support continuous Top-k

publications retrieval

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

6

Target

1. How to define an efficient scoring algorithm by integrating query independent & dependent score metrics taken into account? - Relevance, Freshness & Diversity

2. How to adapt existing indexing data structures used in state-of-the-art publish/subscribe systems under

a) large subscription volume, b) high event rate and,c) the variety of subscribable attributes,

to support Top-k matching queries?

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

7

Scope Optimize Top-k Heuristic for specific domain

E-commerce with buyers & sellers Subscriptions & publications follow a pre-defined

data-structure The number of incoming publications follow a

Poisson random variable Retrieve Top-k publications against subscriptions,

not reverse.

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

8

Design & Architecture

Expire

ExpirePublication

Store

SubscriptionStore

SubscriptionIndexing

Relevance Matching

Publication Stream

MatchingPublication

Store

Publication(Relevance

Score)

PublicationIndexing

Top-kContinuous

Diversity

Personalized Subscription

Personalized Subscription

Personalized Subscription

Dissimilarity

Relevancy

EventDelivery

Top-kNotification

Store

Notification

Notification

Notification

Sliding window

9

Related work:General Top-k publish/subscribePub/sub model Subscription Timing

policy Diversity Scoring metric

Subscription Indexing method

Incremental publication

indexingArchitecture

PrefSIENA(Drosou, ACM

DEBS 2009)Preferential subscription

Sliding window

Relevancy + MAXMIN diversity

Subscription covering

Centralized message-brokers

RRPS(Lu, ICCSA 2009) Normal Continuous QoS Centralized

DaZaLaPs(Pripuzi, IS 2012) Normal Sliding

window Relevancy Grid based P2P

Top-k pub/sub(Shraer[Google],

VLDB 2014)Normal Continuous Relevancy +

Freshness Tree based TAAT & DAAT Centralized

Our modelPersonalized subscription

spaceSliding

windowMAXDIVREL

diversityInverted-list

basedHashing based Cloud based

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

10

Sliding window Top-k computation

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8𝑃9𝑃10 ....

𝑃5𝑃1

𝑃5𝑃6

𝑃5𝑃9

Top-2Matching publication stream

h=1

h=3

Jumping step(h)

Top-k notifications delivery On-demand Pro-active

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

Expired

Active

Top-k

11

Relevancy: Personalized Subscription space

Carrier = AT&T (0.4) Subscribe

Brand = HTC (0.3)

Storage (0.7)

1.75

1.3

2.3

Carrier = Verizon (0.5)

Storage 32GB (0.2)

2.52

Storage 32GB (0.6)

Brand = HTC (0.3)

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

12

Relevancy: Personalized Subscription space

2

Carrier = Verizon

Storage 32GB

2.5

Carrier = AT&T

Storage

1.75

Brand = HTC

1.3

2.3

Carrier = VerizonColor = WhiteOS = Android

Storage = 16GBBrand = HTC

Subscribe

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

13

Subscription Indexing: Modified opIndex Based on inverted-lists

Posting lists

Two level portioning Attribute posting list Operator posting list

Locate satisfying subscription tuples

Relevancy score By satisfying relations By satisfying subscription tuples

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

14

Freshness When window becomes larger,

Older publications may prevent the newer publications to enter into Top-k results

Lease relevancy scores? But have to re-calculate scores Forward decaying!

Fresh-relevancy score = relevancy score Freshness score

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

15

Diversity: Top-k representative set

Representative Top-kDrawback(without diversity)

What we want(with diversity)

Method to retrieve Top-k publications from matching publications

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

16

MAX* k-diversity problemwhere

1. P = {p1, …, pn}2. k ≀ n3. d: a distance metric4. f: a diversity function

),(argmax* dSfS

k|S| PS

Find:

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

17

Proposed: MAXDIVREL k-diversity problem

S-Pin relevancy & similarity-dis theminimize,,

Sin relevancy & similarity-dis themaximize,,g

),,(

),,(maxarg),,(argmax*

rdSh

rdS

rdSh

rdSgrdSfS

PS

where

1. P = {p1, …, pn}

2. d: a distance metric3. r: a relevance metric4. f: a diversity function

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

18

Formal Definition: MAXDIVREL k-diversity

SPpSpji

i

j

Sppji

i

j

ji

ji

ppdpr

pr

SPrdSh

ppdpr

pr

SrdS

,

,

dominance holds ),()(

)(

||

1,,argmin

ceindependen holds ),()(

)(

||

1,,gargmax

where

1. P = {p1, …, pn}

2. d: a distance metric3. r: a relevance metric

Independence condition:

Dominance condition:

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

19

NP-Hardness: Minimum independent-dominating set

𝑝1

𝑝2𝑝3

𝑝4

𝑝5

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2

𝛼

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

jijiji ppppdppodNeighborho ,| )(

𝑣1

𝑣4

𝑣3𝑣2

𝑣5

Publication space

Graph model

Independent, dominating Independent, dominating Independent, dominating Dominating, not independent

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

20

NAÏVE Greedy argmaxπ‘Ÿ (𝑝𝑖)

2

βˆ‘π‘ π‘—βˆˆπ‘ (𝑝 𝑖)

π‘Ÿ (𝑝 𝑗)×𝑑 (𝑝𝑖 ,𝑝 𝑗)

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

21

Handling streaming publications

𝑝1

𝑝2𝑝3

𝑝4

𝑝5

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2𝛼

𝑝6

𝑣1

𝑣4

𝑣3

𝑣5

𝑣2𝑣6

Continuity Requirements1. Durability

an item is selected as diversified in window may still have the chance to be in window if it's not expired & other valid items in window are failed to compete with it.

2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not-older than j.

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

22

MAXDIVREL continuous k-diversity

𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....

Matching publication stream

𝑃1𝑃2𝑃3𝑃4 .. 𝑃 𝑗𝑃 𝑗+1 .. .. .. ....

ith window

(i+1)th window

𝑆 π‘–βˆ—

𝑆 𝑖+1βˆ—

MAXDIVREL k-diversity

MAXDIVREL k-diversity

Independence

Dominance

Durability

Order

Straightforward solution: Apply naΓ―ve greedy method at each instance

Propose incremental index mechanism! Avoid the curse of re-calculating neighborhood

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

23

Locality Sensitive Hashing (LSH) Simple Idea

if two points are close together, then after a β€œprojection” operation these two points will remain close together

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

24

LSH Analysis For any given points

β€’ Hash function h is (, ) sensitive, β€’ Ideally we needβ€’ to be largeβ€’ to be small

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

25

LSH in MAXDIVREL:Publications as categorical data

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

26

LSH in MAXDIVREL:Characteristic Matrix

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

27

LSH in MAXDIVREL:Minhashing No Publications any more!

Signature to represent

Technique Randomly permute the rows at

characteristic matrix m times Take the number of the 1st row, in

the permuted order, which the column has a 1 for

the correspondent column of publications.

First permutation of rows at characteristic matrix

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

Advantage: Reduce the dimensions into a small

minhash signature

28

LSH in MAXDIVREL:Signature Matrix

Fast-minhashingSelect m number of random hash

functionsTo model the effect of m number of

random permutationMathematically proved only when,

The number of rows is a prime.

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

29

LSH in MAXDIVREL:LSH Buckets

Take r sized signature vectors From m sized

minhash-signature

Map them into, L Hash-Tables Each with

arbitrary b number of buckets

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

30

LSH in MAXDIVREL:How to select L, r?

For two vectors x,y

1. ?

2)

31

LSH in MAXDIVREL:Analysis

For two vectors x,y

For publications x & y At a particular hash table

x & y map into the same bucket:

x & y does not map into the same bucket:

At L Hash-tables x & y does not map into the same bucket:

1 βˆ’ΒΏ

True near neighbors will be unlikely to be unlucky

in all the projections

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

32

LSH in MAXDIVREL:Batch-wise Top-k computation

Bucket β€œWinner” – a publication which has the highest relevancy score

Winner is dominant to represent it's bucket neighborhood

Top-k "winnersβ€œ that have a majority of votes k winners are independent

𝑃 𝐴𝑃 𝐡𝑃𝐢𝑃𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .

ith window

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

33

LSH in MAXDIVREL:Incremental Top-k computation

π‘π‘’π‘€π‘π‘’π‘π‘™π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘›π‘– π‘ˆπ‘π‘‘π‘Žπ‘‘π‘’π‘– h𝑑 h𝑐 π‘Žπ‘Ÿπ‘Žπ‘π‘‘π‘’π‘Ÿπ‘–π‘ π‘‘π‘–π‘π‘£π‘’π‘π‘‘π‘œπ‘Ÿ Characteristic Matrix

πΊπ‘’π‘›π‘’π‘Ÿπ‘Žπ‘‘π‘’π‘– h𝑑 h hπ‘šπ‘–π‘› π‘Žπ‘  π‘ π‘–π‘”π‘›π‘Žπ‘‘π‘’π‘Ÿπ‘’

Signature Matrix

Map signature into L hash-tables

Update β€œWinner” at bucket signature

maps into

Vote π‘‡π‘œπ‘βˆ’π‘˜π‘π‘Žπ‘›π‘‘π‘–π‘‘π‘Žπ‘‘π‘’1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

34

LSH in MAXDIVREL:When new publication F arrives…

Only buckets will vote Follow continuity requirements

Durability Order

𝑃 𝐴𝑃 𝐡𝑃𝐢𝑃𝐷𝑃 𝐸𝑃 𝐹𝑃𝐺𝑃𝐻. .

ith window

(i+1)th window

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

35

Implementation

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

36

Cloud service modules

Source: Amazon Kinesis Source: Amazon Elastic-cache

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

37

Publication Stream Zipfian subscriptions

Normalized preferences

Evaluation:Dataset

Amazon on-line market place data available at 17th – 19th November 2014

N - number of elements in distribution,

k - rank of element

s - value of exponent

38

Evaluation:Methodology

Subscriber Effectiveness

Performance & Efficiency

Quality

Accuracy

Resiliency

Freshness

Index construction time

Top-k matching time

Platform: Amazon AWS Linux based micro-node instances

Each with 2.3 GHz, 8GB memory

Algorithms are implemented in Java

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

39

Subscriber Effectiveness:Quality or natural behvior

Testing zipf or power law hypothesis on distribution of ranked results (KS Test)i. Fitting power lawii. Goodness of fit testsiii. Alternative distributions

Compute 19030 ranked distributions over 100K publication stream

Under different subscriber views Under different sized sliding window

instances

Sample distribution of ranked voteslo

g z

ipf_

prob

(ran

k)log (rank)

N - number of elements in distribution,

k - rank of element

s - value of exponent

40

Subscriber Effectiveness:i. Fitting power law

Zipf exponent values

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

41

Subscriber Effectiveness:i. Fitting power law

Illustration of Zipf exponent values convergence

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

42

Subscriber Effectiveness:i. Fitting power law

Zipf exponent values under different similarity threshold

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

43

Subscriber Effectiveness:ii. Goodness of fit tests

P-values of KS test under different subscriber views

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

44

Subscriber Effectiveness:iii. Testing alternative distributions

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

45

Subscriber Effectiveness:Other diversity based methods

P-dispersion problem

MAXMIN

MAXSUM

Minimum independent-dominating set

problem

MAXDIVREL

DisC

For an even comparison,Combine relevancy at all diversity methodTo achieve a bi-criteria objective

Average zipf law exponent in a comparison with other methods

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

46

Subscriber Effectiveness:Other diversity based methods

P-dispersion problem

MAXMIN

MAXSUM

Minimum independent-dominating set

problem

MAXDIVREL

DisC

A comparison of average zipf law exponent with other methods

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

47

Subscriber Effectiveness:Accuracy of Top-k results

LSH Index vs. NAÏVE Rank probability Diversity probability

Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

48

Subscriber Effectiveness:Resiliency of Top-k results

Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)

49

PerformanceSubscription index update time

Index construction time on opIndex vs. modified opIndex

opIndex vs. modified opIndex

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

50

Efficiency:Initial matching time at modified opIndex

Initial matching time under different size of subscription spaces Initial matching time under different size of publications

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

51

Performance & Efficiency:LSH Index

BLSH index construction + update time on different number of minhash functions

Number of minhash functions (m) =

How much accuracy do we sacrifice by comparing small minhash signatures?

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

52

Performance & EfficiencyILSH vs. BLSH vs. NAÏVE

𝑃1𝑃2𝑃3𝑃4𝑃5𝑃6𝑃7𝑃8. .BLSH

or NAIVE

BLSH or

NAIVE

BLSH or

NAIVE

BLSH or

NAIVE

ILSH

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

53

Performance & Efficiency:BLSH vs. NAÏVE

log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

54

Performance & Efficiency:ILSH vs. BLSH vs. NAÏVE

log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

55

Conclusions Diversified results produced by MAXDIVREL based on independent-

dominating set problem Exhibits strong natural behavior other than, Methods based p-dispersion problem

Relevancy is a important factor to employ In distance based diversity methods Always has the tendency to produce the diverse set of personalized

results Absolute ranks are sensitive to the preference value

While keeping the deviation small among relative ranks

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

56

Conclusions (Ctd.) Locality Sensitive Hashing (LSH) indexing method

Produce MAXDIVREL diverse set of results at average 70% accuracy over naΓ―ve method

Reduce the matching time very significantly over NAÏVE method Further, refine by it’s incremental version

For handling streaming publications Avoid the curse of re-computing neighborhoods

No such k to restrict the delivery of Top publications Given a window size & delivery method Model can produce best diverse set of personalized results

To represent the set of all matching publications at given instance1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

57

Major Contributions Dynamic diversification method based on independent-dominating set

problem We introduced a novel diversity definition based on representative

neighborhoods, called MAXDIVREL k-diversity employing relevancy.

Index based diversification approach to rank results incrementally We proposed a novel, hashing based index approach to solve

MAXDIVREL continuous k-diversity problem based on Locality Sensitive Hashing (LSH) technique

Advanced evaluation method to measure the quality of diverse results First significant try to model natural behavior of diversity methods in

pub/sub community1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

58

Future work Explore other suitable use-cases to apply proposed model & develop

prototype applications, E.g. Personalized newspaper for every Facebook user Diverse set of personalized Twitter trends Social annotation of news-stories

Exploit overlap among diversified results of users who have similar interest

Employ existing implicit methods to extract human preferences E.g. click stream analytics

Develop LSH based index over multi-threaded distributed environment

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work

59

Q&A

THANK YOU!

60

AppendixFreshness

Mean delay between publications = 5000msA comparison between relevancy scores after influenced by freshness

61

AppendixNAÏVE Ranking time

Average naΓ―ve Top-k matching time in comparison with size D of publications

62

AppendixBLSH Ranking time

Average BLSH Top-k matching time in comparison with size D of publications

63

AppendixILSH Ranking time

Average ILSH Top-k matching time in comparison with size D of publications