fast indexes and algorithms for set similarity selection queries

23
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A. Chandel N. Koudas D. Srivastava

Upload: kris

Post on 30-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Fast Indexes and Algorithms For Set Similarity Selection Queries. M. Hadjieleftheriou Chandel N. Koudas D. Srivastava. Strings as sets. s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Fast Indexes and Algorithms For Set Similarity Selection Queries

Fast Indexes and AlgorithmsFor Set Similarity Selection Queries

M. HadjieleftheriouA. ChandelN. KoudasD. Srivastava

Page 2: Fast Indexes and Algorithms For Set Similarity Selection Queries

Strings as sets

s1 = “Main St. Maine”:• ‘Main’ ‘St.’ ‘Maine’• ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ …

s2 = “Main St. Main”:• ‘Main’ ‘St.’ ‘Main’

How similar is s1 and s2 ?

Page 3: Fast Indexes and Algorithms For Set Similarity Selection Queries

TF/IDF weighted similarity

Inverse Document Frequency (idf):• ‘Main’ is common• ‘Maine’ is not

• idf(t) = log2[1 + N / df(t)]

Term Frequency (tf):• ‘Main’ appears twice in s2

Similarity:• Inner Product

Page 4: Fast Indexes and Algorithms For Set Similarity Selection Queries

Is TF important?

Information retrieval:• Given a query string retrieve relevant

documents

Relational databases:• Given a query string retrieve relevant strings

In practice TF is small in many applications

Page 5: Fast Indexes and Algorithms For Set Similarity Selection Queries

IDF similarity

Query q = {t1, …, tn}

Set s = {r1, …, rm}

Length len(s) = (t 2 s idf(t)2)1/2

I(q, s) = t 2 s \ q idf(t)2 / len(s) len(q)

IDF is as good as TF/IDF in practice!

Page 6: Fast Indexes and Algorithms For Set Similarity Selection Queries

How can I build an index?

Let w(t, s) = idf(t) / len(s)

Then I(q, s) = t 2 q \ s w(t, s) w(t, q)

So• Decompose strings into tokens• Compute the idf of each token• Create one inverted list per token

Sort lists by string id: Do a merge join

Sort lists by w: Run TA/NRA

Page 7: Fast Indexes and Algorithms For Set Similarity Selection Queries

Example: Sort by id

Page 8: Fast Indexes and Algorithms For Set Similarity Selection Queries

Example: Sort by w

NRA:• Round robin list accesses• Main memory hash table• Computes lower and upper bounds per entry

Page 9: Fast Indexes and Algorithms For Set Similarity Selection Queries

Semantic properties of IDF

Order Preservation:• For all t1 t2: if w(t1, s) < w(t1, r), then w(t2, s) <

w(t2, r)

Length Boundedness:• Query q, set s, threshold

– I(q, s) >= ) len(q) < len(s) < len(q) /

Page 10: Fast Indexes and Algorithms For Set Similarity Selection Queries

Improved NRA

Order Preservation determines if a given set appears in a list or not• ti: encounter s1, then s2

• tk: encounter s2 first

Length Boundedness restricts the search in a small portion of lists

Page 11: Fast Indexes and Algorithms For Set Similarity Selection Queries

Something surprising

Lemma: NRA reads arbitrarily more elements than iNRA

Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

Page 12: Fast Indexes and Algorithms For Set Similarity Selection Queries

Any other strategies?

NRA style is breadth-first

Try depth-first:• Sort query lists in decreasing idf order

– Let q = {t1, …, tn} and idf(t1) > idf(t2) > …> idf(tn)

• Let i be the maximum length a set s in ti can have s.t. I(q, s) >= , assuming that s exists in all tk > ti

– i = I <= k <= n idf(tk)2 / len(q)

• i is a natural cutoff point

• 1 > 2 > … > n

Page 13: Fast Indexes and Algorithms For Set Similarity Selection Queries

Shortest-First

Sort q={t1, …, tn} in decreasing idf order

Let candidate set C

For 1 <= i <= n• Skip to first entry with len(s) >= len(q)• Compute i

• Let i = min(i, len(q) / )• Repeat

– s = pop next element from ti

– Maintain lower/upper bounds of entries in C

• Until len(s) > max(max len C, i)

Page 14: Fast Indexes and Algorithms For Set Similarity Selection Queries

Comparison with NRA

Lemma: Let q={t1, …, tn} and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF

But surprisingly

Page 15: Fast Indexes and Algorithms For Set Similarity Selection Queries

A hybrid strategy

Run iNRA normally

Use i and max len C to stop reading from a particular list• This guarantees that iNRA stops with or before

SF

Drawback of NRA variants:• Very high book keeping cost compared to SF

Page 16: Fast Indexes and Algorithms For Set Similarity Selection Queries

Experiments

DBLP, IMDB and YellowPages datasets

Actors, movies, authors, businesses etc.

Vary threshold, query size, query strings and mistakes

Test wall-clock time, pruning power

Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

Page 17: Fast Indexes and Algorithms For Set Similarity Selection Queries

Wall-clock time vs. Threshold

Page 18: Fast Indexes and Algorithms For Set Similarity Selection Queries

Wall-clock time vs. Query size

TA

NRA

Sort-by-id

iTA

SF

Page 19: Fast Indexes and Algorithms For Set Similarity Selection Queries

Space

Page 20: Fast Indexes and Algorithms For Set Similarity Selection Queries

Conclusion

Proposed a simplified TF/IDF measure

Identified strong monotonicity properties

Used the properties to design efficient algorithms

SF works best overall in practice• Achieves sub-second answers in most practical

cases

Page 21: Fast Indexes and Algorithms For Set Similarity Selection Queries

Q&A

Page 22: Fast Indexes and Algorithms For Set Similarity Selection Queries

Pruning power vs. Threshold

Page 23: Fast Indexes and Algorithms For Set Similarity Selection Queries

Pruning power vs. Query size

NRA TA

iTA