efficient similarity search : arbitrary similarity measures, arbitrary composition

32
Ef cient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting Advisor: Dr. Koh. Jia-ling 1

Upload: vesta

Post on 09-Feb-2016

76 views

Category:

Documents


1 download

DESCRIPTION

Efficient Similarity Search : Arbitrary Similarity Measures, Arbitrary Composition. Date: 2011/10/31 Source: Dustin Lange et. al (CIKM’11) Speaker:Chiang,guang-ting Advisor: Dr. Koh . Jia -ling. I ndex. Introduction Problem Definition Evaluating Query Plans Query Plan Optimization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Efficient Similarity Search :Arbitrary Similarity Measures, Arbitrary Composition

Date: 2011/10/31Source: Dustin Lange et. al (CIKM’11)Speaker:Chiang,guang-tingAdvisor: Dr. Koh. Jia-ling

1

Page 2: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Index

• Introduction• Problem Definition• Evaluating Query Plans• Query Plan Optimization• Evaluation • Conclusion

2

Page 3: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Introduction• Many commercial applications maintain a person/customer

data set that typically contains information such as the name, the date of birth, and the address of individuals that are somehow related to the company.

• Searching require:1. answered extremely fast.2. answered result that might be relevent with query.

3

Page 4: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Introduction

• Motivation:• For these and similar problem settings, a common approach

is to define several similarity measures for different aspects of the problem.

• Goal:• To combine such similarity measures, there exist different

approaches.• perform efficient search based on a set of similarity measures

and an arbitrary composition technique.

4

Page 5: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Filter-and-refine Search

5

Introduction

query Database result

Database Data base

Filter criterion

Query plan

1.Contain correct similar record2.Missing data should be as low as possible3.List should be as low as possible

Page 6: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Problem Definition• Composed similarity measures• Query plan optimization

6

Page 7: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Composed similarity measuresFirstName LastName Birthdate zip

7

Problem Definition

Jarco-winklerdistance

Jarco-winklerdistance

Relative distance

Numeric difference

𝑆𝑖𝑚𝐹𝑖𝑟𝑠𝑡𝑁𝑎𝑚𝑒 𝑆𝑖𝑚𝐿𝑎𝑠𝑡𝑁𝑎𝑚𝑒 𝑆𝑖𝑚𝑧𝑖𝑝𝑆𝑖𝑚 h𝐵𝑖𝑟𝑡 𝑑𝑎𝑡𝑒[0,1]

𝑆𝑖𝑚𝑜𝑣𝑒𝑟𝑎𝑙𝑙

Weighted sum

Page 8: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query plan optimization

8

Problem Definition

𝑆𝑖𝑚𝑝 (𝑞 ,𝑟 )≥ 𝜙𝑝 𝑝 ≥𝜙𝑝

Query plan template:a combination of attribute predicates with the logical operators conjunction and disjunction.

Query plan :all threshold variables in a query plan template areassigned a value.

Example :Query plan template( FirstName ≥ Zip ≥ ) ( LastName ≥ ∧ ∨

∧ BirthDate ≥ )Query plan:( FirstName ≥ 0.9 Zip ≥ 1.0 ) ( LastName ≥ 0.8 BirthDate ≥ 0.85 )∧ ∨ ∧

Page 9: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

• To evaluate query plans, we define different performance metrics.1. completeness comp(qp) :

a query plan qp is the expected proportion of correct query/result record pairs that are covered by qp.

2. costs costs(qp) :a query plan qp is the expected average number of records

in R that are covered by qp.

• Given a set R of records and a cost threshold C, the query plan optimization problem is to find the query plan qp that maximizes comp(qp) with costs(qp) < C.

9

Query plan optimizationProblem Definition

Page 10: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Evaluating Query Plans• Completeness Estimation• How many matches can be found with this query plan?

• Cost Estimation• How large is the cardinality of an average query result with this

query plan?

10

Page 11: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Completeness Estimation• Gathering Training Data :• Real Training Data:

• In our use case, we have a set of 2m queries, most of them with results manually labeled as correct. This is the ideal situation, where we can rely on a manually labeled set of query/result record pairs.

11

Evaluating Query Plans

Page 12: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Completeness Estimation• Estimation

• T : training instance• evaluates to true iff the pair (q, r) is covered by qp (i. e., if the

query plan predicates are fulfilld).

12

Evaluating Query Plans

Page 13: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Completeness Estimation• Estimation• Basic algorithm: Iterate over the list T and count all query/result record pairs that are covered by qp.• Speed up the basic algorithm: For each distinct base similarity value combination, we save its count, thus eliminating all “duplicate” combinations.

• Reduce the number of value combinations by rounding to two decimal places.

• Reduce the amount of value combinations to be checked from 2.0m to 4.4k (a reduction rate of 99.8 %).

13

Evaluating Query Plans

Page 14: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation

14

Evaluating Query Plans

Costly task

Naïve:

Propose:

Quickly

Page 15: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation• Similarity histograms• For each analyzed value, we calculate the similarity to all

other values of the attribute.

15

Evaluating Query Plans

Page 16: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation• Similarity histograms• For each possible similarity, averag the measured numbers of

records with similar values.

16

Evaluating Query Plans

Page 17: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation• Combining attribute estimations• We multiply the probability with the cardinality of the complete

record set to determine expected average costs: U: universe set : a subset of U

• an attribute predicate p ≥ in abbreviate as .• Example: LastName ≥ 0.9 City ≥ 0.95∧

17

Evaluating Query Plans

Page 18: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation• Example: ( (A B) (C D) (E F ) ) ∧ ∨ ∧ ∨ ∧ P ( A ≥ B ≥ ∧ )

18

Evaluating Query Plans

Page 19: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation• Example: ( (A B) (C D) (E F ) ) ∧ ∨ ∧ ∨ ∧ P (A ≥ B ≥ ∨ )

P (A ≥ B ≥ ∨ C ≥ )∨=P (A ≥ ) + P (B ≥ ) + P (C ≥ )- P (A ≥ B ≥ ∧ ) - P (A ≥ C ≥ ) - P (B ≥ ∧ C ≥ )∧+ P (A ≥ B ≥ ∧ C ≥ )∧

19

Evaluating Query Plans

Page 20: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Cost Estimation

20

Evaluating Query Plans

Page 21: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization• Best query plan, i. e., the plan with highest completeness

below a cost threshold.

• How??1. Iterate over the set of all possible query templates and then

optimize thresholds for each query plan template.2. Then,got a optimal query plan per query plan template. 3. From these query plans, can simply select the overall best

query plan.

• Cost too much.

21

Page 22: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization• Observations-cost

22

Page 23: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization• Observations-completeness

23

Page 24: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization• Top neighborhood algorithm• The algorithm should only evaluate as few query plans as possible

and determine an overall good solution.

24

………. W(q6) W(q4) W(q1)

………. W(q5) W(q2)

W(q10)

……… W(q3)

……..

W(qn) ……..

0 0

1

1

1

1

0

1

1

(A B) (C D) ∧ ∨ ∧D

C

B

A

Page 25: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization

25

0 0

1

1

1

1

(A B) (C D) ∧ ∨ ∧D

C

B

A

• A query plan qp has a neighborhoodN(qp) that contains all query plans that can be constructed by lowering one thresholds of qp by one step

Ex: by 0.01

Page 26: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization

26

………. N(W6) N(W4) N(W1)

………. N(W5) N(W2)

N(W10)

……… N(W3)

……..

N(Wm) ……..

0 0

1

1

1

1

0

1

1

(A B) (C ∧ ∨ ∧D)

D

C

B

A • Window W = {q1 . . . qn }.

• Top t neighborhood .

• The cost threshold C also

determines the stopping

criterion of our algorithm.φ =0.6

Page 27: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Query Plan Optimization

27

Page 28: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

28

Page 29: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Real-world Similarity Search• 210 query plan templates• t = 100 (a window size that results in acceptable runtime of our

algorithm) • C = 1000 (a cost threshold that is acceptable for our use case). • Randomly selected 50,000 queries from our test data set.

29

Page 30: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Conclusion• Introduced a novel approach to efficient similarity search for

composed similarity measures. • Showed how to efficiently compute completeness and costs,

two important performance metrics for query plans based on a set of similarity indexes.

• The resulting trade-off between completeness and costs has been solved with the approximative top neighborhood algorithm.

• Showed that the algorithm efficiently determines near-optimal query plans for real-world data.

30

Page 31: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Thx for your listening …..

31

Page 32: Efficient Similarity Search  : Arbitrary Similarity  Measures, Arbitrary Composition

Speed up the basic algorithmAmount FirstName LastName Birthdate zip

12345 EDDY MA 1/1 890

22678 JACK JOHNSON 2/13 888

1423665 KOBE CHU 7/7 235

158489 JACK JOHNSON 2/13 888

9527 JACK JOHNSON 2/13 888

32

Query plan: (FirstName ≥ 0.8)Query : Jackson Mark Susan Jackson Maggie

3

delete