dictionary matching and indexing with edits and don’t cares

85
Dictionary Matching and Indexing with Edits and Don’t Cares Richard Cole NYU Lee-Ad Gottlieb NYU Moshe Lewenstein Bar-Ilan

Upload: shyla

Post on 06-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Dictionary Matching and Indexing with Edits and Don’t Cares. Richard Cole NYU. Lee-Ad Gottlieb NYU. Moshe Lewenstein Bar-Ilan. Pattern Matching. Various problems of the following flavor: Preprocess a text t , or a collection of strings d 1 ,…,d x , - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dictionary Matching and Indexing with Edits and Don’t Cares

Dictionary Matching and Indexing with Edits and Don’t

Cares

Richard ColeNYU

Lee-Ad GottliebNYU

Moshe LewensteinBar-Ilan

Page 2: Dictionary Matching and Indexing with Edits and Don’t Cares

Pattern Matching

Various problems of the following flavor:

Preprocess a text t,or a collection of strings d1,…,dx,

so that given a query string p, all matches with the text can be found quickly.

IndexingDictionary queries

Dictionary matchingAll-to-all matching

Page 3: Dictionary Matching and Indexing with Edits and Don’t Cares

Pattern Matching

Dictionary queries.

Bate Beat Boat Boot

Beta

Page 4: Dictionary Matching and Indexing with Edits and Don’t Cares

Pattern Matching

Dictionary matching.

Bate Beat Boat Boot

The fish beat my boot.

Page 5: Dictionary Matching and Indexing with Edits and Don’t Cares

Pattern Matching

Text indexing.

abracadabra

ra ra

Page 6: Dictionary Matching and Indexing with Edits and Don’t Cares

Pattern Matching

All-to-all matching.

Bate Beat Boat Boot

bat boots be

Page 7: Dictionary Matching and Indexing with Edits and Don’t Cares

Previous Work

a

t

e o

o

t

Bate BeatBoat Boot

aa

e

t

b

t

Beta

Dictionary Queries

Page 8: Dictionary Matching and Indexing with Edits and Don’t Cares

Previous Work

a

t

e o

o

t

Bate BeatBoat Boot

aa

e

t

b

t

Beta

Dictionary Queries

Page 9: Dictionary Matching and Indexing with Edits and Don’t Cares

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Page 10: Dictionary Matching and Indexing with Edits and Don’t Cares

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Page 11: Dictionary Matching and Indexing with Edits and Don’t Cares

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Page 12: Dictionary Matching and Indexing with Edits and Don’t Cares

Suffix Treeg o

o

g

Oogogoogogogoggogogg

g

oogo

g

o

g

o

g

Text Indexing

Page 13: Dictionary Matching and Indexing with Edits and Don’t Cares

Approximate Matches

Wildcards (don’t cares)BoatBo*t

SubstitutionsBoatBoot

Edits – insertions and deletionsBoatB_at

Page 14: Dictionary Matching and Indexing with Edits and Don’t Cares

Previous Work – Best Results

Indexing and Dictionary Matching (edits) Buchsbaum, Goodrich, Westbrook.

k=1 p log log n + occ query timen log n space

Dictionary Queries (substitutions) Brodal, Gasieniec.

k=1 p + occ query timen space

Page 15: Dictionary Matching and Indexing with Edits and Don’t Cares

Previous Work – Basic Intuition

abracadabra Build a suffix tree for

abracadab abracada abracad abraca abrac abra abr ab a

abracadabra And for

a ar arb arba arbad arbada arbadac arbadaca arbadacar

abrac*dabra

Page 16: Dictionary Matching and Indexing with Edits and Don’t Cares

New Results

Indexing, Dictionary Queries, Dictionary Matches Substitutions

k < log n p + [(c1log n)k log log n] / k! + occ query timen(c2log n)k / k! space

Editsk < log n p + [(c3log n)k log log n] / k!

+ 3kocc query timen(c4log n)k / k! space

Wildcards in patternk < log n p + 2klog log n / k! + occ query time

n + (k+log n)k / k! space

Page 17: Dictionary Matching and Indexing with Edits and Don’t Cares

Dictionary Wildcard Queries

Three data structures for dictionary wildcard queries

Naïve: O(n) space kp query time

Less-naïve: O(n1+k) p

New data structure: O(n logkn) 2kp

Page 18: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 19: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 20: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 21: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 22: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 23: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Page 24: Dictionary Matching and Indexing with Edits and Don’t Cares

Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

Query time:k p

Page 25: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Page 26: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t

Page 27: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

tr t

Query string:*it

Page 28: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Page 29: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Page 30: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

tr t

Query time:p

Page 31: Dictionary Matching and Indexing with Edits and Don’t Cares

Less-Naïve Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

*

*

*

Space:O(n1+k)

*

Page 32: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Page 33: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

i

n

a

y

*

t

Page 34: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Page 35: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Page 36: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Page 37: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Page 38: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Page 39: Dictionary Matching and Indexing with Edits and Don’t Cares

New Approach

f

a

rt

i

t

p

i

n

a

y

s

i

t

Query string:*it

i

n

a

y

*

t

Query time:2kp

Page 40: Dictionary Matching and Indexing with Edits and Don’t Cares

Space Analysis

Create a wildcard subtree at each node in the original trie. heaviest child is not in the wildcard tree.

Look at any leaf of the trie How many of its ancestors were not the heaviest child?

log2n So it appears in at most log n wildcard trees.

Space: n log n n logkn

Page 41: Dictionary Matching and Indexing with Edits and Don’t Cares

Edit Distance

Wildcards is (algorithmically) the simplest type of approximate search.

What issues come up when dealing with substitutions, insertions and deletions?

Page 42: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 43: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 44: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 45: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 46: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 47: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 48: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 49: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Search

a

a

a

b

b

b

a

a

Query string:aab

Page 50: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Tree

a

a

a

b

b

b

a

a

Query string:aab

Page 51: Dictionary Matching and Indexing with Edits and Don’t Cares

Substitution Tree

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Page 52: Dictionary Matching and Indexing with Edits and Don’t Cares

Deletion Tree

a

a

a

b

b

c

a

a

Deletion tree

Page 53: Dictionary Matching and Indexing with Edits and Don’t Cares

Deletion Tree

a

a

a

b

b

c

a

a

c

bDeletion tree!

Page 54: Dictionary Matching and Indexing with Edits and Don’t Cares

Insertion Tree

a

a

a

b

b

c

a

a

Insertion tree

Page 55: Dictionary Matching and Indexing with Edits and Don’t Cares

Insertion Tree

a

a

a

b

b

c

a

a

a

c

b

Insertion tree!

Page 56: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

Page 57: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

Page 58: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

b

a

a

Page 59: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

a

a

a

b

b

b

a

a a

a

a

Query string:aab

bGrouping!

Page 60: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

Can’t merge along all possible paths of original trie – too expensive.

Merge along centroid paths. Centroid paths always follow the heaviest child.

Any path from root to leaf traverses at most log n centroid paths.

Page 61: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

Page 62: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

Page 63: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

Page 64: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

Page 65: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Page 66: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Page 67: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Page 68: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Page 69: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Page 70: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Page 71: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Page 72: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

Suppose a search reached up to the 7th edge with no

substitutions.

Page 73: Dictionary Matching and Indexing with Edits and Don’t Cares

Grouping

…then we searchonly three

substitution trees.

Space increase:log n factor

Suppose a search reached up to the 7th edge with no

substitutions.

Page 74: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis w1

w2

w3

w4

log n searches

log n searches

log n searches

Total number of searches:log n * log n = log2 n

Page 75: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis

For k=1 For each centroid path traversed, log n substitution

subtree searches. A path to a leaf traverses at most log n centroid

paths. log2n searches log n searches using balanced

grouping.

More generally logkn searches Using a Y-fast trie, each search takes log log n time

logkn log log n

Page 76: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Balanced SearchTree

Page 77: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Weight Balanced Search Tree

Page 78: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Weight Balanced Search Tree

Page 79: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Weight Balanced Search Tree

Page 80: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Weight Balanced Search Tree

Page 81: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Weight Balanced Search Tree

O(log(W/w)) levels

Page 82: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

For a segment of a centroid path whose top has weight W and bottom has weight w we do about log (W/w) searches

Page 83: Dictionary Matching and Indexing with Edits and Don’t Cares

Analysis w1

w2

w3

w4

log(w1/w2) searches

log(w2/w3) searches

log(w3/w4) searches

Total number of searches:log(w1/w2) + log(w2/w3) log(w3/w4) =log(w1/w4)

Page 84: Dictionary Matching and Indexing with Edits and Don’t Cares

More Rigorous Analysis

Time for one match: logkn log log n / k!

Space: n(c log n)k / k! for some constant c

Page 85: Dictionary Matching and Indexing with Edits and Don’t Cares

Open Problem

Dynamic search structure. Requires a less strict notion of “centroid path”?