geometric matching on sequential data veli mäkinen ag genominformatik technical fakultät bielefeld...

Post on 02-Jan-2016

220 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Geometric Matching on Sequential Data

Veli Mäkinen

AG Genominformatik

Technical Fakultät

Bielefeld Universität

Stringology Haifa 2005 Geometric matching on sequential data 2

Introduction

Motivation: To study problems in the intersection of geometry and stringology.

Applications to time-series data.

Stringology Haifa 2005 Geometric matching on sequential data 3

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Stringology Haifa 2005 Geometric matching on sequential data 4

1D point set matching under translations

Two point sets A and B of sizes m and n. Problem 1a: Find largest common point set

of f(A) and B over translations f. Problem 1b: Find largest common point set

of f(A) and a continuous subset of B. Let k be the number of unmatched points.

Stringology Haifa 2005 Geometric matching on sequential data 5

Example

B

A

f(A)

Problem 1a: k=3Problem 1b: k=1

Stringology Haifa 2005 Geometric matching on sequential data 6

Solutions

Trivial in O(m2n log n) time. Easy in O(mn log m) time. Akutsu gives an O(k3+n log n) time solution.

Stringology Haifa 2005 Geometric matching on sequential data 7

Akutsu’s solution

Use differential encoding for A and B. A’=a2-a1,a3-a2,..., am-am-1,

B’=b2-b1,b3-b2,..., bn-bn-1.

Construct suffix tree T of A’#B’$. Preprocess T for LCA queries.

Stringology Haifa 2005 Geometric matching on sequential data 8

Akutsu’s solution...

Let Jump(ai,bj)=h where h is largest integer such that,

Jump(ai,bj) can be computed O(1) time.

bj bj+h-1

ai ai+h-1

Stringology Haifa 2005 Geometric matching on sequential data 9

Akutsu’s solution...

Observation: One of the first k+1 points in both A and B must match.

Each match defines a translation. For each translation, one needs at most k+1

queries to Jump() to find out whether there is large enough overlap.

Stringology Haifa 2005 Geometric matching on sequential data 10

Akutsu’s solution...

Theorem 1: Problem 1a can be solved in O(k3+n log n) time and Problem 1b in O(k2n+n log n) time.

Akutsu also gives reductions from 2D/3D problems to 1D achieving good bounds.

Stringology Haifa 2005 Geometric matching on sequential data 11

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Stringology Haifa 2005 Geometric matching on sequential data 12

Linear 1D point set matching

Let us consider generalization where we allow also scaling and noise.

We search for best linear mapping from point set A to point set B.- maximum number of points of A should move close to points of B.

Stringology Haifa 2005 Geometric matching on sequential data 13

Example

A

B

Stringology Haifa 2005 Geometric matching on sequential data 14

Example...

A

B

f(A)

Stringology Haifa 2005 Geometric matching on sequential data 15

Linear 1D point set matching...

There is an optimum mapping such that two points of A are mapped exactly at -distance from some points of B.

One mapping fixes the translation, second the scale around the new origin defined by the translation.

Stringology Haifa 2005 Geometric matching on sequential data 16

Example

2

A

B

f(A)

Stringology Haifa 2005 Geometric matching on sequential data 17

Degenerate solution!

2B

A

f(A)

Stringology Haifa 2005 Geometric matching on sequential data 18

One-to-one mapping

To avoid the degenerate solution, one needs a better definition for the mapping searched for.

Hence, we search for a mapping producing maximum size one-to-one matching between the points (Problem 2).

2 22 2 2 2

f(A)B

Stringology Haifa 2005 Geometric matching on sequential data 19

Solving one-to-one case

Consider a fixed translation and scale. Construct a bipartite graph having edges

between points of f(A) and B that are at -distance.

Solve the maximum matching problem on this graph.

2 22 2 2 2

f(A)B

Stringology Haifa 2005 Geometric matching on sequential data 20

Solving one-to-one case...

Repeating the algorithm on each relevant translation and scale gives the optimum solution.

The overall time complexity is O((mn)2 g(mn)) where g(x) is the complexity of the maximum matching algorithm on a graph with x edges.

Stringology Haifa 2005 Geometric matching on sequential data 21

Solving one-to-one case faster

Consider a fixed translation, and sort the relevant scales from smallest to largest.

Observation [Alt et al. 88]: The graph Gi corresponding to ith scale differs from the graph Gi-1 of the (i-1)th scale by one edge.

The maximum matching on Gi can be found by searching for an augmenting path in Gi-1 added/deleted one edge.

Stringology Haifa 2005 Geometric matching on sequential data 22

Solving one-to-one case faster..

Incremental computation gives O((mn)3) time solution.

Theorem 2: Problem 2 can be solved in O((mn)2(m+n)) time.

To obtain the result, we exploit the monotonicity of the match graph.

Stringology Haifa 2005 Geometric matching on sequential data 23

Staircase property

fi(A)

B

Stringology Haifa 2005 Geometric matching on sequential data 24

Greedy algorithm is enough

B

fi(A)

Stringology Haifa 2005 Geometric matching on sequential data 25

scale i => scale i+1

B

fi+1(A)

Stringology Haifa 2005 Geometric matching on sequential data 26

scale i+1

B

fi+1(A)

Stringology Haifa 2005 Geometric matching on sequential data 27

scale i+1 => scale i+2

B

fi+2(A)

Stringology Haifa 2005 Geometric matching on sequential data 28

scale i+2

B

fi+2(A)

Stringology Haifa 2005 Geometric matching on sequential data 29

Observation - open question

Observation: With only translations and noise, we obtain O(mn(m+n)) time.

The staircase matrix changes only by one cell when moving from one scale to another.

Question: Can one update the greedy path incrementally?

O(1) solution for the above would imply that adding noise does not make the problem any harder.

Stringology Haifa 2005 Geometric matching on sequential data 30

Three problems

1D point set matching under translations (Akutsu, COCOON’04).

1D point set matching under translations, scaling and noise (Böcker & Mäkinen, EuroCG’05)

2D point set matching under translations (Ukkonen & Lemström & Mäkinen, 2003 + Cieliebak & Mäkinen, 2005).

Stringology Haifa 2005 Geometric matching on sequential data 31

2D point set matching

B

A f(A)

Stringology Haifa 2005 Geometric matching on sequential data 32

Solutions

Easy in O(mn log m) time by constructing the set of mn translation vectors, sorting it, and finding maximum repeating element.

Possible also in O(mn) time by using naive string matching type algorithm.

Stringology Haifa 2005 Geometric matching on sequential data 33

Naive point set matching

A

B

Remark: This is the fastest known algorithm for this problem!!

Stringology Haifa 2005 Geometric matching on sequential data 34

Restricted case?

Would the problem become easier if there were no other points inside the area of matches?

f(A)

Stringology Haifa 2005 Geometric matching on sequential data 35

Restricted case?

Restricted 1D case is extremely easy:- Exact string matching on the differentially encoded sequences.

Stringology Haifa 2005 Geometric matching on sequential data 36

Easier on grid points

Stringology Haifa 2005 Geometric matching on sequential data 37

Easier on grid points...

The problem becomes a special case of two-dimensional exact string matching.

Can be solved in O(N2) time on a text grid of size N£N and pattern grid of size M£M.

Notice that the run-length encoded representation of the rows of the matrix is of size O(n).

Stringology Haifa 2005 Geometric matching on sequential data 38

Easier on grid points...

The algorithm of Amir & Landau & Sokol, 2002, for run-length compressed 2D search can be applied:- Time complexity O(M2+n). (can be reduced to O(m2+n)?)

Stringology Haifa 2005 Geometric matching on sequential data 39

What about Bird-Baker?

Our idea to solve the problem is to modify Bird-Baker algorithm to work directly on point sets.

As a preliminary tool, we need an Aho-Corasick automaton that recognizes run-length encoded binary strings.

Stringology Haifa 2005 Geometric matching on sequential data 40

Run-length encoding

5.7 12.2

3.1 9.3 ...

05.71012.2

...

Stringology Haifa 2005 Geometric matching on sequential data 41

Modified Aho-Corasick automaton

Proposition: There is an automaton accepting a set of run-length encoded binary strings with the following properties:- O(m log m) construction time, where m is the number of 1-bits in the set.- Reading a fail-link in O(log m) time. - Scanning a string with n 1-bits in O(n log m) time.

Stringology Haifa 2005 Geometric matching on sequential data 42

Bird-Baker on point sets

Now we can build our automaton on the rows of set A, scan it with the rows of set B.

Let R be the set of positions where a row of A was accepted inside the rows of B.

After sorting R by columns, we can test in O(|R|) time if any column of R contains the correct sequence of accepting states.

Stringology Haifa 2005 Geometric matching on sequential data 43

Bird-Baker on point sets

The overall running time is O(n log m +|R| log |R|).

Unfortunately, there are examples where |R|=(mn) :-(

Hence, it is still open if (even) the restricted case has o(mn) solution or not.

top related