a new top-down algorithm for tree inclusion

A New Top-down Algorithmfor Tree Inclusion

Dr. Yangjun Chen

Dept. Applied Computer Science,

University of Winnipeg

515 Portage Ave.

Winnipeg, Manitoba, Canada R3B 2E9

Outline

Motivation Basic algorithm for tree

inclusion problem- Definition- Algorithm description

Improvements Summary

Given two ordered labeled trees P and T, called the pattern and the target,respectively. An interesting problem is: Can we obtain pattern P by deletingsome nodes from target T? That is, is there a sequence v1 , ..., vk of nodessuch that for

T0 = T andTi+1 = delete(Ti, vi +1) for i = 0, ..., k - 1,

we have Tk = P. If this is the case, we say, P is included in T, T contains P,or say, T covers P.

Motivation

a

b d

e f

T:

c b de f

T:adelete(T, c)

Motivation

s

vp

v n adv

“reads”“book”

s

np vp

det n v np adv

“The” “student”“reads” det adj n

“the”“interesting” “book”

“again and again”

Linguistic analysis

Definition 1 Let F and G be labeled ordered forests. We define an ordered embedding (, G, F) as an injective function : V(G) V(F) such that for all nodes v, u V(G),i) label(v) = label((v)); (label preservation condition)ii) v is an ancestor of u iff (v) is an ancestor of (u);(ancestor condition)iii) v is to the left of u iff (v) is to the left of (u); (Sibling condition)

Tree inclusion algorithm Definition

a

b b

G:a

d b

e b

b

F:

Algorithm

Tree inclusion algorithm

1. Let T = <t; T1, ..., Tk> (k 1) be a tree and G = <P1 , ..., Pl>(l 1) be a forest. We handle G as a tree P = <pv; P1, ..., Pl>,where pv represents a virtual node, matching any node in T.

2. Consider a node in P with children v1, ..., vj. We use a pair <i, v>(i j) to represent an ordered forest containing the first i subtreesof v: <P[v1], ..., P[vi]>. Then, <j, pv> represents the first j treesin G.

P:

v1 vi vk

… …

v

<i, v>

Algorithm


3. In addition, h(v) represents the height of v in a tree; and (v)represents a link from v in P to the leaf node on the left-mostpath in P[v].

Let v’ be a leaf node in P. Wedenote by -1(v’) a set of nodesx such that for each v x (v) = v’.

-1(v3) = {v1, v2, v3}

v1

v5

v4

v2

v3

(v1)

(v2)

P:

The tree inclusion checking is done by calling two functions recursively:top-down(T, G),bottom-up(T’, G),

where T is a tree, and T’ and G are two forests.

Algorithm


Each of the two functions returns a pair <i, v> with v being pv or a node onthe left-most path in P1.

T = <t; T1, ..., Tk>

T’ = <T1’, ..., Tk’>

G = <P1, ..., PL>

Function: top-down(T, G)


Case 1: G = <P1>; or G = <P1, ..., Pl> (l > 1), but |T | |P1| + |P2|.

In this case, we try to find a pair <i, v> such that T contains the first isubtrees of v, where v = pv , or v -1(v’) and v’ is the leaf node on the

left-most path in P1.

T: G:

P1

pv

G:

……P1 P2

pv

|T| |P1| + |P2|.

T: t

t

Pl

p1

In top-down(T, G), two cases will be handled.

p1



i) If t is a leaf node, we will check whether label(t) = label((p1)), where p1

is the root of P1. If it is the case, return <1, parent of (p1)>.

Otherwise, return <0, parent of (p1)>.

T = <t; T1, ..., Tk>: G:

P1

pv

G:

……P1 P2

pv

|T | |P1| + |P2|.

t

t

T = <t; T1, ..., Tk>:

Pl

case 1:



ii) If |T| < |P1| or height(t) < height(p1), we will make a recursive call

top-down(T , <P11, ..., P1j>), where <P11, ..., P1j> be a forest of

the subtrees of p1. The return value of top-down(T , <P11, ..., P1j>)

is used as the return value of top-down(T, G)

|T | < |P1|G:

……

pv

p1

… …P11 P1jP1i

T: t

Pl

case 1:



iii) If |T| |P1| (but |T | |P1| + |P2|) and height(t) height(p1), two casesneed to be considered:

• label(t) = label(p1). Call bottom-up(<T1, ..., Tk>, <P11, ..., P1j>).

• label(t) label(p1). Call bottom-up(<T1, ..., Tk>, <P1>).

p1

… …P11 P1jP1i

t

… …T1 TkTi

label(t) = label(p1)

p1

… …P11 P1j

P1i

t

… …T1 TkTi

label(t) label(p1)

case 1:

In both sub-cases, assume that the return value is <i, v>. A further checkingneeds to be conducted:



• If label(t) = label(v) and i = the outdegree of v, the return value shouldbe <1, v’s parent>.

• Otherwise, the return value is the same as <i, v>.

T:t

P1:p1

vor label(t) label(v)

label(t) = label(v)

case 1:



Case 2: G = <P1, ..., Pl> (l > 1), and |T| > |P1| + |P2|. In this case, we

will call bottom-up(<T1, ..., Tk>, G). Assume that the return value is <i, v>.

The following checkings will be continually conducted.

Case 1: G = <P1>; or G = <P1, ..., Pl> (l > 1), but |T | |P1| + |P2|.

G:

……P1 P2

pv

|T | > |P1| + |P2|

Pl

T:

……T1 T2

t

Tk



iv) If v = p1’s parent, the return value is the same as <i, v>. v) If v p1’s parent, check whether label(t) = label(v)) and

i = the outdegree of v. If so, the return value will be changed to<1, v’s parent>. Otherwise, the return value remains <i, v>.

Case 2: G = <P1, ..., Pl> (l > 1), and |T | > |P1| + |P2|. In this case, we

will call bottom-up(<T1, ..., Tk>, G).

Assume that the return value is <i, v>. The following checkings will becontinually conducted.

G:

… …P1 P2

pv

v = p1’s parent = pv

……P1 P2

pv

v p1’s parent

vPi Pl Pl

Function: bottom-up(T’, G)


bottom-up(T’, G) is designed to handle the case that both T’ and G are

forests. Let T’ = <T1, ..., Tk> and G = <P1, ..., Pq>. In bottom-up(T’, G),

we will make a series of calls top-down(Tl, <Pjl, ..., Pq>), where l = 1, ..., k,

j1 = 0, and j1 j2 ... jh q (for some h k), controlled as follows.

… …

Pi

… …

TkT1 Ti P1 PqT2

…

top-down(Tl, <Pjl, ..., Pq>)

T’: G:



1. Two index variables l, j are used to scan T1, ..., Tk and P1, ..., Pq,respectively.

2. Let <il, vl> be the return value of top-down(Tl, <Pj, ..., Pq>). If vl = pj’sparent, set j to be j + il - 1. Otherwise, j is not changed. Set l to be l + 1.Go to (2).

3. The loop terminates when all Tl’s or all Pj’s are examined.

bottom-up(T’, G) is designed to handle the case that both T’ and G are

forests. Let T’ = <T1, ..., Tk> and G = <P1, ..., Pq>. In bottom-up(T’, G),

we will make a series of calls top-down(Tl, <Pjl, ..., Pq>), where l = 1, ..., k,

j1 = 0, and j1 j2 ... jh q (for some h k), controlled as follows.



• If j > 0 when the loop terminates, bottom-up(T’, G) returns<j, p1’s parent>.

… …

Pi

… …

TkT1 Ti P1 PqT2

…

Pj



i) Let <i1, v1>, <i2, v2>, ..., <ik, vk> be the respective return values of

top-down(T1, <P1, ..., Pq>),

top-down(T2, <P1, ..., Pq>), ... ...

top-down(Tk, <P1, ..., Pq>).

Since j = 0, each vl -1(v’) (l = 1, ..., k).

• Otherwise, j = 0. In this case, we will continue to searching for a pair<i, v> such that T’ contains the first i subtrees of v, where v -1(v’) andv’ is the leaf node on the left-most path in P1, as described below.

• If j > 0 when the loop terminates, bottom-up(T’, G) returns<j, p1’s parent>.

P1

v1

v2

vk

…

ii) If each il = 0, return <0, ,>, where is considered to be a descendant ofany node in G. Otherwise, find the first vg with children w1, ..., wh such thatvg is not a descendant of any other vj, and ig > 0. Call

bottom-up(<Tg+1, ..., Tk>, <P[wig+1], ..., P[wh]>).



i) Let <i1, v1>, ..., <ik, vk> be the return values of top-down(T1, <P1, ..., Pq>),..., top-down(Tk, <P1, ..., Pq>), respectively. Since j = 0, each vl -1(v’)(l = 1, ..., k).

• Let <x, y> be its return value. If y = vg, then the return value ofbottom-up(T’, G) is set to be <ig + x, vg>.

• Otherwise, the return value is <ig, vg>.

… …

Tg+1T1 TgT2

P1

v1

vg

vk

Tk

… …

ig

Further improvements


In the case j = 0:

Let <i1, v1>, ..., <ik, vk> be the return values of top-down(T1, <P1, ..., Pq>),..., top-down(Tk, <P1, ..., Pq>). We will find the first vg such that it is not adescendant of any other vj and ig > 0. Then,

bottom-up(<Tg+1, ..., Tk>, <P[wig+1], ..., P[wh]>).

is invoked. This shows that all the return values except <ig, vg> are not usedin the subsequent computation. Thus, the work for looking for such valuesshould be avoided.

… …

Tg+1T1 TgT2

P1

v1

vg

vk

Tk

… …

Let <ij, vj> be the return value of top-down(Tj, <P1, ..., Pq>) such that ij > 0 and vj is p1 or a

descendant of p1. Then, during the execution of top-down(Tj+1, <P1, ..., Pq>), once we have

detected that it can only produce a return value <ij+1, vj+1> with vj+1 being a descendant of vj, we

should stop the corresponding computation immediately since this return value will not be usedin the subsequent searching. For this purpose, we rearrange top-down(Tj+1, <P1, ..., Pq>) to

top-down(Tj+1, <P1, ..., Pq>, vj) with vj being used to transfer information, called a

controlling-node.

Further improvements


Assume that in the execution of top-down(Tj+1, <P1, ..., Pq>, vj), we have the followingfunction calls: top-down(Tj+1,1, <P1, ..., Pq>, u1) returns <a1, u1>,

top-down(Tj+1,2, <P1, ..., Pq>, u2) returns <a1, u2>,

With all uj’s being a proper descendant of vj. Then the bottom-up function call withsome ui as a controlling node should not be conducted.

… …

bottom-up(<Tj+1,i , ... >, <… …>, ui ).

Summary

• An efficient method for tree inclusion problem- O|T|min{DP, |leaves(P)|}) time and- O(|T| + |P|) spacewhere DP – the height of P, and

• Future work- adapt the algorithm to a data stream environment - adapt the algorithm to an indexingenvironment

leaves(P) - set of the leaf nodes of P.

Thank you.

a new top-down algorithm for tree inclusion

Documents