an updown directed acyclic graph approach for sequential pattern mining

An UpDown Directed Acyclic Graph Approachfor Sequential Pattern Mining

Jinlin Chen, Member, IEEE

Abstract—Traditional pattern growth-based approaches for sequential pattern mining derive length-(kþ 1) patterns based on the

projected databases of length-k patterns recursively. At each level of recursion, they unidirectionally grow the length of detected

patterns by one along the suffix of detected patterns, which needs k levels of recursion to find a length-k pattern. In this paper, a novel

data structure, UpDown Directed Acyclic Graph (UDDAG), is invented for efficient sequential pattern mining. UDDAG allows

bidirectional pattern growth along both ends of detected patterns. Thus, a length-k pattern can be detected in blog2kþ 1c levels of

recursion at best, which results in fewer levels of recursion and faster pattern growth. When minSup is large such that the average

pattern length is close to 1, UDDAG and PrefixSpan have similar performance because the problem degrades into frequent item

counting problem. However, UDDAG scales up much better. It often outperforms PrefixSpan by almost one order of magnitude in

scalability tests. UDDAG is also considerably faster than Spade and LapinSpam. Except for extreme cases, UDDAG uses comparable

memory to that of PrefixSpan and less memory than Spade and LapinSpam. Additionally, the special feature of UDDAG enables its

extension toward applications involving searching in large spaces.

Index Terms—Data mining algorithm, directed acyclic graph, performance analysis, sequential pattern, transaction database.

Ç

1 INTRODUCTION

SEQUENTIAL pattern mining is an important data miningproblem, which detects frequent subsequences in a

sequence database. A major technique for sequentialpattern mining is pattern growth. Traditional patterngrowth-based approaches (e.g., PrefixSpan) derive length-(kþ 1) patterns based on the projected databases of alength-k pattern recursively. At each level of recursion, thelength of detected patterns is grown by 1, and patterns aregrown unidirectionally along the suffix direction. Conse-quently, we need k levels of recursion to mine a length-kpattern, which is expensive due to the large number ofrecursive database projections.

In this paper, a new approach based on UpDownDirected Acyclic Graph (UDDAG) is proposed for fastpattern growth. UDDAG is a novel data structure, whichsupports bidirectional pattern growth from both ends ofdetected patterns. With UDDAG, at level i recursion, wemay grow the length of patterns by 2i�1 at most. Thus, alength-k pattern can be detected in blog2kc þ 1 levels ofrecursion at minimum, which results in better scale-upproperty for UDDAG compared to PrefixSpan.

Our extensive experiments clearly demonstrated thestrength of UDDAG with its bidirectional pattern growthstrategy. When minSup is very large such that the averagelength of patterns is very small (close to 1), UDDAG andPrefixSpan have similar performance because in this case,the problem degrades into a basic frequent item counting

problem. However, UDDAG scales up much better com-pared to PrefixSpan. It often outperforms PrefixSpan by oneorder of magnitude in our scalability tests. UDDAG is alsoconsiderably faster than two other representative algo-rithms, Spade and LapinSpam. Except for some extremecases, the memory usage of UDDAG is comparable to thatof PrefixSpan. UDDAG generally uses less memory thanSpade and LapinSpam.

UDDAG may be extended to other areas where efficientsearching in large searching spaces is necessary.

The rest of the paper is organized as follows: Section 2defines the problem and discusses related works. Section 3presents motivation of our approach. Section 4 definesUDDAG-based pattern mining. Performance evaluation ispresented in Section 5. Discussions on time and spacecomplexity are presented in Section 6. Finally, we concludethe paper and discuss future work in Section 7.

2 PROBLEM STATEMENT AND RELATED WORK

2.1 Problem Statement

Let I ¼ fi1; i2; . . . ing be a set of items, an item set is a subsetof I, denoted by ðx1; x2; . . .xkÞ, where xi 2 I; i 2 f1; . . . ; kÞ.Without loss of generality, in this paper, we use nonnegativeintegers to represent items, and assume that items in an itemset are sorted in ascending order. We omit the parentheses foritem set with only one item. A sequence s is a list of item sets,denoted by <s1 s2 . . . sm>, where si is an item set,si � I; i 2 f1; . . . ;mg. The number of instances of itemsets in s is called the length of s.

Given two sequences a ¼ <a1 a2 . . . aj> and b ¼<b1 b2 . . . bk>, if k � j and there exists integers 1 � i1 <i2 < � � � < ij � k, such that a1 � bi1; a2 � bi2; . . . aj � bij,then a is a subsequence of b and b a supersequence of a. Inthis case, a is also called contained in b, denoted by a � b.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 7, JULY 2010 913

. The author is with the Computer Science Department, Queens College,City University of New York, 65-30 Kissena Blvd., Flushing, NY 11367.E-mail: [email protected].

Manuscript received 5 Jan. 2008; revised 20 Dec. 2008; accepted 8 May 2009;published online 28 May 2009.Recommended for acceptance by S. Chakravarthy.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2008-01-0010.Digital Object Identifier no. 10.1109/TKDE.2009.135.

1041-4347/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

A sequence database is a set of tuples <sid; s>, wheresid is a sequence id and s is a sequence. A tuple <sid; s> issaid to contain a sequence � if � � s.

The absolute support of a sequence � in a sequencedatabase D is defined as SupDð�Þ ¼ jf<sid; s>jð� � sÞ ^ð<sid; s> 2 DÞgj, and the relative support of � is defined asSupDðsÞ=jDj. In this paper, we will use absolute and relativesupports interchangeably. Given a positive value minSup asthe support threshold, � is called a sequential pattern in Dif SupDð�Þ � minSup.

Problem Statement. Given a sequence database D and theminimum support threshold, sequential pattern mining is tofind the complete set of sequential patterns (denoted by P ) inthe database. (Note: in this paper, we will always use D as asequence database and P as the complete set of sequentialpatterns in D).

Example 1. Given D as shown in Table 1 and minSup ¼ 2,the length of sequence 1 is 5. <ð1; 2Þ 3> is a patternbecause it is contained in both sequences 1 and 3.<ð1; 3Þ> occurs twice in sequence 1, however, sequence 1only contributes 1 to the support of <ð1; 3Þ>. <1 ð2; 3Þ 1>is a subsequence of sequences 1 and 2.

2.2 Related Work

The problem of sequential pattern mining was introducedby Agrawal and Srikant [1]. Among the many algorithmsproposed to solve the problem, GSP [17] and PrefixSpan[13], [14] represent two major types of approaches: a priori-based and pattern growth-based.

A priori principle states that any supersequence of anonfrequent sequence must not be frequent. A priori-basedapproaches can be considered as breadth-first traversalalgorithms because they construct all length-k patternsbefore constructing length-(kþ 1) patterns.

The AprioriAll algorithm [1] is one of the earliest a priori-based approaches. It first finds all frequent item sets,transforms the database so that each transaction is replacedby all frequent item sets it contains, and then finds patterns.The GSP algorithm [16] is an improvement over AprioriAll.To reduce candidates, GSP only creates a new length-kcandidate when there are two frequent length-(k� 1)sequences with the prefix of one equal to the suffix of theother. To test whether a candidate is a frequent length-kpattern, the support of each length-k candidate is counted byexamining all the sequences. The PSP algorithm [12] issimilar to GSP except that the placement of candidates isimproved through a prefix tree arrangement to speed uppattern discovery. The SPIRIT algorithm [9] uses regularexpressions as constraints and developed a family ofalgorithms for pattern mining under constraints based on a

priori rule. The SPaRSe algorithm [3] improves GSP by usingboth candidate generation and projected databases toachieve higher efficiency for high pattern density conditions.

The approaches above represent databases horizontally.In [4] and [19], databases are transformed into verticallayout consisting of items’ id-lists. The Spade algorithm [19]joins id-list pairs to form sequence lattices to groupcandidate sequences such that each group can be storedin the memory. Spade then searches patterns across eachsequence lattice. In Spade, candidates are generated andtested on the fly to avoid storing candidates, which costs alot to merge the id-lists of frequent sequences for a largenumber of candidates. To reduce this cost, The SPAMalgorithm [4] adopts the lattice concept but represents eachid-list as a vertical bitmap. SPAM is more efficient thanSpade for mining long patterns if all the bitmaps can bestored in the memory. However, it generally consumesmore memory. LapinSpam [20] improves SPAM by usinglast position information of items to avoid the ANDingoperation or comparison at each iteration in the supportcounting process.

One major problem of a priori-based approaches is that acombinatorially explosive number of candidate sequencesmay be generated in a large sequence database, especiallywhen long patterns exist.

Pattern growth approaches can be considered as depth-first traversal algorithms as they recursively generate theprojected database for each length-k pattern to find length-(kþ 1) patterns. They focus the search on a restrictedportion of the initial database to avoid the expensivecandidate generation and test step.

The FreeSpan algorithm [10] first projects a database intomultiple smaller databases based on frequent items.Patterns are found by recursively growing subsequencefragments in each projected database. Based on a similarprojection technique, the same authors proposed thePrefixSpan algorithm [13], [14], which outperforms Free-Span by projecting only effective postfixes.

One major concern of PrefixSpan is that it may generatemultiple projected databases, which is expensive when longpatterns exist. The MEMISP [11] algorithm uses memoryindexes instead of projected databases to detect patterns. Ituses the find-then-index technique to recursively find theitems that constitute a frequent sequence and constructs acompact index set that indicates the set of data sequences forfurther exploration. As a result of effective index advancing,fewer and shorter data sequences need to be processed asthe discovered patterns become longer. MEMISP is fasterthan the basic PrefixSpan algorithm but slower whenpseudoprojection technique is used in PrefixSpan.

Among the various approaches, PrefixSpan was one of themost influential and efficient ones in terms of both time andspace. Some approaches may achieve better performanceunder special circumstances; however, the overall perfor-mance of PrefixSpan is among the best. For example, LAPIN[21] is more efficient for dense data sets with long patterns butless efficient in other cases. Besides, it consumes much morememory than PrefixSpan. FSPM [18] declares to be faster thanPrefixSpan in many cases. However, the sequences that FSPMmines contain only a single item in each item set. In this sense,FSPM is not a pattern mining algorithm as we discuss here.

914 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 7, JULY 2010

TABLE 1An Example Sequence Database

SPAM outperforms the basic PrefixSpan but is much slowerthan PrefixSpan with pseudoprojection technique [17].

3 MOTIVATION

Pattern growth-based approaches recursively grow thelength of detected patterns. At each level of recursion, thealgorithms first partition the solution space into disjointsubspaces. For each subspace, a projected database (orvariations, e.g., memory index, etc.) is created, based onwhich a detection strategy (e.g., frequent prefix counting,memory index counting, etc.) is applied to grow existingpatterns. Projection and support counting are the two majorcosts for pattern growth-based approaches.

In PrefixSpan, patterns are partitioned based on commonprefix and grown unidirectionally along the suffix directionof detected patterns. At each level of recursion, the length ofdetected patterns is only grown by 1. If we can grow thepatterns bidirectionally along both ends of detectedpatterns, we may grow patterns in parallel at each level ofrecursion. The motivation of this paper is to find suitablepartitioning, projection, and detection strategies that allowfor faster pattern growth.

To support bidirectional pattern growth, instead ofpartitioning patterns based on common prefix, we canpartition them based on common root items. For a databasewith n different frequent items (without loss of generality,we assume that these items are 1; 2; . . . ; n), its patterns canbe divided into n disjoint subsets. The ith subset (1 � i � n)is the set of patterns that contains i (the root item of thesubset) and items smaller than i. Since any pattern insubset i contains i, to detect the ith subset, we need onlycheck the subset of tuples whose sequences contain i indatabase D, i.e., the projected database of i, or iD. In theith subset, each pattern can be divided into two parts, prefixand suffix of i. Since all items in the ith subset are no largerthan i, we exclude items that are larger than i in iD.

Example 2. Given the following database:

1. <9 4 5 8 3 6>;2. <3 9 4 5 8 3 1 5>;3. <3 8 2 4 6 3 9>;4. <2 8 4 3 6>;5. <9 6 3>,

8D is,

1. <4 5 8 3 6>;2. <3 4 5 8 3 1 5>;3. <3 8 2 4 6 3>;4. <2 8 4 3 6>.

If minSup is 2, the 8th subset of patterns is

f<8>;<3 8>; <4 8>; <5 8>; <4 5 8>; <8 3>;

<8 4>; <8 6>;<8 3 6>; <8 4 3>; <8 4 6>;

<3 8 3>; <5 8 3>; <4 5 8 3>g:

Observing the patterns in the 8th subset, except for <8>,which only contains 8 and can be derived directly, all otherpatterns can be clustered and derived as follows:

1. f<3 8>; <4 8>; <5 8>; <4 5 8>g, the patterns with8 at the end. This cluster can be derived based on

the prefix subsequences of 8 in 8D, or Preð8DÞ,which is,

1. <4 5>;2. <3 4 5>;3. <3>; and4. <2>.

By concatenating the patterns ð<3>;<4>;<5>;<4 5>Þ of Preð8DÞ with 8, we can derive patterns inthis cluster.

2. f<8 3>; <8 4>; <8 6>; <8 3 6>; <8 4 3>; <8 4 6>g,the patterns with 8 at the beginning. This cluster canbe derived based on the suffix subsequences of 8 in8D, or Sufð8DÞ, which is

1. <3 6>;2. <3 1 5>;3. <2 4 6 3>;4. <4 3 6>.

By concatenating 8 with the patterns ð<3>; <4>; <6>;<3 6>; <4 3>; <4 6>Þ of Sufð8DÞ, we can derive patternsin this cluster.

3. f<3 8 3>; <4 8 3>; <5 8 3>; <4 5 8 3>g, the pat-terns with 8 in between the beginning and end ofeach pattern. This cluster can be mined based on thepatterns in Preð8DÞ and Sufð8DÞ. In this case, apattern (e.g., <4 8 3>) can be derived by concatenat-ing a pattern of Preð8DÞ (e.g., 4) with the root item 8and a pattern of Sufð8DÞ (e.g., 3).

Note: In case a pattern belongs to more than one cluster,it can de derived separately in each cluster. Duplicatedpatterns can be eliminated by set union operation.

Here, the major difficulty is case 3. In example 2, we havefour patterns from Preð8DÞ and six from Sufð8DÞ. Intui-tively, each pattern pair (one from Preð8DÞ and one fromSufð8DÞ) is a possible candidate for case 3. Directevaluation of every pair can be expensive (24 candidatesin this example). If we can decrease the number ofcandidates for evaluation, we will be able to recursivelydetect patterns in cases 1 and 2 using similar strategies, andeventually, find all the patterns in the 8th subset efficiently.

Based on a priori rule, if the concatenation of a patternfrom Preð8DÞ (e.g., <4>) with a pattern from Sufð8DÞ (e.g.,<6>), i.e., <4 8 6> (the root item 8 is added implicitly), isnot a pattern, then the concatenation of any pattern inPreð8DÞ contains <4> (e.g., <4 5>) with any pattern inSufð8DÞ that contains<6> (e.g.,<3 6>) is also not a pattern.

On the other hand, given a pattern s from Preð8DÞ (e.g.,<4 5>), the valid patterns from Sufð8DÞ for s should also bevalid for any pattern from Preð8DÞ that are contained in s(e.g., <4>, <5>). Therefore, to check the candidate patternsfrom Sufð8DÞ for s, we need only check the intersection ofthe valid pattern sets from Sufð8DÞ for patterns in Preð8DÞthat contain s. Here, the valid pattern sets from Sufð8DÞ for<4> and <5> are both {<3>}, and the intersection of thetwo sets is f<3>g, which means that we need only verify<4 5> with <3>.

The strategies above can effectively decrease the numberof candidates for case 3. One challenging issue is how toefficiently find and represent the contain relationshipbetween patterns. To solve this problem, we can use a

CHEN: AN UPDOWN DIRECTED ACYCLIC GRAPH APPROACH FOR SEQUENTIAL PATTERN MINING 915

directed acyclic graph (DAG) that represents patterns asvertexes and contain relationships as directed edges inbetween vertexes. Such a DAG can be recursively con-structed in an efficient way to derive the contain relation-ship of patterns (see Section 4.3). By representing thecontain relationship of patterns from Preð8DÞ with a DAG(Up DAG) and the contain relationship of patterns fromSufð8DÞ with another DAG (Down DAG), we can decreasethe number of candidates by using these DAGs based onthe strategies discussed above. Fig. 1 shows the Up andDown DAGs for the patterns in Preð8DÞ and Sufð8DÞ. In theDAGs, each vertex represents a pattern with occurrenceinformation, i.e., the ids of tuples containing the pattern. Adirected edge means that the pattern of the destinationvertex contains the pattern of the source vertex.

To mine the patterns in the ith subset, first, we performlevel 1 projection to get iD. At this stage, the only length-1pattern in the ith subset, <i>, is detected. We then performlevel 2 projections on PreðiDÞ and SufðiDÞ, respectively,based on which we can detect length-2 (cases 1 and 2) andlength-3 patterns (case 3). We then perform level 3projections to detect length 3, 4, 5, 6, and 7 patterns andcontinue this process to find all the patterns in the ith subset.If the maximal pattern length is k, then at worst, we projectk levels; but at best, we only project blog2kc þ 1 levels, whichis much less than those of previous approaches.

In the example above, each item set has exactly one item.Practically, an item set may have multiple items. Mostprevious approaches detect frequent item sets with multipleitems simultaneously when detecting sequential patterns. Inour approach, we first detect frequent item sets andtransform the database based on frequent item sets. Wethen detect patterns on the transformed database usingUDDAG. Our strategy of detecting frequent item sets first isthe same as AprioriAll. In Section 6.1, we will discuss indetail the impact of this strategy.

In our previous work [6], we presented an UpDown Treedata structure to detect contiguous sequential patterns (inwhich no gap is allowed for a sequence to contain thepattern). However, UpDown Tree is substantially differentfrom the UpDown DAG in this paper. In addition to thedifferent internal data structures, a major difference is thatUpDown Tree is for compressed representation of theprojected databases, while UDDAG represents the contain-ing relationship of detected patterns.

4 UPDOWN DIRECTED ACYCLIC GRAPH-BASED

SEQUENTIAL PATTERN MINING

This section presents UDDAG-based pattern miningapproach, which first transforms a database based on

frequent item sets, then partitions the problem, and finally,detects each subset using UDDAG.

4.1 Database Transformation

Definition 1 (Frequent item set). The absolute support for an

item set in a sequence database is the number of tuples whose

sequences contain the item set. An item set with a support

larger than minSup is called a frequent item (FI) set.

Based on frequent item sets, we transform each sequencein a database D into an alternative representation. First, weassign a unique id to each FI in D. We then replace eachitem set in each sequence with the ids of all the FIscontained in the item set.

For example, for the database in Table 1, the FIs are: (1),

(2), (3), (4), (5), (6), (1,2), (2,3). By assigning a unique id toeach FI, e.g., (1)-1, (1,2)-2, (2)-3, (2,3)-4, (3)-5, (4)-6, (5)-7, (6)-8, we can transform the database as shown in Table 2(infrequent items are eliminated).

Definition 2 (Item pattern). An item pattern is a sequential

pattern with exactly 1 item in every item set it contains.

Lemma 1 (Transformed database). Let D be a database and P

be the complete set of sequential patterns in D, D0 be its

transformed database, substituting the ids of each item pattern

contained in D0 with the corresponding item sets, and denoting

the resulted pattern set by P 0, we have P ¼ P 0.Proof. Let p be a pattern in P , ip be the item pattern derived

by replacing each item set in p with the corresponding idinD0, since the id of an item set i exists at the same positioninD0 as that of i inD, the support of ip inD0 is the same asthat of p inD. Thus, ip is an item pattern inD0. Substituting

each id in ip with the corresponding item set, and denotethe resulted pattern by ip0, we have ip0 ¼ p. Based on thedefinition ofP 0, we have ip0 2 P 0. Thus, p 2 P 0 andP � P 0.Similarly, P 0 � P . All together, P ¼ P . tu

Based on Lemma 1, mining patterns from D is equivalentto mining item patterns from D0. Below, we focus on miningitem patterns from D0 and represent frequent item sets withtheir ids. For brevity, we still use frequent item sets insteadof ids, use pattern instead of item patterns, use D instead ofD0, and use P instead of P 0.

4.2 Problem Partitioning

Lemma 2 (Problem partitioning). Let fx1; x2; . . . ; xtg be the

frequent item sets in a database D; x1 < x2< � � �<xt, the

complete set of patterns (P) in D can be divided into t disjoint

subsets. The ith subset (denoted by Pxi ; 1 � i � tÞ is the set

of patterns that contains xi and FIs smaller than xi.


Fig. 1. Example Up/Down DAGs of patterns from Preð8DÞ=Sufð8DÞ.(a) Example Up DAG. (b) Example Down DAG.

TABLE 2Transformed Database

Proof. First, we create t empty sets. Next, we move patternsthat contain xt from P to Pxt , and in the remaining P , wemove all the patterns that contain xt�1 to Pxt�1

. Wecontinue this until moving all the patterns that containx1 to px1

. Now P is empty because any pattern can onlycontain FIs in fx1; x2; . . .xtg. Thus, P ¼ Px1

[ � � � [ Pxt .Given two integers i and j, 1 � i � j � t, for 8 pk 2

Pxi ; pk 62 Pxj because pk cannot contain xj, which iscontained in any pattern in Pxj . Similarly, for 8 pl 2Pxj ; pl 62 Pxi because pl contains xj, which is larger thanthe largest element contained in any pattern in Pxi .Therefore, Pxi \ Pxj ¼ �, i.e., all the subsets of P aredisjoint. tu

Based on Lemma 2, the problem of pattern mining can bepartitioned into mining subsets of patterns.

4.3 UDDAG-Based Pattern Mining

Definition 3 (Projected database). The collection of all thetuples whose sequences contain an item set x in a database D iscalled x-projected database, denoted by xD.

Lemma 3 (Projected database). Let D be a database, and x bean item set, P and P 0 be the complete sets of patterns in D andxD, respectively, we have Px ¼ P 0x.

Proof. Since any tuple in xD also exists in D; Px � P 0x. For8p 2 Px;<x> � p. Thus, any tuple that contains p alsocontains x, and any tuple that does not contain x alsodoes not contain p. Therefore, p can only be detectedfrom the collection of all the tuples that contain x, i.e., xD.Therefore, Px � P 0x. All together, Px ¼ P 0x. tuBased on Lemma 3, Px can be mined from xD.

Definition 4 (Prefix/suffix subsequence/tuple). Given afrequent item set x and a tuple <sid, s> in a database, s ¼<s1s2 . . . sj>, if x 2 si; 1 � i � j, then sp ¼ <s1s2 . . . si�1>is the prefix subsequence of x in s, and ss ¼<siþ1siþ2 . . . sj>is the suffix subsequence of x in s. <sid; sp> is the prefixtuple of x, and <sid; ss> is the suffix tuple of x.

Definition 5 (Prefix/suffix-projected database). The collec-tion of all the prefix/suffix tuples of a frequent item set x in xDis called the prefix/suffix-projected database of x, denotedby PreðxDÞ=SufðxDÞ.

Definition 6 (Sequence concatenation). Given two sequencesa ¼ <a1 . . . ai>; b ¼ <b1 . . . bj>, sequence concatenation of aand b, denoted by a.b, is defined as <a1 . . . aib1 . . . bj>.

If an FI x occurs multiple times in a sequence, then eachoccurrence has its prefix/suffix subsequence. For example,in sequence <3 5 6 3 5 6>, 3 has two suffix subsequences(<5 6 3 5 6> and <5 6>). If both subsequences contain apattern (e.g., <5 6>), they only contribute 1 to the count ofthe pattern.

Theorem 1 (Pattern mining). Let x be an FI and xD be itsprojected database, P , PPre, and PSuf be the complete sets ofsequential patterns in xD,PreðxDÞ, andSufðxDÞ, respectively,we have Px � Q, where Q ¼ f<x>g [Q1 [Q2 [Q3, and

Q1 ¼ fpk:<x> j pk 2 PPreg; Q2 ¼ f<x>:pk j pk 2 PSufg;Q3 ¼ fpk:<x>:pi j pk 2 PPre; pi 2 PSufg:

Proof. For 8 pj 2 P; pj ¼ <a1a2 . . . an>, based on the position

of x in pj and the length of pj, we have the following:

1. n ¼ 1; x is the only item set of pj; pj ¼ <x>.2. n > 1; x only exists at the beginning and/or end

of pj, i.e., a1 ¼ x, and/or an ¼ x, and aj 6¼ x;j 2 f2; 3; . . .n� 1g.

3. n > 1; 9 m 2 f2; 3; . . .n� 1g; am ¼ x, i.e., x ex-ists in between the beginning and end of pj.

For case 1, since <x> 2 Q, we have pj 2 Q.For case 2, if x only resides at the beginning of pj, let

p0j ¼ <a2 . . . an>, for each occurrence of pj in xD, there isa corresponding occurrence of p0j in SufðxDÞ, thus,p0j 2 PSuf , and pj 2 Q2. Similarly, if x only resides atthe end of pj, then pj 2 Q1. If x only resides at thebeginning and end of pj, we have pj 2 Q1, and pj 2 Q2.

For case 3, let p0j ¼ <a1 a2 . . . am�1>, each occurrenceof pj in xD corresponds to a prefix subsequence of x,which is contained in PreðxDÞ, thus, p0j 2 PPre. letp00j ¼ <amþ1 amþ2 . . . an>, each occurrence of p00j in xDcorresponds to a suffix subsequence of x, which iscontained in SufðxDÞ, thus, p00j 2 PSuf . Since pj ¼p0j:<x>:p

00j , we have pj 2 Q3.

All together, we have Px � Q. tu

Based on Theorem 1, we can detect Px based on PPre

and PSuf , which can be recursively derived. Here, case 1 is

obvious. Case 2 is directly based on PPre and PSuf . Case 3

is complicated due to a potential large number of

candidates. Below, we define UDDAG to decrease the

number of candidates.

Definition 7 (UpDown directed acyclic graph). Given an FI

x and xD, an UpDown Directed Acyclic Graph based on Px,

denoted by x-UDDAG, is derived as follows:

1. Each pattern in Px corresponds to a vertex inx-UDDAG. <x> corresponds to the root vertex,denoted by vx: For a vertex v in x-UDDAG, op(v)represents the pattern corresponding to v. For 8p 2 Px,ov(p) represents the vertex corresponding to p.

2. Let PU be the set of length-2 patterns ending with x inPx, for 8p 2 PU , let vu ¼ ovðpÞ, add a directed edgefrom vx to vu. vu is called an up root child of vx.

3. Let PD be the set of length-2 patterns starting with x inPx, for 8p 2 PD, let vd ¼ ovðpÞ, add a directed edgefrom vx to vd: vd is called a down root child of vx.

4. Each up/down root child vu=vd of vx also correspondsto an UDDAG (defined recursively using rules 1-4),denoted by xU -UDDAG and xD-UDDAG. For 8v1 2VU and 8v2 2 VD, where VU=VD is the set of all thevertexes in xU -UDDAG=xD-UDDAG, assume thatopðv1Þ ¼ <i1i2 . . . imx>, and

opðv2Þ ¼ <xj1j2 . . . jn>;

if 9p 2 Px; p ¼ <i1i2 . . . imxj1j2 . . . jn>, let v3 ¼ovðpÞ, add a direct edge from v1 to v3, and add another

direct edge from v2 to v3; here, v1=v2 is the Up/Down

parent of v3, and v3 is the UpDown child of v1 and v2.

Note: If v3 corresponds to multiple up and down parents, only

one pair (randomly selected) is linked.


Definition 8 (Occurrence set). The Occurrence Set of a vertex

v in a database D (denoted by OSDðvÞÞ is the set of sequence

ids of the tuples in D that contains op(v).

The data structure of a vertex in UDDAG is as follows:

class UDVertex{

UDVertex upParent, downParent;

List upChildren, downChildren, upDownChildren;

int[] pattern; //pattern sequence

int[] occurs; // occurrence set

}.

In an UDDAG, if there is a directed path from vertex v1

to v2; v2 is called reachable from v1. The UDDAG for all thepatterns in PreðxDÞ=SufðxDÞ is called the Up/Down DAGof x. The set of vertexes of an UDDAG (Up/Down DAG) isdenoted by V ðVU=VDÞ.Definition 9 (Valid down vertex set). Given a vertex v in the

Up DAG of x, the valid down vertex set of v ðVDV SvÞ is definedas VDV Sv ¼ fv0 j ðv0 2 VDÞ ^ ðopðvÞ:<x>:opðv0Þ 2 PxÞg.

Definition 10 (Parent valid down vertex set). Given a vertex

v in the Up DAG of x, the parent valid down vertex set of v

ðPVDV SvÞ is defined as follows:

1. If v has no parent (i.e., root vertex), PVDV Sv ¼ VD.2. If v has one parent, PVDV Sv is the VDVS of the

parent.3. Otherwise, PVDV Sv is the intersection of the VDVSs

of the parents.

Lemma 4. VDV Sv ¼ � PVDV Sv.Proof. If v has no parent, PVDV Sv ¼ VD. Based on Defini-

tion 9, VDV Sv � VD. Therefore, VDV Sv � PVDV Sv.If v has one or more parents, for

8 v0 2 VDV Sv; opðvÞ:<x>:opðv0Þ 2 Px:

Based on a priori rule, for 8 sp0, if sp0 � opðvÞ, thenSp0:<x>:opðv0Þ 2 Px. If v00 is a parent of v, we haveopðv00Þ � opðvÞ. Therefore, opðv00Þ:<x>:opðv0Þ 2 Px. Thus,v0 2 PVDV Sv, and VDV Sv � PVDV Sv. tuBased on Lemma 4, to detect VDV Sv, we need only

examine all the vertexes in PVDV Sv.

Lemma 5. Given a vertex v in the Up DAG of x and PVDV Sv,

for 8v0 2 PVDV Sv, if v0 62 VDV Sv, then for any vertex v00 in

the Down DAG of x reachable from v0; v00 62 VDV Sv.Proof. Since v0 62 VDV Sv; opðvÞ:<x>:opðv0Þ 62 Px. Thus,jOSðvÞ \OSðv0Þj<minSup. Since v00 is reachable fromv0; opðv0Þ � opðv00Þ. Thus, OSðv0Þ � OS ðv00Þ, and

jOSðvÞ \OSðv00Þj < minSup:

Therefore, opðvÞ.<x>:opðv00Þ 62 Px, and v00 62 VDV Sv. tuBased on Lemma 5, if v0 does not belong to VDVSv, then

all the vertexes reachable from v0 do not belong to VDVSv.Lemmas 4 and 5 help eliminate candidates for case 3.

Lemma 6 further evaluates candidate patterns.

Lemma 6. Given a vertex v in the Up DAG of x and a vertex v’

in PVDV Sv, let IS be the intersection set of the occurrence

sets of v and v’, if jISj � minSup, and for any tuple whose idis contained in IS, x occurs exactly once in the correspondingsequence, then osp(v).<x>. osp(v’) 2 Px.

Proof. For a tuple <sid; s>, if sid 2 IS, then opðvÞ � s, andopðv0Þ � s. Since x occurs once in s; opðvÞ occurs before xin s, and opðv0Þ occurs after x. Thus, opðvÞ:<x>:opðv0Þ � s.Since jISj � minSup, at least minSup tuples containopðvÞ:<x>. opðv0Þ. Thus, opðvÞ:<x>. opðv0Þ 2 Px. tu

Lemma 6 evaluates candidates for Px when x occurs oncein each sequence in IS. If x occurs more than once in asequence, we need further verify whether the sequencereally contains opðvÞ:<x>. opðv0Þ. For example, in sequence<5 3 5 2 5>, <5 3> is the prefix of the second occurrence of 5,and <3 5 2 5> is the suffix of the first occurrence of 5.Because of this, a candidate pattern <5 3 5 3 5 2 5> may bemistakenly considered as being contained in the sequence.Thus, we need further verification.

To minimize the effort of pattern detection in this case,we build PreðxDÞ=SufðxDÞ as follows: 1) If x occurs onlyonce in a sequence, directly add its prefix/suffix tuple toPreðxDÞ=SufðxDÞ; 2) If x occurs more than once in asequence, add the prefix tuple of the last occurrence of x toPreðxDÞ and the suffix tuple of the first occurrence of x toSufðxDÞ. Denoting the derived prefix/suffix-projecteddatabases as Pre00ðxDÞ=Suf 00ðxDÞ, and Ppre=PSur be thecomplete set of patterns in Pre00ðxDÞ=Suf 00ðxDÞ, and letR ¼ f<x>g [R1 [R2 [R3g, where

R1 ¼ fspk:<x>jspk 2 Ppreg; R2 ¼ f<x>:spkjspk 2 Psufg;R3 ¼ fspk:<x>:spjjspk 2 Ppre; spj 2 Psufg;

we have the following theorem:

Theorem 2. Px � R � Q (Q is defined in Theorem 1).

Proof. The proof of Px � R is similar to that of Px � Q inTheorem 1. The only difference is that in PreðxDÞ=SufðxDÞ, the prefix/suffix tuple of every occurrence of xis contained for multiple occurrences of x in the samesequence, while in Pre00ðxDÞ=Suf 00ðxDÞ, only the lastprefix/first suffix tuple is contained. Based on thedefinition, if multiple prefix/suffix tuples from the samesequence contain the same pattern, only one is countedfor support. By including the last prefix/first suffix tuplein Pre00ðxDÞ=Suf 00ðxDÞ, we can guarantee not miscount-ing the support of any pattern because the sequences ofall other prefix/suffix tuples are contained in the lastprefix/first suffix tuple. Therefore, Px � R.

Since every tuple in Pre00ðxDÞ=Suf 00ðxDÞ also exists inPreðxDÞ=SufðxDÞ, we have R � Q. tu

Based on the lemmas and theorems above; below, wefirst give an example to illustrate UDDAG-based patternmining, and then present the algorithm in detail.

Example 3 (UDDAG). For the sample database in Table 1, ifminSup ¼ 2, its patterns can be mined as follows:

1. Database transformation. see Table 2 in Sec-tion 4.1.

2. Pattern partitioning. P is partitioned into eightsubsets: the one contains 1 (P1), the one contains 2


and smaller ids ðP2Þ; . . . , and the one contains 8and smaller ids (P8).

3. Finding subsets of patterns. To detect Px, we firstdetect patterns in PreðxDÞ and SufðxDÞ and thencombine them to derive Px. This is a recursiveprocess because for PreðxDÞ and SufðxDÞ, weperform the same action until reaching the basecase, where the projected database has no frequentitem set.

a. Finding P8. First, we build Preð8DÞ andSufð8DÞ, which are , Preð8DÞ: 1) <1(1,2,3,4,5) (1,5) 6>, 3) <(7,8) (1,2,3) >, 4)<7>; Sufð8DÞ: 1) <>, 3) < (1,2,3) (6,8) 5 3>,4) <5 3 5>.

Let PP be all the patterns in Preð8DÞ, sincethe FIs in Preð8DÞ are (1), (2), (3), and (7), wecan partition PP into four subsets, PP7;PP3; PP2, and PP1. First, we detect PP7.Since the prefix-projected database of 7 inPreð8DÞ is empty, and the suffix-projecteddatabase of 7 is: 3) <(1, 2, 3)>, 4) < >, the onlypattern in PP7 is <7>. Similarly, PP3 ¼f<3>g; PP2 ¼ f<2>g, and PP1 ¼ f<1>g.Thus, PP ¼ f<1>;<2>; <3>;<7>g.

Let PS be all the patterns in Sufð8DÞ, sincethe FIs in Sufð8DÞ are (3) and (5), we canpartition PS into two subsets, PS5 and PS3.First, we detect PS5. The prefix-projecteddatabase of 5 in Sufð8DÞ is: 3) < (1,2,3) (6,8)>, 4) <5 3>, which contains a pattern <3>. Thesuffix-projected database of 5 in Sufð8DÞ is:3) < 3>, 4) <3 5>, which also contains a pattern<3>. Since both databases have patterns, weneed consider case 3, i.e., whether concatenat-ing <3> with root 5 and <3> is also a pattern.Here, the occurrence set of <3> in the prefix-projected database of 5 is {3, 4}, and theoccurrence set of <3> in the suffix-projecteddatabase of 5 is also {3, 4}. Thus, theirintersection set is {3, 4}, which means that thesupport of <3 5 3> is at most 2. However, since5 occurs twice in tuple 4, we need checkwhether it really contains <3 5 3>, which is nottrue by verification. Thus, the support of<3 5 3> is 1, and it is not a pattern. Therefore,PS5 ¼ f<3 5>;<5 3>;<5>g. S i m i l a r l y ,PS3 ¼ f<3>g. All together, PS ¼ f<3 5>;<5 3>;<5>;<3>g.

Next, we detect P8 based on the Up andDown DAGs of 8 (Figs. 2a and 2b) byevaluating each candidate vertex pair. First,we detect the VDVSs for length-1 pattern inPreð8DÞ, i.e., up vertexes 1, 2, 3, and 7. Forvertex 1, first, we check its combination withdown vertex 3, the intersection of the occur-rence sets is {3}. Thus, the correspondingsupport is at most 1, which is not a validcombination. Similarly, up vertex 1 and downvertex 5 are also invalid combination. Basedon Lemma 5, all the children of down vertex 5

are not valid for up vertex 1. Therefore,VDVS1 ¼ �. Similarly,

VDVS3 ¼ �;VDVS7

¼ fovð<3>Þ; ovð<5>Þ; ovð<5 3>Þg:

Since no length-2 pattern exists in Preð8DÞ,the detection stops. Eventually, we have

P8 ¼ f<8>;<1 8>;<2 8>;<3 8>;<7 8>;

<8 3>;<8 5>;<8 3 5>;<8 5 3>;

<7 8 3>;<7 8 5>;<7 8 5 3>g:

8-UDDAG based on detected patterns in P8

is shown in Fig. 2c.Note: Based on Lemma 1, here, we

actually detect item patterns. Integers in thepatterns are ids of FIs.

b. Similarly, we have

P7 ¼ f<7>;<7 1>;<7 3>;<7 1 3>;<7 5>;

<7 1 5>;<7 3 5>;<7 5 3>g;P6 ¼ f<6>;<1 6>;<2 6>;<3 6>;<6 3>;<6 5>;

<6 5 3>;<3 6 5>;<2 6 5>;<1 6 5>g;P5 ¼ f<5>;<1 5>;<2 5>;<3 5>;<5 5>;

<1 3 5>;<1 5 5>; <5 1>;<5 3>;

<5 5>;<1 5 1>;<1 5 3>g;P4 ¼ f<4>;<1 4>;<4 1>;<1 4 1>g;P3 ¼ f<3>;<1 3>;<3 1>;<1 3 1>g;P2 ¼ f<2>g;P1 ¼ f<1>;<1 1>g:


Fig. 2. UpDown DAG for P8. (a) Up DAG of patterns in preð8DÞ. (b) DownDAG patterns in Sufð8DÞ. (c) 8-UDDAG.

4. The complete set of patterns is the union of allthe subsets of patterns detected above.

Algorithm 1. UDDAG based pattern Mining.

Input: A database D and the minimum support

Output: P , the complete set of patterns in D

Method: findP (D, minSup){

P ¼ �

FISet¼D:getAllFIðminSupÞ;D.transform();

for each FI x in FISet{

UDV ertexrootV T ¼ newUDV ertexðxÞfindP(D.getPreD(x), rootVT, up, minSup)

findP(D.getSurD(x), rootVT, down, minSup)

findPUDDAG(rootVT)

P ¼ P [ rootV T :getAllPatternsðÞ}

}

The algorithm first calls subroutine getAllFI to detect allthe FIs (An adapted version of the FP-growth* algorithm [8]is used to detect FIs in our implementation).

The algorithm then transforms the database. A directedacyclic graph is built to represent the containing relation-ship of FIs. For each (sorted) item set, we check all its FIswith children in the DAG and verify whether the FIcorresponding to each child is valid in the item set. If so, weadd the id of the child to the item set and further check thechildren of that child.

Based on the transformed database, for each FI x, thealgorithm creates a root vertex for <x>, detects all thepatterns in the prefix-projected database and suffix-projecteddatabase of x, creates x-UDDAG, detects Px using x-UDDAG, and adds Px to P .

Subroutine: findP(PD, rootVT, type, minSup){

FISet=PD.getAllFI(minSup);

for each FI x in FISet{

UDVertex curVT=new UDVertex (x, rootVT)

if(type==up) rooVT.addUpChild(curVT)

else rootVT.addDownChild(curVT)

findP(PD. getPreD(x), curVT, up, minSup)

findP(PD.getSufD(x), curVT, down, minSup)

findPUDDAG(curVT)

}

}

This subroutine detects all the patterns whose ids are nolarger than the root of the projected database. The parametersare as follows:

1. PD is the projected database.2. rootV T is the vertex for the root item of PD.3. type (up/down) indicates prefix/suffix PD.4. minSup is the support threshold.

The subroutine first detects all the FIs in PD. For each FIx, it creates a new vertex as the Up/Down child (based ontype) of the root vertex. It then recursively detects all thepatterns in PD similar as findP (D, minSup).

Subroutine: findPUDDAG(rootVT){

upQueue.enQueue(rootVT.upChildren)

while(!upQueue.isEmpty()){

UDVertex upVT=upQueue.deQueue()

if(upVT.upParent==rootVT)

downQueue.enQueue(rootVT.downChildren)

else if (upVT.downParent==null)

downQueue.enQueue(upVT.upParent.VDVS)

else downQueue.enQueue(upVT.upParent.VDVS \upVT.downParent.VDVS)

while(!downQueue.isEmpty()){

UDVertex downVT=downQueue.deQueue()

if(isValid(upVT, downVT){

UDVertex curVT=new UDVertex (upVT, downVT)

upVT.addVDVS(downVT)

if(upVT.upParent==rootVT)

downQueue.enQueue(downVT.children)

}

}

if(upVT.VDVS.size>0)upQueue.enQueue(upVT.children)

}

}

Subroutine findPUDDAG detects all the case 3 patterns in

a projected database using UpDown DAG. The parameter

rootV T is the root vertex of the recursively constructed

UpDown DAG.It first enqueues all the up Children of the root vertex to

an upQueue. For each vertex upVT in the upQueue, it

enqueues PVDVS of upVT into a downQueue as follows: if

upVT is root child of rootVT, it enqueues all the down

children of rootVT into a downQueue; else if upVT has only

one parent, it enqueues the VDVS of the parent into

downQueue, else it enqueues the intersection of the VDVSs

of the parents into downQueue.For each vertex downVT in the downQueue of upVT, if

upVT and downVT correspond to a valid pattern, it creates

a new vertex whose parents are upVT and downVT, and

adds downVT to the VDVS of upVT. It further enqueues all

the children of downVT to downQueue if upVT is the

upChild of the rootVT.Finally, if the size of the VDVS of upVT is not 0, the

subroutine enqueues all the children of upVT into upQueue

for further examination.

Theorem 3 (UDDAG). A sequence is a sequential pattern if and

only if UDDAG says so.

Proof sketch. Based on Theorem 2, Px � R. In Algorithm 1,

every candidate in R is checked either directly or

indirectly based on Lemmas 4, 5, and 6 (Cases 1 and 2

are checked directly in subroutine findP , and case 3 is

checked in subroutine findPUDDAG). Therefore, a

sequence is a sequential pattern if UDDAG says so.Since all the candidates inR are verified in Algorithm 1,

we can guarantee that UDDAG identifies the complete setof patterns in D. tu

4.4 Detailed Implementation Strategies

The major costs of our approach are database projection and

candidate pattern verification. Below, we discuss the

implementation strategies for these two issues.


4.4.1 Pseudoprojection

To reduce the number and size of projected databases, we

adopt similar pseudoprojection technique as in PrefixSpan.

One major difference is that we register the ids of sequences

and both the starting and ending positions of the projected

subsequences in the original sequences. The reason is that

we project a sequence bidirectionally.

Example 4 (Pseudoprojection). Using Pseudoprojection,

Preð8DÞ and Sufð8DÞ in Example 3 are shown as Table 3,

where $ indicates that 8 has an occurrence in the current

sequence but its projected prefix/suffix is empty. Note

that for multiple occurrences of 8 in a sequence (e.g.,

sequence 3), we only register the last prefix and the first

suffix based on Theorem 2.

4.4.2 Verification of Candidate Patterns

To verify whether an up vertex and a down vertex in a

UDDAG correspond to a valid pattern, we derive the support

of the candidate based on the size of the intersection set of the

up and down vertexes’ occurrence sets. Two approaches are

provided in our implementation.The first approach is bit vector-based. We represent each

occurrence set with a bit vector and perform Anding

operation on the two bit vectors. The size of the intersection

set is derived by counting the number of 1s in the resulted

bit vector. Several approaches exist for efficiently counting

bit 1s in a bit vector [5], [15]. In our implementation, we use

the arithmetic logic-based approach [15].For example, 8D in Example 3 has three sequences (1, 3,

and 4). The up vertex 1 in Fig. 2 occurs in sequences 1 and 3,

thus the bit vector representation of its occurrence set is 110.

The down vertex 5 occurs in sequences 3 and 4, and the bit

vector is 011. Anding result of the two bit vectors is 010,

which has only one bit 1. This means that the support of

<1 8 5> in 8D is at most 1, thus not a pattern.The second approach is co-occurrence counting-based.

Given PreðxDÞ and SufðxDÞ, we derive co-occurrence

count for each ordered pair of FIs (one from a prefix and

the other from the corresponding suffix) by enumerating

every ordered pair and adding the corresponding co-

occurrence count by 1. If the co-occurrence count of a pair

is less than minSup, the pair is an invalid candidate.For example, given a 9-projected database with the

following sequences: 1) <5 3 9 6 8>, 2) <3 5 9 8>, and

3) <6 9 8>, we have co-occurring pairs (5 6), (5 8), (3 6),

(3 8) for sequence 1, (3 8), (5 8) for sequence 2, and (6 8)

for sequence 3. If minSup is 2, only (3 8) and (5 8) (both

co-occur twice) will be considered as candidates. Other

pairs are discarded because they occur only once.

5 PERFORMANCE EVALUATION

We conducted an extensive set of experiments to compareour approach with other representative algorithms. All theexperiments were performed on a Windows Server 2003with 3.0 GHz Quad Core Intel Xeon Server and 16 GBmemory. The algorithms we compared are PrefixSpan,Spade, and LapinSpam, which were all implemented inC++ by their authors (Minor changes have been made toadapt Spade to Windows). Two versions of UDDAG weretested. UDDAG-bv uses bit vector to verify candidates andUDDAG-co uses co-occurrences to verify candidates when-ever possible.

We perform two studies using the same data generatoras in [14]: 1) Comparative study, which uses similar datasets as that in [14]; 2) Scalability study. The data sets weregenerated by maintaining all except one of the parametersas shown in Table 4 fixed, and exploring different values forthe remaining ones. We present the experiment results inthis section and give more discussion in Section 6. (Note:The default value for T is 2.8 for scalability testing on I toallow higher value of I to be tested.)

5.1 Experiment Results for Comparative Study

First, we tested data set C10S8T8I8 with 10 k sequencesand 1,000 different items. Fig. 3 shows the distribution ofpattern lengths. Figs. 4 and 5 show the time and memoryusage of the algorithms at different minSup values.

Fig. 4 shows that UDDAG algorithms are the fastest whileLapinSpam is the slowest. When minSup is large (e.g.,3 percent, UDDAG-bv (0.16 s) and UDDAG-co (0.17) areslightly faster than PrefixSpan (0.19) and Spade (0.25), but aremore than 10 times faster than LapinSpam (1.9 s). WhenminSup is 0.5 percent, UDDAG-bv (3 s) and UDDAG-co(2.9 s) are much faster than all the other algorithms.


TABLE 4Parameters for Generating Data Sets

Fig. 3. Distribution of pattern lengths of data set C10S8T8I8.

TABLE 3Preð8DÞ=Sufð8DÞ Based on Pseudoprojection

The UDDAG algorithms use less memory than PrefixSpanwhen minSup is large (�1 percent). When minSup is lessthan 1 percent, they use more memory because of the extramemory usage for UDDAG, which increases as the numberof patterns increases. The memory usages of UDDAG-based

approaches are generally less than that of Spade and muchless than that of LapinSpam. Since LapinSpam crashed inlarge data sets, in the following tests, we only show thetesting results on the other four algorithms.

Second, we tested data set C200S10T2:5I1:25 with200 k sequences and 10,000 different items. Fig. 6 shows thedistribution of pattern lengths. Figs. 7 and 8 show the timeand memory usage of the algorithms. The processing time

has similar order as the first test. When minSup is 1 percent,the algorithms have similar running time. As minSup

decreases, the processing time of PrefixSpan and Spadegrows faster than those of UDDAG-bv and UDDAG-co.

When minSup is 0.1, UDDAG-bv (8.5 s) and UDDAG-co (8.7s) are almost four times faster than PrefixSpan (32 s) and 3times faster than Spade (23 s).

UDDAG-bv and UDDAG-co use less memory thanPrefixSpan except when minSup is 1 percent. The memoryusage of Spade is the highest.

Next, we tested a denser data set C200S10T5I2:5 with200,000 sequences and 10,000 different items. Fig. 9 showsthe distribution of pattern lengths. Figs. 10 and 11 showthe time and memory usage of the algorithms. Theprocessing time shows similar order as previous experi-ments. When minSup is 1 percent, the algorithms havesimilar running time. As minSup decreases, the times ofPrefixSpan and Spade grow faster than those of UDDAG-bv and UDDAG-co. When minSup is 0.25 percent,UDDAG-bv (49 s) and UDDAG-co (50 s) are four timesfaster than PrefixSpan (195 s) and more than two timesfaster than Spade (118 s).


Fig. 5. Memory usage on data set C10S8T8I8.

Fig. 6. Distribution of pattern lengths of data set C200S10T2.5I1.25.

Fig. 7. Time usage on data set C200S10T2.5I1.25.

Fig. 8. Memory usage on data set C200S10T2.5I1.25.

Fig. 9. Distribution of pattern lengths of data set C200S10T5I2.5.

Fig. 4. Time usage on data set C10S8T8I8.

When minSup is large (>0.375 percent), UDDAG-bv and

UDDAG-co have similar memory usage as PrefixSpan.

However, when minSup is less than 0.375 percent, they use

more memory due to the extremely large number of patterns

in this data set at lowminSup. The memory usage of Spade is

the highest when minSup is larger than 0.375 percent.

5.2 Experiment Results for Scalability Study

This section studies the impact of different parameters of

the data sets on the performance of each algorithm. The

default absolute support threshold is 100.First, we examine the performance of the algorithms with

different number of sequences (C) under two different

minSup settings. Figs. 12 and 13 show the performance of

the algorithms when minSup is 100, and Figs. 14 and 15

show the performance when minSup is 400. When minSup

is 100, the UDDAG algorithms are about 10 times faster than

PrefixSpan and 3-4 times faster than Spade. WhenminSup is

400, Spade is the slowest. The UDDAG algorithms havesimilar performance as PrefixSpan for small data sets (100 Kand 200 K). However, when the data sets get larger,UDDAG outperforms PrefixSpan with growing margins.

The UDDAG algorithms have similar memory usage asthat of PrefixSpan. Spade consumes more memory thanother algorithms in most cases.

Figs. 16 and 17 show the performance of the algorithmson data sets with different number of items (N). The timeusage of PrefixSpan and Spade grows as N increases. Onthe contrary, the time usage of UDDAG approachesgenerally decreases as N increases. They outperform

PrefixSpan by about an order of magnitude, on average.They are 3-4 times faster than Spade.

UDDAG-bv and UDDAG-co use similar memory asPrefixSpan and less memory than Spade.


Fig. 11. Memory usage on data set C200S10T5I2.5.

Fig. 12. Time usage on different sequence numbers (minSup ¼ 100).

Fig. 13. Memory usage on different sequence numbers (minSup ¼ 100).

Fig. 14. Time usage on different sequence numbers (minSup ¼ 400).

Fig. 15. Memory usage on different sequence numbers (minSup ¼ 400).

Fig. 10. Time usage on data set C200S10T5I2.5.

Figs. 18 and 19 show the performance of the algorithmson data sets with different average number of transactionsin a sequence (S). UDDAG-bv and UDDAG-co are fasterthan PrefixSpan by about one order of magnitude andthey outperform Spade by about 3-4 times. The timeusage of PrefixSpan increases faster than that of UDDAGas S increases.

UDDAG-bv and UDDAG-co use similar memory asPrefixSpan and less memory than Spade.

Figs. 20 and 21 show the performance of the algorithmson data sets with different average number of items in atransaction (T ). The UDDAG algorithms outperform Pre-fixSpan by about one order of magnitude, on average, andoutperform Spade by about 2-4 times.

The UDDAG algorithms use similar memory as Prefix-Span and less memory than Spade when T is 2. However,they use more memory as T grows.

Figs. 22 and 23 show the performance of the algorithms

on data sets with different average number of transactions

(L) in a sequential pattern. When L is 2, the UDDAG

algorithms are slightly faster than PrefixSpan and two times

faster than Spade. However, when L is 8, they are about an

order of magnitude faster than PrefixSpan and 3.5 times

faster than Spade.UDDAG-bv and UDDAG-co use similar memory as

PrefixSpan and they use less memory than Spade.


Fig. 17. Memory usage on different number of items.

Fig. 18. Time usage on different average number of transactions in asequence.

Fig. 19. Memory usage on different average number of transactions in asequence.

Fig. 20. Time usage on different average number of items in a transaction.

Fig. 21. Memory usage on different average number of items in atransaction.

Fig. 22. Time usage on different average number of transactions in apattern.

Fig. 16. Time usage on different number of items.

Figs. 24 and 25 show the performance of the algorithmson data sets with different average number of items (I) in atransaction of patterns. The UDDAG algorithms outperformPrefixSpan by about one order of magnitude and outper-form Spade by about three times.

When I is small (e.g., <1.4), the UDDAG algorithms usesimilar memory as PrefixSpan and less memory than Spade.However, when I is larger, they use more memory due tothe extremely large number of patterns.

6 DISCUSSION

6.1 Multi-Item Frequent Item Set Detection

UDDAG and PrefixSpan take different approaches ondetecting FIs with multiple items. PrefixSpan detects multi-item FIs in each projected database while detecting sequen-tial patterns. UDDAG detects all the FIs before patterndetection. Below, we examine the impact of this strategy toits performance.

Table 5 shows the relative time (RT ) of FI detection (aswell as database transformation) with respect to the total time

usage of UDDAG-bv for the tests in Section 5. Tables 5a, 5b,

and 5c show that RT generally decreases as minSup

decreases (except for the first minSup values in each test).

Similarly, Tables 5d, 5e, 5g, 5h, 5i, and 5j show that RT

generally decreases as the corresponding parameter in-

creases. The only exception is Table 5f, where RT almost

remains the same with different number of items.Table 5 shows that FI detection consumes around

10 percent of the total time, on average, which is

insignificant to the overall performance of UDDAG.


Fig. 24. Time usage on different average number of items in atransaction in sequential patterns.

Fig. 25. Memory usage on different average number of items in atransaction in sequential patterns.

TABLE 5Relative Time Consumption of FI Detection

(a) Comparative study data set C10S8T8I8. (b) Comparative study dataset C200S10T2.5I1.25. (c) Comparative study data set C200S10T5I2.5.(d) Scalability study on different number of sequences (C)minSup ¼ 400.(e) Scalability study on different number of sequences (C)minSup ¼ 100.(f) Scalability study on different number of items (N). (g) Scalability studyon different number of transactions in a sequence (S). (h) Scalabilitystudy on different average number of items in a transaction (T).(i) Scalability study on different average number of transactions in apattern (L). (j) Scalability study on different average number of items in atransaction of the patterns (I).

Fig. 23. Memory usage on different average number of transactions in apattern.

As discussed in Section 3, AprioriAll also adopts similarsolution paths, i.e., detecting FIs separately before patterndetection. However, practically, AprioriAll is very slow.There are two major reasons as follows:

1. The approach AprioriAll taken for FI detection wasvery inefficient. Based on [1], AprioriAll uses theApriori algorithm [2] for FI detection. This approachis extremely slow compared to the state-of-the-artsolutions. Based on our tests [7] and the FIMI tests[8], Apriori-based algorithms are considerablyslower than FP-Growth* in many cases (often oneor two orders of magnitude slower). Since these testsinclude the time for writing the detected patternsinto a file (which may be significant when largenumber of FIs are involved), the actual gain of FP-growth to Apriori may be even higher if file outputis not needed (which is the case in this paper). Inaddition, the a priori algorithms tested in [7] and [8]are the state-of-the-art algorithms, which themselvesare considerably faster than the original Apriorialgorithm implemented in [2]. Besides, in ourimplementation, we have made some adaptationsto existing state-of-the-art FP-growth approach tomake it even faster. Altogether, the Apriori ap-proach for FI detection adopted in [2] may besignificantly slower than the FP-growth approachadopted in our algorithms, which contributes to theinefficiency of AprioriAll.

2. The original AprioriAll algorithm’s candidate gen-eration (by joining and pruning) and supportcounting (by checking all the sequences for sup-ported patterns) strategies are extremely slowespecially for large databases with many patterns.

6.2 Time Complexity

Multi-item FI detection, database projection, and patterndetection account for the major time usage of our approach.

6.2.1 Multi-Item FI Detection

As discussed in Section 6.1, the time UDDAG used for FIdetection is around 10 percent of the total time. This isinsignificant, and thus, does not have big impact to theoverall performance of UDDAG.

6.2.2 Database Projection

The goal of database projection is to find the occurrenceinformation of FIs in a projected database and furtherderive a projected database for each FI. The time fordatabase projection is proportional to the total time ofchecking items in projected databases. Given a sequence, ifthe longest pattern in the sequence is M, the maximal levelof projections we may have on the sequence is M. Thismeans that each item in the sequence is checked at most Mtimes. Given a database, the total number of items is CST .Let L be the average length of detected patterns, then, onaverage, an item is checked at most L times, and the totalinstances of items we check is at most LCST times.Practically, it is close to OððlogLÞCST ÞÞ because the minimallevel of projections to detect a pattern with length M isabout blog2Mc þ 1.

Using PrefixSpan, the total levels of projection is alwaysM when detecting a length-M pattern. Thus, its projection

time is O(LCST ). The projection complexity of UDDAG issimilar to that of PrefixSpan in the worst case. However, onaverage, it is much less.

The above analysis is verified by our experiment results.Figs. 12, 14, 18, and 20 show that the processing time forboth UDDAG and PrefixSpan scales up quasilinearly whenC, S, and T increase. However, when L increases, Fig. 22clearly shows that PrefixSpan almost scales up linearly,while UDDAG-based approaches scale up much slower(close to OðlogLÞÞ.

6.2.3 Pattern Detection

The major cost for pattern detection in UDDAG is theevaluation of candidate patterns of case 3. Differentapproaches such as UDDAG-bv and UDDAG-co may havedifferent efficiency, as shown in Figs. 4, 7, 10, etc. Lemmas 4and 5 state that the validity of children vertex candidatescan be inferred based on that of parent vertex candidates. Asthe average length of patterns becomes longer, the numberof children vertex candidates in UDDAG also becomeslarger, which helps to eliminate unnecessary candidatechecking. The longer L is, the more effective the evaluationof case 3 candidates will be. This is verified in Fig. 22.

Since PrefixSpan does not generate candidate patterns,the cost of pattern detection in PrefixSpan is limited to FIcounting in each projected database, which is an advantageover UDDAG. However, practically, UDDAG performsbetter due to the following reasons:

1. The special data structure UDDAG eliminates un-necessary candidates (based on Lemmas 4 and 5).

2. Projected databases shrink much faster comparedto PrefixSpan. First, UDDAG has fewer levels ofprojections compared to PrefixSpan, on average.Second, with UDDAG, at each level of recursion,we project a database into prefix- and suffix-projected databases. Each sequence in the prefix-and suffix-projected databases has half the lengthof the original sequence, on average. Thus, atlevel k projection, the average sequence length isT=2k�1 in UDDAG, while in PrefixSpan, theaverage sequence length is T � ðT=LÞk. Therefore,the average number of instances of FIs in aprojected database at level k in UDDAG is muchsmaller than that in PrefixSpan, which leads tomore efficient database projection and FI counting.

One additional fact is, whenminSup is large enough suchthat the average pattern length is close to 1, the problem ofsequential pattern mining degrades into frequent itemcounting problem and PrefixSpan and UDDAG will havesimilar performance. For example, in Fig. 3, the averagepattern lengths are 1.44 and 1.60 for minSup values 3 and2.5 percent, respectively. The time usages of PrefixSpan andUDDAG are very close as shown in Fig. 4. Similarobservations can be found in Figs. 7, 10, 14, and 22 for largeminSup values/small pattern lengths.

6.3 Space Complexity

In UDDAG, the problem of finding all the patterns in adatabase is partitioned into finding subsets of patternsdefined in Lemma 1. Thus, the maximal memory usage of


finding all the patterns is max (M1, M2; . . . ;Mt), where Mi isthe maximal memory usage for detecting subset i. Mi ismainly used to store the projected database and UpDownDAG, whose size is decided by the total number of vertexes.Besides, we also need to store the transformed databaseduring the whole pattern mining process.

The size of the transformed database is decided by thesize of the original database as well as the characteristics ofFIs. If the average length of FI is small, then the size of thetransformed database is close to that of the originaldatabase. The size of the transformed database increasesas the average length, total number, and support of multi-item FIs increase. This is verified in Fig. 25 (where theaverage length of FIs increases) and Fig. 11 (where thenumber and support of multi-item FIs increase dramaticallyas minSup decreases).

For projected databases, given a level 1 projecteddatabase with C sequences, if the length of the longestpattern is M, then the maximal level of projections is M.Using Pseudoprojection, at each level of projection, we storethe beginning and ending positions as well as sequence ids;therefore, the maximal number of integers we need torecursively store is 3CM. The actual memory usage may bemuch smaller because 1) the size of the projected databasesgets smaller as the recursion level increases and 2) the totallevels of real projection may be much smaller than M.

The cost of storing UDDAG is proportional to themaximal number of patterns in a subset. Generally, thiscost is relatively small compared to storing the databases.However, if the number of patterns is extremely large, thiscost may also increase significantly as shown in Fig. 11. Inaddition, this feature of UDDAG may also cause the jittereffect on memory usage for scalability tests. Fig. 23, thememory usage for different average number of transactionsin a pattern (L), shows an example for such an effect. WhenL ¼ 5, the memory usage (23.6 MB) of UDDAG-co is higherthan that of the data sets of L ¼ 4 (16.7 MB) and L ¼ 6(20.9 MB). The reason is that each testing data set isgenerated independently. The largest subsets of patterns insome data sets may be smaller/larger than their neighbor-ing data sets, which results in smaller/larger memoryconsumption. Similar effect can also be found in Fig. 17.

PrefixSpan does not need additional space for UDDAG,and it needs less space for storing the whole database as itstores the original database instead of the transformeddatabase, but it may need more memory for projecteddatabases because of more levels of projection. Overall, thememory usage of UDDAG is comparable to that ofPrefixSpan as shown in Figs. 5, 8, 11, 13, 15, 17, etc. UDDAGmay use more memory than PrefixSpan in extreme caseswhen a significant number of patterns exist in a subset or theaverage length of FIs is large and the number/support ofmulti-item FIs is extremely big, as shown in Figs. 11 and 25.

7 CONCLUSIONS AND FUTURE WORK

In this paper, a novel data structure UDDAG is inventedfor efficient pattern mining. The new approach growspatterns from both ends (prefixes and suffixes) of detectedpatterns, which results in faster pattern growth because ofless levels of database projection compared to traditionalapproaches. Extensive experiments on both comparative

and scalability studies have been performed to evaluate theproposed algorithm.

In terms of time efficiency, when minSup is very largesuch that the average length of patterns is close to 1,UDDAG and PrefixSpan have similar performance becausein this case, the problem becomes a simple frequent itemcounting problem (practically not interesting for sequentialpattern mining). However, UDDAG scales up much slowercompared to PrefixSpan. It often outperforms PrefixSpan byone order of magnitude in our scalability tests. Experimentsalso show that UDDAG is considerably faster than twoother representative algorithms, Spade and LapinSpam. Inaddition, UDDAG also demonstrated satisfactory scale-upproperties with respect to various parameters such as thetotal number of sequences, the total number of items, theaverage lengths of sequences, etc.

The memory usages of UDDAG-based approaches aregenerally comparable to that of PrefixSpan. UDDAG-basedapproaches may use more memory in extreme cases when asignificant number of patterns exist in a subset or theaverage length of FIs is large and the number/support ofmulti-item FIs is extremely big. UDDAG generally uses lessmemory than Spade and LapinSpam.

One major feature of UDDAG is that it supports efficientpruning of invalid candidates. This represents a promisingapproach for applications involving searching in largespaces. Thus, it has great potential to related areas of datamining and artificial intelligence. In the future, we expect tofurther improve UDDAG-based pattern mining algorithmas follows: 1) Currently, FI detection is independent frompattern mining. Practically, the knowledge gained from FIdetection may be useful for pattern mining. In the future,we will integrate the solutions of the two so that they canbenefit from each other. 2) Different candidate verificationstrategies may have different impacts to the efficiency of thealgorithm. In the future, we will study more efficientverification strategy. 3) UDDAG has big impact to thememory usage when the number of patterns in a subset isextremely large. In the future, we will find an efficient wayto store UDDAG.

We will also extend our approach to other types ofsequential pattern mining problems, e.g., mining withconstraints, closed and maximal pattern mining, approx-imate pattern mining, and domain-specific patternmining, etc.

We also expect to extend the UDDAG-based approach toother areas, where large searching spaces are involved andpruning of searching spaces is necessary.

ACKNOWLEDGMENTS

The author is grateful for the insightful comments of theanonymous reviewers. The author would also like to thankPing Zhong, Terry Cook, and Anne Moroney for their helpon the draft. This work was supported in part by the PSC-CUNY Research Grant (PSCREG-38-892) and a QueensCollege Research Enhancement Grant.

REFERENCES

[1] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc.Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, 1995.


[2] R. Agrawal and R. Srikant, “Fast Algorithms for MiningAssociation Rules,” Proc. 20th Int’l Conf. Very Large Data Bases(VLDB), pp. 487-499, 1994.

[3] C. Antunes and A.L. Oliveira, “Generalization of Pattern-GrowthMethods for Sequential Pattern Mining with Gap Constraints,”Proc. Int’l Conf. Machine Learning and Data Mining 2003, pp. 239-251, 2003.

[4] J. Ayres, J. Gehrke, T. Yu, and J. Flannick, “Sequential PatternMining Using a Bitmap Representation,” Proc. Int’l Conf. Knowl-edge Discovery and Data Mining 2002, pp. 429-435, 2002.

[5] S. Berkovich, G. Lapir, and M. Mack, “A Bit-Counting AlgorithmUsing the Frequency Division Principle,” Software: Practice andExperience, vol. 30, no. 14, pp. 1531-1540, 2000.

[6] J. Chen and T. Cook, “Mining Contiguous Sequential Patternsfrom Web Logs,” Proc. World Wide Web Conf. (WWW ’07) PosterSession, May 2007.

[7] J. Chen and K. Xiao, “BISC: A Binary Itemset Support CountingApproach Towards Efficient Frequent Itemset Mining,” to bepublished in ACM Trans. Knowledge Discovery in Data.

[8] G. Grahne and J. Zhu, “Efficiently Using Prefix-Trees in MiningFrequent Itemsets,” Proc. Workshop Frequent Itemset MiningImplementations (FIMI ’03), 2003.

[9] M. Garofalakis, R. Rastogi, and K. Shim, “SPIRIT: SequentialPattern Mining with Regular Expression Constraints,” Proc. Int’lConf. Very Large Data Bases (VLDB ’99), pp. 223-234, 1999.

[10] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M.C. Hsu,“FreeSpan: Frequent Pattern-Projected Sequential PatternMining,” Proc. ACM SIGKDD, pp. 355-359, 2000.

[11] M.Y. Lin and S.Y. Lee, “Fast Discovery of Sequential Patternsthrough Memory Indexing and Database Partitioning,” J. Informa-tion Science and Eng., vol. 21, pp. 109-128, 2005.

[12] F. Masseglia, F. Cathala, and P. Poncelet, “The PSP Approach forMining Sequential Patterns,” Proc. European Symp. Principle of DataMining and Knowledge Discovery, pp. 176-184, 1998.

[13] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, andM.C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently byPrefix-Projected Pattern Growth,” Proc. 2001 Int’l Conf. Data Eng.(ICDE ’01), pp. 215-224, 2001.

[14] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U.Dayal, and M.C. Hsu, “Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach,” IEEE Trans. Knowledge andData Eng., vol. 16, no. 11, pp. 1424-1440, Nov. 2004.

[15] E.M. Reingold, J. Nievergelt, and N. Deo, CombinatorialAlgorithms—Theory and Practice. Prentice-Hall, Inc., 1977.

[16] R. Srikant and R. Agrawal, “Mining Sequential Patterns: General-izations and Performance Improvements,” Proc. Int’l Conf.Extending Database Technology 1996, pp. 3-17, 1996.

[17] K. Wang, Y. Xu, and J.X. Yu, “Scalable Sequential Pattern Miningfor Biological Sequences,” Proc. 2004 ACM Int’l Conf. Informationand Knowledge Management, pp. 178-187, 2004.

[18] J. Wang, Y. Asanuma, E. Kodama, T. Takata, and J. Li, “MiningSequential Patterns More Efficiently by Reducing the Cost ofScanning Sequence Databases,” IPSJ Trans. Database, vol. 47,no. 12, pp. 3365-3379, 2006.

[19] M. Zaki, “Spade: An Efficient Algorithm for Mining FrequentSequences,” Machine Learning, vol. 40, pp. 31-60, 2001.

[20] Z. Zhang and M. Kitsuregawa, “LAPIN-SPAM: An ImprovedAlgorithm for Mining Sequential Pattern,” Proc. Int’l SpecialWorkshop Databases for Next Generation Researchers, pp. 8-11, Apr.2005.

[21] Z. Zhang, Y. Wang, and M. Kitsuregawa, “Effective SequentialPattern Mining Algorithms for Dense Database,” Proc. JapaneseNat’l Data Eng. Workshop (DEWS ’06), 2006.

Jinlin Chen received the bachelor of engineer-ing and bachelor of economics degrees in 1994and the PhD degree in automatic control in 1999from Tsinghua University, China. He is a facultymember in the Computer Science Department,Queens College, the City University of NewYork. Previously, he was a visiting professor atthe University of Pittsburgh and a researcher atMicrosoft Research Asia. His research interestsinclude Web information modeling and proces-

sing, information retrieval, and data mining. He is a member of the IEEEand the ACM.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


an updown directed acyclic graph approach for sequential pattern mining

Documents