clospan: mining closed sequential patterns in large datasets xifeng yan, jiawei han and ramin afshar...
TRANSCRIPT
CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large
DatasetsDatasets
Xifeng Yan, Jiawei Han and Ramin AfsharXifeng Yan, Jiawei Han and Ramin Afshar
Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-Conference on Data Mining (SDM'03), pp. 166-
177, San Fransisco, CA, May 2003.177, San Fransisco, CA, May 2003.
Advisor: Professor Hsin-Hsi ChenAdvisor: Professor Hsin-Hsi ChenReporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh
Natural Language Processing Laboratory,Natural Language Processing Laboratory,Dept. of Computer Science and Info. Engineering, NTUDept. of Computer Science and Info. Engineering, NTU
2006/01/102006/01/10
SlideSlide - - 22Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
OutlineOutlineIntroductionIntroductionSearch Space PruningSearch Space PruningCloSpanCloSpanExperimental ResultsExperimental ResultsConclusionsConclusions
SlideSlide - - 33Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
IntroductionIntroduction Apriori-like algorithm will generate a huge Apriori-like algorithm will generate a huge
set of candidate sequences.set of candidate sequences.Ex. There are 1000 frequent sequences of length-1Ex. There are 1000 frequent sequences of length-1
1000×1000+(1000×999)/2=1,499,500 candidate sequences1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining.Many scans of databases in mining.
Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}The Apriori-based method must scan the database at The Apriori-based method must scan the database at
least 15 times.least 15 times. Difficulties at mining long sequential patterns.Difficulties at mining long sequential patterns.
Ex. There is only a single sequence of length 100, min_sup=1Ex. There is only a single sequence of length 100, min_sup=1length-1 candidate sequences: 100, length-2: 14950, … length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1 total = 2^100-1 10^3010^30
SlideSlide - - 44Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
DefinitionDefinition– Sequence, Elements, Subsequence Sequence, Elements, Subsequence
and Sequential Patternand Sequential PatternA sequence : < (ef) (ab) (df) c b >
Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <<aa(a(abcbc))(ac)(ac)dd((ccf)>f)>Given support threshold min_sup_count =2, <(ab)c> is a sequential pattern
A sequence database
<eg(af)cbc><eg(af)cbc>4040
<(ef)(<(ef)(abab)(df))(df)ccb>b>3030
<(ad)c(bc)(ae)><(ad)c(bc)(ae)>2020
<a(<a(ababc)(ac)(acc)d(cf)>)d(cf)>1010
sequencesequenceSIDSID
SlideSlide - - 55Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
DefinitionDefinition– Frequent Sequential Pattern Frequent Sequential Pattern (FS)(FS)
Include all the sequences whose Include all the sequences whose support is no less than support is no less than min_supmin_sup
– Closed Frequent Sequential Pattern Closed Frequent Sequential Pattern (CS)(CS)
Include no sequence which has a Include no sequence which has a super-sequence with the same supportsuper-sequence with the same support
CS CS FS FS
SlideSlide - - 66Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
Example – Example – FSFS & & CSCS
IDID SequenceSequence
(af)dea(af)dea
eabeab
e(abf)(bde)e(abf)(bde)
00
11
22
min_sup_countmin_sup_count = 2 = 2
FSFS::
CSCS::
a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,(af)d:2, (af)e:2, eab:2(af)d:2, (af)e:2, eab:2
ea:3, (af)d:2, (af)e:2, eab:2ea:3, (af)d:2, (af)e:2, eab:2
SlideSlide - - 77Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
DefinitionDefinition– Prefix and Postfix (Projection)Prefix and Postfix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> <a>, <aa>, <a(ab)> and <a(abc)> are are prefixesprefixes of sequence <a(abc) of sequence <a(abc)(ac)d(cf)>(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>Given sequence <a(abc)(ac)d(cf)>PrefixPrefix PostfixPostfix / /ProjectionProjection
<a><a> <(abc)(ac)d(cf)><(abc)(ac)d(cf)>
<aa><aa> <(_bc)(ac)d(cf)><(_bc)(ac)d(cf)>
<ab><ab> <(_c)(ac)d(cf)><(_c)(ac)d(cf)>
SlideSlide - - 88Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
DefinitionDefinition– sequence sequence s = <ts = <t11, t, t22, …, t, …, tmm>>
– an item an item – I-Step extensionI-Step extension
s s ii = <t = <t11, t, t22, …, t, …, tmm { {}>}> Ex: <(ae)> is an I-Step extension of <(a)>Ex: <(ae)> is an I-Step extension of <(a)>
– S-Step extensionS-Step extension s s ss = <t = <t11, t, t22, …, t, …, tmm, {, {}>}> Ex: <(a)(e)> is an S-Step extension of Ex: <(a)(e)> is an S-Step extension of
<(a)><(a)>
SlideSlide - - 99Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Introduction Introduction (Cont.)(Cont.)
DefinitionDefinition– Prefix Search TreePrefix Search Tree
<><>aass
bbii
aass bbss
aass
bbss
bbss
ddiiccii
<><>
<(a)><(a)> <(b)><(b)>
<(ab)><(ab)><(a)(a)><(a)(a)><(a)(b)><(a)(b)>
<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)><(a)(bd)><(a)(bd)>
SlideSlide - - 1010Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space PruningSearch Space PruningDefinitionDefinition
– Common PrefixCommon Prefix ExampleExample
– DDss = {de(af), de(fg)} = {de(af), de(fg)}
– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s <e> <e>
– Partial OrderPartial Order ExampleExample
– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff
– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff
SlideSlide - - 1111Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition (D)(D)
Total number of items in Total number of items in DD
– Equivalence of Projected Equivalence of Projected DatabaseDatabase
Two sequences Two sequences ss and and s’s’, , s s s’ s’ DDss = D = Ds’s’ (D(Dss) = ) = (D(Ds’s’)) ExampleExample
– DD(af)(af) = D = Dff = {de, (de)} = {de, (de)}
(D(D(af)(af))) = = (D(Dff)) = 4 = 4
IDID SequenceSequence(af)dea(af)deaeabeabe(abf)(bde)e(abf)(bde)
001122
SlideSlide - - 1212Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Early Termination by EquivalenceEarly Termination by Equivalence
Two sequences Two sequences ss and and s’,s’, s s s’ s’ And also And also (D(Dss) = ) = (D(Ds’s’)) Then Then , , support(s support(s ) = support(s’ ) = support(s’
)) ExampleExample
(D(D(af)(af))) = = (D(Dff))
– (af)d & (af)e are frequent(af)d & (af)e are frequent– support((af)d) = support(fd)support((af)d) = support(fd)– support((af)e) = support(fe)support((af)e) = support(fe)– don’t know the support of don’t know the support of fdfd and and fefe
SlideSlide - - 1313Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Backward Sub-PatternBackward Sub-Pattern
sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Stop searching any descendant of Stop searching any descendant of s’s’
in the prefix search treein the prefix search tree
aa
ff
ffss s’s’
aa
ff ff
SlideSlide - - 1414Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Backward Super-PatternBackward Super-Pattern
sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Transplanting the descendants of Transplanting the descendants of ss to to s’s’
instead of searching any descendant of instead of searching any descendant of s’s’ in the prefix search treein the prefix search tree
bb
bb
eess s’s’
bb bb
ee
SlideSlide - - 1515Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Partial Prefix Sequence LatticePartial Prefix Sequence Lattice
Search spaceSearch space
<><>
ffii
ffss aass eess
bbss
bbss
aass bbss
bbss
ddss eess
(D(Debeb) = ) = (D(Dbb))
(D(Deabeab)) = = (D(Dabab))
(D(Dafaf)) = = (D(Dff))
SlideSlide - - 1616Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpanCloSpan CloSpan(CloSpan(s s , , DDs s , , min_sup min_sup , , LL))
– Input: A sequence Input: A sequence ss, a projectd DB , a projectd DB DDs s , and , and min_supmin_sup– Output: The prefix search lattice Output: The prefix search lattice LL– Check whether a discovered sequence Check whether a discovered sequence s’s’ exist exist
s.t. either s s.t. either s s’ or s’ or s’ s’ s s, and , and (D(Dss) = ) = (D(Ds’s’););– if such super-pattern or sub-pattern exists if such super-pattern or sub-pattern exists
thenthen Modify the link in Modify the link in LL, return;, return;
– else insert else insert s s intointo L L; ; – scan scan DDss once, find every frequent item once, find every frequent item such that such that
s s can be extended to can be extended to (s (s ii )), or, or s s can be extended to can be extended to (s (s ss ));;
– if no valid if no valid available then available then return;return;
– for each valid for each valid do do I-Step I-Step Call CloSpan(Call CloSpan(s s ii , D , Dss ii , min_sup , L , min_sup , L ););
– for each valid for each valid do do S-Step S-Step Call CloSpan(Call CloSpan(s s ss , D , Dss s s , min_sup , L , min_sup , L ););
– return;return;
SlideSlide - - 1717Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Hash for Fast Condition CheckingHash for Fast Condition Checking
<><>
ffii
aass eess
bbss
aass
ddss eess
Hash Table: <key, s>Hash Table: <key, s>
nilnil
nilnil
< < (D(Dss)) , s > , s >
SlideSlide - - 1818Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)ExampleExample
IDID SequenceSequence
(af)dea(af)dea
eabeab
e(abf)(bde)e(abf)(bde)
00
11
22
min_sup_count = 2min_sup_count = 2Hash Function Hash Function Mod 4 Mod 4
a:3, b:2, d:2, e:3, f:2a:3, b:2, d:2, e:3, f:2
SlideSlide - - 1919Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)Example Example (Cont.)(Cont.)
DDaa
DDbb
DDdd
DDee
DDff
(_f)dea, b, (_bf)(bde)(_f)dea, b, (_bf)(bde)
(_f)(bde)(_f)(bde)
ea, (_e)ea, (_e)
a, ab, (abf)(bde)a, ab, (abf)(bde)
dea, (bde)dea, (bde)
<><>00
11
22
33
nilnil
nilnil
nilnil
nilnil
(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88
(D(Dss)) DDaa
(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2
a:3, b:2a:3, b:2
66DDee
a, ab, (ab)ba, ab, (ab)b
(D(Dss))
de, (de)de, (de) 44DDff
d:2, e:2d:2, e:2 (D(Dss))
XX 00
(D(Dss)) DDbb
XX
XX 00
(D(Dss)) DDdd
XX
SlideSlide - - 2020Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
<><>00
11
22
33
88 nilnil
aass:3:3
(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88
(D(Dss)) DDaa
(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2
00
Mod 4Mod 4
SlideSlide - - 2121Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
DD(af)(af) de, (bde)de, (bde)
DDabab dede
DDadad e, ee, e
DDaeae
de, (de)de, (de) 44
(D(Dss)) DD(af)(af)
d:2, e:2d:2, e:2
XX 00
(D(Dss)) DDabab
XX
e, ee, e 22
(D(Dss)) DDadad
e:2e:2
XX 00
(D(Dss)) DDaeae
XX
SlideSlide - - 2222Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
de, (de)de, (de) 44
(D(Dss)) DD(af)(af)
d:2, e:2d:2, e:2
00
Mod 4Mod 4
<><>00
11
22
33
88 nilnil
aass:3:3
44
ffii:2:2
SlideSlide - - 2323Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
DD(af)d(af)d e, (_e)e, (_e)
DD(af)e(af)e
XX 00
(D(Dss)) DD(af)d(af)d
XX
XX 00
(D(Dss)) DD(af)e(af)e
XX
SlideSlide - - 2424Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DD(af)d(af)d
XX
00
Mod 4Mod 4
<><>00
11
22
33
88 00aass:3:3
44
ffii:2:2
nilnil
ddss:2:2
SlideSlide - - 2525Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DD(af)e(af)e
XX
00
Mod 4Mod 4
<><>00
11
22
33
88 00aass:3:3
44
ffii:2:2
nilnil
ddss:2:2
00
eess:2:2
SlideSlide - - 2626Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
<><>00
11
22
33
88 00aass:3:3
44
ffii:2:2
nilnil
ddss:2:2
00
eess:2:2
00
bbss:2:2
XX 00
(D(Dss)) DDabab
XX
00
Mod 4Mod 4
SlideSlide - - 2727Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDbb
XX
00
Mod 4Mod 4
<><>00
11
22
33
88 00aass:3:3
44
ffii:2:2
nilnil
ddss:2:2
00
eess:2:2
00
bbss:2:2
SlideSlide - - 2828Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDdd
XX
00
Mod 4Mod 4
<><>00
11
22
33
88 00aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:2
nilnil
SlideSlide - - 2929Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
a, ab, (ab)ba, ab, (ab)b 66
(D(Dss)) DDee
a:3, b:2a:3, b:2
22
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:2
nilnil66
eess:3:3
nilnil
SlideSlide - - 3030Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
DDeaea b, (_b)bb, (_b)bb, bb, b 22
(D(Dss)) DDeaea
b:2b:2
XX 00
(D(Dss)) DDebeb
XXDDebeb
SlideSlide - - 3131Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
b, bb, b 22
(D(Dss)) DDeaea
b:2b:2
22
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:2
2266
eess:3:3
nilnilaass:3:3
nilnil
SlideSlide - - 3232Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
DDeabeab
XX 00
(D(Dss)) DDeabeab
XX
00
Mod 4Mod 4
SlideSlide - - 3333Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDeabeab
XX
00
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
nilnil
SlideSlide - - 3434Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDeabeab
XX
00
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
bbss:2:2
nilnil
SlideSlide - - 3535Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDebeb
XX
00
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
bbss:2:2
nilnil
SlideSlide - - 3636Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
XX 00
(D(Dss)) DDebeb
XX
00
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
bbss:2:2
nilnil
SlideSlide - - 3737Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
bbss:2:2
nilnilde, (de)de, (de) 44
DDff
d:2, e:2d:2, e:2 (D(Dss))
00
Mod 4Mod 4
SlideSlide - - 3838Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
de, (de)de, (de) 44DDff
d:2, e:2d:2, e:2 (D(Dss))
00
Mod 4Mod 4
<><>
00
11
22
33
88 00
aass:3:3
44
ffii:2:2
ddss:2:2
00
eess:2:2
00
bbss:2:22266
eess:3:3
nilnilaass:3:3
bbss:2:2
nilnil
SlideSlide - - 3939Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
CloSpan CloSpan (Cont.)(Cont.)
Example Example (Cont.)(Cont.)
<><>aass:3:3
ffii:2:2
ddss:2:2eess:2:2
bbss:2:2
eess:3:3
aass:3:3
bbss:2:2
(af)d:2(af)d:2 (af)e:2(af)e:2 eab:2eab:2
ea:3ea:3
SlideSlide - - 4040Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental ResultsExperimental Results Synthetic DataSynthetic Data
– ParametersParameters D : Number of sequences in 000sD : Number of sequences in 000s C : Average itemsets per sequenceC : Average itemsets per sequence T : Average items per itemsetT : Average items per itemset N : Number of different items in 000sN : Number of different items in 000s S : Average itemsets in maximal sequencesS : Average itemsets in maximal sequences I : Average items in maximal sequencesI : Average items in maximal sequences
– Two Data SetTwo Data Set D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20
Real world datasetsReal world datasets– KDDCup2000 – Gazelle Click StreamKDDCup2000 – Gazelle Click Stream
SlideSlide - - 4141Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data
D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5
SlideSlide - - 4242Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data
D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20
SlideSlide - - 4343Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental Results Experimental Results (Cont.)(Cont.)
Real world datasetsReal world datasets– KDDCup2000KDDCup2000
29,369 sequences29,369 sequences 35,722 sessions35,722 sessions 87,546 page views87,546 page views The average number of sessions in a sequence The average number of sessions in a sequence
is around 1is around 1 The average number of pageviews in a session The average number of pageviews in a session
is 2is 2 The largest session contains 342 viewsThe largest session contains 342 views The longest sequence has 140 sessionsThe longest sequence has 140 sessions The largest sequence contains 651 page viewsThe largest sequence contains 651 page views
SlideSlide - - 4444Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Experimental Results Experimental Results (Cont.)(Cont.)
SlideSlide - - 4545Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
ConclusionsConclusionsClospan to mine frequent closed Clospan to mine frequent closed
sequences efficiently.sequences efficiently.Clospan outperforms PrefixSpan.Clospan outperforms PrefixSpan.
SlideSlide - - 4646Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Lexicographic OrderLexicographic OrderDefinitionDefinition
– Lexicographic OrderLexicographic Order t = {it = {i11, i, i22, …,i, …,ikk}, i}, i11 i i22 … … i ikk
t’ = {jt’ = {j11, j, j22, …,j, …,jll}, j}, j11 j j22 … … j jll
t<t’t<t’ iff either of the following is true: iff either of the following is true:– For some For some hh, , 0 0 h h min{k,l} min{k,l}, we have , we have iirr
= j= jrr for for r < hr < h, and , and iihh < j < jhh, or, or
– k < lk < l, and , and ii11 = j = j11, i, i22 = j = j22, …,i, …,ikk = j = jkk
ExampleExample– (a,f) < (b,f)(a,f) < (b,f)– (a,b) < (a,b,c)(a,b) < (a,b,c)– (a,b,c) < (b,c)(a,b,c) < (b,c)
SlideSlide - - 4747Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Sequence Lexicographic Sequence Lexicographic OrderOrder
DefinitionDefinition– Sequence Lexicographic OrderSequence Lexicographic Order
If If s’ = s s’ = s p, p, then s < s’then s < s’ If If s = s = ii p p and and s’ = s’ = ss p’ p’ , no matter what the , no matter what the
order relation between order relation between pp and and p’p’ is, is, s < s’s < s’ If s = If s = ii p p and and s’ = s’ = ii p’ , p<p’ p’ , p<p’ , indicates , indicates s<s’s<s’ If s = If s = ss p p and and s’ = s’ = ss p’ , p<p’, p’ , p<p’, indicates indicates s<s’s<s’ ExampleExample
– (ab) < (ab)(a)(ab) < (ab)(a)– (ac) < (a)(d), (ad) < (a)(c)(ac) < (a)(d), (ad) < (a)(c)– (ab) < (ac)(ab) < (ac)– (a)(b) < (a)(c)(a)(b) < (a)(c)
SlideSlide - - 4848Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Lexicographic Sequence TreeLexicographic Sequence Tree
DefinitionDefinition– Lexicographic Sequence TreeLexicographic Sequence Tree
<><>
<(a)><(a)> <(b)><(b)>
<(ab)><(ab)> <(a)(a)><(a)(a)> <(a)(b)><(a)(b)>
<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)> <(a)(bd)><(a)(bd)>
SlideSlide - - 4949Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space PruningSearch Space PruningDefinitionDefinition
– Common PrefixCommon Prefix a subsequence a subsequence ss, projected database , projected database DDss
if if , , is a is a common prefixcommon prefix for all the for all the sequence with the same extension type sequence with the same extension type (either (either itemset-extensionitemset-extension or or sequence-sequence-extensionextension) in ) in DDss
, if , if s s is closed, is closed, must be a prefix of must be a prefix of , we need not search , we need not search s s and its and its
descendants except the branch of descendants except the branch of s s ExampleExample
– DDss = {de(af), de(fg)} = {de(af), de(fg)}– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s
<e><e>
SlideSlide - - 5050Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
CommonPrefixCommonPrefix– An intermediate algorithmAn intermediate algorithm– Developed which adopts the Developed which adopts the
PrefixSpan framework plus the PrefixSpan framework plus the common prefix pruning technique common prefix pruning technique
– Outperforms PrefixSpanOutperforms PrefixSpan
SlideSlide - - 5151Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006
Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets
Search Space Pruning Search Space Pruning (Cont.)(Cont.)
DefinitionDefinition– Partial OrderPartial Order
A sequence s, projected database A sequence s, projected database DDss
if among all the sequences in if among all the sequences in DDs s , an item , an item does always occur before an item does always occur before an item (either in (either in the same itemset for all sequences in the same itemset for all sequences in DDss or in a or in a different itemset but not both), then different itemset but not both), then DDss = D = Dss
, , ss is not closed. Need not search any is not closed. Need not search any sequence in the branch of sequence in the branch of ss
ExampleExample– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff
– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff