clospan: mining closed sequential patterns in large datasets xifeng yan, jiawei han and ramin afshar...

51
CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large Datasets Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166- Conference on Data Mining (SDM'03), pp. 166- 177, San Fransisco, CA, May 2003. 177, San Fransisco, CA, May 2003. Advisor: Professor Hsin-Hsi Chen Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU Dept. of Computer Science and Info. Engineering, NTU 2006/01/10 2006/01/10

Upload: john-brown

Post on 13-Dec-2015

217 views

Category:

Documents


3 download

TRANSCRIPT

CloSpan: Mining Closed CloSpan: Mining Closed Sequential Patterns in Large Sequential Patterns in Large

DatasetsDatasets

Xifeng Yan, Jiawei Han and Ramin AfsharXifeng Yan, Jiawei Han and Ramin Afshar

Proceedings of 2003 SIAM International Proceedings of 2003 SIAM International Conference on Data Mining (SDM'03), pp. 166-Conference on Data Mining (SDM'03), pp. 166-

177, San Fransisco, CA, May 2003.177, San Fransisco, CA, May 2003.

Advisor: Professor Hsin-Hsi ChenAdvisor: Professor Hsin-Hsi ChenReporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh

Natural Language Processing Laboratory,Natural Language Processing Laboratory,Dept. of Computer Science and Info. Engineering, NTUDept. of Computer Science and Info. Engineering, NTU

2006/01/102006/01/10

SlideSlide - - 22Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

OutlineOutlineIntroductionIntroductionSearch Space PruningSearch Space PruningCloSpanCloSpanExperimental ResultsExperimental ResultsConclusionsConclusions

SlideSlide - - 33Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

IntroductionIntroduction Apriori-like algorithm will generate a huge Apriori-like algorithm will generate a huge

set of candidate sequences.set of candidate sequences.Ex. There are 1000 frequent sequences of length-1Ex. There are 1000 frequent sequences of length-1

1000×1000+(1000×999)/2=1,499,500 candidate sequences1000×1000+(1000×999)/2=1,499,500 candidate sequences Many scans of databases in mining.Many scans of databases in mining.

Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}Ex. Sequential pattern {(abc)(abc)(abc)(abc)(abc)}The Apriori-based method must scan the database at The Apriori-based method must scan the database at

least 15 times.least 15 times. Difficulties at mining long sequential patterns.Difficulties at mining long sequential patterns.

Ex. There is only a single sequence of length 100, min_sup=1Ex. There is only a single sequence of length 100, min_sup=1length-1 candidate sequences: 100, length-2: 14950, … length-1 candidate sequences: 100, length-2: 14950, … total = 2^100-1 total = 2^100-1 10^3010^30

SlideSlide - - 44Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Sequence, Elements, Subsequence Sequence, Elements, Subsequence

and Sequential Patternand Sequential PatternA sequence : < (ef) (ab) (df) c b >

Elements items within an element are listed alphabetically <a(bc)dc> is a subsequence of <<aa(a(abcbc))(ac)(ac)dd((ccf)>f)>Given support threshold min_sup_count =2, <(ab)c> is a sequential pattern

A sequence database

<eg(af)cbc><eg(af)cbc>4040

<(ef)(<(ef)(abab)(df))(df)ccb>b>3030

<(ad)c(bc)(ae)><(ad)c(bc)(ae)>2020

<a(<a(ababc)(ac)(acc)d(cf)>)d(cf)>1010

sequencesequenceSIDSID

SlideSlide - - 55Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Frequent Sequential Pattern Frequent Sequential Pattern (FS)(FS)

Include all the sequences whose Include all the sequences whose support is no less than support is no less than min_supmin_sup

– Closed Frequent Sequential Pattern Closed Frequent Sequential Pattern (CS)(CS)

Include no sequence which has a Include no sequence which has a super-sequence with the same supportsuper-sequence with the same support

CS CS FS FS

SlideSlide - - 66Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

Example – Example – FSFS & & CSCS

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_countmin_sup_count = 2 = 2

FSFS::

CSCS::

a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,a:3, b:2, d:2, e:3, f:2, ab:2, ad:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,ae:2, (af):2, ea:3, eb:2, fd:2, fe:2,(af)d:2, (af)e:2, eab:2(af)d:2, (af)e:2, eab:2

ea:3, (af)d:2, (af)e:2, eab:2ea:3, (af)d:2, (af)e:2, eab:2

SlideSlide - - 77Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Prefix and Postfix (Projection)Prefix and Postfix (Projection)

<a>, <aa>, <a(ab)> and <a(abc)> <a>, <aa>, <a(ab)> and <a(abc)> are are prefixesprefixes of sequence <a(abc) of sequence <a(abc)(ac)d(cf)>(ac)d(cf)>

Given sequence <a(abc)(ac)d(cf)>Given sequence <a(abc)(ac)d(cf)>PrefixPrefix PostfixPostfix / /ProjectionProjection

<a><a> <(abc)(ac)d(cf)><(abc)(ac)d(cf)>

<aa><aa> <(_bc)(ac)d(cf)><(_bc)(ac)d(cf)>

<ab><ab> <(_c)(ac)d(cf)><(_c)(ac)d(cf)>

SlideSlide - - 88Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– sequence sequence s = <ts = <t11, t, t22, …, t, …, tmm>>

– an item an item – I-Step extensionI-Step extension

s s ii = <t = <t11, t, t22, …, t, …, tmm { {}>}> Ex: <(ae)> is an I-Step extension of <(a)>Ex: <(ae)> is an I-Step extension of <(a)>

– S-Step extensionS-Step extension s s ss = <t = <t11, t, t22, …, t, …, tmm, {, {}>}> Ex: <(a)(e)> is an S-Step extension of Ex: <(a)(e)> is an S-Step extension of

<(a)><(a)>

SlideSlide - - 99Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Introduction Introduction (Cont.)(Cont.)

DefinitionDefinition– Prefix Search TreePrefix Search Tree

<><>aass

bbii

aass bbss

aass

bbss

bbss

ddiiccii

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)><(a)(a)><(a)(a)><(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)><(a)(bd)><(a)(bd)>

SlideSlide - - 1010Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}

– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s <e> <e>

– Partial OrderPartial Order ExampleExample

– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff

SlideSlide - - 1111Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition (D)(D)

Total number of items in Total number of items in DD

– Equivalence of Projected Equivalence of Projected DatabaseDatabase

Two sequences Two sequences ss and and s’s’, , s s s’ s’ DDss = D = Ds’s’ (D(Dss) = ) = (D(Ds’s’)) ExampleExample

– DD(af)(af) = D = Dff = {de, (de)} = {de, (de)}

(D(D(af)(af))) = = (D(Dff)) = 4 = 4

IDID SequenceSequence(af)dea(af)deaeabeabe(abf)(bde)e(abf)(bde)

001122

SlideSlide - - 1212Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Early Termination by EquivalenceEarly Termination by Equivalence

Two sequences Two sequences ss and and s’,s’, s s s’ s’ And also And also (D(Dss) = ) = (D(Ds’s’)) Then Then , , support(s support(s ) = support(s’ ) = support(s’

)) ExampleExample

(D(D(af)(af))) = = (D(Dff))

– (af)d & (af)e are frequent(af)d & (af)e are frequent– support((af)d) = support(fd)support((af)d) = support(fd)– support((af)e) = support(fe)support((af)e) = support(fe)– don’t know the support of don’t know the support of fdfd and and fefe

SlideSlide - - 1313Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Backward Sub-PatternBackward Sub-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Stop searching any descendant of Stop searching any descendant of s’s’

in the prefix search treein the prefix search tree

aa

ff

ffss s’s’

aa

ff ff

SlideSlide - - 1414Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Backward Super-PatternBackward Super-Pattern

sequence sequence s < s’s < s’ and and s s s’ s’ (D(Dss) = ) = (D(Ds’s’)) Transplanting the descendants of Transplanting the descendants of ss to to s’s’

instead of searching any descendant of instead of searching any descendant of s’s’ in the prefix search treein the prefix search tree

bb

bb

eess s’s’

bb bb

ee

SlideSlide - - 1515Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Partial Prefix Sequence LatticePartial Prefix Sequence Lattice

Search spaceSearch space

<><>

ffii

ffss aass eess

bbss

bbss

aass bbss

bbss

ddss eess

(D(Debeb) = ) = (D(Dbb))

(D(Deabeab)) = = (D(Dabab))

(D(Dafaf)) = = (D(Dff))

SlideSlide - - 1616Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpanCloSpan CloSpan(CloSpan(s s , , DDs s , , min_sup min_sup , , LL))

– Input: A sequence Input: A sequence ss, a projectd DB , a projectd DB DDs s , and , and min_supmin_sup– Output: The prefix search lattice Output: The prefix search lattice LL– Check whether a discovered sequence Check whether a discovered sequence s’s’ exist exist

s.t. either s s.t. either s s’ or s’ or s’ s’ s s, and , and (D(Dss) = ) = (D(Ds’s’););– if such super-pattern or sub-pattern exists if such super-pattern or sub-pattern exists

thenthen Modify the link in Modify the link in LL, return;, return;

– else insert else insert s s intointo L L; ; – scan scan DDss once, find every frequent item once, find every frequent item such that such that

s s can be extended to can be extended to (s (s ii )), or, or s s can be extended to can be extended to (s (s ss ));;

– if no valid if no valid available then available then return;return;

– for each valid for each valid do do I-Step I-Step Call CloSpan(Call CloSpan(s s ii , D , Dss ii , min_sup , L , min_sup , L ););

– for each valid for each valid do do S-Step S-Step Call CloSpan(Call CloSpan(s s ss , D , Dss s s , min_sup , L , min_sup , L ););

– return;return;

SlideSlide - - 1717Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Hash for Fast Condition CheckingHash for Fast Condition Checking

<><>

ffii

aass eess

bbss

aass

ddss eess

Hash Table: <key, s>Hash Table: <key, s>

nilnil

nilnil

< < (D(Dss)) , s > , s >

SlideSlide - - 1818Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)ExampleExample

IDID SequenceSequence

(af)dea(af)dea

eabeab

e(abf)(bde)e(abf)(bde)

00

11

22

min_sup_count = 2min_sup_count = 2Hash Function Hash Function Mod 4 Mod 4

a:3, b:2, d:2, e:3, f:2a:3, b:2, d:2, e:3, f:2

SlideSlide - - 1919Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)Example Example (Cont.)(Cont.)

DDaa

DDbb

DDdd

DDee

DDff

(_f)dea, b, (_bf)(bde)(_f)dea, b, (_bf)(bde)

(_f)(bde)(_f)(bde)

ea, (_e)ea, (_e)

a, ab, (abf)(bde)a, ab, (abf)(bde)

dea, (bde)dea, (bde)

<><>00

11

22

33

nilnil

nilnil

nilnil

nilnil

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

a:3, b:2a:3, b:2

66DDee

a, ab, (ab)ba, ab, (ab)b

(D(Dss))

de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

XX 00

(D(Dss)) DDbb

XX

XX 00

(D(Dss)) DDdd

XX

SlideSlide - - 2020Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>00

11

22

33

88 nilnil

aass:3:3

(_f)de, b, (_f)(bde)(_f)de, b, (_f)(bde) 88

(D(Dss)) DDaa

(_f):2, b:2, d:2, e:2(_f):2, b:2, d:2, e:2

00

Mod 4Mod 4

SlideSlide - - 2121Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DD(af)(af) de, (bde)de, (bde)

DDabab dede

DDadad e, ee, e

DDaeae

de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

XX 00

(D(Dss)) DDabab

XX

e, ee, e 22

(D(Dss)) DDadad

e:2e:2

XX 00

(D(Dss)) DDaeae

XX

SlideSlide - - 2222Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

de, (de)de, (de) 44

(D(Dss)) DD(af)(af)

d:2, e:2d:2, e:2

00

Mod 4Mod 4

<><>00

11

22

33

88 nilnil

aass:3:3

44

ffii:2:2

SlideSlide - - 2323Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DD(af)d(af)d e, (_e)e, (_e)

DD(af)e(af)e

XX 00

(D(Dss)) DD(af)d(af)d

XX

XX 00

(D(Dss)) DD(af)e(af)e

XX

SlideSlide - - 2424Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DD(af)d(af)d

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

SlideSlide - - 2525Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DD(af)e(af)e

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

SlideSlide - - 2626Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2

XX 00

(D(Dss)) DDabab

XX

00

Mod 4Mod 4

SlideSlide - - 2727Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDbb

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

nilnil

ddss:2:2

00

eess:2:2

00

bbss:2:2

SlideSlide - - 2828Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDdd

XX

00

Mod 4Mod 4

<><>00

11

22

33

88 00aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil

SlideSlide - - 2929Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

a, ab, (ab)ba, ab, (ab)b 66

(D(Dss)) DDee

a:3, b:2a:3, b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

nilnil66

eess:3:3

nilnil

SlideSlide - - 3030Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DDeaea b, (_b)bb, (_b)bb, bb, b 22

(D(Dss)) DDeaea

b:2b:2

XX 00

(D(Dss)) DDebeb

XXDDebeb

SlideSlide - - 3131Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

b, bb, b 22

(D(Dss)) DDeaea

b:2b:2

22

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:2

2266

eess:3:3

nilnilaass:3:3

nilnil

SlideSlide - - 3232Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

DDeabeab

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

SlideSlide - - 3333Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

nilnil

SlideSlide - - 3434Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDeabeab

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

SlideSlide - - 3535Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

SlideSlide - - 3636Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

XX 00

(D(Dss)) DDebeb

XX

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

SlideSlide - - 3737Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnilde, (de)de, (de) 44

DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4

SlideSlide - - 3838Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

de, (de)de, (de) 44DDff

d:2, e:2d:2, e:2 (D(Dss))

00

Mod 4Mod 4

<><>

00

11

22

33

88 00

aass:3:3

44

ffii:2:2

ddss:2:2

00

eess:2:2

00

bbss:2:22266

eess:3:3

nilnilaass:3:3

bbss:2:2

nilnil

SlideSlide - - 3939Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

CloSpan CloSpan (Cont.)(Cont.)

Example Example (Cont.)(Cont.)

<><>aass:3:3

ffii:2:2

ddss:2:2eess:2:2

bbss:2:2

eess:3:3

aass:3:3

bbss:2:2

(af)d:2(af)d:2 (af)e:2(af)e:2 eab:2eab:2

ea:3ea:3

SlideSlide - - 4040Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental ResultsExperimental Results Synthetic DataSynthetic Data

– ParametersParameters D : Number of sequences in 000sD : Number of sequences in 000s C : Average itemsets per sequenceC : Average itemsets per sequence T : Average items per itemsetT : Average items per itemset N : Number of different items in 000sN : Number of different items in 000s S : Average itemsets in maximal sequencesS : Average itemsets in maximal sequences I : Average items in maximal sequencesI : Average items in maximal sequences

– Two Data SetTwo Data Set D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5 D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20

Real world datasetsReal world datasets– KDDCup2000 – Gazelle Click StreamKDDCup2000 – Gazelle Click Stream

SlideSlide - - 4141Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D10 C10 T2.5 N10 S6 I2.5D10 C10 T2.5 N10 S6 I2.5

SlideSlide - - 4242Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)Synthetic DataSynthetic Data

D5 C20 T20 N10 S20 I20D5 C20 T20 N10 S20 I20

SlideSlide - - 4343Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)

Real world datasetsReal world datasets– KDDCup2000KDDCup2000

29,369 sequences29,369 sequences 35,722 sessions35,722 sessions 87,546 page views87,546 page views The average number of sessions in a sequence The average number of sessions in a sequence

is around 1is around 1 The average number of pageviews in a session The average number of pageviews in a session

is 2is 2 The largest session contains 342 viewsThe largest session contains 342 views The longest sequence has 140 sessionsThe longest sequence has 140 sessions The largest sequence contains 651 page viewsThe largest sequence contains 651 page views

SlideSlide - - 4444Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Experimental Results Experimental Results (Cont.)(Cont.)

SlideSlide - - 4545Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

ConclusionsConclusionsClospan to mine frequent closed Clospan to mine frequent closed

sequences efficiently.sequences efficiently.Clospan outperforms PrefixSpan.Clospan outperforms PrefixSpan.

SlideSlide - - 4646Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Lexicographic OrderLexicographic OrderDefinitionDefinition

– Lexicographic OrderLexicographic Order t = {it = {i11, i, i22, …,i, …,ikk}, i}, i11 i i22 … … i ikk

t’ = {jt’ = {j11, j, j22, …,j, …,jll}, j}, j11 j j22 … … j jll

t<t’t<t’ iff either of the following is true: iff either of the following is true:– For some For some hh, , 0 0 h h min{k,l} min{k,l}, we have , we have iirr

= j= jrr for for r < hr < h, and , and iihh < j < jhh, or, or

– k < lk < l, and , and ii11 = j = j11, i, i22 = j = j22, …,i, …,ikk = j = jkk

ExampleExample– (a,f) < (b,f)(a,f) < (b,f)– (a,b) < (a,b,c)(a,b) < (a,b,c)– (a,b,c) < (b,c)(a,b,c) < (b,c)

SlideSlide - - 4747Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Sequence Lexicographic Sequence Lexicographic OrderOrder

DefinitionDefinition– Sequence Lexicographic OrderSequence Lexicographic Order

If If s’ = s s’ = s p, p, then s < s’then s < s’ If If s = s = ii p p and and s’ = s’ = ss p’ p’ , no matter what the , no matter what the

order relation between order relation between pp and and p’p’ is, is, s < s’s < s’ If s = If s = ii p p and and s’ = s’ = ii p’ , p<p’ p’ , p<p’ , indicates , indicates s<s’s<s’ If s = If s = ss p p and and s’ = s’ = ss p’ , p<p’, p’ , p<p’, indicates indicates s<s’s<s’ ExampleExample

– (ab) < (ab)(a)(ab) < (ab)(a)– (ac) < (a)(d), (ad) < (a)(c)(ac) < (a)(d), (ad) < (a)(c)– (ab) < (ac)(ab) < (ac)– (a)(b) < (a)(c)(a)(b) < (a)(c)

SlideSlide - - 4848Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Lexicographic Sequence TreeLexicographic Sequence Tree

DefinitionDefinition– Lexicographic Sequence TreeLexicographic Sequence Tree

<><>

<(a)><(a)> <(b)><(b)>

<(ab)><(ab)> <(a)(a)><(a)(a)> <(a)(b)><(a)(b)>

<(ab)(a)><(ab)(a)> <(ab)(b)><(ab)(b)> <(a)(bc)><(a)(bc)> <(a)(bd)><(a)(bd)>

SlideSlide - - 4949Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space PruningSearch Space PruningDefinitionDefinition

– Common PrefixCommon Prefix a subsequence a subsequence ss, projected database , projected database DDss

if if , , is a is a common prefixcommon prefix for all the for all the sequence with the same extension type sequence with the same extension type (either (either itemset-extensionitemset-extension or or sequence-sequence-extensionextension) in ) in DDss

, if , if s s is closed, is closed, must be a prefix of must be a prefix of , we need not search , we need not search s s and its and its

descendants except the branch of descendants except the branch of s s ExampleExample

– DDss = {de(af), de(fg)} = {de(af), de(fg)}– s s <de> <de> not closed not closed unnecessary to extend unnecessary to extend s s

<e><e>

SlideSlide - - 5050Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

CommonPrefixCommonPrefix– An intermediate algorithmAn intermediate algorithm– Developed which adopts the Developed which adopts the

PrefixSpan framework plus the PrefixSpan framework plus the common prefix pruning technique common prefix pruning technique

– Outperforms PrefixSpanOutperforms PrefixSpan

SlideSlide - - 5151Copyright © Natural Language Processing Lab., NTU, 2006Copyright © Natural Language Processing Lab., NTU, 2006

Reporter: Clarence Min-Chi HsiehReporter: Clarence Min-Chi Hsieh CloSpan: Mining Closed Sequential CloSpan: Mining Closed Sequential Patterns in Large DatasetsPatterns in Large Datasets

Search Space Pruning Search Space Pruning (Cont.)(Cont.)

DefinitionDefinition– Partial OrderPartial Order

A sequence s, projected database A sequence s, projected database DDss

if among all the sequences in if among all the sequences in DDs s , an item , an item does always occur before an item does always occur before an item (either in (either in the same itemset for all sequences in the same itemset for all sequences in DDss or in a or in a different itemset but not both), then different itemset but not both), then DDss = D = Dss

, , ss is not closed. Need not search any is not closed. Need not search any sequence in the branch of sequence in the branch of ss

ExampleExample– Before projecting Before projecting DD into into DDa a , D, Db b , D, Dd d , D, De e , D, Dff

– aa is always before the is always before the ff in all the sequences in all the sequences– Need not search any sequence beginning with Need not search any sequence beginning with ff