1 mining sequential patterns with constraints in large database jian pei, jiawei han,wei wang proc....

20
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining (ICDM’02) Adviser: Jia-Ling Koh Speaker: Yu-ting Kung

Upload: imogene-shaw

Post on 18-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

1

Mining Sequential Patterns with Constraints in Large Database

Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining (ICDM’02)

Adviser: Jia-Ling Koh Speaker: Yu-ting Kung

Page 2: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

2

Introduction

In past studies, two problems remain:1. Many practical constraints are not covered

2. There lack a systematic method to push various constraints into the mining process

In this paper: Develop a framework—Prefix-growth, is

built based on a prefix-monotone property The constraints can be effectively and

efficiently pushed deep into sequential pattern mining under this new framework

Page 3: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

3

Categories of constraints

1. Item constraints

For example:

2. Length constraint The number of transactions or occurrences of items… For example:

..,,,,,,

),][),(1:()(

)][),(1:()(

where

VileniiCitem

orVileniiCitem

)][),(1:()( BileniiCbookstore

)50)(()( lenClen

Page 4: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

4

Categories of constraints (Cont.)

3. Super-pattern constraint

where P is a given set of patterns For example:

4. Aggregate constraint Aggregate function: sum, avg, max, min,etc For example:

We like sequential patterns where average price of all the items in each pattern is over $100

)..()( rtsPrC pat

cameradigitalPCC pat _)(

Page 5: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

5

Categories of constraints (Cont.)

5. Regular expression constraints Constraints specified as a regular expression For example:

6. Duration constraints

7. Gap constraints For example:

Find purchasing patterns such that “the gap between each consecutive purchases is less than 1 month”

)||()|( LodgingMotelsandHotelsHotelsCityYorkNewYorkNewTravel

supmin_}].[].[(],[)]([],[]1[..)(1|{.. 1)()(1)(1 ttimeitimeiandilenitsleniiSDBifonlyandifts lenlenlen

Page 6: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

6

Characterization of constraints

Anti-monotonic If a sequence satisfies C implies that every non-emp

ty subsequence of also satisfies C For example: dur() < 3

Monotonic If a sequence satisfies CM implies that every super-s

equence of also satisfies CM For example: len() >= 10, super-pattern constraints

Succinct constraint For example: item-constraint

Page 7: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

7

Characterization of constraints (Cont.)

Page 8: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

8

Prefix-Monotone Property

Prefix anti-monotonicfor each sequence satisfying the constraint, so does every prefix of

Prefix monotonicfor each sequence satisfying the constraint, so does every sequence having as a prefix.

A constraint is called Prefix-monotone if it is prefix-monotonic or prefix monotonic.

Page 9: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

9

Theorem

All the commonly used constraint discussed above, except for g_sum and average, have prefix-monotone property

Page 10: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

10

Push Prefix-Monotone Constraints into Sequential Pattern Mining

Regular expression Min_sup = 2

dddbcbbaC |)(|

Page 11: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

11

Push Prefix-Monotone Constraints into Sequential Pattern Mining (Cont.)

Mining step:1. find length-1 sequential and remove irrelevant seque

nce Patterns <a>, <b>, <c>, <d>, <e> are identified as le

ngth-1 patterns, infrequent item <f> is removed S_id = 10 is removed fail this constraint

2. divide the set of sequential patterns into subsets without overlap prefix<a>, prefix<b>, prefix<c>, prefix<d>, prefix<e>

are pruned!!

Page 12: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

12

Push Prefix-Monotone Constraints into Sequential Pattern Mining (Cont.)

3. construct <a>-projected database and mine it SDB|<a>={<(_b)(bc)dd>, <(_e)(abc)(dd)>,<ddcb>} Locally frequent items and satisfy the constraint:

prefix <ab>, prefix<ac>, prefix<ad>

4. recursive mining To mining patterns with prefix <ab>、 <ac>、 <ad>,

and form the projected database

5. Final pattern outputted {<a(bc)d>, <add>}

Page 13: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

13

Handling Touch aggregate constraint

Constraint: Min_sup = 2 Item i called a small item if its value i.value <= 25, ot

herwise, it is called a big item

25)( avgC

Page 14: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

14

Experimental results

Compare the efficiency of mining sequential patterns without constraint

Page 15: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

15

Experimental results (Cont.)

Compare the efficiency of mining sequential patterns with constraint Capability of GSP and prefix-growth on pushing anti-mono

tone constraint (dur() <= t)

Page 16: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

16

Experimental results (Cont.)

Experimental results on mining with regular expression constraint

Page 17: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

17

Experimental results (Cont.)

Scalability of prefix-growth with Constraint avg() ≤ v

Number of projected databases in prefix-growth with Constraint

avg() ≤ v

Page 18: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

18

Experimental results (Cont.)

Scalability of prefix-growth w.r.t. support threshold

Page 19: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

19

Experimental results (Cont.)

Scalability of prefix-growth w.r.t. database size

Page 20: 1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining

20

Conclusion

Prefix-monotone property covers many commonly used constraints

Experiment results and performance study show that prefix-growth is efficient and scalable in mining large databases