1 mining sequential patterns with constraints in large database jian pei, jiawei han,wei wang proc....
TRANSCRIPT
1
Mining Sequential Patterns with Constraints in Large Database
Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining (ICDM’02)
Adviser: Jia-Ling Koh Speaker: Yu-ting Kung
2
Introduction
In past studies, two problems remain:1. Many practical constraints are not covered
2. There lack a systematic method to push various constraints into the mining process
In this paper: Develop a framework—Prefix-growth, is
built based on a prefix-monotone property The constraints can be effectively and
efficiently pushed deep into sequential pattern mining under this new framework
3
Categories of constraints
1. Item constraints
For example:
2. Length constraint The number of transactions or occurrences of items… For example:
..,,,,,,
),][),(1:()(
)][),(1:()(
where
VileniiCitem
orVileniiCitem
)][),(1:()( BileniiCbookstore
)50)(()( lenClen
4
Categories of constraints (Cont.)
3. Super-pattern constraint
where P is a given set of patterns For example:
4. Aggregate constraint Aggregate function: sum, avg, max, min,etc For example:
We like sequential patterns where average price of all the items in each pattern is over $100
)..()( rtsPrC pat
cameradigitalPCC pat _)(
5
Categories of constraints (Cont.)
5. Regular expression constraints Constraints specified as a regular expression For example:
6. Duration constraints
7. Gap constraints For example:
Find purchasing patterns such that “the gap between each consecutive purchases is less than 1 month”
)||()|( LodgingMotelsandHotelsHotelsCityYorkNewYorkNewTravel
supmin_}].[].[(],[)]([],[]1[..)(1|{.. 1)()(1)(1 ttimeitimeiandilenitsleniiSDBifonlyandifts lenlenlen
6
Characterization of constraints
Anti-monotonic If a sequence satisfies C implies that every non-emp
ty subsequence of also satisfies C For example: dur() < 3
Monotonic If a sequence satisfies CM implies that every super-s
equence of also satisfies CM For example: len() >= 10, super-pattern constraints
Succinct constraint For example: item-constraint
7
Characterization of constraints (Cont.)
8
Prefix-Monotone Property
Prefix anti-monotonicfor each sequence satisfying the constraint, so does every prefix of
Prefix monotonicfor each sequence satisfying the constraint, so does every sequence having as a prefix.
A constraint is called Prefix-monotone if it is prefix-monotonic or prefix monotonic.
9
Theorem
All the commonly used constraint discussed above, except for g_sum and average, have prefix-monotone property
10
Push Prefix-Monotone Constraints into Sequential Pattern Mining
Regular expression Min_sup = 2
dddbcbbaC |)(|
11
Push Prefix-Monotone Constraints into Sequential Pattern Mining (Cont.)
Mining step:1. find length-1 sequential and remove irrelevant seque
nce Patterns <a>, <b>, <c>, <d>, <e> are identified as le
ngth-1 patterns, infrequent item <f> is removed S_id = 10 is removed fail this constraint
2. divide the set of sequential patterns into subsets without overlap prefix<a>, prefix<b>, prefix<c>, prefix<d>, prefix<e>
are pruned!!
12
Push Prefix-Monotone Constraints into Sequential Pattern Mining (Cont.)
3. construct <a>-projected database and mine it SDB|<a>={<(_b)(bc)dd>, <(_e)(abc)(dd)>,<ddcb>} Locally frequent items and satisfy the constraint:
prefix <ab>, prefix<ac>, prefix<ad>
4. recursive mining To mining patterns with prefix <ab>、 <ac>、 <ad>,
and form the projected database
5. Final pattern outputted {<a(bc)d>, <add>}
13
Handling Touch aggregate constraint
Constraint: Min_sup = 2 Item i called a small item if its value i.value <= 25, ot
herwise, it is called a big item
25)( avgC
14
Experimental results
Compare the efficiency of mining sequential patterns without constraint
15
Experimental results (Cont.)
Compare the efficiency of mining sequential patterns with constraint Capability of GSP and prefix-growth on pushing anti-mono
tone constraint (dur() <= t)
16
Experimental results (Cont.)
Experimental results on mining with regular expression constraint
17
Experimental results (Cont.)
Scalability of prefix-growth with Constraint avg() ≤ v
Number of projected databases in prefix-growth with Constraint
avg() ≤ v
18
Experimental results (Cont.)
Scalability of prefix-growth w.r.t. support threshold
19
Experimental results (Cont.)
Scalability of prefix-growth w.r.t. database size
20
Conclusion
Prefix-monotone property covers many commonly used constraints
Experiment results and performance study show that prefix-growth is efficient and scalable in mining large databases