1
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
ACCTG 6910Building Enterprise &
Business Intelligence Systems(e.bis)
Sequential Pattern Mining
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
Olivia R. Liu Sheng, Ph.D.Emma Eccles Jones Presidential Chair of Business
2
Sequential Patterns
Given: A Transaction Database { cid, tid, date, item }
Find: inter-transaction patterns among customers
Example: customers typically rent “ Star Wars”, then “Empire Strikes Back” and then “Return of the Jedi”
3
Sequential Patterns
cid tid date item
1 1 01/01/2000 30
1 2 01/02/2000 90
2 3 01/01/2000 40,70
2 4 01/02/2000 30
2 5 01/03/2000 40,60,70
3 6 01/01/2000 30,50,70
4 7 01/01/2000 30
4 8 01/02/2000 40,70
4 9 01/03/2000 90
5 10 01/01/2000 90
4
Sequential Patterns
Itemset : is a non-empty set of items,
e.g., {30} , {40, 70}.
Sequence: is an ordered list of itemsets,
e.g. <{30} {40,70}> , <{40,70} {30} >.
Size of sequence is the number of itemsets in that sequence.
5
Sequential Patternscid tid date item
1 1 01/01/2000 30
1 2 01/02/2000 90
2 3 01/01/2000 40,70
2 4 01/02/2000 30
2 5 01/03/2000 40,60,70
3 6 01/01/2000 30,50,70
4 7 01/01/2000 30
4 8 01/02/2000 40,70
4 9 01/03/2000 90
5 10 01/01/2000 90
Each transaction of a customer can be viewed as an itemset
A customer’s sequences contains the customer’s ordered itemsets
6
Sequential Patterns
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
7
Sequential Patterns
Sequence <a1 a2 ….an> is contained in sequence <b1 b2 ….bm> if there exist indexes i1<i2….<in such that
a1 bi1, a2 bi2, …, and an bin.
E.g., <{3} {4,5} {8}> is contained in < {3,8}{4,5,6} {8}>
Is <{3} {4,5} {8}> contained in <{7} {3,8} {9}{4,5,6} {8}> ?
Is <{3} {4,5} {8}> contained in <{7} {9} {4,5,6} {3,8} {8}> ?
Is <{3} {4,5} {8}> contained in <{7} {9} {3,8}{4,5,6} > ?
8
Sequential Patterns
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
A customer supports sequence s if s is contained in the
sequence for this customer.
E.g., customers 1 and 4 support sequence <{30} {90}>
9
Sequential Patterns
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
The support for a sequence s is defined as the fraction of
total customers who support s .
E.g., customers 1 and 4 support sequence <{30} {90}>
Supp(<{30} {90}>) = 2/5 = 40%
10
Sequential Patterns
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
Supp(<{40,70}>) = 2/5 = 40%
Supp({40,70}) = 3/10 = 30%
11
Sequential Patterns Mining
Given: A Transaction Database { cid, tid, date, item }
Find: All sequences that have support larger than user-specified minimum support
Apriori property: if a sequence is large then all sequences contained in that sequence should be large.
12
Identify all Large 1-Sequences
Repeat until there is no more Candidate k-SequencesIdentify all Candidate k-Sequences using Large (k-1)-Sequences
Join:Two large (k-1)-sequences, L1 amd L2, that are joinable must satisfy the following conditions:
L1(1)=L2(1) and L1(2)=L2(2) and …. L1(K-2)=L2(K-2) L1(K-1) L2(K-1)
Prune :prune candidate k-sequences generated in step 2-1
that have sub-sequences not large.
Determine Large k-Sequences from Candidate k-Sequences
Sequential Patterns Mining
13
Sequential Patterns Mining
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
Minimum Support: 40%
14
Sequential Patterns Mining
Large 1-Sequence:
<{30}> support=4/5=80%
<{40}> support=2/5=40%
<{70}> support=3/5=60%
<{90}> support=3/5=60%
<{40,70}> support=2/5=40%
cid customer sequence
1 <{30} {90} >
2 <{40,70} {30} {40,60,70}>
3 <{30,50,70}>
4 <{30} {40,70} {90}>
5 <{90}>
Minimum Support: 40%
15
Sequential Patterns MiningLarge 1-Sequence:
<{30}> support=4/5=80%
<{40}> support=2/5=40%
<{70}> support=3/5=60%
<{90}> support=3/5=60%
<{40,70}> support=2/5=40%Candidate 2-Sequence:
<{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}>
<{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}>
<{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}>
<{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}>
<{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}>
16
Sequential Patterns Mining
Large 2-Sequence:
<{30} {40}> support=2/5=40%
<{30} {70}> support=2/5=40%
<{30} {90}> support=2/5=40%
<{30} {40,70}> support=2/5=40%
Candidate 2-Sequence:
<{30} {40}> <{30} {70}> <{30} {90}> <{30} {40,70}>
<{40} {30}> <{40} {70}> <{40} {90}> <{40} {40,70}>
<{70} {30}> <{70} {40}> <{70} {90}> <{70} {40,70}>
<{90} {30}> <{90} {40}> <{90} {70}> <{90} {40,70}>
<{40,70} {30}> <{40,70} {40}> <{40,70} {70}> <{40,70} {90}>
17
Sequential Patterns Mining
Candidate 3-Sequence:
<{30} {40} {70}> <{30} {40} {40,70}>
<{30} {70} {40}> <{30} {70} {40,70}>
<{30} {40,70} {40}> <{30} {40,70} {70}>
<{30} {40} {90}> <{30} {90} {40}><{30} {70} {90}> <{30} {90} {70}><{30} {90} {40,70}> <{30} {40,70} {90}>
Large 2-Sequence:
<{30} {40}> support=2/5=40%
<{30} {70}> support=2/5=40%
<{30} {90}> support=2/5=40%
<{30} {40,70}> support=2/5=40%
Candidate 3-Sequence:
No candidate 3-sequence. Stop.
Prune:
All sub-sequences of a candidate k-sequence should be large.
18
Summary
• What is a sequential pattern?
• What is support for a sequential pattern?
• How to mine sequential patterns?
• What are the similarities and dissimilarities between association rules and sequential patterns mining?