mining frequent patterns from very high dimensional data: a top-down row enumeration approach

26
Yi-Husi Lee 1 Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach Hongyan Liu Jiawei Han Dong Xin Zheng Shao2 From SDM06

Upload: shaw

Post on 14-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach. Hongyan Liu Jiawei Han Dong Xin Zheng Shao2 From SDM06. Outline. Introduction Preliminaries Algorithm Experimental Conclusion. Introduction. What is very high data ? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 1

Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Hongyan LiuJiawei Han

Dong XinZheng Shao2

From SDM06

Page 2: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 2

Outline

Introduction

Preliminaries

Algorithm

Experimental

Conclusion

Page 3: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 3

Introduction

What is very high data ?

Different from transactional data set, very high dimensional data like microarray data usually does not have somany rows (samples) but have a large number of columns (genes).

Page 4: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 4

Preliminaries

Table and Transposed TableLet minsup = 2,

Table T Transposed Table TT

Page 5: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 5

Preliminaries(cont.)

ClosureGiven an itemset I ⊆ I, a rowset S ⊆ S and R ⊆ S × I is a relation, we define

r(I) = { ri ∈ S | ∀ij ∈ I , (ri, ij) ∈ R }i(S) = { ij ∈ I | ∀ri ∈ S , (ri, ij) ∈ R }

thenC(I) = i(r(I))

C(S) = r(i(S))

Closed itemset and closed rowsetAn itemset I is called a closed itemset iff I = C(I).Likewise, a rowset S is called a closed rowset iff S=C(S).

Page 6: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 6

Preliminaries(cont.)

ExampleEX1:{b1, c2}

r({b1, c2}) = {2, 4}i({2, 4}) = {b1, c2, d2}

Then C({b1, c2}) = {b1, c2, d2} ≠ {b1, c2}Therefore, {b1, c2} is not a closed itemset. EX2:{1,2}

i({1, 2}) = {a1, b1} r({a1, b1}) = {1, 2, 3}

Then C({1, 2}) = {1, 2, 3} ≠ {1,2} So rowset {1, 2} is not a Closed rowset, but apparently {1, 2, 3} is.

Page 7: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 7

Preliminaries(cont.)

Bottom-up Search StrategyBy bottom-up we mean that along every search the row enumeration space from small rowsets to large ones.

It is monotonic in terms of bottom-up search order, it is hard to prune the row enumeration search space early.

Memory cost for this kind of bottom-up search is also big.

As the minsup increase , the time need to complete the mining process can not decrease rapidly.

Page 8: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 8

Preliminaries(cont.)

Top-down Search StrategyA top-down search method and a novel row enumeration tree are propose to take advantage of the pruning power of minimum support threshold.

A new method, called closeness-checking, is developed to check efficiently and effectively whether a pattern is closed. Unlike other existing closeness-checking methods, it does not need to scan the mining data set, nor the result set, and is easy to integrate with the top-down search process.

Page 9: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 9

Preliminaries(cont.)

X-excluded transposed tableGiven a rowset x = {ri1, ri2, …, rik} with an order such that ri1 > ri2

… > rik, a minimum support threshold minsup and its parent tableTT|p, an x-excluded transposed table TT|x is a table in which each each tuple contains rids less than any of rids in x, and at the same time tuple contains rids less than any of rids in x, and at the same time contains all of the rids greater than any of rids in x.contains all of the rids greater than any of rids in x. Rowset x iscalled an excluded rowset.

Page 10: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 10

Preliminaries(cont.)

Procedure to get X-excluded transposed table

Example: find table TT|5, TT|54 & TT|4, let the minsup=2

The original transposed table TT|Φ

Page 11: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 11

Preliminaries(cont.)

EX1: TT|5

TT|5 is easy to find, for each tuple in table TT|Φ containing rid 5, delete 5 and copy it to TT|5, then chick if the size of every tuple is great than minsup. If not delete it from TT|5

TT|5

Page 12: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 12

Preliminaries(cont.)

EX2: TT|54

First: Each tuple in table TT|5 is a candidate tuple of TT|54 as there is no rid greater than 5

Second: After excluding rids not less than 4, the table is show below

Last: Since tuple c2 only contains one rid, it does not satisfy minsup and delete it from TT|54

after prune

TT|54 without prune TT|54

Page 13: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 13

Preliminaries(cont.)

EX3:TT|4

TT|4 is not the same of TT|54 and TT|5, because it is not contains the greatest rid. Follow the definition (and at the same time contains all of the rids greater than any of rids in x), so we can find the tuple which has both 4 and 5.But tuple a2 does not satisfy minsup after excluding rid 4, so only tuple left in TT|4. Although the current size of tuple c2 in TT|4 is 1, but it actual size is 2 since it contains rid 5 which is not listed in TT|4.

after prune

Page 14: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 14

Algorithm

Closeness-Checking MethodTo avoid generating all frequent itemsets during the mining process, it is important to perform the closeness checking as early as possible during mining.

Lemma1.Given an itemset I ⊆ I and a rowset S ⊆ S, the following two equations hold:

r(I) = r(i(r(I)))i(S) = i(r(i(S)))

Page 15: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 15

Algorithm(cont.)

Lemma2.In transposed table TT, a rowset S ⊆ S is closed iff it can be represented by an intersection of a set of tuples, that is:

∃I ⊆ I, s.t. S = r(I) = ∩j r({ij})where ij ∈ I, and I = {i1, i2, …, il}

Lemma3.Given a rowset S ⊆ S, in transposed table TT, for every tuple ij containing S, which means ij ∈ i(S), if S ≠ ∩jr({ij}), where ij ∈ i(S), then S is not closed.

Page 16: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 16

Algorithm(cont.)

Skip-rowset and merge of x-excluded transposed tableIn order to speed up this kind of checking, we add some additional information for each x-excluded transposed table.

TT|54 TT|54 with skip rowset TT|54 after merge

Page 17: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 17

Algorithm(cont.)

TD-Closedn: the number of rowset

FCP: frequent closed patterns

excludedsize: is the numberof rids excluded from current table

cMinsup: A dynamically changingminimum support threshold

Page 18: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 18

Algorithm(cont.)

Pruning strategy 1: If excludedSize is equal to or greater than (n–minsup), then there is no need to do any further recursive call of TopDownMine. ExcludedSize is the number of rids excluded from current table. If it is not less than (n–minsup), the size of each rowset in current transposed table must be less than minsup, so these rowsets are impossible to become large.

Page 19: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 19

Algorithm(cont.)

Pruning strategy 2: If an x-excluded transposed table contains only one tuple, it is not necessary to do further recursive call to deal with its child transposed tables. Of course, before return according to pruning strategy 2, the current rowset S might be a closed rowset, so if the skip-rowset is empty, we need to output it first.

EX1.(pruning strategy 2)For table TT|4, there is only one tuple in this table. After we check this tuple to see if it is closed (it is not closed apparently as its skip-rowset is not empty), we do not need to produce any child excluded transposed table from it.

Page 20: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 20

Algorithm(cont.)

Pruning strategy 3: Each tuple t containing rid y in TT|x will be deleted from TT|x∪y if size of t (that is the number of rids t contains) equals cMinsup.

We can get these two tables, TT|x∪y and TT|x’, bysteps as follows. First, for each tuple t in TT|x containing rid y, delete y from it, and copy it to TT|x’.And then check if the size of t is less than cMinsup. Ifnot, keep it and in the meantime put y in the skiprowset,otherwise get rid of it from TT|x. Finally TT|x becomes TT|x∪y, and at the same time TT|x’ is obtained.

Page 21: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 21

Algorithm(cont.)

Ex2.(pruning strategy 2)Let cMinsup=2

TT|x’

TT|x

TT|x∪y

Page 22: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 22

Algorithm(cont.)

Pruning strategy 4: The step is the major output step. If there is a tuple with empty skip-rowset and the largest size, say k, and containing rid k in TT|x∪y, then output it. Since this tuple has the largest size among all of the tuples in TT|x∪y, there is no other tuple that will contribute to it.Therefore, it can be output.

EX3.(pruning strategy 3) In table TT|54, itemset a1b1 corresponding to rowset {1,2,3} can be output as its size is 3 which is larger than the size of the other two tuples, and it also contains the largest rid 3.

Pruning strategy 5: It conducts two recursive calls to deal with TT|x∪y and TT|x’ respectively.

Page 23: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 23

Experimental

Synthetic Data Set

Page 24: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 24

Experimental(cont.)

Page 25: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 25

Experimental(cont.)

Real Data Sets

Page 26: Mining Frequent Patterns from Very High Dimensional Data: A Top-Down Row Enumeration Approach

Yi-Husi Lee 26

Conclusion A new algorithm, TD-Close, is designed and implemented for min

ing a complete set of frequent closed itemsets from high dimensional data.

Several pruning strategies are developed to speed up searching. Both analysis and experimental study show that these methods are effective and useful.

Future work includes integrating top-down row enumeration method and column row enumeration method for frequent pattern mining from both long and deep large datasets, and mining classification rules based on association rules using top-down searching strategy.