ambiguous frequent itemset mining and polynomial delay enumeration may/25/2008 pakdd 2008 takeaki...

26
Ambiguous Frequent Itemset Mi Ambiguous Frequent Itemset Mi ning ning and Polynomial Delay Enumera and Polynomial Delay Enumera tion tion May/25/2008 PAKDD 2008 Takeaki Uno Takeaki Uno (1) (1) , Hiroki Arimura , Hiroki Arimura (2) (2) (1) National Institute of Informatics, JAPAN (The Guraduate University for Advanced Science) (2) Hokkaido University, JAPAN

Upload: isabella-davis

Post on 12-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Ambiguous Frequent Itemset MiningAmbiguous Frequent Itemset Mining

and Polynomial Delay Enumeration and Polynomial Delay EnumerationAmbiguous Frequent Itemset MiningAmbiguous Frequent Itemset Mining

and Polynomial Delay Enumeration and Polynomial Delay Enumeration

May/25/2008 PAKDD 2008

Takeaki UnoTakeaki Uno(1)(1), Hiroki Arimura, Hiroki Arimura(2)(2)

(1) National Institute of Informatics, JAPAN(The Guraduate University for Advanced Science)

(2) Hokkaido University, JAPAN

Page 2: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Frequent Pattern MiningFrequent Pattern MiningFrequent Pattern MiningFrequent Pattern Mining

•• Problem of finding all frequently appearing patterns from given database

database: transaction database (itemset), tree, graph, vectorpatterns: itemset, tree, path/cycle, graph, geometric graph…

genomeexperiments

databaseExtract frequentlyExtract frequentlyappearing patternsappearing patterns

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

ATGCGCCGTATAGCGGGTGGTTCGCGTTAGGGATATAAATGCGCCAAATAATAATGTATTATTGAAGGGCGACAGTCTCTCAATAAGCGGCT

実験1

実験2

実験3

実験4

 ●  ▲  ▲   ●  ▲

 ●  ●  ▲  ● ●  ●  ▲  ● ▲  ●  ●

 ●  ▲  ● ●  ▲  ▲   ▲  ▲  

・・ 実験 1● , 実験 3 ▲・・ 実験 2● , 実験 4●・・ 実験 2●, 実験 3 ▲, 実験4●・・ 実験 2▲ , 実験 3 ▲    .    .    .

・・ 実験 1● , 実験 3 ▲・・ 実験 2● , 実験 4●・・ 実験 2●, 実験 3 ▲, 実験4●・・ 実験 2▲ , 実験 3 ▲    .    .    . ・・ ATGCAT

・・ CCCGGGTAA・・ GGCGTTA・・ ATAAGGG    .    .    .

・・ ATGCAT・・ CCCGGGTAA・・ GGCGTTA・・ ATAAGGG    .    .    .

Page 3: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Researches on Pattern MiningResearches on Pattern MiningResearches on Pattern MiningResearches on Pattern Mining

•• So many studies and applications on itemsets, sequences, trees, graphs, geometric graphs

•• Thanks to the efficient algorithms, we would say any simple structures can be enumerated in practically short time

•• One of the next problems is “how to handle the noise, error, and ambiguity”

usual “inclusion” is too strict

we want to find patterns “mostly” included in many records

We consider ambiguous appearance of patternsWe consider ambiguous appearance of patterns

Page 4: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Related Works on AmbiguityRelated Works on AmbiguityRelated Works on AmbiguityRelated Works on Ambiguity

•• It is popular to detect “ambiguous XXXX”

dense substructures: clustering, community discovering…

homology search on genome sequence

•• Heuristic search is popular because of the difficulty on modeling and computation

   AdvantageAdvantage: usually works efficiently

   ProblemProblem: not easy to understand “what is found”

    much more cost for additional conditions(for each solution)

•• Here we look at the problem from “algorithmic point of view”

(efficient models arising from efficient computation)

Page 5: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Itemset MiningItemset MiningItemset MiningItemset Mining

•• In this talk, we focus on the itemset mining

transaction database transaction database DD:: each record called transaction is a subset of itemset E, that is, ∀∀T ∈DD, T ⊆ E

Occ(P): set of transactions including P

frq(P) = |Occ(P)|: #transactions including P

P is a frequent itemset frq(P) ≥σ (σ is minimum support)

•• Problem is to enumerate all frequent itemsets in DD

We introduce ambiguous inclusion for frequent itemset miningWe introduce ambiguous inclusion for frequent itemset mining

Page 6: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Related worksRelated worksRelated worksRelated works

•• fault-tolerant pattern 、 degenerate pattern 、 soft occurrence, etc.mainly two approaches

(1)(1) generalize inclusion:

(1-a) (1-a) the ratio of included items ≥θ include   lose monotonicity; no subset may be frequent in the worst case   several heuristic-search-based algorithms

(1-b) (1-b) at most k items are not included include   satisfy monotonicity; so many small itemsets are frequent   maximal enumeration or complete enumeration with small k

1,22,31,3

θ=66%

Page 7: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Related works 2Related works 2Related works 2Related works 2

(2)(2) find pairs of itemset and transaction set such that few of them do not satisfy inclusion

   equivalent to finding dense submatrix, or dense bicluster

so many equivalent patterns will be found

   mainly, heuristic search for

finding one such dense substructure

•• ambiguity on the transaction set

an itemset can have many partners

We introduce a new model for (2)(2) to avoid redundancy, and propose an efficient depth-first search type algorithm We introduce a new model for (2)(2) to avoid redundancy,

and propose an efficient depth-first search type algorithm

items

transactions

Page 8: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Average InclusionAverage InclusionAverage InclusionAverage Inclusion

•• inclusion ratio of t for P ⇔ ⇔ | t∩P | / |P|

•• average inclusion ratio of transaction set T for P

 ⇔ ⇔ average of inclusion ratio over all transactions in T

∑ |t ∩ P| / ( |P| × |T| )

equivalent to dense submatrix/subgraph of transaction-item inclusion matrix/graph

•• For a density threshold θ, maximum co-occurrence size cov(P) of itemset P   ⇔⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ

1,3,42,4,51,2

1,3,42,4,51,2

2,350%4,550%1,266%

2,350%4,550%1,266%

Page 9: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Problem DefinitionProblem DefinitionProblem DefinitionProblem Definition

•• For a density threshold θ, the maximum co-occurrence size cov(P) of itemset P ⇔ ⇔ maximum size of transaction set s.t. average inclusion ratio ≥θ •• Ambiguous frequent itemset: itemset P s.t., cov(P) ≥ σ  (σ: minimum support)

•• Ambiguous frequent itemsets are not monotone !!

1,3,42,4,51,2

1,3,42,4,51,2

θ=66%:cov({3}) = 1cov({2}) = 3cov({1,3}) = 2cov({1,2}) = 3

θ=66%:cov({3}) = 1cov({2}) = 3cov({1,3}) = 2cov({1,2}) = 3

Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ

Ambiguous frequent itemset enumeration: the problem of outputting all ambiguous frequent itemsets for given database D, density threshold θ, minimum support σ

The goal is to develop an efficient algorithm for this problemThe goal is to develop an efficient algorithm for this problem

Page 10: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Hardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-BoundHardness for Branch-and-Bound

•• A straightforward approach to this problem is branch-and-bound

•• In each iteration, divide the problem into two non-empty problems by the inclusion of an item

      

ii1, 1, ii22 ii1, 1, ii22 ii1, 1, ii22 ii1, 1, ii22

ii11 vv1 1

Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)

Checking the existence of ambiguous frequent itemset is NP-comp. (Theorem 1)

Page 11: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Is This Really Hard?Is This Really Hard?Is This Really Hard?Is This Really Hard?

•• We proved NP-hardness for "very dense graphs"

unclear for middle dense graph

not impossible for polynomial time enumeration

θ= 1

θ= 0

easyeasy

easyeasy

hardhard

????????????????????

polynomial time in (input size) + (output size)polynomial time in (input size) + (output size)

Page 12: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Efficient Algorithm: Idea of Reverse Efficient Algorithm: Idea of Reverse SearchSearch

Efficient Algorithm: Idea of Reverse Efficient Algorithm: Idea of Reverse SearchSearch

•• We don’t use branch and bound, but use reverse search

•• Define an acyclic parent-child relation on all objects to be found

Recursively find children to search, thus an algorithm for finding all children is sufficientRecursively find children to search, thus an algorithm for finding all children is sufficient

objectsobjectsobjectsobjects

Depth-first search on the rooted tree induced by the relationDepth-first search on the rooted tree induced by the relation

Page 13: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Neighboring RelationNeighboring RelationNeighboring RelationNeighboring Relation•• AmbiOcc(P) of an ambiguous frequent itemset P

⇔ ⇔ lexicographically minimum one among transaction sets whose average inclusion ratio ≥θ and size = cov(P)

•• e*(P):e*(P): the item e e in P s.t. # transactions in AmbiOcc(P) including e e is the minimum (ties are broken by taking the minimum index)

•• the parent Prt(P) of P: P \ e*(P)e*(P)

A: 1,3,4,7B: 2,4,5C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

A: 1,3,4,7B: 2,4,5C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

{1,4,5} D, A,B, C,F, E

AmbiOcc({1,4,5}) = {D,A,B,C}

{1,4,5} D, A,B, C,F, E

AmbiOcc({1,4,5}) = {D,A,B,C}

θ = 66%, σ= 4

e*(P) = 5Prt({1,4,5}) {1,4}

AmbiOcc({1,4}) = {D,A, B,C, F}

e*(P) = 5Prt({1,4,5}) {1,4}

AmbiOcc({1,4}) = {D,A, B,C, F}

Page 14: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Properties of ParentProperties of ParentProperties of ParentProperties of Parent

•• The parent Prt(P) of P: P \ e*(P)e*(P)

uniquely defined

•• Average inclusion ratio of AmbiOcc(P) for P does not decrease

Prt(P) is an ambiguous frequent itemset

•• |Prt(P)| < |P| (parent is always smaller)

   the relation is acyclic, and induces a tree (rooted at φ)

A: 1,3,4,7B: 2,4,5C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

A: 1,3,4,7B: 2,4,5C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

{1,4,5} D, A,B, C,F, E

AmbiOcc({1,4,5}) = {D,A,B,C}

{1,4,5} D, A,B, C,F, E

AmbiOcc({1,4,5}) = {D,A,B,C}

θ = 66%, σ= 4

e*(P) = 5Prt({1,4,5}) {1,4}

AmbiOcc({1,4}) = {D,A, B,C, F}

e*(P) = 5Prt({1,4,5}) {1,4}

AmbiOcc({1,4}) = {D,A, B,C, F}

Page 15: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Enumeration TreeEnumeration TreeEnumeration TreeEnumeration Tree

•• The relation is acyclic, and induces a tree (rooted at φ)

•• We call the tree enumeration tree

A: 1,3,4,7B: 2,4,5,C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

A: 1,3,4,7B: 2,4,5,C: 1,2,7D: 1,4,5,7E: 2,3,6F: 3,4,6

θ = 66%, σ= 4

1,71,73,43,4 4,54,51,41,4 4,74,7

1,4,71,4,71,4,51,4,5 1,3,41,3,4 3,4,73,4,7 4,5,74,5,7 1,2,71,2,7 1,3,71,3,7 1,5,71,5,7

φφ

11 22 33 44 77

1,3,4,71,3,4,7 1,4,5,71,4,5,7

Page 16: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Listing ChildrenListing ChildrenListing ChildrenListing Children

•• To perform a depth-first search on enumeration tree, what we have to do is “finding all children of given itemset”

•• P = Prt(P’) is obtained by removing an item from P’

a child P’ of P is obtained by adding an item to P

to find all children, we examine all possible items

itemsetsitemsetsitemsetsitemsets

φ

Page 17: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Check CandidatesCheck CandidatesCheck CandidatesCheck Candidates

•• An item addition does not always yield a child

     They are just “candidates”

•• If the parent of a candidate P’ = P∪e is P (satisfies e*(P’) = e ),

P’ is a child of P

checking by computing e*(P∪e), for each candidate P∪e

itemsetsitemsetsitemsetsitemsetsEnumeration is done in O(||

D||n) time for each ambifuous frequent itemset

Enumeration is done in O(||D||n) time for each

ambifuous frequent itemset

TheoremTheorem

φ

Page 18: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Algorithm DescriptionAlgorithm DescriptionAlgorithm DescriptionAlgorithm Description

Algorithm AFIM ( P:pattern, D:database )

output P

compute cov(P∪e) for all item e not in P

for each e s.t. cov(P∪e) ≥ σ do

compute AmbiOcc(P∪e)

compute e*(P∪e)

if e*(P∪e) = e then call AFIM ( P∪e, D )

done

Page 19: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Efficient Computation of cov’sEfficient Computation of cov’sEfficient Computation of cov’sEfficient Computation of cov’s

•• For efficient computation, we classify transactions by inclusion ratio

•• When we compute cov(P∪e), we compute the intersection of each group and Occ(e)

inclusion ratio increases, for transactions included in Occ(e)

by moving such transactions, classification for P∪e is obtained

•• This task for all items is done efficiently by Delivery, which takes O(||G||) time where ||G|| is the sum of transaction sizes in group G computation of cov(P∪e) can be done in linear time

0 miss0 miss 1 miss1 miss 2 miss2 miss 3 miss3 miss 4 miss4 miss 5 miss5 miss

Page 20: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Computing AmbiOcc and e*Computing AmbiOcc and e*Computing AmbiOcc and e*Computing AmbiOcc and e*

•• Computation of AmbiOcc(P∪e) needs greedy choice of transactions, in the decreasing order of (inclusion ratio & index)

•• Computation of e*(P∪e) needs intersection of AmbiOcc(P∪e) and Occ(i) for each i∈P Delivery

need O(||D||) time in the worst case

•• However, when cov(P) is small, not so many transactions may be scanned, thus we expect the average computation time is not so long

Page 21: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Bottom-widenessBottom-widenessBottom-widenessBottom-wideness

•• DFS search generates several recursive calls in each iteration

Recursion tree grows exponentially, by going down

Computation time is dominated by the lowest levels

•• Computation time decreases by going down

Near by bottom levels, computation time may be close to σ, thus an iteration may take O(σt) time

where t is the average size of transactions

Near by bottom levels, computation time may be close to σ, thus an iteration may take O(σt) time

where t is the average size of transactions

・・・・・・

long timelong time

short timeshort time

Page 22: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Computational ExperimentsComputational ExperimentsComputational ExperimentsComputational Experiments

CPU: Pentium M 1.1GHz,memory: 256MBOS: Windows XP + CygwinCode: CCompiler: gcc 2.3

•• Test instances are taken from benchmark datasets for frequent itemset mining

Page 23: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

BMS-WebView 2BMS-WebView 2BMS-WebView 2BMS-WebView 2

•• A real-world web access data (sparse; transaction siz = 4.5)

BMS-WebView2

0.1

1

10

100

1000

10000

100000

1000000

10000000

1% 0.50% 0.30% 0.15% 0.05% supporttime(

sec)

/num

ber

LCM time1.0 number1.0 time1.0 time/ M0.9 number0.9 time0.9 time/ M0.8 number0.8 time0.8 time/ M

Page 24: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

MushroomMushroomMushroomMushroom

•• A real-world machine learning data of mushrooms (density = 1/3)

Mushroom

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

80% 70% 60% 50% 40% 30% 20% supporttime(

sec)

/num

ber

LCM time1.0 number1.0 time1.0 time/ M0.9 number0.9 time0.9 time/ M0.8 number0.8 time0.8 time/ M

Page 25: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

Possibility for Further ImprovementsPossibility for Further ImprovementsPossibility for Further ImprovementsPossibility for Further Improvements

•• Ratio of unnecessary operations, non-maximal patterns

Mushroom

1

10

100

80% 70% 60% 50% 40% 30% support

ratio

0.9 max0.9 prt0.9 occ0.8 max0.8 prt0.8 occ

Page 26: Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of

ConclusionConclusionConclusionConclusion

•• Introduced a new model for frequent itemset mining with ambiguous inclusion relation, which avoids redundancy

•• Showed a hardness result for branch-and-bound

•• Showed efficiency on practical (sparse) datasets

Future Works:

•• Reduce the time complexity and fill the gap from the practice

•• Efficient models and computation for maximal ones

•• Application of the technique to the other problems

(ambiguous pattern mining for graph, tree, vector data, etc.)