a parallel association rule mining algorithm for corpus

23
1 of 23 A MPI - based Parallel Association Rule Mining (ARM) Algorithm for Corpus Shankai Yan, 8 November 2014

Upload: caspar-yim

Post on 17-Jul-2015

421 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: A parallel association rule mining algorithm for corpus

1 of 23

A MPI-based Parallel Association Rule Mining (ARM) Algorithm for Corpus

Shankai Yan, 8 November 2014

Page 2: A parallel association rule mining algorithm for corpus

2 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

Page 3: A parallel association rule mining algorithm for corpus

3 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

Page 4: A parallel association rule mining algorithm for corpus

4 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Application of ARM for CorpusTalent Recruitment

Page 5: A parallel association rule mining algorithm for corpus

5 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Richard Chow, Philippe Golle, Jessica Staddon. Detecting Privacy Leaks Using Corpus-based Association Rules. Proceedings of the 14th ACM SIGKDDMIDP, pp.893-901, 2008.

Page 6: A parallel association rule mining algorithm for corpus

6 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Detecting Privacy Leaks

Application of ARM for CorpusTalent Recruitment

Page 7: A parallel association rule mining algorithm for corpus

7 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

The DISCOTEX System (job postings)

Raymond J. Mooney, Un Yong Nahm. Text Mining with Information Extraction. Proceedings of the 4th International MIDP Colloquium, pp.141-160, 2003.

Page 8: A parallel association rule mining algorithm for corpus

8 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

ExperimentsShankai Yan, Pingjian Zhang. A Fast Association Rule Mining Algorithm for Corpus. International Conference on Intelligent Systems and Knowledge Engineering, pp.449-459, 2013.

Page 9: A parallel association rule mining algorithm for corpus

9 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Serial Algorithm Description

Hash Inverted Index Construction

Page 10: A parallel association rule mining algorithm for corpus

10 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Serial Algorithm Description

k-Frequent Itemsets Generation

Association Rules Generation

Page 11: A parallel association rule mining algorithm for corpus

11 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

Page 12: A parallel association rule mining algorithm for corpus

12 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

Page 13: A parallel association rule mining algorithm for corpus

13 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Input Data Decomposition

C=30 C=20 C=25

Adjust bound C of the first fit decrease (FFD) algorithm on bin-packing problem to find the minimum C that leads the bin number to the value equal to process number.Example: Find a strategy to dispatch documents of different size [13, 7, 20, 13, 12, 7, 12] to 4 processes.

Page 14: A parallel association rule mining algorithm for corpus

14 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

Page 15: A parallel association rule mining algorithm for corpus

15 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Hash Inverted Index Synchronization

Page 16: A parallel association rule mining algorithm for corpus

16 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

Page 17: A parallel association rule mining algorithm for corpus

17 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Communication Pattern

Page 18: A parallel association rule mining algorithm for corpus

18 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Parallel Algorithm Description

Corpus 1-Frequent Itemsets

1-Frequent Itemsets

AssociationRules

Input Data Decomposition

Hash Inverted Index Synchronization

Communication PatternAssociation Rules

Generation

Page 19: A parallel association rule mining algorithm for corpus

19 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Presentation Outline

Application of ARM for Corpus

Serial Algorithm Description

Parallel Algorithm Description

Experiments

Page 20: A parallel association rule mining algorithm for corpus

20 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

Data set: Sougou Labs Corpushttp://www.sogou.com/labs/resources.html

Small 103 documents 2.4MB 15710 terms

Medium 104 documents 31.2MB 35617 terms

Large 105 documents 368MB 135527 terms

Page 21: A parallel association rule mining algorithm for corpus

21 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

0

5

10

15

20

25

30

35

serial 1 3 5 7

Elap

se t

ime

(s)

Node Number

small(MPI)

medium(MPI)

small(MPI+OpenMP)

medium(MPI+OpenMP)

Parallel Efficiency:small(MPI) [1.39%, 6.72%]small(MPI+OpenMP) [2.15%, 7.19%]medium(MPI) [1.54%, 7.37%]medium(MPI+OpenMP) [2.25%, 7.94%]

Page 22: A parallel association rule mining algorithm for corpus

22 of 23Email: [email protected]: School of Software

Engineering, South China University of Technology, Guangzhou, Guangdong

Experiments

0

500

1000

1500

2000

2500

3000

3500

4000

serial 1 3 5 7

Elap

se t

ime

(s)

Node Number

large(MPI)

large(MPI+OpenMP)

Parallel Efficiency:large(MPI) [9.67%, 27.05%]large(MPI+OpenMP) [61.00%, 70.48%]

Page 23: A parallel association rule mining algorithm for corpus

23 of 23

Thanks For ListeningShankai Yan, 8 November 2014