a parallel association rule mining algorithm for corpus
TRANSCRIPT
1 of 23
A MPI-based Parallel Association Rule Mining (ARM) Algorithm for Corpus
Shankai Yan, 8 November 2014
2 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Presentation Outline
Application of ARM for Corpus
Serial Algorithm Description
Parallel Algorithm Description
Experiments
3 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Presentation Outline
Application of ARM for Corpus
Serial Algorithm Description
Parallel Algorithm Description
Experiments
4 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Detecting Privacy Leaks
Application of ARM for CorpusTalent Recruitment
5 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Detecting Privacy Leaks
Richard Chow, Philippe Golle, Jessica Staddon. Detecting Privacy Leaks Using Corpus-based Association Rules. Proceedings of the 14th ACM SIGKDDMIDP, pp.893-901, 2008.
6 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Detecting Privacy Leaks
Application of ARM for CorpusTalent Recruitment
7 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
The DISCOTEX System (job postings)
Raymond J. Mooney, Un Yong Nahm. Text Mining with Information Extraction. Proceedings of the 4th International MIDP Colloquium, pp.141-160, 2003.
8 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Presentation Outline
Application of ARM for Corpus
Serial Algorithm Description
Parallel Algorithm Description
ExperimentsShankai Yan, Pingjian Zhang. A Fast Association Rule Mining Algorithm for Corpus. International Conference on Intelligent Systems and Knowledge Engineering, pp.449-459, 2013.
9 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Serial Algorithm Description
Hash Inverted Index Construction
10 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Serial Algorithm Description
k-Frequent Itemsets Generation
Association Rules Generation
11 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Presentation Outline
Application of ARM for Corpus
Serial Algorithm Description
Parallel Algorithm Description
Experiments
12 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Parallel Algorithm Description
Corpus 1-Frequent Itemsets
1-Frequent Itemsets
AssociationRules
Input Data Decomposition
Hash Inverted Index Synchronization
Communication PatternAssociation Rules
Generation
13 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Input Data Decomposition
C=30 C=20 C=25
Adjust bound C of the first fit decrease (FFD) algorithm on bin-packing problem to find the minimum C that leads the bin number to the value equal to process number.Example: Find a strategy to dispatch documents of different size [13, 7, 20, 13, 12, 7, 12] to 4 processes.
14 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Parallel Algorithm Description
Corpus 1-Frequent Itemsets
1-Frequent Itemsets
AssociationRules
Input Data Decomposition
Hash Inverted Index Synchronization
Communication PatternAssociation Rules
Generation
15 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Hash Inverted Index Synchronization
16 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Parallel Algorithm Description
Corpus 1-Frequent Itemsets
1-Frequent Itemsets
AssociationRules
Input Data Decomposition
Hash Inverted Index Synchronization
Communication PatternAssociation Rules
Generation
17 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Communication Pattern
18 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Parallel Algorithm Description
Corpus 1-Frequent Itemsets
1-Frequent Itemsets
AssociationRules
Input Data Decomposition
Hash Inverted Index Synchronization
Communication PatternAssociation Rules
Generation
19 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Presentation Outline
Application of ARM for Corpus
Serial Algorithm Description
Parallel Algorithm Description
Experiments
20 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Experiments
Data set: Sougou Labs Corpushttp://www.sogou.com/labs/resources.html
Small 103 documents 2.4MB 15710 terms
Medium 104 documents 31.2MB 35617 terms
Large 105 documents 368MB 135527 terms
21 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Experiments
0
5
10
15
20
25
30
35
serial 1 3 5 7
Elap
se t
ime
(s)
Node Number
small(MPI)
medium(MPI)
small(MPI+OpenMP)
medium(MPI+OpenMP)
Parallel Efficiency:small(MPI) [1.39%, 6.72%]small(MPI+OpenMP) [2.15%, 7.19%]medium(MPI) [1.54%, 7.37%]medium(MPI+OpenMP) [2.25%, 7.94%]
22 of 23Email: [email protected]: School of Software
Engineering, South China University of Technology, Guangzhou, Guangdong
Experiments
0
500
1000
1500
2000
2500
3000
3500
4000
serial 1 3 5 7
Elap
se t
ime
(s)
Node Number
large(MPI)
large(MPI+OpenMP)
Parallel Efficiency:large(MPI) [9.67%, 27.05%]large(MPI+OpenMP) [61.00%, 70.48%]
23 of 23
Thanks For ListeningShankai Yan, 8 November 2014