a fast parallel algorithm for discovering frequent patterns

Upload: le-van-vinh

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    1/6

    A Fast Parallel Algorithm for Discovering Frequent PatternsKawuu W. Lin

    Department ofComputer Science and InformationEngineeringNational Kaohsiung University of Applied SciencesKaohsiung, Taiwan, [email protected]

    AbstractFast discovery of frequent patterns is the mostextensively discussed problem in data mining fieldsdue to its wide applications. As the size of databaseincreases, the computation time and the requiredmemory increase severely. The difficulty of mininglarge database launched the research of designingparallel and distributed algorithms to solve theproblem. Most of the past studies tried to parallelizethe computation by dividing the database anddistribute the divided database to other nodes formining. This approach might leak data out andevidently is not suitable to be applied to sensitivedomains like health-care. In this paper, we propose anovel data mining algorithm named FD-Mine that isable to efficiently utilize the nodes to discoverfrequent patterns in cloud computing environmentswith data privacy preserved. Through empiricalevaluations on various simulation conditions, theproposed FD-Mine delivers excellent performance interms of scalability and execution time.Keywords: Data mmmg; cloud computing;association rule mining; frequent pattern mining;privacy preserved

    I. IntroductionWith the progress of information technology, datamining techniques have been extensively applied tomany applications in various domains. The goal ofdata mining is to discover the hidden usefulinformation from large databases. The discoveredinformation could help the decision processes, aid thecommercial promotion, and so forth. The data miningincludes four main topics: association rule mining [2],sequential pattern mining [3], clustering [11] andclassification [5]. Among the data mining studies, theproblem of frequent pattern mining, i.e. associationrule mining and sequential pattern mining, is mostlydiscussed due to its wide applications.The basic conception of frequent pattern miningproblem is to discover the pattern whose frequency ofappearance in the database is greater than a specificthreshold. An association rule is defined as X=>Y,where X and Yare sets of items. The concept ofassociation rule mining is to discover the sets of

    items tending to associate with the others in thedatabase. The studies on association rule mining canbe classified into two types, 1) the generate-and-test

    Yu-Chin LuoDepartment of Computer Science and InformationEngineeringNational Kaohsiung University ofApplied Sciences,Kaohsiung, Taiwan, [email protected][2] (Apriori-like) approach and 2) the frequentpattern growth approach [6] (FP-growth-like). TheApriori-like methods iteratively generate candidateitemset of size (k+1) from frequent itemset of size kand scan the database repetitively to test thefrequency of each candidate itemset. Definitely, theApriori-like methods suffer from the large number ofcandidate itemsets, especially when the supportthreshold is small. In view of this reason, Han et al.[6] proposed a novel data structure, named frequentpattern tree (FP-tree), in which the transactions arecompressed and stored. A mining algorithm, namelyFP-growth was also proposed for discovering thefrequent patterns from the FP-tree. FP-growth needsonly two scans on physical databases and thereforehas a great improvement on the execution time.As the size of database increases, the computationtime and the required memory increase severely.Many studies on association rules mining wereproposed mainly to improve the efficiency in terms ofexecution time. In the past decades, parallel and

    distributed computing (PDC) techniques haveattracted extensive attentions on the ability to manageand compute the significant amount of data. Thedifficulty of mining large database launched theresearch of designing parallel and distributedalgorithms to solve the problem [7], [8], [10], [13],[14]. The main approach of the existing studies is todivide the database and then to distribute each part ofthe database to nodes or processors for mining withthe goal to distribute the computation loading. Duringthe mining process, the nodes will exchange requiredtransactions from each other. The workload of dataexchanging among nodes becomes heavy when theaverage length of transaction is long or the size ofdatabase is large. Although many algorithms havebeen proposed, the execution efficiency of frequentpattern mining is still a challenge to the researchersdue to the data explosion. In addition to theexchanging workload, the data privacy is also a majorconcern since this kind of algorithms duplicates thedatabase to every node in the PDC architecture. Thisapproach evidently is not suitable to be applied tosensitive domains like health-care.In this paper, we propose a novel data miningmethod named FD-Mine that is able to efficiently

    utilize the cloud nodes to fast discover frequentpatterns in cloud computing environments with dataprivacy preserved. Through empirical evaluations on

    Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    2/6

    various simulation conditions, the proposed FD-Minedelivers excellent performance in terms of scalabilityand execution time.In the following sections, we briefly review relatedwork in Section 2. In Section 3, we propose thearchitecture and present the data mining algorithm.The empirical evaluation for performance study ismade in Section 4. The conclusions are given inSection 5.

    II. Rela ted WorkIn order to improve the performance of associationrule mining, many researchers tried to distribute themining computation over more than oneprocessor/node. In [9], the authors proposed a parallelalgorithm named Parallel FP-tree (PFP-tree) based onthe FP-tree data structure for mining frequent patternson message passing multiprocessor systems. Theproposed algorithm divides the database into severalnon-overlapping parts according to number theavailable processors, and lets each processorconstruct its FP-tree by exchanging necessaryinformation from other processors. Because thealgorithm is performed on a node, the dataexchanging is done in the same node so that theoverhead might not be severe. To parallelize thefrequent pattern mining, the past studies relied onmainly the database dividing method [4], [15]. Thedatabase is divided equally or by some criteria andeach part of the database is sent to the node formining. The approach that duplicates the database toother nodes risks leaking out the data. The dataprivacy cannot be preserved by this approach.Note that in cloud computing environments thenetwork latency is an important issue that should becarefully considered. Generally, the size of thetargeted database is always large in the miningapplications. Transmitting the database andexchanging large amount of data over the internetwill greatly slow down the performance. In [12], theproposed method, named QFP-growth, divides thedatabase equally and constructs the FP-trees based onthe assigned parts of database. The FP-trees are thenmerged to a FP-tree to complete the mining task.The data transmission overhead was studied in [14].The authors observed that the elapsed time byexchanging transactions is much more than miningtime. To efficiently exchange transactions amongnodes for database dividing approach, TPFP-tree wasproposed by using transaction identification set(Tidset) to select the transactions directly instead ofscanning the physical database. The Tidset is a tablerecording the IDs of transactions that contain acertain item, so the required memory of Tidset is asthe same size as the assigned partial database.Therefore, TPFP is bound to the size of the targeted

    database.To balance the computing loading of TPFP-tree,

    the authors [15] proposed BTP-tree algorithm, whichis a balanced Tidset-based parallel FP-tree algorithm,for mining frequent patterns. The algorithm equallydivides the database into p parts, where p is thenumber of nodes. The partial databases are sent to thenodes individually. Each node establishes the Tidsetand header table in accordance with the assigneddatabase. A global header table named GHT isderived by filtering the items with support smallerthan the threshold from the table in which all of theheader tables of the nodes are gathered. Beforeexecuting the mining task, BTP-tree algorithmcalculates a performance index for each node, andrecords the sum of performance indexes. A miningtask is then separated into p sub-tasks, where theloading of each task is calculated in unit of thenumber of items in header table. The task assignmentis decided by the mechanism of performanceindexing. After the task assignment, each nodeconstructs its Tidset for fast selection use. Therequired transactions are exchanged among nodes togenerate the new sub-databases by referring to theitems of header tables. Finally, the FP-growth isperformed on each node to discover the frequentpatterns. The frequent patterns are further gatheredfrom all the nodes to obtain the complete frequentpatterns.

    III. Proposed Algorithm: FD-MineIn this section, we describe the proposed algorithmthat is able to efficiently distribute the computation inthe cloud computing environments. The cloudarchitecture for mining frequent patterns isintroduced in Section 3.1. In Section 3.2, weformulate the problem. The details of the proposedalgorithms are described in Section 3.3.3.1 Proposed Cloud Architecture for FrequentPattern MiningNote that in the cloud computing environments thedata privacy is an important issue. Since the cloudsare distributed physically and each cloud nodeprovides only its computation ability, the trusty of thenodes cannot be preserved. Therefore, in order topreserve the data privacy only a node that is safe,while not every node, can access the database. In ourarchitecture, we name this node as trusted node orkernel node, the cloud in which the node locates askernel cloud. Considering the efficiency of datatransmission among clouds, each cloud is designed tohave only a node to connect other clouds, namedconnection-node, abbreviated as conn-node. If a nodeN needs data from trusted node, the node N will askthe conn-node of N's cloud to see whether theconn-node has the data or not. If the conn-node hasthe data, N can download the data from conn-nodevia intranet. Otherwise, the data will be duplicated tothe conn-node via internet, and then N can downloadthe data from conn-node via intranet. By using thistransmission policy, the network latency can beminimized.

    Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    3/6

    Physical Machine9 Dat"b.oL\\IIIII!!II Trusted Xode ( Virtual Machine)

    ConnectionNodetvirt ualMactunejCI Comreting Xcdc

    [VirtualM achine)

    Figure 1. Proposed architecture for frequent pattern mining.

    In this architecture, each conn-node should maintaina table to record the status of the nodes of its cloud.The recorded information for each node contains thenode's ID and the availability. All of the tables arethen gathered in the kernel node so that the kernelnode has complete information of computation abilityin terms of available nodes. The information isupdated periodically.3.2 Mining frequent patterns in cloud computingenvironmentsOne of the characteristics of the proposed algorithmis that the data privacy is preserved. Unlike theparallel Apriori-like algorithms [4] that need toduplicate the database to remote nodes or theBTP-tree [15] algorithm that distributes part of thedatabase directly to cloud nodes, only the kernel nodeis permitted to access the database in our designedarchitecture and algorithms. In addition to the leakingproblem of data privacy of the conventionalalgorithms, the required time for duplicating physicaldatabase is considerable.The data structure used by the proposed algorithmsis based on that of FP-growth. The FP-tree is a datastructure that stores the frequent items in compressedform. Because the items with support smaller than thesupport threshold are filtered and the filteredtransactions have been constructed in the FP-tree,reversely retrieving the complete transaction of anyuser from the FP-tree is impossible. Moreover,because the FP-tree is often implemented inlinked-list and our algorithm will also compress theFP-tree again by ZIP to reduce the transmission time,the transactions will not be reversed. The dataprivacy can be preserved.

    3.3 FD-Mine algorithmThe purpose of FD-Mine is fast mining. In the cloudcomputing environments, the distribution of miningcomputation accompanies data transmission over thenetwork. In BTP-tree [15], the database is divided

    equally into several parts and sent to the availablenodes. Then the nodes ask the required data fromeach other to finish the mining task. In fact, thedatabase is often large in size. Obviously, thisapproach not only leaks the data but also incurs a lotof data transmission over the network. Theperforrnance of this kind of approach is expected tobe bad.An intuitive way to save the time is to minimizethe amount of data transmission. Our proposedFD-Mine is designed to transmit as less data aspossible to save the time from network latency anddisk I/O time. The algorithm is presented in Figure 2.We describe the details of FD-Mine as below. Thetrusted node TN follows the FP-tree constructionalgorithm to scan the database twice times, andconstructs the corresponding FP-tree stored in TN(line I). The next step is to obtain the header table HT

    (line 2) and to divide HT into IN! disjointed sets,stored in IS (line 3). Since the frequent patterns arenot predictable, HT is divided randomly with the goalto balance the loading of each node. Considering theexecution efficiency, the most important issue is thatthe amount of data transmission should be minimized.To minimize the amount of data transmission, theFP-tree constructed on TN is duplicated to each idlenode. In the cloud computing environments, we alsoconsider the problem of network latency. Since theinternet latency always larger than intranet latency,the FP-tree duplication should be done in intranet.Algorithm FD-MineInput: A transaction database DB, a minimum supportthreshold ~ the trusted node TN, and a set ofnodes N with cloud architecture COutput: The complete set of frequent patterns, FP1 TN.FPT c o n s t r u c t F P T r e e ( D B , ~ )

    II TN reads the DB and construct the corresponding FP-tree2 HT getHT(FPT)II Obtain the header table ofFPT3 IS divideHT(lNI)IIRandomly divide the items ofHT into IN[disjointed sets4 FOR i=1 TO IISI5 n selectNode(N,i) II Select the ith node

    6 cn selectConnNode(n,C)II Select the conn-node ofn7 IF (isExistFPT(cn)==FALSE)8 cn.FPT TN.FPTII Duplicate FPT from TN ifen does not have FPT9 ENDIF10 n.FPT cn.FPT

    II Duplicate FPT from the conn-node of n11 is, getSet(IS,i) II Obtain ith set of IS12 fp, N;.BatchFPGrowth(isD

    II Batch-run FP-growth for each conditional item in is; tomine the frequent patterns13 FP FP U fp,14 ENDFOR15 RETURN FP

    Figure 2. FD-MineAlgorithm.

    Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    4/6

    80 .- -- - - - - - - - - - - - - - - ---,

    Number of NodesFigure 3. The execut ion time for FD-Mine and BTP-tree withnumber of nodes varied on dataset T20.IS.NIOOK.DIOOK.

    10

    0 0 0 -O30

    60!E-Q)EF 50c:.2"S

    40w

    70

    the required execution time of FD-Mine andBTP-tree decreases with the increase in the numberof nodes. It is observed that the execution time ofFD-Mine is almost the same to that ofBTP-tree whenthere is only one node available to be used. This istrivial because both of them perform FP-growth in asingle node. The execution time of FD-Mine isslightly more than that of BTP-tree when the numberof processors is equal to 2 or 3. This is because thetime elapsed by FP-tree compression anddecompression is more than the time to directlytransmit the divided parts of database. When there aremore than 3 nodes, FD-Mine exhibits the advantageof sending after compression, less time required forcompleting the whole mining task.Figure 4 shows the impact on execution time whenthe average length of transaction is lengthened to 40.

    It is found that FD-Mine delivers better performancethan BTP-tree when the number of nodes is greaterthan 2. The reason is that BTP-tree, the databasedividing approach, needs to exchange the transactionsto each other, and the performance suffers from thelarge number of exchanged transactions.Figure 5 shows the performance of FD-Mine andBTP-tree under the number of transactions set to200K. In this experiment, FD-Mine outperformsBTP-tree when the number of nodes is greater than 2,in which the intrinsic drawback of the database

    dividing approach is demonstrated. In the series ofexperiments, it is observed that FD-Mine not onlycan preserve the data privacy but also delivers betterperformance than BTP-tree in terms of executiontime especially when the database is large in size.5.2 Effects of varying the parameters of datasetIn the section , we study the effects by varying thesupport threshold, and the parameters, number oftransactions and average transaction length, of thedata generator. Two algorithms are compared,FD-Mine and BTP-tree in the experiment.

    IV. Experimental ResultsTo evaluate the performance of the proposedalgorithm, we use IBM 's Quest Synthetic DataGenerator [1] to generate the workload data formining. The experiments were conducted on a cloudsystem with three clouds. The first cloud containsfour nodes, including the kernel node, in which eachnode is equipped with an E8400 204GHZ CPU, 1GB

    of available RAM and 320GB of disk storage. Thesecond cloud and third cloud contain four and threenodes respectively, in which each node is equippedwith a P8600 204GHZ CPU, IGB of available RAMand 160GB of disk storage. Note that the kernel nodeis responsible for receiving the requests and is notused for mining. Therefore totally ten nodes can beused for mining in the system. To verify theperformance, since there are very few parallel andprivacy-preserved algorithms of frequent patternmining, we select the BTP-tree for comparison,which is one of the most efficient algorithms that canparallelize the mining task on grid systems. Both ofFD-Mine and BTP-tree were implemented in Java,and the message passing among nodes and remotefunction call were implemented in Java RMItechnology. Since the most of the existing parallelalgorithms are database dividing approach, we selectthe most efficient one, BTP-tree, for performancecomparison.5.1 Effects of varying the number of cloud nodesIn the following experiments , we investigate theperformance of FD-Mine in terms of execution timeby varying the number of cloud nodes from I to 10.The performance results for databaseT20.I5.NIOOK.D100K are described. The supportthreshold is set to 0.03%, which is a very small value,in order to verify the performance of both thealgorithms, FD-Mine and BTP-tree. Figure 3 shows

    For this reason, the FP-tree duplication is processedas follows. First, the algorithm selects an idle node n(line 5), and selects the connection node en of n fromthe cloud architecture C (line 6). If en has noduplicated FP-tree, TN will duplicate one to en (line7 to line 9). Note that in order to minimize thetransmitting overhead the FP-tree should becompressed in advance. Afterwards, node n canobtain the compressed FP-tree via intranet anddecompress it (line 10). After receiving the FP-tree,node n is assigned to a subset of IS (line 11), andbatch-runs FP-growth for each conditional item in thesubset to mine the frequent patterns (line 12 to line13). Obviously, each node needs only one datatransmission, i.e. FP-tree duplication, and thetransmission is in intranet to minimize the networklatency. After all of the IN! disjointed sets areprocessed, the frequent patterns are returned (line15).

    Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    5/6

    0.05.04.03.02

    o ..0.

    ..0.

    ..0.

    .. 0

    0.01

    3432

    20

    u! 30Q)E 28i=c.Q 263

    24

    22

    36 ,--- - - - - - - - - - - - - - - - -----,

    18.L-,-----r----.,. . . .-----.----.., . . . . . . .J 0 0 {) 0

    140

    120U-Q)$Q) 100Ei=ca 80cQ) o..,w >0 .

    60

    408 10

    Number of NodesFigure 4. The execution time for FD-Mine and BTP-tree withnumber of nodes varied on dataset T40.I5.N100K.D100K.

    Support Threshold (%)Figure 6. The execution t ime for FD-Mine and BTP-tree withsupport threshold varied on dataset T20.15.N100K.D100K.data privacy is preserved. Unlike the parallelApriori-like algorithms that need to duplicate thedatabase to remote nodes or the BTP-tree algorithmthat distributes part of the database directly to cloudnodes, the database will never be duplicated and onlythe kernel node is permitted to access the database inour designed architecture and algorithms. Throughempirical evaluations on various simulationconditions, the proposed FD-Mine delivers excellentperformance in terms of scalability and executiontime. 0 0 . 0 0

    100

    90

    u- 80Q)$Q) 70Ei=c.Q 60:5oQ)x 0w 50

    40

    30 - ' - r - - - - - . - - - - - r - - - r - - - r - - r - - - - - . - - - - - r - - - r - - . . , . . . . . . . .10

    Number of NodesFigure 5. The execution time for FD-Mine and BTP-tree withnumber of nodes varied on dataset T20.I5.Nl OOK.D200K.

    AcknowledgementThis research was partially supported by NationalScience Council, Taiwan, ROC under GrantNo.97-2218-E-151-003-MY2.

    In Figure 6, we explore the impact on executiontime by varying the support threshold from 0.05% to0.0 I% with ten cloud nodes. It can be found thatFD-Mine always requires less time than BTP-tree.The efficiency in execution time of FD-Mine ismainly achieved by reducing the transmissionoverhead and the disk I/O times. In the experiment,the required time of FD-Mine is only about 82% ofthe execution time ofBTP-tree in average.V. Conclusions

    In this paper, we have presented an efficientalgorithm named FD-Mine that is able to efficientlyutilize the cloud nodes to discover frequent patternsin cloud computing environments with data privacypreserved. The proposed FD-Mine is composed oftwo algorithms, namely HD-Mine and FD-Mine. Thelimitation of the conventional algorithm for miningthe dataset with a large number of frequent patterns isbounded to the available memory. The proposedHD-Mine is able to discover the frequent patternsfrom this kind of datasets by merging the memory ofseveral nodes. The proposed FD-Mine focuses on thefast discovery of frequent patterns by utilizing thecloud nodes, and is useful to the applications thatemphasize real time mining. Another importantcharacteristic of the proposed algorithms is that the

    References[IJ R. Agrawal and R. Srikant. Quest Synthetic Data Generator.IBM Almaden Research Center, San Jose, California,http://www.almaden.ibm.com/cs/quest/syndata.html.[2J R. Agrawal, Imielinski T, Swami A. Mining association rulesbetween sets of items in large databases. In: Proc. ACM SIGMODIntI. ConfManagement Data, 1993.[3J R.Agrawal, R. Srikant, Mining Sequential Patterns, in: Proc. ofthe 11 th 1nt'l Conf. on Data Engineering, 1995, pp. 3-14.[4J R. Agrawal, John C. Shafer, "Parallel Mining of AssociationRules", IEEE Transactions on knowledge and Data Engineering,December 1996.[5J R. J. Bayardo, Jr., Brute-force mining of high-confidenceclassification rules. In Proceedings of the 3rd internationalconference on knowledge discovery and data mining (KDD'97),Newport Beach, California, USA.[6J J. Han, 1. Pei, and Y. Yin. Mining Frequent Patterns WithoutCandidate Generation. Proc. of ACM Int. Conf. on Management ofData (SIGMOD), \-12,2000.[7J J.D. Holt, S.M. Chung, "Parallel mining of association rulesfrom text databases on a cluster of workstations," Proceedings of18th International Symposium on Parallel and DistributedProcessing, 2004, pp. 86.[8J P.Iko andM. Kitsuregawa, "Shared Nothing Parallel Executionof FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp. 43-46.[9J A. Javed, A. Khokhar, "Frequent Pattern Mining on MessagePassing Multiprocessor Systems," Distributed and Paralleldatabase, Volume 16, Issue 3, 2004, pp. 321-334.[IOJ T. Li, S. Zhu, M. Ogihara, "A New Distributed Data MiningModel Based on Similarity," Symposium on Applied Computing,2003, pp.432-436.[II J Ester M., Kriegel H.-P., Sander 1., Xu X.: "A Density-Based

    Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 A Fast Parallel Algorithm for Discovering Frequent Patterns

    6/6

    Algor ithm for Discovering Clusters in Large Spatial Databaseswith Noise", Proc. 2nd Int. Conf. on Knowledge Discovery andDataMining, Portland, OR, AAAI Press, 1996, pp. 226-231.[12] Y. Qiu, Y. 1. Lan and Q. S. Xie, "An improved algorithm ofmin ing from FP- t ree," Proceedings of the Third InternationalConference on Machine Learning and Cybernetics, pp. 26-29,2004.[13] E.-H. S. Han, G.Karypis, and V.Kumar. Scalable parallel datamining for association rules. IEEE Transactions on Knowledge andData Engineering, 12(3):352 -377, 2000.[14] J. Zhou, K.-M. Yu, "Tidset-based Parallel FP-tree Algorithmfor the Frequent Pattern Mining Problem on PC Clusters," LectureNotes in Computer Science 5036, 2008, pp. 18-28.[15] 1. Zhou, K.-M. Yu, Balanced Tidset-based Parallel FP-treeAlgorithm for the Frequent Pattern Mining on Grid System, FourthInternational Conference on Semantics, Knowledge and Grid, 2008.

    O S C f