distributed reduction algorithm on grid service

5
April 2010, 17(2): 122–126 www.sciencedirect.com/science/journal/10058885 www.buptjournal.cn/xben The Journal of China Universities of Posts and Telecommunications Distributed reduction algorithm on grid service DENG Song, WANG Ru-chuan ( ), FU Xiong College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, China Abstract Pretreatment of mass and high dimensional data for users plays an important role for data mining in grid environment. To solve optimal reduction effectively, a distributed reduction algorithm on grid service is present. It combines grid services with a novel reduction algorithm on gene expression programming (GEP) (RA-GEP). Simulation experiments show that for mass or high dimensional data sets, the proposed algorithm has advantages in terms of speed and quality in contrast with traditional attribution reduction algorithms on intelligence computing. Keywords gene expression programming, distributed reduction, grid service, intelligence computing 1 Introduction It is well known that mass data storage on grid environment is highly important. To obtain better storage, data partition is required. The rough set (RS) was introduced into data partition. However, data partition for mass and high dimensional data is time-consuming and difficult. Hence, attribution reduction needs to be done before data partition. However, if the distributed mass data are concentrated on processing, it would result in a series of problems on security and privacy of distributed data, and it requires plenty of network bandwidth. However, grid provides favorable computing platform for distributed reduction. Attribution reduction is an important pretreatment process for high dimensional and mass data mining. The so called attribution reduction extracts several attributions from original high dimensional and mass data which reflect information of all attributions, and does not affect decision made on the original data. Traditional methods include principal component analysis (PCA) [1], singular value decomposition (SVD) [2] and RS [3–4]. The former two methods inevitably lead to partial losses of the original data information. However, attribution reduction on rough set does not change the decision rules of data after reduction. Due to nonuniqueness of attribution reduction, it is ideal to find the Received date: 30-01-2009 Corresponding author: WANG Ru-chuan, E-mail: [email protected] DOI: 10.1016/S1005-8885(09)60457-X best reduction which includes minimum attribution to decrease complexity. Nevertheless, the computational complexity of attribution reduction grows exponentially with the size of the decision table. In Ref. [3], computing a best reduction has been proved to be an NP-Hard problem. Hence, heuristic algorithms are adopted to solve this problem [5]. Traditional attribution reduction algorithms achieve the best reduction by heuristic information such as a discernable matrix and attribute importance [3–4]. Meanwhile, plentiful scholars proposed that the best attribution reduction was obtained on intelligent computation. In Ref. [6], Lian-yin Zhai et al. put forward feature extraction algorithm on the genetic algorithm and RS, which designed fitness function that reflected attribution feature. Attribution reduction based upon the ant algorithm was put forward in Refs. [7–9]. Abdel-Rahman Header et al. applied tabu search to solve the best attribution reduction [10]. It was shown that superiority of attribution reduction on tabu search was the same as other intelligent algorithms under the same condition. In Ref. [11], authors adopted particle swarm optimization to solve attribution reduction. Experiments proved that particle swarm optimization algorithm was effective compared with traditional attribution reduction based on positive region and discernable matrix. However, for complicated problems, the implementation of these algorithms is complex and they are not efficient in solving attribution reduction making up of high-dimensional data sets.

Upload: song-deng

Post on 05-Jul-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed reduction algorithm on grid service

April 2010, 17(2): 122–126 www.sciencedirect.com/science/journal/10058885 www.buptjournal.cn/xben

The Journal of China Universities of Posts and Telecommunications

Distributed reduction algorithm on grid service DENG Song, WANG Ru-chuan ( ), FU Xiong

College of Computer, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

Abstract

Pretreatment of mass and high dimensional data for users plays an important role for data mining in grid environment. To solve optimal reduction effectively, a distributed reduction algorithm on grid service is present. It combines grid services with a novel reduction algorithm on gene expression programming (GEP) (RA-GEP). Simulation experiments show that for mass or high dimensional data sets, the proposed algorithm has advantages in terms of speed and quality in contrast with traditional attribution reduction algorithms on intelligence computing.

Keywords gene expression programming, distributed reduction, grid service, intelligence computing

1 Introduction

It is well known that mass data storage on grid environment is highly important. To obtain better storage, data partition is required. The rough set (RS) was introduced into data partition. However, data partition for mass and high dimensional data is time-consuming and difficult. Hence, attribution reduction needs to be done before data partition. However, if the distributed mass data are concentrated on processing, it would result in a series of problems on security and privacy of distributed data, and it requires plenty of network bandwidth. However, grid provides favorable computing platform for distributed reduction.

Attribution reduction is an important pretreatment process for high dimensional and mass data mining. The so called attribution reduction extracts several attributions from original high dimensional and mass data which reflect information of all attributions, and does not affect decision made on the original data. Traditional methods include principal component analysis (PCA) [1], singular value decomposition (SVD) [2] and RS [3–4]. The former two methods inevitably lead to partial losses of the original data information. However, attribution reduction on rough set does not change the decision rules of data after reduction. Due to nonuniqueness of attribution reduction, it is ideal to find the Received date: 30-01-2009 Corresponding author: WANG Ru-chuan, E-mail: [email protected] DOI: 10.1016/S1005-8885(09)60457-X

best reduction which includes minimum attribution to decrease complexity. Nevertheless, the computational complexity of attribution reduction grows exponentially with the size of the decision table. In Ref. [3], computing a best reduction has been proved to be an NP-Hard problem. Hence, heuristic algorithms are adopted to solve this problem [5].

Traditional attribution reduction algorithms achieve the best reduction by heuristic information such as a discernable matrix and attribute importance [3–4]. Meanwhile, plentiful scholars proposed that the best attribution reduction was obtained on intelligent computation. In Ref. [6], Lian-yin Zhai et al. put forward feature extraction algorithm on the genetic algorithm and RS, which designed fitness function that reflected attribution feature. Attribution reduction based upon the ant algorithm was put forward in Refs. [7–9]. Abdel-Rahman Header et al. applied tabu search to solve the best attribution reduction [10]. It was shown that superiority of attribution reduction on tabu search was the same as other intelligent algorithms under the same condition. In Ref. [11], authors adopted particle swarm optimization to solve attribution reduction. Experiments proved that particle swarm optimization algorithm was effective compared with traditional attribution reduction based on positive region and discernable matrix. However, for complicated problems, the implementation of these algorithms is complex and they are not efficient in solving attribution reduction making up of high-dimensional data sets.

Page 2: Distributed reduction algorithm on grid service

Issue 2 DENG Song, et al. / Distributed reduction algorithm on grid service 123

GEP [12–13] combines the advantage of genetic algorithm (GA) and genetic programming (GP). As a self-adaptive and random algorithm of high efficiency, it took effect in solving NP-hard problems. To better solve defects of traditional attribution reduction algorithm, this article puts forward RA-GEP. Meanwhile, based on the RA-GEP, distributed reduction algorithm on grid service (DRGS) is present which combines grid services.

The remainder of the article is organized as follows. The RA-GEP is introduced in Sect. 2. Sect. 3 proposes DRGS. Sect. 4 shows comparative experiments and performance analysis. Sect. 5 concludes the article.

2 RA-GEP

2.1 Preliminaries

Related concepts on RS are introduced. For convenience and clarity, the following definitions are depicted [3–4].

Let decision table be , , ,T U C D V f=< >∪ , whereU , called universe, is a nonempty set of objects. C D R=∪ is a

attribution set of sample data. 1 2{ , ,..., }nC c c c= is a set of

condition attributes. 1 2{ , , ..., }mD d d d= is a set of decision

attributes. , rV v r R= ∈∪ is a nonempty set of attribution

value such that rv for every r R∈ ; :f U R V× → is a

function which gives attribution values of every x in U , that is to say, for ,r R x U∀ ∈ ∈ , ( , ) rf x r v∈ is tenable.

Definition 1 Position region: let decision table be , , ,T U C D V f=< >∪ , C D R=∪ . For X U∀ ⊂ , and

_ ( ) { / :iR X Y U R= ⊂∪ }iY X⊂ is called R-position region

of X and is denoted by POS ( )R X .

Definition 2 Dependence degree: let decision table be , , ,T U C D V f=< >∪ , the dependence degree ( )Cr D of

condition attributes with respect to decision attributes equals to [ ] [ ]card(POS ( )) card( )C D U , where card( )∗ represents

the number of ∗ . Definition 3 Coordination: let decision table be

, , ,T U C D V f=< >∪ , if the same condition attributes values

corresponding to decision attributes value are the same, then T is coordination.

For convenience of analysis, attribution reduction is discussed only to consistent decision table.

Definition 4 Optimal attribution reduction: let condition attribute set C′ be a reduction of decision table T, then C′ , with minimum condition attributes concluded, is called

optimal attribution reduction of T. Lemma 1 Let decision table , , ,T U C D V f=< >∪ be

consistent, for ,c C∀ ⊂ , { } , ,T U C c D V f′ =< − >∪ is

constructed, if T ′ is consistent, then condition attribute c is called to be reducible.

Proof According to Definitions 2 and 3 and known conditions, the following equation can be deduced:

card(POS ( ))( ) 1 1 card(POS ( )) card( )card( )

CC C

Dr D D UU

= ⇔ = ⇔ =

(1) Because T ′ is consistent, Eq. (2) can also be obtained:

{ }{ }

{ }

card(POS ( ))( ) 1

card( ) 1 card(POS ( )) card( )

C cC c

C c

Dr D

UD U

−−

= ⇔ =

⇔ =

(2)

From Eqs. (1) and (2), POS ( )C D equals to { }POS ( )C c D− .

That is, condition attribute c is called to be reducible.

2.2 GEP code

GEP adopts linear code of fixed length to represent individual which is called chromosome in GEP. Meanwhile, the chromosome can consist of one or more genes. Each gene in the chromosome is composed of a gene head h and a gene tail t, where the operator or terminal may be present to h, and the terminal can only occurred in t.

In RA-GEP, a special operator is introduced in the gene head, namely ‘ ∪ ’, which arity are two, and gene tail consists of random generation of condition attribute sets in decision table.

Let decision table be , , ,T U C D V f=< >∪ , where C =

1 2{ , ,..., }nc c c and , [1, ]ia C i n∈ ∈ , then the operator ‘ ∪ ’ follows nether rules: Rule 1: { };i i ia a a∪ = Rule 2:

{ , }, i j i ja a a a i j∪ = ≠ .

Example 1 Let decision table be , , ,T U C D V f=< >∪ , where the condition attribute set is { , , , , , , , }C a b c d e f g h= ,

then the gene about GEP is shown in Fig. 1.

Fig. 1 Single-gene chromosome of RA-GEP

As shown in Fig. 1, the reduction chromosomes are valid, which represent condition attribute combination of generation during the evolving. In the RA-GEP, when evaluating the fitness of a reduction chromosome, the reduction chromosome must be first mapped into expression tree in a certain sequence.

Page 3: Distributed reduction algorithm on grid service

124 The Journal of China Universities of Posts and Telecommunications 2010

2.3 GEP population

The length of the gene in the RA-GEP is excessively large, which decrease the efficiency of GEP evolution and weaken the optimal solution to attribution reduction from the code of the RA-GEP. Therefore, to reduce the gene length, a dynamic population creation strategy (DPCS) is proposed in the RA-GEP.

The basic steps of DPCS are shown as follows: Algorithm 1 DPCS Input OldPopulation; Output NewPopulation; Step 1 Sorting all individuals by fitness value; Step 2 Selecting individual by tournament selector and

putting the corresponding individual into set S; Step 3 According to Rules1and 2, S C′→ ; Step 4 Generating new population. DPCS can effectively process high dimensional data,

therefore, the length of chromosome decreases continuously with the evolution of GEP and reaches equilibrium state gradually.

2.4 RA-GEP

Computing an optimal attribution reduction has been proved to be an NP-hard problem. GEP [12–13] has strong global searching ability; therefore, it has definite potentiality in solving NP-hard problem.

The steps of RA-GEP are shown as follows: Algorithm 2 RA-GEP Input , , ,T U C D V f=< >∪ , GEPParas;

Output The best attribution reduction of grid nodes (bestAR);

Step 1 Initial population; Step 2 While (gen<MaxGen) { Step 3 Evaluating (population); Step 4 Performing selecting, mutation and one-point

operation; Step 5 Selecting individual of maximum fitness value as

the current condition attributes sequences; Step 6 Call Algorithm 1 to generate new population ;} Step 7 Return bestAR.

3 DRGS

3.1 Idea of algorithm

More and more mass data are distributed and stored in

applications. If these mass data are concentrated on processing and reducing, it would result in a series of problems on security and privacy of distributed data. Also, it requires a considerable amount of network bandwidth. Grid computing provides platform for distributed data processing and connects various distributed nodes. The platform not only shares computing and storage resources of individual nodes, but also facilitates processing of distributed data. It has great vantages for saving net bandwidth and improving data processing power.

For convenience of simulation, the distributed data sets of DRGS are divided according to horizontal direction. That is, the data attributes of every node are consistent. The DRGS is shown as follows. First, obtain location of data resources and information of the computing node by the grid service layer. Second, call the RA-GEP algorithm to compute reduction.

3.2 Description of algorithm

The whole reduction process is decomposed into multiple sub-processes, which are wrapped as grid services and deployed on grid nodes to work together to improve efficiency of reduction in grid environment. DRGS includes two steps. One is development of client and the other is development of server. The former provides grid portal where users input kinds of parameters, and the latter mainly supplies various grid services.

The steps of DRGS are shown as follows: Algorithm 3 DRGS Input , , ,T U C D V f=< >∪ , GEPParas;

Output The best attribution reduction of grid nodes (bestAR);

1) Server nodes Step 1 ReceivePara(T, GEPParas, i, GSH);//Receiving

parameters from client; Step 2 Calling Algorithm 2 to compute reduction of

datasets; 2) Client node Step 3 T=Read(Sample);//Reading sample data and

constructing decision table T; Step 4 for (int i=0; i<gridcodes; i++) { Step 5 TransPara(T, GEPParas, i, GSH);//Transporting

parameters to server; Step 6 Return bestAR[i];}//Returning best reduction of

every nodes to client. The communication overhead of DRGS mainly concentrates

on transmitting GEP subpopulation to secondary nodes, computing local optimum reduction.

Page 4: Distributed reduction algorithm on grid service

Issue 2 DENG Song, et al. / Distributed reduction algorithm on grid service 125

4 Experiments and analysis

To evaluate the performance of RA-GEP and DRGS, three experiments were conducted on the following platform: P4 1.8GHz+1GB+Jdk1.5+Windows XP+Ws-Core4.0.2.

Parameters of the experimental data set are shown in Table 1.

The initial parameters of RA-GEP are shown in Table 2.

Table 1 Data set Data name Number of condition attribution Number of sampleConnect-4 42 67 557

Census-Income (KDD) 40 299 285 KDD Cup 1999 42 4 000 000

Table 2 GEP initial parameter Data name Gene head Gene tail Gene

size Head length

Population size

Mutation rate

Recombination rate

Max generation

Connect-4 ∪ +condition attribution

Condition attribution 1 41 500 0.044 0.33 10 000

Census-Income (KDD)

∪ + condition attribution

Condition attribution 3 39 500 0.044 0.33 10 000

KDD Cup 1999

∪ + condition attribution

Condition attribution 3 41 500 0.044 0.33 10 000

Experiment 1 For above-mentioned data set and GEP

parameter in Table 2, this article makes a comparative analysis of solution quality and efficiency on the four algorithms, and the RA-GEP algorithm is performed on a computer.

To make a more detailed comparison, the number of attributes in the best reductions obtained for each data set is shown in Fig. 2. The solution of RA-GEP outperforms other four algorithms for all tested data set. This is mainly because RA-GEP has simple coding, decoding, population generation and effective genetic operation. It is shown in Fig. 3 that the average computational time of RA-GEP outperforms all other algorithms for all tested data set. This is mainly because, compared with all other algorithms, RA-GEP spends less time in decoding, population changing and calculation of fitness value.

Fig. 2 Comparison of solution qualities between RA-GEP and other four algorithms

Fig. 3 Comparison of average computational time between RA-GEP and other four algorithms

Experiment 2 Experiment 1 shows that GEP is advantageous in obtaining the best attribution reduction, which does not change the inherent property of sample data. To demonstrate this, average classification accuracy on five data set in Table 1 by Naive Bayesian algorithm is shown in Table 3.

Table 3 Comparison of average classification accuracy before and after reduction

Average classification accuracy Data set

Before reduction After reductionError

Connect-4 0.812 0.821 1 0.009 1Census-Income (KDD) 0.768 4 0.770 9 0.002 5

KDD Cup 1999 0.856 0.858 1 0.002 1

From Table 3, no obvious change has been observed of error of average classification accuracy on the above five data sets before and after reduction, while the maximum error can only be achieved at about 0.009 1. This is mainly because GEP can only decrease condition attributions of these data and does not change the inherent property of these data sets.

Experiment 3 To show the advantage of grid environment, DRGS algorithm uses grid platform to design parallel distributed attribute reduction, which not only improves efficiency of reduction, but greatly reduce average consumption time. All data sets are split into six sub-data sets. The average consumptive time of reduction with different number of grid nodes are compared, as shown in Fig. 4.

From Fig. 4, it can be seen that, with the increase of grid nodes, average reduction time of all data sets reduce dramatically. Especially, average reduction time of KDD Cup 1999 and Census-Income (KDD) data set decreases the fastest. This is because in LAN, for KDD Cup 1999 and Census-Income (KDD) data sets, the transmission time can be neglected compared to computing time.

Page 5: Distributed reduction algorithm on grid service

126 The Journal of China Universities of Posts and Telecommunications 2010

Fig. 4 Comparison of average reduction time among different grid nodes

5 Conclusions

To divide mass data and store them in grid environment, reduction is very important to data partition. This article proposes DRGS. This algorithm combines grid service with RA-GEP. Meanwhile, the decision table coordination is adopted to design a new fitness function. A dynamic population creation strategy is proposed in RA-GEP to accelerate the evolution of GEP, optimize treatment of high dimensional data sets and improve efficiency of GEP evolution. Simulations show that RA-GEP has an obvious advantage of solution quality and efficiency compared with traditional attribution reduction algorithms. Also, with the increment of grid nodes, the average reduction time of DRGS reduce dramatically.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (60973139, 60773041), the Natural Science Foundation of Jiangsu Province (BK2008451), and the Innovation Project for University of Jiangsu Province (CX09B_153Z, CX08B-086Z).

References

1. Gao H B, Hong W X, Cui J X, et al. Optimization of principal component analysis in feature extraction. Proceedings of the International Conference on Mechatronics and Automation (ICMA’07), Aug 5−8, 2007, Harbin, China. Piscataway, NJ, USA: IEEE, 2007: 3128−3132

2. Baranyi P, Yam Y, Yang C T, et al. SVD based reduction for subdivided rule bases. Proceedings of the 9th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE’00): Vol 2, May 7−10, 2000, San Antonio, TX, USA. Piscataway, NJ, USA: IEEE, 2000: 712−716

3. Pawlak Z. Rough Set. International Journal of Computer and Information Sciences, 1982, 11(5): 341−356

4. Moshkov M, Skowron A, Suraj Z. Maximal consistent extensions of information systems relative to their theories. Information Sciences, 2008,178(12): 2600−2620

5. Hu X H, Nick C. Learning in relational databases: A rough set approach. International Journal of Computational Intelligence, 1995, 11(2): 323−338

6. Zhai L Y, Khoo L P, Fok S C. Feature extraction using rough set theory and genetic algorithms—an application for the simplification of product quality evaluation. Computers and Industrial Engineering, 2002, 43(4): 661−676

7. Jensen R, Shen Q. Finding rough set reducts with ant colony optimization. Proceedings of the 2003 UK Workshop on Computational Intelligence (UKCI’03), Sep 1−3 2003, Bristol, UK. 2003: 15−22

8. Ke L G, Feng Z R, Ren Z G. An efficient ant colony optimization approach to attribute reduction in rough set theory. Pattern Recognition Letters, 2008, 29(9): 1351−1357

9. Deng T Q, Yang C D, Zhang Y T, et al. An improved ant colony optimization applied to attributes reduction. Fuzzy Information and Engineering: Vol 1. Proceedings of the 3rd International Conference on Fuzzy Information and Engineering of China (ACFIE’08), Dec 5−8, 2008, Haikou, China. Berlin, Germany: Springer, 2009: 1−6

10. Header A R, Wang J, Fukushima M. Tabu search for attribute reduction in rough set theory. Soft Computing, 2007, 9(12): 909−918

11. Wang X Y, Yang J, Teng X L, et al. Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 2007, 28(4): 459−471

12. Ferreira C. Gene expression programming: a new adaptive algorithm for solving problems. Complex Systems, 2001, 13(2): 87−129

13. Xu K L, Liu Y T, Tang R, et al. A novel method for real parameter optimization based on gene expression programming. Applied Soft Computing, 2009, 9(2): 725−737

(Editor: WANG Xu-ying)