b sudha et al, international journal of computer ... · assignment of y based on the n vectors ......

Analysis of Random Projections Fuzzy and Map Reducing k-Nearest Neighbourhood Algorithm for Big Data

Classification

Abstract- The k-Nearest Neighbourhood classifier is one of the methods in data mining techniques used for classification and regression. However the classification methods can be used large amounts of data, that become a important task in best number of real-world applications.Focusing the topic is very known as Bigdata classification, in which standardized datamining techniques normally fail to undertake such volume of data. A possible approach to classification of high-dimensionality datasets is to combine a typical classifier, fuzzy k-nearest neighboured in our case (FKNN), with feature reduction by random projection (RP). As opposed to PCA where one projection matrix is computed based on least square optimality, in RP, a projection matrix is chosen at random multiple times. As the random projection procedure is repeated many times, the question is how to aggregate the values of the classifier obtained in each projection. The Map will be decide the k-nearest neighbours in different splits of the data.Then the reduce stage will calculate the definitive neighboureds from the obtained list of map phase. The structured model allows the k-nearest neighboured classifier to amount to datasets of random size, just by simply adding more calculating nodes if necessary. In this paper, presents a fusion strategy for RP FKNN, denoted as RPFKNN. The fusion strategy is based on the class membership values produced by FKNN and classification accuracy in each projection andthis parallel implementation provides the exact classi fication rate as the original k-NN model.

Keyword: Big data; fuzzy k-nearest neighboured; FKNN; Random projections; ensemble FKNN; Principal Component Analysis.

I. INTRODUCTION

The Major Techniques of Bigdata Classification is becoming a necessary task in a wide variety of fields such as social media, Stock and share market, biomedicine, market ing, RFID data, etc. The most current advances in data collecting in many of these fields has resulted in an unchangeable increase of the data that paper have to manage.In this paper, approach to classification of high-dimensionality datasets that combines the fuzzy k-nearest neighboured algorithm (FKNN) with feature reduction by

random project ion (RP). As opposed to PCA where one projection matrix is computed based on least square optimality, in RP a random projection matrix is computed multip le times, denoted here as N (number of RP).A Principal component analysis (PCA) is a way to reduce data dimensionality. PCAprojects high dimensional data to a lower d imension. PCA projects the data in the least square sense, it captures big variability in the data and ignores small variability. As the FKNN algorithm requires all samples to be loaded in the memory at the same time, for n data points from a large space 𝑅𝑝, FKNN has a memory requirement of the order of O(np). As a result of each map, the k nearest neighboureds together with their computed distance values will be emitted to the reduce stage. The map reduce phase will be determine which are the final k- nearest neighbours from the provided list by the maps.

II. FUNDAMENTALS OF RANDOM PROJECTIONS

The theoretical background of random projections was first introduced by Johnson and Lindenstrauss. Essentially, the Johnson and Lindenstrauss (JL) lemma states that, with a given probability p, the pairwise distances of a set of points X = {x1, …xn} ∈ 𝑅𝑝 (upspace) are mostly preserved when X is random projected in a downspace 𝑅𝑞 with q<<p, that is, the feature space dimensionality is reduced from p to q. If Y, Y = {y1,…,yn}∈ 𝑅𝑞, is the projection of X in 𝑅𝑞then Y = XR /√𝑞 , where R is a random projection matrix. In other words, paper don’t have to compute a very expensive projection matrix as in PCA to reduce the dimensionality of the feature space: paper could get reasonable results in a probabilistic sense just choosing one at random. In this paper we use the one proposed by Achlioptas and stated as Theorem 1.

Theorem 1: Consider a set of n points in 𝑅𝑝, represented as an n × p matrix X. Given∈, β > 0 let

𝑞0 = 4+2𝛽𝜀2/2−𝜀3/3

𝑙𝑜𝑔𝑛(1)

B.Sudha1, M.Phil Research Scholar,

Department of Computer Science, Alagappa University, Karaikudi, India.

[email protected]

Dr.E.Ramaraj2, Professor and Head,

Department of Computer Science, Alagappa University, Karaikudi, India

[email protected]

B Sudha et al, International Journal of Computer Technology & Applications,Vol 8(3),425-430

IJCTA | May-June 2017 Available [email protected]

425

ISSN:2229-6093

For integers q q 0, let R be a p × q random matrix with R(i,j)= rij, where {rij} are independent random variables from the fo llowing probability distribution:

𝑟𝑖𝑗 = �+1 𝑤𝑖𝑡ℎ 𝑝 = 1

2�

−1 𝑤𝑖𝑡ℎ 𝑝 = 12�� (2)

If Y=XR /√𝑞 is the projection matrix o f the n points in𝑅𝑞, then, with a probability p > 1-𝑛−𝛽, for all i,j∈[1,n] the distances in𝑅𝑞 are within 1±𝜀 from the ones in 𝑅𝑝, that is:

(𝟏−∈) ∥ 𝑿𝒊 −𝑿𝒋 ∥𝟐≤∥ 𝒀𝒊 −𝒀𝒋 ∥𝟐≤ (𝟏 + 𝜺) ∥ 𝑿𝒊 − 𝑿𝒋 ∥𝟐 (3)

The number g iven by Eq. (1), q0, is called JL bound and it guarantees the distance preservation (3) as long as the downspace dimension is larger than it. Interestingly enough, the JL bound, q0, does not depend on the upspace dimension, p. For example, for n = 1,000,000 and 𝜀 =𝛽= 0.25 paper obtain a JL bound of q0=2387 while for n=1000 paper get q0=1194. In both cases paper see that the JL bound is still a big number.

III. RANDOM PROJECTION FUZZY K-NEARES T NEIGHBOURED (RPFKNN)

FUZZY

Fuzzy logic is a form of many-valued logic in which the truth values of variables may be any real number between 0 and 1. By contrast, in Boolean logic, the truth values of variables may only be the integer values 0 or 1.

A. FKNN

Assume a given set of n points are represented by a set of vectors X={x1, …xn}. The set of points belong to c classes. The membership of point xj𝜖[1,n] in class i𝜖[1,c] is denoted as uij𝜖[0,1], where Σ𝑖=1,𝑐𝑈𝑖𝑗=1.For FKNN, class membership uij can be calculated in multip le ways: 1. Crisp membership in a class, i.e., 𝑈𝑘𝑗=1 if class (𝑋𝑗 ) =k, and 𝑈𝑖𝑗=0 if i≠k; 2. Based on the distance to the class means; 3. Based on the distance to K closest samples. In this paper paper use the first modality. After the memberships U={𝑈𝑖𝑗}i=1,c, j=1,n of the train ing set X are computed, the class memberships ui, i∈[1,c], of an unknown vector x is computed as:

𝑢𝑖(𝑥) =∑ 𝑢𝑖𝑗(1/||𝑥 − 𝑥𝑗||1/(𝑚−1)𝐾𝑗=1 )∑ (1/||𝑥− 𝑥𝑗||1/(𝑚−1)𝐾𝑗=1 )

where {x1, …, xk} is the set of K nearest points to x, and m is a fuzzifier typically chosen as m=2. So Eq. (4) produces a membership vector u∈ 𝑅𝑐 for each unknown datum x.

B. RPFKNN: The RPFKNN idea is to provide an efficient

fusion strategy for the classification results obtained in each downspace obtained by a random projection. By classifying an unknown vector y using FKNN in N random subspaces paper obtain a matrix o f N membership vectors𝑢𝑖. The question is how to compute the final class assignment of y based on the N vectors𝑢𝑖. Several strategies are possible for aggregating the class memberships 𝑢𝑖 (g iven by Eq. 4) across projections:

Strategy 1 (S1):find the class of y as argmax(𝑢𝑖) for each random pro jection i and then use majority vote across projections. Strategy 2 (S2, aka Borda or Olympic count): rank the class memberships in 𝑢𝑖 fo r each pro jection, award points inversely proportional with class ranking, then compute the sum S of the points across projections. Then the class of y is arg𝑚𝑎𝑥𝑖=1,c(S), i.e. the index of the class with most points. Strategy 3 (S3): sum class memberships across projections, 𝑢𝑓=𝑠𝑢𝑚𝑖=1,N (𝑢𝑖). Then class of y is arg𝑚𝑎𝑥𝑖=1 ,c(𝑢𝑓), i.e. the index of the largest cumulative membership. Strategy 4 (S4): sum the memberships across projections (as in S3) but weighted by the accuracy of the FKNN for each projection. Accuracy of projection i was computed as 𝑎𝑖 = #(correct classificat ions)/n. Compute 𝑢𝑓=𝑠𝑢𝑚𝑖=1,N(𝑎𝑖*𝑢𝑖). The class assignment is arg𝑚𝑎𝑥𝑖=1,c(𝑢𝑓). Strategy 5 (S5): generate a set of N projections, choose the best one and then apply FKNN in that one downspace. No fusion needed.

The generic RPFKNN algorithm has the following steps:

Step 0. Choose a number of random project ions N and the downspace dimension q. Input a set of n p-dimensional data points, X(n,p) and a related set L(n) of c class labels. Step 1. At each iteration i𝜖[1,N], produce a random projection matrix 𝑅𝑖(p,q) using Eq. (1). Step 2. Compute the pro jection of X set Yi = XR i/√q as mentioned in Theorem 1; For each vector y in Yi −{y}(leave one out cross validation) Step 3.Compute the class memberships ui ∈ Rc using Eq.(4); Step 4. Concatenate all ind ividual memberships ui in a matrixUf ,i(N × c), that is, [𝑈𝑓,𝑖]=KNN(𝑌𝑖); Step 4*. For

strategy S4 compute the accuracy, 𝑎𝑖 . End For Step 5. Fuse the membership matrices for all p rojections, U�i i∈[1,N], using one of the strategies S1-S4.

Step 6. Compute the final accuracy of the fusion strategy, a.



426

ISSN:2229-6093

To decide which aggregation strategy S1-S4 produces the best results in RPFKNN paper performed the following experiments.

IV. MR-KNN: A MAPREDUCE IMPLEMENTATION FOR K-NN

In this part paper explain how to parallelize the k-NN algorithm based on MapReduce.

A. Map phase

Let TR a train ing set and TS a test set of a random sizes that are stored in the HDFS as single files. The map phase starts headfirst the TR set into a given number of disjoint subsets. Each map task (Map1,Map2,...,Mapm) will create an associated T𝑅𝑗, where 1 ≤ j ≤ m, with the sample of each chunk in which the training set file is divided. The Mapj corresponds to the j data chunk of h/m HDFS blocks. As a outcome, each map analyze roughly a similar number of train ing instances.

When each map has formed its corresponding T𝑅𝑗 set, paper will compute the distance of each xt against the instances of T𝑅𝑗. The class label of the closest k neighboureds for each test example, and the distance will be saved. As a result, paper will obtain a matrix CDj of pairs < class, distance> with dimension n·k. Therefore, at row i, it will have the distance and the class of the k nearest neighboureds. It is noteworthy that every row will be ordered in ascending order regarding the distance to the neighboured, so that, Dist(neigh1) < Dist (neigh2) < .... < Dist(neighk).

Algorithm 1 contains the pseudo-code of the map

function.As each map finish it’s computing the results are forwarded to a single reduce task. The output key in every map is established according to an identifier value of the mapper.

Algorithm 1Map function Require: TS; k 1: Constitute T𝑅𝑗 with the instances of split j. 2: for i =0to i < size (TS) do 3: Compute k-NN (xtest,i,T𝑅𝑗,k) 4: for n =0to n<kdo 5: CDj(i,n )=< Class (neighn),Dist(neighn) >i 6: end for 7: end for 8: key = idMapper 9: EMIT(< key,CD j >) B. Reduce phase

The reduce phase consists of determining which of the tentative k nearest neighboureds from the maps are the closest ones for the complete TS. Given that paper aim to

design a model that can scale to arbitrary size t rain ing sets and that it is independent of the selected number of neighboureds, paper carefully implemented this operation by taking advantage of the setup, reduce, and cleanup procedures of the reduce task.

FIG. 1. Flowchart of the proposed MR-kNN algorithm

Setup: As requirement, this function will need to know the size of TS, but it does not need to read the set itself. This matrix will be init ialized with random values for the classes and positive infinit ive value for the distances.

Algorithm 2 Reduce operation Require: size(TS), k, CDj Require: Setup procedure has been launched. 1: for i =0to i < size (TS) do 2: cont=0 3: for n =0to k do 4: if CDj(i,cont).Dist < CDreducer(i,n).Dist then 5: CDreducer(i,n)=CDj(i,cont) 6: cont++ 7: end if 8: end for 9: end for

Reduce: when the map phase finishes, the class- distance matrix CD reducer is modernized by comparing its current distances values with the matrix that comes from every Mapj, that is, CDj. It consists of the joining process of two sorted lists up to have k values. Thus, for each test example xtest, paper compare every distance value of the neighboureds one by one, starting from the closets neigh- bor. If the distance is lesser than the current value, the class and the distance of this matrix position is updated with the corresponding values, otherwise paper proceed with the following value.This operation does not need to send any < key,value> pairs.



427

ISSN:2229-6093

V EXPERIMENTAL RES ULT

Some random projections are better than others to compare the behavior of random projections in the context of the FKNN algorithm paper constructed a Gaussian mixture dataset. The dataset has n=1000 data points from 𝑅𝑝 with p = 1000, generated from a Gaussian mixture with m = 3 components in proportions (p1, p2, p3) = (0.25, 0.5, 0.25). The parameters for the three mixtures are shown in Table I.

TABLE I. Means and standard deviations of the X2 dataset

Component 1 2 3

Means (2,2… . ,2)1000 (0,0… . ,0)1000 (−2,−2 … . ,−2)1000

Standard deviations

(1,1… . ,1)1000 (2,2… . ,2)1000 (3,3… . ,3)1000

Paper note that for the case of our dataset, the JL bound for 𝜀 = 𝛽 = 0.25 is q0 ~ 1200 (actually larger than the upspace dimension!). In the experiments that follow paper generate our random pro jections using Eq. (2). The scatter plot of the X2 dataset for two random projections in R2 is shown in Fig. 2.

FIG. 2. Two random pro jections of X2 in R2: a bad one (left), a better one (right).

To shed a little more light on the projection process, paper plot the distribution of the ratios of the corresponding pairwise distances in 𝑅𝑞 and 𝑅𝑝, i.e. {dij}𝑅𝑞/{dij}𝑅𝑝. Theorem 1 guarantees that most of the values of this distribution are within (1-,𝜀1+𝜀) with probability p. In Fig . 3 paper plotted the above distribution for 10 random projections at q=2 (left) and q=50 (right).

The distribution bounds get tighter as the downspace dimension increases. However, paper see that most values for q=50 are already within a𝜀~0.25, which is way lower that the JL bound of 1200. Another fact to note for q=2 is that the peak of some distributions is smaller than one, which means that some projections produce a contraction of the distances at low q values.

FIG. 3Pairwise distance distribution for 10 R2 random projections, for q=2(left) and q=50 (right)

This paper also mention that the dotted red distribution (data10) in Fig 3.right corresponds to the picture from Fig2.right, while the one denoted data1 (left most in Fig.3.right) generated Fig.2.left .

Based on Fig.3 some ideas are to compute the mean, theskewness (“skew”) of the distance ratio distribution or thecorrelat ion (“corr”) between {𝑑𝑖𝑗} 𝑅𝑞and {𝑑𝑖𝑗}𝑅𝑝. In Table II show these three RP quality measures computed to theprojections from Fig. 3 (left ), that is, at q=2.

TABLE II values of 3 quality measures for 10 RP at q =2

Projection 1 2 3 4 5 6 7 8 9 10

Corr 0.65

0.30

0.74

0.71

0.46

0.66

0.44

0.72

0.35

0.81

Mean 0.95

0.76

1.05

1.03

0.84

0.97

0.81

1.03

0.76

1.23

Skew 0.40

0.80

0.24

0.29

0.63

0.35

0.67

0.27

0.85

0.09

To compare fusion alternatives S1 to S4 paper performed the same experiment as above (dataset X1, N=30 and variable q) and display the results in Table II.

TABLE III. Comparison of classification rates for fusion strategies s1-s4 for x1 at different lowspace dimensions q

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Corr Mean Skew

Pairwise Distance in Random Projection

1 2 3 4 5 6 7 8 9 10



428

ISSN:2229-6093

Q 2 10 20 50 100

S1(Max and Vote) 0.64 0.95 0.94 0.96 0.98

S2(Borda) 0.66 0.68 0.68 0.70 0.74

S3(Sum and Max) 0.63 0.95 0.96 0.97 0.98

S4(Weighted Sum and Max)

0.75 0.97 0.97 0.98 0.99

From Table III paper can see that S4 is provides the best results for all lowspace dimensions.

TABLE IV. comparison of classification rates for fusion strategies s1-s4 and pca for act at different lowspace

dimensions q

Q 2 10 20 50 100

S1(Max and Vote) 0.36 0.58 0.53 0.6 0.53

S2(Borda) 0.06 0.08 0.1 0.08 0.06

S3(Sum and Max) 0.40 0.58 0.53 0.5 0.49

S4(Weighted Sum and Max)

0.08 0.38 0.47 0.52 0.49

PCA 0.46 0.72 0.67 0.6 0.57

In this paper can make several observations based on Table IV. First, most strategies achieve a maximum accuracy (bolded) at a q between 10 and 50, way lower than limit . So, less then 10 dimensions are not enough but more than 50 actually hurt the performance of the classifier.

All our experiments were, until now, performed with N=30 random projections, In Table V paper show the results obtained on ACT at q=2 for multip le numbers of RP.

TABLE V. Influence of number of pro jection n on accuracy

N 10 20 25 30

S4(Weighted sum and max) 0.8 0.91 0.95 0.98

S1(Max and Vote) 0.22 0.24 0.31 0.36

From Table V it seems that N=30 it is again a suitable number for both S1 and S4. However, it is surprising how low the accuracy for S1 is.

A.Experimental Framework

Accuracy: It counts the number of correct classifications regarding the total number of instances. Runtime: Paper will count the time spent by MR-knn in map and reduce

phases as well as the universal time to classify the whole TS set. The universal time includes all the calculations performed by the MapReduce framework (communications).

Speed up: It checks the efficiency of a parallel algorithm in comparison with a slower version. In our experiments paper will compute the speed up achieved depending on the number of mappers.

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 = 𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒−𝑡𝑖𝑚𝑒𝑃𝑎𝑟𝑎𝑙𝑙𝑒𝑙−𝑡𝑖𝑚𝑒

(1)

Where reference time is the runtime spent with the sequential version and parallel time is the runtime achieved with its improved version.

In these experiments, universal times are compared to those obtained with the sequential version to compute the speed up.

This experimental study is focused on analyzing the effect of the number of mappers (16, 32, 64, 128, and 256) and number of neighboureds (1, 3, 5, and 7) in the proposed MR- kNN model.

TABLE VI. Sequential k-nn performance on map reduce

No AccTest Runtime(ms)

1 0.5012 10237.0058

3 0.4958 10354.8375

5 0.5375 10472.1990

7 0.5487 10585.0380

Table VI collects the average accuracy (AccTest) in the test partitions and the runtime (in milliseconds) results obtained by the standard k-NN algorithm, according to the number of neighboureds.

TABLE VII. Sequential fuzzy k-nn performance on random pro jection

No AccTest Runtime(ms)

1 0.7015 10731.0058

3 0.6942 10621.8495

5 0.5355 10688.1680

7 0.8478 10877.0470

Table VII collects the average accuracy (AccTest) in the test partitions and the runtime (in milliseconds) results obtained by the Random projections fuzzy k-nearest neighboured.



429

ISSN:2229-6093

TABLE VIII.Comparison between random pro jections fuzzy and map reducing algorithm k-nearest neighboured

Comparison Random

projections fuzzy k-nn

Map reducing Algorithm k-nn

Accuracy 2.4790 2.779

Runtime(ms) 42.91807 41.64908

Memory(gb) 372. 648 288.192

Table VIII shows the comparison of Random Projection Fuzzy k-nn and MapReducing Algorithm k-nn based on Accuray,Runtime in milliSeconds and Memory in GB.

FIG.4 random project ions fuzzy k-nearest neighboured

FIG.5 map reducing Algorithm k-nearest neighboured

FIG.6 Comparison of map reducing and random projections fuzzy in k-nearest neighboured algorithm

The above fig 6 gives the clear Comparison result. Based on this, it shows the Comparison level of Best Accuracy 2.779, Runtime 41.64908 in milliseconds and Memory 288.192 GB in Map Reducing Algorithm K-Nearest Neighboured.

VI CONCLUS ION

In this contribution paper proposed a Comparison of Random Pro jection Fuzzy and MapReduce approach to enable the k Nearest Neighboured technique to deal with large-scale problems for Big Data Classification. Existing K-NN algorithm is efficient for classification but its performance is affected by the features used for classification. While analyzing the large-scale data set features selection plays an important role prior to classification process. Based on this Comparison, In Random Project ion Fuzzy K-NN level of Runtime(ms), Memory(gb) is high and the Accuracy level is low. While Compare to this in MapReducing Algorithm K-NN level of Accuracy is high and the Runtime(ms),Memory space is low. FIG 6 shows the variation between Random Projection Fuzzy K-NN and MapReducing Algorithm K_NN.As per the experiment analysis, MapReducing Algorithm K-NN performs as good as Random Project ion Fuzzy K-NN.

REFERENCES

[1] Zhou Z., Yang C., Wen C.,” Random projection based k Nearest Neighboured rule for semiconductor process fault detection”, Proceedings of the 33rd Chinese Control Conference July 28-30, 2014, Nanjing, China.

[2] M. Popescu, J.M. Keller, J. Bezdek, A. Zare, “Random projections fuzzy c-means (RPFCM) for big data clustering”, IEEE Int. Conference on Fuzzy Systems, Istanbul, Turkey, 2015

[3] T. Yokoyama, Y. Ishikawa, and Y. Suzuki, “Processing all k-nearest neighboured queries in hadoop,” in Web-Age Information Management, ser. Lecture Notes in Computer Science, H. Gao, L. Lim, W. Wang, C. Li, and L. Chen, Eds. Springer Berlin Heidelberg, 2012, vol. 7418, pp. 346–351.

[4] Keller J. M., Gray M. R., and Givens J. A., Jr., "A Fuzzy K-Nearest Neighboured Algorithm", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 15, No. 4, pp. 580-585.

2.479 42.91807

372.648

0 50

100 150 200 250 300 350 400

Accuracy Runtime (ms) Memory(gb)

Randon Projections Fuzzy k-nn

2.779 41.64908

288.192

0 50

100 150 200 250 300 350

Accuracy Runtime (ms) Memory(gb)

Map Reducing Algorithm k-nn

2.47 42.91

372.64

2.77 41.64

288.19

Accuracy Runtime Memory

Random projections fuzzy

Map reducing Algorithm



430

ISSN:2229-6093

b sudha et al, international journal of computer ... · assignment of y based on the n vectors ......

Documents