[ieee 2013 annual international conference on emerging research areas (aicera) - 2013 international...

International Conference on Microelectronics, Communication and Renewable Energy (ICMiCR-2013)

Compression of behavioral data using clustering technique

Biku Abraham Dept. of Computer Applications Saintgits College of Engineering

Kottayam, Kerala [email protected]

Dr. Varghese Paul Dept. of CSE/IT

TocH Institute of Science & Technology Kochi, Kerala

Abstract— An efficient method for compression of behavioral data is presented in this paper. The data sets, which occurs most in a wide variety of applications, create some of the most significant challenges in data mining. We propose a method expressing the alikeness of data. Data reduction technique using statistical technique widely known as clustering method is extensively used in management science to reduce the behavioral data. The use of clustering method in data compression is largely unexplored area which has great potential for further research. This study is a humble attempt in this direction. Stating a hypothesis that more number of clusters the less will be the deviation from the original data.

Keywords—compression; cluster; behavioral data;centroid; pattern recognition.

I. INTRODUCTION

Reducing the size of the data and its process is generally known as data compression. This reduces redundant data in order to resize it in to a compressed file. An ASCII file is compressed which contains the same information, but with much smaller size. This reduction increases the free memory that is available for use. Network with limited bandwidth channels could use the same idea for transmission of messages. Obviously reduction in storage space is one of its benefits. The reduction in time for data transfer and faster query evaluation in text retrieval system is corollary to the compression process. Bandwidth could also reduce the time of data transfers by sending more information content in the volume of data sent.

Cluster is a group of similar objects and cluster analysis is a multivariate procedure ideally suited to segmentation applications in marketing research. Cluster analysis is mainly used for taxonomy description, data Simplification and relationship identification. It is important to observe the theoretical, conceptual and practical considerations when selecting clustering variables for cluster analysis. The selected variables characterize the individuals being clustered. Partitions and trees are the two common clustering structures. A cluster is a subset of the universe under observation or experimentation. A partition is a family

of clusters which have the property that each object lies in just one member of the partition. A tree is a family of clusters, which includes the set of all objects and for which any two clusters are disjoint or one includes the other. A partition, with the set of all objects added, is a tree. Since the 1970’s, many market segmentation studies relied on the cluster algorithms. After cluster analysis, each participant whose needs closely matching lies in a single segment; a measure of variation within segment shows the distance or deviation.

Pattern recognition, system modeling, image processing, communication, and data mining in engineering filed also uses clustering algorithms. The parametric and non-parametric algorithms are the two methods in clustering. In the parametric clustering approach, some pre-defined probability density functions are assumed to optimize the clusters. On the contrary no a priori information on the data distribution that can be assumed while using the non-parametric clustering approach[1].

II. LITERATURE SURVEY

There are many studies on the basic strategy of using data compression in classification and in machine learning tasks. The compressed representation of any object is possible only when some recurring patterns or statistical regularities are detected. Drawing from the past studies a distance or (dis)similarity measure between pairs of data points are suggested to use in many data compression algorithms[2] . The method of coding even bits marking and selective output inversion, uses unique code words, odd bits are commonly used to signify the length of runs and even bits are used to symbolize the end of code words. To increase the probability of 0’s, structure of selective output inversion is introduced. Compared with the other schemes, this method gives a very good compression ratio and needs only a low hardware over head [3].

The most traditional algorithms discuss the relationship between data points and cluster. The two ways which this is approached are: “Every data may belong to only one cluster or each data may belong to more than one cluster with a certain degree of membership. The second method is more general and its value has not yet been clearly proven” [4].

978-1-4673-5149-2/13/$31.00 ©2013 IEEE


Compression algorithms based on the Burrows–Wheeler Transform (BWT) uses the word output of BWT points a local similarity (occurrences of a given symbol tend to occur in clusters) and then turns out to be highly compressible. Several authors refer to such property as the ‘‘clustering effect’’ of BWT. Many researches prove that analytical upper bounds on the compression ratio of BWT-based compressors in terms of the kth order empirical entropy Hk of the input string. “Recall that, under the hypothesis of the Markovian nature of the input string w, Hk(w) gives a lower bound on the compression ratio of any encoder that is allowed to use only the context of length k preceding character σ in order to encode it”[1].

III. METHODOLOGY

Data Compression is considered as an important technique for its ability in saving space and convenience while transmission of data to various utility purposes. However the data compression in respect of behavioral data has not been explored so far, though it has got specific and unique characteristics. Data reduction technique using statistical technique widely known as clustering method is extensively used in management science to reduce the behavioral data. The use of clustering method in data compression is largely unexplored area which has great potential for further research. This study is a humble attempt in that direction.

Establishing a hypothesis that more number of clusters the less will be the deviation from the original data. In order to find the above hypothesis we have used an original data set which contains the different types of branded cars and its related characteristics and specialties. It is found that the standard deviation of the differences between the original dataset and the cluster centroids of the various variables in the dataset is the minimum when the clusters are more. The standard deviation of the differences tends to decrease with the increase in the number of clusters. These points out the apparent relationship between the data reduction possibility hidden in the clustering method.

Experimentation with the two clusters to seven clusters and the corresponding standard deviations were done to observe the results. To begin with the study has performed a cluster analysis mentioning 2 clusters using K means method. The cluster centroids estimated were used to assign the average of the variable addressed. Later differences were calculated from the original observation and the average centroid values. Subsequently the files were split to calculate standard deviations of each cluster. Finally the average of standard deviations was calculated for each cluster.

IV. RESULTS

For the experimentation this research has used a cross section data on the features of various passenger car brands.

This data is a real data which recorded the horse power, price, Mpg, brand, width, length, fuel capacity etc. The first experimentation we have used two variables price and engine. This was followed with the 2nd stage experimentation with

three variables such as horse power, price and engine. In the final stage the variables used includes mpg, price, engine and horse power. All the variables were used to perform the seven experimental conditions starting from two clusters to as high as seven cluster solutions. The data generated has been used to analyze the deviations in the respective group. The data is classifed into two clusters and then find the average values. Then we impute those values to the data. Also found out the standard deviation within cluster. The same procedure is repeated for three clusters and continues up to seven clusters. The result shows when the number of cluster increases the standard deviation decreases.

TABLE I. TWO VARIABLE COMPARISON

No. of clusters

Average Standard Deviation

price engine price engine

2 12.17 0.863 12.2 0.9

3 9.337 0.633 9.3 0.6

4 7.512 0.56 7.5 0.6

5 6.168 0.63 6.2 0.6

6 5.534 0.552 5.5 0.6

7 5.652 0.427 5.7 0.4

TABLE II. THREE VARIABLE COMPARISON

No. of clusters

Average Standard Deviation

price engine hp price engine hp

2 28.05 3.11 188.8 10.1 0.7 36.7

3 34.85 3.5 213.4 9.3 0.7 31.3

4 31.21 3.253 198.4 7.5 0.6 24.7

5 36.86 3.514 216.4 6.1 0.7 28.3

6 36.56 3.485 216.3 5.1 0.6 27.0

7 41.39 4.053 246.6 5.2 0.5 19.3


TABLE III. FOUR VARIABLE COMPARISON

No. of clusters

Average Standard Deviation price engine hp mpg price engine hp mpg

2 27.39 3.055 185.1 23.85 10.6 0.7 37.4 3.304

3 34.41 3.483 211.5 22.73 9.7 0.7 31.3 2.555

4 29.62 3.12 190 24.57 7.8 0.6 28.6 2.86

5 29.09 3.186 190.2 23.87 6.8 0.6 28.6 2.304

6 34.61 3.442 208.1 23.2 5.5 0.7 30.9 2.338

7 39.72 4.016 239.5 22.27 5.6 0.5 23.2 2.164 From the above table it is clear that there is a negative relationship with the number of clusters and standard deviations.

Figure 1. Standard deviation of price and engine in two variable

Figure 2. Standard deviation of price, engine and hp in three variable

The graph shows the negative correlation of variables and the number of clusters. Figure 1. Shows only two variables – price and engine. When there are only two clusters the standard deviation of price is 12.2 whereas considering seven clusters it is reduced to 5.7. The differene get down to 50%. The standard deviation of engine in two cluster is 0.9 and in seven cluster the difference is 0.4.

Figure 3. Standard deviation of price, engine and hp in three variable

V. CONCLUSIONS The above study reveals that as the number of

clusters increase the standard deviations between the cluster members tend to decrease. This is an indication of the possibility of using this method to delete the large set of data from the servers and later re produce the same with least errors at a later stage.

VI. REFERENCES [1] Antonio Restivo , Giovanna Rosone, “Balancing and

clustering of words in the Burrows–Wheeler transform,”


Theoretical Computer Science, vol. 412, no. 2011, pp. 3019–3032.

[2] A. Bratko, B. Filipi\vc, G. V. Cormack, T. R. Lynam, and B. Zupan, “Spam filtering using statistical data compression models,” The Journal of Machine Learning Research, vol. 7, pp. 2673–2698, 2006.

[3] Wenfa Zhan (last), “A scheme of test data compression based on coding of even bits marking and selective output inversion,” Comput Electr Eng (2010), doi:10.1016/j.compeleceng.2010.01.002.

[4] Xu-Lei Yang , Qing Song, Yi-Lei Wu, “A robust deterministic annealing algorithm for data clustering,” Data & Knowledge Engineering, vol. 62, no. 2007, pp. 84–100.

[ieee 2013 annual international conference on emerging research areas (aicera) - 2013 international...

Documents