global principal component analysis for dimensionality reduction in distributed data mining

7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

1/16

Global Principal Component Analysis for

Dimensionality Reduction in Distributed Data

Mining

Hairong Qi, Tsei-Wei Wang, J. Douglas Birdwell

University of Tennessee

Knoxville, TN 37996, USA

Previous data mining activities have mostly focused on mining a centralizeddatabase. One big problem with a centralized database is its limited scalability.

Because of the distributed nature of many businesses and the exponentially in-

creasing amount of data generated from numerous sources, a distributed database

becomes an attractive alternative. The challenge in distributed data mining is how

to learn as much knowledge from distributed databases as we do from the central-

ized database without costing too much communication bandwidth. Both unsu-

pervised classication (clustering) and supervised classication are common prac-

tices in data mining applications, where dimensionality reduction is a necessary

step. Principal component analysis is a popular technique used in dimensionality

reduction. This paper develops a distributed principal component analysis algo-

rithm which derives the global principal components from distributed databases

based on the integration of local covariance matrices. We prove that for homo-

geneous databases, the algorithm can derive the global principal components thatare exactly the same as those calculated based on a centralized database. We also

provide quantitative measurement of the error introduced in the recompiled global

principal components when the databases are heterogeneous.

I INTRODUCTION

Data mining is a technology that deals with the discovery of hidden knowledge,

unexpected patterns and new rules from large databases. In an information soci-

ety where we are drowning in information but starved for knowledge [8], data

mining provides an effective means to analyze the uncontrolled and unorganized

data and turns them into meaningful knowledge.

The development of different data mining technologies has been spurred sinceearly 90s. Grossman [4] classied data mining systems into three generations:

The rst generation develops single or collection of data mining algorithms to

mine vector-valued data. The second generation supports mining of larger datasets

and datasets in higher dimensions. It also includes developing data mining schema

and data mining languages to integrate mining into database management systems.

1


2/16

The third generation provides distributed data mining in a transparent fashion.

Current commercially available data mining systems mainly belong to the rst

generation.With the advances in computer networking and information technology, new

challenges are brought to the data mining community, which we summarize as fol-

lows: 1) Large dataset with increased complexity (high dimension); 2) New data

types including object-valued attributes, unstructured data (textual data, image,

etc.), and semi-structured data (html-tagged data); 3) Geographically distributed

data location with heterogeneous data schema; 4) Dynamic environment with data

items updated in real time; and 5) Progressive data mining which returns quick,

partial or approximate results that can be ne-tuned later in support of more active

interactions between user and data mining systems.

The focus of the previous data mining research has been on a centralized

database. One big problem with a centralized database is its limited scalability.

On the other hand, many databases nowadays tend to be maintained distributively

not only because many businesses have a distributed nature, but that growth canbe sustained more gracefully in a distributed system. The paper discusses the

problem of distributed data mining (DDM) from geographically distributed data

locations, with databases being either homogeneous or heterogeneous.

Data mining in distributed systems can be carried out in two different fashions:

data from distributed locations are transferred to a central processing center where

distributed databases will be combined into a data warehouse before any further

processing is to be done. During this process, large amounts of data are moved

through the network. A second framework is to carry out local data mining rst.

Global knowledge can be derived by integrating partial knowledge obtained from

local databases. It is expected that by integrating the knowledge instead of data,

network bandwidth can be saved and computational load can be more evenly dis-

tributed. Since the partial knowledge only reects properties of the local database,

how to integrate these partial knowledge into the global knowledge in order to rep-resent characteristics of the overall data collection remains a problem. Guo et al.

addressed in [5] that in distributed classication problems, the classication error

of a global model should, at worst, be the same as the average classication error

of local models, at best, lower than the error of the non-distributed learned model

of the same domain.

Popularly used data mining techniques include association rule discovery [11],

clustering (unsupervised classication), and supervised classication. With the

growth of distributed databases, distributed approaches to implement all the three

techniques have been developed since early 90s.

Chan and Stolfo proposed a distributed meta-learning algorithm based on the

JAM system [2], which is one of the earliest distributed data mining systems

developed. JAM [10] stands for Java Agents for Meta-learning. It is a multi-

agent framework that carries out meta-learning for fraud detection in banking

systems and intrusion detection for network security. In the distributed meta-

learning system, classiers are rst derived from different training datasets us-

ing different classication algorithms. These base classier will then be col-

lected or combined by another learning processing, the meta-learning process, to

2


3/16

generate a meta-classier that integrates the separately learned classier. Guo

and Sutiwaraphun proposed a similar approach named distributed classication

with knowledge probing (DCKP) [5]. The difference between DCKP and meta-learning lies in the second learning phase and the forms of the nal results. In

DCKP, the second learning phase is performed on a probing set whose class val-

ues are the combinations of predictions from base classiers. The result is one

descriptive model at the base level rather than the meta level. The performance re-

ported from the empirical studies of both approaches vary from dataset to dataset.

Most of the time, the distributed approach performs worse than the non-distributed

approach. Recently, there has been signicant progress in DDM and there are ap-

proaches, dealing with massive datasets that do better than the non-distributed

learned model [9].

Kargupta et al [7] proposed collective data mining (CDM) to learn a function

which approximates the actual relationship between data attributes by inductive

learning. The key idea of CDM is to represent this function as a weighted sum-

mation of an orthonormal basis. Each local dataset generates its own weightscorresponding to the same basis. Cross terms in the function can be solved when

local weights are collected at a central site. He also studied distributed cluster-

ing using collective principal component analysis (PCA) [6]. Collective PCA has

the same objective as global PCA. However, in collective PCA, local principal

components, as well as sampled data items from local dataset need to be sent to a

central site in order to derive the global principal components that can be applied

to all dataset. In global PCA, no data items from the local database are needed in

the derivation of the global principal components.

Except for the CDM approach proposed by Kargupta, most of the current

DDM methods deal with only homogeneous databases.

Almost all DDM algorithms need to transfer some data items from local database

in order to derive the global model. The objective of global PCA is to derive

the exact or high-precision global model, from homogeneous or heterogeneousdatabases respectively, without the transfer of any local data items.

II PRINCIPAL COMPONENT ANALYSIS

Principal component analysis (PCA) is a popular technique for dimensionality

reduction which, in turn, is a necessary step in classication [3]. It constructs

a representation of the data with a set of orthogonal basis vectors that are the

eigenvectors of the covariance matrix generated from the data, which can also

be derived from singular value decomposition. By projecting the data onto the

dominant eigenvectors, the dimension of the original dataset can be reduced with

little loss of information.

In PCA-relevant literature, PCA is often presented using the eigenvalue/eigenvectorapproach of the covariance matrices. But in efcient computation related to PCA,

it is the singular value decomposition (SVD) of the data matrix that is used. The

relationship between the eigen-decomposition of the covariance matrix and the

SVD of the data matrix itself, is presented below to make the connection. In this

paper, eigen decomposition of the covariance matrix and SVD of the data matrix

3


4/16

are used interchangeably.

Let X be the data repository with m records of dimension d (md). Assume

the dataset is mean-centered by makingE[X] = 0. A modern PCA method is basedon nding the singular values and orthonormal singular vectors of the X matrix as

shown in Eq. 1,

X = UVT (1)

where U and V are the left and the right singular vectors of X, and is a diagonal

matrix with positive singular values, 1,

2, ,d (d = rank(X), assuming d

2> >j >d.

Usually, the rst component in Eq. 8 (u11

vT1

) contains most of the information of

X and thus would be a good estimation ofX. In another word, besides transferring

the singular vectors and singular values, we also transfer u1

, the rst column vector

of the left singular matrix ofX. The loss of information by estimatingX using only

the rst component in Eq. 8 can be formulated using Eq. 9,

=

dj=2j

dj=1j (9)

Therefore, the amount of data transferred among heterogeneous databases is

at the order of O(rd2 + m). The more ujs transferred, the more accurate theestimation to X, the more data need to be transferred as well.

8


9/16

Apply the above analysis to Eq. 7, we have

cov(

X Y

) = 1m1

mcov(X) X

T

YYTX mcov(Y)

where X =ti=1 uxixivTxi approximates X. t is the number ofujs transferred. The

loss of information is then calculated by

=

dj=t+1j

dj=1j

V EXPERIMENTS AND RESULTS

The experiments are done on three data sets (Abalone, Pageblocks, and

Mfeat) from the UCI Machine Learning Repository [1]. We use Abaloneand Pageblocks to simulate the homogeneous distributed environment, and

Mfeat, the heterogeneous distributed environment. The details of all data sets are

shown in Table 1. We adopted two metrics to evaluate the performance of global

PCA: the classication accuracy and the Euclidean distance between the major

components derived based on the global PCA and a local PCA respectively. For

the purpose of simplication, we choose the minimum distance classier, where a

sample is assigned to a class if its distance to the class mean is the minimum. In

each subset, 30% of the data are used as the training set, and the other 70% for the

test set.

Data Set Nr. of Attributes Nr. of Classes Nr. of Samples

Abalone 8 29 4177

Pageblocks 10 5 5473Mfeat 646 10 2000

Table 1: Details of data sets used in the simulation.

Here, we outline the PCA and classication processes designed in our exper-

iments given the distributed data sets (including both the training and test sets).

Assume Xi is a subset at location i, XTri the training set, X

Tei the test set, where

Xi =

XTri XTei

T.

Step 1: Apply PCA on XTri to derive the principal components and use

those components which keep most of the information as Pi. A parameter

indicating the information loss () is used to control how many components

need to be used. In all the experiments, an information loss ratio of 10% is

used.

Step 2: Project both XTri and XTei onto Pi to reduce the dimension of the

original data set, and get PXTri and PXTei respectively.

9


10/16

Step 3: Use PXTri and PXTei as the local model for local classication. The

local classication accuracy is averaged and compared to the classication

accuracy derived from the global PCA.

Step 4: Calculate the distance between the major component in Pi and that

in the global principal component for performance evaluation.

A Global PCA for Distributed Homogeneous Databases

Abalone is a data set used to predict the age of abalone from physical measure-

ments. It contains 4177 samples from 29 classes. We randomly divide the whole

data set into 50 homogeneous subsets of the same size. All the subsets have the

same number of attributes (or features). Pageblocks is a data set used to clas-

sify all the blocks of the page layout of a document that has been detected by a

segmentation process. It has 5473 samples from 5 classes. We also randomly

divide this data set into 50 homogeneous subsets.Figure 3 and 4 show the performance comparisons with respect to the classi-

cation accuracy and Euclidean distance on the Abalone data set. We observe

(Fig. 3 that even though some of the local classication accuracy is higher than the

accuracy using the global PCA, the average local accuracy (0.8705) is 7.8% lower

than the global classication accuracy (0.9444). Similar patterns can be observed

from Fig. 5 and 6 which are results generated from the Pageblocks data set.

For this data set, the global classication accuracy (0.7545) is 23% higher than

the averaged local classication accuracy (0.6135).

B Global PCA for Distributed Heterogeneous Databases

Mfeat is a data set that consists of features of handwritten numerals (09) ex-

tracted from a collection of Dutch utility maps. Six different feature selectionalgorithms are applied and the features are saved in six data les.

1. mfeat-fac: 216 prole correlations

2. mfeat-fou: 76 Fourier coefcients of the character

3. mfeat-kar: 64 Karhunen-Love coefcients

4. mfeat-mor: 6 morphological features

5. mfeat-pix: 240 pixel averages in 23 windows

6. mfeat-zer: 47 Zernike moments

Each data le has 2000 samples and corresponding samples in different featuresets (les) correspond to the same original character. We use these six feature

les to simulate a distributed heterogeneous environment.

Figure 7 shows a comparison between the global and local classications. No-

tice that the global classication accuracy is calculated with the assumption that

no information is lost, that is, all the ujs are transferred and the local data set

10


11/16

0 5 10 15 20 25 30 35 40 45 500.75

0.8

0.85

0.9

0.95

1

Local data sets

Classificationaccuracy

Averageclassificationaccuracy based on

local PCA

Classification accuracybased on global PCA

Figure 3: Classication accuracy comparison. Note: the upper solid straight line

indicates the classication accuracy based on the global principal components.

The lower dash straight line is the averaged local classication accuracy based on

local principal components at each of the 50 subsets (Abalone).

0 5 10 15 20 25 30 35 40 45 500

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Local data sets

DistancebetweentheglobalandlocalPCA

Figure 4: Euclidean distance between the major components derived from the

global PCA and the local PCA (Abalone).

11


12/16

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Local data sets


Averageclassificationaccuracy based onlocal PCA

Classification accuracybased on global PCA


indicates the classication accuracy based on the global principal components.

The lower dash straight line is the averaged local classication accuracy based on

local principal components at each of the 50 subsets (Pageblocks).

0 5 10 15 20 25 30 35 40 45 500.1

0.2

0.3

0.4

0.5

0.6

0.7

Local data sets

DistancebetweentheglobalandlocalPCA

Figure 6: Euclidean distance between the major components derived from the

global PCA and the local PCA (Pageblocks).

12


13/16

1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Local data sets


Classfication accuracybased on global PCA

Averaged classification accuracybased on local PCA


indicates the classication accuracy based on the global principal components

(calculated based on all features). The lower dash straight line is the averaged

local classication accuracy based on local principal components at each of the 6

subsets (Mfeat).

is accurately regenerated. However, in real applications, this is very inefcient

since it consumes tremendous amount of computer bandwidth and computing re-

sources. Figure 8 shows the trade-off between the classication accuracy and theamount ofujs being transferred between local data sets. We use

=

dj=t+1j

dj=1j

to calculate the information loss and t is the number of ujs transferred. We ob-

serve that when only one uj is transferred, the information loss is about 40%, but

the classication accuracy is, interestingly, a little bit higher than that calculated

with all uj transferred. As the number of transferred ujs increases to 10 and 20,

the information loss drops to about 15% and 10% respectively, but the classi-

cation accuracy does not change. Actually, it converges to the accuracy derived

when all ujs are transferred. Figure 8 shows a good example on the effectivenessof the rst component (u

11

vT1

) in approximating the original data set (Eq. 8).

13


14/16

1 10 200.1

0.15

0.2

0.25

0.3

0.35

0.4

Infor

mationloss

1 10 200

0.01

0.02

0.03

0.04

0.05

0.06

Euclideandistance

t: Number of uj

transferred

1 10 200.541

0.5415

0.542

0.5425

0.543

0.5435

0.544

0.5445


Figure 8: Effect of the number of left singular vectors (uj) transferred. Top-left:

information loss () vs. t. Bottom-left: Euclidean distance between the major

component derived from the global PCA with t amount of uj transferred and the

major component derived from the global PCA with all ujs transferred. Bottom-

right: Classication accuracy vs. t.

14


15/16

VI CONCLUSION

This paper discusses the problem of distributed data mining. It develops an al-gorithm to derive the global principal components from distributed databases by

mainly transferring the singular vectors and singular values of the local dataset.

When the database is homogeneous, the derived global principal components

are exactly the same as those calculated from a centralized database. When the

databases are heterogeneous, the global principal components cannot be accurate

using the same algorithm. We quantitatively analyze the error introduced with

respect to different amount of local data transferred.

References

[1] C. L. Blake and C. J. Merz. UCI repository of machine learning

databases.http://www.ics.uci.edu/mlearn/MLRepository.

html. University of California, Irvine, Department of Information an d

Computer Sciences.

[2] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-

learning. In Working Notes AAAI Work Knowledge Discovery in Databases,

pages 227240. AAAI, 1993.

[3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classication.

John Wiley & Sons, Inc., 2nd edition, 2001.

[4] R. L. Grossman. Data mining: challenges and opportunities for data mining

duri ng the next decade. http://www.lac.uic.edu, May 1997.

[5] Y. Guo and J. Sutiwaraphun. Advances in Distributed Data Mining, chapter

1: Distributed Classication with Knowledge Probing, pages 125. AAAI,

2001.

[6] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed cluster-

ing using collective principal component an alysis. Under consideration for

publication in Knowledge and Informati on Systems, 2000.

[7] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Advances in Dis-

tributed Data Mining, chapter Collective data mining: a new perspective

toward distributed d ata mining. AAAI Press, 2002. Submitted for publica-

tion.

[8] J. Naisbitt and P. Aburdene. Megatrends 2000: Ten New Directions for the

1990s. Morrow, New York, 1990.

[9] F. J. Provost and V. Kolluri. A survey of methods for scaling up inductive

algorithms. Data Mining and Knowledge Discovery, 3(2):131169, 1999.

[10] S. Stolfo et al. JAM: Java agents for meta-learning over distributed databa

ses. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors,

15


16/16

Proceedings Third International Conference on Knowledge Discov ery and

Data Mining, pages 7481, Menlo Park, CA, 1997. AAAI Press.

[11] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE

Concurrency, pages 1425, October-December 1999.

16

global principal component analysis for dimensionality reduction in distributed data mining

Documents