global principal component analysis for dimensionality reduction in distributed data mining
TRANSCRIPT
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
1/16
Global Principal Component Analysis for
Dimensionality Reduction in Distributed Data
Mining
Hairong Qi, Tsei-Wei Wang, J. Douglas Birdwell
University of Tennessee
Knoxville, TN 37996, USA
Previous data mining activities have mostly focused on mining a centralizeddatabase. One big problem with a centralized database is its limited scalability.
Because of the distributed nature of many businesses and the exponentially in-
creasing amount of data generated from numerous sources, a distributed database
becomes an attractive alternative. The challenge in distributed data mining is how
to learn as much knowledge from distributed databases as we do from the central-
ized database without costing too much communication bandwidth. Both unsu-
pervised classication (clustering) and supervised classication are common prac-
tices in data mining applications, where dimensionality reduction is a necessary
step. Principal component analysis is a popular technique used in dimensionality
reduction. This paper develops a distributed principal component analysis algo-
rithm which derives the global principal components from distributed databases
based on the integration of local covariance matrices. We prove that for homo-
geneous databases, the algorithm can derive the global principal components thatare exactly the same as those calculated based on a centralized database. We also
provide quantitative measurement of the error introduced in the recompiled global
principal components when the databases are heterogeneous.
I INTRODUCTION
Data mining is a technology that deals with the discovery of hidden knowledge,
unexpected patterns and new rules from large databases. In an information soci-
ety where we are drowning in information but starved for knowledge [8], data
mining provides an effective means to analyze the uncontrolled and unorganized
data and turns them into meaningful knowledge.
The development of different data mining technologies has been spurred sinceearly 90s. Grossman [4] classied data mining systems into three generations:
The rst generation develops single or collection of data mining algorithms to
mine vector-valued data. The second generation supports mining of larger datasets
and datasets in higher dimensions. It also includes developing data mining schema
and data mining languages to integrate mining into database management systems.
1
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
2/16
The third generation provides distributed data mining in a transparent fashion.
Current commercially available data mining systems mainly belong to the rst
generation.With the advances in computer networking and information technology, new
challenges are brought to the data mining community, which we summarize as fol-
lows: 1) Large dataset with increased complexity (high dimension); 2) New data
types including object-valued attributes, unstructured data (textual data, image,
etc.), and semi-structured data (html-tagged data); 3) Geographically distributed
data location with heterogeneous data schema; 4) Dynamic environment with data
items updated in real time; and 5) Progressive data mining which returns quick,
partial or approximate results that can be ne-tuned later in support of more active
interactions between user and data mining systems.
The focus of the previous data mining research has been on a centralized
database. One big problem with a centralized database is its limited scalability.
On the other hand, many databases nowadays tend to be maintained distributively
not only because many businesses have a distributed nature, but that growth canbe sustained more gracefully in a distributed system. The paper discusses the
problem of distributed data mining (DDM) from geographically distributed data
locations, with databases being either homogeneous or heterogeneous.
Data mining in distributed systems can be carried out in two different fashions:
data from distributed locations are transferred to a central processing center where
distributed databases will be combined into a data warehouse before any further
processing is to be done. During this process, large amounts of data are moved
through the network. A second framework is to carry out local data mining rst.
Global knowledge can be derived by integrating partial knowledge obtained from
local databases. It is expected that by integrating the knowledge instead of data,
network bandwidth can be saved and computational load can be more evenly dis-
tributed. Since the partial knowledge only reects properties of the local database,
how to integrate these partial knowledge into the global knowledge in order to rep-resent characteristics of the overall data collection remains a problem. Guo et al.
addressed in [5] that in distributed classication problems, the classication error
of a global model should, at worst, be the same as the average classication error
of local models, at best, lower than the error of the non-distributed learned model
of the same domain.
Popularly used data mining techniques include association rule discovery [11],
clustering (unsupervised classication), and supervised classication. With the
growth of distributed databases, distributed approaches to implement all the three
techniques have been developed since early 90s.
Chan and Stolfo proposed a distributed meta-learning algorithm based on the
JAM system [2], which is one of the earliest distributed data mining systems
developed. JAM [10] stands for Java Agents for Meta-learning. It is a multi-
agent framework that carries out meta-learning for fraud detection in banking
systems and intrusion detection for network security. In the distributed meta-
learning system, classiers are rst derived from different training datasets us-
ing different classication algorithms. These base classier will then be col-
lected or combined by another learning processing, the meta-learning process, to
2
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
3/16
generate a meta-classier that integrates the separately learned classier. Guo
and Sutiwaraphun proposed a similar approach named distributed classication
with knowledge probing (DCKP) [5]. The difference between DCKP and meta-learning lies in the second learning phase and the forms of the nal results. In
DCKP, the second learning phase is performed on a probing set whose class val-
ues are the combinations of predictions from base classiers. The result is one
descriptive model at the base level rather than the meta level. The performance re-
ported from the empirical studies of both approaches vary from dataset to dataset.
Most of the time, the distributed approach performs worse than the non-distributed
approach. Recently, there has been signicant progress in DDM and there are ap-
proaches, dealing with massive datasets that do better than the non-distributed
learned model [9].
Kargupta et al [7] proposed collective data mining (CDM) to learn a function
which approximates the actual relationship between data attributes by inductive
learning. The key idea of CDM is to represent this function as a weighted sum-
mation of an orthonormal basis. Each local dataset generates its own weightscorresponding to the same basis. Cross terms in the function can be solved when
local weights are collected at a central site. He also studied distributed cluster-
ing using collective principal component analysis (PCA) [6]. Collective PCA has
the same objective as global PCA. However, in collective PCA, local principal
components, as well as sampled data items from local dataset need to be sent to a
central site in order to derive the global principal components that can be applied
to all dataset. In global PCA, no data items from the local database are needed in
the derivation of the global principal components.
Except for the CDM approach proposed by Kargupta, most of the current
DDM methods deal with only homogeneous databases.
Almost all DDM algorithms need to transfer some data items from local database
in order to derive the global model. The objective of global PCA is to derive
the exact or high-precision global model, from homogeneous or heterogeneousdatabases respectively, without the transfer of any local data items.
II PRINCIPAL COMPONENT ANALYSIS
Principal component analysis (PCA) is a popular technique for dimensionality
reduction which, in turn, is a necessary step in classication [3]. It constructs
a representation of the data with a set of orthogonal basis vectors that are the
eigenvectors of the covariance matrix generated from the data, which can also
be derived from singular value decomposition. By projecting the data onto the
dominant eigenvectors, the dimension of the original dataset can be reduced with
little loss of information.
In PCA-relevant literature, PCA is often presented using the eigenvalue/eigenvectorapproach of the covariance matrices. But in efcient computation related to PCA,
it is the singular value decomposition (SVD) of the data matrix that is used. The
relationship between the eigen-decomposition of the covariance matrix and the
SVD of the data matrix itself, is presented below to make the connection. In this
paper, eigen decomposition of the covariance matrix and SVD of the data matrix
3
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
4/16
are used interchangeably.
Let X be the data repository with m records of dimension d (md). Assume
the dataset is mean-centered by makingE[X] = 0. A modern PCA method is basedon nding the singular values and orthonormal singular vectors of the X matrix as
shown in Eq. 1,
X = UVT (1)
where U and V are the left and the right singular vectors of X, and is a diagonal
matrix with positive singular values, 1,
2, ,d (d = rank(X), assuming d
2> >j >d.
Usually, the rst component in Eq. 8 (u11
vT1
) contains most of the information of
X and thus would be a good estimation ofX. In another word, besides transferring
the singular vectors and singular values, we also transfer u1
, the rst column vector
of the left singular matrix ofX. The loss of information by estimatingX using only
the rst component in Eq. 8 can be formulated using Eq. 9,
=
dj=2j
dj=1j (9)
Therefore, the amount of data transferred among heterogeneous databases is
at the order of O(rd2 + m). The more ujs transferred, the more accurate theestimation to X, the more data need to be transferred as well.
8
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
9/16
Apply the above analysis to Eq. 7, we have
cov(
X Y
) = 1m1
mcov(X) X
T
YYTX mcov(Y)
where X =ti=1 uxixivTxi approximates X. t is the number ofujs transferred. The
loss of information is then calculated by
=
dj=t+1j
dj=1j
V EXPERIMENTS AND RESULTS
The experiments are done on three data sets (Abalone, Pageblocks, and
Mfeat) from the UCI Machine Learning Repository [1]. We use Abaloneand Pageblocks to simulate the homogeneous distributed environment, and
Mfeat, the heterogeneous distributed environment. The details of all data sets are
shown in Table 1. We adopted two metrics to evaluate the performance of global
PCA: the classication accuracy and the Euclidean distance between the major
components derived based on the global PCA and a local PCA respectively. For
the purpose of simplication, we choose the minimum distance classier, where a
sample is assigned to a class if its distance to the class mean is the minimum. In
each subset, 30% of the data are used as the training set, and the other 70% for the
test set.
Data Set Nr. of Attributes Nr. of Classes Nr. of Samples
Abalone 8 29 4177
Pageblocks 10 5 5473Mfeat 646 10 2000
Table 1: Details of data sets used in the simulation.
Here, we outline the PCA and classication processes designed in our exper-
iments given the distributed data sets (including both the training and test sets).
Assume Xi is a subset at location i, XTri the training set, X
Tei the test set, where
Xi =
XTri XTei
T.
Step 1: Apply PCA on XTri to derive the principal components and use
those components which keep most of the information as Pi. A parameter
indicating the information loss () is used to control how many components
need to be used. In all the experiments, an information loss ratio of 10% is
used.
Step 2: Project both XTri and XTei onto Pi to reduce the dimension of the
original data set, and get PXTri and PXTei respectively.
9
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
10/16
Step 3: Use PXTri and PXTei as the local model for local classication. The
local classication accuracy is averaged and compared to the classication
accuracy derived from the global PCA.
Step 4: Calculate the distance between the major component in Pi and that
in the global principal component for performance evaluation.
A Global PCA for Distributed Homogeneous Databases
Abalone is a data set used to predict the age of abalone from physical measure-
ments. It contains 4177 samples from 29 classes. We randomly divide the whole
data set into 50 homogeneous subsets of the same size. All the subsets have the
same number of attributes (or features). Pageblocks is a data set used to clas-
sify all the blocks of the page layout of a document that has been detected by a
segmentation process. It has 5473 samples from 5 classes. We also randomly
divide this data set into 50 homogeneous subsets.Figure 3 and 4 show the performance comparisons with respect to the classi-
cation accuracy and Euclidean distance on the Abalone data set. We observe
(Fig. 3 that even though some of the local classication accuracy is higher than the
accuracy using the global PCA, the average local accuracy (0.8705) is 7.8% lower
than the global classication accuracy (0.9444). Similar patterns can be observed
from Fig. 5 and 6 which are results generated from the Pageblocks data set.
For this data set, the global classication accuracy (0.7545) is 23% higher than
the averaged local classication accuracy (0.6135).
B Global PCA for Distributed Heterogeneous Databases
Mfeat is a data set that consists of features of handwritten numerals (09) ex-
tracted from a collection of Dutch utility maps. Six different feature selectionalgorithms are applied and the features are saved in six data les.
1. mfeat-fac: 216 prole correlations
2. mfeat-fou: 76 Fourier coefcients of the character
3. mfeat-kar: 64 Karhunen-Love coefcients
4. mfeat-mor: 6 morphological features
5. mfeat-pix: 240 pixel averages in 23 windows
6. mfeat-zer: 47 Zernike moments
Each data le has 2000 samples and corresponding samples in different featuresets (les) correspond to the same original character. We use these six feature
les to simulate a distributed heterogeneous environment.
Figure 7 shows a comparison between the global and local classications. No-
tice that the global classication accuracy is calculated with the assumption that
no information is lost, that is, all the ujs are transferred and the local data set
10
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
11/16
0 5 10 15 20 25 30 35 40 45 500.75
0.8
0.85
0.9
0.95
1
Local data sets
Classificationaccuracy
Averageclassificationaccuracy based on
local PCA
Classification accuracybased on global PCA
Figure 3: Classication accuracy comparison. Note: the upper solid straight line
indicates the classication accuracy based on the global principal components.
The lower dash straight line is the averaged local classication accuracy based on
local principal components at each of the 50 subsets (Abalone).
0 5 10 15 20 25 30 35 40 45 500
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Local data sets
DistancebetweentheglobalandlocalPCA
Figure 4: Euclidean distance between the major components derived from the
global PCA and the local PCA (Abalone).
11
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
12/16
0 5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Local data sets
Classificationaccuracy
Averageclassificationaccuracy based onlocal PCA
Classification accuracybased on global PCA
Figure 5: Classication accuracy comparison. Note: the upper solid straight line
indicates the classication accuracy based on the global principal components.
The lower dash straight line is the averaged local classication accuracy based on
local principal components at each of the 50 subsets (Pageblocks).
0 5 10 15 20 25 30 35 40 45 500.1
0.2
0.3
0.4
0.5
0.6
0.7
Local data sets
DistancebetweentheglobalandlocalPCA
Figure 6: Euclidean distance between the major components derived from the
global PCA and the local PCA (Pageblocks).
12
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
13/16
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Local data sets
Classificationaccuracy
Classfication accuracybased on global PCA
Averaged classification accuracybased on local PCA
Figure 7: Classication accuracy comparison. Note: the upper solid straight line
indicates the classication accuracy based on the global principal components
(calculated based on all features). The lower dash straight line is the averaged
local classication accuracy based on local principal components at each of the 6
subsets (Mfeat).
is accurately regenerated. However, in real applications, this is very inefcient
since it consumes tremendous amount of computer bandwidth and computing re-
sources. Figure 8 shows the trade-off between the classication accuracy and theamount ofujs being transferred between local data sets. We use
=
dj=t+1j
dj=1j
to calculate the information loss and t is the number of ujs transferred. We ob-
serve that when only one uj is transferred, the information loss is about 40%, but
the classication accuracy is, interestingly, a little bit higher than that calculated
with all uj transferred. As the number of transferred ujs increases to 10 and 20,
the information loss drops to about 15% and 10% respectively, but the classi-
cation accuracy does not change. Actually, it converges to the accuracy derived
when all ujs are transferred. Figure 8 shows a good example on the effectivenessof the rst component (u
11
vT1
) in approximating the original data set (Eq. 8).
13
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
14/16
1 10 200.1
0.15
0.2
0.25
0.3
0.35
0.4
Infor
mationloss
1 10 200
0.01
0.02
0.03
0.04
0.05
0.06
Euclideandistance
t: Number of uj
transferred
1 10 200.541
0.5415
0.542
0.5425
0.543
0.5435
0.544
0.5445
Classificationaccuracy
Figure 8: Effect of the number of left singular vectors (uj) transferred. Top-left:
information loss () vs. t. Bottom-left: Euclidean distance between the major
component derived from the global PCA with t amount of uj transferred and the
major component derived from the global PCA with all ujs transferred. Bottom-
right: Classication accuracy vs. t.
14
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
15/16
VI CONCLUSION
This paper discusses the problem of distributed data mining. It develops an al-gorithm to derive the global principal components from distributed databases by
mainly transferring the singular vectors and singular values of the local dataset.
When the database is homogeneous, the derived global principal components
are exactly the same as those calculated from a centralized database. When the
databases are heterogeneous, the global principal components cannot be accurate
using the same algorithm. We quantitatively analyze the error introduced with
respect to different amount of local data transferred.
References
[1] C. L. Blake and C. J. Merz. UCI repository of machine learning
databases.http://www.ics.uci.edu/mlearn/MLRepository.
html. University of California, Irvine, Department of Information an d
Computer Sciences.
[2] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-
learning. In Working Notes AAAI Work Knowledge Discovery in Databases,
pages 227240. AAAI, 1993.
[3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classication.
John Wiley & Sons, Inc., 2nd edition, 2001.
[4] R. L. Grossman. Data mining: challenges and opportunities for data mining
duri ng the next decade. http://www.lac.uic.edu, May 1997.
[5] Y. Guo and J. Sutiwaraphun. Advances in Distributed Data Mining, chapter
1: Distributed Classication with Knowledge Probing, pages 125. AAAI,
2001.
[6] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed cluster-
ing using collective principal component an alysis. Under consideration for
publication in Knowledge and Informati on Systems, 2000.
[7] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Advances in Dis-
tributed Data Mining, chapter Collective data mining: a new perspective
toward distributed d ata mining. AAAI Press, 2002. Submitted for publica-
tion.
[8] J. Naisbitt and P. Aburdene. Megatrends 2000: Ten New Directions for the
1990s. Morrow, New York, 1990.
[9] F. J. Provost and V. Kolluri. A survey of methods for scaling up inductive
algorithms. Data Mining and Knowledge Discovery, 3(2):131169, 1999.
[10] S. Stolfo et al. JAM: Java agents for meta-learning over distributed databa
ses. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors,
15
-
7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining
16/16
Proceedings Third International Conference on Knowledge Discov ery and
Data Mining, pages 7481, Menlo Park, CA, 1997. AAAI Press.
[11] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE
Concurrency, pages 1425, October-December 1999.
16