global principal component analysis for dimensionality reduction in distributed data mining

Upload: afriliani-tri-lestari

Post on 03-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    1/16

    Global Principal Component Analysis for

    Dimensionality Reduction in Distributed Data

    Mining

    Hairong Qi, Tsei-Wei Wang, J. Douglas Birdwell

    University of Tennessee

    Knoxville, TN 37996, USA

    Previous data mining activities have mostly focused on mining a centralizeddatabase. One big problem with a centralized database is its limited scalability.

    Because of the distributed nature of many businesses and the exponentially in-

    creasing amount of data generated from numerous sources, a distributed database

    becomes an attractive alternative. The challenge in distributed data mining is how

    to learn as much knowledge from distributed databases as we do from the central-

    ized database without costing too much communication bandwidth. Both unsu-

    pervised classication (clustering) and supervised classication are common prac-

    tices in data mining applications, where dimensionality reduction is a necessary

    step. Principal component analysis is a popular technique used in dimensionality

    reduction. This paper develops a distributed principal component analysis algo-

    rithm which derives the global principal components from distributed databases

    based on the integration of local covariance matrices. We prove that for homo-

    geneous databases, the algorithm can derive the global principal components thatare exactly the same as those calculated based on a centralized database. We also

    provide quantitative measurement of the error introduced in the recompiled global

    principal components when the databases are heterogeneous.

    I INTRODUCTION

    Data mining is a technology that deals with the discovery of hidden knowledge,

    unexpected patterns and new rules from large databases. In an information soci-

    ety where we are drowning in information but starved for knowledge [8], data

    mining provides an effective means to analyze the uncontrolled and unorganized

    data and turns them into meaningful knowledge.

    The development of different data mining technologies has been spurred sinceearly 90s. Grossman [4] classied data mining systems into three generations:

    The rst generation develops single or collection of data mining algorithms to

    mine vector-valued data. The second generation supports mining of larger datasets

    and datasets in higher dimensions. It also includes developing data mining schema

    and data mining languages to integrate mining into database management systems.

    1

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    2/16

    The third generation provides distributed data mining in a transparent fashion.

    Current commercially available data mining systems mainly belong to the rst

    generation.With the advances in computer networking and information technology, new

    challenges are brought to the data mining community, which we summarize as fol-

    lows: 1) Large dataset with increased complexity (high dimension); 2) New data

    types including object-valued attributes, unstructured data (textual data, image,

    etc.), and semi-structured data (html-tagged data); 3) Geographically distributed

    data location with heterogeneous data schema; 4) Dynamic environment with data

    items updated in real time; and 5) Progressive data mining which returns quick,

    partial or approximate results that can be ne-tuned later in support of more active

    interactions between user and data mining systems.

    The focus of the previous data mining research has been on a centralized

    database. One big problem with a centralized database is its limited scalability.

    On the other hand, many databases nowadays tend to be maintained distributively

    not only because many businesses have a distributed nature, but that growth canbe sustained more gracefully in a distributed system. The paper discusses the

    problem of distributed data mining (DDM) from geographically distributed data

    locations, with databases being either homogeneous or heterogeneous.

    Data mining in distributed systems can be carried out in two different fashions:

    data from distributed locations are transferred to a central processing center where

    distributed databases will be combined into a data warehouse before any further

    processing is to be done. During this process, large amounts of data are moved

    through the network. A second framework is to carry out local data mining rst.

    Global knowledge can be derived by integrating partial knowledge obtained from

    local databases. It is expected that by integrating the knowledge instead of data,

    network bandwidth can be saved and computational load can be more evenly dis-

    tributed. Since the partial knowledge only reects properties of the local database,

    how to integrate these partial knowledge into the global knowledge in order to rep-resent characteristics of the overall data collection remains a problem. Guo et al.

    addressed in [5] that in distributed classication problems, the classication error

    of a global model should, at worst, be the same as the average classication error

    of local models, at best, lower than the error of the non-distributed learned model

    of the same domain.

    Popularly used data mining techniques include association rule discovery [11],

    clustering (unsupervised classication), and supervised classication. With the

    growth of distributed databases, distributed approaches to implement all the three

    techniques have been developed since early 90s.

    Chan and Stolfo proposed a distributed meta-learning algorithm based on the

    JAM system [2], which is one of the earliest distributed data mining systems

    developed. JAM [10] stands for Java Agents for Meta-learning. It is a multi-

    agent framework that carries out meta-learning for fraud detection in banking

    systems and intrusion detection for network security. In the distributed meta-

    learning system, classiers are rst derived from different training datasets us-

    ing different classication algorithms. These base classier will then be col-

    lected or combined by another learning processing, the meta-learning process, to

    2

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    3/16

    generate a meta-classier that integrates the separately learned classier. Guo

    and Sutiwaraphun proposed a similar approach named distributed classication

    with knowledge probing (DCKP) [5]. The difference between DCKP and meta-learning lies in the second learning phase and the forms of the nal results. In

    DCKP, the second learning phase is performed on a probing set whose class val-

    ues are the combinations of predictions from base classiers. The result is one

    descriptive model at the base level rather than the meta level. The performance re-

    ported from the empirical studies of both approaches vary from dataset to dataset.

    Most of the time, the distributed approach performs worse than the non-distributed

    approach. Recently, there has been signicant progress in DDM and there are ap-

    proaches, dealing with massive datasets that do better than the non-distributed

    learned model [9].

    Kargupta et al [7] proposed collective data mining (CDM) to learn a function

    which approximates the actual relationship between data attributes by inductive

    learning. The key idea of CDM is to represent this function as a weighted sum-

    mation of an orthonormal basis. Each local dataset generates its own weightscorresponding to the same basis. Cross terms in the function can be solved when

    local weights are collected at a central site. He also studied distributed cluster-

    ing using collective principal component analysis (PCA) [6]. Collective PCA has

    the same objective as global PCA. However, in collective PCA, local principal

    components, as well as sampled data items from local dataset need to be sent to a

    central site in order to derive the global principal components that can be applied

    to all dataset. In global PCA, no data items from the local database are needed in

    the derivation of the global principal components.

    Except for the CDM approach proposed by Kargupta, most of the current

    DDM methods deal with only homogeneous databases.

    Almost all DDM algorithms need to transfer some data items from local database

    in order to derive the global model. The objective of global PCA is to derive

    the exact or high-precision global model, from homogeneous or heterogeneousdatabases respectively, without the transfer of any local data items.

    II PRINCIPAL COMPONENT ANALYSIS

    Principal component analysis (PCA) is a popular technique for dimensionality

    reduction which, in turn, is a necessary step in classication [3]. It constructs

    a representation of the data with a set of orthogonal basis vectors that are the

    eigenvectors of the covariance matrix generated from the data, which can also

    be derived from singular value decomposition. By projecting the data onto the

    dominant eigenvectors, the dimension of the original dataset can be reduced with

    little loss of information.

    In PCA-relevant literature, PCA is often presented using the eigenvalue/eigenvectorapproach of the covariance matrices. But in efcient computation related to PCA,

    it is the singular value decomposition (SVD) of the data matrix that is used. The

    relationship between the eigen-decomposition of the covariance matrix and the

    SVD of the data matrix itself, is presented below to make the connection. In this

    paper, eigen decomposition of the covariance matrix and SVD of the data matrix

    3

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    4/16

    are used interchangeably.

    Let X be the data repository with m records of dimension d (md). Assume

    the dataset is mean-centered by makingE[X] = 0. A modern PCA method is basedon nding the singular values and orthonormal singular vectors of the X matrix as

    shown in Eq. 1,

    X = UVT (1)

    where U and V are the left and the right singular vectors of X, and is a diagonal

    matrix with positive singular values, 1,

    2, ,d (d = rank(X), assuming d

    2> >j >d.

    Usually, the rst component in Eq. 8 (u11

    vT1

    ) contains most of the information of

    X and thus would be a good estimation ofX. In another word, besides transferring

    the singular vectors and singular values, we also transfer u1

    , the rst column vector

    of the left singular matrix ofX. The loss of information by estimatingX using only

    the rst component in Eq. 8 can be formulated using Eq. 9,

    =

    dj=2j

    dj=1j (9)

    Therefore, the amount of data transferred among heterogeneous databases is

    at the order of O(rd2 + m). The more ujs transferred, the more accurate theestimation to X, the more data need to be transferred as well.

    8

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    9/16

    Apply the above analysis to Eq. 7, we have

    cov(

    X Y

    ) = 1m1

    mcov(X) X

    T

    YYTX mcov(Y)

    where X =ti=1 uxixivTxi approximates X. t is the number ofujs transferred. The

    loss of information is then calculated by

    =

    dj=t+1j

    dj=1j

    V EXPERIMENTS AND RESULTS

    The experiments are done on three data sets (Abalone, Pageblocks, and

    Mfeat) from the UCI Machine Learning Repository [1]. We use Abaloneand Pageblocks to simulate the homogeneous distributed environment, and

    Mfeat, the heterogeneous distributed environment. The details of all data sets are

    shown in Table 1. We adopted two metrics to evaluate the performance of global

    PCA: the classication accuracy and the Euclidean distance between the major

    components derived based on the global PCA and a local PCA respectively. For

    the purpose of simplication, we choose the minimum distance classier, where a

    sample is assigned to a class if its distance to the class mean is the minimum. In

    each subset, 30% of the data are used as the training set, and the other 70% for the

    test set.

    Data Set Nr. of Attributes Nr. of Classes Nr. of Samples

    Abalone 8 29 4177

    Pageblocks 10 5 5473Mfeat 646 10 2000

    Table 1: Details of data sets used in the simulation.

    Here, we outline the PCA and classication processes designed in our exper-

    iments given the distributed data sets (including both the training and test sets).

    Assume Xi is a subset at location i, XTri the training set, X

    Tei the test set, where

    Xi =

    XTri XTei

    T.

    Step 1: Apply PCA on XTri to derive the principal components and use

    those components which keep most of the information as Pi. A parameter

    indicating the information loss () is used to control how many components

    need to be used. In all the experiments, an information loss ratio of 10% is

    used.

    Step 2: Project both XTri and XTei onto Pi to reduce the dimension of the

    original data set, and get PXTri and PXTei respectively.

    9

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    10/16

    Step 3: Use PXTri and PXTei as the local model for local classication. The

    local classication accuracy is averaged and compared to the classication

    accuracy derived from the global PCA.

    Step 4: Calculate the distance between the major component in Pi and that

    in the global principal component for performance evaluation.

    A Global PCA for Distributed Homogeneous Databases

    Abalone is a data set used to predict the age of abalone from physical measure-

    ments. It contains 4177 samples from 29 classes. We randomly divide the whole

    data set into 50 homogeneous subsets of the same size. All the subsets have the

    same number of attributes (or features). Pageblocks is a data set used to clas-

    sify all the blocks of the page layout of a document that has been detected by a

    segmentation process. It has 5473 samples from 5 classes. We also randomly

    divide this data set into 50 homogeneous subsets.Figure 3 and 4 show the performance comparisons with respect to the classi-

    cation accuracy and Euclidean distance on the Abalone data set. We observe

    (Fig. 3 that even though some of the local classication accuracy is higher than the

    accuracy using the global PCA, the average local accuracy (0.8705) is 7.8% lower

    than the global classication accuracy (0.9444). Similar patterns can be observed

    from Fig. 5 and 6 which are results generated from the Pageblocks data set.

    For this data set, the global classication accuracy (0.7545) is 23% higher than

    the averaged local classication accuracy (0.6135).

    B Global PCA for Distributed Heterogeneous Databases

    Mfeat is a data set that consists of features of handwritten numerals (09) ex-

    tracted from a collection of Dutch utility maps. Six different feature selectionalgorithms are applied and the features are saved in six data les.

    1. mfeat-fac: 216 prole correlations

    2. mfeat-fou: 76 Fourier coefcients of the character

    3. mfeat-kar: 64 Karhunen-Love coefcients

    4. mfeat-mor: 6 morphological features

    5. mfeat-pix: 240 pixel averages in 23 windows

    6. mfeat-zer: 47 Zernike moments

    Each data le has 2000 samples and corresponding samples in different featuresets (les) correspond to the same original character. We use these six feature

    les to simulate a distributed heterogeneous environment.

    Figure 7 shows a comparison between the global and local classications. No-

    tice that the global classication accuracy is calculated with the assumption that

    no information is lost, that is, all the ujs are transferred and the local data set

    10

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    11/16

    0 5 10 15 20 25 30 35 40 45 500.75

    0.8

    0.85

    0.9

    0.95

    1

    Local data sets

    Classificationaccuracy

    Averageclassificationaccuracy based on

    local PCA

    Classification accuracybased on global PCA

    Figure 3: Classication accuracy comparison. Note: the upper solid straight line

    indicates the classication accuracy based on the global principal components.

    The lower dash straight line is the averaged local classication accuracy based on

    local principal components at each of the 50 subsets (Abalone).

    0 5 10 15 20 25 30 35 40 45 500

    0.2

    0.4

    0.6

    0.8

    1

    1.2

    1.4

    1.6

    1.8

    Local data sets

    DistancebetweentheglobalandlocalPCA

    Figure 4: Euclidean distance between the major components derived from the

    global PCA and the local PCA (Abalone).

    11

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    12/16

    0 5 10 15 20 25 30 35 40 45 500

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Local data sets

    Classificationaccuracy

    Averageclassificationaccuracy based onlocal PCA

    Classification accuracybased on global PCA

    Figure 5: Classication accuracy comparison. Note: the upper solid straight line

    indicates the classication accuracy based on the global principal components.

    The lower dash straight line is the averaged local classication accuracy based on

    local principal components at each of the 50 subsets (Pageblocks).

    0 5 10 15 20 25 30 35 40 45 500.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Local data sets

    DistancebetweentheglobalandlocalPCA

    Figure 6: Euclidean distance between the major components derived from the

    global PCA and the local PCA (Pageblocks).

    12

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    13/16

    1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 60

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Local data sets

    Classificationaccuracy

    Classfication accuracybased on global PCA

    Averaged classification accuracybased on local PCA

    Figure 7: Classication accuracy comparison. Note: the upper solid straight line

    indicates the classication accuracy based on the global principal components

    (calculated based on all features). The lower dash straight line is the averaged

    local classication accuracy based on local principal components at each of the 6

    subsets (Mfeat).

    is accurately regenerated. However, in real applications, this is very inefcient

    since it consumes tremendous amount of computer bandwidth and computing re-

    sources. Figure 8 shows the trade-off between the classication accuracy and theamount ofujs being transferred between local data sets. We use

    =

    dj=t+1j

    dj=1j

    to calculate the information loss and t is the number of ujs transferred. We ob-

    serve that when only one uj is transferred, the information loss is about 40%, but

    the classication accuracy is, interestingly, a little bit higher than that calculated

    with all uj transferred. As the number of transferred ujs increases to 10 and 20,

    the information loss drops to about 15% and 10% respectively, but the classi-

    cation accuracy does not change. Actually, it converges to the accuracy derived

    when all ujs are transferred. Figure 8 shows a good example on the effectivenessof the rst component (u

    11

    vT1

    ) in approximating the original data set (Eq. 8).

    13

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    14/16

    1 10 200.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    Infor

    mationloss

    1 10 200

    0.01

    0.02

    0.03

    0.04

    0.05

    0.06

    Euclideandistance

    t: Number of uj

    transferred

    1 10 200.541

    0.5415

    0.542

    0.5425

    0.543

    0.5435

    0.544

    0.5445

    Classificationaccuracy

    Figure 8: Effect of the number of left singular vectors (uj) transferred. Top-left:

    information loss () vs. t. Bottom-left: Euclidean distance between the major

    component derived from the global PCA with t amount of uj transferred and the

    major component derived from the global PCA with all ujs transferred. Bottom-

    right: Classication accuracy vs. t.

    14

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    15/16

    VI CONCLUSION

    This paper discusses the problem of distributed data mining. It develops an al-gorithm to derive the global principal components from distributed databases by

    mainly transferring the singular vectors and singular values of the local dataset.

    When the database is homogeneous, the derived global principal components

    are exactly the same as those calculated from a centralized database. When the

    databases are heterogeneous, the global principal components cannot be accurate

    using the same algorithm. We quantitatively analyze the error introduced with

    respect to different amount of local data transferred.

    References

    [1] C. L. Blake and C. J. Merz. UCI repository of machine learning

    databases.http://www.ics.uci.edu/mlearn/MLRepository.

    html. University of California, Irvine, Department of Information an d

    Computer Sciences.

    [2] P. Chan and S. Stolfo. Toward parallel and distributed learning by meta-

    learning. In Working Notes AAAI Work Knowledge Discovery in Databases,

    pages 227240. AAAI, 1993.

    [3] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classication.

    John Wiley & Sons, Inc., 2nd edition, 2001.

    [4] R. L. Grossman. Data mining: challenges and opportunities for data mining

    duri ng the next decade. http://www.lac.uic.edu, May 1997.

    [5] Y. Guo and J. Sutiwaraphun. Advances in Distributed Data Mining, chapter

    1: Distributed Classication with Knowledge Probing, pages 125. AAAI,

    2001.

    [6] H. Kargupta, W. Huang, K. Sivakumar, and E. Johnson. Distributed cluster-

    ing using collective principal component an alysis. Under consideration for

    publication in Knowledge and Informati on Systems, 2000.

    [7] H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Advances in Dis-

    tributed Data Mining, chapter Collective data mining: a new perspective

    toward distributed d ata mining. AAAI Press, 2002. Submitted for publica-

    tion.

    [8] J. Naisbitt and P. Aburdene. Megatrends 2000: Ten New Directions for the

    1990s. Morrow, New York, 1990.

    [9] F. J. Provost and V. Kolluri. A survey of methods for scaling up inductive

    algorithms. Data Mining and Knowledge Discovery, 3(2):131169, 1999.

    [10] S. Stolfo et al. JAM: Java agents for meta-learning over distributed databa

    ses. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors,

    15

  • 7/28/2019 Global Principal Component Analysis for Dimensionality Reduction in Distributed Data Mining

    16/16

    Proceedings Third International Conference on Knowledge Discov ery and

    Data Mining, pages 7481, Menlo Park, CA, 1997. AAAI Press.

    [11] M. J. Zaki. Parallel and distributed association mining: A survey. IEEE

    Concurrency, pages 1425, October-December 1999.

    16