o cluster algorithm 129983

Upload: lien-pham

Post on 05-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 o Cluster Algorithm 129983

    1/16

  • 7/31/2019 o Cluster Algorithm 129983

    2/16

    2

    Given a data set with n objects and kn, the number of desired clusters,partitioning

    algorithms partition the objects into kclusters. The clusters are formed in order to

    optimize an objective criterion such as distance. Each object is assigned to the closestcluster. Clusters are typically represented by either the mean of the objects assigned to

    the cluster (k-means [Mac67]) or by one representative object of the cluster (k-medoid

    [KR90]). CLARANS (Clustering Large Applications based upon RANdomized Search)[NH94] is a partitioning clustering algorithm developed for large data sets, which uses a

    randomized and bounded search strategy to improve the scalability of the k-medoid

    approach. CLARANS enables the detection of outliers and its computational complexityis about O(n2). CLARANS performance can be improved by exploring spatial data

    structures such as R*-trees.

    Hierarchical clustering algorithms work by grouping data objects into a hierarchy (e.g., a

    tree) of clusters. The hierarchy can be formed top-down (divisive hierarchical methods)

    or bottom-up (agglomerative hierarchical methods). Hierarchical methods rely on adistance function to measure the similarity between clusters. These methods do not scale

    well with the number of data objects. Their computational complexity is usually O(n

    2

    ).Some newer methods such as BIRCH [ZRL96] and CURE [GRS99] attempt to addressthe scalability problem and improve the quality of clustering results for hierarchical

    methods. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an

    efficient divisive hierarchical algorithm. It has O(n) computational complexity, can workwith limited amount of memory, and has efficient I/O. It uses a special data structure, CF-

    tree (Cluster Feature tree) for storing summary information about subclusters of objects.

    The CF-tree structure can be seen as a multilevel compression of the data that attempts to

    preserve the clustering structure inherent in the data set. Because of the similaritymeasure it uses to determine the data items to be compressed, BIRCH only performs well

    on data sets with spherical clusters. CURE (Clustering Using REpresentatives) is an

    O(n

    2

    ) algorithm that produces high-quality clusters in the presence of outliers, and canidentify clusters of complex shapes and different sizes. It employs a hierarchical

    clustering approach that uses a fixed number of representative points to define a cluster

    instead of a single centroid or object. CURE handles large data sets through acombination of random sampling and partitioning. Since CURE uses only a random

    sample of the data set, it manages to achieve good scalability for large data sets. CURE

    reports better times than BIRCH on the same benchmark data.

    Locality-based clustering algorithms group neighboring data objects into clusters based

    on local conditions. These algorithms allow clustering to be performed in one scan of thedata set. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

    [EKSX96] is a typical representative of this group of algorithms. It regards clusters as

    dense regions of objects in the input space that are separated by regions of low density.

    DBSCANs basic idea is that the density of points in a radius around each point in acluster has to be above a certain threshold. It grows a cluster as long as, for each data

    point within this cluster, a neighborhood of a given radius contains at least a minimum

    number of points. DBSCAN has computational complexity O(n2). If a spatial index is

    used, the computational complexity is O(n log n). The clustering generated by DBSCAN

    is very sensitive to parameter choice. OPTICS (Ordering Points To Identify Clustering

  • 7/31/2019 o Cluster Algorithm 129983

    3/16

    3

    Structures) is another locality-based clustering algorithms. It computes an augmented

    cluster ordering for automatic and iterative clustering analysis. OPTICS has the samecomputational complexity as DBSCAN.

    In general, partitioning, hierarchical, and locality-based clustering algorithms do not scale

    well with the number of objects in the data set. To improve the efficiency, datasummarization techniques integrated with the clustering process have been proposed.

    Besides the above-mentioned BIRCH and CURE algorithms, examples include: active

    data clustering [HB97], ScalableKM [BFR98], and simple single pass k-means [FLE00].Active data clustering utilizes principles from sequential experimental design in order to

    interleave data generation and data analysis. It infers from the available data not only the

    grouping structure in the data, but also which data are most relevant for the clusteringproblem. The inferred relevance of the data is then used to control the re-sampling of the

    data set. ScalableKM requires at most one scan of the data set. The method identifies data

    points that can be effectively compressed, data points that must be maintained inmemory, and data points that can be discarded. The algorithm operates within the

    confines of a limited memory buffer. Unfortunately, the compression schemes used byScalableKM can introduce significant overhead. The simple single pass k-meansalgorithm is a simplification of ScalableKM. Like ScalableKM, it also uses a data buffer

    of fixed size. Experiments indicate that the simple single pass k-means algorithm is

    several times faster than standard k-means while producing clustering of comparablequality.

    All the above-mentioned methods are not fully effective when clustering high

    dimensional data. Methods that rely on near or nearest neighbor information do not workwell on high dimensional spaces. In high dimensional data sets, it is very unlikely that

    data points are nearer to each other than the average distance between data points because

    of sparsely filled space. As a result, as the dimensionality of the space increases, thedifference between the distance to the nearest and the farthest neighbors of a data object

    goes to zero. [BGRS99, HAK00].

    Grid-based clustering algorithms do not suffer from the nearest neighbor problem in high

    dimensional spaces. Examples include STING (STatistical INformation Grid)

    [WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], andMAFIA (Merging Adaptive Finite Intervals And is more than a clique) [NGC99]. These

    methods divide the input space into hyper-rectangular cells, discard the low-density cells,

    and then combine adjacent high-density cells to form clusters. Grid-based methods arecapable of discovering cluster of any shape and are also reasonably fast. However, none

    of these methods address how to efficiently cluster very large data sets that do not fit in

    memory. Furthermore, these methods also only work well with input spaces with low to

    moderate numbers of dimensions. As the dimensionality of the space increases, grid-based methods face some serious problems. The number of cells grows exponentially and

    finding adjacent high-density cells to form clusters becomes prohibitively expensive

    [HK99].

  • 7/31/2019 o Cluster Algorithm 129983

    4/16

    4

    In order to address the curse of dimensionality a couple of algorithms have focused on

    data projections in subspaces. Examples include PROCLUS [APYP99], OptiGrid[HK99], and ORCLUS [AY00]. PROCLUS uses axis-parallel partitions to identify

    subspace clusters. ORCLUS uses generalized projections to identify subspace clusters.

    OptiGrid is an especially interesting algorithm due to its simplicity and ability to findclusters in high dimensional spaces in the presence noise. OptiGrid constructs a grid-

    partitioning of the data by calculating the partitioning hyperplanes using contracting

    projections of the data. OptiGrid looks for hyperplanes that satisfy two requirements:1) separating hyperplanes should cut through regions of low density relative to the

    surrounding regions; and 2) separating hyperplanes should place individual clusters into

    different partitions. The first requirement aims at preventing oversplitting, that is, acutting plane should not split a cluster. The second requirement attempts to achieve good

    cluster discrimination, that is, the cutting plane should contribute to finding the individual

    clusters. OptiGrid recursively constructs a multidimensional grid by partitioning the datausing a set of cutting hyperplanes, each of which is orthogonal to at least one projection.

    At each step, the generation of the set of candidate hyperplanes is controlled by twothreshold parameters. The implementation of OptiGrid described in the paper used axis-parallel partitioning hyperplanes. The authors show that the error introduced by axis-

    parallel partitioning decreases exponentially with the number of dimensions in the data

    space. This validates the use of axis-parallel projections as an effective approach forseparating clusters in high dimensional spaces. Optigrid, however, has two main

    shortcomings. It is sensitive to parameter choice and it does not prescribe a strategy to

    efficiently handle data sets that do not fit in memory.

    To overcome both the scalability problems associated with large amounts of data and

    high dimensional data input space, this paper introduces a new clustering algorithm

    called O-Cluster (Orthogonal partitioning CLUSTERing). This new clustering methodcombines a novel active sampling technique with an axis-parallel partitioning strategy to

    identify continuous areas of high density in the input space. The method operates on a

    limited memory buffer and requires at most a single scan through the data.

    2. The O-Cluster Algorithm

    O-Cluster is a method that builds upon the contracting projection concept introduced by

    OptiGrid. Our algorithm makes two major contributions:

    It proposes the use of a statistical test to validate the quality of a cutting plane.Such a test proves crucial for identifying good splitting points along data

    projections and makes possible automated selection of high quality separators. It can operate on a small buffer containing a random sample from the original data

    set. Active sampling ensures that partitions are provided with additional data

    points if more information is needed to evaluate a cutting plane. Partitions that donot have ambiguities are frozen and the data points associated with them are

    removed from the active buffer.

  • 7/31/2019 o Cluster Algorithm 129983

    5/16

    5

    O-Cluster operates recursively. It evaluates possible splitting points for all projections in

    a partition, selects the best one, and splits the data into two new partitions. Thealgorithm proceeds by searching for good cutting planes inside the newly created

    partitions. Thus O-Cluster creates a hierarchical tree structure that tessellates the input

    space into rectangular regions. Figure 1 provides an outline of O-Clusters algorithm.

    1. Load buffer

    2. Compute histograms for active

    partitions

    3. Find 'best' splitting points for activepartitions

    4. Flag ambiguous and 'frozen'

    partitions

    Splitting

    points

    exist?

    5. Split active

    partitions

    6. Reload

    buffer

    EXIT

    No

    No

    No

    Yes

    Yes

    Yes

    Ambiguous

    partitions

    exist?

    Unseen

    data

    exist?

    Figure 1: O-Cluster algorithm block diagram.

  • 7/31/2019 o Cluster Algorithm 129983

    6/16

    6

    The main processing stages are as follows:

    1. Load data buffer: If the entire data set does not fit in the buffer, a random

    sample is used. O-Cluster assigns all points from the initial buffer to a root

    partition.

    2. Compute histograms for active partitions: The goal is to determine a set ofprojections for the active partitions and compute histograms along these

    projections. Any partition that represents a leaf in the clustering hierarchy and is

    not explicitly marked ambiguous or frozen is considered active. The processwhereby an active partition becomes ambiguous or frozen is explained in Step 4.

    It is essential to compute histograms that provide good resolution but also that

    have data artifacts smoothed out. A number of studies have addressed the problem

    of how many bins can be supported by a given distribution [Sco79,Wan96]. Basedon these studies, a reasonable, simple approach would be to make the number of

    bins inversely proportional to the standard deviation of the data along a given

    dimension and directly proportional toN1/3

    , whereNis the number of pointsinside a partition. Alternatively, one can use a global binning strategy and coarsen

    the histograms as the number of points inside the partitions decreases. O-Cluster

    is robust with respect to different binning strategies as long as the histograms donot significantly undersmooth or oversmooth the distribution density.

    3. Find best splitting points for active partitions: For each histogram, O-Clusterattempts to find the best valid cutting plane, if such exists. A valid cutting plane

    passes through a point of low density (a valley) in the histogram. Additionally, the

    point of low density should be surrounded on both sides by points of high density(peaks). O-Cluster attempts to find a pair of peaks with a valley between them

    where the difference between the peak and the valley histogram counts is

    statistically significant. Statistical significance is tested using a standard 2 test:

    2

    1,

    22 )(2

    =

    expected

    expectedobserved,

    where the observed value is equal to the histogram count of the valley and theexpected value is the average of the histogram counts of the valley and the lower

    peak. The current implementation uses a 95% confidence level ( 843.32 1,05.0 = ).

    Since multiple splitting points can be found to be valid separators per partitionaccording to this test, O-Cluster chooses the one where the valley has the lowest

    histogram count as the best splitting point. Thus the cutting plane would gothrough the area with lowest density.

    4. Flag ambiguous and frozen partitions: If no valid splitting points are found,

    O-Cluster checks whether the 2

    test would have found a valid splitting point at a

    lower confidence level (e.g., 90% with 706.22 1,1.0 = ). If that is the case, the

    current partition can be considered ambiguous. More data points are needed to

  • 7/31/2019 o Cluster Algorithm 129983

    7/16

    7

    establish the quality of the splitting point. If no splitting points were found and

    there is no ambiguity, the partition can be marked as frozen and the recordsassociated with it marked for deletion from the active buffer.

    5. Split active partitions: If a valid separator exists, the data points are split along

    the cutting plane and two new active partitions are created from the original

    partition. For each new partition the processing proceeds recursively from Step 2.

    6. Reload buffer: This step can take place after all recursive partitioning on the

    current buffer has completed. If all existing partitions are marked as frozenand/or there are no more data points available, the algorithm exits. Otherwise, if

    some partitions are marked as ambiguous and additional unseen data records

    exist, O-Cluster proceeds with reloading the data buffer. The new data replacerecords belonging to frozen partitions. When new records are read in, only data

    points that fall inside ambiguous partitions are placed in the active buffer. New

    records falling within a frozen partition are not loaded into the buffer. If it isdesirable to maintain statistics of the data points falling inside partitions

    (including the frozen partitions), such statistics can be continuously updatedwith the reading of each new record. Loading of new records continues untileither: 1) the active buffer is filled again; 2) the end of the data set is reached; or

    3) a reasonable number of records have been read, even if the active buffer is not

    full and there are more data. The reason for the last condition is that if the buffer

    is relatively large and there are many points marked for deletion, it may take along time to fill the entire buffer with data from the ambiguous regions. To avoid

    excessive reloading during these circumstances, the buffer reloading process is

    terminated after reading through a number of records equal to the data buffer size.Once the buffer reload is completed, the algorithm proceeds from Step 2. The

    algorithm requires, at most, a single pass through the entire data set.

    In addition to the major differences from OptiGrid noted in the beginning of thissection, there are two other important distinctions:

    OptiGrids choice of a valid cutting plane depends on a pair of global parameters:

    noise level and maximal splitting density. Those two parameters act as thresholdsfor identifying valid splitting points. In OptiGrid, histogram peaks are required to

    be above the noise level parameter while histogram valleys need to have density

    lower than the maximum splitting density. The maximum splitting density shouldbe set above the noise level threshold (personal communication with OptiGrids

    authors). Finding correct values for these parameters is critical for OptiGrids

    performance. O-Clusters 2 test for splitting points eliminates the need for presetthresholds the algorithm can find valid cutting planes at any density level withina histogram. While not strictly necessary for O-Clusters operation, it was found,

    in the course of algorithm evolution, useful to introduce a parameter called

    sensitivity (). Analogous to OptiGrids noise level, the role of this parameter is to

    suppress the creation of arbitrarily small clusters by setting a minimum count for

    O-Clusters histogram peaks. The effect of is illustrated in Section 4.

  • 7/31/2019 o Cluster Algorithm 129983

    8/16

    8

    While OptiGrid attempts to find good cutting planes that optimally traverse the

    input space, it is prone to oversplitting. By design, OptiGrid can partitionsimultaneously along several cutting planes. This may result in the creation of

    clusters (with few points) that need to be subsequently removed. Additionally,

    OptiGrid works with histograms that undersmooth the distribution density

    (personal communication with OptiGrids authors). Undersmoothed histogramsand the threshold-based mechanism of splitting point identification can lead to the

    creation of separators that cut through clusters. These issues may not benecessarily a serious hindrance in the OptiGrids framework since the algorithm

    attempts to construct a multidimensional grid where the highly populated cells are

    interpreted as clusters. O-Cluster on the other hand, attempts to create a binaryclustering tree where the leaves are regions with flat or unimodal density

    functions. Only a single cutting plane is applied at a time and the quality of the

    splitting point is statistically validated.

    O-Cluster functions optimally for large-scale data sets with many records and high

    dimensionality. It is desirable to work with a sufficiently large active buffer in order tocalculate high quality histograms with good resolution. High dimensionality has been

    shown to significantly reduce the chance of cutting through data when using axis-parallelcutting planes [HK99]. There is no special handling for missing values if a data record

    has missing values, this record would not contribute to the histogram counts along certaindimensions. However, if a missing value is needed to assign the record to a partition, the

    record would not be assigned and it would be marked for deletion from the active buffer.

    3. O-Cluster Complexity

    O-Cluster can use an arbitrary set of projections. Our current implementation is restricted

    to projections that are axis-parallel. The histogram computation step is of complexityO(Nxd) whereNis the number of data points in the buffer and dis the number of

    dimensions. The selection of best splitting point for a single dimension is O(b) where b is

    the average number of histogram bins in a partition. Choosing the best splitting pointover all dimensions is O(dxb). The assignment of data points to newly created partitions

    requires a comparison of an attribute value to the splitting point and the complexity has

    an upper bound ofO(N). Loading new records into the data buffer requires their insertioninto the relevant partitions. The complexity associated with scoring a record is depends

    on the depth of the binary clustering tree (s). The upper limit for filling the whole active

    buffer is O(Nxs). The depth of the tree depends on the data set.

    In general, the total complexity can be approximated as O(Nxd). It is shown in Section 4that O-Cluster scales linearly with the number of records and number of dimensions.

    4. Empirical Results

    This section illustrates the general behavior of O-Cluster and evaluates the correctness of

    its solutions. The first series of tests were carried out on a two-dimensional data set - DS3

    [ZRL96]. This is a particularly challenging benchmark. The low number of dimensions

  • 7/31/2019 o Cluster Algorithm 129983

    9/16

    9

    makes the use of any axis-parallel partitioning algorithm problematic. Also, the data set

    consists of 100 spheric clusters that vary significantly in their size and density. Thenumber of points per cluster is a random number in the range [0, 2000] drawn from a

    uniform distribution and the variance across dimensions for each cluster is a random

    number in the range [0, 2], also drawn from a uniform distribution.

    4.1. O-Cluster on DS3

    Figure 2 depicts the partitions found by O-Cluster on the DS3 data set. The centers of theoriginal clusters are marked with squares while the centroids of the points assigned toeach O-Cluster partition are represented by stars.

    Figure 2: O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by O-

    Cluster. Squares () represent the original cluster centroids, stars (*) represent the centroids of the points

    belonging to an O-Cluster partition; recall = 71%, precision = 97%.

    Although O-Cluster does not function optimally when the dimensionality is low, itproduces a good set of partitions. It is noteworthy that O-Cluster finds cutting planes at

    different levels of density and successfully identifies nested clusters. Axis-parallel splits

    in low dimensions can easily lead to the creation of artifacts where cutting planes have tocut through parts of a cluster and data points are assigned to incorrect partitions. Such

    artifacts can either result in centroid imprecision or lead to further partitioning and

    creation of spurious clusters. For example, in Figure 2 O-Cluster creates 73 partitions. Of

  • 7/31/2019 o Cluster Algorithm 129983

    10/16

    10

    these, 71 contain the centroids of at least one of the original clusters. The remaining 2

    partitions were produced due to artifacts created by splits going through clusters.

    In general, there are two potential sources of imprecision in the algorithm: 1) O-Clustermay fail to create partitions for all original clusters; and/or 2) O-Cluster may create

    spurious partitions that do not correspond to any of the original clusters. To measurethese two effects separately, we use two metrics borrowed from the information retrievaldomain:Recallis defined as the percentage of the original clusters that were found andassigned to partitions;Precision is defined as the percentage of the found partitions that

    contain at least one original cluster centroid. That is, in Figure 2 O-Cluster found 71 out

    of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitionscreated contained at least one centroid of the original clusters (a precision of 97%).

    4.2. The Sensitivity Parameter

    The effect of creating spurious clusters due to splitting artifacts can be alleviated by using

    O-Clusters sensitivity () parameter. is a parameter in the [0, 1] range that is inversely

    proportional to the minimum count required to find a histogram peak. A value of 0requires the histogram peaks to surpass the count corresponding to a global uniform level

    per dimension. The global uniform level is defined as the average histogram count that

    would have been observed if the data points in the buffer were drawn from a uniformdistribution. A value of 0.5 sets the minimum histogram count for a peak to 50% of the

    global uniform level. A value of 1 removes the restrictions on peak histogram counts and

    the splitting point identification relies solely on the 2 test. The results shown in Figure 2

    were produced with = 0.95.

    Figure 3 illustrates that effect of changing . Increasing enables O-Cluster to grow theclustering hierarchy deeper and thus obtain improved recall performance. However,

    values of that are too high may result in excessive splitting and thus poor precision. Itshould be noted that the effect of is magnified by the particular characteristics of theDS3 data set. The 2D dimensionality leads to splitting artifacts that become the main

    reason for oversplitting. Additionally, the original clusters in the DS3 data set vary

    significantly in their number of records, and low values can filter out some of the

    weaker clusters. Higher dimensionality and more evenly represented clusters reduce O-

    Clusters sensitivity to .

  • 7/31/2019 o Cluster Algorithm 129983

    11/16

    11

    Figure 3: Effect of the sensitivity parameter. The grid depicts the splitting planes found by O-Cluster.

    Squares () represent the original cluster centroids, stars (*) represent the centroids of the points belonging

    to an O-Cluster partition. (a)= 0, recall = 22%, precision = 100%; (b)= 0.5, recall = 40%,precision = 100%; (c)= 0.75, recall = 56%, precision = 100%; (d)= 1, recall = 72%, precision = 84%.

    4.3. The Effect of Dimensionality

    In order to illustrate the benefits of higher dimensionality, the DS3 data set was extended

    to 5 and 10 dimensions. was set to 1 for both experiments. Figure 4 shows the 2Dprojection of the data set, the original cluster centroids, and the centroids of O-Clusters

    partitions in the plane specified by the original two dimensions. The O-Cluster grid was

    not included since the cutting planes in higher dimensions could not be plotted in ameaningful way. It can be seen that O-Clusters accuracy (both recall and precision)

    improves dramatically with increased dimensionality. The main reason for the

    remarkably good performance is that higher dimensionality allows O-Cluster to findcutting planes that do not produce splitting artifacts.

  • 7/31/2019 o Cluster Algorithm 129983

    12/16

    12

    Figure 4: Effect of dimensionality. Squares () represent the original cluster centroids, stars (*) represent

    the centroids of the points belonging to an O-Cluster partition. (a) dimensionality = 5, recall = 99%,

    precision = 96%; (b) dimensionality = 10, recall = 100%, precision = 100%.

    4.4. The Effect of Uniform Noise

    O-Cluster shares one remarkable feature with OptiGrid it resistance to uniform noise.To test O-Clusters robustness to uniform noise a synthetic data set consisting of 100,000

    points was generated. It consisted of 50 spherical clusters, with variance in the range [0,

    2], each represented by 2,000 points. To introduce uniform noise to the data set, a certainpercentage of the original records were replaced by records drawn from a uniform

    distribution on each dimension. O-Cluster was tested with 25%, 50%, 75%, and 90%

    noise. For example, when the percentage of noise was 90%, the original clusters wererepresented by 10,000 points (200 on average per cluster) and the remaining 90,000

    points were uniform noise. All experiments were run with = 0.8. Figure 5 illustrates O-

    Clusters performance under noisy conditions. O-Clusters accuracy degrades verygracefully with the increased percentage of background noise. Higher dimensionalityprovides a slight advantage when handling noise.

    Figure 5: Effect of uniform noise. (a) Recall for 5 and 10 dimensions; (b) Precision for 5 and 10

    dimensions.

  • 7/31/2019 o Cluster Algorithm 129983

    13/16

    13

    It should also be noted that once background noise is introduced, the centroids of the

    partitions produced by O-Cluster are offset from the original cluster centroids. In order toidentify the original centers, it is necessary to discount the background noise from the

    histograms and compute centroids on the remaining points. This can be accomplished by

    filtering out the histogram bins that would fall below a level corresponding to the average

    bin count for this partition.

    4.5. O-Cluster Scalability

    The next series of tests addresses O-Clusters scalability with increasing numbers of

    records and dimensions. All data sets used in the experiments consisted of 50 clusters. All50 clusters were correctly identified in each test. When measuring scalability with

    increasing number of records, the number of dimensions was set to 10. When measuring

    scalability for increasing dimensionality, the number of records was set to 100,000.

    Figure 6 shows that there is a clear linear dependency of O-Clusters processing time onboth the number of records and number of dimensions. In general, these timing results

    can be improved significantly because the algorithm was implemented as a PL/SQL

    package in an ORACLE 9i database. There is an overhead associated with the fact thatPL/SQL is an interpreted language.

    Figure 6: Scalability. (a) Scalability with number of records (10 dimensions); (b) Scalability with number

    of dimensions (100,000 records).

    4.6. Working with a Limited Buffer Size

    In all tests described so far, O-Cluster had a sufficiently large buffer to accommodate theentire data set. The next set of results illustrate O-Clusters behavior when the algorithm

    is required to have a small memory footprint such that the active buffer an contain only afraction of the entire data set. This series of tests reuses the data set described in Section

    4.4 (50 clusters, 2,000 point each, 10 dimensions). For all tests, was set to 0.8. Figure 7shows the timing and recall numbers for different buffer sizes (0.5%, 0.8%, 1%, 5%, and10% of the entire data set). Very small buffer sizes may require multiple refills. For

    example, the described experiment showed that when the buffer size was 0.5%, O-Cluster

    needed to refill it 5 times; when the buffer size was 0.8% or 1%, O-Cluster had to refill itonce. For larger buffer sizes, no refills were necessary. As a result, using 0.8% buffer

    proves to be slightly faster than using 0.5% buffer. If no buffer refills were required

    (buffer size greater than 1%), O-Cluster followed a linear scalability pattern, as shown in

  • 7/31/2019 o Cluster Algorithm 129983

    14/16

    14

    the previous section. Regarding O-Clusters accuracy, buffer sizes under 1% proved to be

    too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Clusterfound 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49

    out of 50 clusters (98% recall). Larger buffer sizes allowed O-Cluster to correctly identify

    all original clusters. For all buffer sizes (including buffer sizes smaller than 1%) precision

    was 100%.

    Figure 7 Buffer size. (a) Time scalability; (b) Recall.

    5. ConclusionsThe majority of existing clustering algorithms encounter serious scalability and/or

    accuracy related problems when used on data sets with a large number of records and/ordimensions. We propose a new clustering algorithm O-Cluster capable of efficiently and

    effectively clustering large high dimensional data sets. It relies on a novel active

    sampling approach and uses an axis-parallel partitioning scheme to identify hyper-

    rectangular regions of uni-modal density in the input feature space. O-Cluster has goodaccuracy and scalability, is robust to noise, automatically detects the number of clusters

    in the data, and can successfully operate with limited memory resources.

    Currently we are extending O-Cluster in a number of ways, including:

    Parallel implementation. The results presented in this paper used a serial

    implementation of O-Cluster. Performance can be significantly improved byparallelizing the following steps of O-Cluster:

    o Buffer filling;

    o Histogram computation and splitting point determination;o Assigning records to partitions.

    Cluster representation through rules especially useful for noisy cases whencentroids do not characterize a cluster well.

    Probabilistic modeling and scoring with missing values missing values can be a

    problem during record assignment.

    Handling categorical and mixed (categorical and numerical) data sets.

    These extensions will be reported in a future paper.

  • 7/31/2019 o Cluster Algorithm 129983

    15/16

    15

    References

    [AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automaticsubspace clustering of high dimensional data for data mining applications.

    InProc. 1998 ACM-SIGMOD Int. Conf. Management of Data

    (SIGMOD98), pages 94105, 1998.

    [APYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fastalgorithms for projected clustering. InProc. 1999 ACM-SIGMOD Int. Conf.

    Management of Data (SIGMOD99), pages 6172, 1999.

    [AY00] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high

    dimensional spaces. InProc. 2000 ACM-SIGMOD Int. Conf. Managementof Data (SIGMOD00), pages 7081, 2000.

    [BFR98] P. Bradely, U. Fayyad, and C. Reina. Scaling clustering algorithms to large

    databases. InProc. 1998 Int. Conf. Knowledge Discovery and Data Mining(KDD98), pages 815, 1998.

    [BGRS99] K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft. When is nearest

    neighbor meaningful? InProc. 7th Int. Conf. on Database Theory

    (ICDT99),pages 217235, 1999.

    [EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm fordiscovering clusters in large spatial database. InProc. 1996 Int. Conf.

    Knowledge Discovery and Data Mining (KDD96), pages 226231, 1996.

    [FLE00] F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms

    revisited. SIGKDD Explorations, 2: 5157, 2000.

    [GRS98] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithmfor large databases. InProc. 1998 ACM-SIGMOD Int. Conf. Management of

    Data (SIGMOD98), pages 7384, 1998.

    [HAK00] A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the nearest neighbor inhigh dimensional spaces? InProc. 26

    thInt. Conf. on Very Large Data Bases

    (VLDB00), pages 506515, 2000.

    [HB97] T. Hofmann and J. Buhmann. Active data clustering. InAdvances in Neural

    Information Processing Systems (NIPS97), pages. 528534, 1997.

    [HK98] A. Hinneburg, and D. A. Keim. An efficient approach to clustering in large

    multimedia databases with noise. InProc. 1998 Int. Conf. KnowledgeDiscovery and Data Mining (KDD98), pages 5865, 1998.

    [HK99] A. Hinneburg, and D. A. Keim. Optimal grid-clustering: Towards breaking

    the curse of dimensionality in high-dimensional clustering. InProc. 25th

    Int.Conf. on Very Large Data Bases (VLDB99), pages 506517, 1999.

    [KR90] L. Kaufman, and P. J. Rousseeuw.Finding groups in data: An introduction

    to cluster analysis. New York: John Wiley & Sons, 1990.

  • 7/31/2019 o Cluster Algorithm 129983

    16/16

    16

    [Mac67] J. MacQueen. Some methods for classification and analysis of multivariate

    observations. InProc. 5th

    Berkeley Symp. Math. Statist, Prob., 1:281297,1967.

    [NGC99] H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable

    subspace clustering for very large data sets. Technical Report 9906-010,

    Northwestern University, June 1999.

    [NH94] R. Ng, and J. Han. Efficient and effective clustering method for spatial data

    mining. InProc. 1994 Int. Conf. on Very Large Data Bases (VLDB94),

    pages 144155, 1994.

    [Sco79] D. W. Scott.Multivariate density estimation. New York: John Wiley &Sons, 1979.

    [SCZ98] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-

    resolution clustering approach for very large spatial databases. InProc.

    1998 Int. Conf. on Very Large Data Bases (VLDB98), pages 428439,1998.

    [Wan96] M. P. Wand. Data-based choice of histogram bin width. The American

    Statistician, 51: 5964, 1996.

    [WYM97] W. Wang, J. Yang, M. Muntz. STING: A statistical information grid

    approach to spatial data mining. InProc. 1997 Int. Conf. on Very LargeData Bases (VLDB97), pages 186195, 1997.

    [ZRL96] T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An efficient data

    clustering method for very large databases. InProc. 1996 ACM-SIGMOD

    Int. Conf. Management of Data (SIGMOD96), pages 103114, 1996.