o cluster algorithm 129983
TRANSCRIPT
-
7/31/2019 o Cluster Algorithm 129983
1/16
-
7/31/2019 o Cluster Algorithm 129983
2/16
2
Given a data set with n objects and kn, the number of desired clusters,partitioning
algorithms partition the objects into kclusters. The clusters are formed in order to
optimize an objective criterion such as distance. Each object is assigned to the closestcluster. Clusters are typically represented by either the mean of the objects assigned to
the cluster (k-means [Mac67]) or by one representative object of the cluster (k-medoid
[KR90]). CLARANS (Clustering Large Applications based upon RANdomized Search)[NH94] is a partitioning clustering algorithm developed for large data sets, which uses a
randomized and bounded search strategy to improve the scalability of the k-medoid
approach. CLARANS enables the detection of outliers and its computational complexityis about O(n2). CLARANS performance can be improved by exploring spatial data
structures such as R*-trees.
Hierarchical clustering algorithms work by grouping data objects into a hierarchy (e.g., a
tree) of clusters. The hierarchy can be formed top-down (divisive hierarchical methods)
or bottom-up (agglomerative hierarchical methods). Hierarchical methods rely on adistance function to measure the similarity between clusters. These methods do not scale
well with the number of data objects. Their computational complexity is usually O(n
2
).Some newer methods such as BIRCH [ZRL96] and CURE [GRS99] attempt to addressthe scalability problem and improve the quality of clustering results for hierarchical
methods. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an
efficient divisive hierarchical algorithm. It has O(n) computational complexity, can workwith limited amount of memory, and has efficient I/O. It uses a special data structure, CF-
tree (Cluster Feature tree) for storing summary information about subclusters of objects.
The CF-tree structure can be seen as a multilevel compression of the data that attempts to
preserve the clustering structure inherent in the data set. Because of the similaritymeasure it uses to determine the data items to be compressed, BIRCH only performs well
on data sets with spherical clusters. CURE (Clustering Using REpresentatives) is an
O(n
2
) algorithm that produces high-quality clusters in the presence of outliers, and canidentify clusters of complex shapes and different sizes. It employs a hierarchical
clustering approach that uses a fixed number of representative points to define a cluster
instead of a single centroid or object. CURE handles large data sets through acombination of random sampling and partitioning. Since CURE uses only a random
sample of the data set, it manages to achieve good scalability for large data sets. CURE
reports better times than BIRCH on the same benchmark data.
Locality-based clustering algorithms group neighboring data objects into clusters based
on local conditions. These algorithms allow clustering to be performed in one scan of thedata set. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
[EKSX96] is a typical representative of this group of algorithms. It regards clusters as
dense regions of objects in the input space that are separated by regions of low density.
DBSCANs basic idea is that the density of points in a radius around each point in acluster has to be above a certain threshold. It grows a cluster as long as, for each data
point within this cluster, a neighborhood of a given radius contains at least a minimum
number of points. DBSCAN has computational complexity O(n2). If a spatial index is
used, the computational complexity is O(n log n). The clustering generated by DBSCAN
is very sensitive to parameter choice. OPTICS (Ordering Points To Identify Clustering
-
7/31/2019 o Cluster Algorithm 129983
3/16
3
Structures) is another locality-based clustering algorithms. It computes an augmented
cluster ordering for automatic and iterative clustering analysis. OPTICS has the samecomputational complexity as DBSCAN.
In general, partitioning, hierarchical, and locality-based clustering algorithms do not scale
well with the number of objects in the data set. To improve the efficiency, datasummarization techniques integrated with the clustering process have been proposed.
Besides the above-mentioned BIRCH and CURE algorithms, examples include: active
data clustering [HB97], ScalableKM [BFR98], and simple single pass k-means [FLE00].Active data clustering utilizes principles from sequential experimental design in order to
interleave data generation and data analysis. It infers from the available data not only the
grouping structure in the data, but also which data are most relevant for the clusteringproblem. The inferred relevance of the data is then used to control the re-sampling of the
data set. ScalableKM requires at most one scan of the data set. The method identifies data
points that can be effectively compressed, data points that must be maintained inmemory, and data points that can be discarded. The algorithm operates within the
confines of a limited memory buffer. Unfortunately, the compression schemes used byScalableKM can introduce significant overhead. The simple single pass k-meansalgorithm is a simplification of ScalableKM. Like ScalableKM, it also uses a data buffer
of fixed size. Experiments indicate that the simple single pass k-means algorithm is
several times faster than standard k-means while producing clustering of comparablequality.
All the above-mentioned methods are not fully effective when clustering high
dimensional data. Methods that rely on near or nearest neighbor information do not workwell on high dimensional spaces. In high dimensional data sets, it is very unlikely that
data points are nearer to each other than the average distance between data points because
of sparsely filled space. As a result, as the dimensionality of the space increases, thedifference between the distance to the nearest and the farthest neighbors of a data object
goes to zero. [BGRS99, HAK00].
Grid-based clustering algorithms do not suffer from the nearest neighbor problem in high
dimensional spaces. Examples include STING (STatistical INformation Grid)
[WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], andMAFIA (Merging Adaptive Finite Intervals And is more than a clique) [NGC99]. These
methods divide the input space into hyper-rectangular cells, discard the low-density cells,
and then combine adjacent high-density cells to form clusters. Grid-based methods arecapable of discovering cluster of any shape and are also reasonably fast. However, none
of these methods address how to efficiently cluster very large data sets that do not fit in
memory. Furthermore, these methods also only work well with input spaces with low to
moderate numbers of dimensions. As the dimensionality of the space increases, grid-based methods face some serious problems. The number of cells grows exponentially and
finding adjacent high-density cells to form clusters becomes prohibitively expensive
[HK99].
-
7/31/2019 o Cluster Algorithm 129983
4/16
4
In order to address the curse of dimensionality a couple of algorithms have focused on
data projections in subspaces. Examples include PROCLUS [APYP99], OptiGrid[HK99], and ORCLUS [AY00]. PROCLUS uses axis-parallel partitions to identify
subspace clusters. ORCLUS uses generalized projections to identify subspace clusters.
OptiGrid is an especially interesting algorithm due to its simplicity and ability to findclusters in high dimensional spaces in the presence noise. OptiGrid constructs a grid-
partitioning of the data by calculating the partitioning hyperplanes using contracting
projections of the data. OptiGrid looks for hyperplanes that satisfy two requirements:1) separating hyperplanes should cut through regions of low density relative to the
surrounding regions; and 2) separating hyperplanes should place individual clusters into
different partitions. The first requirement aims at preventing oversplitting, that is, acutting plane should not split a cluster. The second requirement attempts to achieve good
cluster discrimination, that is, the cutting plane should contribute to finding the individual
clusters. OptiGrid recursively constructs a multidimensional grid by partitioning the datausing a set of cutting hyperplanes, each of which is orthogonal to at least one projection.
At each step, the generation of the set of candidate hyperplanes is controlled by twothreshold parameters. The implementation of OptiGrid described in the paper used axis-parallel partitioning hyperplanes. The authors show that the error introduced by axis-
parallel partitioning decreases exponentially with the number of dimensions in the data
space. This validates the use of axis-parallel projections as an effective approach forseparating clusters in high dimensional spaces. Optigrid, however, has two main
shortcomings. It is sensitive to parameter choice and it does not prescribe a strategy to
efficiently handle data sets that do not fit in memory.
To overcome both the scalability problems associated with large amounts of data and
high dimensional data input space, this paper introduces a new clustering algorithm
called O-Cluster (Orthogonal partitioning CLUSTERing). This new clustering methodcombines a novel active sampling technique with an axis-parallel partitioning strategy to
identify continuous areas of high density in the input space. The method operates on a
limited memory buffer and requires at most a single scan through the data.
2. The O-Cluster Algorithm
O-Cluster is a method that builds upon the contracting projection concept introduced by
OptiGrid. Our algorithm makes two major contributions:
It proposes the use of a statistical test to validate the quality of a cutting plane.Such a test proves crucial for identifying good splitting points along data
projections and makes possible automated selection of high quality separators. It can operate on a small buffer containing a random sample from the original data
set. Active sampling ensures that partitions are provided with additional data
points if more information is needed to evaluate a cutting plane. Partitions that donot have ambiguities are frozen and the data points associated with them are
removed from the active buffer.
-
7/31/2019 o Cluster Algorithm 129983
5/16
5
O-Cluster operates recursively. It evaluates possible splitting points for all projections in
a partition, selects the best one, and splits the data into two new partitions. Thealgorithm proceeds by searching for good cutting planes inside the newly created
partitions. Thus O-Cluster creates a hierarchical tree structure that tessellates the input
space into rectangular regions. Figure 1 provides an outline of O-Clusters algorithm.
1. Load buffer
2. Compute histograms for active
partitions
3. Find 'best' splitting points for activepartitions
4. Flag ambiguous and 'frozen'
partitions
Splitting
points
exist?
5. Split active
partitions
6. Reload
buffer
EXIT
No
No
No
Yes
Yes
Yes
Ambiguous
partitions
exist?
Unseen
data
exist?
Figure 1: O-Cluster algorithm block diagram.
-
7/31/2019 o Cluster Algorithm 129983
6/16
6
The main processing stages are as follows:
1. Load data buffer: If the entire data set does not fit in the buffer, a random
sample is used. O-Cluster assigns all points from the initial buffer to a root
partition.
2. Compute histograms for active partitions: The goal is to determine a set ofprojections for the active partitions and compute histograms along these
projections. Any partition that represents a leaf in the clustering hierarchy and is
not explicitly marked ambiguous or frozen is considered active. The processwhereby an active partition becomes ambiguous or frozen is explained in Step 4.
It is essential to compute histograms that provide good resolution but also that
have data artifacts smoothed out. A number of studies have addressed the problem
of how many bins can be supported by a given distribution [Sco79,Wan96]. Basedon these studies, a reasonable, simple approach would be to make the number of
bins inversely proportional to the standard deviation of the data along a given
dimension and directly proportional toN1/3
, whereNis the number of pointsinside a partition. Alternatively, one can use a global binning strategy and coarsen
the histograms as the number of points inside the partitions decreases. O-Cluster
is robust with respect to different binning strategies as long as the histograms donot significantly undersmooth or oversmooth the distribution density.
3. Find best splitting points for active partitions: For each histogram, O-Clusterattempts to find the best valid cutting plane, if such exists. A valid cutting plane
passes through a point of low density (a valley) in the histogram. Additionally, the
point of low density should be surrounded on both sides by points of high density(peaks). O-Cluster attempts to find a pair of peaks with a valley between them
where the difference between the peak and the valley histogram counts is
statistically significant. Statistical significance is tested using a standard 2 test:
2
1,
22 )(2
=
expected
expectedobserved,
where the observed value is equal to the histogram count of the valley and theexpected value is the average of the histogram counts of the valley and the lower
peak. The current implementation uses a 95% confidence level ( 843.32 1,05.0 = ).
Since multiple splitting points can be found to be valid separators per partitionaccording to this test, O-Cluster chooses the one where the valley has the lowest
histogram count as the best splitting point. Thus the cutting plane would gothrough the area with lowest density.
4. Flag ambiguous and frozen partitions: If no valid splitting points are found,
O-Cluster checks whether the 2
test would have found a valid splitting point at a
lower confidence level (e.g., 90% with 706.22 1,1.0 = ). If that is the case, the
current partition can be considered ambiguous. More data points are needed to
-
7/31/2019 o Cluster Algorithm 129983
7/16
7
establish the quality of the splitting point. If no splitting points were found and
there is no ambiguity, the partition can be marked as frozen and the recordsassociated with it marked for deletion from the active buffer.
5. Split active partitions: If a valid separator exists, the data points are split along
the cutting plane and two new active partitions are created from the original
partition. For each new partition the processing proceeds recursively from Step 2.
6. Reload buffer: This step can take place after all recursive partitioning on the
current buffer has completed. If all existing partitions are marked as frozenand/or there are no more data points available, the algorithm exits. Otherwise, if
some partitions are marked as ambiguous and additional unseen data records
exist, O-Cluster proceeds with reloading the data buffer. The new data replacerecords belonging to frozen partitions. When new records are read in, only data
points that fall inside ambiguous partitions are placed in the active buffer. New
records falling within a frozen partition are not loaded into the buffer. If it isdesirable to maintain statistics of the data points falling inside partitions
(including the frozen partitions), such statistics can be continuously updatedwith the reading of each new record. Loading of new records continues untileither: 1) the active buffer is filled again; 2) the end of the data set is reached; or
3) a reasonable number of records have been read, even if the active buffer is not
full and there are more data. The reason for the last condition is that if the buffer
is relatively large and there are many points marked for deletion, it may take along time to fill the entire buffer with data from the ambiguous regions. To avoid
excessive reloading during these circumstances, the buffer reloading process is
terminated after reading through a number of records equal to the data buffer size.Once the buffer reload is completed, the algorithm proceeds from Step 2. The
algorithm requires, at most, a single pass through the entire data set.
In addition to the major differences from OptiGrid noted in the beginning of thissection, there are two other important distinctions:
OptiGrids choice of a valid cutting plane depends on a pair of global parameters:
noise level and maximal splitting density. Those two parameters act as thresholdsfor identifying valid splitting points. In OptiGrid, histogram peaks are required to
be above the noise level parameter while histogram valleys need to have density
lower than the maximum splitting density. The maximum splitting density shouldbe set above the noise level threshold (personal communication with OptiGrids
authors). Finding correct values for these parameters is critical for OptiGrids
performance. O-Clusters 2 test for splitting points eliminates the need for presetthresholds the algorithm can find valid cutting planes at any density level withina histogram. While not strictly necessary for O-Clusters operation, it was found,
in the course of algorithm evolution, useful to introduce a parameter called
sensitivity (). Analogous to OptiGrids noise level, the role of this parameter is to
suppress the creation of arbitrarily small clusters by setting a minimum count for
O-Clusters histogram peaks. The effect of is illustrated in Section 4.
-
7/31/2019 o Cluster Algorithm 129983
8/16
8
While OptiGrid attempts to find good cutting planes that optimally traverse the
input space, it is prone to oversplitting. By design, OptiGrid can partitionsimultaneously along several cutting planes. This may result in the creation of
clusters (with few points) that need to be subsequently removed. Additionally,
OptiGrid works with histograms that undersmooth the distribution density
(personal communication with OptiGrids authors). Undersmoothed histogramsand the threshold-based mechanism of splitting point identification can lead to the
creation of separators that cut through clusters. These issues may not benecessarily a serious hindrance in the OptiGrids framework since the algorithm
attempts to construct a multidimensional grid where the highly populated cells are
interpreted as clusters. O-Cluster on the other hand, attempts to create a binaryclustering tree where the leaves are regions with flat or unimodal density
functions. Only a single cutting plane is applied at a time and the quality of the
splitting point is statistically validated.
O-Cluster functions optimally for large-scale data sets with many records and high
dimensionality. It is desirable to work with a sufficiently large active buffer in order tocalculate high quality histograms with good resolution. High dimensionality has been
shown to significantly reduce the chance of cutting through data when using axis-parallelcutting planes [HK99]. There is no special handling for missing values if a data record
has missing values, this record would not contribute to the histogram counts along certaindimensions. However, if a missing value is needed to assign the record to a partition, the
record would not be assigned and it would be marked for deletion from the active buffer.
3. O-Cluster Complexity
O-Cluster can use an arbitrary set of projections. Our current implementation is restricted
to projections that are axis-parallel. The histogram computation step is of complexityO(Nxd) whereNis the number of data points in the buffer and dis the number of
dimensions. The selection of best splitting point for a single dimension is O(b) where b is
the average number of histogram bins in a partition. Choosing the best splitting pointover all dimensions is O(dxb). The assignment of data points to newly created partitions
requires a comparison of an attribute value to the splitting point and the complexity has
an upper bound ofO(N). Loading new records into the data buffer requires their insertioninto the relevant partitions. The complexity associated with scoring a record is depends
on the depth of the binary clustering tree (s). The upper limit for filling the whole active
buffer is O(Nxs). The depth of the tree depends on the data set.
In general, the total complexity can be approximated as O(Nxd). It is shown in Section 4that O-Cluster scales linearly with the number of records and number of dimensions.
4. Empirical Results
This section illustrates the general behavior of O-Cluster and evaluates the correctness of
its solutions. The first series of tests were carried out on a two-dimensional data set - DS3
[ZRL96]. This is a particularly challenging benchmark. The low number of dimensions
-
7/31/2019 o Cluster Algorithm 129983
9/16
9
makes the use of any axis-parallel partitioning algorithm problematic. Also, the data set
consists of 100 spheric clusters that vary significantly in their size and density. Thenumber of points per cluster is a random number in the range [0, 2000] drawn from a
uniform distribution and the variance across dimensions for each cluster is a random
number in the range [0, 2], also drawn from a uniform distribution.
4.1. O-Cluster on DS3
Figure 2 depicts the partitions found by O-Cluster on the DS3 data set. The centers of theoriginal clusters are marked with squares while the centroids of the points assigned toeach O-Cluster partition are represented by stars.
Figure 2: O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by O-
Cluster. Squares () represent the original cluster centroids, stars (*) represent the centroids of the points
belonging to an O-Cluster partition; recall = 71%, precision = 97%.
Although O-Cluster does not function optimally when the dimensionality is low, itproduces a good set of partitions. It is noteworthy that O-Cluster finds cutting planes at
different levels of density and successfully identifies nested clusters. Axis-parallel splits
in low dimensions can easily lead to the creation of artifacts where cutting planes have tocut through parts of a cluster and data points are assigned to incorrect partitions. Such
artifacts can either result in centroid imprecision or lead to further partitioning and
creation of spurious clusters. For example, in Figure 2 O-Cluster creates 73 partitions. Of
-
7/31/2019 o Cluster Algorithm 129983
10/16
10
these, 71 contain the centroids of at least one of the original clusters. The remaining 2
partitions were produced due to artifacts created by splits going through clusters.
In general, there are two potential sources of imprecision in the algorithm: 1) O-Clustermay fail to create partitions for all original clusters; and/or 2) O-Cluster may create
spurious partitions that do not correspond to any of the original clusters. To measurethese two effects separately, we use two metrics borrowed from the information retrievaldomain:Recallis defined as the percentage of the original clusters that were found andassigned to partitions;Precision is defined as the percentage of the found partitions that
contain at least one original cluster centroid. That is, in Figure 2 O-Cluster found 71 out
of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitionscreated contained at least one centroid of the original clusters (a precision of 97%).
4.2. The Sensitivity Parameter
The effect of creating spurious clusters due to splitting artifacts can be alleviated by using
O-Clusters sensitivity () parameter. is a parameter in the [0, 1] range that is inversely
proportional to the minimum count required to find a histogram peak. A value of 0requires the histogram peaks to surpass the count corresponding to a global uniform level
per dimension. The global uniform level is defined as the average histogram count that
would have been observed if the data points in the buffer were drawn from a uniformdistribution. A value of 0.5 sets the minimum histogram count for a peak to 50% of the
global uniform level. A value of 1 removes the restrictions on peak histogram counts and
the splitting point identification relies solely on the 2 test. The results shown in Figure 2
were produced with = 0.95.
Figure 3 illustrates that effect of changing . Increasing enables O-Cluster to grow theclustering hierarchy deeper and thus obtain improved recall performance. However,
values of that are too high may result in excessive splitting and thus poor precision. Itshould be noted that the effect of is magnified by the particular characteristics of theDS3 data set. The 2D dimensionality leads to splitting artifacts that become the main
reason for oversplitting. Additionally, the original clusters in the DS3 data set vary
significantly in their number of records, and low values can filter out some of the
weaker clusters. Higher dimensionality and more evenly represented clusters reduce O-
Clusters sensitivity to .
-
7/31/2019 o Cluster Algorithm 129983
11/16
11
Figure 3: Effect of the sensitivity parameter. The grid depicts the splitting planes found by O-Cluster.
Squares () represent the original cluster centroids, stars (*) represent the centroids of the points belonging
to an O-Cluster partition. (a)= 0, recall = 22%, precision = 100%; (b)= 0.5, recall = 40%,precision = 100%; (c)= 0.75, recall = 56%, precision = 100%; (d)= 1, recall = 72%, precision = 84%.
4.3. The Effect of Dimensionality
In order to illustrate the benefits of higher dimensionality, the DS3 data set was extended
to 5 and 10 dimensions. was set to 1 for both experiments. Figure 4 shows the 2Dprojection of the data set, the original cluster centroids, and the centroids of O-Clusters
partitions in the plane specified by the original two dimensions. The O-Cluster grid was
not included since the cutting planes in higher dimensions could not be plotted in ameaningful way. It can be seen that O-Clusters accuracy (both recall and precision)
improves dramatically with increased dimensionality. The main reason for the
remarkably good performance is that higher dimensionality allows O-Cluster to findcutting planes that do not produce splitting artifacts.
-
7/31/2019 o Cluster Algorithm 129983
12/16
12
Figure 4: Effect of dimensionality. Squares () represent the original cluster centroids, stars (*) represent
the centroids of the points belonging to an O-Cluster partition. (a) dimensionality = 5, recall = 99%,
precision = 96%; (b) dimensionality = 10, recall = 100%, precision = 100%.
4.4. The Effect of Uniform Noise
O-Cluster shares one remarkable feature with OptiGrid it resistance to uniform noise.To test O-Clusters robustness to uniform noise a synthetic data set consisting of 100,000
points was generated. It consisted of 50 spherical clusters, with variance in the range [0,
2], each represented by 2,000 points. To introduce uniform noise to the data set, a certainpercentage of the original records were replaced by records drawn from a uniform
distribution on each dimension. O-Cluster was tested with 25%, 50%, 75%, and 90%
noise. For example, when the percentage of noise was 90%, the original clusters wererepresented by 10,000 points (200 on average per cluster) and the remaining 90,000
points were uniform noise. All experiments were run with = 0.8. Figure 5 illustrates O-
Clusters performance under noisy conditions. O-Clusters accuracy degrades verygracefully with the increased percentage of background noise. Higher dimensionalityprovides a slight advantage when handling noise.
Figure 5: Effect of uniform noise. (a) Recall for 5 and 10 dimensions; (b) Precision for 5 and 10
dimensions.
-
7/31/2019 o Cluster Algorithm 129983
13/16
13
It should also be noted that once background noise is introduced, the centroids of the
partitions produced by O-Cluster are offset from the original cluster centroids. In order toidentify the original centers, it is necessary to discount the background noise from the
histograms and compute centroids on the remaining points. This can be accomplished by
filtering out the histogram bins that would fall below a level corresponding to the average
bin count for this partition.
4.5. O-Cluster Scalability
The next series of tests addresses O-Clusters scalability with increasing numbers of
records and dimensions. All data sets used in the experiments consisted of 50 clusters. All50 clusters were correctly identified in each test. When measuring scalability with
increasing number of records, the number of dimensions was set to 10. When measuring
scalability for increasing dimensionality, the number of records was set to 100,000.
Figure 6 shows that there is a clear linear dependency of O-Clusters processing time onboth the number of records and number of dimensions. In general, these timing results
can be improved significantly because the algorithm was implemented as a PL/SQL
package in an ORACLE 9i database. There is an overhead associated with the fact thatPL/SQL is an interpreted language.
Figure 6: Scalability. (a) Scalability with number of records (10 dimensions); (b) Scalability with number
of dimensions (100,000 records).
4.6. Working with a Limited Buffer Size
In all tests described so far, O-Cluster had a sufficiently large buffer to accommodate theentire data set. The next set of results illustrate O-Clusters behavior when the algorithm
is required to have a small memory footprint such that the active buffer an contain only afraction of the entire data set. This series of tests reuses the data set described in Section
4.4 (50 clusters, 2,000 point each, 10 dimensions). For all tests, was set to 0.8. Figure 7shows the timing and recall numbers for different buffer sizes (0.5%, 0.8%, 1%, 5%, and10% of the entire data set). Very small buffer sizes may require multiple refills. For
example, the described experiment showed that when the buffer size was 0.5%, O-Cluster
needed to refill it 5 times; when the buffer size was 0.8% or 1%, O-Cluster had to refill itonce. For larger buffer sizes, no refills were necessary. As a result, using 0.8% buffer
proves to be slightly faster than using 0.5% buffer. If no buffer refills were required
(buffer size greater than 1%), O-Cluster followed a linear scalability pattern, as shown in
-
7/31/2019 o Cluster Algorithm 129983
14/16
14
the previous section. Regarding O-Clusters accuracy, buffer sizes under 1% proved to be
too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Clusterfound 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49
out of 50 clusters (98% recall). Larger buffer sizes allowed O-Cluster to correctly identify
all original clusters. For all buffer sizes (including buffer sizes smaller than 1%) precision
was 100%.
Figure 7 Buffer size. (a) Time scalability; (b) Recall.
5. ConclusionsThe majority of existing clustering algorithms encounter serious scalability and/or
accuracy related problems when used on data sets with a large number of records and/ordimensions. We propose a new clustering algorithm O-Cluster capable of efficiently and
effectively clustering large high dimensional data sets. It relies on a novel active
sampling approach and uses an axis-parallel partitioning scheme to identify hyper-
rectangular regions of uni-modal density in the input feature space. O-Cluster has goodaccuracy and scalability, is robust to noise, automatically detects the number of clusters
in the data, and can successfully operate with limited memory resources.
Currently we are extending O-Cluster in a number of ways, including:
Parallel implementation. The results presented in this paper used a serial
implementation of O-Cluster. Performance can be significantly improved byparallelizing the following steps of O-Cluster:
o Buffer filling;
o Histogram computation and splitting point determination;o Assigning records to partitions.
Cluster representation through rules especially useful for noisy cases whencentroids do not characterize a cluster well.
Probabilistic modeling and scoring with missing values missing values can be a
problem during record assignment.
Handling categorical and mixed (categorical and numerical) data sets.
These extensions will be reported in a future paper.
-
7/31/2019 o Cluster Algorithm 129983
15/16
15
References
[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automaticsubspace clustering of high dimensional data for data mining applications.
InProc. 1998 ACM-SIGMOD Int. Conf. Management of Data
(SIGMOD98), pages 94105, 1998.
[APYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fastalgorithms for projected clustering. InProc. 1999 ACM-SIGMOD Int. Conf.
Management of Data (SIGMOD99), pages 6172, 1999.
[AY00] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high
dimensional spaces. InProc. 2000 ACM-SIGMOD Int. Conf. Managementof Data (SIGMOD00), pages 7081, 2000.
[BFR98] P. Bradely, U. Fayyad, and C. Reina. Scaling clustering algorithms to large
databases. InProc. 1998 Int. Conf. Knowledge Discovery and Data Mining(KDD98), pages 815, 1998.
[BGRS99] K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft. When is nearest
neighbor meaningful? InProc. 7th Int. Conf. on Database Theory
(ICDT99),pages 217235, 1999.
[EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm fordiscovering clusters in large spatial database. InProc. 1996 Int. Conf.
Knowledge Discovery and Data Mining (KDD96), pages 226231, 1996.
[FLE00] F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms
revisited. SIGKDD Explorations, 2: 5157, 2000.
[GRS98] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithmfor large databases. InProc. 1998 ACM-SIGMOD Int. Conf. Management of
Data (SIGMOD98), pages 7384, 1998.
[HAK00] A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the nearest neighbor inhigh dimensional spaces? InProc. 26
thInt. Conf. on Very Large Data Bases
(VLDB00), pages 506515, 2000.
[HB97] T. Hofmann and J. Buhmann. Active data clustering. InAdvances in Neural
Information Processing Systems (NIPS97), pages. 528534, 1997.
[HK98] A. Hinneburg, and D. A. Keim. An efficient approach to clustering in large
multimedia databases with noise. InProc. 1998 Int. Conf. KnowledgeDiscovery and Data Mining (KDD98), pages 5865, 1998.
[HK99] A. Hinneburg, and D. A. Keim. Optimal grid-clustering: Towards breaking
the curse of dimensionality in high-dimensional clustering. InProc. 25th
Int.Conf. on Very Large Data Bases (VLDB99), pages 506517, 1999.
[KR90] L. Kaufman, and P. J. Rousseeuw.Finding groups in data: An introduction
to cluster analysis. New York: John Wiley & Sons, 1990.
-
7/31/2019 o Cluster Algorithm 129983
16/16
16
[Mac67] J. MacQueen. Some methods for classification and analysis of multivariate
observations. InProc. 5th
Berkeley Symp. Math. Statist, Prob., 1:281297,1967.
[NGC99] H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable
subspace clustering for very large data sets. Technical Report 9906-010,
Northwestern University, June 1999.
[NH94] R. Ng, and J. Han. Efficient and effective clustering method for spatial data
mining. InProc. 1994 Int. Conf. on Very Large Data Bases (VLDB94),
pages 144155, 1994.
[Sco79] D. W. Scott.Multivariate density estimation. New York: John Wiley &Sons, 1979.
[SCZ98] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-
resolution clustering approach for very large spatial databases. InProc.
1998 Int. Conf. on Very Large Data Bases (VLDB98), pages 428439,1998.
[Wan96] M. P. Wand. Data-based choice of histogram bin width. The American
Statistician, 51: 5964, 1996.
[WYM97] W. Wang, J. Yang, M. Muntz. STING: A statistical information grid
approach to spatial data mining. InProc. 1997 Int. Conf. on Very LargeData Bases (VLDB97), pages 186195, 1997.
[ZRL96] T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An efficient data
clustering method for very large databases. InProc. 1996 ACM-SIGMOD
Int. Conf. Management of Data (SIGMOD96), pages 103114, 1996.