o cluster algorithm 129983

7/31/2019 o Cluster Algorithm 129983

1/16


2/16

2

Given a data set with n objects and kn, the number of desired clusters,partitioning

algorithms partition the objects into kclusters. The clusters are formed in order to

optimize an objective criterion such as distance. Each object is assigned to the closestcluster. Clusters are typically represented by either the mean of the objects assigned to

the cluster (k-means [Mac67]) or by one representative object of the cluster (k-medoid

[KR90]). CLARANS (Clustering Large Applications based upon RANdomized Search)[NH94] is a partitioning clustering algorithm developed for large data sets, which uses a

randomized and bounded search strategy to improve the scalability of the k-medoid

approach. CLARANS enables the detection of outliers and its computational complexityis about O(n2). CLARANS performance can be improved by exploring spatial data

structures such as R*-trees.

Hierarchical clustering algorithms work by grouping data objects into a hierarchy (e.g., a

tree) of clusters. The hierarchy can be formed top-down (divisive hierarchical methods)

or bottom-up (agglomerative hierarchical methods). Hierarchical methods rely on adistance function to measure the similarity between clusters. These methods do not scale

well with the number of data objects. Their computational complexity is usually O(n

2

).Some newer methods such as BIRCH [ZRL96] and CURE [GRS99] attempt to addressthe scalability problem and improve the quality of clustering results for hierarchical

methods. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an

efficient divisive hierarchical algorithm. It has O(n) computational complexity, can workwith limited amount of memory, and has efficient I/O. It uses a special data structure, CF-

tree (Cluster Feature tree) for storing summary information about subclusters of objects.

The CF-tree structure can be seen as a multilevel compression of the data that attempts to

preserve the clustering structure inherent in the data set. Because of the similaritymeasure it uses to determine the data items to be compressed, BIRCH only performs well

on data sets with spherical clusters. CURE (Clustering Using REpresentatives) is an

O(n

2

) algorithm that produces high-quality clusters in the presence of outliers, and canidentify clusters of complex shapes and different sizes. It employs a hierarchical

clustering approach that uses a fixed number of representative points to define a cluster

instead of a single centroid or object. CURE handles large data sets through acombination of random sampling and partitioning. Since CURE uses only a random

sample of the data set, it manages to achieve good scalability for large data sets. CURE

reports better times than BIRCH on the same benchmark data.

Locality-based clustering algorithms group neighboring data objects into clusters based

on local conditions. These algorithms allow clustering to be performed in one scan of thedata set. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

[EKSX96] is a typical representative of this group of algorithms. It regards clusters as

dense regions of objects in the input space that are separated by regions of low density.

DBSCANs basic idea is that the density of points in a radius around each point in acluster has to be above a certain threshold. It grows a cluster as long as, for each data

point within this cluster, a neighborhood of a given radius contains at least a minimum

number of points. DBSCAN has computational complexity O(n2). If a spatial index is

used, the computational complexity is O(n log n). The clustering generated by DBSCAN

is very sensitive to parameter choice. OPTICS (Ordering Points To Identify Clustering


3/16

3

Structures) is another locality-based clustering algorithms. It computes an augmented

cluster ordering for automatic and iterative clustering analysis. OPTICS has the samecomputational complexity as DBSCAN.

In general, partitioning, hierarchical, and locality-based clustering algorithms do not scale

well with the number of objects in the data set. To improve the efficiency, datasummarization techniques integrated with the clustering process have been proposed.

Besides the above-mentioned BIRCH and CURE algorithms, examples include: active

data clustering [HB97], ScalableKM [BFR98], and simple single pass k-means [FLE00].Active data clustering utilizes principles from sequential experimental design in order to

interleave data generation and data analysis. It infers from the available data not only the

grouping structure in the data, but also which data are most relevant for the clusteringproblem. The inferred relevance of the data is then used to control the re-sampling of the

data set. ScalableKM requires at most one scan of the data set. The method identifies data

points that can be effectively compressed, data points that must be maintained inmemory, and data points that can be discarded. The algorithm operates within the

confines of a limited memory buffer. Unfortunately, the compression schemes used byScalableKM can introduce significant overhead. The simple single pass k-meansalgorithm is a simplification of ScalableKM. Like ScalableKM, it also uses a data buffer

of fixed size. Experiments indicate that the simple single pass k-means algorithm is

several times faster than standard k-means while producing clustering of comparablequality.

All the above-mentioned methods are not fully effective when clustering high

dimensional data. Methods that rely on near or nearest neighbor information do not workwell on high dimensional spaces. In high dimensional data sets, it is very unlikely that

data points are nearer to each other than the average distance between data points because

of sparsely filled space. As a result, as the dimensionality of the space increases, thedifference between the distance to the nearest and the farthest neighbors of a data object

goes to zero. [BGRS99, HAK00].

Grid-based clustering algorithms do not suffer from the nearest neighbor problem in high

dimensional spaces. Examples include STING (STatistical INformation Grid)

[WYM97], CLIQUE [AGGR98], DENCLUE [HK98], WaveCluster [SCZ98], andMAFIA (Merging Adaptive Finite Intervals And is more than a clique) [NGC99]. These

methods divide the input space into hyper-rectangular cells, discard the low-density cells,

and then combine adjacent high-density cells to form clusters. Grid-based methods arecapable of discovering cluster of any shape and are also reasonably fast. However, none

of these methods address how to efficiently cluster very large data sets that do not fit in

memory. Furthermore, these methods also only work well with input spaces with low to

moderate numbers of dimensions. As the dimensionality of the space increases, grid-based methods face some serious problems. The number of cells grows exponentially and

finding adjacent high-density cells to form clusters becomes prohibitively expensive

[HK99].


4/16

4

In order to address the curse of dimensionality a couple of algorithms have focused on

data projections in subspaces. Examples include PROCLUS [APYP99], OptiGrid[HK99], and ORCLUS [AY00]. PROCLUS uses axis-parallel partitions to identify

subspace clusters. ORCLUS uses generalized projections to identify subspace clusters.

OptiGrid is an especially interesting algorithm due to its simplicity and ability to findclusters in high dimensional spaces in the presence noise. OptiGrid constructs a grid-

partitioning of the data by calculating the partitioning hyperplanes using contracting

projections of the data. OptiGrid looks for hyperplanes that satisfy two requirements:1) separating hyperplanes should cut through regions of low density relative to the

surrounding regions; and 2) separating hyperplanes should place individual clusters into

different partitions. The first requirement aims at preventing oversplitting, that is, acutting plane should not split a cluster. The second requirement attempts to achieve good

cluster discrimination, that is, the cutting plane should contribute to finding the individual

clusters. OptiGrid recursively constructs a multidimensional grid by partitioning the datausing a set of cutting hyperplanes, each of which is orthogonal to at least one projection.

At each step, the generation of the set of candidate hyperplanes is controlled by twothreshold parameters. The implementation of OptiGrid described in the paper used axis-parallel partitioning hyperplanes. The authors show that the error introduced by axis-

parallel partitioning decreases exponentially with the number of dimensions in the data

space. This validates the use of axis-parallel projections as an effective approach forseparating clusters in high dimensional spaces. Optigrid, however, has two main

shortcomings. It is sensitive to parameter choice and it does not prescribe a strategy to

efficiently handle data sets that do not fit in memory.

To overcome both the scalability problems associated with large amounts of data and

high dimensional data input space, this paper introduces a new clustering algorithm

called O-Cluster (Orthogonal partitioning CLUSTERing). This new clustering methodcombines a novel active sampling technique with an axis-parallel partitioning strategy to

identify continuous areas of high density in the input space. The method operates on a

limited memory buffer and requires at most a single scan through the data.

2. The O-Cluster Algorithm

O-Cluster is a method that builds upon the contracting projection concept introduced by

OptiGrid. Our algorithm makes two major contributions:

It proposes the use of a statistical test to validate the quality of a cutting plane.Such a test proves crucial for identifying good splitting points along data

projections and makes possible automated selection of high quality separators. It can operate on a small buffer containing a random sample from the original data

set. Active sampling ensures that partitions are provided with additional data

points if more information is needed to evaluate a cutting plane. Partitions that donot have ambiguities are frozen and the data points associated with them are

removed from the active buffer.


5/16

5

O-Cluster operates recursively. It evaluates possible splitting points for all projections in

a partition, selects the best one, and splits the data into two new partitions. Thealgorithm proceeds by searching for good cutting planes inside the newly created

partitions. Thus O-Cluster creates a hierarchical tree structure that tessellates the input

space into rectangular regions. Figure 1 provides an outline of O-Clusters algorithm.

1. Load buffer

2. Compute histograms for active

partitions

3. Find 'best' splitting points for activepartitions

4. Flag ambiguous and 'frozen'

partitions

Splitting

points

exist?

5. Split active

partitions

6. Reload

buffer

EXIT

No

No

No

Yes

Yes

Yes

Ambiguous

partitions

exist?

Unseen

data

exist?

Figure 1: O-Cluster algorithm block diagram.


6/16

6

The main processing stages are as follows:

1. Load data buffer: If the entire data set does not fit in the buffer, a random

sample is used. O-Cluster assigns all points from the initial buffer to a root

partition.

2. Compute histograms for active partitions: The goal is to determine a set ofprojections for the active partitions and compute histograms along these

projections. Any partition that represents a leaf in the clustering hierarchy and is

not explicitly marked ambiguous or frozen is considered active. The processwhereby an active partition becomes ambiguous or frozen is explained in Step 4.

It is essential to compute histograms that provide good resolution but also that

have data artifacts smoothed out. A number of studies have addressed the problem

of how many bins can be supported by a given distribution [Sco79,Wan96]. Basedon these studies, a reasonable, simple approach would be to make the number of

bins inversely proportional to the standard deviation of the data along a given

dimension and directly proportional toN1/3

, whereNis the number of pointsinside a partition. Alternatively, one can use a global binning strategy and coarsen

the histograms as the number of points inside the partitions decreases. O-Cluster

is robust with respect to different binning strategies as long as the histograms donot significantly undersmooth or oversmooth the distribution density.

3. Find best splitting points for active partitions: For each histogram, O-Clusterattempts to find the best valid cutting plane, if such exists. A valid cutting plane

passes through a point of low density (a valley) in the histogram. Additionally, the

point of low density should be surrounded on both sides by points of high density(peaks). O-Cluster attempts to find a pair of peaks with a valley between them

where the difference between the peak and the valley histogram counts is

statistically significant. Statistical significance is tested using a standard 2 test:

2

1,

22 )(2

=

expected

expectedobserved,

where the observed value is equal to the histogram count of the valley and theexpected value is the average of the histogram counts of the valley and the lower

peak. The current implementation uses a 95% confidence level ( 843.32 1,05.0 = ).

Since multiple splitting points can be found to be valid separators per partitionaccording to this test, O-Cluster chooses the one where the valley has the lowest

histogram count as the best splitting point. Thus the cutting plane would gothrough the area with lowest density.

4. Flag ambiguous and frozen partitions: If no valid splitting points are found,

O-Cluster checks whether the 2

test would have found a valid splitting point at a

lower confidence level (e.g., 90% with 706.22 1,1.0 = ). If that is the case, the

current partition can be considered ambiguous. More data points are needed to


7/16

7

establish the quality of the splitting point. If no splitting points were found and

there is no ambiguity, the partition can be marked as frozen and the recordsassociated with it marked for deletion from the active buffer.

5. Split active partitions: If a valid separator exists, the data points are split along

the cutting plane and two new active partitions are created from the original

partition. For each new partition the processing proceeds recursively from Step 2.

6. Reload buffer: This step can take place after all recursive partitioning on the

current buffer has completed. If all existing partitions are marked as frozenand/or there are no more data points available, the algorithm exits. Otherwise, if

some partitions are marked as ambiguous and additional unseen data records

exist, O-Cluster proceeds with reloading the data buffer. The new data replacerecords belonging to frozen partitions. When new records are read in, only data

points that fall inside ambiguous partitions are placed in the active buffer. New

records falling within a frozen partition are not loaded into the buffer. If it isdesirable to maintain statistics of the data points falling inside partitions

(including the frozen partitions), such statistics can be continuously updatedwith the reading of each new record. Loading of new records continues untileither: 1) the active buffer is filled again; 2) the end of the data set is reached; or

3) a reasonable number of records have been read, even if the active buffer is not

full and there are more data. The reason for the last condition is that if the buffer

is relatively large and there are many points marked for deletion, it may take along time to fill the entire buffer with data from the ambiguous regions. To avoid

excessive reloading during these circumstances, the buffer reloading process is

terminated after reading through a number of records equal to the data buffer size.Once the buffer reload is completed, the algorithm proceeds from Step 2. The

algorithm requires, at most, a single pass through the entire data set.

In addition to the major differences from OptiGrid noted in the beginning of thissection, there are two other important distinctions:

OptiGrids choice of a valid cutting plane depends on a pair of global parameters:

noise level and maximal splitting density. Those two parameters act as thresholdsfor identifying valid splitting points. In OptiGrid, histogram peaks are required to

be above the noise level parameter while histogram valleys need to have density

lower than the maximum splitting density. The maximum splitting density shouldbe set above the noise level threshold (personal communication with OptiGrids

authors). Finding correct values for these parameters is critical for OptiGrids

performance. O-Clusters 2 test for splitting points eliminates the need for presetthresholds the algorithm can find valid cutting planes at any density level withina histogram. While not strictly necessary for O-Clusters operation, it was found,

in the course of algorithm evolution, useful to introduce a parameter called

sensitivity (). Analogous to OptiGrids noise level, the role of this parameter is to

suppress the creation of arbitrarily small clusters by setting a minimum count for

O-Clusters histogram peaks. The effect of is illustrated in Section 4.


8/16

8

While OptiGrid attempts to find good cutting planes that optimally traverse the

input space, it is prone to oversplitting. By design, OptiGrid can partitionsimultaneously along several cutting planes. This may result in the creation of

clusters (with few points) that need to be subsequently removed. Additionally,

OptiGrid works with histograms that undersmooth the distribution density

(personal communication with OptiGrids authors). Undersmoothed histogramsand the threshold-based mechanism of splitting point identification can lead to the

creation of separators that cut through clusters. These issues may not benecessarily a serious hindrance in the OptiGrids framework since the algorithm

attempts to construct a multidimensional grid where the highly populated cells are

interpreted as clusters. O-Cluster on the other hand, attempts to create a binaryclustering tree where the leaves are regions with flat or unimodal density

functions. Only a single cutting plane is applied at a time and the quality of the

splitting point is statistically validated.

O-Cluster functions optimally for large-scale data sets with many records and high

dimensionality. It is desirable to work with a sufficiently large active buffer in order tocalculate high quality histograms with good resolution. High dimensionality has been

shown to significantly reduce the chance of cutting through data when using axis-parallelcutting planes [HK99]. There is no special handling for missing values if a data record

has missing values, this record would not contribute to the histogram counts along certaindimensions. However, if a missing value is needed to assign the record to a partition, the

record would not be assigned and it would be marked for deletion from the active buffer.

3. O-Cluster Complexity

O-Cluster can use an arbitrary set of projections. Our current implementation is restricted

to projections that are axis-parallel. The histogram computation step is of complexityO(Nxd) whereNis the number of data points in the buffer and dis the number of

dimensions. The selection of best splitting point for a single dimension is O(b) where b is

the average number of histogram bins in a partition. Choosing the best splitting pointover all dimensions is O(dxb). The assignment of data points to newly created partitions

requires a comparison of an attribute value to the splitting point and the complexity has

an upper bound ofO(N). Loading new records into the data buffer requires their insertioninto the relevant partitions. The complexity associated with scoring a record is depends

on the depth of the binary clustering tree (s). The upper limit for filling the whole active

buffer is O(Nxs). The depth of the tree depends on the data set.

In general, the total complexity can be approximated as O(Nxd). It is shown in Section 4that O-Cluster scales linearly with the number of records and number of dimensions.

4. Empirical Results

This section illustrates the general behavior of O-Cluster and evaluates the correctness of

its solutions. The first series of tests were carried out on a two-dimensional data set - DS3

[ZRL96]. This is a particularly challenging benchmark. The low number of dimensions


9/16

9

makes the use of any axis-parallel partitioning algorithm problematic. Also, the data set

consists of 100 spheric clusters that vary significantly in their size and density. Thenumber of points per cluster is a random number in the range [0, 2000] drawn from a

uniform distribution and the variance across dimensions for each cluster is a random

number in the range [0, 2], also drawn from a uniform distribution.

4.1. O-Cluster on DS3

Figure 2 depicts the partitions found by O-Cluster on the DS3 data set. The centers of theoriginal clusters are marked with squares while the centroids of the points assigned toeach O-Cluster partition are represented by stars.

Figure 2: O-Cluster partitions on the DS3 data set. The grid depicts the splitting planes found by O-

Cluster. Squares () represent the original cluster centroids, stars (*) represent the centroids of the points

belonging to an O-Cluster partition; recall = 71%, precision = 97%.

Although O-Cluster does not function optimally when the dimensionality is low, itproduces a good set of partitions. It is noteworthy that O-Cluster finds cutting planes at

different levels of density and successfully identifies nested clusters. Axis-parallel splits

in low dimensions can easily lead to the creation of artifacts where cutting planes have tocut through parts of a cluster and data points are assigned to incorrect partitions. Such

artifacts can either result in centroid imprecision or lead to further partitioning and

creation of spurious clusters. For example, in Figure 2 O-Cluster creates 73 partitions. Of


10/16

10

these, 71 contain the centroids of at least one of the original clusters. The remaining 2

partitions were produced due to artifacts created by splits going through clusters.

In general, there are two potential sources of imprecision in the algorithm: 1) O-Clustermay fail to create partitions for all original clusters; and/or 2) O-Cluster may create

spurious partitions that do not correspond to any of the original clusters. To measurethese two effects separately, we use two metrics borrowed from the information retrievaldomain:Recallis defined as the percentage of the original clusters that were found andassigned to partitions;Precision is defined as the percentage of the found partitions that

contain at least one original cluster centroid. That is, in Figure 2 O-Cluster found 71 out

of 100 original clusters (resulting in recall of 71%), and 71 out of the 73 partitionscreated contained at least one centroid of the original clusters (a precision of 97%).

4.2. The Sensitivity Parameter

The effect of creating spurious clusters due to splitting artifacts can be alleviated by using

O-Clusters sensitivity () parameter. is a parameter in the [0, 1] range that is inversely

proportional to the minimum count required to find a histogram peak. A value of 0requires the histogram peaks to surpass the count corresponding to a global uniform level

per dimension. The global uniform level is defined as the average histogram count that

would have been observed if the data points in the buffer were drawn from a uniformdistribution. A value of 0.5 sets the minimum histogram count for a peak to 50% of the

global uniform level. A value of 1 removes the restrictions on peak histogram counts and

the splitting point identification relies solely on the 2 test. The results shown in Figure 2

were produced with = 0.95.

Figure 3 illustrates that effect of changing . Increasing enables O-Cluster to grow theclustering hierarchy deeper and thus obtain improved recall performance. However,

values of that are too high may result in excessive splitting and thus poor precision. Itshould be noted that the effect of is magnified by the particular characteristics of theDS3 data set. The 2D dimensionality leads to splitting artifacts that become the main

reason for oversplitting. Additionally, the original clusters in the DS3 data set vary

significantly in their number of records, and low values can filter out some of the

weaker clusters. Higher dimensionality and more evenly represented clusters reduce O-

Clusters sensitivity to .


11/16

11

Figure 3: Effect of the sensitivity parameter. The grid depicts the splitting planes found by O-Cluster.

Squares () represent the original cluster centroids, stars (*) represent the centroids of the points belonging

to an O-Cluster partition. (a)= 0, recall = 22%, precision = 100%; (b)= 0.5, recall = 40%,precision = 100%; (c)= 0.75, recall = 56%, precision = 100%; (d)= 1, recall = 72%, precision = 84%.

4.3. The Effect of Dimensionality

In order to illustrate the benefits of higher dimensionality, the DS3 data set was extended

to 5 and 10 dimensions. was set to 1 for both experiments. Figure 4 shows the 2Dprojection of the data set, the original cluster centroids, and the centroids of O-Clusters

partitions in the plane specified by the original two dimensions. The O-Cluster grid was

not included since the cutting planes in higher dimensions could not be plotted in ameaningful way. It can be seen that O-Clusters accuracy (both recall and precision)

improves dramatically with increased dimensionality. The main reason for the

remarkably good performance is that higher dimensionality allows O-Cluster to findcutting planes that do not produce splitting artifacts.


12/16

12

Figure 4: Effect of dimensionality. Squares () represent the original cluster centroids, stars (*) represent

the centroids of the points belonging to an O-Cluster partition. (a) dimensionality = 5, recall = 99%,

precision = 96%; (b) dimensionality = 10, recall = 100%, precision = 100%.

4.4. The Effect of Uniform Noise

O-Cluster shares one remarkable feature with OptiGrid it resistance to uniform noise.To test O-Clusters robustness to uniform noise a synthetic data set consisting of 100,000

points was generated. It consisted of 50 spherical clusters, with variance in the range [0,

2], each represented by 2,000 points. To introduce uniform noise to the data set, a certainpercentage of the original records were replaced by records drawn from a uniform

distribution on each dimension. O-Cluster was tested with 25%, 50%, 75%, and 90%

noise. For example, when the percentage of noise was 90%, the original clusters wererepresented by 10,000 points (200 on average per cluster) and the remaining 90,000

points were uniform noise. All experiments were run with = 0.8. Figure 5 illustrates O-

Clusters performance under noisy conditions. O-Clusters accuracy degrades verygracefully with the increased percentage of background noise. Higher dimensionalityprovides a slight advantage when handling noise.

Figure 5: Effect of uniform noise. (a) Recall for 5 and 10 dimensions; (b) Precision for 5 and 10

dimensions.


13/16

13

It should also be noted that once background noise is introduced, the centroids of the

partitions produced by O-Cluster are offset from the original cluster centroids. In order toidentify the original centers, it is necessary to discount the background noise from the

histograms and compute centroids on the remaining points. This can be accomplished by

filtering out the histogram bins that would fall below a level corresponding to the average

bin count for this partition.

4.5. O-Cluster Scalability

The next series of tests addresses O-Clusters scalability with increasing numbers of

records and dimensions. All data sets used in the experiments consisted of 50 clusters. All50 clusters were correctly identified in each test. When measuring scalability with

increasing number of records, the number of dimensions was set to 10. When measuring

scalability for increasing dimensionality, the number of records was set to 100,000.

Figure 6 shows that there is a clear linear dependency of O-Clusters processing time onboth the number of records and number of dimensions. In general, these timing results

can be improved significantly because the algorithm was implemented as a PL/SQL

package in an ORACLE 9i database. There is an overhead associated with the fact thatPL/SQL is an interpreted language.

Figure 6: Scalability. (a) Scalability with number of records (10 dimensions); (b) Scalability with number

of dimensions (100,000 records).

4.6. Working with a Limited Buffer Size

In all tests described so far, O-Cluster had a sufficiently large buffer to accommodate theentire data set. The next set of results illustrate O-Clusters behavior when the algorithm

is required to have a small memory footprint such that the active buffer an contain only afraction of the entire data set. This series of tests reuses the data set described in Section

4.4 (50 clusters, 2,000 point each, 10 dimensions). For all tests, was set to 0.8. Figure 7shows the timing and recall numbers for different buffer sizes (0.5%, 0.8%, 1%, 5%, and10% of the entire data set). Very small buffer sizes may require multiple refills. For

example, the described experiment showed that when the buffer size was 0.5%, O-Cluster

needed to refill it 5 times; when the buffer size was 0.8% or 1%, O-Cluster had to refill itonce. For larger buffer sizes, no refills were necessary. As a result, using 0.8% buffer

proves to be slightly faster than using 0.5% buffer. If no buffer refills were required

(buffer size greater than 1%), O-Cluster followed a linear scalability pattern, as shown in


14/16

14

the previous section. Regarding O-Clusters accuracy, buffer sizes under 1% proved to be

too small for the algorithm to find all existing clusters. For buffer size of 0.5%, O-Clusterfound 41 out of 50 clusters (82% recall) and for buffer size of 0.8%, O-Cluster found 49

out of 50 clusters (98% recall). Larger buffer sizes allowed O-Cluster to correctly identify

all original clusters. For all buffer sizes (including buffer sizes smaller than 1%) precision

was 100%.

Figure 7 Buffer size. (a) Time scalability; (b) Recall.

5. ConclusionsThe majority of existing clustering algorithms encounter serious scalability and/or

accuracy related problems when used on data sets with a large number of records and/ordimensions. We propose a new clustering algorithm O-Cluster capable of efficiently and

effectively clustering large high dimensional data sets. It relies on a novel active

sampling approach and uses an axis-parallel partitioning scheme to identify hyper-

rectangular regions of uni-modal density in the input feature space. O-Cluster has goodaccuracy and scalability, is robust to noise, automatically detects the number of clusters

in the data, and can successfully operate with limited memory resources.

Currently we are extending O-Cluster in a number of ways, including:

Parallel implementation. The results presented in this paper used a serial

implementation of O-Cluster. Performance can be significantly improved byparallelizing the following steps of O-Cluster:

o Buffer filling;

o Histogram computation and splitting point determination;o Assigning records to partitions.

Cluster representation through rules especially useful for noisy cases whencentroids do not characterize a cluster well.

Probabilistic modeling and scoring with missing values missing values can be a

problem during record assignment.

Handling categorical and mixed (categorical and numerical) data sets.

These extensions will be reported in a future paper.


15/16

15

References

[AGGR98] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automaticsubspace clustering of high dimensional data for data mining applications.

InProc. 1998 ACM-SIGMOD Int. Conf. Management of Data

(SIGMOD98), pages 94105, 1998.

[APYP99] C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fastalgorithms for projected clustering. InProc. 1999 ACM-SIGMOD Int. Conf.

Management of Data (SIGMOD99), pages 6172, 1999.

[AY00] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high

dimensional spaces. InProc. 2000 ACM-SIGMOD Int. Conf. Managementof Data (SIGMOD00), pages 7081, 2000.

[BFR98] P. Bradely, U. Fayyad, and C. Reina. Scaling clustering algorithms to large

databases. InProc. 1998 Int. Conf. Knowledge Discovery and Data Mining(KDD98), pages 815, 1998.

[BGRS99] K. Beyer, J. Goldstein, R. Ramakhrisnan, and U. Shaft. When is nearest

neighbor meaningful? InProc. 7th Int. Conf. on Database Theory

(ICDT99),pages 217235, 1999.

[EKSX96] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm fordiscovering clusters in large spatial database. InProc. 1996 Int. Conf.

Knowledge Discovery and Data Mining (KDD96), pages 226231, 1996.

[FLE00] F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms

revisited. SIGKDD Explorations, 2: 5157, 2000.

[GRS98] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithmfor large databases. InProc. 1998 ACM-SIGMOD Int. Conf. Management of

Data (SIGMOD98), pages 7384, 1998.

[HAK00] A. Hinneburg, C. C. Aggarwal, D. A. Keim. What is the nearest neighbor inhigh dimensional spaces? InProc. 26

thInt. Conf. on Very Large Data Bases

(VLDB00), pages 506515, 2000.

[HB97] T. Hofmann and J. Buhmann. Active data clustering. InAdvances in Neural

Information Processing Systems (NIPS97), pages. 528534, 1997.

[HK98] A. Hinneburg, and D. A. Keim. An efficient approach to clustering in large

multimedia databases with noise. InProc. 1998 Int. Conf. KnowledgeDiscovery and Data Mining (KDD98), pages 5865, 1998.

[HK99] A. Hinneburg, and D. A. Keim. Optimal grid-clustering: Towards breaking

the curse of dimensionality in high-dimensional clustering. InProc. 25th

Int.Conf. on Very Large Data Bases (VLDB99), pages 506517, 1999.

[KR90] L. Kaufman, and P. J. Rousseeuw.Finding groups in data: An introduction

to cluster analysis. New York: John Wiley & Sons, 1990.


16/16

16

[Mac67] J. MacQueen. Some methods for classification and analysis of multivariate

observations. InProc. 5th

Berkeley Symp. Math. Statist, Prob., 1:281297,1967.

[NGC99] H. Nagesh, S. Goil, and A. Choudhary. MAFIA: Efficient and scalable

subspace clustering for very large data sets. Technical Report 9906-010,

Northwestern University, June 1999.

[NH94] R. Ng, and J. Han. Efficient and effective clustering method for spatial data

mining. InProc. 1994 Int. Conf. on Very Large Data Bases (VLDB94),

pages 144155, 1994.

[Sco79] D. W. Scott.Multivariate density estimation. New York: John Wiley &Sons, 1979.

[SCZ98] G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-

resolution clustering approach for very large spatial databases. InProc.

1998 Int. Conf. on Very Large Data Bases (VLDB98), pages 428439,1998.

[Wan96] M. P. Wand. Data-based choice of histogram bin width. The American

Statistician, 51: 5964, 1996.

[WYM97] W. Wang, J. Yang, M. Muntz. STING: A statistical information grid

approach to spatial data mining. InProc. 1997 Int. Conf. on Very LargeData Bases (VLDB97), pages 186195, 1997.

[ZRL96] T. Zhang, R. Ramakhrisnan, and M. Livny. BIRCH: An efficient data

clustering method for very large databases. InProc. 1996 ACM-SIGMOD

Int. Conf. Management of Data (SIGMOD96), pages 103114, 1996.

o cluster algorithm 129983

Documents