3.4 density and grid methods
TRANSCRIPT
ClusteringDensity and Grid
Based
1
2
Density based methods Clusters – dense regions of objects Low density regions – Noise DBSCAN
Density Based Spatial Clustering of Applications with Noise
OPTICS Ordering Points To Identify the Clustering Structure
DENCLUE DENsity Based CLUstEring
3
DBSCAN Cluster – maximal set of density connected points
Grows regions with sufficiently high density into clusters -neighborhood MinPts and Core object Directly Density Reachable
An object p is directly density reachable from object q if p is within the -neighborhood of q and q is a core object
pq
MinPts = 5
4
DBSCAN Density Reachable
An object p is density reachable from q, if there is a chain of objects p1, …pn, p1=q and pn=p such that pi+1 is directly density reachable from pi
p
qp1
5
DBSCAN Density Connected
An object p is density connected to object q if there is an object o such that both p and q are density reachable from o.
p q
o
6
DBSCAN Arbitrarily select a point p
Retrieve all points density-reachable from p
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable
from p, then DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.
Complexity : O(n log n) / O(n2)
7
OPTICS: A Cluster-Ordering Method OPTICS: Ordering Points To Identify the Clustering
Structure Produces a special order of the database with respect
to its density-based clustering structure Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering structure Can be represented graphically or using visualization
techniques
OPTICS In DBSCAN, for a constant MinPts value, density based
clusters with respect to a higher density (lower value of ) are
completely contained in lower density sets.
DBSCAN is extended so that Objects are processed in a
specific order. Selects an object that is density-reachable with respect to lowest
value
Core distance of an object p : smallest ’ value that makes {p} a core
object
Reachability distance of an object q with respect to p = max (core-
distance of p, d(p,q))
8
OPTICS
Complexity : O(n log n)
9
10
Reachability-distance
Cluster-order
of the objects
undefined
‘
OPTICS
11
DENCLUE: using density functions
DENsity-based CLUstEring Major features
Solid mathematical foundation Good for data sets with large amounts of noise Allows a compact mathematical description of arbitrarily
shaped clusters in high-dimensional data sets Significantly faster than existing algorithm (faster than
DBSCAN by a factor of up to 45) But needs a large number of parameters
12
Influence function: describes the impact of a data point within its neighborhood. x, y – objects in Fd – d-dimensional input space Influence of object y on x is: Can be determined by distance:
Overall density of the data space can be calculated as the sum of the influence function of all data points.
Clusters can be determined mathematically by identifying density attractors. Density attractors are local maximal of the overall density function.
DENCLUE
),()( yxfxf ByB
otherwise 1or ),( 0),( yxdifyxf square
f x y eGaussian
d x y
( , )( , )
2
22
13
Density attractor – Local maxima of overall density function
A point x is said to be density attracted to a density attractor x* if there exists a set of points x0, x1,..xk such that x0 = x and xk =x* and the gradient of xi-1 is in the direction of xi
DENCLUE
DENCLUE Center defined clusters
For a density attractor x* - a subset of points that are density attracted by x* and where density function x* is no less than threshold
Others are outliers Arbitrary shape cluster
Set of density attractors and set of Cs There should be a path from each density attractor to
another where density function value for each point is no less that
14
DENCLUE
15
16
Grid Based Methods
Uses a Multi-resolution grid data structure Quantizes space into a finite number of cells
that form a grid structure Fast processing time STING WaveCluster CLIQUE – CLustering In QUEst
17
STING
STatistical Information Grid Spatial area is divided into rectangular cells Several levels of cells – at different levels of
resolution High level cell is partitioned into several lower
level cells Statistical attributes are stored in cell
Mean, Maximum, Minimum
18
STING
19
STING Parameters of higher level cells are computed
from those at lower levels To answer queries
Identify level Estimate cell’s relevance to query Process relevant cells at lower levels Continue to lowest level
20
STING
Computation is query independent Parallel processing – supported Data is processed in a single pass Quality depends on granularity
21
WaveCluster A multi-resolution clustering approach which applies
wavelet transform to the feature space A wavelet transform is a signal processing technique
that decomposes a signal into different frequency sub-band.
Both grid-based and density-based Input parameters:
# of grid cells for each dimension the wavelet, and the # of applications of wavelet
transform.
22
WaveCluster
Using wavelet transform to find clusters Summarises the data by imposing a multidimensional
grid structure onto data space These multidimensional spatial data objects are
represented in a n-dimensional feature space Apply wavelet transform on feature space to find the
dense regions in the feature space Apply wavelet transform multiple times which result in
clusters at different scales from fine to coarse
23
Quantization
24
Transformation
25
WaveCluster Reasons for using Wavelet transformation in clustering
Unsupervised clustering
It uses filters to emphasize region where points cluster, but simultaneously to suppress weaker information in their boundary
Effective removal of outliers Multi-resolution Cost efficiency
Major features: Complexity O(N) Detect arbitrary shaped clusters at different scales Not sensitive to noise, not sensitive to input order Only applicable to low dimensional data
26
CLIQUE (Clustering In QUEst) Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular units
A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace