density-based clustering · main clustering approaches partitioning method →constructs partitions...
TRANSCRIPT
Density-Based ClusteringIzabela Moise, Evangelos Pournaras, Dirk Helbing
Izabela Moise, Evangelos Pournaras, Dirk Helbing 1
Reminder
Unsupervised data miningX Clustering→ k -Means
Izabela Moise, Evangelos Pournaras, Dirk Helbing 2
Main Clustering Approaches
• Partitioning method→ constructs partitions of data points→ evaluates the partitions by some criterion→ k -means, k -medoids
• Density-based method:→ based on connectivity and density functions→ DBSCAN, DJCluster
Izabela Moise, Evangelos Pournaras, Dirk Helbing 3
Density-Based Clustering
Izabela Moise, Evangelos Pournaras, Dirk Helbing 4
Density-Based Clustering
Density-Based Clustering
locates regions of high density that are separated from one anotherby regions of low density.
Izabela Moise, Evangelos Pournaras, Dirk Helbing 4
Main principles
• Two parameters:1. maximum radius of the neighbourhood→ Eps2. minimum number of points in an Eps neighbourhood of a point→ MinPts
• NEps(p) : {q ∈ D s.t . dist(p, q) ≤ Eps}• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Main principles
• Two parameters:1. maximum radius of the neighbourhood→ Eps2. minimum number of points in an Eps neighbourhood of a point→ MinPts
• NEps(p) : {q ∈ D s.t . dist(p, q) ≤ Eps}• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Main principles
• Two parameters:1. maximum radius of the neighbourhood→ Eps2. minimum number of points in an Eps neighbourhood of a point→ MinPts
• NEps(p) : {q ∈ D s.t . dist(p, q) ≤ Eps}• Key idea: the density of the neighbourhood has to exceed
some threshold.
• The shape of a neighbourhood depends on the dist function
Izabela Moise, Evangelos Pournaras, Dirk Helbing 5
Core, Border and Noise/Outlier
1
1Jing Gao, SUNY BuffaloIzabela Moise, Evangelos Pournaras, Dirk Helbing 6
Directly Density-Reachable
Directly density-reachable:→ A point p is directly density-reachable from a point q wrt. Eps,MinPts if:
1. p ∈ NEps(q) and
2. |NEps(q)| ≥ MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 7
Directly Density-Reachable
Directly density-reachable:→ A point p is directly density-reachable from a point q wrt. Eps,MinPts if:
1. p ∈ NEps(q) and
2. |NEps(q)| ≥ MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 7
Density-Reachable
• Density-reachable:→ A point p is density-reachable from a point q wrt. Eps,MinPts if there is a chain of points p1, ..., pn, withp1 = q, pn = p, s.t .pi+1 is directly density reachable from pi
• transitive but not symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 8
Density-Connected
Density-connected:→ A point p is density-connected from a point q wrt. Eps, MinPts ifthere is a point o s.t. p and q are density-reachable from o wrt. Epsand MinPts
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
Density-Connected
Density-connected:→ A point p is density-connected from a point q wrt. Eps, MinPts ifthere is a point o s.t. p and q are density-reachable from o wrt. Epsand MinPts→ symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
Density-Connected
Density-connected:→ A point p is density-connected from a point q wrt. Eps, MinPts ifthere is a point o s.t. p and q are density-reachable from o wrt. Epsand MinPts→ symmetric
Izabela Moise, Evangelos Pournaras, Dirk Helbing 9
DBSCAN - Density-Based Spatial Clustering of Applicationswith Noise
Izabela Moise, Evangelos Pournaras, Dirk Helbing 10
Main Principles
One of the most cited clustering algorithms
Main principle:
a cluster is defined as a maximal set of density-connected points.
• Discovers clusters of arbitrary shapes (spherical, elongated,linear), and noise
• Works with spatial datasets:→ geomarketing, tomography, satellite images
• Requires only two parameters (no prior knowledge of thenumber of clusters)
Izabela Moise, Evangelos Pournaras, Dirk Helbing 11
Definition: Cluster
2
2Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 12
Definition: Noise
3
3Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 13
The Algorithm
1. Randomly select a point p
2. Retrieve all points density-reachable from p wrt. Eps andMinPts
3. If p is a core point, a cluster is formed
4. If p is a border point, no points are density-reachable from p→visit the next data point
5. Continue the process until all points have been processed
Izabela Moise, Evangelos Pournaras, Dirk Helbing 14
Selecting Eps and MinPts
The two parameters can be determined by a heuristic
Observation:• For points in a cluster their k -th nearest neighbours are at
roughly the same distance.
• Noise points have the k -th nearest neighbour at fartherdistance.
Izabela Moise, Evangelos Pournaras, Dirk Helbing 15
4
4Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 16
5
5Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 17
6
6Erik Kropat, University of the Bundeswehr Munich
Izabela Moise, Evangelos Pournaras, Dirk Helbing 18
Pros and Cons
Pros:
X discovers clusters of arbitrary shapes
X handles noise
X needs density parameters as termination condition
Izabela Moise, Evangelos Pournaras, Dirk Helbing 19
Pros and Cons
Cons:
X cannot handle varying densities
X sensitive to parameters→ hard to determine the correct set ofparameters
X sampling affects density measures
Izabela Moise, Evangelos Pournaras, Dirk Helbing 20