by timofey shulepov clustering algorithms. clustering - main features clustering – a data mining...
Post on 29-Jan-2016
227 Views
Preview:
TRANSCRIPT
by Timofey Shulepovby Timofey Shulepov
Clustering AlgorithmsClustering Algorithms
Clustering - main Clustering - main featuresfeatures
Clustering – a data mining techniqueClustering – a data mining technique Def.: Classification of objects into sets by the Def.: Classification of objects into sets by the
common traits among the objects, but not common traits among the objects, but not between different sets.between different sets.
Usage: Usage: Statistical Data AnalysisStatistical Data Analysis Machine LearningMachine Learning Data MiningData Mining Pattern RecognitionPattern Recognition Image AnalysisImage Analysis BioinformaticsBioinformatics
Types of clusteringTypes of clustering HierarchicalHierarchical
Finding new clusters using previously found onesFinding new clusters using previously found ones
PartitionalPartitional Finding all clusters at onceFinding all clusters at once
Self-Organizing MapsSelf-Organizing Maps Hybrids (incremental)Hybrids (incremental)
Concept of distance Concept of distance measuremeasure
Distance measure – determines how the Distance measure – determines how the similaritysimilarity of two elements is calculated. of two elements is calculated.
Similarity is expressed in terms of a Similarity is expressed in terms of a distance functiondistance function
Distance functions – vary significantly for Distance functions – vary significantly for interval-scaled, categorical, and other interval-scaled, categorical, and other variablesvariables
Examples of Dist. Fcns: Euclidean Examples of Dist. Fcns: Euclidean distance, Manhattan distance, etc.distance, Manhattan distance, etc.
Distance functions, in Distance functions, in more detail.more detail.
Euclidean distance – aka “as the crow flies”, or Euclidean distance – aka “as the crow flies”, or 2-norm distance. The most commonly used 2-norm distance. The most commonly used one, the usually implied distance measurement one, the usually implied distance measurement (ruler, 2 dots).(ruler, 2 dots).
Manhattan distance – aka “taxicab” or 1-norm Manhattan distance – aka “taxicab” or 1-norm distance. Going from A to B via intersections distance. Going from A to B via intersections (sort of).(sort of).
Maximum norm – explanation is too Maximum norm – explanation is too complicated for this presentationcomplicated for this presentation
Mahalanobis distance – similar to Euclidean, Mahalanobis distance – similar to Euclidean, but it considers specifics of data sets, and is but it considers specifics of data sets, and is scale-invariantscale-invariant
GarciaGarcia
Hierarchical ClusteringHierarchical Clustering
Hierarchical clusteringHierarchical clustering ResultResult: Given the input set : Given the input set SS, the goal is to , the goal is to
produce a hierarchy (dendogram) in which produce a hierarchy (dendogram) in which nodes represent subsets of nodes represent subsets of SS simulating the simulating the structure found in structure found in SS. .
Can be agglomerative or divisiveCan be agglomerative or divisive Agglomerative – “bottoms-up”: begin with one Agglomerative – “bottoms-up”: begin with one
element as a separate cluster, and escalate. element as a separate cluster, and escalate. Divisive – “top-down”: begin with one large set, Divisive – “top-down”: begin with one large set,
and divide it into smaller sets.and divide it into smaller sets.
Agglomerative Agglomerative
Hierarchical ClusteringHierarchical Clustering 1. Place each instance of S in its own cluster (singleton), 1. Place each instance of S in its own cluster (singleton),
creating the list of clusters L (initially, the leaves of T): creating the list of clusters L (initially, the leaves of T): L = S1, S2, S3, .., Sn. L = S1, S2, S3, .., Sn.
2. Compute a merging cost function between every pair 2. Compute a merging cost function between every pair of elements in L to find the two closest clusters {Si, Sj} of elements in L to find the two closest clusters {Si, Sj} which will be the cheapest couple to mergewhich will be the cheapest couple to merge
Remove Si & Sj from L.Remove Si & Sj from L. 4. Merge Si & Sj to create a new internal node Sij in T 4. Merge Si & Sj to create a new internal node Sij in T
which will be the parent of Sj & Sj in the result tree.which will be the parent of Sj & Sj in the result tree. 5. Do (2) until there is only one set remaining.5. Do (2) until there is only one set remaining.
K-ClusteringK-Clustering
K-clustering algorithmK-clustering algorithm ResultResult: Given the input set : Given the input set SS and a fixed and a fixed
integer integer kk, a partition of , a partition of SS into into kk subsets must subsets must be returned. be returned.
K-means clustering is the most common K-means clustering is the most common partitioning algorithm. partitioning algorithm.
K-clustering algo cont'dK-clustering algo cont'd
1. Select k initial cluster centroids, c1, c2, 1. Select k initial cluster centroids, c1, c2, c3..., ck.c3..., ck.
2. Assign each instance x in S to the 2. Assign each instance x in S to the cluster whose centroid is the nearest to x.cluster whose centroid is the nearest to x.
3. For each cluster, re-compute its 3. For each cluster, re-compute its centroid based on which elements are centroid based on which elements are contained in.contained in.
4. Go to (2) until convergence is achieved.4. Go to (2) until convergence is achieved. GarciaGarcia
Self-Organized MapsSelf-Organized Maps
Def.: A group of several connected nodes Def.: A group of several connected nodes mapped into a k-dimensional space mapped into a k-dimensional space following some specific geometrical following some specific geometrical topology (grids, rings, lines, ...). Initially topology (grids, rings, lines, ...). Initially placed at random and iteratively adjusted placed at random and iteratively adjusted according to the distribution of examples according to the distribution of examples (input) along the k-dimensional space. (input) along the k-dimensional space.
GarciaGarcia
Annotated BibliographyAnnotated Bibliography
WikipediaWikipediahttp://en.wikipedia.org/wiki/Data_clustering#Types_of_clhttp://en.wikipedia.org/wiki/Data_clustering#Types_of_clusteringustering
Enrique Blanco GarciaEnrique Blanco Garcia http://genome.imim.es/~eblanco/seminars/docs/clusterinhttp://genome.imim.es/~eblanco/seminars/docs/clustering/index_types.html#hierarchyg/index_types.html#hierarchy
top related