algorithm design (12) clustering algorithmschik/algorithm...1 algorithm design (12) clustering...
TRANSCRIPT
1
Algorithm Design (12)Clustering Algorithms
Takashi ChikayamaSchool of Engineering
The University of Tokyo
Clustering
� Grouping objects that are similar
� Treating similar things the same is abstraction
� Whether or not similar depends on viewpointse.g., One 1000 yen buck and a set of two 500 yen coins are the same in their monetary values, but not similar in their weights
� A representative unsupervised machine learning method
� Learning classification from non-classified data
Data Representation
Data items are represented by a set of features
� Quantitative features
� Position, length, weight, time, temperature, frequency, price, …
� Quantitative resemblance may be meaningful
� Statistical indices such as averages are meaningful
� Qualitative features
� Figure, nationality, profession, department, …
� Only match or non-match may be meaningful
Forms of Clusters
� Non-hierarchical clustering
� No structure except fora flat set of clusters
� Hierarchical clustering
� Clusters are nested, i.e.,a data item may belong toa cluster of clusters (of clusters of …)
2
Merits and Demerits of Hierarchical Clustering○ Can adjust cluster sizes by controlling the number
of hierarchy layers
○ Can be used for any similarity criteria
⇒Wide application areas
× No definite criteria on how far clusters are to be divided
× Clusters once formed are not checked again
Two Principles of Hierarchical Clustering� Bottom-up or agglomerative clustering
� Form clusters by making groups of similar items
� Form upper-level clusters by grouping similar clusters
� Repeat this procedure until all items are included in the top-level cluster
� Top-down or divisive clustering
� Divide all items into groups of similar items
� Divide items in a group into subgroups
� Repeat this procedure until all groups consist of a single item
Agglomerative Clustering
� Group similar objects into clusters
� Group similar clusters into higher-level clusters
Similarity Indicators
Some indication of similarity is required
� Usually a distance metric is used that quantitatively tells how different two items are
� The metric d should satisfy
1. d(x, y) = 0 iff x = y ; identity of indiscernibles
2. d(x, y) = d(y, x) ; symmetry
3. d(x, y) + d(y, z) ≥ d(x, z) ; triangular inequality
� The metric is proven to be non-negative
2××××d(x, y) = d(x, y) + d(y, x) ≥ d(x, x) = 0
∴∴∴∴ d(x, y) ≥ 0
3
Representative Metrics
2( )k k
k
x y−∑� Euclidian Distance
� Manhattan Distance
� Mahalanobis Distance: Euclidian distance adjusted with correlations of different features
� Hamming Distance: Number of different featuresapplicable to non-quantitative feature space
� Edit Distance (aka Levenshtein distance)Minimum number of edit operations needed to transform one to the other
k k
k
x y−∑
Normalization
Normalization is often needed before defining distances on quantitative features
� Different features may have different units
E.g., Statures (heights) and weights of a human cannot be treated the same
� Different features may have different variation
E.g., The tallest man is not a hundred times taller than the shortest man, but the richest man may have one million (or maybe more) times more money than the poorest
Normalization intoStandard ScoresTransforming values of each feature into standard
scores is a widely used normalization method
� Average:
� Variance:
� Standard deviation:
� Standard score:
x
xi
i
xT
σ
µ−=
∑=
=N
iix x
N 1
1µ
∑=
−=N
ixix x
NV
1
2)(1
µ
∑=
−=N
ixix x
N 1
2)(1
µσ
Non-Linear Transform
� Transform into standard scores is linear
� A non-linear (such as logarithmic) transform, or using ranking instead of bare values, is more appropriate in certain cases
所得金額階級別にみた世帯数の相対度数分布
平成22年 国民生活基礎調査の概況, 厚生労働省
With a linear transform, families with 3M yen of income is assumed to be more similar to families with no income than to families with 7M yen of income.Using ranking would be more appropriate.
4
Agglomerative Clustering:Procedure1. Every item is a cluster by itself
2. The closest pair of clusters form an upper-level cluster
3. Repeat until a cluster covering all items is formed
a b c d e f g
bc de fg
abc
abcdeabcdefg
Metrics Between Clusters
� Similarity metrics between clusters should also satisfy the distance axiom
� Can be defined as:
� Minimum: single linkage clustering
� Maximum: complete linkage clustering
� Average: average linkage clustering
of distances of items of different clusters
� Can be defined as the distance between some representative items or the centroid
� Median, centroid (average of items), etc
Ward’s Minimum Variance Criterion
Joe Ward, 1963
“Choose the pair of clusters that gives, when they are merged, the minimum increase of the sum of the intra-cluster variances”
� Less susceptible to exceptional items
Hierarchical Divisive Clustering� Divide items into classes based on their similarity
� Divide classes further
� Repeat until certain criterion is satisfied
5
Repeated Bisection
Repetitively divide the item set into two� For example, applying k-means (explained later)
algorithm with k = 2 repetitively� Which cluster to divide
� With the largest number of items� With the largest radius� With the largest sum or average of variance
� When to stop division� When the largest cluster has items less than
the threshold� When the largest radius or variance is less than
a threshold
Merits and Demerits of Hierarchical Divisive Clustering� No prior assumption on the number of clusters
� Some criteria on when to stop division are needed, though
� Can be parallelized easily
� Division on different clusters are independent
� Once a division is decided, it will not be changed
Non-Hierarchical DivisiveClustering� Decide the number of clusters first
� Define the cost function of clustering
� For example, the sum of variances of clusters
The clustering with the minimum cost is desirable, which is difficult due to computational costs
⇒ Iterative improvement methods are often used
� Decide some initial clustering
� Move items from one cluster to another to lower the cost
The solution is suboptimal but often acceptable
Stochastic Data Models
� Assume that data items consists of samples from a number of different stochastic models
� Clustering is inferring which models the items belong to
� Inference is made by maximizing the likelihood of observing the actual data
� The steps usually taken are:
� Infer of parameters of stochastic models, and
� Infer which of the models each item belongs to
6
Mixture Distribution Model
� Items are assumed to consists of samples of different probability distributions
� Inference of parameters of original distributions is tried
Stochastic Data Models
,
( | ) Pr( | )j i j
i j
L X C x Cτ= ∏
� Let ττττj denote the probability of an item belong to the model j
� Let Pr(xi |Cj) denote the probability of item xi is observed when it belongs to the model j
� The likelihood of the set of items X is observed is:
� Model parameters are estimated to maximize this
� Items are clustered so that the likelihood for each item is maximized under the estimated parameters
Expectation Maximization (EM) Algorithm
Arthur Dempster, et al. 1977
An algorithm to maximize the likelihood for probabilistic models with hidden parameters
� Two different sets of values have to be obtained
� A set of hidden parameters of the distribution
� Which items are samples of which distribution
� The algorithm repetitively tries to obtain better estimates of these two in turn
� Many application areas other than clustering
EM Algorithm: Procedure
θθθθ : Set of parameters of distributions
X: Observed set of data
Z: Some unknown data about the natures of X
1. Initial parameter set θθθθ are somehow given
2. (E step) Under the parameter set θθθθ and observed X, find Z that maximizes the likelihood
3. (M step) Using Z obtained in step 2 and X,
compute a better estimate of θθθθ
4. Repeat from step 2 until some convergence criterion is reached
7
Clustering with EM Algorithm
� First, decide the number of clusters and the shape of distributions (Gaussian, for example)
1. Decide initial θθθθ, , , , parameters of distributionsFor Gaussian, averages and variances
2. Compute the clustering Z that gives the largest likelihood under θθθθ
3. Compute θθθθ that gives the largest likelihood for the actual set of data X and its clustering Z
4. If some convergence condition is not satisfied yet, go back to the step 2 and repeat with this new θθθθ
Merits and Demerits ofEM Algorithm� No guarantee of convergence to the set of
parameters giving the maximum likelihood
� May converge to a local maximum
� It depends on the initial value of θθθθ
� Starting with different initial values may alleviate the problem
� Advantageous for simple distributions as computation of each step can have low cost;Gaussian mixture is typical
Clustering with EM Algorithm;An Example
Old Faithful Geyser(Yellowstone Nat. Park)
k-Means Clustering Algorithm
A non-hierarchical clustering algorithm
1. Start with k random clusters
2. Compute the centroids of clusters
3. Make each item belong to the cluster with the nearest centroid
4. Repeat from step 2 until some criterion is satisfied
8
Handling Empty Clusters
When empty clusters may appears
� Create a new cluster consisting of a single item
� Several strategy to choose the item
� The item furthest from the centroid of its own cluster
� First choose the cluster with the largest sum of squared errors, and then choose its item furthest from the centroid
A Fuzzy Version of k-Means:Fuzzy C-means� Each data item belongs fuzzily to multiple clusters
� Item xi belong to the cluster k with the weight ofuk(i) (ΣΣΣΣ uk(i) = 1)
� Centroids are computed with weights of uk(i)
ck= ΣΣΣΣ { xi∙uk(i) } / ΣΣΣΣ uk(i)
� Given new centroids, uk(i) are updated to be inversely proportional to the distance from the new centroids
Merits and Demerits ofk-Means Clustering� Simple, easily parallelized
� The result heavily depends on initial clusteringLikely to converge to local optima, which may be quite far from the global optimum
� No good criterion for k, the number of clusters
� No attentions paid to the balance of cluster sizes
� Influenced heavily on exceptional items
� Only applicable to quantitative data
� For example, if a feature is nationality, there is no such thing as the “mean of nationalities”
k-Medoid Algorithm
Medoid: the item representing a cluster
e.g., the item with minimum sum of distances from other items in the cluster
Medoid in place of the centroid of k-means
� Can be used for non-quantitative data
� Less influenced by exceptional items
centroid
exception
medoid
9
How to Decide the Number of ClustersMany proposals but no definitive one
� An empirical rule of thumb
� Elbow criterion (aka knee criterion)
� At some point, increasing the cluster number does not increase information given by the cluster an item belongs to
number of clusters
2/nk =
rati
o of
var
ianc
eexpl
aine
d
Shapes of Clusters
� Clusters may have a variety of shapes
� k-means and similar methods fit only to clusters with elliptical shapes
A ring is the most appropriate shape
in this case
Density-Based Clustering
� Repeat adding items in the neighborhood to clusters
⇒ Clusters will grow to follow high-density areas
○ Clusters can have arbitrary shapes
○ Exceptional items do not disturb clustering
∵ As exceptional items are not in neighborhoods of any other items, they do not participate in cluster forming
Density-Based Clustering: DBSCAN
Martin Ester et al., 1996
� εεεε-neighborhood: the set items within distance εεεε
� Core object: an item with more items in its εεεε-neighborhood than a threshold MinPts
� An item y is said to be density-reachable to a core object x if there is a sequence of core objects whose adjacent items are in the εεεε-neighborhood of each other
� Two items x and y are said to be density-connected if both are density-reachable from the same core object
� A cluster consists of density-connected items
10
MinPts = 2
Clustering with DBSCAN
εεεε
Core Object
Merits and Demerits of DBSCAN○ Relatively small computational cost
With low dimensionality, O(n log n)
⇒ Can be applied to a large-scale data
× No good criteria to decide algorithm parameters εεεεand MinPts
⇒ Characteristics of the analyzed data have to be known to a certain level beforehand
× Two overlapping groups cannot be divided
× No good mathematical model that gives the ground of the algorithm
Measures of Cluster Validity
� External: Based on information externally supplied
� Data labels given to items, for example
� Useful for evaluating clustering algorithms
� Internal: Using only the analyzed data set itself
� Sum of squared errors
� The ratio of the pairs of the nearest items belonging to the same cluster
In general, evaluating the results of clustering is a very difficult task
Summary of This Course
� Computational complexity theory� Orders of computation: Ω, Θ, O
� Amortized complexity
� A variety of algorithms for different problems� Combinatorial optimization
� Search and compaction of symbol strings
� Memory management
� Graph algorithms
� Computational geometry
� Clustering
Many other areas are not touched in this class
11
Important Points
� Algorithm selection dominates the computation cost
� Orders of complexity are of primary importance
� Constants factors are only secondary
� Strategies
� Using appropriate data structures
� Preprocessing for efficiency
� Concentrating on differences
� Exploiting the characteristics of data
� Mathematical models
� Conversion to problems with efficient algorithms
Why the Order is Imporant
� The problem sizes change in time
� Performance Improvements (Moore’s law):4 times in 3 years → 1024 times in 15 years
� Software are used longer than you may think
� The same software are used for tens of years
E.g. Two algorithms A: O(n log n); B: O(n2)
� B runs ten times faster than A now� 15 years after, data becomes 1000 times larger� A needs ten thousand times more computation� B needs one million times more computation
⇒ A becomes faster by two orders of magnitude
Selecting Data Structures
Data structures should be selected depending on which kind of accesses are required to be efficient
� Arrays are compact and allows random access, but extension is not easy?
⇒ Array extension has amortized linear cost
� Structures linked with pointers are flexible but no random access is possible?
⇒ Providing additional index arrays may allow required random access
Preprocessing may Make Algorithms More Efficient� Values repeatedly used can be stored in memory
e.g., dynamic programming
� A variety of forms to represent information
� Tables of numbers, adding flags, …
� As the structures themselves, such as in graphs
� Costs of preprocessing should also be considered
� Isn’t the preprocessing cost too high?
� Is the result of preprocess used repetitively?
12
Focusing on Differences may Lead to EfficiencyFinding all the possible plays at a given board game
position
� Computing it in each turn is costly
A slight change of the game position makes a slight change to the set of possible moves
⇒ Changes should be handled!
� For initial position, compute all the possible moves
� With one move of the game, compute plays that are enabled and disabled with the move
⇒ This may be much faster than computing all the move candidates from scratch each time
Pay Attention to Data
If some cases are known to be more frequent, improvements should be made for such cases
� How frequent are they?Even if 99% is made one hundred time faster,if 1% is made one hundred times slower, the whole system becomes slower
� Cases being frequent is always true?There might be critical situations where the frequencies are quite different
Converting to a Problem with Efficient AlgorithmsComputing a specific item in a sequence defined as a
recurrence formula
For the Fibonacci sequence
How can we compute the n’th item xn?
� Naïve application of the definition leads to an algorithm with exponential order
� Starting from the top and computing one after another, this can be improved to O(n)
� Is this the best possible?
dbxaxxcxcx nnn ++++++++============ ++++++++ 121100
nnn xxxxx ++++============ ++++++++ 1210 11
Solving Recurrence Formulae
The recurrence formula of the of the last page can be expressed in a form of matrix multiplication
With this and the values of x0 and x1, we have
which may look only adding more complication, but …
××××
====
++++
++++
++++
1100
001
1
1
1
2
n
n
n
n
x
xdba
x
x
××××
====
++++
1100
001
1
0
11
c
cn
dba
x
x
n
n
13
Computing Powers Efficiently
mn can be computed with O(log n) multiplications
� If n is a power of 2, a sequence of multiplications m, m2=m×m, m4=m2
×m2, … can compute mn with O(log n) multiplications
� n can be expressed as a sum of 2’s power
For example, m 105 = m 64× m 32
× m 8× m 1
� Using this, for arbitrary n, the required number of multiplications can be made to O(log n)
The same can be used for powers of matrices
⇒ Computing the n’th item of a sequence defined by a recurrence formula has the complexity O(log n)
Shortcomings of Traditional Complexity Theories� Complexity theories assumes random
access memory
� In reality, computer memory is not flat
� Clock speeds and circuit density are increasing, while total physical sizes of high-end computers are getting larger
� Relative memory access latency is increasing
Memory Hierarchy to Solve the ProblemCombination of memories with
� Smaller capacity, faster access
� Larger capacity, slower access
Frequently accessed data are stored in the faster memory
⇒With time and space access localities, the system is expected to run fast
The behavior cannot be analyzed precisely with theories based on random access memory
CPU
Primary Cache
Secondary Cache
Main Memory
Disks
Tertiary Cache
Sp
eed
Cap
acity
Working Set Model
Peter Denning, end of 1960’s
A correction to the complexity theory to reflect memory hierarchy
� Working set is a set of memory areas accessed in a short period of time
� If its size exceeds the size of a certain layer of the memory hierarchy, the program suddenly runs much slower
� Complexity analyses based on random access is only applicable as far as the working set size is within the size of a certain memory layer
14
Shortcomings of theWorking Set ModelThe validity of analysis is limited by a factor of
concrete system in application
� The memory structure of the computer system
� The amount of data to be processed
⇒When the situation changes, the analysis may become invalid
An important subject to be solved
Course Project Subject
� Choose an arbitrary problem to which algorithms introduced in this course is applicable
� Choose sets of input data for performance evaluation
� Write programs using two or more algorithms to the problem, and compare their performances
� Compare naïve and more sophisticated algorithms
� Compare results on multiple sets of data, differing in their sizes and/or natures
Project Report
Your report should include the following
1. Description of the problem
2. Description of two or more algorithms triedOne may be a naïve algorithm
3. Summary of data used for performance evaluation
4. Performance results
5. Reasoning of differences (or similarities) of the performance
Appendix: Your programs
Submitting Your Reports
Send it via email to the following address
� The Subject should include the course title“Algorithm Design”
� Either in English or in Japanese
� Write your name, lab, department and student ID number at the top of the email text
� The report should be attached to the email;Its format should be PDF, RTF, or MS-word
� Graphs to show performances recommended
� Deadline: Feb 10th, 9:00 PM JST