algorithm design (12) clustering algorithmschik/algorithm...1 algorithm design (12) clustering...

1

Algorithm Design (12)Clustering Algorithms

Takashi ChikayamaSchool of Engineering

The University of Tokyo

Clustering

� Grouping objects that are similar

� Treating similar things the same is abstraction

� Whether or not similar depends on viewpointse.g., One 1000 yen buck and a set of two 500 yen coins are the same in their monetary values, but not similar in their weights

� A representative unsupervised machine learning method

� Learning classification from non-classified data

Data Representation

Data items are represented by a set of features

� Quantitative features

� Position, length, weight, time, temperature, frequency, price, …

� Quantitative resemblance may be meaningful

� Statistical indices such as averages are meaningful

� Qualitative features

� Figure, nationality, profession, department, …

� Only match or non-match may be meaningful

Forms of Clusters

� Non-hierarchical clustering

� No structure except fora flat set of clusters

� Hierarchical clustering

� Clusters are nested, i.e.,a data item may belong toa cluster of clusters (of clusters of …)

2

Merits and Demerits of Hierarchical Clustering○ Can adjust cluster sizes by controlling the number

of hierarchy layers

○ Can be used for any similarity criteria

⇒Wide application areas

× No definite criteria on how far clusters are to be divided

× Clusters once formed are not checked again

Two Principles of Hierarchical Clustering� Bottom-up or agglomerative clustering

� Form clusters by making groups of similar items

� Form upper-level clusters by grouping similar clusters

� Repeat this procedure until all items are included in the top-level cluster

� Top-down or divisive clustering

� Divide all items into groups of similar items

� Divide items in a group into subgroups

� Repeat this procedure until all groups consist of a single item

Agglomerative Clustering

� Group similar objects into clusters

� Group similar clusters into higher-level clusters

Similarity Indicators

Some indication of similarity is required

� Usually a distance metric is used that quantitatively tells how different two items are

� The metric d should satisfy

1. d(x, y) = 0 iff x = y ; identity of indiscernibles

2. d(x, y) = d(y, x) ; symmetry

3. d(x, y) + d(y, z) ≥ d(x, z) ; triangular inequality

� The metric is proven to be non-negative

2××××d(x, y) = d(x, y) + d(y, x) ≥ d(x, x) = 0

∴∴∴∴ d(x, y) ≥ 0

3

Representative Metrics

2( )k k

k

x y−∑� Euclidian Distance

� Manhattan Distance

� Mahalanobis Distance: Euclidian distance adjusted with correlations of different features

� Hamming Distance: Number of different featuresapplicable to non-quantitative feature space

� Edit Distance (aka Levenshtein distance)Minimum number of edit operations needed to transform one to the other

k k

k

x y−∑

Normalization

Normalization is often needed before defining distances on quantitative features

� Different features may have different units

E.g., Statures (heights) and weights of a human cannot be treated the same

� Different features may have different variation

E.g., The tallest man is not a hundred times taller than the shortest man, but the richest man may have one million (or maybe more) times more money than the poorest

Normalization intoStandard ScoresTransforming values of each feature into standard

scores is a widely used normalization method

� Average:

� Variance:

� Standard deviation:

� Standard score:

x

xi

i

xT

σ

µ−=

∑=

=N

iix x

N 1

1µ

∑=

−=N

ixix x

NV

1

2)(1

µ

∑=

−=N

ixix x

N 1

2)(1

µσ

Non-Linear Transform

� Transform into standard scores is linear

� A non-linear (such as logarithmic) transform, or using ranking instead of bare values, is more appropriate in certain cases

所得金額階級別にみた世帯数の相対度数分布

平成22年国民生活基礎調査の概況, 厚生労働省

With a linear transform, families with 3M yen of income is assumed to be more similar to families with no income than to families with 7M yen of income.Using ranking would be more appropriate.

4

Agglomerative Clustering:Procedure1. Every item is a cluster by itself

2. The closest pair of clusters form an upper-level cluster

3. Repeat until a cluster covering all items is formed

a b c d e f g

bc de fg

abc

abcdeabcdefg

Metrics Between Clusters

� Similarity metrics between clusters should also satisfy the distance axiom

� Can be defined as:

� Minimum: single linkage clustering

� Maximum: complete linkage clustering

� Average: average linkage clustering

of distances of items of different clusters

� Can be defined as the distance between some representative items or the centroid

� Median, centroid (average of items), etc

Ward’s Minimum Variance Criterion

Joe Ward, 1963

“Choose the pair of clusters that gives, when they are merged, the minimum increase of the sum of the intra-cluster variances”

� Less susceptible to exceptional items

Hierarchical Divisive Clustering� Divide items into classes based on their similarity

� Divide classes further

� Repeat until certain criterion is satisfied

5

Repeated Bisection

Repetitively divide the item set into two� For example, applying k-means (explained later)

algorithm with k = 2 repetitively� Which cluster to divide

� With the largest number of items� With the largest radius� With the largest sum or average of variance

� When to stop division� When the largest cluster has items less than

the threshold� When the largest radius or variance is less than

a threshold

Merits and Demerits of Hierarchical Divisive Clustering� No prior assumption on the number of clusters

� Some criteria on when to stop division are needed, though

� Can be parallelized easily

� Division on different clusters are independent

� Once a division is decided, it will not be changed

Non-Hierarchical DivisiveClustering� Decide the number of clusters first

� Define the cost function of clustering

� For example, the sum of variances of clusters

The clustering with the minimum cost is desirable, which is difficult due to computational costs

⇒ Iterative improvement methods are often used

� Decide some initial clustering

� Move items from one cluster to another to lower the cost

The solution is suboptimal but often acceptable

Stochastic Data Models

� Assume that data items consists of samples from a number of different stochastic models

� Clustering is inferring which models the items belong to

� Inference is made by maximizing the likelihood of observing the actual data

� The steps usually taken are:

� Infer of parameters of stochastic models, and

� Infer which of the models each item belongs to

6

Mixture Distribution Model

� Items are assumed to consists of samples of different probability distributions

� Inference of parameters of original distributions is tried

Stochastic Data Models

,

( | ) Pr( | )j i j

i j

L X C x Cτ= ∏

� Let ττττj denote the probability of an item belong to the model j

� Let Pr(xi |Cj) denote the probability of item xi is observed when it belongs to the model j

� The likelihood of the set of items X is observed is:

� Model parameters are estimated to maximize this

� Items are clustered so that the likelihood for each item is maximized under the estimated parameters

Expectation Maximization (EM) Algorithm

Arthur Dempster, et al. 1977

An algorithm to maximize the likelihood for probabilistic models with hidden parameters

� Two different sets of values have to be obtained

� A set of hidden parameters of the distribution

� Which items are samples of which distribution

� The algorithm repetitively tries to obtain better estimates of these two in turn

� Many application areas other than clustering

EM Algorithm: Procedure

θθθθ : Set of parameters of distributions

X: Observed set of data

Z: Some unknown data about the natures of X

1. Initial parameter set θθθθ are somehow given

2. (E step) Under the parameter set θθθθ and observed X, find Z that maximizes the likelihood

3. (M step) Using Z obtained in step 2 and X,

compute a better estimate of θθθθ

4. Repeat from step 2 until some convergence criterion is reached

7

Clustering with EM Algorithm

� First, decide the number of clusters and the shape of distributions (Gaussian, for example)

1. Decide initial θθθθ, , , , parameters of distributionsFor Gaussian, averages and variances

2. Compute the clustering Z that gives the largest likelihood under θθθθ

3. Compute θθθθ that gives the largest likelihood for the actual set of data X and its clustering Z

4. If some convergence condition is not satisfied yet, go back to the step 2 and repeat with this new θθθθ

Merits and Demerits ofEM Algorithm� No guarantee of convergence to the set of

parameters giving the maximum likelihood

� May converge to a local maximum

� It depends on the initial value of θθθθ

� Starting with different initial values may alleviate the problem

� Advantageous for simple distributions as computation of each step can have low cost;Gaussian mixture is typical

Clustering with EM Algorithm;An Example

Old Faithful Geyser(Yellowstone Nat. Park)

k-Means Clustering Algorithm

A non-hierarchical clustering algorithm

1. Start with k random clusters

2. Compute the centroids of clusters

3. Make each item belong to the cluster with the nearest centroid

4. Repeat from step 2 until some criterion is satisfied

8

Handling Empty Clusters

When empty clusters may appears

� Create a new cluster consisting of a single item

� Several strategy to choose the item

� The item furthest from the centroid of its own cluster

� First choose the cluster with the largest sum of squared errors, and then choose its item furthest from the centroid

A Fuzzy Version of k-Means:Fuzzy C-means� Each data item belongs fuzzily to multiple clusters

� Item xi belong to the cluster k with the weight ofuk(i) (ΣΣΣΣ uk(i) = 1)

� Centroids are computed with weights of uk(i)

ck= ΣΣΣΣ { xi∙uk(i) } / ΣΣΣΣ uk(i)

� Given new centroids, uk(i) are updated to be inversely proportional to the distance from the new centroids

Merits and Demerits ofk-Means Clustering� Simple, easily parallelized

� The result heavily depends on initial clusteringLikely to converge to local optima, which may be quite far from the global optimum

� No good criterion for k, the number of clusters

� No attentions paid to the balance of cluster sizes

� Influenced heavily on exceptional items

� Only applicable to quantitative data

� For example, if a feature is nationality, there is no such thing as the “mean of nationalities”

k-Medoid Algorithm

Medoid： the item representing a cluster

e.g., the item with minimum sum of distances from other items in the cluster

Medoid in place of the centroid of k-means

� Can be used for non-quantitative data

� Less influenced by exceptional items

centroid

exception

medoid

9

How to Decide the Number of ClustersMany proposals but no definitive one

� An empirical rule of thumb

� Elbow criterion (aka knee criterion)

� At some point, increasing the cluster number does not increase information given by the cluster an item belongs to

number of clusters

2/nk =

rati

o of

var

ianc

eexpl

aine

d

Shapes of Clusters

� Clusters may have a variety of shapes

� k-means and similar methods fit only to clusters with elliptical shapes

A ring is the most appropriate shape

in this case

Density-Based Clustering

� Repeat adding items in the neighborhood to clusters

⇒ Clusters will grow to follow high-density areas

○ Clusters can have arbitrary shapes

○ Exceptional items do not disturb clustering

∵ As exceptional items are not in neighborhoods of any other items, they do not participate in cluster forming

Density-Based Clustering: DBSCAN

Martin Ester et al., 1996

� εεεε-neighborhood: the set items within distance εεεε

� Core object: an item with more items in its εεεε-neighborhood than a threshold MinPts

� An item y is said to be density-reachable to a core object x if there is a sequence of core objects whose adjacent items are in the εεεε-neighborhood of each other

� Two items x and y are said to be density-connected if both are density-reachable from the same core object

� A cluster consists of density-connected items

10

MinPts = 2

Clustering with DBSCAN

εεεε

Core Object

Merits and Demerits of DBSCAN○ Relatively small computational cost

With low dimensionality, O(n log n)

⇒ Can be applied to a large-scale data

× No good criteria to decide algorithm parameters εεεεand MinPts

⇒ Characteristics of the analyzed data have to be known to a certain level beforehand

× Two overlapping groups cannot be divided

× No good mathematical model that gives the ground of the algorithm

Measures of Cluster Validity

� External: Based on information externally supplied

� Data labels given to items, for example

� Useful for evaluating clustering algorithms

� Internal: Using only the analyzed data set itself

� Sum of squared errors

� The ratio of the pairs of the nearest items belonging to the same cluster

In general, evaluating the results of clustering is a very difficult task

Summary of This Course

� Computational complexity theory� Orders of computation: Ω, Θ, O

� Amortized complexity

� A variety of algorithms for different problems� Combinatorial optimization

� Search and compaction of symbol strings

� Memory management

� Graph algorithms

� Computational geometry

� Clustering

Many other areas are not touched in this class

11

Important Points

� Algorithm selection dominates the computation cost

� Orders of complexity are of primary importance

� Constants factors are only secondary

� Strategies

� Using appropriate data structures

� Preprocessing for efficiency

� Concentrating on differences

� Exploiting the characteristics of data

� Mathematical models

� Conversion to problems with efficient algorithms

Why the Order is Imporant

� The problem sizes change in time

� Performance Improvements (Moore’s law):4 times in 3 years → 1024 times in 15 years

� Software are used longer than you may think

� The same software are used for tens of years

E.g. Two algorithms A: O(n log n); B: O(n2)

� B runs ten times faster than A now� 15 years after, data becomes 1000 times larger� A needs ten thousand times more computation� B needs one million times more computation

⇒ A becomes faster by two orders of magnitude

Selecting Data Structures

Data structures should be selected depending on which kind of accesses are required to be efficient

� Arrays are compact and allows random access, but extension is not easy?

⇒ Array extension has amortized linear cost

� Structures linked with pointers are flexible but no random access is possible?

⇒ Providing additional index arrays may allow required random access

Preprocessing may Make Algorithms More Efficient� Values repeatedly used can be stored in memory

e.g., dynamic programming

� A variety of forms to represent information

� Tables of numbers, adding flags, …

� As the structures themselves, such as in graphs

� Costs of preprocessing should also be considered

� Isn’t the preprocessing cost too high?

� Is the result of preprocess used repetitively?

12

Focusing on Differences may Lead to EfficiencyFinding all the possible plays at a given board game

position

� Computing it in each turn is costly

A slight change of the game position makes a slight change to the set of possible moves

⇒ Changes should be handled!

� For initial position, compute all the possible moves

� With one move of the game, compute plays that are enabled and disabled with the move

⇒ This may be much faster than computing all the move candidates from scratch each time

Pay Attention to Data

If some cases are known to be more frequent, improvements should be made for such cases

� How frequent are they?Even if 99% is made one hundred time faster,if 1% is made one hundred times slower, the whole system becomes slower

� Cases being frequent is always true?There might be critical situations where the frequencies are quite different

Converting to a Problem with Efficient AlgorithmsComputing a specific item in a sequence defined as a

recurrence formula

For the Fibonacci sequence

How can we compute the n’th item xn?

� Naïve application of the definition leads to an algorithm with exponential order

� Starting from the top and computing one after another, this can be improved to O(n)

� Is this the best possible?

dbxaxxcxcx nnn ++++++++============ ++++++++ 121100

nnn xxxxx ++++============ ++++++++ 1210 11

Solving Recurrence Formulae

The recurrence formula of the of the last page can be expressed in a form of matrix multiplication

With this and the values of x0 and x1, we have

which may look only adding more complication, but …

××××

====

++++

++++

++++

1100

001

1

1

1

2

n

n

n

n

x

xdba

x

x

××××

====

++++

1100

001

1

0

11

c

cn

dba

x

x

n

n

13

Computing Powers Efficiently

mn can be computed with O(log n) multiplications

� If n is a power of 2, a sequence of multiplications m, m2=m×m, m4=m2

×m2, … can compute mn with O(log n) multiplications

� n can be expressed as a sum of 2’s power

For example, m 105 = m 64× m 32

× m 8× m 1

� Using this, for arbitrary n, the required number of multiplications can be made to O(log n)

The same can be used for powers of matrices

⇒ Computing the n’th item of a sequence defined by a recurrence formula has the complexity O(log n)

Shortcomings of Traditional Complexity Theories� Complexity theories assumes random

access memory

� In reality, computer memory is not flat

� Clock speeds and circuit density are increasing, while total physical sizes of high-end computers are getting larger

� Relative memory access latency is increasing

Memory Hierarchy to Solve the ProblemCombination of memories with

� Smaller capacity, faster access

� Larger capacity, slower access

Frequently accessed data are stored in the faster memory

⇒With time and space access localities, the system is expected to run fast

The behavior cannot be analyzed precisely with theories based on random access memory

CPU

Primary Cache

Secondary Cache

Main Memory

Disks

Tertiary Cache

Sp

eed

Cap

acity

Working Set Model

Peter Denning, end of 1960’s

A correction to the complexity theory to reflect memory hierarchy

� Working set is a set of memory areas accessed in a short period of time

� If its size exceeds the size of a certain layer of the memory hierarchy, the program suddenly runs much slower

� Complexity analyses based on random access is only applicable as far as the working set size is within the size of a certain memory layer

14

Shortcomings of theWorking Set ModelThe validity of analysis is limited by a factor of

concrete system in application

� The memory structure of the computer system

� The amount of data to be processed

⇒When the situation changes, the analysis may become invalid

An important subject to be solved

Course Project Subject

� Choose an arbitrary problem to which algorithms introduced in this course is applicable

� Choose sets of input data for performance evaluation

� Write programs using two or more algorithms to the problem, and compare their performances

� Compare naïve and more sophisticated algorithms

� Compare results on multiple sets of data, differing in their sizes and/or natures

Project Report

Your report should include the following

1. Description of the problem

2. Description of two or more algorithms triedOne may be a naïve algorithm

3. Summary of data used for performance evaluation

4. Performance results

5. Reasoning of differences (or similarities) of the performance

Appendix: Your programs

Submitting Your Reports

Send it via email to the following address

[email protected]

� The Subject should include the course title“Algorithm Design”

� Either in English or in Japanese

� Write your name, lab, department and student ID number at the top of the email text

� The report should be attached to the email;Its format should be PDF, RTF, or MS-word

� Graphs to show performances recommended

� Deadline: Feb 10th, 9:00 PM JST

algorithm design (12) clustering algorithmschik/algorithm...1 algorithm design (12) clustering...

Documents