clustering algorithms - maya ackerman · • there are clustering algorithms for a wide variety of...

27
Margareta Ackerman Clustering Algorithms

Upload: others

Post on 27-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

Margareta Ackerman !

Clustering Algorithms

Page 2: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

• As we discussed last class, there are MANY clustering algorithms, and new ones are proposed all the time.

• They are very different from each other!

A sea of algorithms

Page 3: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

• There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one.

• Input:The input is (X,d) and k, where 1. X is a set of elements (think of it as the labels of the points) 2. d: X x X → R+ is a dissimilarity function 3. k is the number of desired clusters, 1≤k≤|X|

Input/output

Page 4: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

• Input:The input is (X,d) and k, where 1. X is a set of elements (think of it as the labels of the points) 2. d: X x X → R+ is a dissimilarity function 3. k is the number of desired clusters, 1≤k≤|X|

• Output: A partition of X into k sets {C1, C2, …, Ck} where 1) Ci ∩ Cj is empty for all i and j 2) C1 ∪ C2 ∪ … ∪ Ck = X.

Input/output

Page 5: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

• Start by placing each point in its own cluster

• Then, merge the two “closest” clusters

• Continue to merge two “closest” clusters until exactly k clusters remain

Linkage-Based Algorithms

Page 6: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

Start by placing each point in its own cluster Calculate and store the distance between each pair of clustersWhile there are more than k clusters - Let A, B be the two closest clusters - Add cluster A U B - Remove clusters A and B - Find the distance between A U B and all other clusters

Linkage-Based Algorithms: More detail

Page 7: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!• How do we define the distance between clusters? • Common examples: – Single-linkage: min between-cluster distance – Average-linkage: average between-cluster distance – Complete-linkage: max between-cluster distance

7

Examples of linkage-based algorithms

Page 8: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

Linkage-based algorithms are often applied in the hierarchical setting, where the algorithm outputs an entire tree of clustering.

Hierarchical linkage-based algorithms are similar to the partitional versions we saw here (more about the hierarchal setting later).

!

Hierarchical algorithms

Page 9: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

Perhaps the most popular clustering algorithm !

Often applied to data in Euclidean space.

9

K-means

Page 10: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

Given a clustering {C1, C2, …, Ck}, the k-means objective function is !

!

!

Where µi is the mean of Ci. That is, !

The ideal goal is to find a clustering with the minimum k-means cost. But that can take too long (it’s NP-hard.) !

So instead, we apply a heuristic: An algorithm that, in practice, tends to find clusterings with low k-means cost.

10

K-means Objective Function

Page 11: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

Pick k points (call them “centers”) Until convergence: Assign each point to its closest center. This gives us k clusters. Compute the mean of each cluster Let these means be the new centers !

!

The algorithm converges when the clusters don’t change in two consecutive iterations. !

! 11

Lloyd’s method

Page 12: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

!

How could we initialize the centers? !

Furthest centroids: Pick one random center c1. Set c2 to the furthest point from c1 Set ci to have the largest minimum distance from any center already chosen. !

12

Variations of Lloyd’s method

Page 13: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

!

How could we initialize the centers? !

Random: Pick k random initial centers. !

Using this approach, we might end up in a “local optimum.” !

So, we run the algorithm many times (~100) to completion and pick the minimum cost clustering.

13

Variations of Lloyd’s method

Page 14: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

!

• Picking random centers works VERY WELL in practice. • In particular, it work much better than furthest centroids. • It works so well, that “k-means” is synonymous with this approach. !• Does Lloyd’s method with random centers always find the optimal k-means solution? No. !• We will see other ways to initialize Lloyd’s method.

14

Lloyd’s method with random centers

Page 15: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

15

K-median!

Like k-means, except that we do not square the distance to the center. !

!

!

!

Given a clustering {C1, C2, …, Ck}, the k-median objective function is !

!

!

Where µi is the mean of Ci. That is, !

!

Page 16: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

16

K-medoids!

Like k-means, except that the centers must be part of the data set. !!

Given a clustering {C1, C2, …, Ck}, the k-medoids objective function is !

!

!

where that minimizes the above sum. !

ci 2 Ci

kX

i=1

X

x2Ci

kx� c

i

k2

Page 17: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

17

Min-sum!

Given a clustering {C1, C2, …, Ck}, the min-sum objective function is !

!

!

!

!

kX

i=1

X

x,y2Ci

d(x, y)

Page 18: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

18Single-linkage k-means

Differences in Input/Output Behavior of Clustering Algorithms

Page 19: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

19

Single-linkage, average-linkage, complete-linkage, min-diamater

k-means, k-median, k-medoids

Differences in Input/Output Behavior of Clustering Algorithms

Page 20: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

There are a wide variety of clustering algorithms, which can produce very different clusterings.

!!

!

20

How should a user decide which algorithm to use for

a given application?

The User’s Dilemma

Page 21: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

Users rely on cost related considerations: running

times, space usage, software purchasing costs, etc…

!

There is inadequate emphasis on

input-output behaviour !

21

Clustering Algorithm Selection

Page 22: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

A framework that lets a user utilize prior knowledge to select an algorithm

!

• Identify properties that distinguish between different input-output behaviour of clustering paradigms

• The properties should be: 1) Intuitive and “user-friendly” 2) Useful for distinguishing clustering algorithms

22

Framework for Algorithm Selection (Ackerman, Ben-David, and Loker, NIPS 2010)

Page 23: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

The goal is to understand fundamental differences between clustering

methods, and convey them formally, clearly, and as simply as possible.

23

Framework for Algorithm Selection

Page 24: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

24

Local Outer Con.

Inner Con.

Consistent Refin. Preserv

Order Inv.

Rich Outer Rich

Rep Ind

Scale Inv

Single linkage ! ! ! ! ! ! ! ! ! !

Average linkage ! ! " " ! " ! ! ! !

Complete linkage ! ! " " ! ! ! ! ! !

K-means ! ! " " " " ! ! ! !K-medoids ! ! " " " " ! ! ! !Min-Sum ! ! ! ! " " ! ! ! !Ratio-cut " " ! ! " " ! ! ! !Normalized-cut " " " " " " ! ! ! !

Property-based classification for fixed k Ackerman, Ben-David, and Loker, NIPS 2010

Page 25: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

Local Outer Con.

Inner Con.

Consistent Refin. Preserv

Order Inv.

Rich Outer Rich

Rep Ind

Scale Inv

Single linkage ! ! ! ! ! ! ! ! ! !

Average linkage ! ! " " ! " ! ! ! !

Complete linkage ! ! " " ! ! ! ! ! !

K-means ! ! " " " " ! ! ! !K-medoids ! ! " " " " ! ! ! !Min-Sum ! ! ! ! " " ! ! ! !Ratio-cut " " ! ! " " ! ! ! !Normalized-cut " " " " " " ! ! ! !

25

Kleinberg’s Axioms are consistent when

k is given

Kleinberg’s axioms for fixed k

Page 26: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

26

Single linkage satisfied ALL of these properties! !

So should we just use Single linkage all the time? !

It’s not a good clustering algorithm in practice!

Single-linkage satisfies everything

Local Outer Con.

Inner Con.

Consistent Refin. Preserv

Order Inv.

Rich Outer Rich

Rep Ind

Scale Inv

Single linkage ! ! ! ! ! ! ! ! ! !

Page 27: Clustering algorithms - Maya Ackerman · • There are clustering algorithms for a wide variety of input and output types. Today, we will focus on the most popular one. • Input:

27

! Despite much work on clustering properties, some basic questions remained

unanswered. !

Consider some of the most popular clustering methods: k-means, single-linkage, average-linkage, etc…

!!!

• How do these algorithms differ in their input-output behavior? • What are the advantages of k-means over other methods? • We were missing some key properties. !

! More on that in our next class!

!

!

What’s Left To Be Done?