cs480 introduction to machine learning...

CS480 Introduction to Machine Learning Unsupervised Learning

Edith Law

Supervised Learning

•Finding a teacher may be difficult, expensive, or impossible •Unsupervised learning is about learning without a teacher.

Unsupervised Learning

In unsupervised learning, data consists only of examples and not the corresponding labels.

Our job is to make sense of or find some pattern of regularity in the data, even though no one has provided the correct labels.

For example, we might want to do •clustering: automatically partition the data into groups. •dimensionality reduction: project high dimensional data into lower

dimensional space so that it can be more easily visualized

Overview

• Clustering (K-Means) • Hierarchical Clustering • PCA

�4

Overview


�5

A Simple Clustering Example

• A fruit merchant approaches you, with a set of apples to classify according to their variety. – Tells you there are five varieties of apples in the basket. – Tells you the weight and colour of each apple in the basket.

• Can you label each apple with the correct variety? – What would you need to know / assume?

A Simple Clustering Example

• Data = <x1, ?>, <x2, ?>, …, <xn, ?>

• You know there are 5 varieties.

• Assume each variety generates apples according to a (variety-specific) 2D Gaussian distribution. - If you know µi, σi2 for each class, it’s easy to classify the apples. - If you know the class of each apple, it’s easy to estimate µi, σi2.

• What if we know neither?

Chicken and Egg Problem

In unsupervised clustering, the goal is to find clusters in the data.

We represent each cluster by its cluster center. • If we know the cluster centers, we can assign each point to its

nearest cluster. • If we know which points belong to which clusters, then we can

compute the center.

This is a chicken and egg problem, which can be solved via iterations. • guess cluster centres • assign point to closest centre • recompute centres • repeat until clusters stop moving

This iterative process is the idea behind the K-means algorithm.

A Simple Algorithm: K-means clustering

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Gaussian). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment.

K-means algorithm

1. Ask user how many clusters.

Image courtesy of Andrew Moore, Carnegie Mellon U.

This data could easily be modeled by Gaussians.

K-means algorithm

Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:

{ µ1,…, µk } (assume σ2 is known).


K-means algorithm



3. Assign each data point to the closest center.


K-means algorithm



3. Assign each data point to the closest center.

4. Each centre finds the centroid of the points it owns.

5. Repeat

REPEAT!


K-Means Clustering (Daume’s Version)

K-means algorithm starts


(Pelleg and More, 1999)

https://dl.acm.org/citation.cfm?id=312248

K-means algorithm continues (2)


K-means algorithm terminates


K-Means is an instance of the EM Algorithm

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Gaussian). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step

expectation step

Properties of K-means

Does it converge? Yes, but to a local optimum (Proof in Daume)

How long does it take the converge? In practice, very quickly (usually fewer than 20 iterations). In theory, O(knm), i.e., exponential in the number of data points • k = #centers • n = #datapoints • m = dimensionality of data

Does it converge to the right answer? It is not guaranteed to converge to the “right answer”, partly because we have no way to knowing what the right answer is.

Properties of K-means

Rapid convergence depends on initialization

• Can use random re-starts (e.g., run the algorithm 10 times with different initialization ) to get better local optimum.

• Alternately, can choose your initial centers carefully: - Place µ1 on top of a randomly chosen datapoint. - Place µ2 on top of datapoint that is furthest from µ1. - Place µ3 on top of datapoint that is furthest from both µ1 and µ2.

K-means: Choosing K

A common approach is to search over many solutions (i.e., with different K) and find one that that minimizes a certain criterion:

BIC : arg minK

LK + λmK log N

“measure of quality of the clustering” (e.g., sum of squared distance between any data point and its assigned center)

# data points# centers

# dimensions

From: http://www.cs.cmu.edu/~./awm/tutorials/kmeans11.pdf

e.g., Bayes Information Criterion (BIC)

K-means: How to Choose K

W(K) =K

∑k=1

∑i∈Ik

∥xi − xk∥2

•Within-Cluster Scatter - how tightly grouped the clusters are

•Between-Cluster Scatter - how spread apart the clusters are from each other - nk is the number of data points in cluster Ck

Let Ik be the set of indices of data points belonging to cluster Ck

B(K) =K

∑k=1

nk∥xk − x∥2 xk =1nk ∑

i∈Ik

xi x =1n

n

∑i=1

xiwhere

K-means: How to Choose K

Goal: the clustering assignment should simultaneously have a small W and a large B.

Choose K (upper-bounded by Kmax) with the largest CH(K) score

K* = arg maxK∈{2,…,Kmax}

CH(K)

within-cluster scatter between-cluster scatter

B(K) =K

∑k=1

nk∥xk − x∥2

CH(K) =B(K)/(K − 1)W(K)/(N − K)

“Calinski-Harabasz Index”

W(K) =K

∑k=1

∑i∈Ik

∥xi − xk∥2

Overview


�31

Hierarchical Clustering

• A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level.

• Two general approaches: – Bottom-up: Recursively merge a pair of clusters. – Top-down: Recursively split the existing clusters.

• Use dissimilarity measure to select split/merge pairs: – Measure pairwise distance between any points in the 2 clusters.

• E.g. Euclidean distance, Manhattan distance. – Measure distance over entire clusters using linkage criterion.

• E.g. Min/Max/Mean over pairs of points.

Hierarchical Clustering

There is a hierarchical sequence of clustering assignments, which can be represented as a dendrogram.

A B C D E F G

A (B F) C D E G

(A E) (B F) C D G

(A E) (B F) (C G) D

((A E) (C G)) (B F) D

(((A E) (C G)) (B F)) D

((((A E) (C G)) (B F)) D)

Hierarchical Clustering Forms Dendrograms

A Dendrogram is a tree where each node represents a group:

• leaf: a group with a single data point.

• root: a group containing whole dataset.

• internal node: has two child nodes representing the groups that were merged to form it.

Each internal node is drawn at a height proportional to the dissimilarity between its two children • assume that the leaf nodes are at height zero.

Linkage Functions

• Linkage: function d(G,H) that takes two groups G,H as input and competes a dissimilarity score between them.

• The clustering process will result in different dendrograms depending on the choice of linkage function we use to measure dissimilarity between groups.

Linkage Function images from Manning et al., 2008

• single linkage (i.e., nearest-neighbor linkage): - the dissimilarity between G and H is the smallest dissimilarity between

two points in the opposite groups.

dsingle(G, H) = mini∈G,j∈H

dij

• complete linkage (i.e., furthest-neighbor linkage): - the dissimilarity between G and H is the largest dissimilarity between

two points in the opposite groups.

dcomplete(G, H) = maxi∈G,j∈H

dij

Linkage Function

• average linkage: the dissimilarity between G and H is the average dissimilarity over all points in opposite groups.

daverage(G, H) =1

nG ⋅ nH ∑i∈G,j∈H

dij

Example: Complete Linkage

From: http://www.econ.upf.edu/~michael/stanford/maeb7.pdf

Step 1: look for the most similar pair (lowest similarity score)


Step 2: join B and F at level 0.20. This forms a node.


Step 3: calculate the similarity score between each data point x and the merged pair (B, F). - complete linkage means dissimilarity = max of d(x, B) and d(x, F) - e.g., d(A,B)=0.5, d(A,F)=0.6250, therefore d(A,(B,F)) = 0.6250.


This is what the table looks like after re-calculating the similarity scores between each point and the merged (B, F) pair.

Step 4: Repeat the process. Find the smallest similarity score.


Step 5: join A and E at level 0.25.


Step 6: calculate the similarity score between each data point x and the merged pair (A, E)- e.g., dissimilarity between (A, E) and (B, F) is the max of

d(A, (B,F))=0.6250 and d(E, (B,F))=0.7778.


This is what the table looks like after re-calculating the similarity scores between each point and the merged (A,E) pair.

Single vs Complete Linkage

https://www-users.cs.umn.edu/~kumar001/dmbook/ch7_clustering.pdf

single linkage: •sensitive to noise/outliers •clusters tend to be

elliptical, long and skinny.

complete linkage: • less sensitive to noise/outliers •clusters tend to be tight,

compact, globular

Hierarchical Clustering of News Articleshttp://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html

Overview


�54

Dimensionality Reduction

How do we automatically detect and remove redundant dimensions? Looking for a vector u that points in the direction of maximal variance!

x x

xxx

x

x xx x

xx xx x

u2

u1

Skill

Enjo

ymen

t

Direction of Maximum Variance

Principal Component Analysis: The Problem

•find orthonormal basis vectorsU = [u(1) u(2) … u(k)] where k ≪ n

z = UT x where zk = (u(k))T x

•reconstructed data points

x =K

∑k=1

zku(k)

•cost function: reconstruction error

J =1n

n

∑i=1

∥xi − xi∥2

•want: minU

J

Principal Component Analysis: The Solution

•The solution turns out to be the first K eigenvectors of the data covariance matrix (see [B]Sec.12.1)

•Closed-form: - use Singular Value Decomposition (SVD) on covariance matrix

•Other PCA formulations: - maximizing variance of the projected data (see [D]Sec.15.2)

Principal Component Analysis: The Algorithm

•normalize features (ensure every feature has zero mean) and optimally scale feature

•compute “covariance matrix”

•compute its “eigenvectors”

•keep the first k eigenvectors and project to get new features z

Principal Component Analysis: The Algorithm

PCA: Example (Smith, 2002)


Step 1: subtract the mean of each dimension from the data along that dimension.

xi − x

∀i = 1,…,10yi − y


Step 2: Calculate the covariance matrix.•covariance measures how two

variables vary from the mean with respect to each other.

var(X) =∑n

i=1 (Xi − X)(Xi − X)

(n − 1)

cov(X) =∑n

i=1 (Xi − X)(Yi − Y )

(n − 1)

•covariance matrix captures covariance values between all dimensions

C = (cov(x, x) cov(x, y)cov(y, x) cov(y, y))

C = (0.6166 0.61540.6154 0.7166)

non-diagonal values are positive, indicating that x increases as y increases.


Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. •Eigenvector v of a linear transformation is a vector that, upon

transformation, does not change direction.Av = λv

(2 32 1) × (3

2) = (128 ) = 4 × (3

2)•Here, the associated eigenvalue is 4.

•Note that all eigenvectors of a matrix are orthogonal (perpendicular) to each other—we can re-express the data using eigenvectors as the new axes!


Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix.

eigenvalues = (0.0491.284)

eigenvectors = (−0.7352 − 0.67790.6779 − 0.7352 )

•the unit eigenvectors:

•the corresponding eigenvalues:


Step 4: Choose components.

eigenvalues = (0.0491.284)eigenvectors = (−0.7352 −0.6779

0.6779 −0.7352)

•the eigenvector with the highest eigenvalue is the principal component of the dataset.

•form a matrix of k eigenvectors, ordered by eigenvalues from largest to smallest.

W = [eig1, eig2, …, eigk]

W = (−0.6779 −0.7352−0.7352 0.6779 ) W = (−0.6779

−0.7352)If k < m, then you are essentially discarding some dimensions. e.g.,


Step 5: Derive the new dataset

•multiply transpose of W on the left of the mean-adjusted dataset, transposed.

(−0.6779 −0.7352−0.7352 0.6779 ) (0.69 −1.31 0.39 0.09 1.29 0.49 0.19 −0.81 −0.31 −0.71

0.49 −1.21 0.99 0.29 1.09 0.79 −0.31 −0.81 −0.31 −1.01)


Step 5: Derive the new dataset


To Reconstruct the data:

•We can get exactly the original data back if we had used all the eigenvectors,

•but we would have lost some information if we used only some of the eigenvectors.

Uses of Dimensionality Reduction

•Compression •Visualization

- pick dimension = 2 or 3. •Pre-processing (to avoid the

curse of dimensionality)

https://en.wikipedia.org/wiki/Eigenface

What you should know

• K-means clustering and its properties • Hierarchical clustering and different linkage functions • How to cluster a toy dataset using K-means / hierarchical clustering • procedures and applications of Principal Component Analysis

cs480 introduction to machine learning...

Documents