cs480 introduction to machine learning...

73
CS480 Introduction to Machine Learning Unsupervised Learning Edith Law

Upload: others

Post on 26-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

CS480 Introduction to Machine Learning Unsupervised Learning

Edith Law

Page 2: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Supervised Learning

•Finding a teacher may be difficult, expensive, or impossible •Unsupervised learning is about learning without a teacher.

Page 3: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Unsupervised Learning

In unsupervised learning, data consists only of examples and not the corresponding labels.

Our job is to make sense of or find some pattern of regularity in the data, even though no one has provided the correct labels.

For example, we might want to do •clustering: automatically partition the data into groups. •dimensionality reduction: project high dimensional data into lower

dimensional space so that it can be more easily visualized

Page 4: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Overview

• Clustering (K-Means) • Hierarchical Clustering • PCA

�4

Page 5: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Overview

• Clustering (K-Means) • Hierarchical Clustering • PCA

�5

Page 6: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

A Simple Clustering Example

• A fruit merchant approaches you, with a set of apples to classify according to their variety. – Tells you there are five varieties of apples in the basket. – Tells you the weight and colour of each apple in the basket.

• Can you label each apple with the correct variety? – What would you need to know / assume?

Page 7: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

A Simple Clustering Example

• Data = <x1, ?>, <x2, ?>, …, <xn, ?>

• You know there are 5 varieties.

• Assume each variety generates apples according to a (variety-specific) 2D Gaussian distribution. - If you know µi, σi2 for each class, it’s easy to classify the apples. - If you know the class of each apple, it’s easy to estimate µi, σi2.

• What if we know neither?

Page 8: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Chicken and Egg Problem

In unsupervised clustering, the goal is to find clusters in the data.

We represent each cluster by its cluster center. • If we know the cluster centers, we can assign each point to its

nearest cluster. • If we know which points belong to which clusters, then we can

compute the center.

This is a chicken and egg problem, which can be solved via iterations. • guess cluster centres • assign point to closest centre • recompute centres • repeat until clusters stop moving

This iterative process is the idea behind the K-means algorithm.

Page 9: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

A Simple Algorithm: K-means clustering

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Gaussian). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment.

Page 10: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm

1. Ask user how many clusters.

Image courtesy of Andrew Moore, Carnegie Mellon U.

This data could easily be modeled by Gaussians.

Page 11: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm

Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:

{ µ1,…, µk } (assume σ2 is known).

This data could easily be modeled by Gaussians.

Page 12: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm

Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:

{ µ1,…, µk } (assume σ2 is known).

3. Assign each data point to the closest center.

This data could easily be modeled by Gaussians.

Page 13: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm

Image courtesy of Andrew Moore, Carnegie Mellon U. 1. Ask user how many clusters. 2. Randomly guess k centers:

{ µ1,…, µk } (assume σ2 is known).

3. Assign each data point to the closest center.

4. Each centre finds the centroid of the points it owns.

5. Repeat

REPEAT!

This data could easily be modeled by Gaussians.

Page 14: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-Means Clustering (Daume’s Version)

Page 15: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm starts

Image courtesy of Andrew Moore, Carnegie Mellon U.

(Pelleg and More, 1999)

https://dl.acm.org/citation.cfm?id=312248

Page 16: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (2)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 17: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (3)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 18: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (4)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 19: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (5)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 20: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (6)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 21: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (7)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 22: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (8)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 23: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm continues (9)

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 24: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means algorithm terminates

Image courtesy of Andrew Moore, Carnegie Mellon U.

Page 25: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-Means is an instance of the EM Algorithm

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Gaussian). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step

expectation step

Page 26: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Properties of K-means

Does it converge? Yes, but to a local optimum (Proof in Daume)

How long does it take the converge? In practice, very quickly (usually fewer than 20 iterations). In theory, O(knm), i.e., exponential in the number of data points • k = #centers • n = #datapoints • m = dimensionality of data

Does it converge to the right answer? It is not guaranteed to converge to the “right answer”, partly because we have no way to knowing what the right answer is.

Page 27: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Properties of K-means

Rapid convergence depends on initialization

• Can use random re-starts (e.g., run the algorithm 10 times with different initialization ) to get better local optimum.

• Alternately, can choose your initial centers carefully: - Place µ1 on top of a randomly chosen datapoint. - Place µ2 on top of datapoint that is furthest from µ1. - Place µ3 on top of datapoint that is furthest from both µ1 and µ2.

Page 28: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means: Choosing K

A common approach is to search over many solutions (i.e., with different K) and find one that that minimizes a certain criterion:

BIC : arg minK

LK + λmK log N

“measure of quality of the clustering” (e.g., sum of squared distance between any data point and its assigned center)

# data points# centers

# dimensions

From: http://www.cs.cmu.edu/~./awm/tutorials/kmeans11.pdf

e.g., Bayes Information Criterion (BIC)

Page 29: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means: How to Choose K

W(K) =K

∑k=1

∑i∈Ik

∥xi − xk∥2

•Within-Cluster Scatter - how tightly grouped the clusters are

•Between-Cluster Scatter - how spread apart the clusters are from each other - nk is the number of data points in cluster Ck

Let Ik be the set of indices of data points belonging to cluster Ck

B(K) =K

∑k=1

nk∥xk − x∥2 xk =1nk ∑

i∈Ik

xi x =1n

n

∑i=1

xiwhere

Page 30: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

K-means: How to Choose K

Goal: the clustering assignment should simultaneously have a small W and a large B.

Choose K (upper-bounded by Kmax) with the largest CH(K) score

K* = arg maxK∈{2,…,Kmax}

CH(K)

within-cluster scatter between-cluster scatter

B(K) =K

∑k=1

nk∥xk − x∥2

CH(K) =B(K)/(K − 1)W(K)/(N − K)

“Calinski-Harabasz Index”

W(K) =K

∑k=1

∑i∈Ik

∥xi − xk∥2

Page 31: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Overview

• Clustering (K-Means) • Hierarchical Clustering • PCA

�31

Page 32: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Hierarchical Clustering

• A hierarchy of clusters, where the cluster at each level are created by merging clusters from the next lower level.

• Two general approaches: – Bottom-up: Recursively merge a pair of clusters. – Top-down: Recursively split the existing clusters.

• Use dissimilarity measure to select split/merge pairs: – Measure pairwise distance between any points in the 2 clusters.

• E.g. Euclidean distance, Manhattan distance. – Measure distance over entire clusters using linkage criterion.

• E.g. Min/Max/Mean over pairs of points.

Page 33: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Hierarchical Clustering

There is a hierarchical sequence of clustering assignments, which can be represented as a dendrogram.

A B C D E F G

A (B F) C D E G

(A E) (B F) C D G

(A E) (B F) (C G) D

((A E) (C G)) (B F) D

(((A E) (C G)) (B F)) D

((((A E) (C G)) (B F)) D)

Page 34: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Hierarchical Clustering Forms Dendrograms

A Dendrogram is a tree where each node represents a group:

• leaf: a group with a single data point.

• root: a group containing whole dataset.

• internal node: has two child nodes representing the groups that were merged to form it.

Each internal node is drawn at a height proportional to the dissimilarity between its two children • assume that the leaf nodes are at height zero.

Page 35: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Linkage Functions

• Linkage: function d(G,H) that takes two groups G,H as input and competes a dissimilarity score between them.

• The clustering process will result in different dendrograms depending on the choice of linkage function we use to measure dissimilarity between groups.

Page 36: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Linkage Function images from Manning et al., 2008

• single linkage (i.e., nearest-neighbor linkage): - the dissimilarity between G and H is the smallest dissimilarity between

two points in the opposite groups.

dsingle(G, H) = mini∈G,j∈H

dij

• complete linkage (i.e., furthest-neighbor linkage): - the dissimilarity between G and H is the largest dissimilarity between

two points in the opposite groups.

dcomplete(G, H) = maxi∈G,j∈H

dij

Page 37: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Linkage Function

• average linkage: the dissimilarity between G and H is the average dissimilarity over all points in opposite groups.

daverage(G, H) =1

nG ⋅ nH ∑i∈G,j∈H

dij

Page 38: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

From: http://www.econ.upf.edu/~michael/stanford/maeb7.pdf

Step 1: look for the most similar pair (lowest similarity score)

Page 39: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Step 2: join B and F at level 0.20. This forms a node.

Page 40: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Step 3: calculate the similarity score between each data point x and the merged pair (B, F). - complete linkage means dissimilarity = max of d(x, B) and d(x, F) - e.g., d(A,B)=0.5, d(A,F)=0.6250, therefore d(A,(B,F)) = 0.6250.

Page 41: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

This is what the table looks like after re-calculating the similarity scores between each point and the merged (B, F) pair.

Step 4: Repeat the process. Find the smallest similarity score.

Page 42: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Step 5: join A and E at level 0.25.

Page 43: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Step 6: calculate the similarity score between each data point x and the merged pair (A, E)- e.g., dissimilarity between (A, E) and (B, F) is the max of

d(A, (B,F))=0.6250 and d(E, (B,F))=0.7778.

Page 44: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

This is what the table looks like after re-calculating the similarity scores between each point and the merged (A,E) pair.

Page 45: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 46: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 47: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 48: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 49: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 50: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 51: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Example: Complete Linkage

Page 52: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Single vs Complete Linkage

https://www-users.cs.umn.edu/~kumar001/dmbook/ch7_clustering.pdf

single linkage: •sensitive to noise/outliers •clusters tend to be

elliptical, long and skinny.

complete linkage: • less sensitive to noise/outliers •clusters tend to be tight,

compact, globular

Page 53: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Hierarchical Clustering of News Articleshttp://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html

Page 54: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Overview

• Clustering (K-Means) • Hierarchical Clustering • PCA

�54

Page 55: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Dimensionality Reduction

How do we automatically detect and remove redundant dimensions? Looking for a vector u that points in the direction of maximal variance!

x x

xxx

x

x xx x

xx xx x

u2

u1

Skill

Enjo

ymen

t

Page 56: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Direction of Maximum Variance

Page 57: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Direction of Maximum Variance

Page 58: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Principal Component Analysis: The Problem

•find orthonormal basis vectorsU = [u(1) u(2) … u(k)] where k ≪ n

z = UT x where zk = (u(k))T x

•reconstructed data points

x =K

∑k=1

zku(k)

•cost function: reconstruction error

J =1n

n

∑i=1

∥xi − xi∥2

•want: minU

J

Page 59: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Principal Component Analysis: The Solution

•The solution turns out to be the first K eigenvectors of the data covariance matrix (see [B]Sec.12.1)

•Closed-form: - use Singular Value Decomposition (SVD) on covariance matrix

•Other PCA formulations: - maximizing variance of the projected data (see [D]Sec.15.2)

Page 60: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Principal Component Analysis: The Algorithm

•normalize features (ensure every feature has zero mean) and optimally scale feature

•compute “covariance matrix”

•compute its “eigenvectors”

•keep the first k eigenvectors and project to get new features z

Page 61: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Principal Component Analysis: The Algorithm

Page 62: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Page 63: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 1: subtract the mean of each dimension from the data along that dimension.

xi − x

∀i = 1,…,10yi − y

Page 64: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 2: Calculate the covariance matrix.•covariance measures how two

variables vary from the mean with respect to each other.

var(X) =∑n

i=1 (Xi − X)(Xi − X)

(n − 1)

cov(X) =∑n

i=1 (Xi − X)(Yi − Y )

(n − 1)

•covariance matrix captures covariance values between all dimensions

C = (cov(x, x) cov(x, y)cov(y, x) cov(y, y))

C = (0.6166 0.61540.6154 0.7166)

non-diagonal values are positive, indicating that x increases as y increases.

Page 65: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. •Eigenvector v of a linear transformation is a vector that, upon

transformation, does not change direction.Av = λv

(2 32 1) × (3

2) = (128 ) = 4 × (3

2)•Here, the associated eigenvalue is 4.

•Note that all eigenvectors of a matrix are orthogonal (perpendicular) to each other—we can re-express the data using eigenvectors as the new axes!

Page 66: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix.

eigenvalues = (0.0491.284)

eigenvectors = (−0.7352 − 0.67790.6779 − 0.7352 )

•the unit eigenvectors:

•the corresponding eigenvalues:

Page 67: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 4: Choose components.

eigenvalues = (0.0491.284)eigenvectors = (−0.7352 −0.6779

0.6779 −0.7352)

•the eigenvector with the highest eigenvalue is the principal component of the dataset.

•form a matrix of k eigenvectors, ordered by eigenvalues from largest to smallest.

W = [eig1, eig2, …, eigk]

W = (−0.6779 −0.7352−0.7352 0.6779 ) W = (−0.6779

−0.7352)If k < m, then you are essentially discarding some dimensions. e.g.,

Page 68: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 5: Derive the new dataset

•multiply transpose of W on the left of the mean-adjusted dataset, transposed.

(−0.6779 −0.7352−0.7352 0.6779 ) (0.69 −1.31 0.39 0.09 1.29 0.49 0.19 −0.81 −0.31 −0.71

0.49 −1.21 0.99 0.29 1.09 0.79 −0.31 −0.81 −0.31 −1.01)

Page 69: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 5: Derive the new dataset

Page 70: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

Step 5: Derive the new dataset

Page 71: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

PCA: Example (Smith, 2002)

To Reconstruct the data:

•We can get exactly the original data back if we had used all the eigenvectors,

•but we would have lost some information if we used only some of the eigenvectors.

Page 72: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

Uses of Dimensionality Reduction

•Compression •Visualization

- pick dimension = 2 or 3. •Pre-processing (to avoid the

curse of dimensionality)

https://en.wikipedia.org/wiki/Eigenface

Page 73: CS480 Introduction to Machine Learning …edithlaw.ca/.../w19/lectures/13-unsupervised-learning.pdfUnsupervised Learning In unsupervised learning, data consists only of examples and

What you should know

• K-means clustering and its properties • Hierarchical clustering and different linkage functions • How to cluster a toy dataset using K-means / hierarchical clustering • procedures and applications of Principal Component Analysis