applied machine learning lecture 10: unsupervised learningrichajo/dit866/lectures/l10/l10.pdf ·...

Post on 22-Jul-2020

10 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Applied Machine LearningLecture 10: Unsupervised learning

Selpi (selpi@chalmers.se)

The slides are further development of Richard Johansson's slides

February 25, 2020

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Di�erent learning approaches based on supervision

Di�erent learning approaches based on supervision

Di�erent learning approaches based on supervision

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

Unsupervised learning

I Why do this?I hand-labeled data might not be available or expensive and

time-consuming to produce

I Goal:I to discover structure in the data to summarise, explore,

visualise, and understand data

Applications of unsupervised learning

Main types of unsupervised approaches

I grouping the data: clustering

I modeling the statistical distribution of the data

I dimensionality reduction: projecting a high-dimensional

dataset to a lower dimensionality

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

clustering

I clustering describes the data by forming groups (partitions)or hierarchies

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

lv et fi lt mt pl cs sk sl ro en fr it es pt hude nl da sv0.0

2.5

5.0

7.5

10.0

12.5

15.0

17.5

What do we need to be able to cluster data?

I Euclidean distanceI Manhattan distance

Di�erent clustering methods

I PartitionalI K-meansI Graph-basedI EM

I HierarchicalI Single linkageI Average linkageI Median linkageI Complete linkageI Centrod linkageI Ward's method

I ...

I Neural network

in scikit-learn

http://scikit-learn.org/stable/modules/clustering.html

�at clustering

4 2 0 2 4 6 8 100

2

4

6

8

10

6 4 2 0 2 4 6 8

8

6

4

2

0

2

4

6

�at clustering methods: overview

K-means (1)

I each cluster is represented by its

centroid µi

I minimize the within-cluster sum

of squares:

argminS

k∑i=1

∑x∈Si

‖x− µi‖2

Randomly initialise k cluster centroids

repeat until converged:assign cluster to each training data, based on distance

move each centroid to the centre of its cluster

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

K-means (2)

I How to determine number of clusters?I Elbow method, trial and error, use domain knowledge

I AdvantagesI simple, easy to understand, fast, robust enough, good for

distinct clusters

I DisadvantagesI Rely on k, can have local minima, depends on initialisation, it

clusters even a uniform data, cannot handle outlier, sensitiveto scalling

I See Demo!

DBSCAN

[source]

Parameters: eps and min_samples

1) Randomly select a point P

2) Retrieve all points directly density-reachable from P w.r.t. eps

(radius to search for neighboors).

- If P is a core point, a cluster is formed. Find recursively all its

density conected points and assign them to the same cluster as P

- If P is not a core point, iterate through the remaining unvisited

points in the dataset

DBSCAN example

6 4 2 0 2 4 6 8

8

6

4

2

0

2

4

6

See demo!

Discussion

I why is clustering hard?

I Di�cult to interpret and evaluate the results

Discussion

I why is clustering hard?I Di�cult to interpret and evaluate the results

Evaluating clustering algorithms

I how could we evaluate the result produced by a clusteringalgorithm?I Check Silhouette scoreI Compare with gold standard data (if any)

Evaluating �at clustering (1): internal consistency

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

Cluster evaluation: the silhouette score

I Measures how close each object is to its own cluster compared

to other clusters.

I For each data point xi , the silhouette score is de�ned

si =bi − ai

max(ai , bi )

whereI ai is the average distance to other members in the same

clusterI bi the minimal average distance to another cluster

I then we take the average over all the data

I in scikit-learn: sklearn.metrics.silhouette_score

example

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

silhouette = 0.68

5 4 3 2 1 0 1 2

3

2

1

0

1

2

3

silhouette = 0.42

evaluating �at clustering (2): comparing to a gold standard

evaluating �at clustering (2): comparing to a gold standard

I scikit-learn contains several evaluation metrics for this scenarioI http://scikit-learn.org/stable/modules/classes.

html#clustering-metricsI http://scikit-learn.org/stable/modules/clustering.

html#clustering-evaluation

I for instance, the adjusted Rand score:

[source]

hierarchical clustering

See demo!

Co-clustering

[source]

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

modeling distributions

I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from

the same distributionI are there any exceptional data points?I for some new data point x, does it seem

plausible that it comes from the samedistribution?

[source]

I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a

basic stats class?I . . . and what are the limitations of that method?

modeling distributions

I we are given a dataset and now we need toI visualize the distributionI �nd the most probable region in the dataI randomly generate new synthetic data from

the same distributionI are there any exceptional data points?I for some new data point x, does it seem

plausible that it comes from the samedistribution?

[source]

I we need to model the statistical distribution of the dataI how would we solve this using an approach we've seen in a

basic stats class?I . . . and what are the limitations of that method?

kernel density estimation (2)

[source]

novelty detection, anomaly detection, . . .

[source]

Applications: fraud detection, cyber security, manufacturing

processes, monitoring machine.

Anomaly detection vs supervised

I Anomaly detection: Very small number of positive examples,

large number of negative examples, future anomalies are

unlikely to be similar to the training set

I Supervised: Large number of positives and negatives future

examples are likely to be similar to training set.

in scikit-learn

http://scikit-learn.org/stable/modules/outlier_

detection.html

one-class SVM

I the one-class SVM tries to �nd a decision boundary that

encloses the training data (except a fraction ν)

[source]I it can be used for novelty and outlier detection

one-class SVM (formally)

Anomaly detection - choosing features

I Plot histogram of data, see if the data looks gaussian. If not

gaussian, do transformation to make the data look gaussian

(e.g., take log(x) and plot the data again).

I log(x) can be changed with other function. The function that

makes the data more Gaussian can be used as a feature

evaluation of anomaly detection systems

I how do you think it should be done?

Overview

Di�erent learning approaches

Unsupervised learning

Clustering

modeling distributions

dimensionality reduction

dimensionality reduction

I reducing a high-dimensional dataset to a low-dimensional one

I why?I visualizing, understandingI reducing the need for storageI making supervised/unsupervised algorithms run fasterI making supervised/unsupervised learning easier

Principal Component Analysis

[source]

autoencoders: dimensionality reduction using a NN

I for several Keras implementations, see

https://blog.keras.io/

building-autoencoders-in-keras.html

Review from today's lecture

I Characterise essential di�erences between supervised,

unsupervised, semi-supervised learning approaches

I Discuss several unsupervised learning methods

I Compare and evaluate di�erent clustering methods

I Compare and evaluate di�erent distance measures

Next two lectures

I 28 Feb: Ethics in Machine LearningI Will be given by Vilhelm Verendel

(https://www.chalmers.se/en/sta�/Pages/vilhelm-gustav-verendel.aspx)

I 3 March: Dealing with time series data

top related