clustering and data mining in r - workshop...

25
Clustering and Data Mining in R Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011

Upload: truongdang

Post on 14-Feb-2019

241 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Clustering and Data Mining in RWorkshop Supplement

Thomas Girke

December 10, 2011

Page 2: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Introduction

Data PreprocessingData TransformationsDistance MethodsCluster Linkage

Hierarchical ClusteringApproachesTree Cutting

Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Page 3: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Introduction

OutlineIntroduction

Data PreprocessingData TransformationsDistance MethodsCluster Linkage

Hierarchical ClusteringApproachesTree Cutting

Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Page 4: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Introduction

What is Clustering?

I Clustering is the classification of data objects into similaritygroups (clusters) according to a defined distance measure.

I It is used in many fields, such as machine learning, datamining, pattern recognition, image analysis, genomics,systems biology, etc.

Page 5: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Introduction

Why Clustering and Data Mining in R?

I Efficient data structures and functions for clustering.

I Efficient environment for algorithm prototyping andbenchmarking.

I Comprehensive set of clustering and machine learning libraries.

I Standard for data analysis in many areas.

Page 6: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Data Preprocessing

OutlineIntroduction

Data PreprocessingData TransformationsDistance MethodsCluster Linkage

Hierarchical ClusteringApproachesTree Cutting

Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Page 7: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Data Preprocessing

Data Transformations

Data Transformations Choice depends on data set!

I Center & standardize1. Center: subtract from each vector its mean2. Standardize: devide by standard deviation

⇒ Mean = 0 and STDEV = 1I Center & scale with the scale() fuction

1. Center: subtract from each vector its mean2. Scale: divide centered vector by their root mean square (rms)

xrms =

√√√√ 1

n − 1

n∑i=1

xi 2

⇒ Mean = 0 and STDEV = 1

I Log transformation

I Rank transformation: replace measured values by ranks

I No transformation

Page 8: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Data Preprocessing

Distance Methods

Distance Methods List of most common ones!

I Euclidean distance for two profiles X and Y

d(X ,Y ) =

√√√√ n∑i=1

(xi − yi )2

Disadvantages: not scale invariant, not for negative correlations

I Maximum, Manhattan, Canberra, binary, Minowski, ...

I Correlation-based distance: 1− r

I Pearson correlation coefficient (PCC)

r =n∑n

i=1 xiyi −∑n

i=1 xi∑n

i=1 yi√(∑n

i=1 x2i − (

∑ni=1 xi )

2)(∑n

i=1 y2i − (

∑ni=1 yi )

2)

Disadvantage: outlier sensitiveI Spearman correlation coefficient (SCC)

Same calculation as PCC but with ranked values!

Page 9: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Data Preprocessing

Cluster Linkage

Cluster Linkage

Single Linkage

Complete Linkage

Average Linkage

Page 10: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Hierarchical Clustering

OutlineIntroduction

Data PreprocessingData TransformationsDistance MethodsCluster Linkage

Hierarchical ClusteringApproachesTree Cutting

Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Page 11: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Hierarchical Clustering

Hierarchical Clustering Steps

1. Identify clusters (items) with closest distance

2. Join them to new clusters

3. Compute distance between clusters (items)

4. Return to step 1

Page 12: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Hierarchical Clustering

Hierarchical Clustering Agglomerative Approach

g1 g2 g3 g4 g50.1

g1 g2 g3 g4 g50.1

0.4

g1 g2 g3 g4 g50.1

0.4

0.6

0.5

(a)

(b)

(c)

Page 13: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Hierarchical Clustering

Approaches

Hierarchical Clustering Approaches

1. Agglomerative approach (bottom-up)

hclust() and agnes()

2. Divisive approach (top-down)

diana()

Page 14: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Hierarchical Clustering

Tree Cutting

Tree Cutting to Obtain Discrete Clusters

1. Node height in tree

2. Number of clusters

3. Search tree nodes by distance cutoff

Page 15: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

OutlineIntroduction

Data PreprocessingData TransformationsDistance MethodsCluster Linkage

Hierarchical ClusteringApproachesTree Cutting

Non-Hierarchical ClusteringK-MeansPrincipal Component AnalysisMultidimensional ScalingBiclusteringMany Additional Techniques

Page 16: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Non-Hierarchical Clustering

Selected Examples

Page 17: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

K-Means

K-Means Clustering

1. Choose the number of k clusters

2. Randomly assign items to the k clusters

3. Calculate new centroid for each of the k clusters

4. Calculate the distance of all items to the k centroids

5. Assign items to closest centroid

6. Repeat until clusters assignments are stable

Page 18: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

K-Means

K-Means

X

X

X

XX

X

X

X

X

(a)

(b)

(c)

Page 19: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Principal Component Analysis

Principal Component Analysis (PCA)

Principal components analysis (PCA) is a data reduction techniquethat allows to simplify multidimensional data sets to 2 or 3dimensions for plotting purposes and visual variance analysis.

Page 20: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Principal Component Analysis

Basic PCA Steps

I Center (and standardize) dataI First principal component axis

I Accross centroid of data cloudI Distance of each point to that line is minimized, so that it

crosses the maximum variation of the data cloud

I Second principal component axisI Orthogonal to first principal componentI Along maximum variation in the data

I 1st PCA axis becomes x-axis and 2nd PCA axis y-axis

I Continue process until the necessary number of principalcomponents is obtained

Page 21: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Principal Component Analysis

PCA on Two-Dimensional Data Set

1st

2nd

1st

2nd

Page 22: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Principal Component Analysis

Identifies the Amount of Variability between Components

Example

Principal Component 1st 2nd 3rd OtherProportion of Variance 62% 34% 3% rest

1st and 2nd principal components explain 96% of variance.

Page 23: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Multidimensional Scaling

Multidimensional Scaling (MDS)

I Alternative dimensionality reduction approach

I Represents distances in 2D or 3D space

I Starts from distance matrix (PCA uses data points)

Page 24: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Biclustering

BiclusteringFinds in matrix subgroups of rows and columns which are as similar aspossible to each other and as different as possible to the remaining datapoints.

Unclustered ⇒ Clustered

Page 25: Clustering and Data Mining in R - Workshop Supplementfaculty.ucr.edu/~tgirke/HTML_Presentations/Manuals/Clustering/... · Clustering and Data Mining in R Workshop Supplement Thomas

Clustering and Data Mining in R

Non-Hierarchical Clustering

Many Additional Techniques

Remember: There Are Many Additional Techniques!

Continue with R manual section:”Clustering and Data Mining”