4.cluster nov 2013

Data Mining Chapter 4Cluster AnalysisPart 1

Introduction to Data Mining with Case StudiesAuthor: G. K. GuptaPrentice Hall India, 2006.

*GKGupta*Cluster AnalysisUnsupervised classification

Aims to decompose or partition a data set in to clusters such that each cluster is similar within itself but is as dissimilar as possible to other clusters.

Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized.

Different to classification as there are no predefined classes, no training data, may not even know how many clusters.

GKGupta

*GKGupta*Cluster AnalysisNot a new field, a branch of statisticsMany old algorithms are availableRenewed interest due to data mining and machine learningNew algorithms are being developed, in particular to deal with large amounts of dataEvaluating a clustering method is difficult

GKGupta

*GKGupta*Classification Example

GKGupta

*GKGupta*Clustering Example (From text by Roiger and Geatz)

GKGupta

*GKGupta*Cluster AnalysisAs in classification, each object has several attributes.

Some methods require that the number of clusters is specified by the user and a seed to start each cluster be given.

The objects are then clustered based on self-similarity.

GKGupta

Desired Features of cluster analysis*Scalability : to deal with small and large problemsOnly one scan of the dataset : cost of I/O is huge. So scan should not be more than oneAbility to stop and resume: Need more processor time in large datasets. So task to be stopped and resumeMinimal input parameters: user not to expect to have more domain knowledge of the data.Robustness: to deal with noise, outliers & missing valuesAbility to discover different cluster shapes: Different data types: To deal also boolean & categrical valuesResult independent of data input data:

*GKGupta*QuestionsHow to define a cluster?

How to find if objects are similar or not?

How to define some concept of distance between individual objects and sets of objects?

GKGupta

*GKGupta*Types of DataInterval-scaled data - continuous variables on a roughly linear scale e.g. marks, age, weight. Units can affect the analysis. Distance in metres is obviously a larger number than in kilometres. May need to scale or normalise data (how?).

Binary data many variables are binary e.g. gender, married or not, u/g or p/g. How to compute distance between binary variables?

GKGupta

*GKGupta*Types of DataNominal data - similar to binary but can take more than two states e.g. colour, staff position. How to compute distance between two objects with nominal data?

Ordinal data - similar to nominal but the different values are ordered in a meaningful sequence

Ratio-scaled data - nonlinear scale data

GKGupta

*GKGupta*DistanceA simple, well-understood concept. Distance has the following properties (assume x and y to be vectors): distance is always positivedistance x to x is zero distance x to y cannot be greater than the sum of distance x to z and z to ydistance x to y is the same as from y to x.

Examples: absolute value of the difference |x _ y| (the Manhattan distance ) or (x-y)**2 (the Euclidean distance).

GKGupta

*GKGupta*DistanceThree common distance measures are;

Manhattan Distance or the absolute value of the differenceD(x, y) = |x-y|

Euclidean distance D(x, y) = ( (x-y)2 )

3. Maximum differenceD(x, y) = maxi |xi yi|

GKGupta

*GKGupta*Distance between clustersA more difficult concept.

Single linkComplete linkAverage link

We will return to this

GKGupta

*GKGupta*Categorical DataDifficult to find a sensible distance measure between say China and Australia or between a male and a female or between a student doing BIT and one doing MIT.

One possibility is to convert all categorical data into numeric data (makes sense?). Another possibility is to treat distance between categorical values as a function which only has a value 1 or 0.

GKGupta

*GKGupta*DistanceUsing the approach of treating distance between two categorical values as a value 1 or 0, it is possible to compare two objects that have all attribute values categorical.

Essentially it is just comparing how many attribute value are the same.

Consider the example on the next page and compare one patient with another.

GKGupta

*GKGupta*Distance (From text by Roiger and Geatz)

GKGupta

*GKGupta*Types of DataOne possibility:q = number of attributes for which both 1 (or yes)t = number of attributes for which both 0 (or no)r, s when attribute values do not matchd = (r + s)/(r + s + q + t)

GKGupta

*GKGupta*Distance MeasureNeed to carefully look at scales of different attributes values. For example, the GPA varies from 0 to 4 while the age attribute varies from say 20 to 50.

The total distance, the sum of distance between all attribute values, is then dominated by the age attribute and will be affected little by the variations in GPA. To avoid such problems the data should be scaled or normalized (perhaps so that all data varies between -1 and 1).

GKGupta

*GKGupta*Distance MeasureThe distance measures we have discussed so far assume equal importance being given to all attributes.

One may want to assign weights to all attributes indicating their importance. The distance measure will then need to use these weights in computing the total distance.

GKGupta

*GKGupta*ClusteringIt is not always easy to predict how many clusters to expect from a set of data.

One may choose a number of clusters based on some knowledge of data. The result may be an acceptable set of clusters. This does not however mean that a different number of clusters will not provide an even better set of clusters. Choosing the number of clusters is a difficult problem.

GKGupta

**Types of MethodsPartitioning methods given n objects, make k (n) partitions (clusters) of data and use iterative relocation method. It is assumed that each cluster has at least one object and each object belongs to only one cluster.

Hierarchical methods - start with one cluster and then split it in smaller clusters or start with each object in an individual cluster and then try to merge similar clusters.

*GKGupta*Types of MethodsDensity-based methods - for each data point in a cluster, at least a minimum number of points must exist within a given radius

Grid-based methods - object space is divided into a grid

Model-based methods - a model is assumed, perhaps based on a probability distribution

GKGupta

*GKGupta*Partitional MethodsThe K-Means Method:Perhaps the most commonly used method.Select the number of clusters and then dividing the given n objects choosing k seeds randomly as starting samples of these clusters.Specified seeds assigned to a cluster that is closest to it based on some distance measure. Once all the members have been allocated, the mean value of each cluster is computed and these means essentially become the new seeds.

GKGupta

*GKGupta*The K-Means Method

Using the mean value of each cluster, all the members are now re-allocated to the clusters. In most situations, some members will change clusters unless the first guesses of seeds were very good.This process continues until no changes take place to cluster memberships.Different starting seeds obviously may lead to different clusters.

GKGupta

*GKGupta*The K-Means MethodEssentially the algorithm tries to build cluster with high level of similarity within clusters and low level of similarity between clusters. Similarity measurement is based on the mean values and the algorithm tries to minimize the squared-error function.

This method requires means of variables to be computed.

The method is scalable and is efficient. The algorithm does not find a global minimum, it rather terminates at a local minimum.

GKGupta

*GKGupta*Example

StudentAgeMarks1Marks2Marks3S118737557S218798575S323707052S420555555S522858687S619919089S720706565S821535659S919828260S1040766078

GKGupta

*GKGupta*ExampleLet the three seeds be the first three records:

Now we compute the distances between the objects based on the four attributes. K-Means requires Euclidean distances but we use the sum of absolute differences as well as to show how the distance metric can change the results. Next page table shows the distances and their allocation to the nearest neighbor.

StudentAgeMark1Mark2Mark3S118737557S218798575S323707052

GKGupta

*GKGupta*Distances

StudentAgeMarks1Marks2Marks3ManhattanEuclideanDistanceDistanceS118737557Seed 1C1Seed 1C1S218798575Seed 2C2Seed 2C2S323707052Seed 3C3Seed 3C3S42055555542/76/36C327/43/22C3S52285868757/23/67C234/14/41C2S61991908966/32/82C240/19/47C2S72070656023/41/21C313/24/14C1S82153565944/74/40C328/42/23C3S91982826020/22/36C112/16/19C1S104775767760/52/58C233/34/32C3

GKGupta

*GKGupta*ExampleNote the different allocations by the two different distance metrics. K-Means uses the Euclidean distance.

The means of the new clusters are also different as shown on the next slide. We now start with the new means and compute Euclidean distances. The process continues until there is no change.

GKGupta

*GKGupta*Cluster Means

ClusterAgeMark1Mark2Mark3ManhattanC118.577.578.558.5C224.882.880.382C3216261.557.8Seed 118737557Seed 218798575Seed 323707052EuclideanC119.075.074.060.7C219.785.087.083.7C326.063.560.360.8

GKGupta

*GKGupta*Distances

StudentAgeMarks1Marks2Marks3DistanceClusterC118.577.578.558.5C224.882.880.382C3216261.557.8S1187375578/52/36C1S21879857530/18/63C2S32370705222/67/28C1S42055555546/91/26C3S52285868751/7/78C2S61991908960/15/93C2S72070656019/56/22C1S82153565944/89/22C3S91982826016/32/48C1S104775767752/63/43C3

GKGupta

*GKGupta*CommentsEuclidean Distance used in the last slide since that is what K-Means requires.

The results of using Manhattan distance and Euclidean distance are quite different

The last iteration changed the cluster membership of S3 from C3 to C1. New means are now computed followed by computation of new distances. The process continues until there is no change.

GKGupta

*GKGupta*Distance MetricMost important issue in methods like the K-Means method is the distance metric although K-Means is married to Euclidean distance.

There is no real definition of what metric is a good one but the example shows that different metrics, used on the same data, can produce different results making it very difficult to decide which result is the best.

Whatever metric is used is somewhat arbitrary although extremely important. Euclidean distance is used in the next example.

GKGupta

*GKGupta*ExampleExample deals with data about countries.

Three seeds, France, Indonesia and Thailand were chosen randomly. Euclidean distances were computed and nearest neighbors were put together in clusters.

The result is shown on the second slide. Another iteration did not change the clusters.

GKGupta

*GKGupta*Another Example

GKGupta

*GKGupta*Example

GKGupta

*GKGupta*NotesThe scales of the attributes in the examples are different. Impacts the results.The attributes are not independent. For example, strong correlation between Olympic medals and income.The examples are very simple and too small to illustrate the methods fully.Number of clusters - can be difficult since there is no generally accepted procedure for determining the number. May be use trial and error and then decide based on within-group and between-group distances.How to interpret results?

GKGupta

*GKGupta*Validating ClustersDifficult problem.

Statistical testing may be possible could test a hypothesis that there are no clusters.

May be data is available where clusters are known.

GKGupta

*GKGupta*Hierarchical ClusteringQuite different than partitioning. Involves gradually merging different objects into clusters (called agglomerative) or dividing large clusters into smaller ones (called divisive).

We consider one method.

GKGupta

*GKGupta*Agglomerative ClusteringThe algorithm normally starts with each cluster consisting of a single data point. Using a measure of distance, the algorithm merges two clusters that are nearest, thus reducing the number of clusters. The process continues until all the data points are in one cluster.

The algorithm requires that we be able to compute distances between two objects and also between two clusters.

GKGupta

*GKGupta*Distance between ClustersSingle Link method nearest neighbour - distance between two clusters is the distance between the two closest points, one from each cluster. Chains can be formed (why?) and a long string of points may be assigned to the same cluster.

Complete Linkage method furthest neighbour distance between two clusters is the distance between the two furthest points, one from each cluster. Does not allow chains.

GKGupta

*GKGupta*Distance between ClustersCentroid method distance between two clusters is the distance between the two centroids or the two centres of gravity of the two clusters.

Unweighted pair-group average method distance between two clusters is the average distance between all pairs of objects in the two clusters. This means p*n distances need to be computed if p and n are the number of objects in each of the clusters

GKGupta

*GKGupta*AlgorithmEach object is assigned to a cluster of its ownBuild a distance matrix by computing distances between every pair of objects Find the two nearest clustersMerge them and remove them from the listUpdate matrix by including distance of all objects from the new clusterContinue until all objects belong to one cluster

The next slide shows an example of the final result of hierarchical clustering.

GKGupta

*GKGupta*Hierarchical Clusteringhttp://www.resample.com/xlminer/help/HClst/HClst_ex.htm

GKGupta

*GKGupta*ExampleWe now consider the simple example about students marks and use the distance between two objects as the Manhattan distance.

We will use the centroid method for distance between clusters.

We first calculate a matrix of distances.

GKGupta

*GKGupta*Example

StudentAgeMarks1Marks2Marks397123418737557994321187985759654382370705298765420555555968765228586879787651991908998456720706560985555215356599988881982826099554447757677

GKGupta

*GKGupta*Distance Matrix

GKGupta

*GKGupta*ExampleThe matrix gives distance of each object with every other object.The smallest distance of 8 is between object 4 and object 8. They are combined and put where object 4 was.Compute distances from this cluster and update the distance matrix.

GKGupta

*GKGupta*Updated Matrix

GKGupta

*GKGupta*NoteThe smallest distance now is 15 between the objects 5 and 6. They are combined in a cluster and 5 and 6 are removed.Compute distances from this cluster and update the distance matrix.

GKGupta

*GKGupta*Updated Matrix

GKGupta

*GKGupta*NextLook at shortest distance again.S3 and S7 are at a distance 16 apart. We merge them and put C3 where S3 was. The updated distance matrix is given on the next slide. It shows that C2 and S1 have the smallest distance and are then merged in the next step.We stop short of finishing the example.

GKGupta

*GKGupta*Updated matrix

GKGupta

*GKGupta*Another Example

GKGupta

*GKGupta*Distance Matrix (Euclidean Distance)

GKGupta

*GKGupta*Next stepsMalaysia and Philippines have the shortest distance (France and UK are also very close).In the next step we combine Malaysia and Philippines and replace the two countries by this cluster.Then update the distance matrix.Then choose the shortest distance and so on until all countries are in one large cluster.

GKGupta

*GKGupta*Divisive ClusteringOpposite to the last method. Less commonly used.

We start with one cluster that has all the objects and seek to split the cluster into components which themselves are then split into even smaller components.

The divisive method may be based on using one variable at a time or using all the variables together.

The algorithm terminates when a termination condition is met or when each cluster has only one object.

GKGupta

*GKGupta*Hierarchical ClusteringThe ordering produced can be useful in gaining some insight into the data.The major difficulty is that once an object is allocated to a cluster it cannot be moved to another cluster even if the initial allocation was incorrectDifferent distance metrics can produce different results

GKGupta

*GKGupta*ChallengesScalability - would the method work with millions of objects (e.g. Cluster analysis of Web pages)?

Ability to deal with different types of attributes

Discovery of clusters with arbitrary shape - often algorithms discover spherical clusters which are not always suitable

GKGupta

*GKGupta*ChallengesMinimal requirements for domain knowledge to determine input parameters - often input parameters are required e.g. how many clusters. Algorithms can be sensitive to inputs.

Ability to deal with noisy or missing data

Insensitivity to the order of input records - order of input should not influence the result of the algorithm

GKGupta

*GKGupta*ChallengesHigh dimensionality - data can have many attributes e.g. Web pages

Constraint-based clustering - sometime clusters may need to satisfy constraints

GKGupta

4.cluster nov 2013

Documents