evaluating performance for data mining techniques 1

Evaluating Performancefor Data Mining Techniques

1

Evaluating Numeric Output

• Mean absolute error (MAE)

• Mean square error (MSE)

• Root mean square error (RMSE)

2

Mean Absolute Error (MAE)

The average absolute difference between classifier predicted output and actual output.

1

1( )

N

i ii

Desired ActualN

3

Mean Square Error (MSE)

The average of the sum of squared differences between classifier predicted output and actual output.

2

1

1( )

N

i ii

Desired ActualN

4

Root Mean Square Error (RMSE)

The square root of the mean square error.

2

1

1( )

N

i ii

Desired ActualN

5

Clustering Techniques

6


• Clustering Techniques apply some measure of similarity to divide instances of the data to be analyzed into disjoint partitions

• The partitions are generalized by computing a group mean for each cluster or by listing a most typical subset of instances from each cluster

7


• 1st approach: unsupervised clustering

• 2nd approach: to partition data in a hierarchical fashion where each level of the hierarchy is a generalization of the data at some level of abstraction.

8


9

The K-Means Algorithm• The K-means algorithm is a simple (but

widely used) statistical clustering technique, which is used for unsupervised clustering

• The K-means algorithm divides instances of the data to be analyzed into disjoint K partitions (clusters).

• Proposed by S.P. Lloyd in 1957, first published in 1982.

10

The K-Means Algorithm1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.

3. Assign the remaining instances to their closest cluster center (for example, using Euclidian distance as a criterion).

4. Calculate a new cluster center for each cluster.

5. Repeat steps 3-5 until the cluster centers do not change.

11

The K-Means Algorithm: Analysis

• Choose a value for K, the total number of clusters – this step requires an initial discussion about how many clusters can be distinguished within a data set

12


• Randomly choose K points as cluster centers – the initial cluster centers are selected randomly, but this is not essential if K was chosen properly; the resulting clustering in this case should not depend on the selection of the initial cluster centers

13


• Calculate a new cluster center for each cluster – new cluster centers are the means of the cluster members that were placed to their clusters in the previous step

14

The K-Means Algorithm: Analysis• Repeat steps 3-5 until the cluster centers do

not change – the process instance classification and cluster center computation continues until an iteration of the algorithm shows no change in the cluster centers.

• The algorithm terminates after j iterations if for each cluster Ci all instances found in Ci after iteration j-1 remain in cluster Ci upon the completion of iteration j

15

Euclidian Distance

2 21 1, ( ) ... ( )n nD X Y x y x y

1 1,..., ; ( ,..., )n nX x x Y y y

Euclidian distance between two n-dimensional vectors

is determined as

16

Cluster Quality

• How we can evaluate the cluster quality, its reliability?

• One evaluation method, which is more suitable for the clusters of about equal size, is to calculate the sum of square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

17

Cluster Quality

• Another evaluation method is to calculate the mean square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

18

Optimal Clustering CriterionClustering is considered optimal, when the average (taken over all clusters) mean square deviation of the cluster members from their center is either:

minimal over several (s) experiments

or less than some predetermined acceptable value

2

1 1

1 1( , )

j

j

NK

j ij ij

D Center XK N

2

s experiments 1 1

1 1( , )MIN

j

j

NKs sj i

j ij

D Center XK N

19

An Example Using

the K-Means Algorithm

20

Table 3.6 • K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

21

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

f(x)

x

22

Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)

Outcome Cluster Centers Cluster Points Square Error

1 (2.67,4.67) 2, 4, 6 14.50

(2.00,1.83) 1, 3, 5

2 (1.5,1.5) 1, 3 15.94

(2.75,4.125) 2, 4, 5, 6

3 (1.8,2.7) 1, 2, 3, 4, 5 9.60

(5,6) 6

23

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

x

f(x)

24

Unsupervised Model Evaluation

25

The K-Means Algorithm:General Considerations

• Requires real-valued data.• We must select the number of clusters present in the

data.• Works best when the clusters that exist in the data are

of approximately equal size. If an optimal solution is represented by clusters of unequal size, the K-Means algorithm is not likely to

• Attribute significance cannot be determined.• A supervised data mining tool must be used to gain

into the nature of the clusters formed by a clustering tool.

26

Supervised Learning for Unsupervised Model Evaluation

• Designate each formed cluster as a class and assign each class an arbitrary name.

• Choose a random sample of instances from each class for supervised learning.

• Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.

27

evaluating performance for data mining techniques 1

Documents

cluster new cluster

cluster quality

cluster members

initial cluster centers

cluster c i

cluster center computation

closest cluster center

clustering techniques