evaluating performance for data mining techniques 1

27
Evaluating Performance for Data Mining Techniques 1

Upload: russell-willis

Post on 24-Dec-2015

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Performance for Data Mining Techniques 1

Evaluating Performancefor Data Mining Techniques

1

Page 2: Evaluating Performance for Data Mining Techniques 1

Evaluating Numeric Output

• Mean absolute error (MAE)

• Mean square error (MSE)

• Root mean square error (RMSE)

2

Page 3: Evaluating Performance for Data Mining Techniques 1

Mean Absolute Error (MAE)

The average absolute difference between classifier predicted output and actual output.

1

1( )

N

i ii

Desired ActualN

3

Page 4: Evaluating Performance for Data Mining Techniques 1

Mean Square Error (MSE)

The average of the sum of squared differences between classifier predicted output and actual output.

2

1

1( )

N

i ii

Desired ActualN

4

Page 5: Evaluating Performance for Data Mining Techniques 1

Root Mean Square Error (RMSE)

The square root of the mean square error.

2

1

1( )

N

i ii

Desired ActualN

5

Page 6: Evaluating Performance for Data Mining Techniques 1

Clustering Techniques

6

Page 7: Evaluating Performance for Data Mining Techniques 1

Clustering Techniques

• Clustering Techniques apply some measure of similarity to divide instances of the data to be analyzed into disjoint partitions

• The partitions are generalized by computing a group mean for each cluster or by listing a most typical subset of instances from each cluster

7

Page 8: Evaluating Performance for Data Mining Techniques 1

Clustering Techniques

• 1st approach: unsupervised clustering

• 2nd approach: to partition data in a hierarchical fashion where each level of the hierarchy is a generalization of the data at some level of abstraction.

8

Page 9: Evaluating Performance for Data Mining Techniques 1

Clustering Techniques

9

Page 10: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm• The K-means algorithm is a simple (but

widely used) statistical clustering technique, which is used for unsupervised clustering

• The K-means algorithm divides instances of the data to be analyzed into disjoint K partitions (clusters).

• Proposed by S.P. Lloyd in 1957, first published in 1982.

10

Page 11: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm1. Choose a value for K, the total number of clusters.

2. Randomly choose K points as cluster centers.

3. Assign the remaining instances to their closest cluster center (for example, using Euclidian distance as a criterion).

4. Calculate a new cluster center for each cluster.

5. Repeat steps 3-5 until the cluster centers do not change.

11

Page 12: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm: Analysis

• Choose a value for K, the total number of clusters – this step requires an initial discussion about how many clusters can be distinguished within a data set

12

Page 13: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm: Analysis

• Randomly choose K points as cluster centers – the initial cluster centers are selected randomly, but this is not essential if K was chosen properly; the resulting clustering in this case should not depend on the selection of the initial cluster centers

13

Page 14: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm: Analysis

• Calculate a new cluster center for each cluster – new cluster centers are the means of the cluster members that were placed to their clusters in the previous step

14

Page 15: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm: Analysis• Repeat steps 3-5 until the cluster centers do

not change – the process instance classification and cluster center computation continues until an iteration of the algorithm shows no change in the cluster centers.

• The algorithm terminates after j iterations if for each cluster Ci all instances found in Ci after iteration j-1 remain in cluster Ci upon the completion of iteration j

15

Page 16: Evaluating Performance for Data Mining Techniques 1

Euclidian Distance

2 21 1, ( ) ... ( )n nD X Y x y x y

1 1,..., ; ( ,..., )n nX x x Y y y

Euclidian distance between two n-dimensional vectors

is determined as

16

Page 17: Evaluating Performance for Data Mining Techniques 1

Cluster Quality

• How we can evaluate the cluster quality, its reliability?

• One evaluation method, which is more suitable for the clusters of about equal size, is to calculate the sum of square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

17

Page 18: Evaluating Performance for Data Mining Techniques 1

Cluster Quality

• Another evaluation method is to calculate the mean square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

18

Page 19: Evaluating Performance for Data Mining Techniques 1

Optimal Clustering CriterionClustering is considered optimal, when the average (taken over all clusters) mean square deviation of the cluster members from their center is either:

minimal over several (s) experiments

or less than some predetermined acceptable value

2

1 1

1 1( , )

j

j

NK

j ij ij

D Center XK N

2

s experiments 1 1

1 1( , )MIN

j

j

NKs sj i

j ij

D Center XK N

19

Page 20: Evaluating Performance for Data Mining Techniques 1

An Example Using

the K-Means Algorithm

20

Page 21: Evaluating Performance for Data Mining Techniques 1

Table 3.6 • K-Means Input Values

Instance X Y

1 1.0 1.52 1.0 4.53 2.0 1.54 2.0 3.55 3.0 2.56 5.0 6.0

21

Page 22: Evaluating Performance for Data Mining Techniques 1

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

f(x)

x

22

Page 23: Evaluating Performance for Data Mining Techniques 1

Table 3.7 • Several Applications of the K-Means Algorithm (K = 2)

Outcome Cluster Centers Cluster Points Square Error

1 (2.67,4.67) 2, 4, 6 14.50

(2.00,1.83) 1, 3, 5

2 (1.5,1.5) 1, 3 15.94

(2.75,4.125) 2, 4, 5, 6

3 (1.8,2.7) 1, 2, 3, 4, 5 9.60

(5,6) 6

23

Page 24: Evaluating Performance for Data Mining Techniques 1

0

1

2

3

4

5

6

7

0 1 2 3 4 5 6

x

f(x)

24

Page 25: Evaluating Performance for Data Mining Techniques 1

Unsupervised Model Evaluation

25

Page 26: Evaluating Performance for Data Mining Techniques 1

The K-Means Algorithm:General Considerations

• Requires real-valued data.• We must select the number of clusters present in the

data.• Works best when the clusters that exist in the data are

of approximately equal size. If an optimal solution is represented by clusters of unequal size, the K-Means algorithm is not likely to

• Attribute significance cannot be determined.• A supervised data mining tool must be used to gain

into the nature of the clusters formed by a clustering tool.

26

Page 27: Evaluating Performance for Data Mining Techniques 1

Supervised Learning for Unsupervised Model Evaluation

• Designate each formed cluster as a class and assign each class an arbitrary name.

• Choose a random sample of instances from each class for supervised learning.

• Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.

27