data mining using conceptual clustering

29
Data Mining Using Conceptual Clustering By Trupti Kadam

Upload: trupti-kadam

Post on 26-Mar-2015

62 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining Using Conceptual Clustering

Data Mining Using Conceptual Clustering

By Trupti Kadam

Page 2: Data Mining Using Conceptual Clustering

What is Data Mining?

• Many Definitions– Non-trivial extraction of implicit, previously unknown and

potentially useful information from data– Exploration & analysis, by automatic or

semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 3: Data Mining Using Conceptual Clustering

• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality

of data– Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Page 4: Data Mining Using Conceptual Clustering

Clustering Definition

• Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that– Data points in one cluster are more similar to one

another.– Data points in separate clusters are less similar to

one another.

Page 5: Data Mining Using Conceptual Clustering

Conceptual Clustering• Unsupervised, spontaneous - categorizes or

postulates concepts without a teacher• Conceptual clustering forms a classification tree -

all initial observations in root - create new children using single attribute (not good), attribute combinations (all), information metrics, etc. - Each node is a class

• Should decide quality of class partition and significance (noise)

• Many models use search to discover hierarchies which fulfill some heuristic within and/or between clusters - similarity, cohesiveness, etc.

Page 6: Data Mining Using Conceptual Clustering

Concept Under CC

Page 7: Data Mining Using Conceptual Clustering

Concept Hierarchy

Page 8: Data Mining Using Conceptual Clustering

Contd..• Suppose we choose 6 as threshold value for

similarity, the algo produce 5 distinct clusters (1,2),(3,4),(5,6,7,8),(5,6),(5,7,8) after deleting redundant one and a hierarchy is formed as follows:

Page 9: Data Mining Using Conceptual Clustering

Contd..

Page 10: Data Mining Using Conceptual Clustering

The COBWEB Conceptual Clustering Algorithm

• The COBWEB algorithm was developed by machine learning researchers in the 1980s for clustering objects in a object-attribute data set.

• The COBWEB algorithm yields a clustering dendogram called classification tree that characterizes each cluster with a probabilistic description.

Page 11: Data Mining Using Conceptual Clustering

Contd..

• When given a new instance, COBWEB considers the overall quality of either placing the instance in an existing category or modifying the hierarchy

• The criterion COBWEB uses for evaluating the quality of the classification is called category utility

Page 12: Data Mining Using Conceptual Clustering

Category utility

• Was developed in research of human categorization (Gluck and Corter 1985)

• Category utility attempts to maximize both the probability that two objects in the same category have values in common and the probability that objects in different categories will have different property values.

• Manhattan distance or Euclidean distance formula is used to measure cohesion among clusters.

Page 13: Data Mining Using Conceptual Clustering

Category utility

P(Ck) represents size of cluster Ck.

represents probability of attribute Ai taking on value V ij over the entire set, and

is its conditional probability of taking the same value in class k C .

Page 14: Data Mining Using Conceptual Clustering

• To evaluate an entire partition made up of K clusters, we use the average CU over the K clusters

Page 15: Data Mining Using Conceptual Clustering

The Classification Tree Generated by the COBWEB Algorithm

Page 16: Data Mining Using Conceptual Clustering

• COBWEB performs a hill-climbing search of the space of possible taxonomies (trees) using category utility to evaluate and select possible categorizations

Page 17: Data Mining Using Conceptual Clustering

– Initializes the taxonomy to a single category whose features are those of the first example

– For each example, the algorithm begins with the root category and moves through the tree

– At each level is uses category utility to evaluate the taxonomies

1. Placing the example in the best category2. Adding a new category containing the example3. Merging two existing categories and adding the example

to the category4. Splitting two existing categories and placing the example

into the best category in the tree

Page 18: Data Mining Using Conceptual Clustering

• Insertion means that the new object is inserted into one of the existing child nodes. The COBWEB algorithm evaluates the respective CU function value of inserting the new object into each of the existing child nodes and selects the one with the highest score.

• The COBWEB algorithm also considers creating a new child node specifically for the new object.

Page 19: Data Mining Using Conceptual Clustering

• The COBWEB algorithm considers merging the two existing child nodes with the highest and second highest scores.

BA

P

… … …

P

… …

BA

N

Merge

Page 20: Data Mining Using Conceptual Clustering

• The COBWEB algorithm considers spliting the existing child node with the highest score.

BA

P

… … …

P

… …

BA

N

Split

Page 21: Data Mining Using Conceptual Clustering

The COBWEB AlgorithmCobweb(N, I)

If N is a terminal node,Then Create-new-terminals(N, I) Incorporate(N,I).

Else Incorporate(N, I).For each child C of node N,

Compute the score for placing I in C.

Let P be the node with the highest score W.

Let Q be the node with the second highest score.

Let X be the score for placing I in a new node R.

Let Y be the score for merging P and Q into one node.

Let Z be the score for splitting P into its children.

If W is the best score,Then Cobweb(P, I) (place I

in category P).Else if X is the best score,

Then initialize R’s probabilities using I’s values

(place I by itself in the new category R).

Else if Y is the best score,Then let O be Merge(P, R,

N).Cobweb(O, I).

Else if Z is the best scoreThen Split(P, N).

Cobweb(N, I).

Input: The current node N in the concept hierarchy.

An unclassified (attribute-value) instance I.

Results: A concept hierarchy that classifies the instance.

Top-level call: Cobweb(Top-node, I).

Variables: C, P, Q, and R are nodes in the hierarchy.

U, V, W, and X are clustering (partition) scores.

Page 22: Data Mining Using Conceptual Clustering

• Limitations of COBWEB– The assumption that the attributes are independent of

each other is often too strong because correlation may exist

– Not suitable for clustering large database data – skewed tree and expensive probability distributions

Page 23: Data Mining Using Conceptual Clustering

ITERATE

The algorithm has three primary steps: 1. Derive a classification tree using category utility as

a criterion function for grouping instances.2. Extract a good initial partition of data from the

classification tree as a starting point to focus the search for desirable groupings or clusters.

3. Iteratively redistribute data objects among the groupings to achieve maximally separable clusters.

Page 24: Data Mining Using Conceptual Clustering

Derivation of classification tree

Page 25: Data Mining Using Conceptual Clustering

The initial partition structure is extracted by comparing the CU value of classes or nodes along a path in the classification tree. For any path from root to leaf of a classification tree this value initially increases, and then drops .

Page 26: Data Mining Using Conceptual Clustering

Extraction of a good initial partition

Page 27: Data Mining Using Conceptual Clustering

Iteratively redistribute data objects

• The iterative redistribution operator is applied to maximize the cohesion measure for individual classes in the partition.

• The redistribution operator assigns object d toclass k for which the category match measure CMdk is maximum.

Page 28: Data Mining Using Conceptual Clustering

Evaluating Cluster Partitions• To be able to assess the result of a certain

clustering operation, we adopt a measure known as cohesion, which measures the degree of interclass similarity between objects in the same class.

• The increase in predictability for an object for an object d assigned to cluster k, Mdk is defined as

Page 29: Data Mining Using Conceptual Clustering

THANK YOU