cs690l data mining: classification reference: j. han and m. kamber, data mining: concepts and...

CS690LData Mining: Classification

Reference:

J. Han and M. Kamber, Data Mining: Concepts and Techniques

Yong Fu: http://web.umr.edu/~yongjian/cs401dm/

Classification

• Classification determine the class or category of an object based on its properties• Two stages

– Learning stage: construction of a classification function or model– Classification stage: prediction of classes of objects using the

function or model• Tools for classification

– Decision tree– Bayesian networks– Neural networks– Regression

• Problem– Given a set of objects whose classes are known called training set

derive a classification model which can correctly classify future objects

Classification: Decision Tree

• Classification model: decision tree• Method: Top Down Induction of Decision Trees• Data representation:

– Every object is represented by a vector of values on a fixed set of attributes. If a relation is defined on the attributes an object is a tuple in the relation.

– A special attribute called class attribute tells the group/category the object belongs to which is the dependent attribute to be predicted

• Learning stage:– Induction of a decision tree that classifies the training set

• Classification stage: – The decision tree will classify new objects.

An Example

• Definitions A decision tree is a tree in which each non-leaf node corresponds to an attribute of objects and each branch from a non-leaf node to its children represents a value of the attribute. Each leaf node in a decision tree is labeled by a class of the objects

• Classification using decision trees Starting from the root an object follows the path to a leaf node which gives the class of the object taking branches according to its values along the way

• Alternative view of decision tree • Node/Branch: discrimination test• Node: subset of objects satisfying test

Decision Tree Induction

• Induction of decision trees:

Starting from a training set recursively selecting attributes to split nodes thus partitioning the objects– Termination condition: when to stop splitting a node– Selection of attribute for splitting testing:

• Best split• A measure for splitting?

• ID3 algorithm– Selection: attribute information gain– Termination condition: all objects are in a single class

ID3 Algorithm

ID3 Algorithm (Cont)

Example

• Information content of C (Expected information for the classification)

I(P) = Ent(C)= - {(9/14) log2 ( 9/14) + (5/14)log2 (5/14)} = 0.940

• For each Attribute Ai– Step 1: Compute the entropy for a given attribute Ai

Ent(Sunny) = - ((2/5 log2 2/5) + (3/5 log2 3/5)) = 0.97Ent(Rainy) = 0.97Ent(Overcast) = 0

– Step 2: Compute the Entropy (expected information based on the partitioning into Subsets by A)

Ent(C, Outlook) = (5/14)Ent(Sunny) + (5/14)Ent(Rainy) + (4/14)Ent(Overcast) = (5/14)(0.97) + (5/14)(0.97) + (4/14)(0) = 0.69

– Step 3: Gain(C, Outlook) = Ent(C) – Ent(C, Outlook) = 0.940 – 0.69 = 0.25

• Select the attribute that maximize information gain• Build a node for the selected attribute

Recursively build nodes.

Example: Decision Tree Building

Level1: Decision Tree Building

Outlook

Temp Hum Wind Class

85 85 False DP

80 90 True DP

72 95 False DP

69 70 False P

75 70 True P

Temp Hum Wind Class

83 88 False P

64 65 True P

72 90 False P

81 75 False P

Temp Hum Wind Class70 96 False P71 80 False P72 70 True DP75 80 False P71 96 True DP

RainySunny

Overcast

Decision Tree

Generated Rules

ID3 Algorithm

C4.5 Extensions to ID3• Gain ratio: Gain favors attributes with many values GainRatio (C, A) = Gain(C, A)/Ent(P) where P = (|T1|/|C|, |T2|/C, … |Tm|/|C|)

and Ti are partitions of C based on object’s value of A. e.g. GainRatio (Outlook) = Gain(Outlook)/ {(5/14) log2 (5/14) + (5/14)log2 (5/14)

+ (4/14)log2 (4/14) } • Missing values:

– consider only objects without the attribute is defined. • Continuous attributes:

– consider all binary splits A <= ai and A > ai where ai is the ith values of A. – compute the gain or gain ratio and choose the split that maximizes the gain

or gain ratio• Over-fitting: Change the termination condition. If a subtree is dominated by a

class stop splitting• Tree pruning: replacing a subtree by a single leaf node. When the expected

classification error can be reduced• Rule deriving: A rule basically corresponds to a path from root to a leaf The

LHS is the conjunction of testing and the RHS is the class prediction• Rule simplification: removing some conditions in the LHS

Evaluation of Decision Tree Methods

• Complexity

• Expressive power

• Robustness

• Effectiveness

cs690l data mining: classification reference: j. han and m. kamber, data mining: concepts and...

Documents