cs690l data mining: classification reference: j. han and m. kamber, data mining: concepts and...

15
CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: http://web.umr.edu/~yongjian/cs401dm/

Upload: cuthbert-preston

Post on 05-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

CS690LData Mining: Classification

Reference:

J. Han and M. Kamber, Data Mining: Concepts and Techniques

Yong Fu: http://web.umr.edu/~yongjian/cs401dm/

Page 2: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Classification

• Classification determine the class or category of an object based on its properties• Two stages

– Learning stage: construction of a classification function or model– Classification stage: prediction of classes of objects using the

function or model• Tools for classification

– Decision tree– Bayesian networks– Neural networks– Regression

• Problem– Given a set of objects whose classes are known called training set

derive a classification model which can correctly classify future objects

Page 3: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Classification: Decision Tree

• Classification model: decision tree• Method: Top Down Induction of Decision Trees• Data representation:

– Every object is represented by a vector of values on a fixed set of attributes. If a relation is defined on the attributes an object is a tuple in the relation.

– A special attribute called class attribute tells the group/category the object belongs to which is the dependent attribute to be predicted

• Learning stage:– Induction of a decision tree that classifies the training set

• Classification stage: – The decision tree will classify new objects.

Page 4: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

An Example

• Definitions A decision tree is a tree in which each non-leaf node corresponds to an attribute of objects and each branch from a non-leaf node to its children represents a value of the attribute. Each leaf node in a decision tree is labeled by a class of the objects

• Classification using decision trees Starting from the root an object follows the path to a leaf node which gives the class of the object taking branches according to its values along the way

• Alternative view of decision tree • Node/Branch: discrimination test• Node: subset of objects satisfying test

Page 5: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Decision Tree Induction

• Induction of decision trees:

Starting from a training set recursively selecting attributes to split nodes thus partitioning the objects– Termination condition: when to stop splitting a node– Selection of attribute for splitting testing:

• Best split• A measure for splitting?

• ID3 algorithm– Selection: attribute information gain– Termination condition: all objects are in a single class

Page 6: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

ID3 Algorithm

Page 7: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

ID3 Algorithm (Cont)

Page 8: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Example

Page 9: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

• Information content of C (Expected information for the classification)

I(P) = Ent(C)= - {(9/14) log2 ( 9/14) + (5/14)log2 (5/14)} = 0.940

• For each Attribute Ai– Step 1: Compute the entropy for a given attribute Ai

Ent(Sunny) = - ((2/5 log2 2/5) + (3/5 log2 3/5)) = 0.97Ent(Rainy) = 0.97Ent(Overcast) = 0

– Step 2: Compute the Entropy (expected information based on the partitioning into Subsets by A)

Ent(C, Outlook) = (5/14)Ent(Sunny) + (5/14)Ent(Rainy) + (4/14)Ent(Overcast) = (5/14)(0.97) + (5/14)(0.97) + (4/14)(0) = 0.69

– Step 3: Gain(C, Outlook) = Ent(C) – Ent(C, Outlook) = 0.940 – 0.69 = 0.25

• Select the attribute that maximize information gain• Build a node for the selected attribute

Recursively build nodes.

Example: Decision Tree Building

Page 10: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Level1: Decision Tree Building

Outlook

Temp Hum Wind Class

85 85 False DP

80 90 True DP

72 95 False DP

69 70 False P

75 70 True P

Temp Hum Wind Class

83 88 False P

64 65 True P

72 90 False P

81 75 False P

Temp Hum Wind Class70 96 False P71 80 False P72 70 True DP75 80 False P71 96 True DP

RainySunny

Overcast

Page 11: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Decision Tree

Page 12: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Generated Rules

Page 13: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

ID3 Algorithm

Page 14: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

C4.5 Extensions to ID3• Gain ratio: Gain favors attributes with many values GainRatio (C, A) = Gain(C, A)/Ent(P) where P = (|T1|/|C|, |T2|/C, … |Tm|/|C|)

and Ti are partitions of C based on object’s value of A. e.g. GainRatio (Outlook) = Gain(Outlook)/ {(5/14) log2 (5/14) + (5/14)log2 (5/14)

+ (4/14)log2 (4/14) } • Missing values:

– consider only objects without the attribute is defined. • Continuous attributes:

– consider all binary splits A <= ai and A > ai where ai is the ith values of A. – compute the gain or gain ratio and choose the split that maximizes the gain

or gain ratio• Over-fitting: Change the termination condition. If a subtree is dominated by a

class stop splitting• Tree pruning: replacing a subtree by a single leaf node. When the expected

classification error can be reduced• Rule deriving: A rule basically corresponds to a path from root to a leaf The

LHS is the conjunction of testing and the RHS is the class prediction• Rule simplification: removing some conditions in the LHS

Page 15: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm

Evaluation of Decision Tree Methods

• Complexity

• Expressive power

• Robustness

• Effectiveness