decision tree

1

Decision tree is a classifier in the form of a tree structure, where each node is either:

o A leaf node - indicates the value of the target attribute (class) of examples, or o A decision node - specifies some test to be carried out on a single attribute-value,

with one branch and sub-tree for each possible outcome of the test.

A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.

The strengths of decision tree methods are:

o Decision trees are able to generate understandable rules. o Decision trees perform classification without requiring much computation. o Decision trees are able to handle both continuous and categorical variables. o Decision trees provide a clear indication of which fields are most important for

prediction or classification.

The weaknesses of decision tree methods

o Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.

o Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.

o Decision tree can be computationally expensive to train.

Uses: Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal. Another use of decision trees is as a descriptive means for calculating conditional probabilities.

Advantages of Decision trees:

Are simple to understand and interpret. Have value even with little hard data.

Disadvantages of Decision trees:

For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels.

2

Decision tree induction is a typical inductive approach to learn knowledge on classification. The key requirements to do mining with decision trees are:

o Attribute-value description. o Predefined classes (target attribute values). o Discrete classes. o Sufficient data.

Entropy characterizes the (im)purity of an arbitrary collection of examples. Given a set S, containing only positive and negative examples of some target concept (a 2 class problem), the entropy of set S relative to this simple, binary classification is defined as:

Entropy(S) = - pplog2 pp – pnlog2 pn

Where pp is the proportion of positive examples in S and pn is the proportion of negative examples in S. Suppose S is a collection of 25 examples, including 15 positive and 10 negative examples [15+, 10-]. Then the entropy of S relative to this classification is

Entropy(S) = - (15/25) log2 (15/25) - (10/25) log2 (10/25) = 0.970

The entropy is 0 if all members of S belong to the same class. The entropy is 1 (at its maximum) when the collection contains an equal number of positive and negative examples. If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1. Figure 1 shows the form of the entropy function relative to a binary classification, as p+ varies between 0 and 1.

Figure 1: The entropy function relative to a binary classification, as the proportion of positive examples pp varies between 0 and 1.

If the target attribute takes on c different values, then the entropy of S relative to this c-wise classification is defined as

where pi is the proportion of S belonging to class i.

3

Information gain measures how well a given attribute separates the training examples according to their target classification. This measure is used to select among the candidate attributes at each step while growing the tree.

Information gain is simply the expected reduction in entropy caused by partitioning the examples according to this attribute. More precisely, the information gain, Gain (S, A) of an attribute A, relative to a collection of examples S, is defined as

where Values(A) is the set of all possible values for attribute A, and Sv is the subset of S for which attribute A has value v (i.e., Sv = {s Є S | A(s) = v}).

4

This process continues for each new leaf node until either of two conditions is met:

1. Every attribute has already been included along this path through the tree, or 2. The training examples associated with this leaf node all have the same target

attribute value (i.e., their entropy is zero).

8

Avoiding over-fitting the data

It can lead to difficulties when there is noise in the data, or when the number of training examples is too small to produce a representative sample of the true target function. In either of these cases, this simple algorithm can produce trees that over-fit the training examples.

Over-fitting is a significant practical difficulty for decision tree learning and many other learning methods. There are several approaches to avoiding over-fitting in decision tree learning. These can be grouped into two classes:

o Approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data,

o Approaches that allow the tree to over-fit the data and then post prune the tree.

Entropy

Putting together a decision tree is all a matter of choosing which attribute to test at each node in the tree. We shall define a measure called information gain which will be used to decide which attribute to test at each node. Information gain is itself calculated using a measure called entropy, which we first define for the case of a binary decision problem and then define for the general case.

Given a binary categorization, C, and a set of examples, S, for which the proportion of examples categorized as positive by C is p+ and the proportion of examples categorized as negative by C is p-, then the entropy of S is:

The reason we defined entropy first for a binary decision problem is because it is easier to get an impression of what it is trying to calculate. Tom Mitchell puts this quite well:

"In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples."

Given an arbitrary categorization, C into categories c1, ..., cn, and a set of examples, S, for which the proportion of examples in ci is pi, then the entropy of S is:

Information Gain

The information gain of attribute A, relative to a collection of examples, S, is calculated as:

The information gain of an attribute can be seen as the expected reduction in entropy caused by knowing the value of attribute A.

9

An Example Calculation

As an example, suppose we are working with a set of examples, S = {s1,s2,s3,s4} categorized into a binary categorization of positives and negatives, such that s1 is positive and the rest are negative. Suppose further that we want to calculate the information gain of an attribute, A, and that A can take the values {v1,v2,v3}. Finally, suppose that:

s1 takes value v2 for A s2 takes value v2 for A s3 takes value v3 for A s4 takes value v1 for A

To work out the information gain for A relative to S, we first need to calculate the entropy of S. To use our formula for binary categorizations, we need to know the proportion of positives in S and the proportion of negatives. These are given as: p+ = 1/4 and p- = 3/4. So, we can calculate:

Entropy(S) = -(1/4)log2(1/4) -(3/4)log2(3/4)

= -(1/4)(-2) -(3/4)(-0.415)

= 0.5 + 0.311 = 0.811

Note that, to do this calculation with your calculator, you may need to remember that: log2(x) = ln(x)/ln(2), where ln(2) is the natural log of 2. Next, we need to calculate the weighted Entropy(Sv) for each value v = v1, v2, v3, v4, noting that the weighting involves multiplying by (|Svi|/|S|). Remember also that Sv is the set of examples from S which have value v for attribute A. This means that:

Sv1 = {s4}, sv2={s1, s2}, sv3 = {s3}.

We now have needed to carry out these calculations:

(|Sv1|/|S|) * Entropy(Sv1) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1))

= (1/4)(-0 -(1)log2(1)) = (1/4)(-0 -0) = 0

(|Sv2|/|S|) * Entropy(Sv2) = (2/4) * (-(1/2)log2(1/2) - (1/2)log2(1/2)) = (1/2) * (-(1/2)*(-1) - (1/2)*(-1)) = (1/2) * (1) = 1/2

(|Sv3|/|S|) * Entropy(Sv3) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1))

= (1/4)(-0 -(1)log2(1)) = (1/4)(-0 -0) = 0

Note that we have taken 0 log2(0) to be zero, which is standard. In our calculation, we only required log2(1) = 0 and log2(1/2) = -1. We now have to add these three values together and take the result from our calculation for Entropy(S) to give us the final result:

Gain(S,A) = 0.811 - (0 + 1/2 + 0) = 0.311

We now look at how information gain can be used in practice in an algorithm to construct decision trees.

10

The ID3 algorithm: Given a set of examples, S, categorized in categories ci, then:

1. Choose the root node to be the attribute, A, which scores the highest for information gain relative to S.

2. For each value v that A can possibly take, draw a branch from the node.

3. For each branch from A corresponding to value v, calculate Sv. Then:

If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch.

If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch.

Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.

The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples.

A worked example

We will stick with our weekend example. Suppose we want to train a decision tree using the following instances:

Weekend (Example) Weather Parents Money Decision (Category) W1 Sunny Yes Rich Cinema W2 Sunny No Rich Tennis W3 Windy Yes Rich Cinema W4 Rainy Yes Poor Cinema W5 Rainy No Rich Stay in W6 Rainy Yes Poor Cinema W7 Windy No Poor Cinema W8 Windy No Rich Shopping W9 Windy Yes Rich Cinema W10 Sunny No Rich Tennis

The first thing we need to do is work out which attribute will be put into the node at the top of our tree: either weather, parents or money. To do this, we need to calculate:

Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) -pshopping log2(pshopping) -pstay_in log2(pstay_in) = -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) * log2(1/10) = -(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -(1/10) * -3.322 = 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571

11

and we need to determine the best of:

Gain(S, weather) = 1.571 - (|Ssun|/10)*Entropy(Ssun) - (|Swind|/10)*Entropy(Swind) - (|Srain|/10)*Entropy(Srain) = 1.571 - (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) - (0.3)*Entropy(Srain) = 1.571 - (0.3)*(0.918) - (0.4)*(0.81125) - (0.3)*(0.918) = 0.70

Gain(S, parents) = 1.571 - (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno) = 1.571 - (0.5) * 0 - (0.5) * 1.922 = 1.571 - 0.961 = 0.61

Gain(S, money) = 1.571 - (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor) = 1.571 - (0.7) * (1.842) - (0.3) * 0 = 1.571 - 1.2894 = 0.2816

This means that the first node in the decision tree will be the weather attribute. As an exercise, convince yourself why this scored (slightly) higher than the parents attribute - remember what entropy means and look at the way information gain is calculated.

From the weather node, we draw a branch for the values that weather can take: sunny, windy and rainy:

Now we look at the first branch. Ssunny = {W1, W2, W10}. This is not empty, so we do not put a default categorisation leaf node here. The categorisations of W1, W2 and W10 are Cinema, Tennis and Tennis respectively. As these are not all the same, we cannot put a categorisation leaf node here. Hence we put an attribute node here, which we will leave blank for the time being.

Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is not empty, and they do not all belong to the same class, so we put an attribute node here, left blank for now. The same situation happens with the third branch, hence our amended tree looks like this:

Now we have to fill in the choice of attribute A, which we know cannot be weather, because we've already removed that from the list of attributes to use. So, we need to calculate the values for Gain(Ssunny, parents) and Gain(Ssunny, money). Firstly, Entropy(Ssunny) = 0.918. Next, we set S to be Ssunny = {W1,W2,W10} (and, for this part of the branch, we will ignore all the other examples). In effect, we are interested only in this part of the table:

12

Weekend (Example) Weather Parents Money Decision (Category) W1 Sunny Yes Rich Cinema W2 Sunny No Rich Tennis W10 Sunny No Rich Tennis

Gain(Ssunny, parents) = 0.918 - (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno) = 0.918 - (1/3)*0 - (2/3)*0 = 0.918

Gain(Ssunny, money) = 0.918 – (|Srich|/|S|)*Entropy(Srich) – (|Spoor|/|S|)*Entropy(Spoor) = 0.918 – (3/3)*0.918 – (0/3)*0 = 0.918 – 0.918 = 0

Notice that Entropy(Syes) and Entropy(Sno) were both zero, because Syes contains examples which are all in the same category (cinema), and Sno similarly contains examples which are all in the same category (tennis). This should make it more obvious why we use information gain to choose attributes to put in nodes.

Given our calculations, attribute A should be taken as parents. The two values from parents are yes and no, and we will draw a branch from the node for each of these. Remembering that we replaced the set S by the set SSunny, looking at Syes, we see that the only example of this is W1. Hence, the branch for yes stops at a ategorization leaf, with the category being Cinema. Also, Sno contains W2 and W10, but these are in the same category (Tennis). Hence the branch for no ends here at a ategorization leaf. Hence our upgraded tree looks like this:

……………..

decision tree

Engineering