decision trees in python

Published on ONLamp.com (http://www.onlamp.com/) See this if you're having trouble printing code examples

Building Decision Trees in Pythonby Christopher Roach02/09/2006

You have a great idea to start selling the most marvelous widget ever known. You're so sure of yourenterprising idea that you decide to go into business for yourself and begin manufacturing saidwidgets. A few years pass and you're a success. However, lately you've noticed a slump in sales, andyou decide that you need a better way of focusing your marketing dollars toward your target audience.How do you do this?

This article introduces a popular and easy-to-use datamining tool called a decision tree that shouldhelp you solve your marketing dilemma.

Decision trees are a topic of artificial intelligence. More specifically, they belong to the subfield ofmachine learning. This is due to their ability to learn--through example--to classify individual recordsin a data set.

This article will give you a closer look at what decision trees are, and how they can help you in yourendeavors to more effectively target your marketing campaign toward your core customers. In sodoing, I'll go over a few different uses for decision trees (aside from the aforementioned marketingscenario), and I'll discuss one popular heuristic that can be used during the learning process to createthe most effective decision tree. Finally, I'll implement a simple decision tree program using thePython programming language.

If you've ever had any interest in machine learning or artificial intelligence; if you find the idea ofwriting a program that has the ability to learn simply fascinating; or if you just happen to manufacturethe world's best widgets, then this article is for you.

Decision Tree ... What's That?

Decision trees fall under the subfield of machine learning within the larger field of artificialintelligence. Decision trees are mainly used for classification purposes, but they are also helpful inuncovering features of data that were previously unrecognizable to the eye. Thus, they can be veryuseful in datamining activities as well as data classification. They work well in many different areas,from typical business scenarios (such as the widget example described previously) to airplaneautopilots and medical diagnoses.

ONLamp.com: Building Decision Trees in Python http://onlamp.com/lpt/a/6464

1 of 12 12/13/11 7:16 PM

A decision tree is essentially a series of if-then statements, that, when applied to a record in a data set,results in the classification of that record. Therefore, once you've created your decision tree, you willbe able to run a data set through the program and get a classification for each individual record withinthe data set. What this means to you, as a manufacturer of quality widgets, is that the program youcreate from this article will be able to predict the likelihood of each user, within a data set, purchasingyour finely crafted product.

Though the classification of data is the driving force behind creating a decision tree program, it is notwhat makes a decision tree special. The beauty of decision trees lies in their ability to learn. Whenfinished, you will be able to feed your program a test set of data, and it essentially will learn how toclassify future sets of data from the examples.

Hopefully, if I've done my job well enough, you're champing at the bit to start coding. However,before you go any further with your code monkey endeavors, you need a good idea of what a decisiontree looks like. It's one of those data structures that is easiest to understand with a good visualrepresentation. Figure 1 contains a graphical depiction of a decision tree.

Figure 1. A decision tree

Notice that it actually is a true tree structure. Because of this, you can use recursive techniques to bothcreate and traverse the decision tree. For that reason, you could use just about any tree representationthat you remember from your Data Structures course to represent the decision tree. In this article,though, I'm going to keep everything very simple and create my decision tree out of only Python'sbuilt-in dictionary object.

Fleshing Out the Scenario

If you recall the example scenario, you are the manufacturer of the world's finest widgets and arecurrently looking for a way to focus your marketing efforts to more closely entice your target


2 of 12 12/13/11 7:16 PM

demographic. To do this, you need a set of test data to use to train the decision tree program.

I assume that, being the concerned entrepreneur that you undoubtedly are, you've already gatheredplenty of demographic information through, perhaps, anonymous email surveys. Now, what you needto do is organize all this data into a large set of user records. Here is a table containing a sampling ofthe information you collected during your email survey:

Age Education Income Marital Status Purchase?

36-55 master's high single will buy

18-35 high school low single won't buy

36-55 master's low single will buy

18-35 bachelor's high single won't buy

< 18 high school low single will buy

18-35 bachelor's high married won't buy

36-55 bachelor's low married won't buy

> 55 bachelor's high single will buy

36-55 master's low married won't buy

> 55 master's low married will buy

36-55 master's high single will buy

> 55 master's high single will buy

< 18 high school high single won't buy

36-55 master's low single will buy

36-55 high school low single will buy

< 18 high school low married will buy

18-35 bachelor's high married won't buy

> 55 high school high married will buy

> 55 bachelor's low single will buy

36-55 high school high married won't buy

Given the decision tree in Figure 1 and the set of data, it should be somewhat easy to see just how a


3 of 12 12/13/11 7:16 PM

decision tree can classify records in a data set. Starting with the top node (Age), check the value of thefirst record in the field matching that of the top node (in this case, 36-55). Then follow the link to thenext node in the tree (Marital Status) and repeat the process until you finally reach a leaf node (a nodewith no children). This leaf node holds the answer to the question of whether the user will buy yourproduct. (In this example, the user will buy, because his marital status is single). It's also quite easy tosee that this type of operation lends itself to a recursive process (not necessarily the most efficient wayto program, I know--but a very elegant way).

The decision tree in the figure is just one of many decision tree structures you could create to solve themarketing problem. The task of finding the optimal decision tree is an intractable problem. For thoseof you who have taken an analysis of algorithms course, you no doubt recognize this term. For thoseof you who haven't had this pleasure (he says, gritting his teeth), essentially what this means is that asthe amount of test data used to train the decision tree grows, the amount of time it takes to do sogrows as well--exponentially. While it may be nearly impossible to find the smallest (or more fittingly,the shallowest) decision tree in a respectable amount of time, it is possible to find a decision tree thatis "small enough" using special heuristics. It is the job of the heuristic you choose to accomplish thistask by choosing the "next best" attribute by which to divide the data set according to some predefinedcriteria. There are many such heuristics (C4.5, C5.0, gain ratio, GINI, and others). However, for thisarticle I've used one of the more popular heuristics for choosing "next best" attributes based on someof the ideas found in information theory. The ID3 (information theoretic) heuristic uses the concept ofentropy to calculate which attribute is best to use for dividing the data into subgroups.

The next section quickly covers the basic idea behind how this heuristic works. Don't worry; it's nottoo much math. Following this discussion, you'll finally get a chance to get your hands dirty bywriting the code that will create the decision tree and classify the users in your data set as a "will buy"or "won't buy," thereby making your company instantly more profitable.

The ID3 Heuristic

Physics uses the term entropy to describe the amount of disorder inherent within a system. Ininformation theory, this term has a similar meaning--it is the measure of the disorder in a set of data.The ID3 heuristic uses this concept to come up with the "next best" attribute in the data set to use as anode, or decision criteria, in the decision tree. Thus, the idea behind the ID3 heuristic is to find theattribute that most lowers the entropy for the data set, thereby reducing the amount of informationneeded to completely describe each piece of data. Thus, by following this heuristic you willessentially be finding the best attribute to classify the records (according to a reduction in the amountof information needed to describe the remaining data division) in the data set.

Information theory uses the log function with a base of 2 to determine the number of bits necessary torepresent a piece of information. If you remember from early math education, the log function findsthe exponent in an equation such as 2x = 8. In this equation, x is equal to 3. The exponent in thatequation is easy enough to see, but what about a more difficult example, such as 2x = 8,388,608? Byusing the logarithm function with a base of 2--log

2 8,388,608--you can find that the exponent x is

equal to 23. Thus you need 23 bits of information to properly represent 8,388,608 different numbers.This is the basic idea behind the entropy measurement in the ID3 algorithm. In other words, you are


4 of 12 12/13/11 7:16 PM

trying to find the attribute that best reduces the amount of information you need to classify your data.

The first step in this process is getting the "next best" attribute from the set of available attributes. Thecall to choose_attribute takes care of this step. The choose_attribute function uses the heuristicyou've chosen for selecting "next best" attributes--in this case, the ID3 heuristic. In fact, thefitness_func parameter you see in the call to choose_attribute is a pointer to the gain functionfrom the ID3 algorithm described in the next section. By passing in a pointer to the gain function,you've effectively separated the code for choosing the next attribute in the decision tree from the codefor assembling the decision tree. This makes it possible, and extremely easy, to switch out the ID3heuristic and exchange it with other heuristics that you may prefer with only the most minimalamount of change to the code.

The next step is to create a new decision tree containing the attribute returned from thechoose_attribute function as its root node. The only task left after this is to create the subtrees foreach of the values in the best attribute. The get_values function cycles through each of the recordsin the data set and returns a list containing the unique values for the chosen attribute.

Finally, the code loops through each of these unique values and creates a subtree for them by makinga recursive call to the create_decision_tree function. The call to get_examples just returns a list ofall the records in the data set that have the value val for the attribute defined by the best variable.This list of examples is passed into the create_decision_tree function along with the list ofremaining attributes (minus the currently selected "next best" attribute). The call tocreate_decision_tree returns the subtree for the remaining list of attributes and the subset of datapassed into it. All that's left is to add each of these subtrees to the current decision tree and return it.

The next step in finding the entropy for the data set is to find the number of bits needed to representeach of the probabilities we calculated in the previous step. This is where you use the logarithmfunction. For the example above, the number of bits needed to represent the probability of each valueoccurring in the target attribute is log

2 0.6 = -0.736 for "will buy" and log

2 0.4 = -1.321 for

"won't buy."

Now that you have the number of bits needed to represent the probability of each value occurring inthe data set, all that's left to do is sum this up and, voilá, you have the entropy for the data set! Right?Not exactly. Before you do this, there is one more step. You need to go through and weight each ofthese numbers before summing them. To do so, multiply each amount that you found in the previousstep by the probability of that value occurring, and then multiply the outcome by -1 to make thenumber positive. Once you've done this, the summation should look like (-0.6 * -0.736) + (-0.4* -1.321) = 0.97. Thus, 0.97 is the entropy for the data set in the table above.

That's all there is to finding the entropy for a set of data. You use the same equation to calculate theentropy for each subset of data in the gain equation, but it is essentially the same process. The onlydifference is that you will be using a smaller subset of the records within the data set, and you'll alsobe using an attribute other than the target attribute to calculate the entropy.

The next step in the ID3 heuristic is to calculate the information gain that each attribute affords if it isthe next decision criteria in the decision tree. If you understood the first step on calculating the


5 of 12 12/13/11 7:16 PM

entropy, then this step should be a breeze. Essentially, all you need to do to find the gain for a specificattribute is find the entropy measurement for that attribute using the process described in the last fewparagraphs (find the entropy for the subset of data for each value in the chosen attribute and sum themall), and subtract this value from the entropy for the entire data set. The decision tree algorithmfollows this process for each attribute in the data set, and the attribute with the highest gain will be theone chosen as the next node in the decision tree.

That's the prose explanation. For those of you a bit more mathematically minded, the equations inFigure 2 and Figure 3 are the entropy and information gain for the data set.

Figure 2. The entropy equation

Figure 3. The information gain equation

Entropy and gain are the only two methods you need in the ID3 module. If you understood theconcepts of entropy and information gain, then you understand the final pieces of the puzzle.

Just as a quick note: if you didn't totally understand the section on the ID3 heuristic, don't worry--several good web sites go over the ID3 heuristic in more detail. (One very good site in particular isdecisiontrees.net, created by Michael Nashvili.) Also, keep in mind that it's just one of severalheuristics that you can use to decide the "next best" node in the decision tree. The most importantthing is to understand the inner workings of the decision tree algorithm. In the end, if you don'tunderstand ID3, you can always just plug in another heuristic or create your own.

The Decision Tree Learning Algorithm

With most of the preliminary information out of the way, you can now look at the actual decision treealgorithm. The following code listing is the main function used to create your decision tree:

def create_decision_tree(data, attributes, target_attr, fitness_func): """ Returns a new decision tree based on the examples given. """ data = data[:] vals = [record[target_attr] for record in data] default = majority_value(data, target_attr)

# If the dataset is empty or the attributes list is empty, return the # default value. When checking the attributes list for emptiness, we # need to subtract 1 to account for the target attribute. if not data or (len(attributes) - 1) <= 0: return default # If all the records in the dataset have the same classification,


6 of 12 12/13/11 7:16 PM

# return that classification. elif vals.count(vals[0]) == len(vals): return vals[0] else: # Choose the next best attribute to best classify our data best = choose_attribute(data, attributes, target_attr, fitness_func)

# Create a new decision tree/node with the best attribute and an empty # dictionary object--we'll fill that up next. tree = {best:{}}

# Create a new decision tree/sub-node for each of the values in the # best attribute field for val in get_values(data, best): # Create a subtree for the current value under the "best" field subtree = create_decision_tree( get_examples(data, best, val), [attr for attr in attributes if attr != best], target_attr, fitness_func)

# Add the new subtree to the empty dictionary object in our new # tree/node we just created. tree[best][val] = subtree

return tree

The create_decision_tree function starts off by declaring three variables: data, vals, and default.The first, data, is just a copy of the data list being passed into the function. The reason I do this isbecause Python passes all mutable data types, such as dictionaries and lists, by reference. It's a goodrule of thumb to make a copy of any of these in order to keep from accidentally altering the originaldata. vals is a list of all the values in the target attribute for each record in the data set, and defaultholds the default value that is returned from the function when the data set is empty. That is simplythe value in the target attribute with the highest frequency, and thus, the best guess for when thedecision tree is unable to classify a record.

The next lines are the real nitty-gritty of the algorithm. The algorithm makes use of recursion to createthe decision tree, and as such it needs a base case (or, in this case, two base cases) to prevent it fromentering an infinite recursive loop. What are the base cases for this algorithm? For starters, if eitherthe data or attributes list is empty, then the algorithm has reached a stopping point. The firstif-then statement takes care of this case. If either list is empty, then the algorithm returns a defaultvalue. (Actually, for the attributes list, check to see whether it has only one attribute in it, becausethe attributes list also contains the target attribute, which the decision tree never uses; that is whatthe tree should predict.) It returns the value with the highest frequency in the data set for the targetattribute. The only other case to worry about is when the remaining records in the data list all havethe same value for the target attribute, in which case the algorithm returns that value.

Those are the base cases. What about the recursive case? Well, when everything else is normal (thatis, the data and attributes lists are not empty and the records in the list of data still have multiplevalues for the target attribute), the algorithm needs to choose the "next best" attribute for classifyingthe test data and add it to the decision tree. The choose_attribute function is responsible for picking


7 of 12 12/13/11 7:16 PM

the "next best" attribute for classifying the records in the test data set. After this, the code creates anew decision tree containing only the newly selected "best" attribute. Then the recursion takes place.In other words, each of the subtrees is created by making a recursive call to thecreate_decision_tree function and adding the returned tree to the newly created tree in the laststep.

The first step in this process is getting the "next best" attribute from the set of available attributes. Thecall to choose_attribute takes care of this step. The next step is to create a new decision treecontaining the chosen attribute as the root node. All that remains to do after this is to create thesubtrees for each of the values in the best attribute. The get_values function cycles through each ofthe records in the data set and returns a list containing the unique values for the chosen attribute. Next,the code loops through each of these unique values and creates a subtree for them by making arecursive call to the create_decision_tree function. The call to get_examples just returns a list ofall the records in the data set that have the value val for the attribute defined by the best variable.This list of examples passes to the create_decision_tree function along with the list of remainingattributes (minus the currently selected "next best" attribute). The call to create_decision_tree willreturn the subtree for the remaining list of attributes and the subset of data passed into it. All that's leftis to add each of these subtrees to the current decision tree and return it.

If you're not used to recursion, this process can seem a bit strange. Take some time to look over thecode and make sure that you understand what is happening here. Create a little script to run thefunction and print out the tree (or, just alter test.py to do so), so you can get a better idea of how it'sfunctioning. It's really a good idea to take your time and make sure you understand what's happening,because many programming problems lend themselves to a recursive solution--you just may beadding a very important tool to your programming arsenal.

That's about all there is to the algorithm; everything else is just helper functions to the main algorithm.Most of the functions should be fairly self-explanatory, with the exception of the ID3 heuristic.

Implementing the ID3 Heuristic

The ID3 heuristic uses the concept of entropy to formulate the gain in information received bychoosing a particular attribute to be the next node in the decision tree. Here's the entropy function:

def entropy(data, target_attr): """ Calculates the entropy of the given data set for the target attribute. """ val_freq = {} data_entropy = 0.0

# Calculate the frequency of each of the values in the target attr for record in data: if (val_freq.has_key(record[target_attr])): val_freq[record[target_attr]] += 1.0 else: val_freq[record[target_attr]] = 1.0

# Calculate the entropy of the data for the target attribute


8 of 12 12/13/11 7:16 PM

for freq in val_freq.values(): data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2) return data_entropy

Just like the create_decision_tree function, the first thing the entropy function does is create thevariables it uses throughout the algorithm. The first is a dictionary object called val_freq to hold allthe values found in the data set passed into this function and the frequency at which each valueappears in the data set. The other variable is data_entropy, which holds the ongoing calculation ofthe data's entropy value.

The next section of code adds each of the values in the data set to the val_freq dictionary andcalculates the corresponding frequency for each value. It does so by looping through each of therecords in the data set and checking the val_freq dictionary object to see if the current value alreadyresides within it. If it does, it increments the frequency for the current value, otherwise, it adds thecurrent value to the dictionary object and initializes its frequency to 1. The final portion of the code isresponsible for actually calculating the entropy measurement (using the equation in Figure 1) with thefrequencies stored in the val_freq dictionary object.

That was easy, wasn't it? That's only the first half of the ID3 heuristic. Now that you know how tocalculate the amount of disorder in a set of data, you need to take those calculations and use them tofind the amount of information gain you will get by using an attribute in the decision tree. Theinformation gain function is very similar to the entropy function. Here's the code that calculates thismeasurement:

def gain(data, attr, target_attr): """ Calculates the information gain (reduction in entropy) that would result by splitting the data on the chosen attribute (attr). """ val_freq = {} subset_entropy = 0.0

# Calculate the frequency of each of the values in the target attribute for record in data: if (val_freq.has_key(record[attr])): val_freq[record[attr]] += 1.0 else: val_freq[record[attr]] = 1.0

# Calculate the sum of the entropy for each subset of records weighted # by their probability of occuring in the training set. for val in val_freq.keys(): val_prob = val_freq[val] / sum(val_freq.values()) data_subset = [record for record in data if record[attr] == val] subset_entropy += val_prob * entropy(data_subset, target_attr)

# Subtract the entropy of the chosen attribute from the entropy of the # whole data set with respect to the target attribute (and return it) return (entropy(data, target_attr) - subset_entropy)

Once again, the code starts by calculating the frequency of each of the values in the data set.Following this, it calculates the entropy for the data set with the new division of data derived by using


9 of 12 12/13/11 7:16 PM

the chosen attribute, attr, to classify the records in the data set. Subtracting that from the originalentropy for the current subset of data set finds the gain in information (or, reduction in disorder, if youprefer to think those terms) that you get by choosing that attribute as the next node in the decisiontree.

That is essentially all there is to it. You still need some code that cycles through each attribute andcalculates its information gain measure and chooses the best one, but that part of the code should besomewhat obvious; it's just a matter of repeatedly calling the gain function on each attribute andkeeping track of the attribute with the best score. That said, I leave it as a challenge for you to lookover the rest of the helper functions in the accompanying source code and figure out each one.

Datamining with Decision Trees

Aside from classifying data, decision trees are also useful for discerning patterns in data, commonlyreferred to as datamining. Just by glancing over the decision tree figure at the beginning of this article,you can quickly pull out a few significant trends in the data you've collected.

The most obvious trend in the data is that young adults (ages 18 to 35) and seniors (55 and older)seem not to buy your widget at all. This could lead you to target only youths (18 and younger) and themiddle-aged (36 to 55) with your marketing campaign in an effort to save money, because onlyconsumers from these two groups tend to purchase the widget. On the other hand, it could also leadyou to change your marketing strategy altogether to try to convince those reticent customers to buyyour product, in the hopes of opening up new sources of revenue.

By looking a little closer at the decision tree, you'll also notice that with youths, income is the largestdeterminant in their decision to buy your widget. As youths from low-income households tend to buythe widget and those from high-income families tend not to, it may be possible that your product isnot trendy enough for higher-income kids. You may want to push the widget into higher-priced,trendier retail stores in an effort to popularize it with higher-income kids and capture a portion of thatmarket. With middle-aged consumers, marital status seems to be the discriminant factor. Becausesingle consumers are more likely to buy your product, perhaps it would be a good idea to stress itsutility to both sexes and maybe even point out its usefulness in a family setting, if adding moreconsumers from the married crowd is important to you.

The main point here is that looking over the data set, even with only 20 records, none of these patternsare easy to find with the naked eye. With hundreds, thousands, or even millions of records in a dataset, spotting these trends becomes absolutely impossible. By using the decision tree algorithm, youcan not only predict the likelihood of a person buying your product, you can also spot significantpatterns in your collected test data that can help you to better mold your marketing practices, and thusinfuse your revenue stream with plenty of new customers.

Conclusion

As I stated earlier, the rest of the code is basically just helper functions for the decision tree algorithm.I am hoping they will be fairly self-explanatory. Download the decision tree source code to see the


10 of 12 12/13/11 7:16 PM

rest of the functions that help create the decision tree.

The tarball contains three separate source code files. If you want to try out the algorithm and see howit works, just uncompress the source and run the test.py file. All it does is create a set of test data (thedata you saw earlier in this article) that it uses to create a decision tree. Then it creates another set ofsample data whose records it classifies using the decision tree it created with the test data in the firststep.

The other two source files are the code for the decision tree algorithm and the ID3 heuristic. The firstfile, d_tree.py, contains the create_decision_tree function and all the helper functions associatedwith it. The second file contains all the code that implements the ID3 heuristic, called, appropriatelyenough, id3.py. The reason for this division is that the decision tree learning algorithm is awell-established algorithm with little need for change. However, there exist many heuristics that canbe used in the choosing of the "next best" attribute, and by placing this code into its own file, you areable to try out other heuristics by just adding another file and including it in place of id3.py in the filethat makes the call to the create_decision_tree function. (In this case, that file is test.py.)

I've had fun running through this first foray into the world of artificial intelligence with you. I hopeyou've enjoyed this tutorial and had plenty of success in getting your decision tree up and running. Ifso, and you find yourself thirsting for more AI related topics, such as genetic algorithms, neuralnetworks, and swarm intelligence, then keep your eyes peeled for my next installment in this series onPython AI programming.

Until next time ... I wish you all the best in your programming endeavors.

Christopher Roach recently graduated with a master's in computer science and currently works inFlorida as a software engineer at a government communications corporation.

RelatedReading

Python PocketReferenceBy Mark Lutz

Return to the Python DevCenter.


11 of 12 12/13/11 7:16 PM

decision trees in python

Documents

building decision trees

decision tree algorithm

decision tree program

decision trees

decision tree

id3 algorithm

decision criteria

create decision tree function