it/cs 811 principles of machine learning and inference

1 2002, G.Tecuci, Learning Agents Laboratory

Learning Agents LaboratoryComputer Science Department

George Mason University

Prof. Gheorghe Tecuci

Inductive Learning from Examples:Decision tree learning


Overview

The basic ID3 learning algorithm

Discussion and refinement of the ID3 method

Applicability of the decision tree learning

The decision tree learning problem

Recommended reading

Exercises



Given • language of instances: feature value vectors • language of generalizations: decision trees • a set of positive examples (E1, ..., En) of a concept • a set of negative examples (C1, ... , Cm) of the same concept • learning bias: preference for shorter decision trees

Determine • a concept description in the form of a decision tree which is

a generalization of the positive examples that does not cover any of the negative examples


Examples

Illustration

height hair eyes classshort blond blue +tall blond brown -tall red blue +short dark blue -tall dark blue -tall blond blue +tall dark brown -short blond brown -

Feature vector representation of examplesThat is, there is a fixed set of attributes, each attribute taking values from a specified set.

hair

dark red blond

eyes

blue brown

-

-

+

+(short, blond, blue) is +

hair = blond

eyes = blue

(short, blond, blue)

Decision tree concept


What is the logical expression represented by the decision tree?

Decision tree concept: hair

dark red blond

eyes

blue brown

-

-

+

+

Disjunction of conjunctions (one conjunct per path to a + node):

(hair = red) [(hair = blond) & (eyes = blue)]


Feature-value representation

If the training set (i.e. the set of positive and negative examples from which the tree is learned) contains a positive example and a negative example that have identical values for each attribute, it is impossible to differentiate between the instances with reference only to the given attributes.

In such a case the attributes are inadequate for the training set and for the induction task.

Is the feature value representation powerful enough?


Feature-value representation (cont.)

The problem is that there are many such correct decision trees and the task of induction is to construct a decision tree that correctly classifies not only instances from the training set but other (unseen) instances as well.

When could a decision tree be built?

If the attributes are adequate, it is always possible to construct a decision tree that correctly classifies each instance in the training set.

So what is the difficulty in learning a decision tree?


Overview





Recommended reading

Exercises


The basic ID3 learning algorithm • Let C be the set of training examples • If all the examples in C are positive then create a node with label + • If all the examples in C are negative then create a node with label -

• If there is no attribute left then create a node with the same label as the majority of examples in C • Otherwise:

- partition the examples into subsets C1, C2, ... , Ck according to the values of A.- apply the algorithm recursively to each of the sets Ci which is not empty

- for each Ci which is empty create a node with the same label as the majority of examples in C the node

- select the best attribute A and create a decision node, where v1, v2, ... , vk are the values of A:

. . .

Av1 v2

vk


Features selection: information theory

Let us consider a set S containing objects from n classes S1, ... , Sn, so that the probability of an object to belong to a class Si is pi.

According to the information theory, the amount of information needed to identify the class of one particular member of S is:

Ii = - log2 pi

Intuitively, Ii represents the number of questions required to identify the class Si of a given element in S.

The average amount of information needed to identify the class of an element in S is:

- ∑ pi log2 pi


Features selection: the best attribute

Let us suppose that the decision tree has been built from a training set C consisting of p positive examples and n negative examples.

The average amount of information needed to classify an instance from C is

p + n log2 p + n p + n log2 p + np np nI(p, n) = - -

If attribute A with values {v1, v2,...,vk} is used for the root of the decision tree, it will partition C into {C1, C2,...,Ck}, where each Ci contains pi positive examples and ni negative examples.

The expected information required to classify an instance in Ci is I(pi, ni).The expected amount of information required to classify an instance after the value of the attribute A is known is therefore:

p + n I(p , n )p + n ii

iii = 1

kIres(A) =

The information gained by branching on A is: gain(A) = I(p, n) - Ires(A)


Features selection: the heuristic

The information gained by branching on A is:

gain(A) = I(p, n) - Ires(A)

Choose the attribute which leads to the greatest information gain.

What would be a good heuristic?

Why is this a heuristic and not a guaranteed method?


Features selection: algorithm optimizationHow could we optimize the algorithm for determining the best attribute?

Hint: The information for A is: gain(A) = I(p, n) - Ires(A)

Since I(p, n) is constant for all attributes, maximizing the gain is equivalent to minimizing Ires(A), which in turn is equivalent to minimizing the following expression:

ID3 examines all candidate attributes and chooses A to maximize gain(A) (or minimize Ires(A)), forms the tree as above, and then uses the same process recursively to form decision trees for the residual subsets C1, C2,...,Ck.

ip log2 +-

i

in log2 +(i

- )where is the number of positive examples in Cipi

is the number of negative examples in Ci

if then the corresponding term in the sum is 0= 0 = 0or

n

ip

ipip

in

inin

inip

Σ ip log2 +-

i

in log2 +(i

- )where is the number of positive examples in Cipi

is the number of negative examples in Ci

if then the corresponding term in the sum is 0= 0 = 0or

n

ip

ipip

in

inin

inip

Σ


Examples

Illustration of the method

height hair eyes classshort blond blue +tall blond brown -tall red blue +short dark blue -tall dark blue -tall blond blue +tall dark brown -short blond brown -

1. Find the attribute that maximizes the information gain:

p + n I(p , n )p + nii

iii = 1

k

p + n I(p , n )p + nii

iii = 1

kgain(A) = I(p, n) -

p + n log2 p + n p + n log2 p + np np nI(p, n) = - -

I(3+, 5-) = -3/8log23/8 – 5/8log25/8 = 0.954434003

Height: short (1+, 2-) tall(2+, 3-)Gain(height) = 0.954434003 - 3/8*I(1+,2-) - 5/8*I(2+,3-) = = 0.954434003 – 3/8(-1/3log21/3 - 2/3log22/3) – 5/8(-2/5log22/5 - 3/5log23/5) = 0.003228944

Hair: blond(2+, 2-) red(1+, 0-) dark(0+, 3-)Gain(hair) = 0.954434003 – 4/8(-2/4log22/4 – 2/4log22/4) – 1/8(-1/1log21/1-0) – -3/8(0-3/3log23/3) = 0.954434003 – 0.5 = 0.454434003

Eyes: blue(3+, 2-) brown(0+, 3-)Gain(eyes) = 0.954434003 – 5/8(-3/5log23/5 – 2/5log22/5) -5/8(= = 0.954434003 - 0.606844122 = 0.347589881 “Hair” is the best attribute.


Examples

Illustration of the method (cont.)

height hair eyes classshort blond blue +tall blond brown -tall red blue +short dark blue -tall dark blue -tall blond blue +tall dark brown -short blond brown - hair

dark red blond

short, dark, blue: -tall, dark, blue: -tall, bark, brown: -

tall, red, blue: + short, blond, blue: +tall, blond, brown: -tall, blond, blue: +short, blond, brown: -

2. “Hair” is the best attribute. Build the tree using it.



3. Select the best attribute for the set of examples:

short, blond, blue: +tall, blond, brown: -tall, blond, blue: +short, blond, brown: -

I(2+, 2-) = -2/4log22/4 – 2/4log22/4 = -log21/2=1

Height: short (1+, 1-) tall(1+, 1-)

Eyes: blue (2+, 0-) brown(0+, 2-)

Gain(height) = 1 – 2/4*I(1+,1-) – 2/4*I(1+,1-) = 1 - I(1+,1-) = 1-1 = 0

Gain(eyes) = 1 – 2/4*I(2+,0-) – 2/4*I(0+,2-) = 1 – 0 – 0 = 1

“Eyes” is the best attribute.



hair

dark red blond

short, dark, blue: -tall, dark, blue: -tall, bark, brown: -

tall, red, blue: +

short, blond, blue: + tall, blond, brown: -tall, blond, blue: + short, blond, brown: -

eyes

blue brown

4. “Eyes” is the best attribute. Expand the tree using it:



5. Build the decision tree:

hair

dark red blond

eyes

blue brown

-

-

+

+

What induction hypothesis is made?


Overview





Recommended reading

Exercises


How could we transform a tree into a set of rules?hair

dark red blond

eyes

blue brown

-

-

+

+

Answer:

IF (hair = red) THEN positive example

IF [(hair = blond) & (eyes = blue)]THEN positive example

Why should we make such a transformation?


Learning from noisy data

• errors in the values of attributes (due to measurements or subjective judgments);

• errors of classifications of the instances (for instance a negative example that was considered a positive example).

What errors could be found in an example (also called noise in data)?

What are the effects of noise?

How to change the ID3 algorithm to deal with noise?


How to deal with noise?

The algorithm must be able to work with inadequate attributes, because noise can cause even the most comprehensive set of attributes to appear inadequate.

The algorithm must be able to decide that testing further attributes will not improve the predictive accuracy of the decision tree. For instance, it should refrain from increasing the complexity of the decision tree to accommodate a single noise-generated special case.

Noise may cause the attributes to become inadequate.

Noise may lead to decision trees of spurious complexity (overfitting).

What are the effects of noise?

How to change the ID3 algorithm to deal with noise?


How to deal with an inadequate attribute set?

A collection C of instances may contain representatives of both classes, yet further testing of C may be ruled out, either because the attributes are inadequate and unable to distinguish among the instances in C, or because each attribute has been judged to be irrelevant to the class of instances in C.

In this situation it is necessary to produce a leaf labeled with a class information, but the instances in C are not all of the same class.

(inadequacy due to noise)

What class to assign a leaf node that contains both + and - examples?


What class to assign a leaf node that contains both + and - examples?

Approaches:

1. The notion of class could be generalized from a binary value (0 for negative examples and 1 for positive examples) to a number in the interval [0; 1]. In such a case, a class of 0.8 would be interpreted as 'belonging to class P with probability 0.8'.

2. Opt for the more numerous class, i.e. assign the leaf to class P if p>n, to class N if p<n, and to either if p=n.

The first approach minimizes the sum of the squares of the error over objects in C.

The second approach minimizes the sum of the absolute errors over objects in C. If the aim is to minimize expected error, the second approach might be anticipated to be superior.


How to avoid overfitting the data?

One says that a hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances.

• Stop growing the tree before it overfits;• Allow the tree to overfit and then prune it.

How to determine the correct size of the tree?

Use a testing set of examples to compare the likely errors of various trees.

How to avoid overfitting?


Rule post pruning to avoid overfitting the data?

Infer a decision tree

Convert the tree into a set of rules

Prune (generalize) the rules by removing antecedents as long as this improves their accuracy

Sort the rules by their accuracy and use this order in classification

Rule post pruning algorithm

Compare tree pruning with rule post pruning.


How to use continuous attributes?

Transform a continuous attribute into a discrete one.

Give an example of such a transformation.


How to deal with missing attribute values?

Estimate the value from the values of the other examples.

How?


Comparison with the candidate elimination algorithm

Generalization languageID3 – disjunctions of conjunctionsCE – conjunctions

ID3 – all in the same time (can deal with noise and missing values)CE – one at a time (can determine the most informative example)

Use of examples

ID3: hill climbing (may not find the concept but only an approximation)CE: exhaustive search

Search strategy

ID3 – preference bias (Occam’s razor)CE – representation bias

Bias


Overview





Recommended reading

Exercises


What problems are appropriate for decision tree learning?

Problems for which:

Instances can be represented by attribute-value pairs

Disjunctive descriptions may be required to represent the learned concept

Training data may contain errors

Training data may contain missing attribute values


What practical applications could you envision?

Classify:

- Patients by their disease;

- Equipment malfunctions by their cause;

- Loan applicants by their likelihood to default on payments.


Which are the main features of decision tree learning?

May employ a large number of examples.

Discovers efficient classification trees that are theoretically justified.

Learns disjunctive concepts.

Is limited to attribute-value representations.

Has a non incremental nature (there are however also incremental versions that are less efficient).

The tree representation is not very understandable.

The method is limited to learning classification rules.

The method was successfully applied to complex real world problems.


Overview





Recommended reading

Exercises


Exercise

food medium type classherbivore land harmless mammal + deer (e1)carnivore land harmful mammal - lion (c1)omnivorous water harmless fish + goldfish (e2)herbivore amphibious harmless amphibian - frog (c2)omnivorous air harmless bird - parrot (c3)carnivore land harmful reptile + cobra (e3)carnivore land harmless reptile - lizard (c4)omnivorous land moody mammal + bear (e4)

Build two different decision trees corresponding to the examples and counterexamples from the following table.

Indicate the concept represented by each decision tree.

Apply the ID3 algorithm to build the decision tree corresponding to the examples and counterexamples from the above table.


Exercise

shape size classball large + e1brick small - c1cube large - c2ball small + e2

any-shape

ball cube

any-size

largesmallbrick mediumstar

a) You will be required to learn this concept by applying two different learning methods, the Induction of Decision Trees method, and the Versions Space (candidate elimination) method.Do you expect to learn the same concept with each method or different concepts?Explain in detail your prediction (You will need to consider various aspects like the instance space, the hypothesis space, and the method of learning).

b) Learn the concept represented by the above examples by applying:- the Induction of Decision Trees method;- the Versions Space method.

c) Explain the results obtained in b) and compare them with your predictions.

d) Which will be the results of learning with the above two methods if only the first three examples are available?

Consider the following positive and negative examples of a concept

and the following background knowledge


Exercise

workstation software printer classmaclc macwrite laserwriter + e1sun frame-maker laserwriter + e2hp accounting laserjet - c1sgi spreadsheet laserwriter - c2macII microsoft-word proprinter + e3

any-printer

any-software

publishing-sw page-makerframe-makermicrosoft-word

mac-writespreadsheet

accounting

any-workstation sunhp

mac

sgivax

laserwriter

xerox

proprinter

laserjetmicrolaser

op-system unixvmsmac-os

macplus

maclc

macIIsomething

a) Build two decision trees corresponding to the above examples. Indicate the concept represented by each decision tree. In principle, how many different decision trees could you build?b) Learn the concept represented by the above examples by applying the Versions Space method. Which is the learned concept if only the first four examples are available?c) Compare and justify the obtained results.

Consider the following positive and negative examples of a concept

and the following background knowledge


Exercise

True of false:If decision tree D2 is an elaboration of D1 (according to ID3), then D1 is more general than D2.


Recommended reading

Mitchell T.M., Machine Learning, Chapter 3: Decision tree learning, pp. 52 -80, McGraw Hill, 1997.

Quinlan J.R., Induction of decision trees, in Machine Learning Journal, 1:81-106. Also in Shavlik J. and Dietterich T. (eds), Readings in Machine Learning, Morgan Kaufmann, 1990.

Barr A., Cohen P., and Feigenbaum E.(eds), The Handbook of Artificial Intelligence, vol III, pp.406-410, Morgan Kaufmann, 1982.

Elwyn Edwards, Information Transmission, Chapter 4: Uncertainty, pp. 28-39, Chapman and Hall, 1964.

it/cs 811 principles of machine learning and inference

Documents