learning, page 19 csi 4106, winter 2005 learning decision trees a concept can be represented as a...

Learning, page 1CSI 4106, Winter 2005

Learning decision treesA concept can be represented as a decision tree, built from examples, as in this problem of estimating credit risk by considering four features of a potential creditor. Such data can be derived from the history of credit applications.


Learning decision trees (2)At every level, one feature value is selected.


Learning decision trees (3)

Usually many decision trees are possible, with varying average cost of classification. Not all features must be included.



The ID3 algorithm(its latest industrial-strength implementation is called C5.0)

• If all examples are in the same class, build a leaf with this class.(If, for example, we have no historical data that record low or moderate risk, we can only learn that everything is high-risk.)• Otherwise, if no more features can be used, build a leaf

with a disjunction of the classes of the examples.(We might have data that only allow us to distinguish low risk from high and moderate risk.)• Otherwise, select a feature for the root; partition the

examples of this feature; build recursively the decision trees for all partitions; attach them to the root.

(This is a greedy algorithm: a form of hill climbing.)



Two partially constructed decisions trees.



We saw that the same data can be turned into different trees. The question is what trees are better.

Essentially, the choice of the feature for the root is important. We want to select a feature that gives the most information.

Information in a set of disjoint classes

C = {c1, ..., cn}

is defined by this formula:

I(C) = Σ -p(ci) log2 p(ci)

p(ci) is the probability that an example is in class ci.

The information is measured in bits.



Let us consider our credit risk data. There are three feature values in 14 classes.

6 classes have high risk, 3 have moderate risk, 5 have low risk. Assuming uniform distribution, their probabilities are as follows:

€

314

€

514

€

614

€

314

€

514

€

614

Information contained in this partition:

I(RISK) =

- log2 - log2 - log2 ≈

1.531 bits

€

314

€

514

€

614

high , moderate , low



Let feature F be at the root, and let e1, ..., em be the partitions of the examples on this feature.

Information needed to build a tree for partition ei is I(ei).

Expected information needed to build the whole tree is a weighted average of I(ei).

Let |s| be the cardinality of set s.

Let {ei} be the set of all partitions.

Expected information is defined by this formula:

E(F) = Σ |ei| / |{ei}| * I(ei)



In our data, there are three partitions based on income:e1 = {1, 4, 7, 11}, |e1| = 4, I(e1) = 0.0

All examples have high risk, so I(e1) = -1 log2 1.

e2 = {2, 3, 12, 14}, |e2| = 4, I(e2) = 1.0

Two examples have high risk, two have moderate:I(e2) = - 1/2 log2 1/2 - 1/2 log2 1/2.

e3 = {5, 6, 8, 9, 10, 13}, |e3| = 6, I(e3) ≈ 0.65

I(e3) = - 1/6 log2 1/6 - 5/6 log2 5/6.

The expected information to complete the tree using income as the root feature is this:

- 4/14 * 0.0 - 4/14 * 1.0 - 6/14 * 0.65 ≈ 0.564 bits



Now we define the information gain from selecting feature F for tree-building, given a set of classes C.

G(F) = I(C) - E(F)For our sample data and for F = income, we get this:

G(INCOME) = I(RISK) - E(INCOME) ≈1.531 - 0.564 bits = 0.967 bits.

Our analysis will be complete, and our choice clear, after we have similarly considered the remaining three features. The values are as follows:

G(COLLATERAL) ≈ 0.756 bits,G(DEBT) ≈ 0.581 bits,G(CREDIT HISTORY) ≈ 0.266 bits.

That is, we should choose INCOME as the criterion in the root of the best decision tree that we can construct.


Explanation-based learningA target conceptThe learning system finds an “operational” definition of this concept, expressed in terms of some primitives. The target concept is represented as a predicate.A training exampleThis is an instance of the target concept. It takes the form of a set of simple facts, not all of them necessarily relevant to the theory.A domain theoryThis is a set of rules, usually in predicate logic, that can explain how the training example fits the target concept.Operationality criteriaThese are the predicates (features) that should appear in an effective definition of the target concept.


Explanation-based learning (2)

A classic example: a theory and an instance of a cup. A cup is a container for liquids that can be easily lifted. It has some typical parts, such as a handle and a bowl, Bowls, the actual containers, must be concave. Because a cup can be lifted, it should be light. And so on.

The target concept is cup(X).

The domain theory has five rules.

liftable( X ) holds_liquid( X ) cup( X )

part( Z, W ) concave( W ) points_up( W ) holds_liquid( Z )

light( X ) part( X, handle ) liftable( X )

small( A ) light( A )

made_of( A, feathers ) light( A )



The training example lists nine facts (some of them are not relevant).cup( obj1 ) small( obj1 )part( obj1, handle ) owns( bob, obj1 )part( obj1, bottom ) part( obj1, bowl )points_up( bowl ) concave( bowl )color( obj1, red )Operationality criteria require a definition in terms of structural properties of objects (part, points_up, small, concave).



Step 1: prove the target concept using the training example



Step 2: generalize the proof. Constants from the domain theory, for example handle, are not generalized.



Step 3: Take the definition “off the tree”,only the root and the leaves.

In our example, we get this rule:

small( X ) part( X, handle ) part( X, W ) concave( W ) points_up( W )

cup( X )

learning, page 19 csi 4106, winter 2005 learning decision trees a concept can be represented as a...

Documents