learning, page 19 csi 4106, winter 2005 learning decision trees a concept can be represented as a...

16
Learning, page 1 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem of estimating credit risk by considering four features of a potential creditor. Such data can be derived from the history of credit applications.

Upload: juniper-flynn

Post on 04-Jan-2016

217 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 1CSI 4106, Winter 2005

Learning decision treesA concept can be represented as a decision tree, built from examples, as in this problem of estimating credit risk by considering four features of a potential creditor. Such data can be derived from the history of credit applications.

Page 2: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 2CSI 4106, Winter 2005

Learning decision trees (2)At every level, one feature value is selected.

Page 3: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 3CSI 4106, Winter 2005

Learning decision trees (3)

Usually many decision trees are possible, with varying average cost of classification. Not all features must be included.

Page 4: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 4CSI 4106, Winter 2005

Learning decision trees (4)

The ID3 algorithm(its latest industrial-strength implementation is called C5.0)

• If all examples are in the same class, build a leaf with this class.(If, for example, we have no historical data that record low or moderate risk, we can only learn that everything is high-risk.)• Otherwise, if no more features can be used, build a leaf

with a disjunction of the classes of the examples.(We might have data that only allow us to distinguish low risk from high and moderate risk.)• Otherwise, select a feature for the root; partition the

examples of this feature; build recursively the decision trees for all partitions; attach them to the root.

(This is a greedy algorithm: a form of hill climbing.)

Page 5: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 5CSI 4106, Winter 2005

Learning decision trees (5)

Two partially constructed decisions trees.

Page 6: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 6CSI 4106, Winter 2005

Learning decision trees (6)

We saw that the same data can be turned into different trees. The question is what trees are better.

Essentially, the choice of the feature for the root is important. We want to select a feature that gives the most information.

Information in a set of disjoint classes

C = {c1, ..., cn}

is defined by this formula:

I(C) = Σ -p(ci) log2 p(ci)

p(ci) is the probability that an example is in class ci.

The information is measured in bits.

Page 7: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 7CSI 4106, Winter 2005

Learning decision trees (7)

Let us consider our credit risk data. There are three feature values in 14 classes.

6 classes have high risk, 3 have moderate risk, 5 have low risk. Assuming uniform distribution, their probabilities are as follows:

314

514

614

314

514

614

Information contained in this partition:

I(RISK) =

- log2 - log2 - log2 ≈

1.531 bits

314

514

614

high , moderate , low

Page 8: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 8CSI 4106, Winter 2005

Learning decision trees (8)

Let feature F be at the root, and let e1, ..., em be the partitions of the examples on this feature.

Information needed to build a tree for partition ei is I(ei).

Expected information needed to build the whole tree is a weighted average of I(ei).

Let |s| be the cardinality of set s.

Let {ei} be the set of all partitions.

Expected information is defined by this formula:

E(F) = Σ |ei| / |{ei}| * I(ei)

Page 9: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 9CSI 4106, Winter 2005

Learning decision trees (9)

In our data, there are three partitions based on income:e1 = {1, 4, 7, 11}, |e1| = 4, I(e1) = 0.0

All examples have high risk, so I(e1) = -1 log2 1.

e2 = {2, 3, 12, 14}, |e2| = 4, I(e2) = 1.0

Two examples have high risk, two have moderate:I(e2) = - 1/2 log2 1/2 - 1/2 log2 1/2.

e3 = {5, 6, 8, 9, 10, 13}, |e3| = 6, I(e3) ≈ 0.65

I(e3) = - 1/6 log2 1/6 - 5/6 log2 5/6.

The expected information to complete the tree using income as the root feature is this:

- 4/14 * 0.0 - 4/14 * 1.0 - 6/14 * 0.65 ≈ 0.564 bits

Page 10: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 10CSI 4106, Winter 2005

Learning decision trees (10)

Now we define the information gain from selecting feature F for tree-building, given a set of classes C.

G(F) = I(C) - E(F)For our sample data and for F = income, we get this:

G(INCOME) = I(RISK) - E(INCOME) ≈1.531 - 0.564 bits = 0.967 bits.

Our analysis will be complete, and our choice clear, after we have similarly considered the remaining three features. The values are as follows:

G(COLLATERAL) ≈ 0.756 bits,G(DEBT) ≈ 0.581 bits,G(CREDIT HISTORY) ≈ 0.266 bits.

That is, we should choose INCOME as the criterion in the root of the best decision tree that we can construct.

Page 11: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 11CSI 4106, Winter 2005

Explanation-based learningA target conceptThe learning system finds an “operational” definition of this concept, expressed in terms of some primitives. The target concept is represented as a predicate.A training exampleThis is an instance of the target concept. It takes the form of a set of simple facts, not all of them necessarily relevant to the theory.A domain theoryThis is a set of rules, usually in predicate logic, that can explain how the training example fits the target concept.Operationality criteriaThese are the predicates (features) that should appear in an effective definition of the target concept.

Page 12: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 12CSI 4106, Winter 2005

Explanation-based learning (2)

A classic example: a theory and an instance of a cup. A cup is a container for liquids that can be easily lifted. It has some typical parts, such as a handle and a bowl, Bowls, the actual containers, must be concave. Because a cup can be lifted, it should be light. And so on.

The target concept is cup(X).

The domain theory has five rules.

liftable( X ) holds_liquid( X ) cup( X )

part( Z, W ) concave( W ) points_up( W ) holds_liquid( Z )

light( X ) part( X, handle ) liftable( X )

small( A ) light( A )

made_of( A, feathers ) light( A )

Page 13: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 13CSI 4106, Winter 2005

Explanation-based learning (3)

The training example lists nine facts (some of them are not relevant).cup( obj1 ) small( obj1 )part( obj1, handle ) owns( bob, obj1 )part( obj1, bottom ) part( obj1, bowl )points_up( bowl ) concave( bowl )color( obj1, red )Operationality criteria require a definition in terms of structural properties of objects (part, points_up, small, concave).

Page 14: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 14CSI 4106, Winter 2005

Explanation-based learning (4)

Step 1: prove the target concept using the training example

Page 15: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 15CSI 4106, Winter 2005

Explanation-based learning (5)

Step 2: generalize the proof. Constants from the domain theory, for example handle, are not generalized.

Page 16: Learning, page 19 CSI 4106, Winter 2005 Learning decision trees A concept can be represented as a decision tree, built from examples, as in this problem

Learning, page 16CSI 4106, Winter 2005

Explanation-based learning (6)

Step 3: Take the definition “off the tree”,only the root and the leaves.

In our example, we get this rule:

small( X ) part( X, handle ) part( X, W ) concave( W ) points_up( W )

cup( X )