decision tree learning mitchell, chapter 3 - washington state

30
Decision Tree Learning Mitchell, Chapter 3 CptS 570 Machine Learning School of EECS Washington State University

Upload: others

Post on 25-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Decision Tree Learning Mitchell, Chapter 3

CptS

570 Machine LearningSchool of EECS

Washington State University

Page 2: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Outline

Decision tree representationID3 learning algorithmEntropy and information gainOverfittingEnhancements

Page 3: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Decision Tree for PlayTennis

Page 4: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Decision Trees

Decision tree representationEach internal node test an attributeEach branch corresponds to attribute valueEach leaf node assigns a classification

How would we represent:∧, ∨, XOR(A ∧ B) ∨ (C ∧ ¬D ∧ E)M of N

Page 5: Decision Tree Learning Mitchell, Chapter 3 - Washington State

When to Consider Decision Trees

Instances describable by attribute-value pairsTarget function is discrete valuedDisjunctive hypothesis may be requiredPossibly noisy training dataExamples:

Equipment or medical diagnosisCredit risk analysisModeling calendar scheduling preferences

Page 6: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Top-Down Induction of Decision Trees

Main loop (ID3, Table 3.1):A the “best” decision attribute for next nodeAssign A as decision attribute for nodeFor each value of A, create new descendant of nodeSort training examples to leaf nodesIf training examples perfectly classified

Then STOPElse iterate over new leaf nodes

Page 7: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Which Attribute is Best?

Page 8: Decision Tree Learning Mitchell, Chapter 3 - Washington State

EntropyS is a sample of training examplesp⊕ is the proportion of positive examples in Sp is the proportion of negative examples in SEntropy measures the impurity of S

Entropy(S) ≡– p⊕

log2

p⊕

– p log2 p

Page 9: Decision Tree Learning Mitchell, Chapter 3 - Washington State

EntropyEntropy(S) = expected number of bits needed to encode class (⊕ or ) of randomly drawn member of S (under the optimal, shortest-length code)Why? Information theory

Optimal length code assigns (– log2 p) bits to message having probability p

So, expected number of bits to encode ⊕ or of random member of S:p⊕ (– log2 p⊕ ) + p (– log2 p )Entropy(S) ≡ – p⊕ log2 p⊕ – p log2 p

Page 10: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Information Gain

Gain(S,A) = expected reduction in entropy due to sorting on attribute A

)()(),()(

vAValuesv

v SEntropySS

SEntropyASGain ∑∈

−≡

Page 11: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Training Examples: PlayTennisDay Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

Page 12: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Selecting the Next Attribute

Page 13: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Selecting the Next Attribute

Page 14: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Hypothesis Space Search by ID3

Page 15: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Hypothesis Space Search by ID3

Hypothesis space is complete!Target function surely in there…

Outputs a single hypothesis (which one?)Can't play 20 questions…

No backtrackingLocal minima…

Statistically-based search choicesRobust to noisy data…

Inductive bias: “prefer shortest tree”

Page 16: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Inductive Bias in ID3Note H is the power set of instances X

Unbiased?Not really…

Preference for short trees with high information gain attributes near the root

Bias is a preference for some hypotheses, rather than a restriction on the hypothesis space HOccam's razor

Prefer the shortest hypothesis that fits the data

Page 17: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Occam’s

RazorWhy prefer short hypotheses?Argument in favor:

Fewer short hypotheses than long hypothesesShort hypothesis that fits data unlikely to be coincidenceLong hypothesis that fits data might be coincidence

Argument opposed:There are many ways to define small sets of hypothesesE.g., all trees with a prime number of nodes that use attributes beginning with “Z”What's so special about small sets based on the size of the hypothesis?

Page 18: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Overfitting

in Decision Trees

Consider adding noisy training example #15:(<Sunny, Hot, Normal, Strong>, PlayTennis = No)

What effect on earlier tree?

Page 19: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Overfitting

Consider error of hypothesis h overTraining data: errortrain(h)Entire distribution D of data: errorD(h)

Hypothesis h ∈ H overfits the training data if there is an alternative hypothesis h’ ∈ H such that

errortrain(h) < errortrain(h’)errorD(h) > errorD(h’)

Page 20: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Overfitting

in Decision Tree Learning

Page 21: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Avoiding Overfitting

How can we avoid overfitting?Stop growing when data split not statistically significantGrow full tree, then post-prune

How to select “best” tree:Measure performance over training dataMeasure performance over separate validation data set

MDLMinimize size(tree) + size(misclassifications(tree))

Page 22: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Reduced-Error Pruning

Split data into training and validation setDo until further pruning is harmful:

Evaluate impact on validation set of pruning each possible node (plus those below it)Greedily remove the one that most improves validation set accuracy

Produces smallest version of most accurate subtreeWhat if data is limited?

Page 23: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Effect of Reduced-Error Pruning

Page 24: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Rule Post-Pruning

Generate decision tree, thenConvert tree to equivalent set of rulesPrune each rule independently of othersSort final rules into desired sequence for use

Perhaps most frequently used method (e.g., C4.5)

Page 25: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Converting a Tree to Rules

If (Outlook=Sunny) Λ

(Humidity=High)Then PlayTennis=No

If (Outlook=Sunny) Λ

(Humidity=Normal)Then PlayTennis=Yes…

Page 26: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Continuous Valued Attributes

Create a discrete attribute to test continuous values

Temperature = 82.5(Temperature > 72.3) = true, false

Temperature 40 48 60 72 80 90PlayTennis No No Yes Yes Yes No

Page 27: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Attributes with Many ValuesProblem:

If attribute has many values, Gain will select itImagine using Date = Jun_3_1996 as attribute

One approach: Use GainRatio instead

where Si is the subset of S for which A has value vi

),(),(),(

ASmationSplitInforASGainASGainRatio ≡

SS

SS

ASmationSplitInfor iAValues

i

i2

)(

1log),( ∑

=

−≡

Page 28: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Attributes with CostsConsider

Medical diagnosis, BloodTest has cost $150Robotics, Width_from_1ft has cost 23 sec.

How to learn a consistent tree with low expected cost?One approach: replace gain by

Tan and Schlimmer (1990)

Nunez (1988)

where w ∈ [0,1] determines importance of cost

)(),(2

ACostASGain

w

ASGain

ACost )1)((12 ),(

+−

Page 29: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Unknown Attribute ValuesWhat if some examples missing values of A?Use training example anyway, sort through tree

If node n tests A, assign most common value of Aamong other examples sorted to node nAssign most common value of A among other examples with same target valueAssign probability pi to each possible value vi of A

Assign fraction pi of example to each descendant in tree

Classify new examples in same fashion

Page 30: Decision Tree Learning Mitchell, Chapter 3 - Washington State

Summary: Decision-Tree Learning

Most popular symbolic learning methodLearning discrete-valued functionsInformation-theoretic heuristicHandles noisy dataDecision trees completely expressiveBiased towards simpler treesID3 C4.5 C5.0 (www.rulequest.com)J48 (WEKA) ≈ C4.5