csc 311: introduction to machine learning · csc 311: introduction to machine learning lecture 2 -...

34
CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto Intro ML (UofT) CSC311-Lec2 1 / 34

Upload: others

Post on 20-May-2020

45 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

CSC 311: Introduction to Machine LearningLecture 2 - Decision Trees

Amir-massoud Farahmand & Emad A.M. Andrews

University of Toronto

Intro ML (UofT) CSC311-Lec2 1 / 34

Page 2: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Today

Decision Trees

I Simple but powerful learning algorithm

I One of the most widely used learning algorithms in Kagglecompetitions

I Lets us introduce ensembles, a key idea in ML

Useful information theoretic concepts (entropy, mutual information, etc.)

Intro ML (UofT) CSC311-Lec2 2 / 34

Page 3: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Trees

Decision trees make predictions by recursively splitting on differentattributes according to a tree structure.

Example: classifying fruit as an orange or lemon based on height andwidth

Yes No

Yes No Yes No

Intro ML (UofT) CSC311-Lec2 3 / 34

Page 4: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Trees

Intro ML (UofT) CSC311-Lec2 4 / 34

Page 5: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Trees

For continuous attributes, split based on less than or greater than somethreshold

Thus, input space is divided into regions with boundaries parallel to axes

Intro ML (UofT) CSC311-Lec2 5 / 34

Page 6: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Example with Discrete Inputs

What if the attributes are discrete?

Attributes:Intro ML (UofT) CSC311-Lec2 6 / 34

Page 7: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Tree: Example with Discrete Inputs

Possible tree to decide whether to wait (T) or not (F)

Intro ML (UofT) CSC311-Lec2 7 / 34

Page 8: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Trees

Internal nodes test attributes

Branching is determined by attribute value

Leaf nodes are outputs (predictions)

Intro ML (UofT) CSC311-Lec2 8 / 34

Page 9: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Expressiveness

Discrete-input, discrete-output case:

I Decision trees can express any function of the input attributesI Example: For Boolean functions, the truth table row → path to leaf

Continuous-input, continuous-output case:

I Can approximate any function arbitrarily closely

Trivially, there is a consistent decision tree for any training set w/ onepath to leaf for each example (unless f nondeterministic in x) but itprobably won’t generalize to new examples

[Slide credit: S. Russell]

Intro ML (UofT) CSC311-Lec2 9 / 34

Page 10: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm

of input space

Let {(x(m1), t(m1)), . . . , (x(mk), t(mk))} be thetraining examples that fall into Rm

Classification tree:

I discrete output

I leaf value ym typically set to the most common value in{t(m1), . . . , t(mk)}

Regression tree:

I continuous output

I leaf value ym typically set to the mean value in {t(m1), . . . , t(mk)}

Note: We will focus on classification

Intro ML (UofT) CSC311-Lec2 10 / 34

Page 11: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

How do we Learn a DecisionTree?

How do we construct a useful decision tree?

Intro ML (UofT) CSC311-Lec2 11 / 34

Page 12: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Learning Decision Trees

Learning the simplest (smallest) decision tree which correctly classifiestraining set is an NP complete problem (if you are interested, check: Hyafil &Rivest’76).

Resort to a greedy heuristic! Start with empty decision tree andcomplete training set

I Split on the “best” attribute, i.e. partition datasetI Recurse on subpartitions

When should we stop?

Which attribute is the “best” (and where should we split, if continuous)?

I Choose based on accuracy?I Loss: misclassification errorI Say region R is split in R1 and R2 based on loss L(R).I Accuracy gain is L(R)− |R1|L(R1)+|R2|L(R2)

|R1|+|R2|

Intro ML (UofT) CSC311-Lec2 12 / 34

Page 13: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Choosing a Good Split

Why isn’t accuracy a good measure?

Classify by the majority, loss is the misclassification error.

Is this split good? Zero accuracy gain

L(R)− |R1|L(R1) + |R2|L(R2)

|R1|+ |R2|=

49

149−

50× 0 + 99× 4999

149= 0

But we have reduced our uncertainty about whether a fruit is a lemon!

Intro ML (UofT) CSC311-Lec2 13 / 34

Page 14: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Choosing a Good Split

How can we quantify uncertainty in prediction for a given leaf node?

I All examples in leaf have the same class: good (low uncertainty)I Each class has the same number of examples in leaf: bad (high

uncertainty)

Idea: Use counts at leaves to define probability distributions, and useinformation theory to measure uncertainty

Intro ML (UofT) CSC311-Lec2 14 / 34

Page 15: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Flipping Two Different Coins

Q: Which coin is more uncertain?

Sequence 1: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?

Sequence 2: 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?

16

2 8 10

0 1

versus

0 1

Intro ML (UofT) CSC311-Lec2 15 / 34

Page 16: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Quantifying Uncertainty

Entropy is a measure of expected “surprise”: How uncertain are we of thevalue of a draw from this distribution?

H(X) = −EX∼p[log2 p(X)] = −∑x∈X

p(x) log2 p(x)

0 1

8/9

1/9

−8

9log2

8

9− 1

9log2

1

9≈ 1

2

0 1

4/9 5/9

−4

9log2

4

9− 5

9log2

5

9≈ 0.99

Averages over information content of each observation

Unit = bits (based on the base of logarithm)

A fair coin flip has 1 bit of entropy

Intro ML (UofT) CSC311-Lec2 16 / 34

Page 17: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Quantifying Uncertainty

H(X) = −∑x∈X

p(x) log2 p(x)

0.2 0.4 0.6 0.8 1.0probability p of heads

0.2

0.4

0.6

0.8

1.0

entropy

Intro ML (UofT) CSC311-Lec2 17 / 34

Page 18: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Entropy

“High Entropy”:

I Variable has a uniform like distributionI Flat histogramI Values sampled from it are less predictable

“Low Entropy”

I Distribution of variable has peaks and valleysI Histogram has lows and highsI Values sampled from it are more predictable

[Slide credit: Vibhav Gogate]

Intro ML (UofT) CSC311-Lec2 18 / 34

Page 19: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Entropy of a Joint Distribution

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

H(X,Y ) = −∑x∈X

∑y∈Y

p(x, y) log2 p(x, y)

= − 24

100log2

24

100− 1

100log2

1

100− 25

100log2

25

100− 50

100log2

50

100

≈ 1.56bits

Intro ML (UofT) CSC311-Lec2 19 / 34

Page 20: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Specific Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness Y , given that it is raining?

H(Y |X = raining) = −∑y∈Y

p(y|raining) log2 p(y|raining)

= −24

25log2

24

25− 1

25log2

1

25

≈ 0.24bits

We used: p(y|x) = p(x,y)p(x) , and p(x) =

∑y p(x, y) (sum in a row)

Intro ML (UofT) CSC311-Lec2 20 / 34

Page 21: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Conditional Entropy

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

The expected conditional entropy:

H(Y |X) = EX∼p(x)[H(Y |X)] (1)

=∑x∈X

p(x)H(Y |X = x)

= −∑x∈X

∑y∈Y

p(x, y) log2 p(y|x)

= −E(X,Y )∼p(x,y)[log2 p(Y |X)]

Intro ML (UofT) CSC311-Lec2 21 / 34

Page 22: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness, given the knowledge of whether or notit is raining?

H(Y |X) =∑x∈X

p(x)H(Y |X = x)

=1

4H(cloudy|raining) +

3

4H(cloudy|not raining)

≈ 0.75 bits

Intro ML (UofT) CSC311-Lec2 22 / 34

Page 23: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Conditional Entropy

<latexit sha1_base64="/gc/uu6pS3r43VGc/2gJYU8PcCM=">AAACQnicdVBLS8NAGNzUV62vVo9egkWpICURQY/FXnqsYB/ShrLZbNq1+wi7G6GE/Aev+nv8E/4Fb+LVg9u0B9vSgQ+GmW9gGD+iRGnH+bRyG5tb2zv53cLe/sHhUbF03FYilgi3kKBCdn2oMCUctzTRFHcjiSHzKe744/rU77xgqYjgj3oSYY/BISchQVAbqd2odK+eLgfFslN1MtirxJ2TMpijOShZF/1AoJhhrhGFSvVcJ9JeAqUmiOK00I8VjiAawyHuGcohw8pLsrqpfW6UwA6FNMe1nan/EwlkSk2Ybz4Z1CO17E3FdZ4esXRRo0MhiZEJWmMstdXhnZcQHsUaczQrG8bU1sKe7mcHRGKk6cQQiEyeIBuNoIRIm5UL/SyY1AVjkAcqNcu6yzuukvZ11XWq7sNNuXY/3zgPTsEZqAAX3IIaaIAmaAEEnsEreAPv1of1ZX1bP7PXnDXPnIAFWL9/KF+xKg==</latexit>

<latexit sha1_base64="WWaEsydbxCauzSDMT7j+Wtq5mHY=">AAACQHicdVBLS8NAGNz4rPXV6tFLsCj1UhIR9FjspccK9gFtKJvNplm6j7C7EUroX/Cqv8d/4T/wJl49uUlzsC0d+GCY+QaG8WNKlHacT2tre2d3b790UD48Oj45rVTPekokEuEuElTIgQ8VpoTjriaa4kEsMWQ+xX1/2sr8/guWigj+rGcx9hiccBISBHUmteuDm3Gl5jScHPY6cQtSAwU646p1PQoEShjmGlGo1NB1Yu2lUGqCKJ6XR4nCMURTOMFDQzlkWHlpXnZuXxklsEMhzXFt5+r/RAqZUjPmm08GdaRWvUzc5OmIzZc1OhGSGJmgDcZKWx0+eCnhcaIxR4uyYUJtLexsPTsgEiNNZ4ZAZPIE2SiCEiJtNi6P8mDaEoxBHqi5WdZd3XGd9G4brtNwn+5qzcdi4xK4AJegDlxwD5qgDTqgCxCIwCt4A+/Wh/VlfVs/i9ctq8icgyVYv3/lw7CR</latexit>

<latexit sha1_base64="2+vpPEJbiplHPcrCr62vctNPLqA=">AAACQHicdVBLS8NAGNytr1pfrR69BItSLyURQY/FXnqsYB/ShrLZbpql+wi7G6GE/gWv+nv8F/4Db+LVk9s2B9vSgQ+GmW9gmCBmVBvX/YS5re2d3b38fuHg8Oj4pFg6bWuZKExaWDKpugHShFFBWoYaRrqxIogHjHSCcX3md16I0lSKJzOJic/RSNCQYmRmUqPyfD0olt2qO4ezTryMlEGG5qAEr/pDiRNOhMEMad3z3Nj4KVKGYkamhX6iSYzwGI1Iz1KBONF+Oi87dS6tMnRCqewJ48zV/4kUca0nPLCfHJlIr3ozcZNnIj5d1thIKmplijcYK21NeO+nVMSJIQIvyoYJc4x0Zus5Q6oINmxiCcI2T7GDI6QQNnbjQn8eTOuScySGemqX9VZ3XCftm6rnVr3H23LtIds4D87BBagAD9yBGmiAJmgBDCLwCt7AO/yAX/Ab/ixeczDLnIElwN8/552wkg==</latexit>

<latexit sha1_base64="dCnWubk2ypAOzjkfx2pIVhq6V6Q=">AAACQnicdVBLT8JAGNziC/EFevTSSDR4Ia0x0SORC0dM5GGgIdvtAiv7aHa3JqT2P3jV3+Of8C94M149uJQeBMIkXzKZ+SaZjB9SorTjfFq5jc2t7Z38bmFv/+DwqFg6bisRSYRbSFAhuz5UmBKOW5poiruhxJD5FHf8SX3md56xVETwBz0NscfgiJMhQVAbqd2odF8eLwfFslN1UtirxM1IGWRoDkrWRT8QKGKYa0ShUj3XCbUXQ6kJojgp9COFQ4gmcIR7hnLIsPLitG5inxslsIdCmuPaTtX/iRgypabMN58M6rFa9mbiOk+PWbKo0ZGQxMgErTGW2urhrRcTHkYaczQvO4yorYU9288OiMRI06khEJk8QTYaQwmRNisX+mkwrgvGIA9UYpZ1l3dcJe2rqutU3fvrcu0u2zgPTsEZqAAX3IAaaIAmaAEEnsAreAPv1of1ZX1bP/PXnJVlTsACrN8/vM+xeg==</latexit>

<latexit sha1_base64="F4WSeHZO0negxBLpAVFBcEqRPno=">AAACQnicdVBLT8JAGNziC/EFevTSSDR4Ia0x0SORC0dM5GGgIdvtAiv7aHa3JqT2P3jV3+Of8C94M149uJQeBMIkXzKZ+SaZjB9SorTjfFq5jc2t7Z38bmFv/+DwqFg6bisRSYRbSFAhuz5UmBKOW5poiruhxJD5FHf8SX3md56xVETwBz0NscfgiJMhQVAbqd2oPL50LwfFslN1UtirxM1IGWRoDkrWRT8QKGKYa0ShUj3XCbUXQ6kJojgp9COFQ4gmcIR7hnLIsPLitG5inxslsIdCmuPaTtX/iRgypabMN58M6rFa9mbiOk+PWbKo0ZGQxMgErTGW2urhrRcTHkYaczQvO4yorYU9288OiMRI06khEJk8QTYaQwmRNisX+mkwrgvGIA9UYpZ1l3dcJe2rqutU3fvrcu0u2zgPTsEZqAAX3IAaaIAmaAEEnsAreAPv1of1ZX1bP/PXnJVlTsACrN8/vNGxeg==</latexit>

<latexit sha1_base64="EHHP+woVRHaf22ZBtxSxiC/oSLo=">AAACQnicdVBLS8NAGNzUV62vVo9egkWpl5KIoOCl2IveKtiHtKFsNtt27T7C7kYoIf/Bq/4e/4R/wZt49eA2zcG2dOCDYeYbGMYPKVHacT6t3Nr6xuZWfruws7u3f1AsHbaUiCTCTSSokB0fKkwJx01NNMWdUGLIfIrb/rg+9dsvWCoi+KOehNhjcMjJgCCojdS6r3Runs77xbJTdVLYy8TNSBlkaPRL1lkvEChimGtEoVJd1wm1F0OpCaI4KfQihUOIxnCIu4ZyyLDy4rRuYp8aJbAHQprj2k7V/4kYMqUmzDefDOqRWvSm4ipPj1gyr9GhkMTIBK0wFtrqwbUXEx5GGnM0KzuIqK2FPd3PDojESNOJIRCZPEE2GkEJkTYrF3ppMK4LxiAPVGKWdRd3XCati6rrVN2Hy3LtNts4D47BCagAF1yBGrgDDdAECDyDV/AG3q0P68v6tn5mrzkryxyBOVi/f0YSsTo=</latexit>

Some useful properties for the discrete case:

I H is always non-negative.I Chain rule: H(X,Y ) = H(X|Y ) + H(Y ) = H(Y |X) + H(X).I If X and Y independent, then X does not tell us anything about Y :

H(Y |X) = H(Y ).I If X and Y independent, then H(X,Y ) = H(X) + H(Y ).I But Y tells us everything about Y : H(Y |Y ) = 0.I By knowing X, we can only decrease uncertainty about Y :

H(Y |X) ≤ H(Y ).

Exercise: Verify these!The figure is reproduced from Fig 8.1 of MacKay, Information Theory, Inference, and ... .

Intro ML (UofT) CSC311-Lec2 23 / 34

Page 24: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Information Gain

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

How much information about cloudiness do we get by discoveringwhether it is raining?

IG(Y |X) = H(Y )−H(Y |X)

≈ 0.25 bits

This is called the information gain in Y due to X, or the mutualinformation of Y and X

If X is completely uninformative about Y : IG(Y |X) = 0

If X is completely informative about Y : IG(Y |X) = H(Y )Intro ML (UofT) CSC311-Lec2 24 / 34

Page 25: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Revisiting Our Original Example

Information gain measures the informativeness of a variable, which isexactly what we desire in a decision tree attribute!

What is the information gain of this split?

Let Y be r.v. denoting lemon or orange, B be r.v. denoting whether leftor right split taken, and treat counts as probabilities.

Root entropy: H(Y ) = − 49149 log2( 49

149 )− 100149 log2( 100

149 ) ≈ 0.91

Leafs entropy: H(Y |B = left) = 0, H(Y |B = right) ≈ 1.

IG(Y |B) = H(Y )−H(Y |B)

= H(Y )− {H(Y |B=left)P(B=left) + H(Y |B=right)P(B=right)}≈ 0.91− (0 · 13 + 1 · 23 ) ≈ 0.24 > 0

Intro ML (UofT) CSC311-Lec2 25 / 34

Page 26: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Constructing Decision Trees

Yes No

Yes No Yes No

At each level, one must choose:

1. Which variable to split.2. Possibly where to split it.

Choose them based on how much information we would gain from thedecision! (choose attribute that gives the best gain)

Intro ML (UofT) CSC311-Lec2 26 / 34

Page 27: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Tree Construction Algorithm

Simple, greedy, recursive approach, builds up tree node-by-node

Start with empty decision tree and complete training set

I Split on the most informative attribute, partitioning datasetI Recurse on subpartitions

Possible termination condition: end if all examples in currentsubpartition share the same class

Intro ML (UofT) CSC311-Lec2 27 / 34

Page 28: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Back to Our Example

Attributes: [from: Russell & Norvig]

Intro ML (UofT) CSC311-Lec2 28 / 34

Page 29: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Attribute Selection

IG(Y ) = H(Y )−H(Y |X)

IG(type) = 1−[

2

12H(Y |Fr.) +

2

12H(Y |It.) +

4

12H(Y |Thai) +

4

12H(Y |Bur.)

]= 0

IG(Patrons) = 1−[

2

12H(0, 1) +

4

12H(1, 0) +

6

12H(

2

6,

4

6)

]≈ 0.541

Intro ML (UofT) CSC311-Lec2 29 / 34

Page 30: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Which Tree is Better?

Intro ML (UofT) CSC311-Lec2 30 / 34

Page 31: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctionsin data

Not too big:

I Computational efficiency (avoid redundant, spurious attributes)

I Avoid over-fitting training examples

I Human interpretability

“Occam’s Razor”: find the simplest hypothesis that fits the observations

I Useful principle, but hard to formalize (how to define simplicity?)

I See Domingos, 1999, “The role of Occam’s razor in knowledgediscovery”

We desire small trees with informative nodes near the root

Intro ML (UofT) CSC311-Lec2 31 / 34

Page 32: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levels

I A large tree can overfit the data

I Greedy algorithms don’t necessarily yield the global optimum

I Mistakes at top-level propagate down tree

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

There are other criteria used to measure the quality of a split, e.g., Giniindex

Trees can be pruned in order to make them less complex

Decision trees can also be used for regression on real-valued outputs.Choose splits to minimize squared error, rather than maximizeinformation gain.

Intro ML (UofT) CSC311-Lec2 32 / 34

Page 33: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Comparison to k-NN

Advantages of decision trees over k-NN

Good with discrete attributes

Easily deals with missing values (just treat as another value)

Robust to scale of inputs; only depends on ordering

Good when there are lots of attributes, but only a few areimportant

Fast at test time

More interpretable

Intro ML (UofT) CSC311-Lec2 33 / 34

Page 34: CSC 311: Introduction to Machine Learning · CSC 311: Introduction to Machine Learning Lecture 2 - Decision Trees Amir-massoud Farahmand & Emad A.M. Andrews University of Toronto

Comparison to k-NN

Advantages of k-NN over decision trees

Able to handle attributes/features that interact in complex ways

Can incorporate interesting distance measures, e.g., shapecontexts.

Intro ML (UofT) CSC311-Lec2 34 / 34