csc 2515 lecture 2: decision trees and ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. ·...

91
CSC 2515 Lecture 2: Decision Trees and Ensembles Marzyeh Ghassemi Material and slides developed by Roger Grosse, University of Toronto UofT CSC 2515: 02-Decision Trees and Ensembles 1 / 58

Upload: others

Post on 11-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

CSC 2515 Lecture 2: Decision Trees and Ensembles

Marzyeh Ghassemi

Material and slides developed by Roger Grosse, University of Toronto

UofT CSC 2515: 02-Decision Trees and Ensembles 1 / 58

Page 2: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Overview

Decision Trees

I Simple but powerful learning algorithmI One of the most widely used learning algorithms in Kaggle competitions

Lets us introduce ensembles, a key idea in ML more broadly

Useful information theoretic concepts (entropy, mutual information, etc.)

UofT CSC 2515: 02-Decision Trees and Ensembles 2 / 58

Page 3: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Trees

Yes No

Yes No Yes No

UofT CSC 2515: 02-Decision Trees and Ensembles 3 / 58

Page 4: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Trees

UofT CSC 2515: 02-Decision Trees and Ensembles 4 / 58

Page 5: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Trees

Decision trees make predictions by recursively splitting on different attributesaccording to a tree structure.

UofT CSC 2515: 02-Decision Trees and Ensembles 5 / 58

Page 6: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Example with Discrete Inputs

What if the attributes are discrete?

Attributes:

UofT CSC 2515: 02-Decision Trees and Ensembles 6 / 58

Page 7: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree: Example with Discrete Inputs

The tree to decide whether to wait (T) or not (F)

UofT CSC 2515: 02-Decision Trees and Ensembles 7 / 58

Page 8: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Trees

Yes No

Yes No Yes No

Internal nodes test attributes

Branching is determined by attribute value

Leaf nodes are outputs (predictions)

UofT CSC 2515: 02-Decision Trees and Ensembles 8 / 58

Page 9: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm

of input space

Let {(x (m1), t(m1)), . . . , (x (mk ), t(mk ))} be thetraining examples that fall into Rm

Classification tree:

I discrete outputI leaf value ym typically set to the most common value in{t(m1), . . . , t(mk )}

Regression tree:

I continuous outputI leaf value ym typically set to the mean value in {t(m1), . . . , t(mk )}

Note: We will focus on classification

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 9 / 58

Page 10: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm

of input space

Let {(x (m1), t(m1)), . . . , (x (mk ), t(mk ))} be thetraining examples that fall into Rm

Classification tree:

I discrete outputI leaf value ym typically set to the most common value in{t(m1), . . . , t(mk )}

Regression tree:

I continuous outputI leaf value ym typically set to the mean value in {t(m1), . . . , t(mk )}

Note: We will focus on classification

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 9 / 58

Page 11: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm

of input space

Let {(x (m1), t(m1)), . . . , (x (mk ), t(mk ))} be thetraining examples that fall into Rm

Classification tree:

I discrete outputI leaf value ym typically set to the most common value in{t(m1), . . . , t(mk )}

Regression tree:

I continuous outputI leaf value ym typically set to the mean value in {t(m1), . . . , t(mk )}

Note: We will focus on classification

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 9 / 58

Page 12: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree: Classification and Regression

Each path from root to a leaf defines a region Rm

of input space

Let {(x (m1), t(m1)), . . . , (x (mk ), t(mk ))} be thetraining examples that fall into Rm

Classification tree:

I discrete outputI leaf value ym typically set to the most common value in{t(m1), . . . , t(mk )}

Regression tree:

I continuous outputI leaf value ym typically set to the mean value in {t(m1), . . . , t(mk )}

Note: We will focus on classification

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 9 / 58

Page 13: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Expressiveness

Discrete-input, discrete-output case:

I Decision trees can express any function of the input attributesI E.g., for Boolean functions, truth table row → path to leaf:

Continuous-input, continuous-output case:

I Can approximate any function arbitrarily closely

Trivially, there is a consistent decision tree for any training set w/ one pathto leaf for each example (unless f nondeterministic in x) but it probablywon’t generalize to new examples

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 10 / 58

Page 14: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Expressiveness

Discrete-input, discrete-output case:

I Decision trees can express any function of the input attributesI E.g., for Boolean functions, truth table row → path to leaf:

Continuous-input, continuous-output case:

I Can approximate any function arbitrarily closely

Trivially, there is a consistent decision tree for any training set w/ one pathto leaf for each example (unless f nondeterministic in x) but it probablywon’t generalize to new examples

[Slide credit: S. Russell]

UofT CSC 2515: 02-Decision Trees and Ensembles 10 / 58

Page 15: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Questions?

?

UofT CSC 2515: 02-Decision Trees and Ensembles 11 / 58

Page 16: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

How do we Learn a DecisionTree?

How do we construct a useful decision tree?

UofT CSC 2515: 02-Decision Trees and Ensembles 12 / 58

Page 17: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Learning Decision Trees

Learning the simplest (smallest) decision tree is an NP complete problem [if youare interested, check: Hyafil & Rivest’76]

Resort to a greedy heuristic:

I Start from an empty decision treeI Split on the “best” attributeI Recurse

Which attribute is the “best”?

I Choose based on accuracy?

UofT CSC 2515: 02-Decision Trees and Ensembles 13 / 58

Page 18: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Learning Decision Trees

Learning the simplest (smallest) decision tree is an NP complete problem [if youare interested, check: Hyafil & Rivest’76]

Resort to a greedy heuristic:

I Start from an empty decision treeI Split on the “best” attributeI Recurse

Which attribute is the “best”?

I Choose based on accuracy?

UofT CSC 2515: 02-Decision Trees and Ensembles 13 / 58

Page 19: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Learning Decision Trees

Learning the simplest (smallest) decision tree is an NP complete problem [if youare interested, check: Hyafil & Rivest’76]

Resort to a greedy heuristic:

I Start from an empty decision treeI Split on the “best” attributeI Recurse

Which attribute is the “best”?

I Choose based on accuracy?

UofT CSC 2515: 02-Decision Trees and Ensembles 13 / 58

Page 20: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Choosing a Good Split

Why isn’t accuracy a good measure?

Is this split good?

Zero accuracy gain.

Instead, we will use techniques from information theory

Idea: Use counts at leaves to define probability distributions, so we can measureuncertainty

UofT CSC 2515: 02-Decision Trees and Ensembles 14 / 58

Page 21: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Choosing a Good Split

Why isn’t accuracy a good measure?

Is this split good? Zero accuracy gain.

Instead, we will use techniques from information theory

Idea: Use counts at leaves to define probability distributions, so we can measureuncertainty

UofT CSC 2515: 02-Decision Trees and Ensembles 14 / 58

Page 22: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Choosing a Good Split

Why isn’t accuracy a good measure?

Is this split good? Zero accuracy gain.

Instead, we will use techniques from information theory

Idea: Use counts at leaves to define probability distributions, so we can measureuncertainty

UofT CSC 2515: 02-Decision Trees and Ensembles 14 / 58

Page 23: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Choosing a Good Split

Which attribute is better to split on, X1 or X2?

I Deterministic: good (all are true or false; just one class in the leaf)I Uniform distribution: bad (all classes in leaf equally probable)I What about distributons in between?

Note: Let’s take a slight detour and remember concepts from information theory

[Slide credit: D. Sontag]

UofT CSC 2515: 02-Decision Trees and Ensembles 15 / 58

Page 24: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

We Flip Two Different Coins

Sequence 1: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?

Sequence 2: 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?

16

2 8 10

0 1

versus

0 1

UofT CSC 2515: 02-Decision Trees and Ensembles 16 / 58

Page 25: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

We Flip Two Different Coins

Sequence 1: 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ?

Sequence 2: 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1 ... ?

16

2 8 10

0 1

versus

0 1

UofT CSC 2515: 02-Decision Trees and Ensembles 16 / 58

Page 26: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Quantifying Uncertainty

Entropy is a measure of expected “surprise”:

H(X ) = −∑x∈X

p(x) log2 p(x)

0 1

8/9

1/9

−8

9log2

8

9− 1

9log2

1

9≈ 1

2

0 1

4/9 5/9

−4

9log2

4

9− 5

9log2

5

9≈ 0.99

Measures the information content of each observation

Unit = bits

A fair coin flip has 1 bit of entropy

UofT CSC 2515: 02-Decision Trees and Ensembles 17 / 58

Page 27: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Quantifying Uncertainty

H(X ) = −∑x∈X

p(x) log2 p(x)

0.2 0.4 0.6 0.8 1.0probability p of heads

0.2

0.4

0.6

0.8

1.0

entropy

UofT CSC 2515: 02-Decision Trees and Ensembles 18 / 58

Page 28: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Entropy

“High Entropy”:

I Variable has a uniform like distributionI Flat histogramI Values sampled from it are less predictable

“Low Entropy”

I Distribution of variable has many peaks and valleysI Histogram has many lows and highsI Values sampled from it are more predictable

[Slide credit: Vibhav Gogate]

UofT CSC 2515: 02-Decision Trees and Ensembles 19 / 58

Page 29: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Entropy

“High Entropy”:

I Variable has a uniform like distributionI Flat histogramI Values sampled from it are less predictable

“Low Entropy”

I Distribution of variable has many peaks and valleysI Histogram has many lows and highsI Values sampled from it are more predictable

[Slide credit: Vibhav Gogate]

UofT CSC 2515: 02-Decision Trees and Ensembles 19 / 58

Page 30: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Entropy of a Joint Distribution

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

H(X ,Y ) = −∑x∈X

∑y∈Y

p(x , y) log2 p(x , y)

= − 24

100log2

24

100− 1

100log2

1

100− 25

100log2

25

100− 50

100log2

50

100

≈ 1.56bits

UofT CSC 2515: 02-Decision Trees and Ensembles 20 / 58

Page 31: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Specific Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness Y , given that it is raining?

H(Y |X = x) = −∑y∈Y

p(y |x) log2 p(y |x)

= −24

25log2

24

25− 1

25log2

1

25

≈ 0.24bits

We used: p(y |x) = p(x,y)p(x) , and p(x) =

∑y p(x , y) (sum in a row)

UofT CSC 2515: 02-Decision Trees and Ensembles 21 / 58

Page 32: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Conditional Entropy

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

The expected conditional entropy:

H(Y |X ) =∑x∈X

p(x)H(Y |X = x)

= −∑x∈X

∑y∈Y

p(x , y) log2 p(y |x)

UofT CSC 2515: 02-Decision Trees and Ensembles 22 / 58

Page 33: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Conditional Entropy

Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

What is the entropy of cloudiness, given the knowledge of whether or not itis raining?

H(Y |X ) =∑x∈X

p(x)H(Y |X = x)

=1

4H(cloudy|is raining) +

3

4H(cloudy|not raining)

≈ 0.75 bits

UofT CSC 2515: 02-Decision Trees and Ensembles 23 / 58

Page 34: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Conditional Entropy

Some useful properties:

I H is always non-negative

I Chain rule: H(X ,Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )

I If X and Y independent, then X doesn’t tell us anything about Y :H(Y |X ) = H(Y )

I But Y tells us everything about Y : H(Y |Y ) = 0

I By knowing X , we can only decrease uncertainty about Y :H(Y |X ) ≤ H(Y )

UofT CSC 2515: 02-Decision Trees and Ensembles 24 / 58

Page 35: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Information Gain

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

How much information about cloudiness do we get by discovering whether itis raining?

IG (Y |X ) = H(Y )− H(Y |X )

≈ 0.25 bits

This is called the information gain in Y due to X , or the mutual informationof Y and X

If X is completely uninformative about Y : IG (Y |X ) = 0

If X is completely informative about Y : IG (Y |X ) = H(Y )

UofT CSC 2515: 02-Decision Trees and Ensembles 25 / 58

Page 36: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Information Gain

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

How much information about cloudiness do we get by discovering whether itis raining?

IG (Y |X ) = H(Y )− H(Y |X )

≈ 0.25 bits

This is called the information gain in Y due to X , or the mutual informationof Y and X

If X is completely uninformative about Y : IG (Y |X ) = 0

If X is completely informative about Y : IG (Y |X ) = H(Y )

UofT CSC 2515: 02-Decision Trees and Ensembles 25 / 58

Page 37: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Questions?

?

UofT CSC 2515: 02-Decision Trees and Ensembles 26 / 58

Page 38: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Revisiting Our Original Example

Information gain measures the informativeness of a variable, which is exactlywhat we desire in a decision tree attribute!

What is the information gain of this split?

Root entropy: H(Y ) = − 49149 log2( 49

149 )− 100149 log2( 100

149 ) ≈ 0.91

Leafs entropy: H(Y |left) = 0, H(Y |right) ≈ 1.

IG (split) ≈ 0.91− ( 13 · 0 + 2

3 · 1) ≈ 0.24 > 0

UofT CSC 2515: 02-Decision Trees and Ensembles 27 / 58

Page 39: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Constructing Decision Trees

Yes No

Yes No Yes No

At each level, one must choose:

1. Which variable to split.2. Possibly where to split it.

Choose them based on how much information we would gain from thedecision! (choose attribute that gives the highest gain)

UofT CSC 2515: 02-Decision Trees and Ensembles 28 / 58

Page 40: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Construction Algorithm

Simple, greedy, recursive approach, builds up tree node-by-node

1. pick an attribute to split at a non-terminal node

2. split examples into groups based on attribute value

3. for each group:

I if no examples – return majority from parentI else if all examples in same class – return classI else loop to step 1

UofT CSC 2515: 02-Decision Trees and Ensembles 29 / 58

Page 41: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Back to Our Example

Attributes: [from: Russell & Norvig]

UofT CSC 2515: 02-Decision Trees and Ensembles 30 / 58

Page 42: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Attribute Selection

IG (Y ) = H(Y )− H(Y |X )

IG (type) = 1−[

2

12H(Y |Fr.) +

2

12H(Y |It.) +

4

12H(Y |Thai) +

4

12H(Y |Bur.)

]= 0

IG (Patrons) = 1−[

2

12H(0, 1) +

4

12H(1, 0) +

6

12H(

2

6,

4

6)

]≈ 0.541

UofT CSC 2515: 02-Decision Trees and Ensembles 31 / 58

Page 43: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Which Tree is Better?

UofT CSC 2515: 02-Decision Trees and Ensembles 32 / 58

Page 44: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctions indata

Not too big:

I Computational efficiency (avoid redundant, spurious attributes)I Avoid over-fitting training examplesI Human interpretability

“Occam’s Razor”: find the simplest hypothesis that fits the observations

I Useful principle, but hard to formalize (how to define simplicity?)I See Domingos, 1999, “The role of Occam’s razor in knowledge

discovery”

We desire small trees with informative nodes near the root

UofT CSC 2515: 02-Decision Trees and Ensembles 33 / 58

Page 45: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctions indata

Not too big:

I Computational efficiency (avoid redundant, spurious attributes)I Avoid over-fitting training examplesI Human interpretability

“Occam’s Razor”: find the simplest hypothesis that fits the observations

I Useful principle, but hard to formalize (how to define simplicity?)I See Domingos, 1999, “The role of Occam’s razor in knowledge

discovery”

We desire small trees with informative nodes near the root

UofT CSC 2515: 02-Decision Trees and Ensembles 33 / 58

Page 46: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctions indata

Not too big:

I Computational efficiency (avoid redundant, spurious attributes)I Avoid over-fitting training examplesI Human interpretability

“Occam’s Razor”: find the simplest hypothesis that fits the observations

I Useful principle, but hard to formalize (how to define simplicity?)I See Domingos, 1999, “The role of Occam’s razor in knowledge

discovery”

We desire small trees with informative nodes near the root

UofT CSC 2515: 02-Decision Trees and Ensembles 33 / 58

Page 47: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

What Makes a Good Tree?

Not too small: need to handle important but possibly subtle distinctions indata

Not too big:

I Computational efficiency (avoid redundant, spurious attributes)I Avoid over-fitting training examplesI Human interpretability

“Occam’s Razor”: find the simplest hypothesis that fits the observations

I Useful principle, but hard to formalize (how to define simplicity?)I See Domingos, 1999, “The role of Occam’s razor in knowledge

discovery”

We desire small trees with informative nodes near the root

UofT CSC 2515: 02-Decision Trees and Ensembles 33 / 58

Page 48: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levelsI Too big of a tree can overfit the dataI Greedy algorithms don’t necessarily yield the global optimum

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

Decision trees can also be used for regression on real-valued outputs. Choosesplits to minimize squared error, rather than maximize information gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 34 / 58

Page 49: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levelsI Too big of a tree can overfit the dataI Greedy algorithms don’t necessarily yield the global optimum

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

Decision trees can also be used for regression on real-valued outputs. Choosesplits to minimize squared error, rather than maximize information gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 34 / 58

Page 50: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levelsI Too big of a tree can overfit the dataI Greedy algorithms don’t necessarily yield the global optimum

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

Decision trees can also be used for regression on real-valued outputs. Choosesplits to minimize squared error, rather than maximize information gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 34 / 58

Page 51: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levelsI Too big of a tree can overfit the dataI Greedy algorithms don’t necessarily yield the global optimum

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

Decision trees can also be used for regression on real-valued outputs.

Choosesplits to minimize squared error, rather than maximize information gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 34 / 58

Page 52: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Decision Tree Miscellany

Problems:

I You have exponentially less data at lower levelsI Too big of a tree can overfit the dataI Greedy algorithms don’t necessarily yield the global optimum

Handling continuous attributes

I Split based on a threshold, chosen to maximize information gain

Decision trees can also be used for regression on real-valued outputs. Choosesplits to minimize squared error, rather than maximize information gain.

UofT CSC 2515: 02-Decision Trees and Ensembles 34 / 58

Page 53: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Comparison to k-NN

Advantages of decision trees over KNN

Good when there are lots of attributes, but only a few are important

Good with discrete attributes

Easily deals with missing values (just treat as another value)

Robust to scale of inputs

Fast at test time

More interpretable

Advantages of KNN over decision trees

Few hyperparameters

Able to handle attributes/features that interact in complex ways (e.g. pixels)

Can incorporate interesting distance measures (e.g. shape contexts)

Typically make better predictions in practice

I As we’ll see next lecture, ensembles of decision trees are muchstronger. But they lose many of the advantages listed above.

UofT CSC 2515: 02-Decision Trees and Ensembles 35 / 58

Page 54: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Comparison to k-NN

Advantages of decision trees over KNN

Good when there are lots of attributes, but only a few are important

Good with discrete attributes

Easily deals with missing values (just treat as another value)

Robust to scale of inputs

Fast at test time

More interpretable

Advantages of KNN over decision trees

Few hyperparameters

Able to handle attributes/features that interact in complex ways (e.g. pixels)

Can incorporate interesting distance measures (e.g. shape contexts)

Typically make better predictions in practice

I As we’ll see next lecture, ensembles of decision trees are muchstronger. But they lose many of the advantages listed above.

UofT CSC 2515: 02-Decision Trees and Ensembles 35 / 58

Page 55: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Comparison to k-NN

Advantages of decision trees over KNN

Good when there are lots of attributes, but only a few are important

Good with discrete attributes

Easily deals with missing values (just treat as another value)

Robust to scale of inputs

Fast at test time

More interpretable

Advantages of KNN over decision trees

Few hyperparameters

Able to handle attributes/features that interact in complex ways (e.g. pixels)

Can incorporate interesting distance measures (e.g. shape contexts)

Typically make better predictions in practice

I As we’ll see next lecture, ensembles of decision trees are muchstronger. But they lose many of the advantages listed above.

UofT CSC 2515: 02-Decision Trees and Ensembles 35 / 58

Page 56: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Comparison to k-NN

Advantages of decision trees over KNN

Good when there are lots of attributes, but only a few are important

Good with discrete attributes

Easily deals with missing values (just treat as another value)

Robust to scale of inputs

Fast at test time

More interpretable

Advantages of KNN over decision trees

Few hyperparameters

Able to handle attributes/features that interact in complex ways (e.g. pixels)

Can incorporate interesting distance measures (e.g. shape contexts)

Typically make better predictions in practice

I As we’ll see next lecture, ensembles of decision trees are muchstronger. But they lose many of the advantages listed above.

UofT CSC 2515: 02-Decision Trees and Ensembles 35 / 58

Page 57: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Questions?

?

UofT CSC 2515: 02-Decision Trees and Ensembles 36 / 58

Page 58: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Ensembles and Bagging

UofT CSC 2515: 02-Decision Trees and Ensembles 37 / 58

Page 59: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Ensemble methods: Overview

An ensemble of predictors is a set of predictors whose individual decisionsare combined in some way to classify new examples

I E.g., (possibly weighted) majority vote

For this to be nontrivial, the classifiers must differ somehow, e.g.

I Different algorithmI Different choice of hyperparametersI Trained on different dataI Trained with different weighting of the training examples

Ensembles are usually trivial to implement. The hard part is deciding whatkind of ensemble you want, based on your goals.

UofT CSC 2515: 02-Decision Trees and Ensembles 38 / 58

Page 60: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Ensemble methods: Overview

This lecture: bagging

I Train classifiers independently on random subsets of the training data.

Later lecture: boosting

I Train classifiers sequentially, each time focusing on training examplesthat the previous ones got wrong.

Bagging and boosting serve very different purposes. To understand this, weneed to take a detour to understand the bias and variance of a learningalgorithm.

UofT CSC 2515: 02-Decision Trees and Ensembles 39 / 58

Page 61: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias and Variance

UofT CSC 2515: 02-Decision Trees and Ensembles 40 / 58

Page 62: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Loss Functions

A loss function L(y , t) defines how bad it is if the algorithm predicts y , butthe target is actually t.

Example: 0-1 loss for classification

L0−1(y , t) =

{0 if y = t

1 if y 6= t

I Averaging the 0-1 loss over the training set gives the training errorrate, and averaging over the test set gives the test error rate.

Example: squared error loss for regression

LSE(y , t) =1

2(y − t)2

I The average squared error loss is called mean squared error (MSE).

UofT CSC 2515: 02-Decision Trees and Ensembles 41 / 58

Page 63: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Loss Functions

A loss function L(y , t) defines how bad it is if the algorithm predicts y , butthe target is actually t.

Example: 0-1 loss for classification

L0−1(y , t) =

{0 if y = t

1 if y 6= t

I Averaging the 0-1 loss over the training set gives the training errorrate, and averaging over the test set gives the test error rate.

Example: squared error loss for regression

LSE(y , t) =1

2(y − t)2

I The average squared error loss is called mean squared error (MSE).

UofT CSC 2515: 02-Decision Trees and Ensembles 41 / 58

Page 64: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition

Recall that overly simple models underfit the data, and overly complexmodels overfit.

We can quantify this effect in terms of the bias/variance decomposition.

I Bias and variance of what?

UofT CSC 2515: 02-Decision Trees and Ensembles 42 / 58

Page 65: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Suppose the training set D consists of pairs (xi , ti ) sampled independent andidentically distributed (i.i.d.) from a single data generating distributionpdata.

Pick a fixed query point x (denoted with a green x).

Consider an experiment where we sample lots of training sets independentlyfrom pdata.

UofT CSC 2515: 02-Decision Trees and Ensembles 43 / 58

Page 66: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Suppose the training set D consists of pairs (xi , ti ) sampled independent andidentically distributed (i.i.d.) from a single data generating distributionpdata.

Pick a fixed query point x (denoted with a green x).

Consider an experiment where we sample lots of training sets independentlyfrom pdata.

UofT CSC 2515: 02-Decision Trees and Ensembles 43 / 58

Page 67: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Let’s run our learning algorithm on each training set, and compute itsprediction y at the query point x.

We can view y as a random variable, where the randomness comes from thechoice of training set.

The classification accuracy is determined by the distribution of y .

UofT CSC 2515: 02-Decision Trees and Ensembles 44 / 58

Page 68: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Here is the analogous setup for regression:

Since y is a random variable, we can talk about its expectation, variance, etc.UofT CSC 2515: 02-Decision Trees and Ensembles 45 / 58

Page 69: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Recap of basic setup:

Notice: y is independent of t. (Why?)

This gives a distribution over the loss at x, with expectation E[L(y , t) | x].

For each query point x, the expected loss is different. We are interested inminimizing the expectation of this with respect to x ∼ pdata.

UofT CSC 2515: 02-Decision Trees and Ensembles 46 / 58

Page 70: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Recap of basic setup:

Notice: y is independent of t. (Why?)

This gives a distribution over the loss at x, with expectation E[L(y , t) | x].

For each query point x, the expected loss is different. We are interested inminimizing the expectation of this with respect to x ∼ pdata.

UofT CSC 2515: 02-Decision Trees and Ensembles 46 / 58

Page 71: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias-Variance Decomposition: Basic Setup

Recap of basic setup:

Notice: y is independent of t. (Why?)

This gives a distribution over the loss at x, with expectation E[L(y , t) | x].

For each query point x, the expected loss is different. We are interested inminimizing the expectation of this with respect to x ∼ pdata.

UofT CSC 2515: 02-Decision Trees and Ensembles 46 / 58

Page 72: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bayes Optimality

For now, focus on squared error loss, L(y , t) = 12 (y − t)2.

A first step: suppose we knew the conditional distribution p(t | x). Whatvalue y should we predict?

I Here, we are treating t as a random variable and choosing y .

Claim: y∗ = E[t | x] is the best possible prediction.

Proof:

E[(y − t)2 | x] = E[y2 − 2yt + t2 | x]

= y2 − 2yE[t | x] + E[t2 | x]

= y2 − 2yE[t | x] + E[t | x]2 + Var[t | x]

= y2 − 2yy∗ + y2∗ + Var[t | x]

= (y − y∗)2 + Var[t | x]

UofT CSC 2515: 02-Decision Trees and Ensembles 47 / 58

Page 73: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bayes Optimality

For now, focus on squared error loss, L(y , t) = 12 (y − t)2.

A first step: suppose we knew the conditional distribution p(t | x). Whatvalue y should we predict?

I Here, we are treating t as a random variable and choosing y .

Claim: y∗ = E[t | x] is the best possible prediction.

Proof:

E[(y − t)2 | x] = E[y2 − 2yt + t2 | x]

= y2 − 2yE[t | x] + E[t2 | x]

= y2 − 2yE[t | x] + E[t | x]2 + Var[t | x]

= y2 − 2yy∗ + y2∗ + Var[t | x]

= (y − y∗)2 + Var[t | x]

UofT CSC 2515: 02-Decision Trees and Ensembles 47 / 58

Page 74: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bayes Optimality

E[(y − t)2 | x] = (y − y∗)2 + Var[t | x]

The first term is nonnegative, and can be made 0 by setting y = y∗.

The second term corresponds to the inherent unpredictability, or noise, ofthe targets, and is called the Bayes error.

I This is the best we can ever hope to do with any learning algorithm.An algorithm that achieves it is Bayes optimal.

I Notice that this term doesn’t depend on y .

This process of choosing a single value y∗ based on p(t | x) is an example ofdecision theory.

UofT CSC 2515: 02-Decision Trees and Ensembles 48 / 58

Page 75: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bayes Optimality

Now return to treating y as a random variable (where the randomnesscomes from the choice of dataset).

We can decompose out the expected loss (suppressing the conditioning on xfor clarity):

E[(y − t)2] = E[(y − y∗)2] + Var(t)

= E[y2∗ − 2y∗y + y2] + Var(t)

= y2∗ − 2y∗E[y ] + E[y2] + Var(t)

= y2∗ − 2y∗E[y ] + E[y ]2 + Var(y) + Var(t)

= (y∗ − E[y ])2︸ ︷︷ ︸bias

+ Var(y)︸ ︷︷ ︸variance

+ Var(t)︸ ︷︷ ︸Bayes error

UofT CSC 2515: 02-Decision Trees and Ensembles 49 / 58

Page 76: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bayes Optimality

E[(y − t)2] = (y∗ − E[y ])2︸ ︷︷ ︸bias

+ Var(y)︸ ︷︷ ︸variance

+ Var(t)︸ ︷︷ ︸Bayes error

We just split the expected loss into three terms:

I bias: how wrong the expected prediction is (corresponds tounderfitting)

I variance: the amount of variability in the predictions (corresponds tooverfitting)

I Bayes error: the inherent unpredictability of the targets

Even though this analysis only applies to squared error, we often loosely use“bias” and “variance” as synonyms for “underfitting” and “overfitting”.

UofT CSC 2515: 02-Decision Trees and Ensembles 50 / 58

Page 77: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias/Variance Decomposition: Another Visualization

We can visualize this decomposition in output space, where the axescorrespond to predictions on the test examples.

If we have an overly simple model (e.g. KNN with large k), it mighthave

I high bias (because it’s too simplistic to capture the structure in thedata)

I low variance (because there’s enough data to get a stable estimate ofthe decision boundary)

UofT CSC 2515: 02-Decision Trees and Ensembles 51 / 58

Page 78: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bias/Variance Decomposition: Another Visualization

If you have an overly complex model (e.g. KNN with k = 1), it mighthave

I low bias (since it learns all the relevant structure)I high variance (it fits the quirks of the data you happened to sample)

UofT CSC 2515: 02-Decision Trees and Ensembles 52 / 58

Page 79: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging

Now, back to bagging!

UofT CSC 2515: 02-Decision Trees and Ensembles 53 / 58

Page 80: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over itI Bias: unchanged, since the averaged prediction has the same

expectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 81: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error:

unchanged, since we have no control over itI Bias: unchanged, since the averaged prediction has the same

expectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 82: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over it

I Bias: unchanged, since the averaged prediction has the sameexpectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 83: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over itI Bias:

unchanged, since the averaged prediction has the sameexpectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 84: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over itI Bias: unchanged, since the averaged prediction has the same

expectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 85: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over itI Bias: unchanged, since the averaged prediction has the same

expectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance:

reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 86: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: Motivation

Suppose we could somehow sample m independent training sets frompdata.

We could then compute the prediction yi based on each one, and takethe average y = 1

m

∑mi=1 yi .

How does this affect the three terms of the expected loss?I Bayes error: unchanged, since we have no control over itI Bias: unchanged, since the averaged prediction has the same

expectation

E[y ] = E

[1

m

m∑i=1

yi

]= E[yi ]

I Variance: reduced, since we’re averaging over independent samples

Var[y ] = Var

[1

m

m∑i=1

yi

]=

1

m2

m∑i=1

Var[yi ] =1

mVar[yi ].

UofT CSC 2515: 02-Decision Trees and Ensembles 54 / 58

Page 87: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: The Idea

In practice, running an algorithm separately on independently sampleddatasets is very wasteful!

Solution: bootstrap aggregation, or bagging.I Take a single dataset D with n examples.I Generate m new datasets, each by sampling n training examples fromD, with replacement.

I Average the predictions of models trained on each of these datasets.

The bootstrap is one of the most important ideas in all of statistics!

UofT CSC 2515: 02-Decision Trees and Ensembles 55 / 58

Page 88: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: The Idea

Problem: the datasets are not independent, so we don’t get the 1/mvariance reduction.

I Possible to show that if the sampled predictions have variance σ2 andcorrelation ρ, then

Var

(1

m

m∑i=1

yi

)=

1

m(1− ρ)σ2 + ρσ2.

Ironically, it can be advantageous to introduce additional variabilityinto your algorithm, as long as it reduces the correlation betweensamples.

I Intuition: you want to invest in a diversified portfolio, not just onestock.

I Can help to use average over multiple algorithms, or multipleconfigurations of the same algorithm.

UofT CSC 2515: 02-Decision Trees and Ensembles 56 / 58

Page 89: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Bagging: The Idea

Problem: the datasets are not independent, so we don’t get the 1/mvariance reduction.

I Possible to show that if the sampled predictions have variance σ2 andcorrelation ρ, then

Var

(1

m

m∑i=1

yi

)=

1

m(1− ρ)σ2 + ρσ2.

Ironically, it can be advantageous to introduce additional variabilityinto your algorithm, as long as it reduces the correlation betweensamples.

I Intuition: you want to invest in a diversified portfolio, not just onestock.

I Can help to use average over multiple algorithms, or multipleconfigurations of the same algorithm.

UofT CSC 2515: 02-Decision Trees and Ensembles 56 / 58

Page 90: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Random Forests

Random forests = bagged decision trees, with one extra trick todecorrelate the predictions

When choosing each node of the decision tree, choose a random setof d input features, and only consider splits on those features

Random forests are probably the best black-box machine learningalgorithm — they often work well with no tuning whatsoever.

I one of the most widely used algorithms in Kaggle competitions

UofT CSC 2515: 02-Decision Trees and Ensembles 57 / 58

Page 91: CSC 2515 Lecture 2: Decision Trees and Ensembleshuang/courses/csc2515_2020f/... · 2020. 9. 16. · Overview Decision Trees I Simple but powerful learning algorithm I One of the most

Summary

Bagging reduces overfitting by averaging predictions.

Used in most competition winners

I Even if a single model is great, a small ensemble usually helps.

Limitations:

I Does not reduce bias.I There is still correlation between classifiers.

Random forest solution: Add more randomness.

UofT CSC 2515: 02-Decision Trees and Ensembles 58 / 58