how to solve a classification problem with 45 class levels using random forests nicholas l....

20
Big classification problems How to solve a classification problem with 45 class levels using Random Forests Nicholas L. Crookston Gerald E. Rehfeldt US Forest Service, Rocky Mountain Research Station, Moscow, ID Western Mensurationists Missoula, MT June 20-22, 2010

Upload: riva

Post on 25-Feb-2016

57 views

Category:

Documents


4 download

DESCRIPTION

How to solve a classification problem with 45 class levels using Random Forests Nicholas L. Crookston Gerald E. Rehfeldt US Forest Service, Rocky Mountain Research Station, Moscow, ID Western Mensurationists Missoula, MT June 20-22, 2010. Problem (we have 45 class levels, that’s a lot) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsHow to solve a classification problem with

45 class levels using Random Forests

Nicholas L. CrookstonGerald E. Rehfeldt

US Forest Service, Rocky Mountain Research Station, Moscow, IDWestern Mensurationists

Missoula, MTJune 20-22, 2010

Page 2: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsContents• Problem (we have 45 class levels, that’s a lot)• Solution (we broke the problem into many

subsets and formed an ensemble classifier)• Results (very good, and we have a measure of

extrapolation)• Discussion

Page 3: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

• We desire to predict the biotic community as a function of climate.

• There are 45 biotic communities of interest. Brown, D.E., F. Reichenbacher, S.E. Franson. 1998. A classification of North American biotic communities. University of Utah Press, Salt Lake City. 141 pp.

Problem

Page 4: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

• In a 2006 effort on a subset of these communities, we had great results using:Breiman, Leo. 2001. Random Forests. Machine Learning 45:5-32.

• These results were published in:Rehfeldt, G.E., N.L. Crookston, M.V. Warwell and J.S. Evans. 2006. Empirical analyses of plant-climate relationships for the western United States. Int. J. Plant Sci. 167, 1123-1150.

Problem

Page 5: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsRandom Forests• A Random Forest (RF) is a set of

classification or regression trees (CART).• RF builds many trees, each one minimizes the

classification error on a boot-strap sample of training data.

• 32 class-levels are supported, but when there are over 10, it uses a sampling scheme for each tree.

Page 6: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsRandom Forests -- continued

• To classify a new observation:– RF puts the new observation down each of the

trees in the forest – Each tree gives a classification, the classification

is a vote.– The forest chooses the class having the most votes

over all the trees.

Page 7: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsProblem -- continued

• We have 45 class levels, over the limit in package randomForest 32!

• We desire to make predictions using future climates.

• RF might predict nonsense answers for future climatic conditions that are unique with respect to the training data.

• These are extrapolations we need to detect.

Page 8: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsSolution -- Steps

1. Training data: ~1.6 million obs, 35 climate variables from the Moscow climate model.

2. We created 100 Random Forests.3. To create 1 of the forests:

a. Sample 9 of 45 class levels (without replacement)b. Make a copy of the training data.c. Recode the biotic community in this copy; keep

as is if code is one of the 9 in the sample, otherwise change the observed class to “other”.

Page 9: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsSteps -- continued.

3. Fit each of the 100 RFs. 4. To make a prediction:

a. Put the new case down all 100 RFs, providing a vector of 100 predictions for the case.

b. Count the number of predictions by biotic community code, including “other”. This gives a table of codes and counts that has 46 rows (one for each community code plus “other”).

Page 10: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsSteps -- continued.

c. Divide the counts for each code by the number of RFs that contained the code.

d. The ensemble classification is the class value corresponding to the maximum of these quotients.

Page 11: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsExample 1 (contemporary climate):

CodeNumber

PredictedNumber Forests Quotients

1 20 25 0.802 3 34 0.09

3 8 33 0.24

4 2 29 0.07

Other 6 100 0.06

Page 12: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsExample 2 (future climate 1):

CodeNumber

PredictedNumber Forests Quotients

1 20 -> 4 25 0.16

2 3 -> 25 34 0.74

3 8 -> 4 33 0.12

4 2 -> 3 29 0.10

Other 6 -> 20 100 0.20

Page 13: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsExample 3 (future climate 2):

CodeNumber

PredictedNumber Forests Quotients

1 20 -> 8 25 0.32

2 3 -> 4 34 0.12

3 8 -> 4 33 0.12

4 2 -> 3 29 0.10

Other 6 -> 40 100 0.40

Page 14: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsResults• We interpret predictions of other to indicate

extrapolation. • For this work, extrapolation indicates there is

no biotic community in our study area that corresponds to the (new) climate.

• It is not a perfect indication of extrapolation.

Page 15: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsResults• Application to Brown’s biotic communities

– All of North America– Prediction of community as a function of climatic

metrics– Mapped at 0.0083333 arc degrees (~ 1km2)

Page 16: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

Page 17: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

No analog: contemporary

Page 18: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

No analog: 2030

Page 19: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problems

No analog: 2090

Canadian

Princeton Hadley

Page 20: How to solve a classification problem with 45 class levels using Random Forests  Nicholas L. Crookston Gerald E. Rehfeldt

Big classification problemsDiscussion / Conclusion• The method can be use on larger problems and

perhaps with CART-based methods other than Random Forests.

• One could add samples that are actually other, that is, not any of those of interest.

• Random Forests remains a very important tool in our tool set.