how to solve a classification problem with 45 class levels using random forests nicholas l....
DESCRIPTION
How to solve a classification problem with 45 class levels using Random Forests Nicholas L. Crookston Gerald E. Rehfeldt US Forest Service, Rocky Mountain Research Station, Moscow, ID Western Mensurationists Missoula, MT June 20-22, 2010. Problem (we have 45 class levels, that’s a lot) - PowerPoint PPT PresentationTRANSCRIPT
Big classification problemsHow to solve a classification problem with
45 class levels using Random Forests
Nicholas L. CrookstonGerald E. Rehfeldt
US Forest Service, Rocky Mountain Research Station, Moscow, IDWestern Mensurationists
Missoula, MTJune 20-22, 2010
Big classification problemsContents• Problem (we have 45 class levels, that’s a lot)• Solution (we broke the problem into many
subsets and formed an ensemble classifier)• Results (very good, and we have a measure of
extrapolation)• Discussion
Big classification problems
• We desire to predict the biotic community as a function of climate.
• There are 45 biotic communities of interest. Brown, D.E., F. Reichenbacher, S.E. Franson. 1998. A classification of North American biotic communities. University of Utah Press, Salt Lake City. 141 pp.
Problem
Big classification problems
• In a 2006 effort on a subset of these communities, we had great results using:Breiman, Leo. 2001. Random Forests. Machine Learning 45:5-32.
• These results were published in:Rehfeldt, G.E., N.L. Crookston, M.V. Warwell and J.S. Evans. 2006. Empirical analyses of plant-climate relationships for the western United States. Int. J. Plant Sci. 167, 1123-1150.
Problem
Big classification problemsRandom Forests• A Random Forest (RF) is a set of
classification or regression trees (CART).• RF builds many trees, each one minimizes the
classification error on a boot-strap sample of training data.
• 32 class-levels are supported, but when there are over 10, it uses a sampling scheme for each tree.
Big classification problemsRandom Forests -- continued
• To classify a new observation:– RF puts the new observation down each of the
trees in the forest – Each tree gives a classification, the classification
is a vote.– The forest chooses the class having the most votes
over all the trees.
Big classification problemsProblem -- continued
• We have 45 class levels, over the limit in package randomForest 32!
• We desire to make predictions using future climates.
• RF might predict nonsense answers for future climatic conditions that are unique with respect to the training data.
• These are extrapolations we need to detect.
Big classification problemsSolution -- Steps
1. Training data: ~1.6 million obs, 35 climate variables from the Moscow climate model.
2. We created 100 Random Forests.3. To create 1 of the forests:
a. Sample 9 of 45 class levels (without replacement)b. Make a copy of the training data.c. Recode the biotic community in this copy; keep
as is if code is one of the 9 in the sample, otherwise change the observed class to “other”.
Big classification problemsSteps -- continued.
3. Fit each of the 100 RFs. 4. To make a prediction:
a. Put the new case down all 100 RFs, providing a vector of 100 predictions for the case.
b. Count the number of predictions by biotic community code, including “other”. This gives a table of codes and counts that has 46 rows (one for each community code plus “other”).
Big classification problemsSteps -- continued.
c. Divide the counts for each code by the number of RFs that contained the code.
d. The ensemble classification is the class value corresponding to the maximum of these quotients.
Big classification problemsExample 1 (contemporary climate):
CodeNumber
PredictedNumber Forests Quotients
1 20 25 0.802 3 34 0.09
3 8 33 0.24
4 2 29 0.07
Other 6 100 0.06
Big classification problemsExample 2 (future climate 1):
CodeNumber
PredictedNumber Forests Quotients
1 20 -> 4 25 0.16
2 3 -> 25 34 0.74
3 8 -> 4 33 0.12
4 2 -> 3 29 0.10
Other 6 -> 20 100 0.20
Big classification problemsExample 3 (future climate 2):
CodeNumber
PredictedNumber Forests Quotients
1 20 -> 8 25 0.32
2 3 -> 4 34 0.12
3 8 -> 4 33 0.12
4 2 -> 3 29 0.10
Other 6 -> 40 100 0.40
Big classification problemsResults• We interpret predictions of other to indicate
extrapolation. • For this work, extrapolation indicates there is
no biotic community in our study area that corresponds to the (new) climate.
• It is not a perfect indication of extrapolation.
Big classification problemsResults• Application to Brown’s biotic communities
– All of North America– Prediction of community as a function of climatic
metrics– Mapped at 0.0083333 arc degrees (~ 1km2)
Big classification problems
Big classification problems
No analog: contemporary
Big classification problems
No analog: 2030
Big classification problems
No analog: 2090
Canadian
Princeton Hadley
Big classification problemsDiscussion / Conclusion• The method can be use on larger problems and
perhaps with CART-based methods other than Random Forests.
• One could add samples that are actually other, that is, not any of those of interest.
• Random Forests remains a very important tool in our tool set.