decision jungles - geekstackgeekstack.net/resources/...decision_jungles_slides.pdf · winn, and...
TRANSCRIPT
Decision Jungles
Tobias Pohlen
March 8, 2015
Tobias Pohlen | March 8, 2015 0/60
Decision Jungles
Tobias Pohlen | March 8, 2015 1/60
I Literature
I Introduction
I Training
I Implementation Details
I Experiments and Results
Outline
Tobias Pohlen | March 8, 2015 2/60
I Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, JohnWinn, and Antonio Criminisi. Decision Jungles: Compact and rich modelsfor classification. Advances in Neural Information Processing System 26,pages 234-242. Curran Associates, Inc., 2013
I Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, JohnWinn, and Antonio Criminisi. Decision Jungles: Compact and rich modelsfor classification. Supplemental material. 2013
I Piotr Dollár, Piotr’s Image and Video Matlab Toolbox (PMT).http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html.
Literature
Tobias Pohlen | March 8, 2015 3/60
ObjectiveI Solve the multiclass classification problem
Definition (Multiclass classification problem)
GivenI A training set X = {(x1,y1), ...,(xN ,yN)} ⊂ Rn×{1, ...,C}
I training examples xi ∈ Rn
I class labels yi ∈ {1, ...,C}ProblemI Assign the previously unseen data point x to one of the classes 1, ...,C
The Classification Problem
Tobias Pohlen | March 8, 2015 4/60
Definition (Binary decision tree)
A binary decision tree is a binary treeG = (V,E) with the following properties:
An internal node v is augmented with
I Feature dimension dv ∈ {1, ...,n}I Threshold θv ∈ R
A leaf node v is augmented with
I Class label cv
I or class histogram hv : {1, ...,C} 7→ R
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Binary Decision Trees
Tobias Pohlen | March 8, 2015 5/60
Definition (Binary decision tree)
A binary decision tree is a binary treeG = (V,E) with the following properties:
An internal node v is augmented with
I Feature dimension dv ∈ {1, ...,n}I Threshold θv ∈ R
A leaf node v is augmented with
I Class label cv
I or class histogram hv : {1, ...,C} 7→ R
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Binary Decision Trees
Tobias Pohlen | March 8, 2015 5/60
Definition (Binary decision tree)
A binary decision tree is a binary treeG = (V,E) with the following properties:
An internal node v is augmented with
I Feature dimension dv ∈ {1, ...,n}I Threshold θv ∈ R
A leaf node v is augmented with
I Class label cv
I or class histogram hv : {1, ...,C} 7→ R
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Binary Decision Trees
Tobias Pohlen | March 8, 2015 5/60
Definition (Classifier semantics)
A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.
Example
Classify x =
(x1x2
)=
(24
)
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Classifying Data Points
Tobias Pohlen | March 8, 2015 6/60
Definition (Classifier semantics)
A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.
Example
Classify x =
(x1x2
)=
(24
)
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Classifying Data Points
Tobias Pohlen | March 8, 2015 6/60
Definition (Classifier semantics)
A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.
Example
Classify x =
(x1x2
)=
(24
)
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Classifying Data Points
Tobias Pohlen | March 8, 2015 6/60
Definition (Classifier semantics)
A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.
Example
Classify x =
(x1x2
)=
(24
)
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Classifying Data Points
Tobias Pohlen | March 8, 2015 6/60
Definition (Classifier semantics)
A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.
Example
Classify x =
(x1x2
)=
(24
)
x1 ≤ 3
x2 ≤ 2
1 3
x2 ≤ 3.5
2 3
Classifying Data Points
Tobias Pohlen | March 8, 2015 6/60
Let E be some objective function.
Deterministic decision trees
At each node v, determine dv andθv such that
E(dv,θv) = mind∈{1,...,n},θ∈R
E(d,θ)
Random decision trees
At each node v, determine dv andθv such that
E(dv,θv) = mind∈F ,θ∈R
E(d,θ)
where
I F ⊆ {1, ...,C} is a randomselection of features
Random Decision Trees
Tobias Pohlen | March 8, 2015 7/60
Let E be some objective function.
Deterministic decision trees
At each node v, determine dv andθv such that
E(dv,θv) = mind∈{1,...,n},θ∈R
E(d,θ)
Random decision trees
At each node v, determine dv andθv such that
E(dv,θv) = mind∈F ,θ∈R
E(d,θ)
where
I F ⊆ {1, ...,C} is a randomselection of features
Random Decision Trees
Tobias Pohlen | March 8, 2015 7/60
Definition
A random forest F = (G1, ...,Gm) is an ensemble of random decision treesGi.
Classification
A data point x ∈ Rn is assigned to the class that receives the most votes.
Random Forests
Tobias Pohlen | March 8, 2015 8/60
Definition
A random forest F = (G1, ...,Gm) is an ensemble of random decision treesGi.
Classification
A data point x ∈ Rn is assigned to the class that receives the most votes.
Random Forests
Tobias Pohlen | March 8, 2015 8/60
I Initially proposed by Breiman in 2001 [1]
I High classification accuracy by learning uncorrelated trees
I Fast training due to random feature selection
I Fast evaluation
Random Forests: Discussion
Tobias Pohlen | March 8, 2015 9/60
I Initially proposed by Breiman in 2001 [1]
I High classification accuracy by learning uncorrelated trees
I Fast training due to random feature selection
I Fast evaluation
Random Forests: Discussion
Tobias Pohlen | March 8, 2015 9/60
I Initially proposed by Breiman in 2001 [1]
I High classification accuracy by learning uncorrelated trees
I Fast training due to random feature selection
I Fast evaluation
Random Forests: Discussion
Tobias Pohlen | March 8, 2015 9/60
I Initially proposed by Breiman in 2001 [1]
I High classification accuracy by learning uncorrelated trees
I Fast training due to random feature selection
I Fast evaluation
Random Forests: Discussion
Tobias Pohlen | March 8, 2015 9/60
I High memory consumption: O(2d)I Memory consumption grows exponentially with the depth d of the trees
I Especially a problem for memory constraint scenarios. E.g.I Embedded systemsI Mobile devices
Random Forests: Problem
Tobias Pohlen | March 8, 2015 10/60
Idea: Instead of a tree graph, use a directed acyclic graph (DAG).
x1 ≤ 3
x2 ≤ 2
x1 ≤ 1
1 2
x2 ≤ 3
1 2
x2 ≤ 3.5
x2 ≤ 3
1 2
x1 ≤ 5
2 3
Decision DAGs: Concept
Tobias Pohlen | March 8, 2015 11/60
Idea: Instead of a tree graph, use a directed acyclic graph (DAG).
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Concept
Tobias Pohlen | March 8, 2015 11/60
Control the memory consumptionby limiting the width of the DAG bya merging schedule
s : N 7→ N,d 7→ s(d)
If s(d)≤ S ∀ d⇒, then thememory consumtion is O(dS),where d is the depth.
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Memory consumption
Tobias Pohlen | March 8, 2015 12/60
A typical choice is
s : N 7→ N,d 7→ s(d) = min(2d,2D)
where D ∈ N is a constant.
Example (D = 7)
s : N 7→ N,d 7→min(2d,128)
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Memory consumption
Tobias Pohlen | March 8, 2015 12/60
Definition
A decision DAG is a directedacyclic graph G = (V,E) with thefollwing properties:An internal node v is augmentedwith
I Feature dimensiondv ∈ {1, ...,n}
I Threshold θv ∈ RI Left child node lv ∈ VI Right child node rv ∈ V
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Parameters
Tobias Pohlen | March 8, 2015 13/60
Definition
A random decision DAG is a decision DAG whose parameters are sampledfrom some probability distribution.
Definition
A decision jungle J = (G1, ...,Gm) is an ensemble of random decision DAGsGi.
Decision jungles were proposed by J. Shotton et al. at NIPS 2013 [2].
Decision Jungles
Tobias Pohlen | March 8, 2015 14/60
Definition
A random decision DAG is a decision DAG whose parameters are sampledfrom some probability distribution.
Definition
A decision jungle J = (G1, ...,Gm) is an ensemble of random decision DAGsGi.
Decision jungles were proposed by J. Shotton et al. at NIPS 2013 [2].
Decision Jungles
Tobias Pohlen | March 8, 2015 14/60
Binary decision trees
At each node v optimize
I the feature dv
I the threshold θv
Decision DAGs
At each node v optimize
I the feature dv
I the threshold θv
I the left child node lvI the right child node rv
Conclusion
The graph structure and the thresholds/features need to be optimizedsimultaneously.
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 15/60
Binary decision trees
At each node v optimize
I the feature dv
I the threshold θv
Decision DAGs
At each node v optimize
I the feature dv
I the threshold θv
I the left child node lvI the right child node rv
Conclusion
The graph structure and the thresholds/features need to be optimizedsimultaneously.
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 15/60
Technically, this is also a decision DAG.
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 16/60
Shotton et al. assumed a level-wise graph structure for optimization.
x1 ≤ 3
x2 ≤ 2 x2 ≤ 3.5
x1 ≤ 1 x2 ≤ 3 x1 ≤ 5
1 2 3
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 16/60
The DAG is trained level-wise. Let s be a merging schedule(e.g. s(d) = min(2d,128)).
1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 17/60
The DAG is trained level-wise. Let s be a merging schedule(e.g. s(d) = min(2d,128)).
1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 17/60
0
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 18/60
x? ≤?
? ?
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 18/60
x? ≤? x? ≤?
x2 ≤ 4
? ? ?
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 18/60
x2 ≤ 1 x1 ≤ 6
x2 ≤ 4
x? ≤? x? ≤? x? ≤?
? ? ?
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 18/60
x2 ≤ 1 x1 ≤ 6
x2 ≤ 4
x1 ≤−1 x2 ≤ 0 x1 ≤ 1
1 2 1
Decision DAGs: Training
Tobias Pohlen | March 8, 2015 18/60
Naming convention:
p1 p2 . . . pk−1 pk Parent nodes
c1 c2
. . .
cl−1 cl Child nodes
I feature dimension dpi
I threshold θpi
I left child node lpi
I right child node rpi
I Spi and Scj are the training setsat nodes pi and cj respectively
Level Optimization
Tobias Pohlen | March 8, 2015 19/60
Goal: Find the optimal parameters for the parent nodes in terms of anobjective function E.
Definition
Let X ⊂ Rn×{1, ...,C} be a training set. The entropy H(X) is defined as
H(X) =−C
∑i=1
p(i) log2 p(i)
where
p(i) =|{(x,y) ∈ X : y = i}|
|X|
Objective Function I
Tobias Pohlen | March 8, 2015 20/60
The objective function E is defined in terms of the entropies at the childnodes.
E(Θ1, ...,Θk) =l
∑i=1|Sci |H(Sci)
where
I Θi = (dpi ,θpi , lpi ,rpi) are the parameters of pi
Objective Function II
Tobias Pohlen | March 8, 2015 21/60
The connection between the Θ1, ...,Θk and the Sc1 , ...,Scl becomes apparentwhen looking at the definition of Sci :
Sci =⋃
j=1,..,k : lpj=ci
{(x,y) ∈ Spj : xdpj≤ θpj}∪
⋃j=1,..,k : rpj=ci
{(x,y) ∈ Spj : xdpj> θpj}
Objective Function III
Tobias Pohlen | March 8, 2015 22/60
1: function LSEARCH(Θp1 , ...,Θpk )2: while something changes do3: for i = 1, ...,k do4: F ← random feature selection5: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)6: end for7: for i = 1, ...,k do8: lpi ← argminl=c1,...,cl
E(...,Θpi−1 ,(dpi ,θpi , l,rpi),Θpi+1 , ...)9: rpi ← argminr=c1,...,cl
E(...,Θpi−1 ,(dpi ,θpi , lpi ,r),Θpi+1 , ...)10: end for11: end while12: return Θp1 , ...,Θpk
13: end function
LSEARCH Optimization Algorithm
Tobias Pohlen | March 8, 2015 23/60
This is where the technical section of the paper ends.
Open questions
I Does the algorithm converge to a local minimum?
I Does the algorithm terminate in a finite number of steps?
I How to implement the two minimization steps efficiently?
Main issueI There is no code available
In the following, I present the findings of my research.
Intermediate Discussion
Tobias Pohlen | March 8, 2015 24/60
This is where the technical section of the paper ends.
Open questions
I Does the algorithm converge to a local minimum?
I Does the algorithm terminate in a finite number of steps?
I How to implement the two minimization steps efficiently?
Main issueI There is no code available
In the following, I present the findings of my research.
Intermediate Discussion
Tobias Pohlen | March 8, 2015 24/60
This is where the technical section of the paper ends.
Open questions
I Does the algorithm converge to a local minimum?
I Does the algorithm terminate in a finite number of steps?
I How to implement the two minimization steps efficiently?
Main issueI There is no code available
In the following, I present the findings of my research.
Intermediate Discussion
Tobias Pohlen | March 8, 2015 24/60
Theorem
The LSEARCH algorithm terminates.
Termination Theorem
Tobias Pohlen | March 8, 2015 25/60
Proof
I E takes on a finite number of discrete valuesI There are only finitely many combinations of dv, lr and rvI There are infinitely many choices for θv
I We can factorize R using the following relation
x∼ y :⇔∀λ ∈ [0,1] : E(d,x, l,r) = E(d,λx+(1−λ)y, l,r)
I R/∼ is finiteI The joined parameter space is finite
Termination Theorem Proof I
Tobias Pohlen | March 8, 2015 26/60
Proof
I If the algorithm did not terminate, it would cycle through someconfigurations
I Let γ1, ...,γr be those configurations of Θ1, ...,ΘkI γi+1 = LSEARCH(γi) and γ1 = LSEARCH(γr)
I Observation: Parameters only change when the objective functiondecreases
I Hence
E(γ1)> E(γ2)> .. . > E(γr−1)> E(γr)
Termination Theorem Proof II
Tobias Pohlen | March 8, 2015 27/60
Proof
I But because of the cycling, it must also hold
E(γr)> E(γ1)
Therefore
E(γ1)> E(γ1)
Hence, the algorithm terminates.
Termination Theorem Proof III
Tobias Pohlen | March 8, 2015 28/60
From the termination proof, the following theorem follows immidiately.
Theorem
The LSEARCH optimization algorithm converges to a local minimum of theobjective function in a finite number of iterations.
Proof
Termination theorem + Only parameters which decrease the objectivefunction are accepted.
Optimality Theorem
Tobias Pohlen | March 8, 2015 29/60
From the termination proof, the following theorem follows immidiately.
Theorem
The LSEARCH optimization algorithm converges to a local minimum of theobjective function in a finite number of iterations.
Proof
Termination theorem + Only parameters which decrease the objectivefunction are accepted.
Optimality Theorem
Tobias Pohlen | March 8, 2015 29/60
I Efficiently implementing the algorithm is not trivial
I Evaluating the objective function is expensive
E(Θ1, ...,Θk) =l
∑i=1|Sci |H(Sci)
I First the Sci have to be determinedI Then the entropies have to be calculated
I Exploit the problem structure in order to find an efficient implementation
Implementation
Tobias Pohlen | March 8, 2015 30/60
1: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)
First we note that only Slpiand Srpi
can change.
Corollary
It holdsargmind∈F ,θ∈R
E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)
= argmind∈F ,θ∈R
l
∑i=1|Sci |H(Sci)
= argmind∈F ,θ∈R
|Slpi|H(Slpi
)+ |Srpi|H(Srpi
)
Threshold Optimization I
Tobias Pohlen | March 8, 2015 31/60
1: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)
First we note that only Slpiand Srpi
can change.
Corollary
It holdsargmind∈F ,θ∈R
E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)
= argmind∈F ,θ∈R
l
∑i=1|Sci |H(Sci)
= argmind∈F ,θ∈R
|Slpi|H(Slpi
)+ |Srpi|H(Srpi
)
Threshold Optimization I
Tobias Pohlen | March 8, 2015 31/60
pi−1 pi . . . pk−1 pk Parent nodes
rpi
ObservationI There is only a constant contribution from the other parents
I Only the contribution from pi varies
IdeaI Precompute the contribution from the other parent nodes in histograms
Threshold Optimization II
Tobias Pohlen | March 8, 2015 32/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension
I Subsequently test thresholds between neighboring points
xd• • • • • •• •
θ
I At each iteration, only a single data points moves from the right to the leftchild node
I This technique is due to Piotr Dollár [3]
Threshold Optimization III
Tobias Pohlen | March 8, 2015 33/60
In SummaryI Precompute the contributions from the other parents
I Subsequently test different thresholds
I These steps allow us to evaluate the objective function at each iteration inconstant time
NotesI The steps are proven to be correct
I Derivations are rather technical
I See the seminar paper for formal details
Threshold Optimization: Discussion
Tobias Pohlen | March 8, 2015 34/60
In SummaryI Precompute the contributions from the other parents
I Subsequently test different thresholds
I These steps allow us to evaluate the objective function at each iteration inconstant time
NotesI The steps are proven to be correct
I Derivations are rather technical
I See the seminar paper for formal details
Threshold Optimization: Discussion
Tobias Pohlen | March 8, 2015 34/60
Kincet Body Dataset [5]
I Estimate a human pose from asingle depth image
I 31 classes
Image by Shotton et al. [2]
Experiments from the Paper
Tobias Pohlen | March 8, 2015 35/60
Image by Shotton et al. [2]
Results: Test accuracy
Tobias Pohlen | March 8, 2015 36/60
Image by Shotton et al. [2]
Results: Feature evaluations
Tobias Pohlen | March 8, 2015 37/60
ConclusionsI Decision DAGs trained using the LSEARCH algorithm...
I consume less memory than binary decision treesI perform significantly better when compared to trees of the same size (i.e.
same number of nodes)I The proposed DAG structure works better than trees of fixed width
I Fixed-width tree: At each level, only split the M nodes that have the highestentropy
QuestionsI How do decision jungles perform compared to random forests
(disregarding model size)?I Training time?I Absolute test accuracy?I Evaluation time?
Interpretation
Tobias Pohlen | March 8, 2015 38/60
ConclusionsI Decision DAGs trained using the LSEARCH algorithm...
I consume less memory than binary decision treesI perform significantly better when compared to trees of the same size (i.e.
same number of nodes)I The proposed DAG structure works better than trees of fixed width
I Fixed-width tree: At each level, only split the M nodes that have the highestentropy
QuestionsI How do decision jungles perform compared to random forests
(disregarding model size)?I Training time?I Absolute test accuracy?I Evaluation time?
Interpretation
Tobias Pohlen | March 8, 2015 38/60
ConclusionsI Decision DAGs trained using the LSEARCH algorithm...
I consume less memory than binary decision treesI perform significantly better when compared to trees of the same size (i.e.
same number of nodes)I The proposed DAG structure works better than trees of fixed width
I Fixed-width tree: At each level, only split the M nodes that have the highestentropy
QuestionsI How do decision jungles perform compared to random forests
(disregarding model size)?I Training time?I Absolute test accuracy?I Evaluation time?
Interpretation
Tobias Pohlen | March 8, 2015 38/60
I Decision jungle results are obtained using my LibJungle C++ library [4]I Efficient multi-threaded implementation of decision jungles
I Baseline results are obtained using Piotr Dollár’s MATLAB Toolbox [3]I Very efficient and well tested implementation of random forestsI Fair comparison: Crucial parts are implemented in C
My Experiments
Tobias Pohlen | March 8, 2015 39/60
I Decision jungle results are obtained using my LibJungle C++ library [4]I Efficient multi-threaded implementation of decision jungles
I Baseline results are obtained using Piotr Dollár’s MATLAB Toolbox [3]I Very efficient and well tested implementation of random forestsI Fair comparison: Crucial parts are implemented in C
My Experiments
Tobias Pohlen | March 8, 2015 39/60
I Handwritten digits 0-9 (10 classes)I Grayscale imagesI 28×28 pixels
I 60,000 training images
I 10,000 test images
I Available under http://yann.lecun.com/exdb/mnist/
Evaluation Data: MNIST Data Set
Tobias Pohlen | March 8, 2015 40/60
Algorithm
1: function LSEARCH(Θp1 , ...,Θpk )2: while something changes do3: ...4: end while5: return Θp1 , ...,Θpk
6: end function
ExperimentI We set an iteration limit on the outer while-loop in the LSEARCH
optimization algorithm
I Evaluate the performance of a single DAG vs. a single tree
Experiment 1: Iteration Limit
Tobias Pohlen | March 8, 2015 41/60
0 20 40 60
0.85
0.9
0.95
1
Max Iterations
Test
Acc
urac
ySingle DAGSingle Tree
Results: Test Accuracy
Tobias Pohlen | March 8, 2015 42/60
0 20 40 6020
30
40
50
60
Max Iterations
Dep
thSingle DAGSingle Tree
Results: Depth
Tobias Pohlen | March 8, 2015 43/60
0 20 40 600
100
200
300
400
Max Iterations
Trai
ning
Tim
e(s
)Single DAGSingle Tree
Results: Training Time
Tobias Pohlen | March 8, 2015 44/60
0 10 20 30 40 500
0.2
0.4
0.6
0.8
Levels
Trai
ning
Err
orI = 5
I = 15I = 55
Single Tree
Results: Convergence Speed
Tobias Pohlen | March 8, 2015 45/60
ProsI DAGs outperform trees by a large margin
I DAGs consume considerable less memory
ConsI Evaluation time for DAGs is twice the time for trees
I Training DAGs takes significantly longer than training trees
Experiment 1: Interpretation
Tobias Pohlen | March 8, 2015 46/60
ProsI DAGs outperform trees by a large margin
I DAGs consume considerable less memory
ConsI Evaluation time for DAGs is twice the time for trees
I Training DAGs takes significantly longer than training trees
Experiment 1: Interpretation
Tobias Pohlen | March 8, 2015 46/60
QuestionI How do decision jungles perform compared to random forests?
ExperimentI Train up to 30 DAGs/trees
I Evaluate the performance of the ensemble each time after adding aDAG/tree
I Perform the experiment for different depth limits (10,15,45)
I Perform the experiment with bagging and without bagging
Experiment 2: Ensembles
Tobias Pohlen | March 8, 2015 47/60
QuestionI How do decision jungles perform compared to random forests?
ExperimentI Train up to 30 DAGs/trees
I Evaluate the performance of the ensemble each time after adding aDAG/tree
I Perform the experiment for different depth limits (10,15,45)
I Perform the experiment with bagging and without bagging
Experiment 2: Ensembles
Tobias Pohlen | March 8, 2015 47/60
5 10 15 20 25 300.8
0.85
0.9
0.95
1
Ensemble Size
Test
Acc
urac
y
L = 10L = 15L = 45
Random Forest
Results: Without Bagging
Tobias Pohlen | March 8, 2015 48/60
5 10 15 20 25 300.8
0.85
0.9
0.95
1
Ensemble Size
Test
Acc
urac
y
L = 10L = 15L = 45
Random Forest
Results: With Bagging
Tobias Pohlen | March 8, 2015 49/60
Algorithm
1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for
Two possibilitiesI Initialize parameters randomly
I Initialize lpi and rpi such that parent nodes with high entropy do not havecommon child nodes
GoalI Speed up convergence
Experiment 3
Tobias Pohlen | March 8, 2015 50/60
Algorithm
1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for
Two possibilitiesI Initialize parameters randomly
I Initialize lpi and rpi such that parent nodes with high entropy do not havecommon child nodes
GoalI Speed up convergence
Experiment 3
Tobias Pohlen | March 8, 2015 50/60
Algorithm
1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for
Two possibilitiesI Initialize parameters randomly
I Initialize lpi and rpi such that parent nodes with high entropy do not havecommon child nodes
GoalI Speed up convergence
Experiment 3
Tobias Pohlen | March 8, 2015 50/60
10 20 30 40 500
0.1
0.2
0.3
Levels Trained
Trai
ning
Err
orRandom initialization
Deterministic initialization
Results: Convergence speed
Tobias Pohlen | March 8, 2015 51/60
10 15 20 25 30
0.7
0.8
0.9
1
Levels Trained
Test
Acc
urac
y
Random initializationDeterministic initialization
Results: Test accuracy
Tobias Pohlen | March 8, 2015 52/60
We compare the test accuracy of decision jungles and random forests.
Data set Size Features Attributes #DAGs
MNIST 60,000/10,000 784 numerical 8/15USPS 3,823/1,797 64 numerical 8/15CONNECT 4 67,557/- 42(126) categorical 8/15LETTER RECOG. 20,000/- 16 numerical 8/15SHUTTLE 43,500/14,500 9 numerical 8/15
Data sets are from the UCI Machine Learning Repository [6].
Experiment 4: Various Data Sets
Tobias Pohlen | March 8, 2015 53/60
Decision jungles Random forests8 DAGs 8 Trees
Data set Mean Stdev. Mean Stdev.
MNIST 95.72% 0.13% 95.14% 0.20%USPS 94.65% 0.5% 94.44% 0.30%CONNECT 4 81.17% 0.22% 80.99% 0.46%LETTER RECOGNITION 94.73% 0.57% 94.29% 0.43%SHUTTLE 99.98% 0.01% 99.99% 0.00%
DAGs are trained without bagging.
Experiment 4: Results I
Tobias Pohlen | March 8, 2015 54/60
Decision jungles Random forests15 DAGs 15 Trees
Data set Mean Stdev. Mean Stdev.
MNIST 96.38% 0.09% 96.23% 0.16%USPS 95.95% 0.2% 95.93% 0.52%CONNECT 4 81.98% 0.15% 81.47% 0.66%LETTER RECOGNITION 95.73% 0.55% 95.58% 0.48%SHUTTLE 99.99% 0.01% 99.99% 0.01%
DAGs are trained without bagging.
Experiment 4: Results II
Tobias Pohlen | March 8, 2015 55/60
I C++ implementation of decision jungles
I Implements all speed-ups discussed in the seminar paper
I Can be used as a static library
I Open source license (BSD)
I Available under https://bitbucket.org/geekStack/libjungle
LibJungle C++ Library
Tobias Pohlen | March 8, 2015 56/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
I Goal: Find memory efficient alternative to random forests
I Idea: Use DAGs and limit their width
I Train ensembles of random decision DAGs (called decision jungle)
I Train a DAG level-wise by minimizing an objective function
I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests
I Sometimes even better
I Evaluation is twice as expensive
I Training takes significantly longer
Summary
Tobias Pohlen | March 8, 2015 57/60
Questions are welcome
Seminar paper available under geekstack.net/paper
Thanks for your attention
Tobias Pohlen | March 8, 2015 58/60
Leo Breiman.Random Forests.Machine Learning 45, 2001.
Shotton, Jamie and Sharp, Toby and Kohli, Pushmeet and Nowozin,Sebastian and Winn, John and Criminisi, Antonio.Decision Jungles: Compact and Rich Models for Classification.Advances in Neural Information Processing Systems 26, 2013.
Piotr Dollár.Piotr’s Image and Video Matlab Toolbox (PMT).http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html.
Further Reading I
Tobias Pohlen | March 8, 2015 59/60
Tobias Pohlen.LibJungle - Decision Jungle Library.https://bitbucket.org/geekStack/libjungle.
Jamie Shotton and Ross Girshick and Andrew Fitzgibbon and TobySharp and Mat Cook and Mark Finocchio and Richard Moore andPushmeet Kohli and Antonio Criminisi and Alex Kipman and AndrewBlake.Efficient Human Pose Estimation from Single Depth Images.IEEE Trans. Pattern Anal. Mach. Intell., 35, pages 2821-2840, 2013.
K. Bache and M. Lichman.UCI Machine Learning Repository.http://archive.ics.uci.edu/ml.
Further Reading II
Tobias Pohlen | March 8, 2015 60/60