decision jungles - geekstackgeekstack.net/resources/...decision_jungles_slides.pdf · winn, and...

Decision Jungles

Tobias Pohlen

March 8, 2015

Tobias Pohlen | March 8, 2015 0/60

Decision Jungles

I Literature

I Introduction

I Training

I Implementation Details

I Experiments and Results

Outline

I Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, JohnWinn, and Antonio Criminisi. Decision Jungles: Compact and rich modelsfor classification. Advances in Neural Information Processing System 26,pages 234-242. Curran Associates, Inc., 2013

I Jamie Shotton, Toby Sharp, Pushmeet Kohli, Sebastian Nowozin, JohnWinn, and Antonio Criminisi. Decision Jungles: Compact and rich modelsfor classification. Supplemental material. 2013

I Piotr Dollár, Piotr’s Image and Video Matlab Toolbox (PMT).http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html.

Literature

ObjectiveI Solve the multiclass classification problem

Definition (Multiclass classification problem)

GivenI A training set X = {(x1,y1), ...,(xN ,yN)} ⊂ Rn×{1, ...,C}

I training examples xi ∈ Rn

I class labels yi ∈ {1, ...,C}ProblemI Assign the previously unseen data point x to one of the classes 1, ...,C

The Classification Problem

Definition (Binary decision tree)

A binary decision tree is a binary treeG = (V,E) with the following properties:

An internal node v is augmented with

I Feature dimension dv ∈ {1, ...,n}I Threshold θv ∈ R

A leaf node v is augmented with

I Class label cv

I or class histogram hv : {1, ...,C} 7→ R

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Binary Decision Trees

I Class label cv

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

I Class label cv

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Definition (Classifier semantics)

A data point x ∈ Rn is assigned to a class bypassing it along the tree according to thesplits defined by dv and θv.

Example

Classify x =

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Classifying Data Points

Example

Classify x =

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Example

Classify x =

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Example

Classify x =

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Example

Classify x =

x1 ≤ 3

x2 ≤ 2

x2 ≤ 3.5

Let E be some objective function.

Deterministic decision trees

At each node v, determine dv andθv such that

E(dv,θv) = mind∈{1,...,n},θ∈R

E(d,θ)

Random decision trees

E(dv,θv) = mind∈F ,θ∈R

E(d,θ)

I F ⊆ {1, ...,C} is a randomselection of features

Random Decision Trees

Let E be some objective function.

Deterministic decision trees

E(dv,θv) = mind∈{1,...,n},θ∈R

E(d,θ)

Random decision trees

E(dv,θv) = mind∈F ,θ∈R

E(d,θ)

I F ⊆ {1, ...,C} is a randomselection of features

Random Decision Trees

Definition

A random forest F = (G1, ...,Gm) is an ensemble of random decision treesGi.

Classification

A data point x ∈ Rn is assigned to the class that receives the most votes.

Random Forests

Definition

A random forest F = (G1, ...,Gm) is an ensemble of random decision treesGi.

Classification

A data point x ∈ Rn is assigned to the class that receives the most votes.

Random Forests

I Initially proposed by Breiman in 2001 [1]

I High classification accuracy by learning uncorrelated trees

I Fast training due to random feature selection

I Fast evaluation

Random Forests: Discussion

I Fast evaluation

I High memory consumption: O(2d)I Memory consumption grows exponentially with the depth d of the trees

I Especially a problem for memory constraint scenarios. E.g.I Embedded systemsI Mobile devices

Random Forests: Problem

Idea: Instead of a tree graph, use a directed acyclic graph (DAG).

x1 ≤ 3

x2 ≤ 2

x1 ≤ 1

x2 ≤ 3

x2 ≤ 3.5

x2 ≤ 3

x1 ≤ 5

Decision DAGs: Concept

Idea: Instead of a tree graph, use a directed acyclic graph (DAG).

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

Decision DAGs: Concept

Control the memory consumptionby limiting the width of the DAG bya merging schedule

s : N 7→ N,d 7→ s(d)

If s(d)≤ S ∀ d⇒, then thememory consumtion is O(dS),where d is the depth.

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

Decision DAGs: Memory consumption

A typical choice is

s : N 7→ N,d 7→ s(d) = min(2d,2D)

where D ∈ N is a constant.

Example (D = 7)

s : N 7→ N,d 7→min(2d,128)

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

Decision DAGs: Memory consumption

Definition

A decision DAG is a directedacyclic graph G = (V,E) with thefollwing properties:An internal node v is augmentedwith

I Feature dimensiondv ∈ {1, ...,n}

I Threshold θv ∈ RI Left child node lv ∈ VI Right child node rv ∈ V

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

Decision DAGs: Parameters

Definition

A random decision DAG is a decision DAG whose parameters are sampledfrom some probability distribution.

Definition

A decision jungle J = (G1, ...,Gm) is an ensemble of random decision DAGsGi.

Decision jungles were proposed by J. Shotton et al. at NIPS 2013 [2].

Decision Jungles

Definition

A random decision DAG is a decision DAG whose parameters are sampledfrom some probability distribution.

Definition

A decision jungle J = (G1, ...,Gm) is an ensemble of random decision DAGsGi.

Decision jungles were proposed by J. Shotton et al. at NIPS 2013 [2].

Decision Jungles

Binary decision trees

At each node v optimize

I the feature dv

I the threshold θv

Decision DAGs

I the feature dv

I the threshold θv

I the left child node lvI the right child node rv

Conclusion

The graph structure and the thresholds/features need to be optimizedsimultaneously.

Decision DAGs: Training

Binary decision trees

I the feature dv

I the threshold θv

Decision DAGs

I the feature dv

I the threshold θv

I the left child node lvI the right child node rv

Conclusion

The graph structure and the thresholds/features need to be optimizedsimultaneously.

Technically, this is also a decision DAG.

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

Shotton et al. assumed a level-wise graph structure for optimization.

x1 ≤ 3

x2 ≤ 2 x2 ≤ 3.5

x1 ≤ 1 x2 ≤ 3 x1 ≤ 5

The DAG is trained level-wise. Let s be a merging schedule(e.g. s(d) = min(2d,128)).

1: G← ({root}, /0)2: for d = 1,2,.. do3: Add s(d) new nodes to G4: Initialize the parameters of the former leaf nodes5: Optimize the parameters of the former leaf nodes6: end for

The DAG is trained level-wise. Let s be a merging schedule(e.g. s(d) = min(2d,128)).

x? ≤?

x? ≤? x? ≤?

x2 ≤ 4

x2 ≤ 1 x1 ≤ 6

x2 ≤ 4

x? ≤? x? ≤? x? ≤?

x2 ≤ 1 x1 ≤ 6

x2 ≤ 4

x1 ≤−1 x2 ≤ 0 x1 ≤ 1

Naming convention:

p1 p2 . . . pk−1 pk Parent nodes

cl−1 cl Child nodes

I feature dimension dpi

I threshold θpi

I left child node lpi

I right child node rpi

I Spi and Scj are the training setsat nodes pi and cj respectively

Level Optimization

Goal: Find the optimal parameters for the parent nodes in terms of anobjective function E.

Definition

Let X ⊂ Rn×{1, ...,C} be a training set. The entropy H(X) is defined as

H(X) =−C

∑i=1

p(i) log2 p(i)

p(i) =|{(x,y) ∈ X : y = i}|

Objective Function I

The objective function E is defined in terms of the entropies at the childnodes.

E(Θ1, ...,Θk) =l

∑i=1|Sci |H(Sci)

I Θi = (dpi ,θpi , lpi ,rpi) are the parameters of pi

Objective Function II

The connection between the Θ1, ...,Θk and the Sc1 , ...,Scl becomes apparentwhen looking at the definition of Sci :

Sci =⋃

j=1,..,k : lpj=ci

{(x,y) ∈ Spj : xdpj≤ θpj}∪

⋃j=1,..,k : rpj=ci

{(x,y) ∈ Spj : xdpj> θpj}

Objective Function III

1: function LSEARCH(Θp1 , ...,Θpk )2: while something changes do3: for i = 1, ...,k do4: F ← random feature selection5: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)6: end for7: for i = 1, ...,k do8: lpi ← argminl=c1,...,cl

E(...,Θpi−1 ,(dpi ,θpi , l,rpi),Θpi+1 , ...)9: rpi ← argminr=c1,...,cl

E(...,Θpi−1 ,(dpi ,θpi , lpi ,r),Θpi+1 , ...)10: end for11: end while12: return Θp1 , ...,Θpk

13: end function

LSEARCH Optimization Algorithm

This is where the technical section of the paper ends.

Open questions

I Does the algorithm converge to a local minimum?

I Does the algorithm terminate in a finite number of steps?

I How to implement the two minimization steps efficiently?

Main issueI There is no code available

In the following, I present the findings of my research.

Intermediate Discussion

Open questions

Theorem

The LSEARCH algorithm terminates.

Termination Theorem

I E takes on a finite number of discrete valuesI There are only finitely many combinations of dv, lr and rvI There are infinitely many choices for θv

I We can factorize R using the following relation

x∼ y :⇔∀λ ∈ [0,1] : E(d,x, l,r) = E(d,λx+(1−λ)y, l,r)

I R/∼ is finiteI The joined parameter space is finite

Termination Theorem Proof I

I If the algorithm did not terminate, it would cycle through someconfigurations

I Let γ1, ...,γr be those configurations of Θ1, ...,ΘkI γi+1 = LSEARCH(γi) and γ1 = LSEARCH(γr)

I Observation: Parameters only change when the objective functiondecreases

I Hence

E(γ1)> E(γ2)> .. . > E(γr−1)> E(γr)

Termination Theorem Proof II

I But because of the cycling, it must also hold

E(γr)> E(γ1)

Therefore

E(γ1)> E(γ1)

Hence, the algorithm terminates.

Termination Theorem Proof III

From the termination proof, the following theorem follows immidiately.

Theorem

The LSEARCH optimization algorithm converges to a local minimum of theobjective function in a finite number of iterations.

Termination theorem + Only parameters which decrease the objectivefunction are accepted.

Optimality Theorem

From the termination proof, the following theorem follows immidiately.

Theorem

The LSEARCH optimization algorithm converges to a local minimum of theobjective function in a finite number of iterations.

Termination theorem + Only parameters which decrease the objectivefunction are accepted.

Optimality Theorem

I Efficiently implementing the algorithm is not trivial

I Evaluating the objective function is expensive

E(Θ1, ...,Θk) =l

∑i=1|Sci |H(Sci)

I First the Sci have to be determinedI Then the entropies have to be calculated

I Exploit the problem structure in order to find an efficient implementation

Implementation

1: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)

First we note that only Slpiand Srpi

can change.

Corollary

It holdsargmind∈F ,θ∈R

E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)

= argmind∈F ,θ∈R

∑i=1|Sci |H(Sci)

|Slpi|H(Slpi

)+ |Srpi|H(Srpi

Threshold Optimization I

1: (dpi ,θpi)← argmind∈F ,θ∈R E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)

First we note that only Slpiand Srpi

can change.

Corollary

It holdsargmind∈F ,θ∈R

E(...,Θpi−1 ,(d,θ, lpi ,rpi),Θpi+1 , ...)

∑i=1|Sci |H(Sci)

|Slpi|H(Slpi

)+ |Srpi|H(Srpi

Threshold Optimization I

pi−1 pi . . . pk−1 pk Parent nodes

ObservationI There is only a constant contribution from the other parents

I Only the contribution from pi varies

IdeaI Precompute the contribution from the other parent nodes in histograms

Threshold Optimization II

Testing multiple thresholds for a fixed feature dimension efficientlyI Sort the training set according the the feature dimension

I Subsequently test thresholds between neighboring points

xd• • • • • •• •

I At each iteration, only a single data points moves from the right to the leftchild node

I This technique is due to Piotr Dollár [3]

Threshold Optimization III

xd• • • • • •• •

In SummaryI Precompute the contributions from the other parents

I Subsequently test different thresholds

I These steps allow us to evaluate the objective function at each iteration inconstant time

NotesI The steps are proven to be correct

I Derivations are rather technical

I See the seminar paper for formal details

Threshold Optimization: Discussion

In SummaryI Precompute the contributions from the other parents

I Subsequently test different thresholds

I These steps allow us to evaluate the objective function at each iteration inconstant time

NotesI The steps are proven to be correct

I Derivations are rather technical

I See the seminar paper for formal details

Threshold Optimization: Discussion

Kincet Body Dataset [5]

I Estimate a human pose from asingle depth image

I 31 classes

Image by Shotton et al. [2]

Experiments from the Paper

Results: Test accuracy

Results: Feature evaluations

ConclusionsI Decision DAGs trained using the LSEARCH algorithm...

I consume less memory than binary decision treesI perform significantly better when compared to trees of the same size (i.e.

same number of nodes)I The proposed DAG structure works better than trees of fixed width

I Fixed-width tree: At each level, only split the M nodes that have the highestentropy

QuestionsI How do decision jungles perform compared to random forests

(disregarding model size)?I Training time?I Absolute test accuracy?I Evaluation time?

Interpretation

I Decision jungle results are obtained using my LibJungle C++ library [4]I Efficient multi-threaded implementation of decision jungles

I Baseline results are obtained using Piotr Dollár’s MATLAB Toolbox [3]I Very efficient and well tested implementation of random forestsI Fair comparison: Crucial parts are implemented in C

My Experiments

I Decision jungle results are obtained using my LibJungle C++ library [4]I Efficient multi-threaded implementation of decision jungles

I Baseline results are obtained using Piotr Dollár’s MATLAB Toolbox [3]I Very efficient and well tested implementation of random forestsI Fair comparison: Crucial parts are implemented in C

My Experiments

I Handwritten digits 0-9 (10 classes)I Grayscale imagesI 28×28 pixels

I 60,000 training images

I 10,000 test images

I Available under http://yann.lecun.com/exdb/mnist/

Evaluation Data: MNIST Data Set

Algorithm

1: function LSEARCH(Θp1 , ...,Θpk )2: while something changes do3: ...4: end while5: return Θp1 , ...,Θpk

6: end function

ExperimentI We set an iteration limit on the outer while-loop in the LSEARCH

optimization algorithm

I Evaluate the performance of a single DAG vs. a single tree

Experiment 1: Iteration Limit

0 20 40 60

Max Iterations

ySingle DAGSingle Tree

Results: Test Accuracy

0 20 40 6020

Max Iterations

thSingle DAGSingle Tree

Results: Depth

0 20 40 600

Max Iterations

)Single DAGSingle Tree

Results: Training Time

0 10 20 30 40 500

Levels

orI = 5

I = 15I = 55

Single Tree

Results: Convergence Speed

ProsI DAGs outperform trees by a large margin

I DAGs consume considerable less memory

ConsI Evaluation time for DAGs is twice the time for trees

I Training DAGs takes significantly longer than training trees

Experiment 1: Interpretation

ProsI DAGs outperform trees by a large margin

I DAGs consume considerable less memory

ConsI Evaluation time for DAGs is twice the time for trees

I Training DAGs takes significantly longer than training trees

Experiment 1: Interpretation

QuestionI How do decision jungles perform compared to random forests?

ExperimentI Train up to 30 DAGs/trees

I Evaluate the performance of the ensemble each time after adding aDAG/tree

I Perform the experiment for different depth limits (10,15,45)

I Perform the experiment with bagging and without bagging

Experiment 2: Ensembles

QuestionI How do decision jungles perform compared to random forests?

ExperimentI Train up to 30 DAGs/trees

I Evaluate the performance of the ensemble each time after adding aDAG/tree

I Perform the experiment for different depth limits (10,15,45)

I Perform the experiment with bagging and without bagging

Experiment 2: Ensembles

5 10 15 20 25 300.8

Ensemble Size

L = 10L = 15L = 45

Random Forest

Results: Without Bagging

5 10 15 20 25 300.8

Ensemble Size

L = 10L = 15L = 45

Random Forest

Results: With Bagging

Algorithm

Two possibilitiesI Initialize parameters randomly

I Initialize lpi and rpi such that parent nodes with high entropy do not havecommon child nodes

GoalI Speed up convergence

Experiment 3

Algorithm

Experiment 3

Algorithm

Experiment 3

10 20 30 40 500

Levels Trained

orRandom initialization

Deterministic initialization

Results: Convergence speed

10 15 20 25 30

Levels Trained

Random initializationDeterministic initialization

Results: Test accuracy

We compare the test accuracy of decision jungles and random forests.

Data set Size Features Attributes #DAGs

MNIST 60,000/10,000 784 numerical 8/15USPS 3,823/1,797 64 numerical 8/15CONNECT 4 67,557/- 42(126) categorical 8/15LETTER RECOG. 20,000/- 16 numerical 8/15SHUTTLE 43,500/14,500 9 numerical 8/15

Data sets are from the UCI Machine Learning Repository [6].

Experiment 4: Various Data Sets

Decision jungles Random forests8 DAGs 8 Trees

Data set Mean Stdev. Mean Stdev.

MNIST 95.72% 0.13% 95.14% 0.20%USPS 94.65% 0.5% 94.44% 0.30%CONNECT 4 81.17% 0.22% 80.99% 0.46%LETTER RECOGNITION 94.73% 0.57% 94.29% 0.43%SHUTTLE 99.98% 0.01% 99.99% 0.00%

DAGs are trained without bagging.

Experiment 4: Results I

Decision jungles Random forests15 DAGs 15 Trees

Data set Mean Stdev. Mean Stdev.

MNIST 96.38% 0.09% 96.23% 0.16%USPS 95.95% 0.2% 95.93% 0.52%CONNECT 4 81.98% 0.15% 81.47% 0.66%LETTER RECOGNITION 95.73% 0.55% 95.58% 0.48%SHUTTLE 99.99% 0.01% 99.99% 0.01%

DAGs are trained without bagging.

Experiment 4: Results II

I C++ implementation of decision jungles

I Implements all speed-ups discussed in the seminar paper

I Can be used as a static library

I Open source license (BSD)

I Available under https://bitbucket.org/geekStack/libjungle

LibJungle C++ Library

I Goal: Find memory efficient alternative to random forests

I Idea: Use DAGs and limit their width

I Train ensembles of random decision DAGs (called decision jungle)

I Train a DAG level-wise by minimizing an objective function

I Efficiently implement the optimization using histogramsI Decision jungles perform as well as random forests

I Sometimes even better

I Evaluation is twice as expensive

I Training takes significantly longer

Summary

Questions are welcome

Seminar paper available under geekstack.net/paper

Thanks for your attention

Leo Breiman.Random Forests.Machine Learning 45, 2001.

Shotton, Jamie and Sharp, Toby and Kohli, Pushmeet and Nowozin,Sebastian and Winn, John and Criminisi, Antonio.Decision Jungles: Compact and Rich Models for Classification.Advances in Neural Information Processing Systems 26, 2013.

Piotr Dollár.Piotr’s Image and Video Matlab Toolbox (PMT).http://vision.ucsd.edu/~pdollar/toolbox/doc/index.html.

decision jungles - geekstackgeekstack.net/resources/...decision_jungles_slides.pdf · winn, and...

Documents