stochastic tree ensembles for regularized supervised learning · stochastic tree ensembles for...

94
Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn November 5, 2019 The University of Chicago Booth School of Business 1

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Stochastic Tree Ensembles

for Regularized Supervised Learning

Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

November 5, 2019

The University of Chicago Booth School of Business

1

Page 2: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Elements of supervised learning

• x ∈ Rp is a vector of covariate variables, the outcome y is

continuous (regression) or discrete (classfication)

• How to predict y given x.

• We want to learn the function form f (·)

y = f (x) (1)

• Linear model: y = xβ (or Logistic regression)

• Non-parametric model: tree, deep learning

2

Page 3: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Simulation and Empirical Results

Page 4: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Regression Simulation

We draw 30 x variables from U(0, 1). y = f (x) + ε where

ε ∼ N(0, σ2) and σ2 = κVar(f ). κ measures signal to noise ratio.

Name Function

Linear xtγ; γj = −2 + 4(j−1)d−1

Single index 10√a + sin (5a); a =

∑10j=1(xj − γj)2; γj = −1.5 + j−1

3 .

Trig + poly 5 sin(3x1) + 2x22 + 3x3x4

Max max(x1, x2, x3)

We compare with neural networks, random forest, xgboost with /

without cross validation.

3

Page 5: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Simulation, signal to noise = 1:1, κ = 1 and n = 10K

Function Method RMSE Seconds

Linear XBART 1.74 20

Linear XGBoost Tuned 2.63 64

Linear XGBoost Untuned 3.23 < 1

Linear Random Forest 3.56 6

Linear BART 1.50 117

Linear Neural Network 1.39 26

Trig + Poly XBART 1.31 17

Trig + Poly XGBoost Tuned 2.08 61

Trig + Poly XGBoost Untuned 2.70 < 1

Trig + Poly Random Forest 3.04 6

Trig + Poly BART 1.30 115

Trig + Poly Neural Network 3.96 26

Max XBART 0.39 16

Max XGBoost Tuned 0.42 62

Max XGBoost Untuned 0.79 < 1

Max Random Forest 0.41 6

Max BART 0.44 114

Max Neural Network 0.40 30

Single Index XBART 2.27 17

Single Index XGBoost Tuned 2.65 61

Single Index XGBoost Untuned 3.65 < 1

Single Index Random Forest 3.45 6

Single Index BART 2.03 116

Single Index Neural Network 2.76 28 4

Page 6: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Simulation, signal to noise = 1:10, κ = 10 and n = 10K

Function Method RMSE Seconds

Linear XBART 5.07 16

Linear XGBoost Tuned 8.04 61

Linear XGBoost Untuned 21.25 < 1

Linear Random Forest 6.52 6

Linear BART 6.64 111

Linear Neural Network 7.39 12

Trig + Poly XBART 4.94 16

Trig + Poly XGBoost Tuned 7.16 61

Trig + Poly XGBoost Untuned 17.97 < 1

Trig + Poly Random Forest 6.34 7

Trig + Poly BART 6.15 110

Trig + Poly Neural Network 8.20 13

Max XBART 1.94 16

Max XGBoost Tuned 2.76 60

Max XGBoost Untuned 7.18 < 1

Max Random Forest 2.30 6

Max BART 2.46 111

Max Neural Network 2.98 15

Single Index XBART 7.13 16

Single Index XGBoost Tuned 10.61 61

Single Index XGBoost Untuned 28.68 < 1

Single Index Random Forest 8.99 6

Single Index BART 8.69 111

Single Index Neural Network 9.43 14 5

Page 7: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Larger Simulation, κ = 1, signal to noise = 1:1

κ = 1, signal to noise ratio 1 to 1

n XBART XGB+CV XGB NN

Linear

10k 1.74 (20) 2.63 (64) 3.23 (0) 1.39 (26)

50k 1.04 (180) 1.99 (142) 2.56 (4) 0.66 (28)

250k 0.67 (1774) 1.50 (1399) 2.00 (55) 0.28 (40)

Max

10k 0.39 (16) 0.42 (62) 0.79 (0) 0.40 (30)

50k 0.25 (134) 0.29 (140) 0.58 (4) 0.20 (32)

250k 0.14 (1188) 0.21 (1554) 0.41 (60) 0.16 (44)

Single Index

10k 2.27 (17) 2.65 (61) 3.65 (0) 2.76 (28)

50k 1.54 (153) 1.61 (141) 2.81 (4) 1.93 (31)

250k 1.14 (1484) 1.18 (1424) 2.16 (55) 1.67 (41)

Trig + Poly

10k 1.31 (17) 2.08 (61) 2.70 (0) 3.96 (26)

50k 0.74 (147) 1.29 (141) 1.67 (4) 3.33 (29)

250k 0.45 (1324) 0.82 (1474) 1.11 (59) 2.56 (41)6

Page 8: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Larger Simulation, κ = 10, signal to noise = 1:10

κ = 10, signal to noise ratio 1 to 10

n XBART XGB+CV XGB NN

Linear

10k 5.07 (16) 8.04 (61) 21.25 (0) 7.39 (12)

50k 3.16 (135) 5.47 (140) 16.17 (4) 3.62 (14)

250k 2.03 (1228) 3.15 (1473) 11.49 (54) 1.89 (19)

Max

10k 1.94 (16) 2.76 (60) 7.18 (0) 2.98 (15)

50k 1.22 (133) 1.85 (139) 5.49 (4) 1.63 (16)

250k 0.75 (1196) 1.05 (1485) 3.85 (54) 0.85 (22)

Single Index

10k 7.13 (16) 10.61 (61) 28.68 (0) 9.43 (14)

50k 4.51 (133) 6.91 (139) 21.18 (4) 6.42 (16)

250k 3.06 (1214) 4.10 (1547) 14.82 (54) 4.72 (21)

Trig + Poly

10k 4.94 (16) 7.16 (61) 17.97 (0) 8.20 (13)

50k 3.01 (132) 4.92 (139) 13.30 (4) 5.53 (14)

250k 1.87 (1216) 3.17 (1462) 9.37 (49) 4.13 (20)7

Page 9: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Motivation

Page 10: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2µ1

8

Page 11: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.3

µ1 µ2

yes no

8

Page 12: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 µ2

yes no

8

Page 13: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.7

µ1 µ2

yes no

8

Page 14: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 µ2

yes no

8

Page 15: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 x2 < 0.3

µ2 µ3

yes no

yes no

8

Page 16: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 x2 < 0.5

µ2 µ3

yes no

yes no

8

Page 17: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 x2 < 0.7

µ2 µ3

yes no

yes no

8

Page 18: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 x2 < 0.7

µ2 µ3

yes no

yes no

8

Page 19: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

x1

x2

x1 < 0.5

µ1 x2 < 0.7

µ2 µ3

yes no

yes no

Prediction of a new observation is µ3.

8

Page 20: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Grow a tree recursively

306 9. Additive Models, Trees, and Related Methods

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ! t1

X2 ! t2 X1 ! t3

X2 ! t4

FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.

Tree is essentially a step function, partitions at one variable at a

time. Predict with constant. Picture from Friedman, Hastie, Tibshirani 2001.

9

Page 21: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Sum of trees

+ =

d

c

µ1

µ2

µ3

+

e

f

θ1

θ2 θ3=

e

f

d

c

θ1 + µ2 θ1 + µ1

θ2 + µ3

θ2 + µ2

θ3 + µ3

θ3 + µ2

θ3 + µ1

Sum of trees (tree ensembles or forest) implies an extra level of

smoothing.

10

Page 22: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Why a tree / forest

• Widely used, more than half winners of data mining

competition on kaggle use variants of tree ensemble methods.

• Invariant to scaling of input variables. Not necessary to worry

about feature normalization.

• Learn high order interactions between features.

• Random Forest, takes average of multiple trees, all trees are

independent.

• Boosting, takes sum of multiple trees, grown sequentially.

11

Page 23: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

x

y

12

Page 24: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Random Forest

x

y

12

Page 25: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Random Forest

x

y

12

Page 26: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Random Forest

x

y

12

Page 27: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Random Forest

x

y

12

Page 28: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 29: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 30: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 31: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 32: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 33: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 34: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Page 35: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Classification and Regression Trees (CART)

• Probably the most popular tree growing algorithm.

• Grow tree until very large, then prune the tree.

• CART split criterion minimizes L2 loss,∑i∈left

(yi − yleft)2 +

∑i∈right

(yi − yright)2

• Split points and leaf parameter might conspire to make a bad

split point looks nicer.

• Optimize split criterion. What if two split points have very

close evaluation like 10.00 and 9.99?

14

Page 36: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Intuition for XBART

• A randomized algorithm. Sampling split points instead ofoptimization.

• A new split criterion with probabilistic interpretation.

• Early stopping, rather than pruning after over-growing.

• Split and estimate leaf parameter separately.

• Tree ensemble.

15

Page 37: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Comparison of tree-based algorithms

CART RF XGB XBART BART

Leaf optimized optimized optimized integrate out at split integrate out at split

Parameters with splits with splits with splits then sample then sample

Criteria likelihood likelihood likelihood marginal likelihood marginal likelihood

Aggregation Noaggregation

Noaggregation aggregation

of trees of forests of forests

SequentialNo No Yes Yes Yes

fitting

Iterations No No No Yes No

Recursion Yes Yes Yes Yes No

16

Page 38: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART regression

Page 39: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of Gaussian regression

• Assume Gaussian likelihood on one leaf node N(µ, σ2)

• µ ∼ N(0, τ) prior

The integrated-likelihood is

p(Y | τ, σ2) =

∫N(Y | µ, σ2In)N(µ | 0, τ)dµ = N(0, τJJt+σ2In),

Re-arrange terms and ignore constants in the density, take

logarithm,

log(m(s | τ, σ2)) = log

(σ2

σ2 + τn

)+

τ

σ2(σ2 + τn)s2.

where s =∑

y , sum of all y in the leaf node.

17

Page 40: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of a single tree

• A split point candidate cjk partitions the current node to left

and right nodes.

• Two nodes have sufficient statistics s ljk and srjk .

• Assume data in two leaf nodes are independent.

The joint integrated-likelihood of two children leaf nodes is

l(cjk) := m(s ljk | τ, σ2)m(srjk | τ, σ2),

This is split criterion of cjk .

18

Page 41: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Why integrated-likelihood

• Most tree algorithms optimize split point and leaf parameters

simultaneously.

• A bad split point probably looks better in collusion with leaf

parameters because random noise from the data.

• We split nodes and estimate leaf parameters separately.

19

Page 42: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

No-split option (Regularization)

Furthermore, the split criterion for no-split is defined as

lstop = |C|(

(1 + d)β

α− 1

)m(sall | Φ,Ψ)

• d is depth of the current node.

• sall is sufficient stat of all data.

• α, β are hyper-parameters.

20

Page 43: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Weight of no-split option

• A split point candidates has prior proportional to 1.

• No-split has prior proportional to |C|(

(1+d)β

α − 1)

.

• |C| split candidates in total.

Therefore the implied prior probability of splitting is

split weight

split weight + no split weight=

|C|

|C|(

(1+d)β

α − 1)

+ |C|= α(1+d)−β.

which is same as prior probability of a split of BART.

21

Page 44: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Sample one split point (or stop)

Sample one of them according to probability

P(cjk) =l(cjk)∑

cjk∈C l(cjk) + lstop, P(stop) =

lstop∑cjk∈C l(cjk) + lstop

22

Page 45: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of a single tree

Consider a split point candidate, it partitions the space to left and

right child as

x1

x2

23

Page 46: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of a single tree

Consider a split point candidate, it partitions the space to left and

right child as

x1

x2

m(s ljk | Φ,Ψ)

23

Page 47: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of a single tree

Consider a split point candidate, it partitions the space to left and

right child as

x1

x2

m(s ljk | Φ,Ψ)m(srjk | Φ,Ψ)

23

Page 48: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

No-split option

If stop splitting, it is equivalent to leave all data in one node.

x1

x2

|C|(

(1 + d)β

α− 1

)m(s∅ | Φ,Ψ)

24

Page 49: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

GrowFromRoot algorithm

In addition to no-split option, we still have regular stop splitting

conditions — max depth of tree, least leaf size, etc. The algorithm

growing a single tree is (GrowFromRoot)

1. Start from a root node.

2. Evaluate split criterion for all split point candidates andno-split option, sample one from it.

if no-split or other stopping criterion is reached. Update leaf

parameter and return the algorithm.

else split current node to left and right side, repeat step 1 for both

child nodes.

25

Page 50: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Next it is algorithm growing a forest. We fit a sum of trees

(regression) or product of trees (classification).

Suppose we sample I forests with L trees in each forest.

For iter in 1 to I .

For h in 1 to L.

GrowFromRoot fits the target r iterh .

Update r iterh+1, the target for the next tree to fit.

Update other non-tree parameters Ψ.

26

Page 51: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Take regression (additive trees) for an example. Suppose we fit

three trees, the first tree fits target R11 = 1

3Y . Denote the fitted

value as R11 .

27

Page 52: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Take regression (additive trees) for an example. The second tree

fits target R12 = 1

3Y and fitted value is R12 .

+

27

Page 53: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

The third tree fits target R13 = 1

3Y and fitted value is R13 .

+ +

27

Page 54: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Next iteration of forest, re-fit the first tree to target

R21 = Y − R1

2 − R13 . Fitted value is R2

1 .

+ +

27

Page 55: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Next iteration of forest, re-fit the second tree to target

R22 = Y − R2

1 − R13 . Fitted value is R2

2 .

+ +

27

Page 56: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

XBART forest

Re-fit the third tree to target R23 = Y − R2

1 − R22 . Fitted value is

R23 .

+ +

27

Page 57: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Update leaf parameter µ and residual variance

We assume µlb ∼ N(0, τ) prior on each leaf, update it by

µlb ∼ N

( ∑y

σ2(

1τ + nlb

σ2

) , 1(1τ + nlb

σ2

))∑

takes sum of all data, and nlb is number of observations.

Assume a regular inverse Gamma prior on σ2 and update in

between of trees.

σ2 ∼ inverse-Gamma(N + a, r

(iter)th r

(iter)h + b

)r(iter) is the total residual of all trees.

28

Page 58: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Default Hyperparameters

We recommend

• L = 50, 100 or 200

• α = 0.95,

• β = 1.25 and

• τ = var(y)/L.

Lower β permits deeper trees (BART’s default is β = 2).

This τ dictates that a priori each tree will account for 1L of the

observed variance.

Our default suggestion is just 40 sweeps through the data,

discarding the first 15 as burn-in.

29

Page 59: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Final prediction

Given I iterations of the algorithm, keep final I − I0 forests for

prediction, where I0 < I is denotes the length of the burn-in period.

Take average on the time domain of a Markov chain.

f (X) =1

I − I0

I∑k>I0

f (k)(X).

where f (k) denotes a sample of the forest, a sum of trees in that

forest.

30

Page 60: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Variable importance

We also keep track of variable importance. Count how many times

each variable is used as split variable in tree l at iteration k as

w(k)l . Update parameter

w← w − w(k−1)l + w

(k)l

The weight parameter is then resampled as w ∼ Dirichlet(w).

Then for the next tree, we subsample mtry variables for

consideration with respect to probability w.

w gives a natural evaluation of variable importance.

31

Page 61: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Generic XBART and Classification

Page 62: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Generic split criterion

It is natural to extend split criterion to other likelihoods.

• Define a likelihood L(y ;µ,Ψ) on one leaf node.

• Leaf parameter µ; other model parameters Ψ (given and fixed

in tree growing process).

• Assume prior π(µ | Φ) on µ.

The integrated likelihood is

m(s | Φ,Ψ) :=

∫L(y ;µ,Ψ)π(µ | Φ)dµ,

s is a sufficient statistics of data y falling in the current node.

32

Page 63: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of multi-class classification

• Classification of C categories. Suppose each xi is observed ni

times in the data.

• The response yij is the number of observations with covariate

xi in category j , where 1 ≤ i ≤ n and 1 ≤ j ≤ C .

• So we have∑C

j=1 yij = ni . If xi is continuous, then ni = 1.

• The probability of response of covariate vector xi belongs to

category j is

πj(xi ) =f (j)(xi )∑Ch=1 f

(h)(xi ).

33

Page 64: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of multi-class classification

We assume the logarithm of regression function is a sum of trees,

taking the form as log(f (j)(x)

)=∑L

l=1 g(

x;T(j)l , µ

(j)l

), which

leads to a multinomial logistic trees model

πj(xi ) =exp

[∑Ll=1 g

(x;T

(j)l , µ

(j)l

)]∑C

h=1 exp[∑L

l=1 g(

x;T(h)l , µ

(h)l

)] .Let λlb = exp(µlb)

f (x) = exp

[L∑

l=1

g(x;Tl , µl)

]=

L∏l=1

g(x;Tl ,Λl),

where g(x,Tl ,Λl) = λlb if x ∈ Alb for 1 ≤ b ≤ Bl .

34

Page 65: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of multi-class classification

The likelihood of each covariate value is

pMN(yi , φi ) =

(ni

yi1yi2 · · · yic

)∏cj=1 f

(j)(xi )yij∑c

j=1 f(j)(xi )ni

.

We apply apply the data augmentation strategy of Murray 2017 by

introducing a latent variable φi , the augmented likelihood is

pMN(yi , φi ) =( ni

yi1yi2 · · · yic

) c∏j=1

f (j)(xi )yij

φni−1i

Γ(ni )exp

−φi c∑j=1

f (j)(xi )

=( ni

yi1yi2 · · · yic

)φni−1

Γ(ni )

c∏j=1

f (j)(xi )yij exp

[−φi f (j)(xi )

].

The augmented likelihood introduces latent variable φi for

1 ≤ i ≤ N per data observation.

35

Page 66: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of multi-class classification

Assume independent conjugate prior λlb ∼ Gamma(a1, a2) for each

leaf parameter in Λl , the integrated likelihood is

L(Tl ;T(l),Λ(l), θ, y) =

∫L(Tl ,Λl ;T(l),Λ(l), θ, y)p(Λl)dΛl

=

∫cl

Bl∏t=1

λrlblb exp[−slbλlb]λa1−1lb aa1

2 e−a2λlb

Γ(a1)dλlt

∝Bl∏t=1

1

(a2 + slb)(a1+rlb)Γ(a1 + rlb),

The integrated likelihood of one leaf is

m(slb | a1, a2, φiNi=1

)=

1

(a2 + slb)(a1+rlb)Γ(a1 + rlb)

36

Page 67: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Sampling non-tree parameters in multi-class classification

The sampling steps of leaf and non-tree parameters are

• For 1 ≤ j ≤ C , update each leaf parameter of f (j)

independently when update leaf parameter in GrowFromRoot

algorithm.

λlb ∼ Gamma(a1 + rlb, a2 + slb).

• For 1 ≤ i ≤ N, update latent variable after updating a tree in

XBART forest algorithm.

φi ∼ Gamma

ni ,c∑

j=1

f (j)(xi )

.

37

Page 68: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Some Theory

Page 69: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Markov chain

The algorithm sampling forest is a finite-state Markov chain with

stationary distribution.

• Each iteration only relies on prior iteration but all forests

before that.

• Forest states are finite because of max depth and finite split

point candidates.

• The probability drawing a single tree is defined as product of

integrated-likelihood of all cut points, which is non-zero.

• At least one way to go from one forest to the other: re-grow

trees one by one.

38

Page 70: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Consistency of a single regression tree

• We prove consistency of a single regression tree based on

consistency result of CART (Scornet et al. 2015).

• The framework of proof is the same. We verify XBART

satisfies key lemmas related to specific split criterion function.

• Random sampling split-point can be convert to optimization

by perturb max lemma.

39

Page 71: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Perturb max lemma

Lemma

Suppose at a specific node, there are |C| finite split-point

candidates cjk, we are interested in drawing one of them

according to probability P(cjk) =exp(l(cjk ))∑

cjk∈Cexp(l(cjk )) . We have

P

(cjk = arg max

cjk∈Cl(cjk) + γjk

)=

exp(l(cjk))∑cjk∈C exp(l(cjk))

where γjk is a set of independent random draws from

Gumble(0, 1) distribution with density

p(x) = exp(−x + exp(−x)).

Random sampling is equivalent to optimization with an additional

random drawn constant.

40

Page 72: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Perturb max lemma

Follow the perturb max lemma, we optimize

arg maxcjk∈Cl(cjk) + γjk

which is equivalent to

arg maxcjk∈C

l(cjk)

n+γjkn.

Let n→∞, our empirical split criterion function Ln(x) converges

to the theoretical version

L∗(j , cjk) =1

σ2P(x(j) ≤ cjk)

[E(y | x(j) ≤ cjk)

]2

+1

σ2P(x(j) > cjk)

[E(y | x(j) > cjk)

]2.

This theoretical split criterion is the same as CART.

41

Page 73: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Assumption

We prove consistency of a single tree in regression setting.

Assumption (A1)

y =

p∑i=1

fj(x(j)) + ε

where x = (x(1), · · · , x(p)) is uniformly distributed on [0, 1]p.

ε ∼ N(0, σ2).

42

Page 74: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Main theorem

Suppose dn is a sequence of max depth of trees, each tree is fully

grown, we have

Theorem

Assume (A1) holds. Let n→∞, dn →∞ and

(2dn − 1)(log n)9/n→ 0, a single XBART tree is consistent in the

sense that

limn→∞

E[fn(x)− f (x)]2 = 0.

43

Page 75: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

An important proposition

The total variation of the true function f within leaf node A is

∆(f ,A) = supx,x′∈A

|f (x)− f (x′)|.

An(x,Θ) is the leaf node x falls in.

Proposition

Assume (A1) holds, for all ρ, ξ > 0, there exists N ∈ N∗ such

that, for all n > N,

P [∆(f ,An(x,Θ)) ≤ ξ] ≥ 1− ρ.

n→∞, variation of the true function on all leaf nodes is arbitrarily

small. Either because the leaf size shrinks to zero, or the true

function is flat.44

Page 76: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Proof of proposition

Proof of proposition is done by three key lemmas, we verify it is

valid for XBART in the paper.

1. Proposition is true for tree grown with theoretical split

criterion.

2. Tree grown with empirical split criterion is close enough to

theoretical tree as n→∞.

45

Page 77: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Future Research

• XBART enjoys great prediction accuracy and is fast.

• Application to Bayesian causal forest. Including future

empirical works in causal inference.

• Prove consistency of the XBART forest.

46

Page 78: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Thanks

Download XBART at xbart.ai

47

Page 79: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Extra slides

48

Page 80: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Classification Results

We compare XBART with other methods on 20 datasets from the

UCI machine learning repository. All datasets have 3 to 6

categories and 100 to 3,000 observations.

The goal is to demonstrate that default setting of XBART has a

reasonable performance comparing to other approaches.

49

Page 81: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Classification Results

rf gbm mno svm nnet xbart

balance-scale 0.848 (0.023) 0.925 (0.010) 0.897 (0.021) 0.909 (0.025) 0.961 (0.019)* 0.912 (0.011)

car 0.983 (0.006)* 0.979 (0.008) 0.834 (0.019) 0.774 (0.033) 0.947 (0.015) 0.938 (0.018)

cardiotocography-3clases 0.937 (0.009) 0.949 (0.009)* 0.894 (0.011) 0.911 (0.011) 0.909 (0.013) 0.931 (0.011)

contrac 0.546 (0.024) 0.557 (0.023) 0.516 (0.028) 0.551 (0.024) 0.556 (0.028)* 0.324 (0.058)

dermatology 0.970 (0.016) 0.972 (0.020)* 0.968 (0.020) 0.759 (0.024) 0.970 (0.022) 0.972 (0.018)*

glass 0.798* (0.062) 0.771 (0.055) 0.622 (0.066) 0.679 (0.054) 0.673 (0.064) 0.702 (0.076)

heart-cleveland 0.578 (0.033) 0.586 (0.039) 0.587 (0.039) 0.620 (0.038)* 0.603 (0.052) 0.583 (0.034)

heart-va 0.357 (0.071)* 0.320 (0.067) 0.349 (0.069) 0.315 (0.069) 0.302 (0.08) 0.308 (0.064)

iris 0.948 (0.034) 0.945 (0.034) 0.965 (0.029)* 0.947 (0.034) 0.954 (0.045) 0.954 (0.033)

lymphography 0.866 (0.063)* 0.853 (0.057) 0.821 (0.069) 0.850 (0.062) 0.818 (0.077) 0.835 (0.058)

pittsburg-bridges-MATERIAL 0.840 (0.058) 0.844 (0.049) 0.834 (0.061) 0.860 (0.048)* 0.824 (0.066) 0.849 (0.046)

pittsburg-bridges-REL-L 0.725 (0.084)* 0.681 (0.093) 0.650 (0.087) 0.692 (0.082) 0.659 (0.091) 0.680 (0.083)

pittsburg-bridges-SPAN 0.637 (0.098) 0.648 (0.101) 0.675 (0.100) 0.681 (0.099)* 0.647 (0.102) 0.628 (0.100)

pittsburg-bridges-TYPE 0.609 (0.088)* 0.581 (0.089) 0.549 (0.089) 0.540 (0.075) 0.565 (0.092) 0.585 (0.093)

seeds 0.940 (0.030) 0.940 (0.031) 0.948 (0.035)* 0.929 (0.029) 0.944 (0.034) 0.945 (0.042)

synthetic-control 0.984 (0.012) 0.971 (0.015) 0.984 (0.012) 0.716 (0.024) 0.987 (0.011)* 0.983 (0.017)

teaching 0.622 (0.085)* 0.557 (0.086) 0.526 (0.072) 0.547 (0.073) 0.525 (0.087) 0.491 (0.086)

vertebral-column-3clases 0.847 (0.038) 0.829 (0.038) 0.861 (0.033)* 0.839 (0.042) 0.859 (0.033) 0.842 (0.039)

wine-quality-red 0.702 (0.021)* 0.631 (0.026) 0.597 (0.024) 0.576 (0.026) 0.597 (0.025) 0.613 (0.022)

wine 0.985 (0.018)* 0.979 (0.021) 0.979 (0.026) 0.980 (0.025) 0.977 (0.027) 0.969 (0.032)

50

Page 82: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Tree is Prone to Overfitting

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

51

Page 83: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Regularization

Three secrets of a successful tree model.

Regularization, regularization and regularization.

The strategies for preventing overfitting are diverse, but the basic

idea is to favor smaller trees.

52

Page 84: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Split criterion of multi-class classification

Let f(j)−l (x) =

∏h 6=l g

(j)(x,Th,Λh) be the total fit of all trees

except the l-th one, the likelihood of all n covariates has the form

L(Tl ,Λl ;T(l),Λ(l), y) =n∏

i=1

wi f(j)(xi )

yij exp[vi f(j)(xi )]

= cl

Bl∏b=1

λrlblt exp[−slbλlb]

where Bl is number of nodes of the l-th tree and

cl =n∏

i=1

(ni

yi1yi2 · · · yic

)φni−1

Γ(ni )f

(j)−l (xi )

yij

rlb =∑

i :xi∈Alb

yij , slb =∑

i :xi∈Alb

φi f(j)−l (xi ).

53

Page 85: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Adaptive Cut-points

Cut-points are defined via evenly spaced quantiles of the data

observation X .

At each recursion, the cut-points are redefined.

x0 1 2 3 4 5 6 7 8

f (x)

54

Page 86: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Adaptive Cut-points

x0 1 2 3 4 5 6 7 8

f (x)

54

Page 87: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Adaptive Cut-points

x0 1 2 3 4 5 6 7 8

f (x)

”Zoom in”. It’s easy because data are ordered.

54

Page 88: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Prune CART

Loop over all possible sub-trees T , calculate loss function

Cα(T ) = C (T ) + α|T |

where C (T ) is prediction error of sub-tree T and |T | is a

measurement of tree complexity. α is a tuning parameter. Pick a

sub-tree with lowest loss.

55

Page 89: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Gradient boosting trees

Obj =n∑

i=1

loss(yi , yi ) +L∑

l=1

Ω(g(Tl , µl))

Obj t =n∑

i=1

[2(y (t−1) − yi )gt(xi ) + gt(xi )

2]

+ Ω(gt) + const

56

Page 90: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Gyorfi et al. (2006)

Theorem (Gyorfi et al. (2006))

Assume that

1. limn→∞ βn =∞;

2. limn→∞ E

[inf m∈Mn(Θ)||m||∞≤βn

EX [m(x)− f (x)]2

]= 0;

3. for all L > 0,

limn→∞

E

supm∈Mn(Θ)||f ||∞≤βn

∣∣∣∣∣ 1

an

n∑i=1

[m(xi )− yi,L]2 − E [m(x)− yL]2

∣∣∣∣∣ = 0.

Then

limn→∞

E[Tβn fn(X ,Θ)− f (X )]2 = 0.

57

Page 91: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Lemma 1

Lemma

Assume that (A1) holds. Then for all x ∈ [0, 1]p,

∆(f ,A∗k(x,Θ))→ 0 almost surely as k →∞.

58

Page 92: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Lemma 2

Lemma

Assume that (A1) holds. Fix x ∈ [0, 1]p, k ∈ N∗ and let ξ > 0.

Then Ln,k(x , ) is stochastically equicontinuous on Aξk(x), that is,

for all α, ρ > 0, there exist δ > 0 such that

limn→∞

P

sup||ck−c′k ||∞≤δck ,c′k∈Aξk (x)

∣∣Ln,k(x, ck)− Ln,k(x, c′k)∣∣ > α

≤ ρ.

59

Page 93: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

Lemma 3

Lemma

Assume that (A1) holds. Fix ξ, ρ > 0 and k ∈ N∗. There there

exists N ∈ N∗ such that for all n ≥ N,

P [c∞(ck,n(x,Θ),A∗k(x,Θ)) ≤ ξ] ≥ 1− ρ.

60

Page 94: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

References

Jingyu He, Saar Yalov, P. Richard Hahn XBART: Accelerated

bayesian additive regression trees. AISTATS 2019.

Jingyu He, Saar Yalov, Jared Murray, P. Richard Hahn Stochastic

tree ensembles for regularized supervised learning. Technical

Report.

61