![Page 1: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/1.jpg)
Stochastic Tree Ensembles
for Regularized Supervised Learning
Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn
November 5, 2019
The University of Chicago Booth School of Business
1
![Page 2: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/2.jpg)
Elements of supervised learning
• x ∈ Rp is a vector of covariate variables, the outcome y is
continuous (regression) or discrete (classfication)
• How to predict y given x.
• We want to learn the function form f (·)
y = f (x) (1)
• Linear model: y = xβ (or Logistic regression)
• Non-parametric model: tree, deep learning
2
![Page 3: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/3.jpg)
Simulation and Empirical Results
![Page 4: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/4.jpg)
Regression Simulation
We draw 30 x variables from U(0, 1). y = f (x) + ε where
ε ∼ N(0, σ2) and σ2 = κVar(f ). κ measures signal to noise ratio.
Name Function
Linear xtγ; γj = −2 + 4(j−1)d−1
Single index 10√a + sin (5a); a =
∑10j=1(xj − γj)2; γj = −1.5 + j−1
3 .
Trig + poly 5 sin(3x1) + 2x22 + 3x3x4
Max max(x1, x2, x3)
We compare with neural networks, random forest, xgboost with /
without cross validation.
3
![Page 5: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/5.jpg)
Simulation, signal to noise = 1:1, κ = 1 and n = 10K
Function Method RMSE Seconds
Linear XBART 1.74 20
Linear XGBoost Tuned 2.63 64
Linear XGBoost Untuned 3.23 < 1
Linear Random Forest 3.56 6
Linear BART 1.50 117
Linear Neural Network 1.39 26
Trig + Poly XBART 1.31 17
Trig + Poly XGBoost Tuned 2.08 61
Trig + Poly XGBoost Untuned 2.70 < 1
Trig + Poly Random Forest 3.04 6
Trig + Poly BART 1.30 115
Trig + Poly Neural Network 3.96 26
Max XBART 0.39 16
Max XGBoost Tuned 0.42 62
Max XGBoost Untuned 0.79 < 1
Max Random Forest 0.41 6
Max BART 0.44 114
Max Neural Network 0.40 30
Single Index XBART 2.27 17
Single Index XGBoost Tuned 2.65 61
Single Index XGBoost Untuned 3.65 < 1
Single Index Random Forest 3.45 6
Single Index BART 2.03 116
Single Index Neural Network 2.76 28 4
![Page 6: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/6.jpg)
Simulation, signal to noise = 1:10, κ = 10 and n = 10K
Function Method RMSE Seconds
Linear XBART 5.07 16
Linear XGBoost Tuned 8.04 61
Linear XGBoost Untuned 21.25 < 1
Linear Random Forest 6.52 6
Linear BART 6.64 111
Linear Neural Network 7.39 12
Trig + Poly XBART 4.94 16
Trig + Poly XGBoost Tuned 7.16 61
Trig + Poly XGBoost Untuned 17.97 < 1
Trig + Poly Random Forest 6.34 7
Trig + Poly BART 6.15 110
Trig + Poly Neural Network 8.20 13
Max XBART 1.94 16
Max XGBoost Tuned 2.76 60
Max XGBoost Untuned 7.18 < 1
Max Random Forest 2.30 6
Max BART 2.46 111
Max Neural Network 2.98 15
Single Index XBART 7.13 16
Single Index XGBoost Tuned 10.61 61
Single Index XGBoost Untuned 28.68 < 1
Single Index Random Forest 8.99 6
Single Index BART 8.69 111
Single Index Neural Network 9.43 14 5
![Page 7: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/7.jpg)
Larger Simulation, κ = 1, signal to noise = 1:1
κ = 1, signal to noise ratio 1 to 1
n XBART XGB+CV XGB NN
Linear
10k 1.74 (20) 2.63 (64) 3.23 (0) 1.39 (26)
50k 1.04 (180) 1.99 (142) 2.56 (4) 0.66 (28)
250k 0.67 (1774) 1.50 (1399) 2.00 (55) 0.28 (40)
Max
10k 0.39 (16) 0.42 (62) 0.79 (0) 0.40 (30)
50k 0.25 (134) 0.29 (140) 0.58 (4) 0.20 (32)
250k 0.14 (1188) 0.21 (1554) 0.41 (60) 0.16 (44)
Single Index
10k 2.27 (17) 2.65 (61) 3.65 (0) 2.76 (28)
50k 1.54 (153) 1.61 (141) 2.81 (4) 1.93 (31)
250k 1.14 (1484) 1.18 (1424) 2.16 (55) 1.67 (41)
Trig + Poly
10k 1.31 (17) 2.08 (61) 2.70 (0) 3.96 (26)
50k 0.74 (147) 1.29 (141) 1.67 (4) 3.33 (29)
250k 0.45 (1324) 0.82 (1474) 1.11 (59) 2.56 (41)6
![Page 8: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/8.jpg)
Larger Simulation, κ = 10, signal to noise = 1:10
κ = 10, signal to noise ratio 1 to 10
n XBART XGB+CV XGB NN
Linear
10k 5.07 (16) 8.04 (61) 21.25 (0) 7.39 (12)
50k 3.16 (135) 5.47 (140) 16.17 (4) 3.62 (14)
250k 2.03 (1228) 3.15 (1473) 11.49 (54) 1.89 (19)
Max
10k 1.94 (16) 2.76 (60) 7.18 (0) 2.98 (15)
50k 1.22 (133) 1.85 (139) 5.49 (4) 1.63 (16)
250k 0.75 (1196) 1.05 (1485) 3.85 (54) 0.85 (22)
Single Index
10k 7.13 (16) 10.61 (61) 28.68 (0) 9.43 (14)
50k 4.51 (133) 6.91 (139) 21.18 (4) 6.42 (16)
250k 3.06 (1214) 4.10 (1547) 14.82 (54) 4.72 (21)
Trig + Poly
10k 4.94 (16) 7.16 (61) 17.97 (0) 8.20 (13)
50k 3.01 (132) 4.92 (139) 13.30 (4) 5.53 (14)
250k 1.87 (1216) 3.17 (1462) 9.37 (49) 4.13 (20)7
![Page 9: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/9.jpg)
Motivation
![Page 10: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/10.jpg)
Grow a tree recursively
x1
x2µ1
8
![Page 11: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/11.jpg)
Grow a tree recursively
x1
x2
x1 < 0.3
µ1 µ2
yes no
8
![Page 12: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/12.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 µ2
yes no
8
![Page 13: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/13.jpg)
Grow a tree recursively
x1
x2
x1 < 0.7
µ1 µ2
yes no
8
![Page 14: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/14.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 µ2
yes no
8
![Page 15: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/15.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.3
µ2 µ3
yes no
yes no
8
![Page 16: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/16.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.5
µ2 µ3
yes no
yes no
8
![Page 17: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/17.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
8
![Page 18: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/18.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
8
![Page 19: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/19.jpg)
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
Prediction of a new observation is µ3.
8
![Page 20: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/20.jpg)
Grow a tree recursively
306 9. Additive Models, Trees, and Related Methods
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ! t1
X2 ! t2 X1 ! t3
X2 ! t4
FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.
Tree is essentially a step function, partitions at one variable at a
time. Predict with constant. Picture from Friedman, Hastie, Tibshirani 2001.
9
![Page 21: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/21.jpg)
Sum of trees
+ =
d
c
µ1
µ2
µ3
+
e
f
θ1
θ2 θ3=
e
f
d
c
θ1 + µ2 θ1 + µ1
θ2 + µ3
θ2 + µ2
θ3 + µ3
θ3 + µ2
θ3 + µ1
Sum of trees (tree ensembles or forest) implies an extra level of
smoothing.
10
![Page 22: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/22.jpg)
Why a tree / forest
• Widely used, more than half winners of data mining
competition on kaggle use variants of tree ensemble methods.
• Invariant to scaling of input variables. Not necessary to worry
about feature normalization.
• Learn high order interactions between features.
• Random Forest, takes average of multiple trees, all trees are
independent.
• Boosting, takes sum of multiple trees, grown sequentially.
11
![Page 23: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/23.jpg)
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
x
y
12
![Page 24: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/24.jpg)
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
![Page 25: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/25.jpg)
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
![Page 26: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/26.jpg)
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
![Page 27: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/27.jpg)
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
![Page 28: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/28.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 29: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/29.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 30: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/30.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 31: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/31.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 32: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/32.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 33: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/33.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 34: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/34.jpg)
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
![Page 35: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/35.jpg)
Classification and Regression Trees (CART)
• Probably the most popular tree growing algorithm.
• Grow tree until very large, then prune the tree.
• CART split criterion minimizes L2 loss,∑i∈left
(yi − yleft)2 +
∑i∈right
(yi − yright)2
• Split points and leaf parameter might conspire to make a bad
split point looks nicer.
• Optimize split criterion. What if two split points have very
close evaluation like 10.00 and 9.99?
14
![Page 36: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/36.jpg)
Intuition for XBART
• A randomized algorithm. Sampling split points instead ofoptimization.
• A new split criterion with probabilistic interpretation.
• Early stopping, rather than pruning after over-growing.
• Split and estimate leaf parameter separately.
• Tree ensemble.
15
![Page 37: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/37.jpg)
Comparison of tree-based algorithms
CART RF XGB XBART BART
Leaf optimized optimized optimized integrate out at split integrate out at split
Parameters with splits with splits with splits then sample then sample
Criteria likelihood likelihood likelihood marginal likelihood marginal likelihood
Aggregation Noaggregation
Noaggregation aggregation
of trees of forests of forests
SequentialNo No Yes Yes Yes
fitting
Iterations No No No Yes No
Recursion Yes Yes Yes Yes No
16
![Page 38: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/38.jpg)
XBART regression
![Page 39: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/39.jpg)
Split criterion of Gaussian regression
• Assume Gaussian likelihood on one leaf node N(µ, σ2)
• µ ∼ N(0, τ) prior
The integrated-likelihood is
p(Y | τ, σ2) =
∫N(Y | µ, σ2In)N(µ | 0, τ)dµ = N(0, τJJt+σ2In),
Re-arrange terms and ignore constants in the density, take
logarithm,
log(m(s | τ, σ2)) = log
(σ2
σ2 + τn
)+
τ
σ2(σ2 + τn)s2.
where s =∑
y , sum of all y in the leaf node.
17
![Page 40: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/40.jpg)
Split criterion of a single tree
• A split point candidate cjk partitions the current node to left
and right nodes.
• Two nodes have sufficient statistics s ljk and srjk .
• Assume data in two leaf nodes are independent.
The joint integrated-likelihood of two children leaf nodes is
l(cjk) := m(s ljk | τ, σ2)m(srjk | τ, σ2),
This is split criterion of cjk .
18
![Page 41: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/41.jpg)
Why integrated-likelihood
• Most tree algorithms optimize split point and leaf parameters
simultaneously.
• A bad split point probably looks better in collusion with leaf
parameters because random noise from the data.
• We split nodes and estimate leaf parameters separately.
19
![Page 42: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/42.jpg)
No-split option (Regularization)
Furthermore, the split criterion for no-split is defined as
lstop = |C|(
(1 + d)β
α− 1
)m(sall | Φ,Ψ)
• d is depth of the current node.
• sall is sufficient stat of all data.
• α, β are hyper-parameters.
20
![Page 43: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/43.jpg)
Weight of no-split option
• A split point candidates has prior proportional to 1.
• No-split has prior proportional to |C|(
(1+d)β
α − 1)
.
• |C| split candidates in total.
Therefore the implied prior probability of splitting is
split weight
split weight + no split weight=
|C|
|C|(
(1+d)β
α − 1)
+ |C|= α(1+d)−β.
which is same as prior probability of a split of BART.
21
![Page 44: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/44.jpg)
Sample one split point (or stop)
Sample one of them according to probability
P(cjk) =l(cjk)∑
cjk∈C l(cjk) + lstop, P(stop) =
lstop∑cjk∈C l(cjk) + lstop
22
![Page 45: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/45.jpg)
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
23
![Page 46: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/46.jpg)
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
m(s ljk | Φ,Ψ)
23
![Page 47: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/47.jpg)
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
m(s ljk | Φ,Ψ)m(srjk | Φ,Ψ)
23
![Page 48: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/48.jpg)
No-split option
If stop splitting, it is equivalent to leave all data in one node.
x1
x2
|C|(
(1 + d)β
α− 1
)m(s∅ | Φ,Ψ)
24
![Page 49: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/49.jpg)
GrowFromRoot algorithm
In addition to no-split option, we still have regular stop splitting
conditions — max depth of tree, least leaf size, etc. The algorithm
growing a single tree is (GrowFromRoot)
1. Start from a root node.
2. Evaluate split criterion for all split point candidates andno-split option, sample one from it.
if no-split or other stopping criterion is reached. Update leaf
parameter and return the algorithm.
else split current node to left and right side, repeat step 1 for both
child nodes.
25
![Page 50: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/50.jpg)
XBART forest
Next it is algorithm growing a forest. We fit a sum of trees
(regression) or product of trees (classification).
Suppose we sample I forests with L trees in each forest.
For iter in 1 to I .
For h in 1 to L.
GrowFromRoot fits the target r iterh .
Update r iterh+1, the target for the next tree to fit.
Update other non-tree parameters Ψ.
26
![Page 51: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/51.jpg)
XBART forest
Take regression (additive trees) for an example. Suppose we fit
three trees, the first tree fits target R11 = 1
3Y . Denote the fitted
value as R11 .
27
![Page 52: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/52.jpg)
XBART forest
Take regression (additive trees) for an example. The second tree
fits target R12 = 1
3Y and fitted value is R12 .
+
27
![Page 53: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/53.jpg)
XBART forest
The third tree fits target R13 = 1
3Y and fitted value is R13 .
+ +
27
![Page 54: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/54.jpg)
XBART forest
Next iteration of forest, re-fit the first tree to target
R21 = Y − R1
2 − R13 . Fitted value is R2
1 .
+ +
27
![Page 55: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/55.jpg)
XBART forest
Next iteration of forest, re-fit the second tree to target
R22 = Y − R2
1 − R13 . Fitted value is R2
2 .
+ +
27
![Page 56: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/56.jpg)
XBART forest
Re-fit the third tree to target R23 = Y − R2
1 − R22 . Fitted value is
R23 .
+ +
27
![Page 57: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/57.jpg)
Update leaf parameter µ and residual variance
We assume µlb ∼ N(0, τ) prior on each leaf, update it by
µlb ∼ N
( ∑y
σ2(
1τ + nlb
σ2
) , 1(1τ + nlb
σ2
))∑
takes sum of all data, and nlb is number of observations.
Assume a regular inverse Gamma prior on σ2 and update in
between of trees.
σ2 ∼ inverse-Gamma(N + a, r
(iter)th r
(iter)h + b
)r(iter) is the total residual of all trees.
28
![Page 58: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/58.jpg)
Default Hyperparameters
We recommend
• L = 50, 100 or 200
• α = 0.95,
• β = 1.25 and
• τ = var(y)/L.
Lower β permits deeper trees (BART’s default is β = 2).
This τ dictates that a priori each tree will account for 1L of the
observed variance.
Our default suggestion is just 40 sweeps through the data,
discarding the first 15 as burn-in.
29
![Page 59: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/59.jpg)
Final prediction
Given I iterations of the algorithm, keep final I − I0 forests for
prediction, where I0 < I is denotes the length of the burn-in period.
Take average on the time domain of a Markov chain.
f (X) =1
I − I0
I∑k>I0
f (k)(X).
where f (k) denotes a sample of the forest, a sum of trees in that
forest.
30
![Page 60: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/60.jpg)
Variable importance
We also keep track of variable importance. Count how many times
each variable is used as split variable in tree l at iteration k as
w(k)l . Update parameter
w← w − w(k−1)l + w
(k)l
The weight parameter is then resampled as w ∼ Dirichlet(w).
Then for the next tree, we subsample mtry variables for
consideration with respect to probability w.
w gives a natural evaluation of variable importance.
31
![Page 61: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/61.jpg)
Generic XBART and Classification
![Page 62: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/62.jpg)
Generic split criterion
It is natural to extend split criterion to other likelihoods.
• Define a likelihood L(y ;µ,Ψ) on one leaf node.
• Leaf parameter µ; other model parameters Ψ (given and fixed
in tree growing process).
• Assume prior π(µ | Φ) on µ.
The integrated likelihood is
m(s | Φ,Ψ) :=
∫L(y ;µ,Ψ)π(µ | Φ)dµ,
s is a sufficient statistics of data y falling in the current node.
32
![Page 63: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/63.jpg)
Split criterion of multi-class classification
• Classification of C categories. Suppose each xi is observed ni
times in the data.
• The response yij is the number of observations with covariate
xi in category j , where 1 ≤ i ≤ n and 1 ≤ j ≤ C .
• So we have∑C
j=1 yij = ni . If xi is continuous, then ni = 1.
• The probability of response of covariate vector xi belongs to
category j is
πj(xi ) =f (j)(xi )∑Ch=1 f
(h)(xi ).
33
![Page 64: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/64.jpg)
Split criterion of multi-class classification
We assume the logarithm of regression function is a sum of trees,
taking the form as log(f (j)(x)
)=∑L
l=1 g(
x;T(j)l , µ
(j)l
), which
leads to a multinomial logistic trees model
πj(xi ) =exp
[∑Ll=1 g
(x;T
(j)l , µ
(j)l
)]∑C
h=1 exp[∑L
l=1 g(
x;T(h)l , µ
(h)l
)] .Let λlb = exp(µlb)
f (x) = exp
[L∑
l=1
g(x;Tl , µl)
]=
L∏l=1
g(x;Tl ,Λl),
where g(x,Tl ,Λl) = λlb if x ∈ Alb for 1 ≤ b ≤ Bl .
34
![Page 65: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/65.jpg)
Split criterion of multi-class classification
The likelihood of each covariate value is
pMN(yi , φi ) =
(ni
yi1yi2 · · · yic
)∏cj=1 f
(j)(xi )yij∑c
j=1 f(j)(xi )ni
.
We apply apply the data augmentation strategy of Murray 2017 by
introducing a latent variable φi , the augmented likelihood is
pMN(yi , φi ) =( ni
yi1yi2 · · · yic
) c∏j=1
f (j)(xi )yij
φni−1i
Γ(ni )exp
−φi c∑j=1
f (j)(xi )
=( ni
yi1yi2 · · · yic
)φni−1
Γ(ni )
c∏j=1
f (j)(xi )yij exp
[−φi f (j)(xi )
].
The augmented likelihood introduces latent variable φi for
1 ≤ i ≤ N per data observation.
35
![Page 66: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/66.jpg)
Split criterion of multi-class classification
Assume independent conjugate prior λlb ∼ Gamma(a1, a2) for each
leaf parameter in Λl , the integrated likelihood is
L(Tl ;T(l),Λ(l), θ, y) =
∫L(Tl ,Λl ;T(l),Λ(l), θ, y)p(Λl)dΛl
=
∫cl
Bl∏t=1
λrlblb exp[−slbλlb]λa1−1lb aa1
2 e−a2λlb
Γ(a1)dλlt
∝Bl∏t=1
1
(a2 + slb)(a1+rlb)Γ(a1 + rlb),
The integrated likelihood of one leaf is
m(slb | a1, a2, φiNi=1
)=
1
(a2 + slb)(a1+rlb)Γ(a1 + rlb)
36
![Page 67: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/67.jpg)
Sampling non-tree parameters in multi-class classification
The sampling steps of leaf and non-tree parameters are
• For 1 ≤ j ≤ C , update each leaf parameter of f (j)
independently when update leaf parameter in GrowFromRoot
algorithm.
λlb ∼ Gamma(a1 + rlb, a2 + slb).
• For 1 ≤ i ≤ N, update latent variable after updating a tree in
XBART forest algorithm.
φi ∼ Gamma
ni ,c∑
j=1
f (j)(xi )
.
37
![Page 68: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/68.jpg)
Some Theory
![Page 69: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/69.jpg)
Markov chain
The algorithm sampling forest is a finite-state Markov chain with
stationary distribution.
• Each iteration only relies on prior iteration but all forests
before that.
• Forest states are finite because of max depth and finite split
point candidates.
• The probability drawing a single tree is defined as product of
integrated-likelihood of all cut points, which is non-zero.
• At least one way to go from one forest to the other: re-grow
trees one by one.
38
![Page 70: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/70.jpg)
Consistency of a single regression tree
• We prove consistency of a single regression tree based on
consistency result of CART (Scornet et al. 2015).
• The framework of proof is the same. We verify XBART
satisfies key lemmas related to specific split criterion function.
• Random sampling split-point can be convert to optimization
by perturb max lemma.
39
![Page 71: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/71.jpg)
Perturb max lemma
Lemma
Suppose at a specific node, there are |C| finite split-point
candidates cjk, we are interested in drawing one of them
according to probability P(cjk) =exp(l(cjk ))∑
cjk∈Cexp(l(cjk )) . We have
P
(cjk = arg max
cjk∈Cl(cjk) + γjk
)=
exp(l(cjk))∑cjk∈C exp(l(cjk))
where γjk is a set of independent random draws from
Gumble(0, 1) distribution with density
p(x) = exp(−x + exp(−x)).
Random sampling is equivalent to optimization with an additional
random drawn constant.
40
![Page 72: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/72.jpg)
Perturb max lemma
Follow the perturb max lemma, we optimize
arg maxcjk∈Cl(cjk) + γjk
which is equivalent to
arg maxcjk∈C
l(cjk)
n+γjkn.
Let n→∞, our empirical split criterion function Ln(x) converges
to the theoretical version
L∗(j , cjk) =1
σ2P(x(j) ≤ cjk)
[E(y | x(j) ≤ cjk)
]2
+1
σ2P(x(j) > cjk)
[E(y | x(j) > cjk)
]2.
This theoretical split criterion is the same as CART.
41
![Page 73: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/73.jpg)
Assumption
We prove consistency of a single tree in regression setting.
Assumption (A1)
y =
p∑i=1
fj(x(j)) + ε
where x = (x(1), · · · , x(p)) is uniformly distributed on [0, 1]p.
ε ∼ N(0, σ2).
42
![Page 74: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/74.jpg)
Main theorem
Suppose dn is a sequence of max depth of trees, each tree is fully
grown, we have
Theorem
Assume (A1) holds. Let n→∞, dn →∞ and
(2dn − 1)(log n)9/n→ 0, a single XBART tree is consistent in the
sense that
limn→∞
E[fn(x)− f (x)]2 = 0.
43
![Page 75: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/75.jpg)
An important proposition
The total variation of the true function f within leaf node A is
∆(f ,A) = supx,x′∈A
|f (x)− f (x′)|.
An(x,Θ) is the leaf node x falls in.
Proposition
Assume (A1) holds, for all ρ, ξ > 0, there exists N ∈ N∗ such
that, for all n > N,
P [∆(f ,An(x,Θ)) ≤ ξ] ≥ 1− ρ.
n→∞, variation of the true function on all leaf nodes is arbitrarily
small. Either because the leaf size shrinks to zero, or the true
function is flat.44
![Page 76: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/76.jpg)
Proof of proposition
Proof of proposition is done by three key lemmas, we verify it is
valid for XBART in the paper.
1. Proposition is true for tree grown with theoretical split
criterion.
2. Tree grown with empirical split criterion is close enough to
theoretical tree as n→∞.
45
![Page 77: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/77.jpg)
Future Research
• XBART enjoys great prediction accuracy and is fast.
• Application to Bayesian causal forest. Including future
empirical works in causal inference.
• Prove consistency of the XBART forest.
46
![Page 79: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/79.jpg)
Extra slides
48
![Page 80: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/80.jpg)
Classification Results
We compare XBART with other methods on 20 datasets from the
UCI machine learning repository. All datasets have 3 to 6
categories and 100 to 3,000 observations.
The goal is to demonstrate that default setting of XBART has a
reasonable performance comparing to other approaches.
49
![Page 81: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/81.jpg)
Classification Results
rf gbm mno svm nnet xbart
balance-scale 0.848 (0.023) 0.925 (0.010) 0.897 (0.021) 0.909 (0.025) 0.961 (0.019)* 0.912 (0.011)
car 0.983 (0.006)* 0.979 (0.008) 0.834 (0.019) 0.774 (0.033) 0.947 (0.015) 0.938 (0.018)
cardiotocography-3clases 0.937 (0.009) 0.949 (0.009)* 0.894 (0.011) 0.911 (0.011) 0.909 (0.013) 0.931 (0.011)
contrac 0.546 (0.024) 0.557 (0.023) 0.516 (0.028) 0.551 (0.024) 0.556 (0.028)* 0.324 (0.058)
dermatology 0.970 (0.016) 0.972 (0.020)* 0.968 (0.020) 0.759 (0.024) 0.970 (0.022) 0.972 (0.018)*
glass 0.798* (0.062) 0.771 (0.055) 0.622 (0.066) 0.679 (0.054) 0.673 (0.064) 0.702 (0.076)
heart-cleveland 0.578 (0.033) 0.586 (0.039) 0.587 (0.039) 0.620 (0.038)* 0.603 (0.052) 0.583 (0.034)
heart-va 0.357 (0.071)* 0.320 (0.067) 0.349 (0.069) 0.315 (0.069) 0.302 (0.08) 0.308 (0.064)
iris 0.948 (0.034) 0.945 (0.034) 0.965 (0.029)* 0.947 (0.034) 0.954 (0.045) 0.954 (0.033)
lymphography 0.866 (0.063)* 0.853 (0.057) 0.821 (0.069) 0.850 (0.062) 0.818 (0.077) 0.835 (0.058)
pittsburg-bridges-MATERIAL 0.840 (0.058) 0.844 (0.049) 0.834 (0.061) 0.860 (0.048)* 0.824 (0.066) 0.849 (0.046)
pittsburg-bridges-REL-L 0.725 (0.084)* 0.681 (0.093) 0.650 (0.087) 0.692 (0.082) 0.659 (0.091) 0.680 (0.083)
pittsburg-bridges-SPAN 0.637 (0.098) 0.648 (0.101) 0.675 (0.100) 0.681 (0.099)* 0.647 (0.102) 0.628 (0.100)
pittsburg-bridges-TYPE 0.609 (0.088)* 0.581 (0.089) 0.549 (0.089) 0.540 (0.075) 0.565 (0.092) 0.585 (0.093)
seeds 0.940 (0.030) 0.940 (0.031) 0.948 (0.035)* 0.929 (0.029) 0.944 (0.034) 0.945 (0.042)
synthetic-control 0.984 (0.012) 0.971 (0.015) 0.984 (0.012) 0.716 (0.024) 0.987 (0.011)* 0.983 (0.017)
teaching 0.622 (0.085)* 0.557 (0.086) 0.526 (0.072) 0.547 (0.073) 0.525 (0.087) 0.491 (0.086)
vertebral-column-3clases 0.847 (0.038) 0.829 (0.038) 0.861 (0.033)* 0.839 (0.042) 0.859 (0.033) 0.842 (0.039)
wine-quality-red 0.702 (0.021)* 0.631 (0.026) 0.597 (0.024) 0.576 (0.026) 0.597 (0.025) 0.613 (0.022)
wine 0.985 (0.018)* 0.979 (0.021) 0.979 (0.026) 0.980 (0.025) 0.977 (0.027) 0.969 (0.032)
50
![Page 82: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/82.jpg)
Tree is Prone to Overfitting
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
51
![Page 83: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/83.jpg)
Regularization
Three secrets of a successful tree model.
Regularization, regularization and regularization.
The strategies for preventing overfitting are diverse, but the basic
idea is to favor smaller trees.
52
![Page 84: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/84.jpg)
Split criterion of multi-class classification
Let f(j)−l (x) =
∏h 6=l g
(j)(x,Th,Λh) be the total fit of all trees
except the l-th one, the likelihood of all n covariates has the form
L(Tl ,Λl ;T(l),Λ(l), y) =n∏
i=1
wi f(j)(xi )
yij exp[vi f(j)(xi )]
= cl
Bl∏b=1
λrlblt exp[−slbλlb]
where Bl is number of nodes of the l-th tree and
cl =n∏
i=1
(ni
yi1yi2 · · · yic
)φni−1
Γ(ni )f
(j)−l (xi )
yij
rlb =∑
i :xi∈Alb
yij , slb =∑
i :xi∈Alb
φi f(j)−l (xi ).
53
![Page 85: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/85.jpg)
Adaptive Cut-points
Cut-points are defined via evenly spaced quantiles of the data
observation X .
At each recursion, the cut-points are redefined.
x0 1 2 3 4 5 6 7 8
f (x)
54
![Page 86: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/86.jpg)
Adaptive Cut-points
x0 1 2 3 4 5 6 7 8
f (x)
54
![Page 87: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/87.jpg)
Adaptive Cut-points
x0 1 2 3 4 5 6 7 8
f (x)
”Zoom in”. It’s easy because data are ordered.
54
![Page 88: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/88.jpg)
Prune CART
Loop over all possible sub-trees T , calculate loss function
Cα(T ) = C (T ) + α|T |
where C (T ) is prediction error of sub-tree T and |T | is a
measurement of tree complexity. α is a tuning parameter. Pick a
sub-tree with lowest loss.
55
![Page 89: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/89.jpg)
Gradient boosting trees
Obj =n∑
i=1
loss(yi , yi ) +L∑
l=1
Ω(g(Tl , µl))
Obj t =n∑
i=1
[2(y (t−1) − yi )gt(xi ) + gt(xi )
2]
+ Ω(gt) + const
56
![Page 90: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/90.jpg)
Gyorfi et al. (2006)
Theorem (Gyorfi et al. (2006))
Assume that
1. limn→∞ βn =∞;
2. limn→∞ E
[inf m∈Mn(Θ)||m||∞≤βn
EX [m(x)− f (x)]2
]= 0;
3. for all L > 0,
limn→∞
E
supm∈Mn(Θ)||f ||∞≤βn
∣∣∣∣∣ 1
an
n∑i=1
[m(xi )− yi,L]2 − E [m(x)− yL]2
∣∣∣∣∣ = 0.
Then
limn→∞
E[Tβn fn(X ,Θ)− f (X )]2 = 0.
57
![Page 91: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/91.jpg)
Lemma 1
Lemma
Assume that (A1) holds. Then for all x ∈ [0, 1]p,
∆(f ,A∗k(x,Θ))→ 0 almost surely as k →∞.
58
![Page 92: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/92.jpg)
Lemma 2
Lemma
Assume that (A1) holds. Fix x ∈ [0, 1]p, k ∈ N∗ and let ξ > 0.
Then Ln,k(x , ) is stochastically equicontinuous on Aξk(x), that is,
for all α, ρ > 0, there exist δ > 0 such that
limn→∞
P
sup||ck−c′k ||∞≤δck ,c′k∈Aξk (x)
∣∣Ln,k(x, ck)− Ln,k(x, c′k)∣∣ > α
≤ ρ.
59
![Page 93: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/93.jpg)
Lemma 3
Lemma
Assume that (A1) holds. Fix ξ, ρ > 0 and k ∈ N∗. There there
exists N ∈ N∗ such that for all n ≥ N,
P [c∞(ck,n(x,Θ),A∗k(x,Θ)) ≤ ξ] ≥ 1− ρ.
60
![Page 94: Stochastic Tree Ensembles for Regularized Supervised Learning · Stochastic Tree Ensembles for Regularized Supervised Learning Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn](https://reader035.vdocuments.site/reader035/viewer/2022071004/5fc10f392ceac73fb37251b9/html5/thumbnails/94.jpg)
References
Jingyu He, Saar Yalov, P. Richard Hahn XBART: Accelerated
bayesian additive regression trees. AISTATS 2019.
Jingyu He, Saar Yalov, Jared Murray, P. Richard Hahn Stochastic
tree ensembles for regularized supervised learning. Technical
Report.
61