stochastic tree ensembles for regularized supervised learning · stochastic tree ensembles for...

Stochastic Tree Ensembles

for Regularized Supervised Learning

Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn

November 5, 2019

The University of Chicago Booth School of Business

1

Elements of supervised learning

• x ∈ Rp is a vector of covariate variables, the outcome y is

continuous (regression) or discrete (classfication)

• How to predict y given x.

• We want to learn the function form f (·)

y = f (x) (1)

• Linear model: y = xβ (or Logistic regression)

• Non-parametric model: tree, deep learning

2

Simulation and Empirical Results

Regression Simulation

We draw 30 x variables from U(0, 1). y = f (x) + ε where

ε ∼ N(0, σ2) and σ2 = κVar(f ). κ measures signal to noise ratio.

Name Function

Linear xtγ; γj = −2 + 4(j−1)d−1

Single index 10√a + sin (5a); a =

∑10j=1(xj − γj)2; γj = −1.5 + j−1

3 .

Trig + poly 5 sin(3x1) + 2x22 + 3x3x4

Max max(x1, x2, x3)

We compare with neural networks, random forest, xgboost with /

without cross validation.

3

Simulation, signal to noise = 1:1, κ = 1 and n = 10K

Function Method RMSE Seconds

Linear XBART 1.74 20

Linear XGBoost Tuned 2.63 64

Linear XGBoost Untuned 3.23 < 1

Linear Random Forest 3.56 6

Linear BART 1.50 117

Linear Neural Network 1.39 26

Trig + Poly XBART 1.31 17

Trig + Poly XGBoost Tuned 2.08 61

Trig + Poly XGBoost Untuned 2.70 < 1

Trig + Poly Random Forest 3.04 6

Trig + Poly BART 1.30 115

Trig + Poly Neural Network 3.96 26

Max XBART 0.39 16

Max XGBoost Tuned 0.42 62

Max XGBoost Untuned 0.79 < 1

Max Random Forest 0.41 6

Max BART 0.44 114

Max Neural Network 0.40 30

Single Index XBART 2.27 17

Single Index XGBoost Tuned 2.65 61

Single Index XGBoost Untuned 3.65 < 1

Single Index Random Forest 3.45 6

Single Index BART 2.03 116

Single Index Neural Network 2.76 28 4

Simulation, signal to noise = 1:10, κ = 10 and n = 10K

Function Method RMSE Seconds

Linear XBART 5.07 16

Linear XGBoost Tuned 8.04 61

Linear XGBoost Untuned 21.25 < 1

Linear Random Forest 6.52 6

Linear BART 6.64 111

Linear Neural Network 7.39 12

Trig + Poly XBART 4.94 16

Trig + Poly XGBoost Tuned 7.16 61

Trig + Poly XGBoost Untuned 17.97 < 1

Trig + Poly Random Forest 6.34 7

Trig + Poly BART 6.15 110

Trig + Poly Neural Network 8.20 13

Max XBART 1.94 16

Max XGBoost Tuned 2.76 60

Max XGBoost Untuned 7.18 < 1

Max Random Forest 2.30 6

Max BART 2.46 111

Max Neural Network 2.98 15

Single Index XBART 7.13 16

Single Index XGBoost Tuned 10.61 61

Single Index XGBoost Untuned 28.68 < 1

Single Index Random Forest 8.99 6

Single Index BART 8.69 111

Single Index Neural Network 9.43 14 5

Larger Simulation, κ = 1, signal to noise = 1:1

κ = 1, signal to noise ratio 1 to 1

n XBART XGB+CV XGB NN

Linear

10k 1.74 (20) 2.63 (64) 3.23 (0) 1.39 (26)

50k 1.04 (180) 1.99 (142) 2.56 (4) 0.66 (28)

250k 0.67 (1774) 1.50 (1399) 2.00 (55) 0.28 (40)

Max

10k 0.39 (16) 0.42 (62) 0.79 (0) 0.40 (30)

50k 0.25 (134) 0.29 (140) 0.58 (4) 0.20 (32)

250k 0.14 (1188) 0.21 (1554) 0.41 (60) 0.16 (44)

Single Index

10k 2.27 (17) 2.65 (61) 3.65 (0) 2.76 (28)

50k 1.54 (153) 1.61 (141) 2.81 (4) 1.93 (31)

250k 1.14 (1484) 1.18 (1424) 2.16 (55) 1.67 (41)

Trig + Poly

10k 1.31 (17) 2.08 (61) 2.70 (0) 3.96 (26)

50k 0.74 (147) 1.29 (141) 1.67 (4) 3.33 (29)

250k 0.45 (1324) 0.82 (1474) 1.11 (59) 2.56 (41)6

Larger Simulation, κ = 10, signal to noise = 1:10

κ = 10, signal to noise ratio 1 to 10

n XBART XGB+CV XGB NN

Linear

10k 5.07 (16) 8.04 (61) 21.25 (0) 7.39 (12)

50k 3.16 (135) 5.47 (140) 16.17 (4) 3.62 (14)

250k 2.03 (1228) 3.15 (1473) 11.49 (54) 1.89 (19)

Max

10k 1.94 (16) 2.76 (60) 7.18 (0) 2.98 (15)

50k 1.22 (133) 1.85 (139) 5.49 (4) 1.63 (16)

250k 0.75 (1196) 1.05 (1485) 3.85 (54) 0.85 (22)

Single Index

10k 7.13 (16) 10.61 (61) 28.68 (0) 9.43 (14)

50k 4.51 (133) 6.91 (139) 21.18 (4) 6.42 (16)

250k 3.06 (1214) 4.10 (1547) 14.82 (54) 4.72 (21)

Trig + Poly

10k 4.94 (16) 7.16 (61) 17.97 (0) 8.20 (13)

50k 3.01 (132) 4.92 (139) 13.30 (4) 5.53 (14)

250k 1.87 (1216) 3.17 (1462) 9.37 (49) 4.13 (20)7

Motivation

Grow a tree recursively

x1

x2µ1

8


x1

x2

x1 < 0.3

µ1 µ2

yes no

8


x1

x2

x1 < 0.5

µ1 µ2

yes no

8


x1

x2

x1 < 0.7

µ1 µ2

yes no

8


x1

x2

x1 < 0.5

µ1 µ2

yes no

8


x1

x2

x1 < 0.5

µ1 x2 < 0.3

µ2 µ3

yes no

yes no

8


x1

x2

x1 < 0.5

µ1 x2 < 0.5

µ2 µ3

yes no

yes no

8


x1

x2

x1 < 0.5

µ1 x2 < 0.7

µ2 µ3

yes no

yes no

8


x1

x2

x1 < 0.5

µ1 x2 < 0.7

µ2 µ3

yes no

yes no

Prediction of a new observation is µ3.

8


306 9. Additive Models, Trees, and Related Methods

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ! t1

X2 ! t2 X1 ! t3

X2 ! t4

FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.

Tree is essentially a step function, partitions at one variable at a

time. Predict with constant. Picture from Friedman, Hastie, Tibshirani 2001.

9

Sum of trees

+ =

d

c

µ1

µ2

µ3

+

e

f

θ1

θ2 θ3=

e

f

d

c

θ1 + µ2 θ1 + µ1

θ2 + µ3

θ2 + µ2

θ3 + µ3

θ3 + µ2

θ3 + µ1

Sum of trees (tree ensembles or forest) implies an extra level of

smoothing.

10

Why a tree / forest

• Widely used, more than half winners of data mining

competition on kaggle use variants of tree ensemble methods.

• Invariant to scaling of input variables. Not necessary to worry

about feature normalization.

• Learn high order interactions between features.

• Random Forest, takes average of multiple trees, all trees are

independent.

• Boosting, takes sum of multiple trees, grown sequentially.

11

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

x

y

12

Random Forest

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Random Forest

x

y

12

Boosting

0.0 0.2 0.4 0.6 0.8 1.0

−0.

6−

0.4

−0.

20.

00.

20.

4

Boosting

x

y

13

Classification and Regression Trees (CART)

• Probably the most popular tree growing algorithm.

• Grow tree until very large, then prune the tree.

• CART split criterion minimizes L2 loss,∑i∈left

(yi − yleft)2 +

∑i∈right

(yi − yright)2

• Split points and leaf parameter might conspire to make a bad

split point looks nicer.

• Optimize split criterion. What if two split points have very

close evaluation like 10.00 and 9.99?

14

Intuition for XBART

• A randomized algorithm. Sampling split points instead ofoptimization.

• A new split criterion with probabilistic interpretation.

• Early stopping, rather than pruning after over-growing.

• Split and estimate leaf parameter separately.

• Tree ensemble.

15

Comparison of tree-based algorithms

CART RF XGB XBART BART

Leaf optimized optimized optimized integrate out at split integrate out at split

Parameters with splits with splits with splits then sample then sample

Criteria likelihood likelihood likelihood marginal likelihood marginal likelihood

Aggregation Noaggregation

Noaggregation aggregation

of trees of forests of forests

SequentialNo No Yes Yes Yes

fitting

Iterations No No No Yes No

Recursion Yes Yes Yes Yes No

16

XBART regression

Split criterion of Gaussian regression

• Assume Gaussian likelihood on one leaf node N(µ, σ2)

• µ ∼ N(0, τ) prior

The integrated-likelihood is

p(Y | τ, σ2) =

∫N(Y | µ, σ2In)N(µ | 0, τ)dµ = N(0, τJJt+σ2In),

Re-arrange terms and ignore constants in the density, take

logarithm,

log(m(s | τ, σ2)) = log

(σ2

σ2 + τn

)+

τ

σ2(σ2 + τn)s2.

where s =∑

y , sum of all y in the leaf node.

17

Split criterion of a single tree

• A split point candidate cjk partitions the current node to left

and right nodes.

• Two nodes have sufficient statistics s ljk and srjk .

• Assume data in two leaf nodes are independent.

The joint integrated-likelihood of two children leaf nodes is

l(cjk) := m(s ljk | τ, σ2)m(srjk | τ, σ2),

This is split criterion of cjk .

18

Why integrated-likelihood

• Most tree algorithms optimize split point and leaf parameters

simultaneously.

• A bad split point probably looks better in collusion with leaf

parameters because random noise from the data.

• We split nodes and estimate leaf parameters separately.

19

No-split option (Regularization)

Furthermore, the split criterion for no-split is defined as

lstop = |C|(

(1 + d)β

α− 1

)m(sall | Φ,Ψ)

• d is depth of the current node.

• sall is sufficient stat of all data.

• α, β are hyper-parameters.

20

Weight of no-split option

• A split point candidates has prior proportional to 1.

• No-split has prior proportional to |C|(

(1+d)β

α − 1)

.

• |C| split candidates in total.

Therefore the implied prior probability of splitting is

split weight

split weight + no split weight=

|C|

|C|(

(1+d)β

α − 1)

+ |C|= α(1+d)−β.

which is same as prior probability of a split of BART.

21

Sample one split point (or stop)

Sample one of them according to probability

P(cjk) =l(cjk)∑

cjk∈C l(cjk) + lstop, P(stop) =

lstop∑cjk∈C l(cjk) + lstop

22


Consider a split point candidate, it partitions the space to left and

right child as

x1

x2

23



right child as

x1

x2

m(s ljk | Φ,Ψ)

23



right child as

x1

x2

m(s ljk | Φ,Ψ)m(srjk | Φ,Ψ)

23

No-split option

If stop splitting, it is equivalent to leave all data in one node.

x1

x2

|C|(

(1 + d)β

α− 1

)m(s∅ | Φ,Ψ)

24

GrowFromRoot algorithm

In addition to no-split option, we still have regular stop splitting

conditions — max depth of tree, least leaf size, etc. The algorithm

growing a single tree is (GrowFromRoot)

1. Start from a root node.

2. Evaluate split criterion for all split point candidates andno-split option, sample one from it.

if no-split or other stopping criterion is reached. Update leaf

parameter and return the algorithm.

else split current node to left and right side, repeat step 1 for both

child nodes.

25

XBART forest

Next it is algorithm growing a forest. We fit a sum of trees

(regression) or product of trees (classification).

Suppose we sample I forests with L trees in each forest.

For iter in 1 to I .

For h in 1 to L.

GrowFromRoot fits the target r iterh .

Update r iterh+1, the target for the next tree to fit.

Update other non-tree parameters Ψ.

26

XBART forest

Take regression (additive trees) for an example. Suppose we fit

three trees, the first tree fits target R11 = 1

3Y . Denote the fitted

value as R11 .

27

XBART forest

Take regression (additive trees) for an example. The second tree

fits target R12 = 1

3Y and fitted value is R12 .

+

27

XBART forest

The third tree fits target R13 = 1

3Y and fitted value is R13 .

+ +

27

XBART forest

Next iteration of forest, re-fit the first tree to target

R21 = Y − R1

2 − R13 . Fitted value is R2

1 .

+ +

27

XBART forest

Next iteration of forest, re-fit the second tree to target

R22 = Y − R2

1 − R13 . Fitted value is R2

2 .

+ +

27

XBART forest

Re-fit the third tree to target R23 = Y − R2

1 − R22 . Fitted value is

R23 .

+ +

27

Update leaf parameter µ and residual variance

We assume µlb ∼ N(0, τ) prior on each leaf, update it by

µlb ∼ N

( ∑y

σ2(

1τ + nlb

σ2

) , 1(1τ + nlb

σ2

))∑

takes sum of all data, and nlb is number of observations.

Assume a regular inverse Gamma prior on σ2 and update in

between of trees.

σ2 ∼ inverse-Gamma(N + a, r

(iter)th r

(iter)h + b

)r(iter) is the total residual of all trees.

28

Default Hyperparameters

We recommend

• L = 50, 100 or 200

• α = 0.95,

• β = 1.25 and

• τ = var(y)/L.

Lower β permits deeper trees (BART’s default is β = 2).

This τ dictates that a priori each tree will account for 1L of the

observed variance.

Our default suggestion is just 40 sweeps through the data,

discarding the first 15 as burn-in.

29

Final prediction

Given I iterations of the algorithm, keep final I − I0 forests for

prediction, where I0 < I is denotes the length of the burn-in period.

Take average on the time domain of a Markov chain.

f (X) =1

I − I0

I∑k>I0

f (k)(X).

where f (k) denotes a sample of the forest, a sum of trees in that

forest.

30

Variable importance

We also keep track of variable importance. Count how many times

each variable is used as split variable in tree l at iteration k as

w(k)l . Update parameter

w← w − w(k−1)l + w

(k)l

The weight parameter is then resampled as w ∼ Dirichlet(w).

Then for the next tree, we subsample mtry variables for

consideration with respect to probability w.

w gives a natural evaluation of variable importance.

31

Generic XBART and Classification

Generic split criterion

It is natural to extend split criterion to other likelihoods.

• Define a likelihood L(y ;µ,Ψ) on one leaf node.

• Leaf parameter µ; other model parameters Ψ (given and fixed

in tree growing process).

• Assume prior π(µ | Φ) on µ.

The integrated likelihood is

m(s | Φ,Ψ) :=

∫L(y ;µ,Ψ)π(µ | Φ)dµ,

s is a sufficient statistics of data y falling in the current node.

32

Split criterion of multi-class classification

• Classification of C categories. Suppose each xi is observed ni

times in the data.

• The response yij is the number of observations with covariate

xi in category j , where 1 ≤ i ≤ n and 1 ≤ j ≤ C .

• So we have∑C

j=1 yij = ni . If xi is continuous, then ni = 1.

• The probability of response of covariate vector xi belongs to

category j is

πj(xi ) =f (j)(xi )∑Ch=1 f

(h)(xi ).

33


We assume the logarithm of regression function is a sum of trees,

taking the form as log(f (j)(x)

)=∑L

l=1 g(

x;T(j)l , µ

(j)l

), which

leads to a multinomial logistic trees model

πj(xi ) =exp

[∑Ll=1 g

(x;T

(j)l , µ

(j)l

)]∑C

h=1 exp[∑L

l=1 g(

x;T(h)l , µ

(h)l

)] .Let λlb = exp(µlb)

f (x) = exp

[L∑

l=1

g(x;Tl , µl)

]=

L∏l=1

g(x;Tl ,Λl),

where g(x,Tl ,Λl) = λlb if x ∈ Alb for 1 ≤ b ≤ Bl .

34


The likelihood of each covariate value is

pMN(yi , φi ) =

(ni

yi1yi2 · · · yic

)∏cj=1 f

(j)(xi )yij∑c

j=1 f(j)(xi )ni

.

We apply apply the data augmentation strategy of Murray 2017 by

introducing a latent variable φi , the augmented likelihood is

pMN(yi , φi ) =( ni

yi1yi2 · · · yic

) c∏j=1

f (j)(xi )yij

φni−1i

Γ(ni )exp

−φi c∑j=1

f (j)(xi )

=( ni

yi1yi2 · · · yic

)φni−1

Γ(ni )

c∏j=1

f (j)(xi )yij exp

[−φi f (j)(xi )

].

The augmented likelihood introduces latent variable φi for

1 ≤ i ≤ N per data observation.

35


Assume independent conjugate prior λlb ∼ Gamma(a1, a2) for each

leaf parameter in Λl , the integrated likelihood is

L(Tl ;T(l),Λ(l), θ, y) =

∫L(Tl ,Λl ;T(l),Λ(l), θ, y)p(Λl)dΛl

=

∫cl

Bl∏t=1

λrlblb exp[−slbλlb]λa1−1lb aa1

2 e−a2λlb

Γ(a1)dλlt

∝Bl∏t=1

1

(a2 + slb)(a1+rlb)Γ(a1 + rlb),

The integrated likelihood of one leaf is

m(slb | a1, a2, φiNi=1

)=

1

(a2 + slb)(a1+rlb)Γ(a1 + rlb)

36

Sampling non-tree parameters in multi-class classification

The sampling steps of leaf and non-tree parameters are

• For 1 ≤ j ≤ C , update each leaf parameter of f (j)

independently when update leaf parameter in GrowFromRoot

algorithm.

λlb ∼ Gamma(a1 + rlb, a2 + slb).

• For 1 ≤ i ≤ N, update latent variable after updating a tree in

XBART forest algorithm.

φi ∼ Gamma

ni ,c∑

j=1

f (j)(xi )

.

37

Some Theory

Markov chain

The algorithm sampling forest is a finite-state Markov chain with

stationary distribution.

• Each iteration only relies on prior iteration but all forests

before that.

• Forest states are finite because of max depth and finite split

point candidates.

• The probability drawing a single tree is defined as product of

integrated-likelihood of all cut points, which is non-zero.

• At least one way to go from one forest to the other: re-grow

trees one by one.

38

Consistency of a single regression tree

• We prove consistency of a single regression tree based on

consistency result of CART (Scornet et al. 2015).

• The framework of proof is the same. We verify XBART

satisfies key lemmas related to specific split criterion function.

• Random sampling split-point can be convert to optimization

by perturb max lemma.

39

Perturb max lemma

Lemma

Suppose at a specific node, there are |C| finite split-point

candidates cjk, we are interested in drawing one of them

according to probability P(cjk) =exp(l(cjk ))∑

cjk∈Cexp(l(cjk )) . We have

P

(cjk = arg max

cjk∈Cl(cjk) + γjk

)=

exp(l(cjk))∑cjk∈C exp(l(cjk))

where γjk is a set of independent random draws from

Gumble(0, 1) distribution with density

p(x) = exp(−x + exp(−x)).

Random sampling is equivalent to optimization with an additional

random drawn constant.

40

Perturb max lemma

Follow the perturb max lemma, we optimize

arg maxcjk∈Cl(cjk) + γjk

which is equivalent to

arg maxcjk∈C

l(cjk)

n+γjkn.

Let n→∞, our empirical split criterion function Ln(x) converges

to the theoretical version

L∗(j , cjk) =1

σ2P(x(j) ≤ cjk)

[E(y | x(j) ≤ cjk)

]2

+1

σ2P(x(j) > cjk)

[E(y | x(j) > cjk)

]2.

This theoretical split criterion is the same as CART.

41

Assumption

We prove consistency of a single tree in regression setting.

Assumption (A1)

y =

p∑i=1

fj(x(j)) + ε

where x = (x(1), · · · , x(p)) is uniformly distributed on [0, 1]p.

ε ∼ N(0, σ2).

42

Main theorem

Suppose dn is a sequence of max depth of trees, each tree is fully

grown, we have

Theorem

Assume (A1) holds. Let n→∞, dn →∞ and

(2dn − 1)(log n)9/n→ 0, a single XBART tree is consistent in the

sense that

limn→∞

E[fn(x)− f (x)]2 = 0.

43

An important proposition

The total variation of the true function f within leaf node A is

∆(f ,A) = supx,x′∈A

|f (x)− f (x′)|.

An(x,Θ) is the leaf node x falls in.

Proposition

Assume (A1) holds, for all ρ, ξ > 0, there exists N ∈ N∗ such

that, for all n > N,

P [∆(f ,An(x,Θ)) ≤ ξ] ≥ 1− ρ.

n→∞, variation of the true function on all leaf nodes is arbitrarily

small. Either because the leaf size shrinks to zero, or the true

function is flat.44

Proof of proposition

Proof of proposition is done by three key lemmas, we verify it is

valid for XBART in the paper.

1. Proposition is true for tree grown with theoretical split

criterion.

2. Tree grown with empirical split criterion is close enough to

theoretical tree as n→∞.

45

Future Research

• XBART enjoys great prediction accuracy and is fast.

• Application to Bayesian causal forest. Including future

empirical works in causal inference.

• Prove consistency of the XBART forest.

46

Thanks

Download XBART at xbart.ai

47

xbart.ai

Extra slides

48

Classification Results

We compare XBART with other methods on 20 datasets from the

UCI machine learning repository. All datasets have 3 to 6

categories and 100 to 3,000 observations.

The goal is to demonstrate that default setting of XBART has a

reasonable performance comparing to other approaches.

49

Classification Results

rf gbm mno svm nnet xbart

balance-scale 0.848 (0.023) 0.925 (0.010) 0.897 (0.021) 0.909 (0.025) 0.961 (0.019)* 0.912 (0.011)

car 0.983 (0.006)* 0.979 (0.008) 0.834 (0.019) 0.774 (0.033) 0.947 (0.015) 0.938 (0.018)

cardiotocography-3clases 0.937 (0.009) 0.949 (0.009)* 0.894 (0.011) 0.911 (0.011) 0.909 (0.013) 0.931 (0.011)

contrac 0.546 (0.024) 0.557 (0.023) 0.516 (0.028) 0.551 (0.024) 0.556 (0.028)* 0.324 (0.058)

dermatology 0.970 (0.016) 0.972 (0.020)* 0.968 (0.020) 0.759 (0.024) 0.970 (0.022) 0.972 (0.018)*

glass 0.798* (0.062) 0.771 (0.055) 0.622 (0.066) 0.679 (0.054) 0.673 (0.064) 0.702 (0.076)

heart-cleveland 0.578 (0.033) 0.586 (0.039) 0.587 (0.039) 0.620 (0.038)* 0.603 (0.052) 0.583 (0.034)

heart-va 0.357 (0.071)* 0.320 (0.067) 0.349 (0.069) 0.315 (0.069) 0.302 (0.08) 0.308 (0.064)

iris 0.948 (0.034) 0.945 (0.034) 0.965 (0.029)* 0.947 (0.034) 0.954 (0.045) 0.954 (0.033)

lymphography 0.866 (0.063)* 0.853 (0.057) 0.821 (0.069) 0.850 (0.062) 0.818 (0.077) 0.835 (0.058)

pittsburg-bridges-MATERIAL 0.840 (0.058) 0.844 (0.049) 0.834 (0.061) 0.860 (0.048)* 0.824 (0.066) 0.849 (0.046)

pittsburg-bridges-REL-L 0.725 (0.084)* 0.681 (0.093) 0.650 (0.087) 0.692 (0.082) 0.659 (0.091) 0.680 (0.083)

pittsburg-bridges-SPAN 0.637 (0.098) 0.648 (0.101) 0.675 (0.100) 0.681 (0.099)* 0.647 (0.102) 0.628 (0.100)

pittsburg-bridges-TYPE 0.609 (0.088)* 0.581 (0.089) 0.549 (0.089) 0.540 (0.075) 0.565 (0.092) 0.585 (0.093)

seeds 0.940 (0.030) 0.940 (0.031) 0.948 (0.035)* 0.929 (0.029) 0.944 (0.034) 0.945 (0.042)

synthetic-control 0.984 (0.012) 0.971 (0.015) 0.984 (0.012) 0.716 (0.024) 0.987 (0.011)* 0.983 (0.017)

teaching 0.622 (0.085)* 0.557 (0.086) 0.526 (0.072) 0.547 (0.073) 0.525 (0.087) 0.491 (0.086)

vertebral-column-3clases 0.847 (0.038) 0.829 (0.038) 0.861 (0.033)* 0.839 (0.042) 0.859 (0.033) 0.842 (0.039)

wine-quality-red 0.702 (0.021)* 0.631 (0.026) 0.597 (0.024) 0.576 (0.026) 0.597 (0.025) 0.613 (0.022)

wine 0.985 (0.018)* 0.979 (0.021) 0.979 (0.026) 0.980 (0.025) 0.977 (0.027) 0.969 (0.032)

50

Tree is Prone to Overfitting

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

10 20 30

1020

3040

50

Boston$lstat

Bos

ton$

med

v

51

Regularization

Three secrets of a successful tree model.

Regularization, regularization and regularization.

The strategies for preventing overfitting are diverse, but the basic

idea is to favor smaller trees.

52


Let f(j)−l (x) =

∏h 6=l g

(j)(x,Th,Λh) be the total fit of all trees

except the l-th one, the likelihood of all n covariates has the form

L(Tl ,Λl ;T(l),Λ(l), y) =n∏

i=1

wi f(j)(xi )

yij exp[vi f(j)(xi )]

= cl

Bl∏b=1

λrlblt exp[−slbλlb]

where Bl is number of nodes of the l-th tree and

cl =n∏

i=1

(ni

yi1yi2 · · · yic

)φni−1

Γ(ni )f

(j)−l (xi )

yij

rlb =∑

i :xi∈Alb

yij , slb =∑

i :xi∈Alb

φi f(j)−l (xi ).

53

Adaptive Cut-points

Cut-points are defined via evenly spaced quantiles of the data

observation X .

At each recursion, the cut-points are redefined.

x0 1 2 3 4 5 6 7 8

f (x)

54

Adaptive Cut-points

x0 1 2 3 4 5 6 7 8

f (x)

54

Adaptive Cut-points

x0 1 2 3 4 5 6 7 8

f (x)

”Zoom in”. It’s easy because data are ordered.

54

Prune CART

Loop over all possible sub-trees T , calculate loss function

Cα(T ) = C (T ) + α|T |

where C (T ) is prediction error of sub-tree T and |T | is a

measurement of tree complexity. α is a tuning parameter. Pick a

sub-tree with lowest loss.

55

Gradient boosting trees

Obj =n∑

i=1

loss(yi , yi ) +L∑

l=1

Ω(g(Tl , µl))

Obj t =n∑

i=1

[2(y (t−1) − yi )gt(xi ) + gt(xi )

2]

+ Ω(gt) + const

56

Gyorfi et al. (2006)

Theorem (Gyorfi et al. (2006))

Assume that

1. limn→∞ βn =∞;

2. limn→∞ E

[inf m∈Mn(Θ)||m||∞≤βn

EX [m(x)− f (x)]2

]= 0;

3. for all L > 0,

limn→∞

E

supm∈Mn(Θ)||f ||∞≤βn

∣∣∣∣∣ 1

an

n∑i=1

[m(xi )− yi,L]2 − E [m(x)− yL]2

∣∣∣∣∣ = 0.

Then

limn→∞

E[Tβn fn(X ,Θ)− f (X )]2 = 0.

57

Lemma 1

Lemma

Assume that (A1) holds. Then for all x ∈ [0, 1]p,

∆(f ,A∗k(x,Θ))→ 0 almost surely as k →∞.

58

Lemma 2

Lemma

Assume that (A1) holds. Fix x ∈ [0, 1]p, k ∈ N∗ and let ξ > 0.

Then Ln,k(x , ) is stochastically equicontinuous on Aξk(x), that is,

for all α, ρ > 0, there exist δ > 0 such that

limn→∞

P

sup||ck−c′k ||∞≤δck ,c′k∈Aξk (x)

∣∣Ln,k(x, ck)− Ln,k(x, c′k)∣∣ > α

≤ ρ.

59

Lemma 3

Lemma

Assume that (A1) holds. Fix ξ, ρ > 0 and k ∈ N∗. There there

exists N ∈ N∗ such that for all n ≥ N,

P [c∞(ck,n(x,Θ),A∗k(x,Θ)) ≤ ξ] ≥ 1− ρ.

60

References

Jingyu He, Saar Yalov, P. Richard Hahn XBART: Accelerated

bayesian additive regression trees. AISTATS 2019.

Jingyu He, Saar Yalov, Jared Murray, P. Richard Hahn Stochastic

tree ensembles for regularized supervised learning. Technical

Report.

61

stochastic tree ensembles for regularized supervised learning · stochastic tree ensembles for...

Documents