gbm.more gbm in h2o

H2O – The Open Source Math Engine

H2O and Gradient Boosting

What is Gradient Boosting

gbm is a boosted ensemble of decision trees, fitted in a stagewise forward fashion to minimize a loss function

ie gbm is a sum of decision trees

each new tree corrects errors of the previous forest

Why gradient boosting

Performs variable selecting during fitting process• Highly collinear explanatory variables

- glm: backwards/forwards is unstable

Interactions: will search to a specified depth

Captures nonlinearities in the data• ex airlines on-time performance: gbm captures a change in 2001

without analyst having to do so

Why gradient boosting, moreWill naturally handle unscaled data (unlike glm, particularly with L1, L2 penalties)

Handles ordinal data, eg income:[$10k,$20k],($20k,$40k],($40k,$100k],($100k,inf)]

Relatively insensitive to long tailed distributions and outliers

gradient boosting works wellon the right dataset, gbm classification will outperform both glm and random forest

Demonstrates good performance on various classification problems• Hugh Miller, team leader, winner KDD Cup 2009 Slow Challenge:

gbm main model to predict telco customer churn

• KDD Cup 2013 - Author-Paper Identification Challenge - 3 of the 4 winners incorporated gbm

• many kaggle winners

• results at previous employers

Inference algorithm (simplified)

1. Initialize k predictors f_k,m=0(x)

2. for m = 1:num_treesa. normalize current predictions

b. for k = 1:num_classes

i. compute pseudo residual r = y – p_k

ii. fit a regression tree to targets r with data X

iii. for each terminal region, compute multiplier that maximizes the deviance loss

iv. f_k,m+1(x) = f_k,m(x) + region multiplier

Regression tree, 1

R1

R2

R4

R3

X1

X2

2

7

1

Regression tree, 2

1-level regression tree: 2 terminal nodes, split decision: minimize squared error

Data (9 observations)

Errors

but has pain points

Slow to fit

Slow to predict

Data size limitations: often downsampling required

Many implementations single threaded

Parameters difficult to understand

Fit with searching, choose with holdout:• Interaction levels / depths [1,5,10,15]

• trees: [10,100,1000,5000]

• learning rate: [.1, .01, .001]

• this is often an overnight job

h2o can help

multicore

distributed

parallel

Questions?

gbm intuition

Why should this work well?

Universe is sparse. Life is messy. Data is sparse & messy. - Lao Tzu

gbm.more gbm in h2o

Technology