xgboost: a scalable tree boosting system184pc128.csie.ntnu.edu.tw/presentation/18-04-17/... ·...
TRANSCRIPT
XGBOOST: A SCALABLE TREE BOOSTING SYSTEMADVISOR: JIA-LING KOH SPEAKER: YIN-HSIANG LIAO 2018/04/17, FROM KDD 2016
OutlineIntroduction
Method
Experiment
Conclusion
2
IntroductionRegression tree
CART (Gini)
Boosting
Ensemble method, an iterative procedure adaptively change the distribution of training examples.
Adaboost
3
IntroductionThe most important factor of XGBoost —
Scalability.
Billions of examples.
4
IntroductionA practical choice:
17 out of 29 winning solutions in Kaggle 2015.
Top-10 teams all used XGBoost in KDDcup 2015
T-brain: used in top-3 teams.
Ad click through rate prediction, malware classification, customer behavior prediction, etc.
5
MethodTree ensemble model:
Prediction Leaf weights of a tree
6
MethodRegularized objective function:
Differentiable convex loss function
Number of leaves +
Weights on leave
Model complexity
Number of leaves
7
Objective function
MethodGradient tree boosting:
Model is trained in additive manner.
Usual
__
_________
8
Objective function
MethodAdditive training (Boosting)
9
Objective function
Method
Taylor expansion:
10
Objective function
:instance set of j ( xi in leaf j )
Method
T : number of leaf
11
Objective function
Method
For a fixed tree q, the optimal weight is:
12
Objective function
MethodFor a fixed tree q, the optimal weight is:
The corresponding optimal value is:
13
Objective function
MethodFrom now, if the tree is known, we get the optimal value.
The problem becomes “what tree is the best ?”
Left subtree. Right subtree. Parent
Loss reduction
The larger the better, might be negative
Greedy strategy
14
Objective function
MethodPreventing overfitting further:
Shrinkage.
Subsampling. (column)
15
Objective function
MethodBasic Exact Greedy Algorithm.
Approximate Algorithm.
Global
Local
16
Split Finding
MethodBasic Exact Greedy Algorithm:
17
Split Finding
.m
When to stop?
MethodB.E.G.A. is good, since all possible splits, but…. When data can’t fit in memory, the thrashing slow down the system.
Approximations:
18
Split Finding
MethodLocal/ Global agendas:
Global: less proposal but more candidate point.
19
Split Finding
MethodWeighted quantile sketch:
Each interval has the same “impact” on OF.
20
Split Finding
MethodSparsity-aware:
Possible reasons:
Missing value
Frequent zero
Artifacts of feature engineering (like one-hot)
Solution: default direction
21
Split Finding
Method
22
Split Finding
Sort criteria: Missing value last
Learn the best direction (of the feature)
MethodNon-presence -> missing value.
Only deal with presence.
50x faster than naive ver. , on Allstate.
23
Split Finding
MethodThe most time consuming part: sorting.
Sort just once.
Store data in in-memory unit: block.
24
System Design
MethodCSC format (compressed column)
Ex:
Different blocks can be distributed across machine, stored on disk in the out-of-core setting.
25
System Design
MethodBlock structure helps split finding.
However, it’s a non-continuous memory access.
Solution: allocate an internal buffer in each thread.
26
System Design
MethodBlock size matters. (max number of examples)
Small blocks result in small workload for each thread.
Large blocks lead cache missing.
27
System Design
Balance!
MethodOut-of-core computation:
Block compression
Ex: [0, 2, 2, 0, 1, 2]
Block sharding
A prefetch thread is assigned to each disk.
28
System Design
ExperimentClassification:
GBM expands one branch of a tree.
Other two expand full tree.
30
ExperimentLearning to rank:
pGBRT: the best previously published system.
pGBRT only supports approximate algorithm.
31
ExperimentOut-of-core experiment
Compression helps 3x times.
Sharding into two give 2x speedup.
32
Conclusion
The most Important feature: Scalability !
Lessons from building XGBoost:
Sparsity aware, weighted quantile sketch, cache aware, parallelization.
33
System Design
Fin.