ensembling & boosting 概念介紹
TRANSCRIPT
-
Ensembling & Boosting
Wayne Chen201608
-
sense
Maybe
-
Deep Learning ML
XGBoost : Kaggle Winning Solution
Giuliano Janson: Won two games and retired from Kaggle
Persistence: every Kaggler nowadays can put up a great model in a few hours and usually achieve 95% of final score. Only persistence will get you the remaining 5%.
Ensembling: need to know how to do it "like a pro". Forget about averaging models. Nowadays many Kaggler do meta-models, and meta-meta-models.
-
Why Ensemble is needed?
Occam's Razor
An explanation of the data should be made as simple as possible, but no simpler.
Simple s good.
Training data might not provide sufficient information for choosing a single best learner. The search processes of the learning algorithms might be imperfect (difficult to achieve unique
best hypothesis) Hypothesis space being searched might not contain the true target function.
-
ID3, C4.5, CART Tree base methodEntropyex. 5 (1M,4F), 9 (6M,3F)
E_all -5/14 * log(5/14) - 9/14 * log(9/14) Entropy is 1 if 50% - 50%, 0 if 100% - 0%
Information Gain a split attribute Entropy E_gender P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender
http://www.saedsayad.com/decision_tree.htm
-
http://blogs.sas.com/content/jmp/2013/03/25/partitioning-a-quadratic-in-jmp/
-
Boost Ensemble
1. 2. 3. ()
-
Ensemble
try model Decision tree, NN, SVM, Regression ..
Ensemble Kaggle submission CSV files. Its work!Majority Voting
Three models : 70%, 70%, 70% Majority vote ensemble will be ~78%. Averaging predictions often reduces overfit.
http://mlwave.com/kaggle-ensembling-guide/
-
Ensemble
Kobe, Curry, LBJ
Uncorrelated models usually performed betterAs more accurate as possible, and as more diverse aspossible Majority Vote, Weighted AveragingVoting Ensemble RandomForest GradientBoostingMachine
1111111100 = 80% accuracy1111111100 = 80% accuracy1011111100 = 70% accuracy
1111111100 = 80% accuracy
1111111100 = 80% accuracy0111011101 = 70% accuracy1000101111 = 60% accuracy
1111111101 = 90% accuracy
-
Ensemble
Randomly sampling not only dat but also feature
Majority vote Minimal tuning Performance pass lots of
complex method
n: subsample size
m: subfeature set size
tree size, tree numberhttp://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013
-
Base Learner ensemble ex. , simple neural network Train by base learning algorithm (ex. decision tree, neural network ..)
Boosting - Boost weak learners too strong learners (sequential learners) Bagging - Like RandomForest, sampling from data or features Stacking - (parallel learners)
Employing different learning algorithms to train individual learners Individual learners then combined by a second-level learner which is
called meta-learner.
Ensemble
-
Bagging Ensemble Bootstrap Aggregating
m (bootstrap sample) train base learner by calling a base learning algorithm
Sampling train model
Cherkauer(1996) 32 NN input feature
randomness backpropagation random init, tree random select feature
Majority voting
--
-
Boost Family
AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost
Conbination of Additive Models
Bagging can significantly reduce the variance Boosting can significantly reduce the bias
-
http://slideplayer.com/slide/4816467/
Assigns equal weights to all the training examples, increased the weights of incorrectly classified examples.
-
Adaboost
http://www.37steps.com/exam/adaboost_comp/html/adaboost_comp.html
-
Gradient Boosting
Additive training New predictor is optimized by moving in the opposite direction of the
gradient to minimize the loss function.
GBDT 510 Boosted Tree: GBDT, GBRT, MART, LambdaMART
-
Gradient Boosting Model Steps
Leaf weighted cost score Additive training:
cost error Greedy algorithm to build new tree from a single leaf Gradient update weight
-
Training Tips
Shrinkage
Reduces the influence of each individual tree and leaves space for future trees to improve the model.
Better to improve model by many small steps than lagre steps.
Subsampling, Early Stopping, Post-Prunning
-
In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural nets 11)
KDDCup 2015 all winning solution mention it. leaderboard top 10
Scalability enables data scientists to process hundred millions of examples on a desktop.
OpenMP CPU multi-thread DMatrix Cache-aware and Sparsity-aware
XGBoost
-
Column Block for Parallel Learning
The most time consuming part of tree learning is to get the data into sorted order.In memory block, compressed column format, each column sorted by the corresponding feature value. Block Compression, Block Sharding.
-
Results
-
Use it in Python
xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27)
gamma : Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight : Minimum sum of instance weight(hessian) needed in a child.
colsample_bytree : Subsample ratio of columns when constructing each tree.
-
Ensamble in Kaggle
Voting ensembles, Weighted majority vote, Bagged Perceptrons, Rank averaging, Historical ranks, Stacked & Blending (Netflix)
-
Voting ensemble of around 30 convnets. The best single model scored 0.93170. Final score 0.94120.
Ensemble in Kaggle
-
No Free Lunch
Ensemble is much better than single learner.Bias-variance tradeoff Boosting or Average vote it.
Not understandable -- like DNN, Non-linear SVM There is no ensemble method which outperforms other ensemble methods
consistentlySelecting some base learners instead of using all of them to compose an ensemble is a better choice -- selective ensembles
XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)
-
Reference
Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2 XGBoost: A Scalable Tree Boosting System - Tianqi Chen NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/ http://mlwave.com/kaggle-ensembling-guide/
http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/