data science competition
TRANSCRIPT
Data Science Competition
2. 25. 2017
The 27th Annual KSEA South-Western Regional Conference
Jeong-Yoon Lee, Ph.D.
Chief Data Scientist, Conversion Logic
Ph.D. in Computer Science, USC
M.S. in Electrical Engineering, USC
B.S. in Electrical Engineering, SNU
KDD Cup Winner 2012 & 2015
Top 10, Kaggle 2015
Jeong-Yoon Lee, Ph.D.
Why Data Science Competition
Why Compete
• For fun
• For experience
• For learning
• For networking
4
Fun
• Competing with others
• Incremental improvement
5
Experience
6
Learning
7
Learning
8
Networking
9
10
Data Science Competition
Data Science Competitions
Since 1997
2006 - 2009
Since 2010
Competition Structure
Training Data
Test Data
Feature Label
Provided Submission Public LB Score Private LB Score
Kaggle
• 250+ competitions since 2010
• 500K+ users
• 50K+ competitors
• $3MM+ prize paid out
Kaggle
Kaggle
Misconceptions on Competitions
Misconceptions on Competitions
• No ETL
• No EDA
• Not worth it
• Not for production
18
No ETL? - Deloitte Western Australia Rental Prices
19
No ETL? - Outbrain Click Prediction
202B page views. 16.9MM clicks. 700MM users. 560 sites
No ETL? - YouTube-8M Video Understanding Challenge
21
1.7TB feature-level data. 31GB video-level data.
No ETL?
22
No EDA?
• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.
23
No EDA?
• Anonymized data - more creative EDA
24
Not worth it?
• Performance matters
• You walk easier when you can run
25
Not for Production?
• Kaggle Kernelo Max execution time:10 minutes
o Max file output: 500MB
o Memory limit: 8GB
26
Ensemble Pipeline at Conversion Logic
27
Best Practices
Best Practices
• Feature Engineering
• Algorithms
• Cross Validation
• Ensemble
29
Feature Engineering
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD, Autoencoder
* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
30
AlgorithmsAlgorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest Used to be popular before GBM
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)31
Cross Validation
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
32
Ensemble
* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/34
KDDCup 2015 Solution
35
Why Competition
• For fun
• For experiences
• For learning
• For networking
36
37
One Last Thing
Google: 20K applications per week
Conversion Logic: 200 applications per week
Thank You