data science competition

Data Science Competition

2. 25. 2017

The 27th Annual KSEA South-Western Regional Conference

Jeong-Yoon Lee, Ph.D.

Chief Data Scientist, Conversion Logic

Ph.D. in Computer Science, USC

M.S. in Electrical Engineering, USC

B.S. in Electrical Engineering, SNU

KDD Cup Winner 2012 & 2015

Top 10, Kaggle 2015

Jeong-Yoon Lee, Ph.D.

Why Data Science Competition

Why Compete

• For fun

• For experience

• For learning

• For networking

4

Fun

• Competing with others

• Incremental improvement

5

Experience

6

Learning

7

Learning

8

Networking

9

Data Science Competition

Data Science Competitions

Since 1997

2006 - 2009

Since 2010

Competition Structure

Training Data

Test Data

Feature Label

Provided Submission Public LB Score Private LB Score

Kaggle

• 250+ competitions since 2010

• 500K+ users

• 50K+ competitors

• $3MM+ prize paid out

Kaggle

Misconceptions on Competitions

Misconceptions on Competitions

• No ETL

• No EDA

• Not worth it

• Not for production

18

No ETL? - Deloitte Western Australia Rental Prices

19

No ETL? - Outbrain Click Prediction

202B page views. 16.9MM clicks. 700MM users. 560 sites

No ETL? - YouTube-8M Video Understanding Challenge

21

1.7TB feature-level data. 31GB video-level data.

No ETL?

22

No EDA?

• Most of competitions provide actual labels - typical EDA

• Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.

23

No EDA?

• Anonymized data - more creative EDA

24

Not worth it?

• Performance matters

• You walk easier when you can run

25

Not for Production?

• Kaggle Kernelo Max execution time:10 minutes

o Max file output: 500MB

o Memory limit: 8GB

26

Ensemble Pipeline at Conversion Logic

27

Best Practices

Best Practices

• Feature Engineering

• Algorithms

• Cross Validation

• Ensemble

29

Feature Engineering

• Numerical - Log, Log(1 + x), Normalization, Binarization

• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence

• Timeseries - Stats, FFT, MFCC, ERP (EEG)

• Numerical/Timeseries to Categorical - RF/GBM*

• Dimensionality Reduction - PCA, SVD, Autoencoder

* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

30

http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

AlgorithmsAlgorithm Tool Note

Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions

Random Forests Scikit-Learn, randomForest Used to be popular before GBM

Extremely Random Trees Scikit-Learn

Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions

Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.

Support Vector Machine Scikit-Learn

FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions

Factorization Machine libFM Winning solution for KDD Cup 2012

Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)31

Cross Validation

Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

32

Ensemble

* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/34

KDDCup 2015 Solution

35

Why Competition

• For fun

• For experiences

• For learning

• For networking

36

37

One Last Thing

Google: 20K applications per week

Conversion Logic: 200 applications per week

Thank You

data science competition

Technology