data science competition

38
Data Science Competition 2. 25. 2017 The 27th Annual KSEA South-Western Regional Conference Jeong-Yoon Lee, Ph.D.

Upload: jeong-yoon-lee

Post on 19-Mar-2017

1.237 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Data Science Competition

Data Science Competition

2. 25. 2017

The 27th Annual KSEA South-Western Regional Conference

Jeong-Yoon Lee, Ph.D.

Page 2: Data Science Competition

Chief Data Scientist, Conversion Logic

Ph.D. in Computer Science, USC

M.S. in Electrical Engineering, USC

B.S. in Electrical Engineering, SNU

KDD Cup Winner 2012 & 2015

Top 10, Kaggle 2015

Jeong-Yoon Lee, Ph.D.

Page 3: Data Science Competition

Why Data Science Competition

Page 4: Data Science Competition

Why Compete

• For fun

• For experience

• For learning

• For networking

4

Page 5: Data Science Competition

Fun

• Competing with others

• Incremental improvement

5

Page 6: Data Science Competition

Experience

6

Page 7: Data Science Competition

Learning

7

Page 8: Data Science Competition

Learning

8

Page 9: Data Science Competition

Networking

9

Page 10: Data Science Competition

10

Page 11: Data Science Competition

Data Science Competition

Page 12: Data Science Competition

Data Science Competitions

Since 1997

2006 - 2009

Since 2010

Page 13: Data Science Competition

Competition Structure

Training Data

Test Data

Feature Label

Provided Submission Public LB Score Private LB Score

Page 14: Data Science Competition

Kaggle

• 250+ competitions since 2010

• 500K+ users

• 50K+ competitors

• $3MM+ prize paid out

Page 15: Data Science Competition

Kaggle

Page 16: Data Science Competition

Kaggle

Page 17: Data Science Competition

Misconceptions on Competitions

Page 18: Data Science Competition

Misconceptions on Competitions

• No ETL

• No EDA

• Not worth it

• Not for production

18

Page 19: Data Science Competition

No ETL? - Deloitte Western Australia Rental Prices

19

Page 20: Data Science Competition

No ETL? - Outbrain Click Prediction

202B page views. 16.9MM clicks. 700MM users. 560 sites

Page 21: Data Science Competition

No ETL? - YouTube-8M Video Understanding Challenge

21

1.7TB feature-level data. 31GB video-level data.

Page 22: Data Science Competition

No ETL?

22

Page 23: Data Science Competition

No EDA?

• Most of competitions provide actual labels - typical EDA

• Anonymized data - more creative EDAo People decode age, states, time intervals, income, etc.

23

Page 24: Data Science Competition

No EDA?

• Anonymized data - more creative EDA

24

Page 25: Data Science Competition

Not worth it?

• Performance matters

• You walk easier when you can run

25

Page 26: Data Science Competition

Not for Production?

• Kaggle Kernelo Max execution time:10 minutes

o Max file output: 500MB

o Memory limit: 8GB

26

Page 27: Data Science Competition

Ensemble Pipeline at Conversion Logic

27

Page 28: Data Science Competition

Best Practices

Page 29: Data Science Competition

Best Practices

• Feature Engineering

• Algorithms

• Cross Validation

• Ensemble

29

Page 30: Data Science Competition

Feature Engineering

• Numerical - Log, Log(1 + x), Normalization, Binarization

• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence

• Timeseries - Stats, FFT, MFCC, ERP (EEG)

• Numerical/Timeseries to Categorical - RF/GBM*

• Dimensionality Reduction - PCA, SVD, Autoencoder

* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

30

Page 31: Data Science Competition

AlgorithmsAlgorithm Tool Note

Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions

Random Forests Scikit-Learn, randomForest Used to be popular before GBM

Extremely Random Trees Scikit-Learn

Neural Networks/ Deep Learning Keras, MXNet, CNTK, Torch Blends well with GBM. Best at image and speech recognition competitions

Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.

Support Vector Machine Scikit-Learn

FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions

Factorization Machine libFM Winning solution for KDD Cup 2012

Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)31

Page 32: Data Science Competition

Cross Validation

Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

32

Page 33: Data Science Competition
Page 34: Data Science Competition

Ensemble

* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/34

Page 35: Data Science Competition

KDDCup 2015 Solution

35

Page 36: Data Science Competition

Why Competition

• For fun

• For experiences

• For learning

• For networking

36

Page 37: Data Science Competition

37

One Last Thing

Google: 20K applications per week

Conversion Logic: 200 applications per week

Page 38: Data Science Competition

Thank You