data science competition

47
DATA SCIENCE COMPETITION Conversion Logic @ Whisper 2. 1. 2017

Upload: jeong-yoon-lee

Post on 10-Feb-2017

345 views

Category:

Data & Analytics


0 download

TRANSCRIPT

DATA SCIENCE COMPETITIONConversion Logic @ Whisper

2. 1. 2017

ATTRIBUTION

DATA SCIENCE COMPETITION

Since 1997

2006 - 2009

Since 2010

4

COMPETITION STRUCTURE

Training Data

Test Data

Feature Label

Provided Submission Public LBScore

5

COMPETITION STRUCTURE

Training Data

Test Data

Feature Label

Provided Submission Public LBScore

Private LBScore

5

KAGGLE

• 237 competitions since 2010

• 500K+ users

• 50K+ competitors

• $3MM+ prize paid out

6

KAGGLE

7

KAGGLE

8

WHY COMPETITION

9

WHY COMPETITION

• For fun

• For experience

• For learning

• For networking

10

FUN

11

FUN

• Competing with others

11

FUN

• Competing with others

• Incremental improvement

11

EXPERIENCE

12

LEARNING

13

LEARNING

13

LEARNING

13

LEARNING

13

LEARNING

13

LEARNING

13

LEARNING

14

NETWORKING

15

NETWORKING

15

NETWORKING

15

NETWORKING

15

16

BS ON COMPETITIONS

17

BS ON COMPETITIONS

• No ETL

• No EDA

• Not worth it

• Not for production

18

NO ETL?

19

• Deloitte Western Australia Rental Prices

NO ETL?

20

• Outbrain Click Prediction

2B page views. 16.9MM clicks. 700MM users. 560 sites

NO ETL?

21

NO EDA?• Most of competitions provide actual labels - typical EDA

• Anonymized data - more creative EDA• People decode age, states, time intervals, income, etc.

22

NO EDA?• Anonymized data - more creative EDA

23

NOT WORTH IT?

• Performance matters

• You can walk easier once you know how to run

24

NOT FOR PRODUCTION?

• Kaggle Kernel• Max execution time:10 minutes

• Max file output: 500MB

• Memory limit: 8GB

25

ENSEMBLE PIPELINE AT CL

26

BEST PRACTICES

27

BEST PRACTICES

• Feature Engineering

• Algorithms

• Cross Validation

• Ensemble

28

FEATURE ENGINEERING

• Numerical - Log, Log(1 + x), Normalization, Binarization

• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence

• Timeseries - Stats, FFT, MFCC, ERP (EEG)

• Numerical/Timeseries to Categorical - RF/GBM*

• Dimensionality Reduction - PCA, SVD, Autoencoder* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf

29

ALGORITHMSAlgorithm Tool Note

Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions

Random Forests Scikit-Learn, randomForest

Extremely Random Trees Scikit-Learn

Neural Networks/ Deep Learning Keras, MXNet Blends well with GBM. Best at image recognition competitions, NLP.

Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.

Support Vector Machine Scikit-Learn

FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions

Factorization Machine libFM Winning solution for KDD Cup 2012

Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)

30

CROSS VALIDATION

Training data are split into five folds where the sample size and dropout rate are preserved (stratified).

31

Ensemble Model Training

32

ENSEMBLE

* for other types of ensemble, see http://mlwave.com/kaggle-ensembling-guide/33

KDDCUP 2015 SOLUTION

34

WHY COMPETITION

• For fun

• For experiences

• For learning

• For networking

35

36