data science competition
TRANSCRIPT
COMPETITION STRUCTURE
Training Data
Test Data
Feature Label
Provided Submission Public LBScore
Private LBScore
5
NO EDA?• Most of competitions provide actual labels - typical EDA
• Anonymized data - more creative EDA• People decode age, states, time intervals, income, etc.
22
NOT FOR PRODUCTION?
• Kaggle Kernel• Max execution time:10 minutes
• Max file output: 500MB
• Memory limit: 8GB
25
FEATURE ENGINEERING
• Numerical - Log, Log(1 + x), Normalization, Binarization
• Categorical - One-hot-encode, TF-IDF (text), Weight-of-Evidence
• Timeseries - Stats, FFT, MFCC, ERP (EEG)
• Numerical/Timeseries to Categorical - RF/GBM*
• Dimensionality Reduction - PCA, SVD, Autoencoder* http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf
29
ALGORITHMSAlgorithm Tool Note
Gradient Boosting Machine XGBoost, LightGBM The most popular algorithm in competitions
Random Forests Scikit-Learn, randomForest
Extremely Random Trees Scikit-Learn
Neural Networks/ Deep Learning Keras, MXNet Blends well with GBM. Best at image recognition competitions, NLP.
Logistic/Linear Regression Scikit-Learn, Vowpal Wabbit Fastest. Good for ensemble.
Support Vector Machine Scikit-Learn
FTRL Vowpal Wabbit Competitive solution for CTR estimation competitions
Factorization Machine libFM Winning solution for KDD Cup 2012
Field-aware Factorization Machine libFFM Winning solution for CTR estimation competitions (Criteo, Avazu)
30
CROSS VALIDATION
Training data are split into five folds where the sample size and dropout rate are preserved (stratified).
31