beat the benchmark

Beat the benchmark.

Getting started with competitive data mining

By Maheshakya Wijewardena

What is competitive data mining and why?● Gap between those who are with data and those who can analyze them.

Organizations need to make use of their massive amounts of data, but with less expenditure.

Promote and expand research on applications and data models.

Challenges organized by SIGKDD, ICML, PAKDD, ECML, NIPS, etc. to promote the practices.

Find talent, attract skills...

● eg: Facebook, yahoo, yelp, ...2

What is competitive data mining and why?

“ I keep saying the sexy job in next ten years will be the statisticians “

- Hal Varian

Google chief economist, 2009 3

What is competitive data mining and why?

● Kaggle?

“ is a platform for data prediction competitions that allows organizations to post their data and have it scrutinized by the world’s best data scientists. “

- verbatim

4

An outline

● Types of challenges● Understanding the challenge● Setting things up● Analyzing data● Data preprocessing● Training models● Validating models● ML/Statistics packages● Conclusion

5

Types of competitions

Those well known tasks you find in the data mining class...

● Most of them are classification○ Binary or probability○ Rarely multiclass

● Time series forecasting○ Predict for some period ahead○ Seasonal patterns

● Anomaly detection

Majority of competitions focus on the results, not the process.

But there are some which give high priority to process - scalability, technical feasibility, complexity, etc. (Often for recruitments and research) 6

Before you start...

Be aware of structure of data mining competitions in Kaggle

Always remember that the purpose of the predictive model is to predict on data

that we have not seen!

7

Understand what it is about

● Read the problem until you understand it; pristine.● Keep an eye on the forum, always - Know how other competitors think.● Check dataset sizes! - Can you handle it?● Competitive advantage - Try to get some domain knowledge, but not

necessary.● How do they evaluate, on what criterion?

○ Area under ROC curve○ MSE○ False positive/negative rate○ Precision - recall○ ...

8

Setting things up...

● Boil down the problem into sections● Organize your team - divide work● Look at benchmarks codes - a good point to start but it’s not enough!● Look at sample submission files

And most importantly,

● Set up an environment in which you can iterate and test new ideas rapidly

9

Analyzing Data

KNOW THY DATA !!!

10

Analyzing Data

● Get to know your data ○ Raw data - Image ,video, text - do I need to perform feature extraction too?○ Numerical, categorical

● Visualize! - Histograms, pie charts, cluster diagrams…○ Advanced - vector quantization - SOM

● Missing values● Class imbalance

11

Feature engineering and Data Preprocessing

Typical preprocessing techniques:

● Handle missing values - keep, discard, impute● Resample - up/downsampling● Encoding

○ Label encoding○ One hot encoding / bit maps

● For textual - TF-IDF, feature hashing, bag of words, ...● Dimensionality reduction - PCA, SVD, ...

12

Feature engineering and Data Preprocessing

Feature engineering is a bit tricker…

● Identify what the most important/impacting features are.○ Feature selection○ Strong dependency with the learning algorithms○ Recursive feature elimination

● Eliminate (trivial) irrelevant features - IDs, timestamps(sometimes)● Derived features?

13

Important !

Make sure you have your own evaluating metric implemented.

When evaluating your models:

● Simple training/validation split is not enough. ○ K-fold validation uses all fractions while training though you hold out a sample.

● Always have a separate hold out set that you do not touch at all during model building process○ Including preprocessing

14

Typical model building process

15

Split training/holdout

Preprocess Train model

Evaluation

Implement model

Training set

Hold out set for validation

Preprocess Good?

Bad?

Be brave and scrap the model !

Training models

● Learning algorithm - select carefully based on the problem● Hyper parameter tuning

○ Grid search○ randomized search○ manual?

● Be aware of overfitting!● Ensemble methods:

○ Bagging○ Boosting○ model ensembling - convex combinations

No matter what models you train, winning solutions will always be ensembles

16

Model Validation

● Get the score of your model from your evaluator.○ Bad? - Keep it aside and design a new model○ Good? - go ahead and predict for the test set

● Even though an individual model performs poorly, it might fit in gracefully in an ensemble

● Confusion matrix● Try to visualize predicted vs. actual

○ With each feature○ Gives you an insight on what characteristics of features make the model better or worse

● Keep records.

17

Final steps...

Submissions:

● Try to submit something every day - know your position● Keep updated● Don’t do changes in your model which do slight improvements in public leader

board - often a trap !

Don’t forget the forum !

● If you have something interesting, share it with others - but not everything ;)● Good Kagglers alway give something back

18

About ML/Stat packages... ● Machine learning Packages:

○ R○ scikit-learn○ pylearn○ ML Pack○ Shogun○ Spark/H2O - scalable, distributed processing - but limited functionality.

● Statistics○ Again R○ statsmodels

● Data manipulation○ Again R○ Pandas, numpy, scipy

● Visualization - ○ Again R○ Matplotlib

Sometimes,

● Deep learning - Theano● NLP - NLTK

Emerging - Julia 19

Conclusion● First, try out some “getting started” competitions - take the advantage● When analyzing data - be patient, be meticulous● Visualize!● (Some) Domain knowledge would be useful● Feature engineering is the key (often)● Have discipline to have a proper validation framework● Be brave!● Learn from others● “Right” models● Use of ML/Stat packages effectively● Good coding/data manipulation and software engineering best practices● Avoid overfitting!● Luck....

20

No Free Lunch

21

References

1. Kaggle, https://www.kaggle.com/2. Krishna Shankar, Hitchhiker’s guide to Kaglle, http://www.slideshare.

net/ksankar/oscon-kaggle203. Beth Schultz, 10 Tips for Winning a Data Science Competition, http://www.

allanalytics.com/author.asp?doc_id=2685134. Owen Zhang, “Tips for data science competitions”, http://www.slideshare.

net/OwenZhang2/tips-for-data-science-competitions5. Parsec Labs, https://www.parseclabs.com/knowthydata

23

https://www.kaggle.com/

http://www.slideshare.net/ksankar/oscon-kaggle20



http://www.allanalytics.com/author.asp?doc_id=268513



http://www.slideshare.net/OwenZhang2/tips-for-data-science-competitions



https://www.parseclabs.com/knowthydata

beat the benchmark

Data & Analytics