what does it take to win the kaggle/yandex competition

WHAT DOES IT TAKE TO WIN THE KAGGLE/YANDEX COMPETITION

Christophe BourguignatKenji Lefèvre-HasegawaPaul Masurel @DataikuMatthieu Scordia @Dataiku

OUTLINE OF THE TALK

• Review of the Kaggle/Yandex Challenge• How we worked (team work & tools)• The winning model

GOAL Re-rank URLs returned by Yandex according to the personal preferences of the users

url1

url2

url3

url4

url3

url2

url1

url4

GOAL

ML CHALLENGE Predict user’s pertinence for urls and rerank result set accordingly

The Kaggle/Yandex challenge

GIVEN• 30 days logs test: 3 days, train: 27 days

• Users historic queries, clicks & dwell-times

• Test session prior activity queries, clicks & dwell-times

SIZE• 15Gb size


Q Q T ?Test session :

Q Q Q Q

QUALITY METRIC

• One query test / user on the last 3 days• NDCG metric penalize error of pertinence on top ranked

urls

• No A/B test


url1

url2

url3

url4

url3

url2

url1

url4

url1

url2

url4

url3

Prediction Another rankingKaggle

BADOK

TEAM DATAIKU SCIENCE STUDIO / KAGGLE

• Christophe Bourguignat Engineer, Data enthusiastic

• Kenji Lefèvre-Hasegawa Ph.D. math, new to ML

• Paul Masurel Software Engineer @dataiku

• Matthieu Scordia Data Scientist @dataiku

First meeting : October16th 2013

How we worked (Team work & tools)

WE’VE USED

• Related papers (mainly Microsoft’s)• 12 core, 64 Gb• Python scikit-learn• Dataiku Science Studio• Java Ranklib


DATAIKU SCIENCE STUDIO


LEARNING

Team members work independantly

Original train

Split train & validation

Labels

Featu

res &

labels

FEATURES CONSTRUCTION

Team members work independantly

Features

DATA DRIVEN COMPUTATION

HOW MUCH WORK ?• 960+ emails • 360+ features• 50+ ideas grid tuned (300+ models fitted)

• Server heavily loaded the last 3 weeks • 56 kaggle submissions• 196 teams, 264 players, 3570 submissions


1/2 month 1 week 1 week 1 week

Top 25

Top 10

5th

1st

3rd

1st

2014-01-01Future top 2 & 3

enter race

PROBLEM ANALYSIS

Query

Result Set• Rank• URL Snippet Quality• URL is skipped, clicked or missed

Reading URL• URL & Domain pertinence with dwell-time

CLICK

The winning model

FEATURESFeatures :• Rank• User habits, query specificity (entropy, frequency,…)• Snippet pertinence• Missed, Skipped, Clicked• URL & Domain Pertinence

Declinaison of & Clicked• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)• For each user : historic & previous activity in test session & aggregate• For all user• Declined for all queries & same query

The winning model

MODELS

• Random Forest (predict proba)+ maximize E(NDCG)

• Lambda MART (Gradient Boosting Tree optimized for NDCG) WINS !

The winning model

Kaggle/Yandex Top 1 then 3rd

QUESTIONS

?

Thank you !

what does it take to win the kaggle/yandex competition

Technology

team work tools2014

query test user

features features

winning model

times test session

days logs test

endcg kaggleyandex

kaggleyandex challengeqqt