what does it take to win the kaggle/yandex competition
DESCRIPTION
A feedback on how we won Kaggle/Yandex competitionTRANSCRIPT
WHAT DOES IT TAKE TO WIN THE KAGGLE/YANDEX COMPETITION
Christophe BourguignatKenji Lefèvre-HasegawaPaul Masurel @DataikuMatthieu Scordia @Dataiku
OUTLINE OF THE TALK
• Review of the Kaggle/Yandex Challenge• How we worked (team work & tools)• The winning model
GOAL Re-rank URLs returned by Yandex according to the personal preferences of the users
url1
url2
url3
url4
url3
url2
url1
url4
GOAL
ML CHALLENGE Predict user’s pertinence for urls and rerank result set accordingly
The Kaggle/Yandex challenge
GIVEN• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
• Test session prior activity queries, clicks & dwell-times
SIZE• 15Gb size
The Kaggle/Yandex challenge
Q Q T ?Test session :
Q Q Q Q
QUALITY METRIC
• One query test / user on the last 3 days• NDCG metric penalize error of pertinence on top ranked
urls
• No A/B test
The Kaggle/Yandex challenge
url1
url2
url3
url4
url3
url2
url1
url4
url1
url2
url4
url3
Prediction Another rankingKaggle
BADOK
TEAM DATAIKU SCIENCE STUDIO / KAGGLE
• Christophe Bourguignat Engineer, Data enthusiastic
• Kenji Lefèvre-Hasegawa Ph.D. math, new to ML
• Paul Masurel Software Engineer @dataiku
• Matthieu Scordia Data Scientist @dataiku
First meeting : October16th 2013
How we worked (Team work & tools)
WE’VE USED
• Related papers (mainly Microsoft’s)• 12 core, 64 Gb• Python scikit-learn• Dataiku Science Studio• Java Ranklib
How we worked (Team work & tools)
DATAIKU SCIENCE STUDIO
How we worked (Team work & tools)
LEARNING
Team members work independantly
Original train
Split train & validation
Labels
Featu
res &
labels
FEATURES CONSTRUCTION
Team members work independantly
Features
DATA DRIVEN COMPUTATION
HOW MUCH WORK ?• 960+ emails • 360+ features• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last 3 weeks • 56 kaggle submissions• 196 teams, 264 players, 3570 submissions
How we worked (Team work & tools)
1/2 month 1 week 1 week 1 week
Top 25
Top 10
5th
1st
3rd
1st
2014-01-01Future top 2 & 3
enter race
PROBLEM ANALYSIS
Query
Result Set• Rank• URL Snippet Quality• URL is skipped, clicked or missed
Reading URL• URL & Domain pertinence with dwell-time
CLICK
The winning model
FEATURESFeatures :• Rank• User habits, query specificity (entropy, frequency,…)• Snippet pertinence• Missed, Skipped, Clicked• URL & Domain Pertinence
Declinaison of & Clicked• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)• For each user : historic & previous activity in test session & aggregate• For all user• Declined for all queries & same query
The winning model
MODELS
• Random Forest (predict proba)+ maximize E(NDCG)
• Lambda MART (Gradient Boosting Tree optimized for NDCG) WINS !
The winning model
Kaggle/Yandex Top 1 then 3rd
QUESTIONS
?
Thank you !