Page 1
Machine Learning the Product Boxun Zhang, Spotify
Page 2
about me
Data Scientist at Spotify • Model and understand user retention • General measurement and behavior analysis
Ph.D. in CS from TU Delft • Studied user behaviors in BitTorrent
Page 3
A/B testing is a popular method for learning your product, but traditional A/B testing techniques are
insufficient for learning from your A/B test.
Page 4
In this presentation, I will introduce a different method used at Spotify for analyzing A/B test.
A/B testing is a popular method for learning your product, but traditional A/B testing techniques are
insufficient for learning from your A/B test.
Page 5
learning your product is crucial
Page 6
measure performance
learning your product is crucial
Page 7
measure performanceidentify problems
learning your product is crucial
Page 8
measure performanceidentify problems
learning your product is crucial
discover opportunities
Page 9
everyone knows A/B testing
Page 10
everyone knows A/B testing
more or less
Page 12
control test
a hypothesis
Page 13
a change
control test
a hypothesis
Page 14
a change
testcontrolcontrol test
a hypothesis
Page 15
measure effect
testcontrol
Page 17
is this good enough?
Page 19
with traditional A/B testing techniques, we can only learn from A/B test in a rather superficial way
Page 20
we can measure the size of the effect, but often don’t know the cause of the effect
Page 21
product can be complex
Page 22
Credit: Image by Eric Long, National Air and Space Museum, Smithsonian Institution
Page 23
user behavior can be complex, too
Page 25
sometimes, we introduce big changes
Page 27
change
changechangechange
…
product
behavior
change
Page 28
effectchange
changechangechange
…
product
behavior
change
Page 29
we need to ask a different question
Page 30
which behavior change has the biggest impact on the effect?
Page 31
predictive powerimpact
Page 32
our brain is great, but not at this
Page 35
machine learning problemA/B test
Page 36
1. determine type of problem
Page 37
classificationretention
conversionclick-through
rate
Page 38
classificationretention
conversionclick-through
rate
regressionlifetime value
session length
Page 39
classificationretention
conversionclick-through
rate
regressionlifetime value
session length
Page 40
2. prepare user-level measurement
Page 41
group level user level
Page 42
change
change_1change_2change_3
…
product
behavior
change_n
Page 43
change
change_1change_2change_3
…
product
behavior
change_n
m_1m_2m_3…
measurement
m_p
Page 44
change
change_1change_2change_3
…
product
behavior
change_n
m_1m_2m_3…
measurement
m_p
target
effect
Page 45
change
change_1change_2change_3
…
product
behavior
change_n
m_1m_2m_3…
measurement
m_p
target
effect
p > n
Page 46
measurement machine learning features
Page 47
actionable features
non-actionable featuresvs.
Page 48
session lengthskip rate
offline streams
actionable features
Page 49
session lengthskip rate
offline streams
actionable features
countryage
gender
non-actionable features
Page 50
actionable features are used to build models
Page 51
actionable features are used to build models
non-actionable features are used to segment users
Page 52
this is essentially feature engineering, and we need rich, high quality features
Page 53
“…some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.”
— Pedro Domingos
A few useful things to know about machine learning. Pedro Domingos.
Page 54
3. model selection
Page 55
gradient boosting machine1
XGBoost2: a better variance of GBM
1: Greedy function approximation: A gradient boosting machine. Jerome H. Friedman
2: XGBoost: A Scalable Tree Boosting System. Tianqi Chen, Carlos Guestrin
Page 56
this is not a kaggle competition
Page 57
prediction is not the goal
Page 58
feature/variable importance is
Page 59
feature importance is measured as the improvement in accuracy brought by a feature to the branches it
is on, then summarized over all the trees
https://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html
Page 60
once a model is trained and validated, the features of highest importance are the ones with the biggest impact on the target/effect
Page 61
examine top features
Page 62
examine top features
compare between test and control
Page 63
derive informative, high-level metrics
Page 64
derive informative, high-level metrics
Page 65
this can be an iterative process, as we can build new models to understand which
features have big impact on top features
Page 66
why not linear models?
Page 67
linear models have relatively weak predictive power, and are not robust
Page 68
why not Random ForestsTM
Page 69
feature importance is inaccurate (biased) for correlated features
Bias in random forest variable importance measures: Illustrations, sources and a solution. Carolin Strobl, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn
Page 70
applications at Spotify
activationretention
churn