introduction to uplift modelling
TRANSCRIPT
Introduction to Uplift Modelling An online gaming application
A few words about me
• Senior Data Scientist at Dataiku (worked on churn prediction, fraud detection, bot detection, recommender systems, graph analytics, smart cities, … )
• Occasional Kaggle competitor
• Mostly code with python and SQL
• Twitter @prrgutierrez
Plan • Introduction / Clients situation
• Uplift use case examples
• Uplift modelling
• Uplift evaluation & results
Client situation • French Online Gaming Company (RPG)
• A lot of users are leaving • let’s do a churn prediction model !
• Target : no come back in 14 or 28 days. (14 missing days -> 80 % of chance not to come back
28 missing days -> 90 % of chance not to come back) • Features :
• Connection features : • Time played in 1,7,15,30,… days • Time since last connection • Connection frequency • Days of week / hours of days played
• Equivalent for payments and subscriptions
• Age, sex, country • Number of account, is a bot … • No in game features (no data)
Client situation • Model Results :
• AUC 0.88 • Very stable model
• Marketing actions : • 7 different actions based on customer segmentation (offers, promotion, … ) • A/B test -> -5 % churn for persons contacted by email
• Going further : • Feature engineering : guilds, close network, in game actions, … • Study long term churn …
Client situation • But wait !
• Strong hypothesis : target the person that are the most likely to churn
Client situation • But wait !
• Strong hypothesis : target the person that are the most likely to churn • What is the gain / person for an action ?
• cost of action • value of the customer • independent variables • “treated” population and “control” population
•
• Value with action : • Value without action : • Gain (if independent of treatment ) :
cvi i
XT C
Y =
⇢1 if customer churn
0 otherwise
ET (Vi) = vi(1� PT (Y = 1|X))� cEC(Vi) = vi(1� PC(Y = 1|X))
viE(Gi) = vi(P
C(Y = 1|X)� PT (Y = 1|X))� c
Client situation • But wait !
• Strong hypothesis : target the person that are the most likely to churn • What is the gain / person for an action ?
• Objective : maximize this gain • Targeting highly probable churner -> minimize But not the difference ! • Intuitive examples :
• : action is expected to make the situation worst. Spam ? • : user does not care, is already lost
Upli& = Model
E(Gi) = vi(PC(Y = 1|X)� PT (Y = 1|X))� c
PT (Y = 1|X)
PC(Y = 1) ⇡ PT (Y = 1)
�P
PC(Y = 1) < PT (Y = 1)
Uplift • Model effect of the action
• 4 groups of customers / patients
• 1 Responded because of the action (the people we want) • 2 Responded, but would have responded anyway (unnecessary costs) • 3 Did not respond and the action had no impact (unnecessary costs) • 4 Did not respond because the action had a negative impact (negative impact)
• Incomplete knowledge
Uplift Examples • Healthcare :
• A typical medical trial: • treatment group: gets the treatment • control group: gets placebo (or another treatment)
• do a statistical test to show that the treatment is better than placebo
• With uplift modeling we can find out for whom the treatment works best
• Personalized medicine
• Ex : What is the gain in survival probability ?
-> classification/uplift problem
Uplift Examples • Churn :
• E-gaming • Other Ex : Coyote
• Retail : • Compare coupons campaigns
Uplift Examples • Mailing : Hillstrom challenge
• 2 campaigns : • one men email
• one woman email
• Question : who are the people to target / that have the best response rate
Uplift Examples • Common pattern
• Experiment or A/B testing -> Test and control
• Warning : Control can be biased easily : • Targeted most probable churners and control is the rest • Call only the people that come to a shop
• Limited experiment trial -> no bandit algorithm : (once a medicine experiment is done, you don’t continue the “exploration”) -> relatively large and discrete in time feedbacks.
Uplift modelling • Three main methods :
• Two models approach
• Class variable modification
• Modification of existing machine learning models
Uplift modelling : Two model approach • Build a model on treatment to get
• Build a model on control to get
• Set :
PT (Y |X)
PC(Y |X)
�P = PT (Y |X)� PC(Y |X)
Uplift modelling : Two model approach • Advantages :
• Standard ML models can be used • In theory, two good estimators -> a good uplift model • Works well in practice • Generalize to regression and multi-treatment easily
• Drawbacks • Difference of estimators is probably not the best estimator of the difference • The two classifier can ignore the weaker uplift signal (since it’s not their target) • Algorithm focusing on estimating the difference should perform better
Uplift modelling : Class variable modification • Introduced in Jaskowski, Jaroszewicz 2012 • Allows any classifier to be updated to uplift modeling
• Let denote the group membership (Treatment or Control)
• Let’s define the new target variable :
• This corresponds to flipping the target in the control dataset.
G 2 {T,C}
Z =
8<
:
1 if G = T and Y = 1
1 if G = C and Y = 0
0 otherwise
Uplift modelling : Class variable modification • Why does it work ?
• By design (A/B test warning !), should be independent from
• Possibly with a reweighting of the datasets we should have :
thus
P (Z = 1|X) = PT (Y = 1|X)P (G = T |X) + PC(Y = 0|X)P (G = C|X)
P (Z = 1|X) = PT (Y = 1|X)P (G = T ) + PC(Y = 0|X)P (G = C)
G X
P (G = T ) = P (G = C) = 1/2
2P (Z = 1|X) = PT (Y = 1|X) + PC(Y = 0|X)
Uplift modelling : Class variable modification • Why does it work ?
Thus And sorting by is the same as sorting by
2P (Z = 1|X) = PT (Y = 1|X) + PC(Y = 0|X)= PT (Y = 1|X) + 1� PC(Y = 1|X)
�P = 2P (Z = 1|X)� 1
P (Z = 1|X) �P
Uplift modelling : Class variable modification • Summary :
• Flip class for control dataset • Concatenate test and control dataset • Build a classifier • Target users with highest probability
• Advantages :
• Any classifier can be used • Directly predict uplift (and not each class separately) • Single model on a larger dataset (instead of two small ones)
• Drawbacks :
• Complex decision surface -> model can perform poorly • Interpretation : what is AUC in this case ?
Uplift modeling : Other methods • Based on decision trees :
• Rzepakowski Jaroszewicz 2012 new decision tree split criterion based on information theory • Soltys Rzepakowski Jaroszewicz 2013 Ensemble methods for uplift modeling
(out of today scope)
Evaluation • We used :
• 2 model approach. -> AUC ? Not very informative. • 1 model approach -> does AUC means something ? • How can we evaluate / compare them ?
• Cross Validation : • 4 datasets : treatment/control x train/test
• Problem : • We don’t have a clear 0/1 target. • We would need to know for each customer
• Response to treatment • Response to control -> not possible
Evaluation
• Gain for group of customers : • Gain for the 10% highest scoring customers =
% of successes for top 10% treated customers − % of successes for top 10% control customers
• Uplift curve ? :
• Difference between two lift curve • Interpretation : net gain in success rate if a given percentage of the population is treated • Pb : no theoretic maximum • Pb 2 : weird behaviour for 2 wizard models.
Evaluation : Qini
• Qini Measure : • Similar to Gini (Area under lift curve). Lift Curve <-> Qini Curve • Parametric curve defined by :
• When taking the first observations • is the total number of 1 seen in target observations • is the total number of 1 seen in control observations • is the total number of target observations • is the total number of control observations
• Balanced setting :
tf(t) = YT (t)� YC(t) ⇤NC(t)/NT (t)
YT
YC
NC
NT
f(t) = YT (t)� YC(t)
Evaluation : Qini
• Personal intuition : • We can’t know everything :
• treated that convert, not treated that don’t convert. What would have happen ? • But we don’t want to see :
• Treated not converting • Not treated converting (in our top list)
• In we want to minimize :
• Very similar to lift taking into account only negative examples.
t
NT (t)� YT (t) + YC(t)
Evaluation : Qini
f(t) = YT (t)� YC(t)
Evaluation : Qini • Best model :
• Take first all positive in target and last all positive in control. • No theoretic best model :
• depends on possibility of negative effect • Displayed for no negative effect
• Random model : • Corresponds to global effect of treatment
• Hillstrom Dataset : • For women models are comparable and useful • For men, there is no clear individuals to target
Evaluation : Qini
f(t) = YT (t)� YC(t)
Evaluation : Qini • Back to our study :
• Class modification performs best • Two models approach performs poorly
• A/B test failure : • Control dataset is way to small ! • Class modification model very close to lift • Two model slightly better than random -> need to redo the A/B test.
Conclusion • Uplift :
• Surprisingly little literature / examples • The theory is rather easy to test
• Two models • Class modification
• The intuition and evaluation are not easy to grasp
• On the client side : • I don’t loose hope we’ll do the A/B test again • A good lead to select the best offer for a customer
A few references • Data :
• Churn in gaming : WOWAH dataset (blog post to come)
• Uplift for healthcare : Colon Dataset
• Uplift in mailing : Hillstrom data challenge
• Uplift in General :
Simulated data : (blog post to come)
A few references • Application
• Uplift modeling for clinical trial data (Jaskowski, Jaroszewicz) • Uplift Modeling in Direct Marketing (Rzepakowski, Jaroszewicz)
A few references • Modeling techniques :
• Rzepakowski Jaroszewicz 2011 (decision trees) • Soltys Rzepakowski Jaroszewicz 2013 (ensemble for uplift) • Jaskowski Jaroszewicz 2012 (Class modification model)
A few references • Evaluation
• Using Control Groups to Target on Predicted Lift (Radcliffe) • Testing a New Metric for Uplift Models (Mesalles Naranjo)
Thank you for your attention !