from practice to theory in learning from massive data by charles elkan at bigmine16
TRANSCRIPT
1
From practice to theoryin learning from massive data
Charles Elkan
Amazon Fellow
August 14, 2016
Important
Information here is already public.
Opinions are mine, not Amazon’s.
3
Outline
Only 30 minutes!
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation
Outline
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation
From practice to theory
From theory to practice
Now for everyone!
Outline
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation
From practice to practice
Outline
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation
13
Academic versus applied
In theory, researchers favor simplicity. In practice, they don’t.
In industry, simplicity genuinely wins.
Example: Desiderata for recommender systems:1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. Make them responsive to the user’s most recent interests.4. Generate them with millisecond latency.
14
Amazon’s most important recommender system
1. Respect the privacy of users; don’t be creepy.2. Make recommendations understandable.3. And responsive to the user’s most recent interests.4. Generate them with millisecond latency.
Outline
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation
What data scientists do every day
Let x be a user and let R = 0 or 1 be a response. For example, R=1 means the user buys shoes in the next month.
Routinely, we train models to predict the probability p(R=1|x).
We send messages and coupons to users with high p(R=1|x).
16
Is p(R=1|x) actually useful?
In principle, no. "Our goal is not to predict the future; it is to change the future."• Merely predicting user behavior is of limited interest.
We want to select treatments that influence users.• T = t means we choose treatment t. • For each available t, compute p(R=1|x,T=t). • Choose the t that gives highest probability.
17
The risk of ignoring uplift
18
Users are ranked by p(R=1|x), shown by the brown line.The blue dashed line shows p(R=1|x,T=t) .
The treatment t has a negative effect for users in the top 5%: p(R=1|x,T=t) < p(R=1|x).
Politicians know this …
If you are a Republican, don’t target confirmed Democrat voters!Instead:• Send persuasive messages to undecided voters.• Send “get out the vote” messages to confirmed supporters.• Send “please donate” messages to these people also.
A common scenario for uplift
Many treatments are almost free to apply, such as sending email.
The uplift question is then which treatment is most effective.
For each user x, we want to know which t has highest value p(R=1|x,T=t).
Keep in mind: The same treatment may be the best for all x.
20
A public dataset
Published by Kevin Hillstrom, former VP of database marketing at Nordstrom.
Studied in several published papers on uplift, notably by Nicholas Radcliffe, professor at the University of Edinburgh.
• 64,000 past customers of an e-commerce site selling clothing.• Randomized to no email, men’s email, or women’s email. • Three outcomes: Binary visit? purchase? and numerical spend.
21
Looking at the data
22
Treatments have a larger effect on “visit” than on “purchase given visit” or on “spend given purchase.”
We'll analyze uplift (i.e., the causal influence of treatments) for visits.
Table from Hillstrom’s MineThatData email analytics challenge by Radcliffe.
The linear probability model
Assume the linear function p(R=1|x) = b0 + ∑i bi * xi.• Find coefficients bi to minimize square loss.
Square loss is proper, so predicted probabilities are calibrated.
Avoid overfitting and predictions <0 or >1 by not having too many predictors.
Commonly used in econometrics, not in ML. In practice, often quite similar to logistic regression.
23
probability of visit = 7.5% + … +6.5% IF (men’s past AND men’s email) +6.6% IF (women’s past AND men’s email) +6.1% IF (women’s past AND women’s email)
24
Including treatment indicators M and W
25
The men’s email is effective for customers who have previously purchased men’s or women’s clothing.
The women’s email is not effective for customers who have previously purchased only men’s clothing.
26
Optimal treatment policy:• If only men’s previous purchases: send men’s email.• If only women’s purchases: send either email.• If both: send men’s email.
Hypothesis: Women tend to buy clothing for their families, but men tend to buy clothing only for themselves.
Validation
How can we confirm that we have found an optimal policy?
Approach:1. Train models of response for each treatment.2. For each user x in a test set, plot both predicted probabilities.3. Three separate test sets: users who previously purchased only
women’s clothing, only men’s, or both.4. The latter two sets should show p(R=1|x, T=M) > p(R=1|x, T=W)
for most x.
Results using random forests:
Lower two panels: As expected, p(R=1|x, T=M) > p(R=1|x, T=W).
Top panel: The two treatments M and W are equally effective.
What comes next?
Conclusion: Indeed, one treatment (the men’s email) can beoptimal for all customers.
The step beyond uplift modeling is reinforcement learning: Learning a sequence of actions that is best for each user.
• The goal is to maximize total lifetime reward from each customer.
• Learn simultaneously how customers evolve and how they respond to actions that we take.
29
Questions?
1. Detecting anomalies in streaming data2. Making Spark usable for real-time predictions3. Amazon’s most important algorithm for recommendations4. Uplift: We want causation, not merely correlation