talks@coursera - a/b testing @ internet scale
DESCRIPTION
Talks@Coursera This tech talk will describe how to build an experiment platform that can handle large-scale experiments. The talk will also discuss several best practices in designing and analyzing online experiments learned from companies like Coursera, Microsoft and LinkedIn. About the Speakers Ya Xu has been working in the domain of online A/B testing for over 4 years. She currently leads a team of engineers and data scientists building a world-class online A/B testing platform at LinkedIn. She also spearheads taking LinkedIn's A/B testing culture to the next level by evangelizing best practices and pushing for broad-based platform adoption. She holds a Ph.D. in Statistics from Stanford University. Chuong (Tom) Do currently leads a team of data engineers and analysts in the Analytics team at Coursera, which is responsible for data infrastructure and quantitative analysis in support of the product and business. He completed his Ph.D. in Computer Science at Stanford University in 2009 and worked as a scientist in the personal genetics company 23andMe until 2012, where his research has collectively spanned the fields of machine learning, computational biology, and statistical genetics.TRANSCRIPT
A/B Testing @ Internet Scale
Ya Xu 8/12/2014 @ Coursera
A/B Testing in One Slide
20% 80%
Collect results to determine which one is better
Join now
Control Treatment
Outline
§ Culture Challenge – Why A/B testing – What to A/B test
§ Building a scalable experimentation system § Best practices
3
Why A/B Testing
Amazon Shopping Cart Recommendation
5
• At Amazon, Greg Linden had this idea of showing recommendations based on cart items
• Trade-offs • Pro: cross-sell more items (increase average basket size) • Con: distract people from checking out (reduce conversion)
• HiPPO (Highest Paid Person’s Opinion) : stop the project
From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
MSN Real Estate
§ “Find a house” widget variations § Revenue to MSN generated every time a user
clicks search/find button
6
A B
http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx
Take-away
Experiments are the only way to prove causality.
7
Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback
What to A/B Test
8
Ads CTR Drop
9
Sudden drop on 11/11/2013
Profile top ads
Root-Cause
10
5 Pixels!!
Navigation bar
Profile top ads
What to A/B Test
§ Evaluating new ideas: – Visual changes – Complete redesign of web page – Relevance algorithms – …
§ Platform changes § Code refactoring § Bug fixes
11
Test Everything!
Startups vs. Big Websites
§ Do startups have enough users to A/B test? – Startups typically look for larger effects – 5% vs. 0.5% difference è 100 times more users!
§ Startups should establish A/B testing culture early
12
A Scalable Experimentation System
13
A/B Testing 3 Steps
14
Design • What/Whom to experiment on
Deploy • Code deployment
Analyze • Impact on metrics
A/B Testing Platform Architecture
1. Experiment Management 2. Online Infrastructure 3. Offline Analysis
15
Example: Bing A/B
1. Experiment Management
§ Define experiments – Whom to target? – How to split traffic?
§ Start/stop an experiment § Important addition:
– Define success criteria – Power analysis
16
2. Online Infrastructure
1) Hash & partition: random & consistent
2) Deploy: server-side, as a change to – The default configuration (Bing) – The default code path (LinkedIn)
3) Data logging
17
0% 100%
Treatment1
D 20% D 20%
Hash (ID)
Treatment2 Control
Hash & Partition @ Scale (I)
§ Pure bucket system (Google/Bing before 200X)
18
0% 100%
Exp. 1
D 20% D 20%
Exp. 2 Exp. 3
60%
red green yellow
15% 15% 30%
• Does not scale • Traffic management
Hash & Partition @ Scale (II)
§ Fully overlapping system 0% 100%
D Exp. 2
A2 B2 control
Exp.1
control A1
D
B1
D
• Each experiment gets 100% traffic • A user is in “all” experiments simultaneously • Randomization btw experiments are independent
(unique hashID) • Cannot avoid interaction
Hash & Partition @ Scale (III)
§ Hybrid: Layer + Domain
20
• Centralized management (Bing) • Central exp. team creates/manages layers/domains
• De-centralized management (LinkedIn) • Each experiment is one “layer” by default • Experimenter controls hashID to create a “domain”
Data Logging
§ Trigger
§ Trigger-based logging – Log whether a request is actually affected by the
experiment – Log for both factual & counter-factual
21
All LinkedIn members 300MM +
Triggered: Members visiting contacts page
3. Automated Offline Analysis
§ Large-scale data processing, e.g. daily @LinkedIn – 200+ experiments – 700+ metrics – Billions of experiment trigger events
§ Statistical analysis – Metrics design – Statistical significance test (p-value, confidence interval) – Deep-dive: slicing & dicing capability
§ Monitoring & alerting – Data quality – Early termination
22
Best Practices
23
Example: Unified Search
What to Experiment?
Measure one change at a time.
Unified Search Experiments 1+2+…N 50% En-US
Pre-unified search 50%
En-US
What to Measure?
§ Success metrics: summarize whether treatment is better
§ Puzzling example: – Key metrics for Bing: number of searches &
revenue – Ranking bug in experiment resulted in poor search
results – Number of searches up +10% and revenue up
+30%
Success metrics should reflect long term impact
Scientific Experiment Design
§ How long to run the experiment? § How much traffic to allocate to treatment? Story: § Site speed matters
– Bing: +100msec = -0.6% revenue – Amazon: +100msec = -1.0% revenue – Google: +100msec = -0.2% queries
§ But not for Etsy.com? “Faster results better? … meh”
27
Power
§ Power: the chance of detecting a difference when there really is one.
§ Two reasons your feature doesn’t move metrics
1. No “real” impact 2. Not enough power
28
Properly power up your experiment!
Statistical Significance
§ Which experiment has a bigger impact?
29
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% 2.4%
Statistical Significance
§ Which experiment has a bigger impact?
30
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
Statistical Significance
31
§ Must consider statistical significance – A 12.9% delta can still be noise! – Identify signal from noise; focus on the “real” movers – Ensure results are reproducible
Experiment 1 Experiment 2
Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%
Multiple Testing
§ Famous xkcd comic on Jelly Beans
32
Multiple Testing Concerns
§ Multiple ramps – Pre-decide a ramp to base decision on (e.g. 50/50)
§ Multiple “peeks” – Rely on “full”-week results
§ Multiple variants – Choose the best, then rerun to see if replicate
§ Multiple metrics
An irrelevant metric is statistically significant. What to do? § Which metric? § How “significant”? (p-value)
34
34
All metrics
2nd order metrics
1st order metrics
p-value < 0.05
p-value < 0.01
p-value < 0.001
Directly impacted by exp.
Maybe impacted by exp.
Watch out for multiple testing
With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5
References
§ Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010.
§ Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.
§ LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin
Additional Resources: RecSys’14 A/B testing workshop
35