talks@coursera - a/b testing @ internet scale

35
A/B Testing @ Internet Scale Ya Xu 8/12/2014 @ Coursera

Upload: courseratalks

Post on 29-Nov-2014

6.841 views

Category:

Engineering


2 download

DESCRIPTION

Talks@Coursera This tech talk will describe how to build an experiment platform that can handle large-scale experiments. The talk will also discuss several best practices in designing and analyzing online experiments learned from companies like Coursera, Microsoft and LinkedIn. About the Speakers Ya Xu has been working in the domain of online A/B testing for over 4 years. She currently leads a team of engineers and data scientists building a world-class online A/B testing platform at LinkedIn. She also spearheads taking LinkedIn's A/B testing culture to the next level by evangelizing best practices and pushing for broad-based platform adoption. She holds a Ph.D. in Statistics from Stanford University. Chuong (Tom) Do currently leads a team of data engineers and analysts in the Analytics team at Coursera, which is responsible for data infrastructure and quantitative analysis in support of the product and business. He completed his Ph.D. in Computer Science at Stanford University in 2009 and worked as a scientist in the personal genetics company 23andMe until 2012, where his research has collectively spanned the fields of machine learning, computational biology, and statistical genetics.

TRANSCRIPT

Page 1: Talks@Coursera - A/B Testing @ Internet Scale

A/B Testing @ Internet Scale

Ya Xu 8/12/2014 @ Coursera

Page 2: Talks@Coursera - A/B Testing @ Internet Scale

A/B Testing in One Slide

20% 80%

Collect results to determine which one is better

Join now

Control Treatment

Page 3: Talks@Coursera - A/B Testing @ Internet Scale

Outline

§ Culture Challenge –  Why A/B testing –  What to A/B test

§ Building a scalable experimentation system § Best practices

3

Page 4: Talks@Coursera - A/B Testing @ Internet Scale

Why A/B Testing

Page 5: Talks@Coursera - A/B Testing @ Internet Scale

Amazon Shopping Cart Recommendation

5

•  At Amazon, Greg Linden had this idea of showing recommendations based on cart items

•  Trade-offs •  Pro: cross-sell more items (increase average basket size) •  Con: distract people from checking out (reduce conversion)

•  HiPPO (Highest Paid Person’s Opinion) : stop the project

From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html

http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

Page 6: Talks@Coursera - A/B Testing @ Internet Scale

MSN Real Estate

§  “Find a house” widget variations § Revenue to MSN generated every time a user

clicks search/find button

6

A B

http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

Page 7: Talks@Coursera - A/B Testing @ Internet Scale

Take-away

Experiments are the only way to prove causality.

7

Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback

Page 8: Talks@Coursera - A/B Testing @ Internet Scale

What to A/B Test

8

Page 9: Talks@Coursera - A/B Testing @ Internet Scale

Ads CTR Drop

9

Sudden drop on 11/11/2013

Profile top ads

Page 10: Talks@Coursera - A/B Testing @ Internet Scale

Root-Cause

10

5 Pixels!!

Navigation bar

Profile top ads

Page 11: Talks@Coursera - A/B Testing @ Internet Scale

What to A/B Test

§ Evaluating new ideas: –  Visual changes –  Complete redesign of web page –  Relevance algorithms – …

§ Platform changes § Code refactoring § Bug fixes

11

Test Everything!

Page 12: Talks@Coursera - A/B Testing @ Internet Scale

Startups vs. Big Websites

§ Do startups have enough users to A/B test? –  Startups typically look for larger effects –  5% vs. 0.5% difference è 100 times more users!

§ Startups should establish A/B testing culture early

12

Page 13: Talks@Coursera - A/B Testing @ Internet Scale

A Scalable Experimentation System

13

Page 14: Talks@Coursera - A/B Testing @ Internet Scale

A/B Testing 3 Steps

14

Design •  What/Whom to experiment on

Deploy •  Code deployment

Analyze •  Impact on metrics

Page 15: Talks@Coursera - A/B Testing @ Internet Scale

A/B Testing Platform Architecture

1.  Experiment Management 2.  Online Infrastructure 3.  Offline Analysis

15

Example: Bing A/B

Page 16: Talks@Coursera - A/B Testing @ Internet Scale

1. Experiment Management

§ Define experiments –  Whom to target? –  How to split traffic?

§ Start/stop an experiment §  Important addition:

–  Define success criteria –  Power analysis

16

Page 17: Talks@Coursera - A/B Testing @ Internet Scale

2. Online Infrastructure

1)  Hash & partition: random & consistent

2)  Deploy: server-side, as a change to –  The default configuration (Bing) –  The default code path (LinkedIn)

3)  Data logging

17

0% 100%

Treatment1

D 20% D 20%

Hash (ID)

Treatment2 Control

Page 18: Talks@Coursera - A/B Testing @ Internet Scale

Hash & Partition @ Scale (I)

§ Pure bucket system (Google/Bing before 200X)

18

0% 100%

Exp. 1

D 20% D 20%

Exp. 2 Exp. 3

60%

red green yellow

15% 15% 30%

•  Does not scale •  Traffic management

Page 19: Talks@Coursera - A/B Testing @ Internet Scale

Hash & Partition @ Scale (II)

§ Fully overlapping system 0% 100%

D Exp. 2

A2 B2 control

Exp.1

control A1

D

B1

D

•  Each experiment gets 100% traffic •  A user is in “all” experiments simultaneously •  Randomization btw experiments are independent

(unique hashID) •  Cannot avoid interaction

Page 20: Talks@Coursera - A/B Testing @ Internet Scale

Hash & Partition @ Scale (III)

§ Hybrid: Layer + Domain

20

•  Centralized management (Bing) •  Central exp. team creates/manages layers/domains

•  De-centralized management (LinkedIn) •  Each experiment is one “layer” by default •  Experimenter controls hashID to create a “domain”

Page 21: Talks@Coursera - A/B Testing @ Internet Scale

Data Logging

§ Trigger

§ Trigger-based logging –  Log whether a request is actually affected by the

experiment –  Log for both factual & counter-factual

21

All LinkedIn members 300MM +

Triggered: Members visiting contacts page

Page 22: Talks@Coursera - A/B Testing @ Internet Scale

3. Automated Offline Analysis

§  Large-scale data processing, e.g. daily @LinkedIn –  200+ experiments –  700+ metrics –  Billions of experiment trigger events

§ Statistical analysis –  Metrics design –  Statistical significance test (p-value, confidence interval) –  Deep-dive: slicing & dicing capability

§ Monitoring & alerting –  Data quality –  Early termination

22

Page 23: Talks@Coursera - A/B Testing @ Internet Scale

Best Practices

23

Page 24: Talks@Coursera - A/B Testing @ Internet Scale

Example: Unified Search

Page 25: Talks@Coursera - A/B Testing @ Internet Scale

What to Experiment?

Measure one change at a time.

Unified Search Experiments 1+2+…N 50% En-US

Pre-unified search 50%

En-US

Page 26: Talks@Coursera - A/B Testing @ Internet Scale

What to Measure?

§ Success metrics: summarize whether treatment is better

§ Puzzling example: –  Key metrics for Bing: number of searches &

revenue –  Ranking bug in experiment resulted in poor search

results –  Number of searches up +10% and revenue up

+30%

Success metrics should reflect long term impact

Page 27: Talks@Coursera - A/B Testing @ Internet Scale

Scientific Experiment Design

§ How long to run the experiment? § How much traffic to allocate to treatment? Story: §  Site speed matters

–  Bing: +100msec = -0.6% revenue –  Amazon: +100msec = -1.0% revenue –  Google: +100msec = -0.2% queries

§  But not for Etsy.com? “Faster results better? … meh”

27

Page 28: Talks@Coursera - A/B Testing @ Internet Scale

Power

§ Power: the chance of detecting a difference when there really is one.

§ Two reasons your feature doesn’t move metrics

1.  No “real” impact 2.  Not enough power

28

Properly power up your experiment!

Page 29: Talks@Coursera - A/B Testing @ Internet Scale

Statistical Significance

§ Which experiment has a bigger impact?

29

Experiment 1 Experiment 2

Pageviews 1.5% 12.9% Revenue 0.8% 2.4%

Page 30: Talks@Coursera - A/B Testing @ Internet Scale

Statistical Significance

§ Which experiment has a bigger impact?

30

Experiment 1 Experiment 2

Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%

Page 31: Talks@Coursera - A/B Testing @ Internet Scale

Statistical Significance

31

§ Must consider statistical significance –  A 12.9% delta can still be noise! –  Identify signal from noise; focus on the “real” movers –  Ensure results are reproducible

Experiment 1 Experiment 2

Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%

Page 32: Talks@Coursera - A/B Testing @ Internet Scale

Multiple Testing

§ Famous xkcd comic on Jelly Beans

32

Page 33: Talks@Coursera - A/B Testing @ Internet Scale

Multiple Testing Concerns

§ Multiple ramps –  Pre-decide a ramp to base decision on (e.g. 50/50)

§ Multiple “peeks” –  Rely on “full”-week results

§ Multiple variants –  Choose the best, then rerun to see if replicate

§ Multiple metrics

Page 34: Talks@Coursera - A/B Testing @ Internet Scale

An irrelevant metric is statistically significant. What to do? §  Which metric? §  How “significant”? (p-value)

34

34

All metrics

2nd order metrics

1st order metrics

p-value < 0.05

p-value < 0.01

p-value < 0.001

Directly impacted by exp.

Maybe impacted by exp.

Watch out for multiple testing

With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5

Page 35: Talks@Coursera - A/B Testing @ Internet Scale

References

§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010.

§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.

§  LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin

Additional Resources: RecSys’14 A/B testing workshop

35