talks@coursera - a/b testing @ internet scale

A/B Testing @ Internet Scale

Ya Xu 8/12/2014 @ Coursera

A/B Testing in One Slide

20% 80%

Collect results to determine which one is better

Join now

Control Treatment

Outline

§ Culture Challenge –  Why A/B testing –  What to A/B test

§ Building a scalable experimentation system § Best practices

3

Why A/B Testing

Amazon Shopping Cart Recommendation

5

•  At Amazon, Greg Linden had this idea of showing recommendations based on cart items

•  Trade-offs •  Pro: cross-sell more items (increase average basket size) •  Con: distract people from checking out (reduce conversion)

•  HiPPO (Highest Paid Person’s Opinion) : stop the project

From Greg Linden’s Blog: http://glinden.blogspot.com/2006/04/early-amazon-shopping-cart.html

http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

MSN Real Estate

§  “Find a house” widget variations § Revenue to MSN generated every time a user

clicks search/find button

6

A B

http://www.exp-platform.com/Documents/2012-08%20Puzzling%20Outcomes%20KDD.pptx

Take-away

Experiments are the only way to prove causality.

7

Use A/B testing to: § Guide product development § Measure impact (assess ROI) § Gain “real” customer feedback

What to A/B Test

8

Ads CTR Drop

9

Sudden drop on 11/11/2013

Profile top ads

Root-Cause

10

5 Pixels!!

Navigation bar

Profile top ads

What to A/B Test

§ Evaluating new ideas: –  Visual changes –  Complete redesign of web page –  Relevance algorithms – …

§ Platform changes § Code refactoring § Bug fixes

11

Test Everything!

Startups vs. Big Websites

§ Do startups have enough users to A/B test? –  Startups typically look for larger effects –  5% vs. 0.5% difference è 100 times more users!

§ Startups should establish A/B testing culture early

12

A Scalable Experimentation System

13

A/B Testing 3 Steps

14

Design •  What/Whom to experiment on

Deploy •  Code deployment

Analyze •  Impact on metrics

A/B Testing Platform Architecture

1.  Experiment Management 2.  Online Infrastructure 3.  Offline Analysis

15

Example: Bing A/B

1. Experiment Management

§ Define experiments –  Whom to target? –  How to split traffic?

§ Start/stop an experiment §  Important addition:

–  Define success criteria –  Power analysis

16

2. Online Infrastructure

1)  Hash & partition: random & consistent

2)  Deploy: server-side, as a change to –  The default configuration (Bing) –  The default code path (LinkedIn)

3)  Data logging

17

0% 100%

Treatment1

D 20% D 20%

Hash (ID)

Treatment2 Control

Hash & Partition @ Scale (I)

§ Pure bucket system (Google/Bing before 200X)

18

0% 100%

Exp. 1

D 20% D 20%

Exp. 2 Exp. 3

60%

red green yellow

15% 15% 30%

•  Does not scale •  Traffic management

Hash & Partition @ Scale (II)

§ Fully overlapping system 0% 100%

D Exp. 2

A2 B2 control

Exp.1

control A1

D

B1

D

•  Each experiment gets 100% traffic •  A user is in “all” experiments simultaneously •  Randomization btw experiments are independent

(unique hashID) •  Cannot avoid interaction

Hash & Partition @ Scale (III)

§ Hybrid: Layer + Domain

20

•  Centralized management (Bing) •  Central exp. team creates/manages layers/domains

•  De-centralized management (LinkedIn) •  Each experiment is one “layer” by default •  Experimenter controls hashID to create a “domain”

Data Logging

§ Trigger

§ Trigger-based logging –  Log whether a request is actually affected by the

experiment –  Log for both factual & counter-factual

21

All LinkedIn members 300MM +

Triggered: Members visiting contacts page

3. Automated Offline Analysis

§  Large-scale data processing, e.g. daily @LinkedIn –  200+ experiments –  700+ metrics –  Billions of experiment trigger events

§ Statistical analysis –  Metrics design –  Statistical significance test (p-value, confidence interval) –  Deep-dive: slicing & dicing capability

§ Monitoring & alerting –  Data quality –  Early termination

22

Best Practices

23

Example: Unified Search

What to Experiment?

Measure one change at a time.

Unified Search Experiments 1+2+…N 50% En-US

Pre-unified search 50%

En-US

What to Measure?

§ Success metrics: summarize whether treatment is better

§ Puzzling example: –  Key metrics for Bing: number of searches &

revenue –  Ranking bug in experiment resulted in poor search

results –  Number of searches up +10% and revenue up

+30%

Success metrics should reflect long term impact

Scientific Experiment Design

§ How long to run the experiment? § How much traffic to allocate to treatment? Story: §  Site speed matters

–  Bing: +100msec = -0.6% revenue –  Amazon: +100msec = -1.0% revenue –  Google: +100msec = -0.2% queries

§  But not for Etsy.com? “Faster results better? … meh”

27

Power

§ Power: the chance of detecting a difference when there really is one.

§ Two reasons your feature doesn’t move metrics

1.  No “real” impact 2.  Not enough power

28

Properly power up your experiment!

Statistical Significance

§ Which experiment has a bigger impact?

29

Experiment 1 Experiment 2

Pageviews 1.5% 12.9% Revenue 0.8% 2.4%


§ Which experiment has a bigger impact?

30


Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%


31

§ Must consider statistical significance –  A 12.9% delta can still be noise! –  Identify signal from noise; focus on the “real” movers –  Ensure results are reproducible


Pageviews 1.5% 12.9% Revenue 0.8% Stat. significant 2.4%

Multiple Testing

§ Famous xkcd comic on Jelly Beans

32

Multiple Testing Concerns

§ Multiple ramps –  Pre-decide a ramp to base decision on (e.g. 50/50)

§ Multiple “peeks” –  Rely on “full”-week results

§ Multiple variants –  Choose the best, then rerun to see if replicate

§ Multiple metrics

An irrelevant metric is statistically significant. What to do? §  Which metric? §  How “significant”? (p-value)

34

34

All metrics

2nd order metrics

1st order metrics

p-value < 0.05

p-value < 0.01

p-value < 0.001

Directly impacted by exp.

Maybe impacted by exp.

Watch out for multiple testing

With 100 metrics, how many would you see stat. significant even if your experiment does NOTHING? 5

References

§  Tang, Diane, et al. Overlapping Experiment Infrastructure: More, Better, Faster Experimentation. Proceedings 16th Conference on Knowledge Discovery and Data Mining. 2010.

§  Kohavi, Ron, et al. Online Controlled Experiments at Large Scale. KDD 2013: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.

§  LinkedIn blog post: http://engineering.linkedin.com/ab-testing/xlnt-platform-driving-ab-testing-linkedin

Additional Resources: RecSys’14 A/B testing workshop

35

talks@coursera - a/b testing @ internet scale

Engineering