challenging problems in online controlled experiments slides at //bit.ly/code2015kohavi ron kohavi,...

15
Challenging Problems in Online Controlled Experiments Slides at https://bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft

Upload: caren-wells

Post on 18-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

Challenging Problems in Online Controlled Experiments

Slides at https://bit.ly/CODE2015Kohavi

Ron Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft

Page 2: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

2

Challenge 1: Sessions/User as MetricSearch engines (Bing, Google) are evaluated on query share (distinct queries) and

revenue as long-term goalsObservation: A ranking bug in an experiment resulted in very poor search results Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear

more relevant Distinct queries went up over 10%, and revenue went up over 30%.

If we optimize for these, we should fire the relevance team

What metrics should we use as the OEC (Overall Evaluation Criterion) for search engine?

Ronny Kohavi

Page 3: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

3

OEC for Search EnginesAnalyzing queries per month, we have

where a session begins with a query and ends with 30-minutes of inactivity. (Ideally, we would look at tasks, not sessions).

Key observation: we want users to find answers and complete tasks quickly, so queries/session should be smaller

In a controlled experiment, the variants get (approximately) the same number of users by design, so the last term is about equal

The OEC should therefore include the middle term: sessions/userThis seems like ideal metric for many sites, not just Bing: increased sessions/visits

Ronny Kohavi

Page 4: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

Challenge: Statistical PowerThe t-statistic used to evaluate stat-sig is defined as , where is the difference between a metric for the two variants, and is the (estimated) standard deviation of the difference

is , whereCV is the coefficient of variation of the metricn is the sample sizeNormally, as the experiment runs longer and more users are admitted, the confidence interval should shrink

Here is a graph of the relative confidence interval size for Sessions/User over a month

It is NOT shrinking as expectedCV, normally fixed, is growing at the same rate

as

Page 5: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

5

Why is this Important?Given that this metric is Bing’s “north star,” everyone tries to improve this metricDegradations in Sessions/User (commonly due to serious bugs) are quickly stat-sig,

indicating abandonmentPositive movements are extremely rare About two ideas a year are “certified” as having moved Sessions/user positively (out of 10K experiments,

about 1,000-2,000 are successful on other OEC metrics), so 0.02% of the time Certification involves very low p-value (rare) and more commonly replication

Challenges Can we improve the sensitivity?

We published a paper on using pre-experiment data: CUPED, which really helped here. Other ideas? Is there a similar metric that is more sensitive? Is it possible that this metric just can’t be moved much?

Unlikely. Comscore reports Sessions/User for Bing and Google and there’s a factor of two gap

Ronny Kohavi

Page 6: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

6

Challenge 2: NHST and P-valuesNHST = Null Hypothesis Statistical Testing, the “standard” model commonly usedP-value <= 0.05 is the “standard” for rejecting the Null hypothesisP-value is often mis-interpreted. Here are some incorrect statements from Steve

Goodman’s A Dirty Dozen1. If P = .05, the null hypothesis has only a 5% chance of being true2. A non-significant difference (e.g., P >.05) means there is no difference between groups3. P = .05 means that we have observed data that would occur only 5% of the time under the null

hypothesis4. P = .05 means that if you reject the null hypothesis, the probability of a type I error (false positive) is

only 5%

The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want isProb (H_0 | X = x)

Ronny Kohavi

Page 7: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

7

Why is this Important?Take Sessions/User, a metric that historically moves positively 0.02% of the time at BingWith standard p-value computations, 5% of experiments will show stat-sig movement,

half of those positive 99.6% of the time, a stat-sig movement with p-value = 0.05 does not mean that the idea

improved Sessions/UserInitial way to address this: Bayesian. See Objective Bayesian Two Sample Hypothesis Testing for Online Controlled Experiments at

http://exp-platform.com for recent work by Alex DengBasically, we use historical data to set priors. But this assumes the new idea behaves like prior ones.

Ronny Kohavi

Page 8: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

8

Challenge 3: Duration / Novelty EffectsHow long do experiments need to run? We normally run experiments for one to two weeks When we suspect novelty (or primacy) effects, we run longer At Bing, despite running some experiments for 3 months, we rarely see significant changes.

Never saw stat-sig turn into negative stat-sig, for example

Google reported significant long-term impact of showing more ads For example, KDD 2015 paper by Henning etal. on Focusing on the Long-term: It’s Good for Users and

Business We ran the same experiment and have very different conclusionsoWe saw Sessions/user decline. When that happens, most metrics are invalid, as users are abandoningoLong-term experiments on cookie-based user-identification have strong selection bias, as they require the same

user to visit over a long-period. Users erase cookies, lose cookies, etc.

Ronny Kohavi

Page 9: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

9

Challenge 4: OECOEC is the Overall Evaluation CriterionDesiderata:oA single (weighted) metric that determines whether to ship a new feature (or not)oMeasurable using short-term metrics, but predictive of long-term goals (e.g., lifetime value)

What are properties of good OECs?oHard example: support.microsoft.com as a web site.

Is more time better or worse? oIs there an OEC for Word? Excel? Or do these have to be feature specific?

How do we value things like “standards” or “designs”?“Modern” web design does not use underlines for links(Example next slide)

Ronny Kohavi

Page 10: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

10

Underlines (left) or no underlines (right)

Ronny Kohavi

Page 11: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

11

Challenge 4: OEC (cont)All our key Bing metrics show that underlines help users, includingUsers click faster (time to successful click is faster with underlines)Users click more on the page (everywhere with underlines).

This includes ads, so we make more money when there are underlinesMulti-month experiments show that the deltas hold over time (No strong Primacy effect, where after a couple of

months users would be as efficient without underlines)

Google’s metrics were also negative (per my discussions with Google analysts)Designers claim the page is “cleaner” and “underlines are so 1990s”Google made the change March 2014. Yandex followed.

Yahoo! underlines in ads, but not algorithmic results, thus shifting clicks to monetizable elements (making more money), at the expense of consistencyBing made the change in summer of 2015Baidu, Ask, and AOL still underline

Is the industry standard wise, or is a fashion, like the use of Flash 10 years ago?How do we tradeoff key metrics vs. hard-to-quantify “cleaner” and “industry standard”?

Ronny Kohavi

Page 12: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

12

Challenge 5: Leaks Due to Shared ResourcesShared resources are a problem for controlled experimentsThese violate Rubin’s SUTVA (Stable Unit Treatment Value Assumption) because

resources used by treatment can impact controlExample: LRU caches are used often (least-recently-used elements are replaced) For correctness, caches must be experiment aware, as the cached elements often depend on

experiments (e.g., search output depends on query term + backend experiments).The experiments that a user is in, are therefore part of the cache key

If control and treatment are of different size (e.g., control is 90%, treatment is 10%), then control has a big advantage because its elements are cached more, leading to performance improvements

We usually run experiments with equal sizes because of this (e.g., 10% vs. 10%, even if control could be 90% and we would reduce variance)

(Side note: With users falling into multiple experiments, “whole-page” caches are heavily fragmented, making them useless. If users are in 15 concurrent experiments (number lines) with 5 treatments each, then the cache is fragmented by a factor of 5^15=30 billion, rendering it useless.)

Ronny Kohavi

Page 13: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

13

Challenge 5: Leaks Due to Shared ResourcesUnsolved examples Treatment has a bug and uses disk I/O much more heavily. As the disk is a shared resource, this slows

down control (a leak) and the delta in many metrics is not reflecting the issue Treatment causes the server to crash. As the system is distributed and reasonably resilient to crashes,

requests are routed to other servers and the overall system (usually) survives.You don’t see it in the experiment results, i.e., the results look like an A/A test because crashes take down the machine so requests from control users die too

Solutions are not obvious Deploy experiments to subset of machines to look for impact on the machines?

This would work if there were a few experiments, but there are 300 concurrent experiments running Deploy to single data center and then scale out. This is what we do today, but when crashes take several

hours to impact the overall system (such as a memory leak), near-real-time monitoring does not catch this

Ronny Kohavi

Page 14: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

14

Challenge 6: Interesting SegmentsWhen an A/B experiment completes, can we provide interesting insights by finding

segments of users where the delta was much larger or smaller than the mean?We should be able to apply machine learning methodsInitial ideas, such as Susan Athey’s work, create high-variance labelsIssues with multiple hypothesis testing / overfitting

Ronny Kohavi

Page 15: Challenging Problems in Online Controlled Experiments Slides at //bit.ly/CODE2015Kohavi Ron Kohavi, Distinguished Engineer,

15

Challenges not RaisedExperiments in social-network settings (Facebook, LinkedIn) commonly violate SUTVA

Papers are being published on these.I have very limited experience

User identifiers changing (non-logged-in-users login).The user identifier is the key to determine treatment

See Ya Xu’s papers in the main KDD conference and Social Recommender Systems workshop: A/B testing challenges in social networks

Ronny Kohavi