a/b testing framework design

38
www.abingo.org Company LOGO D By Patrick McKenzie 2010. Please use or send to people who'd benefit. A/B Testing Framework Design Issues Patrick McKenzie 2010 (This presentation is meant to be read. It is released under the Creative Commons By Attribution license – feel free to spread it or use it.)

Upload: patrick-mckenzie

Post on 11-Apr-2017

21.751 views

Category:

Technology


0 download

TRANSCRIPT

www.abingo.org

Company LOGO

D

By Patrick McKenzie 2010. Please use or send to people who'd benefit.

A/B Testing FrameworkDesign Issues

Patrick McKenzie 2010 (This presentation is meant to be read. It is released under the Creative Commons By Attribution license – feel free to spread it or use it.)

www.abingo.org

Company LOGO

DA/B Testing Frameworks

• Why You Should Care• Core Use Scenarios• A/B Test Lifecycle• Design Decisions• Technical Considerations• API Considerations

www.abingo.org

Company LOGO

DWhy You Should Care

There is a paucity of A/B testing frameworks.

"I can probably name a dozen different systems for building high scale applications (distributed storage, message queues, caching layers, search engines, etc), but I can’t name a single AB testing framework other than Google Website Optimizer. That seems like a serious inversion of priorities for most startups."

http://www.tomkleinpeter.com/2009/01/21/where-are-the-ab-testing-frameworks/

www.abingo.org

Company LOGO

DWhy You Should Care

• A/B testing helps you validate your hypotheses about customers and product.

• A/B testing is drop-dead easy if your tech supports it.

• You won't do it otherwise, because it feels like boring busywork.

The goal is to have split-testing be a continuous part of our development process, so much so that it is considered a completely routine part of developing a new feature. In fact, I've seen this approach work so well that it would be considered weird and kind of silly for anyone to ship a new feature without subjecting it to a split-test. That's when this approach can pay huge dividends.

Eric Ries in blog post

www.abingo.org

Company LOGO

DWhy You Should Care

• There are only two decent A/B test frameworks for Rails. Both less than 9 months old.

• There are (to best of my knowledge) no OSS frameworks for Java, Python, etc.

• You should write one. V1.0 can be done in 10 man hours in modern MVC frameworks. Will be best ROI you ever get.

• This presentation hopes to save you time by telling you where the hard decisions are.

www.abingo.org

Company LOGO

DThree Use Scenarios

• Customers interacting with site.• Implementers coding A/B test.• Somebody interpreting results.

www.abingo.org

Company LOGO

DUser View of A/B Test

(What Cindy Sees)

www.abingo.org

Company LOGO

DUser View of A/B Test

(What Bob Sees)

www.abingo.org

Company LOGO

DKey Points For Users

• Users get consistent behavior. Cindy always sees her alternative. Bob always sees his.

• A/B test doesn't break usage of site. (Sounds obvious, can be non-trivial. Test for interactions!)

• Ending A/B test doesn't break site.

Did you know that in Google Website Optimizer users can bookmark individual A/B alternatives because they have distinct URLs? And that after the test is over they may 404? Yeah. Don't do that.

www.abingo.org

Company LOGO

DWhat Developers See

• One line to add a test.• One line to track it.

• No thought required beyond creating alternatives.

www.abingo.org

Company LOGO

DWhat Internal Customers See

• Simple, clear, actionable results.• Stats 101 not required.

Your marketing team might know math. That doesn't mean they should have to.

www.abingo.org

Company LOGO

DA/B Test Lifecycle

• Come up with alternatives.• Code alternatives.• Test alternatives.• Deploy to site.• Users interact with alternatives.• Analyze results.• End test.

When designing your A/B testing framework, keep in mind that you'll be doing all of the above. Eliminate as much friction from each step as possible – this decreases total time through the loop.

www.abingo.org

Company LOGO

DCome up with alternatives.

• Not generally a technical problem.• Inspiration can come from anywhere – a

blog post, a passing fancy, customer comments.

• Should never have to say "We can't do that!"

• Strong recommendation: If we pay your salary, you are authorized to test.

Customers do not think in terms of Model/View/Controller interfaces. They just want to know what the app can do. You should be able to A/B test from any point in the app.

www.abingo.org

Company LOGO

DCode Alternatives

• Programming is hard, but you have to do it anyway.

• Programming A/B tests is easy – one liner and if statement.

• Testing framework handles all bookkeeping – programmers never care.

• Re-use conversion code. Typical businesses have lots of tests, few defined conversions. No need to reinvent wheel every single time.

www.abingo.org

Company LOGO

DTest Alternatives

• A/B tests are live code. They can have bugs. You should be able to unit test like normal.

• Helpful for developers to have access to quick "switch what test I'm seeing" functionality. Simplest example: manually add parameter to URL (&exampleTest=altA). Turn off feature in production.

• Careful of test interactions. Very easy to do once you start testing behavior in addition to display.

www.abingo.org

Company LOGO

DDeploy to site.

• Avoid pointless work here. "Push code live, test starts automatically" is the ideal.

• Testing framework should handle its own setup first time test is called. After that, re-use.

• Note this decision going to be made thousands or hundreds of thousands of times, possibly right after you push live: consider performance implications.

• Can make code default to old version, control start/stop of test via dashboard. Could be worth it, adds complexity.

www.abingo.org

Company LOGO

DUsers interact with alternatives.

• Happily, this takes very little work for you...• … except when it creates Heisenbugs.• In addition to thorough testing, make sure

your "What The User Is Seeing" feature (you have one, right?) reflects their A/B tests.

www.abingo.org

Company LOGO

DAnalyze results.

• Stats behind A/B tests may not be well understood. Impress that stats are real, measured, and actionable. It doesn't matter if they think it is magic as long as they trust the magic.

• Do significance testing so it isn't magic.• Doing significance testing is grunt work: let

the computer do it.• Spend the extra time to make internal

dashboard pretty. People trust pretty things.

• A/B tests not a good place to dig for data. One glance tells you all you need.

www.abingo.org

Company LOGO

DEnd test

• Simple solution: rip code out, test stops.• Simple solution requires redeploy. In event of bug

or strong test result ("Oh my God what were we thinking!?!") might want immediate end button on dashboard. Be able to specify alternative.

• Automatic end of test? Probably a misfeature, but easy to implement.

• Ending test should switch all users to winner (or else you get to support old tests until doomsday). However, users have memories.

• Negatively affected users (e.g. you end test in favor of higher price, user planning on buying later saw lower price) may be mad. Not big problem, but be ready.

www.abingo.org

Company LOGO

DDesign Considerations

• Tracking and managing identity.• How to choose alternatives by identity.• Where to store test participation.• Where to store alternatives.• Stats is hard, let's go shopping.• Presenting results.

www.abingo.org

Company LOGO

DTracking Identity

• Cindy is Cindy, Bob is Bob, Cindy should always see Cindy's tests.

• Cindy is not a cookie. Cindy is not a session. Cindy is freaking Cindy. Even when she is on different computer.

• You already have identity via user authentication. Probably want to punt identity problem there. Have it inform framework of current user identity.

• Important edge case: new user signup should persist “identity” from anonymous visitor to identifiable user.

www.abingo.org

Company LOGO

DTracking Identity

• Easiest identity is random number thrown into cookie. Associate with user accounts. Restore on login. Bam, done.

• However, you will occasionally have A/B test conversions outside of Cindy's HTTP cycle. (e.g. Purchase notification comes from Paypal, not from Cindy. Cindy calls up to place order.) Think it through – not terribly difficult if you plan for it.

www.abingo.org

Company LOGO

DHow To Choose Alternatives

• If you have N alternatives, picking randomly and persisting it by identity works decently.

• Another approach: MD5(identity) % number_of_alts. Saves space (marginally).

• Don't need to save what test Cindy is seeing as long as you can reproduce it.

www.abingo.org

Company LOGO

DHow To Choose Alternatives

• If you have N alternatives, picking randomly and persisting it by identity works decently.

• Another approach: MD5(identity) % number_of_alts. Saves space (marginally).

• Don't need to save what test Cindy is seeing as long as you can reproduce it.

www.abingo.org

Company LOGO

DWhere to store test participation

• Cookie/session bad idea: Cindy will log in at work tomorrow. She should see consistent behavior.

• Cache (memcached) possible, but if Cindy is evicted from cache or cache resets, tough for Cindy and tough for you.

• Persistent data store best bet. Will talk about specific data stores later in slides.

www.abingo.org

Company LOGO

DWhere to store alternatives

• Many approaches. Whatever works for you.

• A/Bingo puts alternatives directly in code. Easiest place, always right in front of developer, no conceptual overhead.

• Vanity puts alternatives in special experiment files. Arguably cleaner code, but have to context/switch.

• Google Website Optimizer has you define alternatives on a web form. Great for marketing department at insurance company. Don't do this. Greatly limits possibilities, increases integration work, blows testing to heck and back.

www.abingo.org

Company LOGO

DDoing Stats

• If possible, call out to dedicated stats modules/libraries to do stats.

• Many types of possible stats for A/B testing. Pick one, stick with it. I use Z-scores because a) I remember them and b) implementation was drop-dead easy.

• Sadly, Ruby lacks many good stats libraries. Oh, to be a Perl programmer...

• This subject worth its own presentation. See Ben Tilly. http://elem.com/~btilly/effective-ab-testing/

www.abingo.org

Company LOGO

DPresenting Results

• Text is easy! Graphs not quite.• Google's confidence bars are sexy... and

pretty useless.• Simple, human language to describe what

confidence intervals and statistical significance mean.

• De-emphasize null results (A > B but not statistically significantly so) but don't hide them. (After all, the fact that "this test was too close to call" tells you something useful.)

www.abingo.org

Company LOGO

DTechnical Considerations

• Less than 1,000 visitors per hour? Skip these slides.

• A/B testing turns performance assumptions on head: heavy writes in very bursty fashion ("as soon as test goes live"), very non-relational data, fairly infrequent reads (~3X writes on my site), extraordinarily infrequent use of summary statistics.

• Practically tailor-made for key/value store, not so much for SQL.

www.abingo.org

Company LOGO

DQueries You Have To Answer FAST

• Who is Cindy? (user → identity)• Is Cindy participating in Test X?• If so, what alternative has she seen?• If not, what alternative should she see?• Record fact that Cindy is participating in

Test X.• Has Cindy converted in Test X?• Record fact that Cindy converted for Test

X.

www.abingo.org

Company LOGO

DQueries You Can Answer Leisurely

• How many people have participated in Experiment X?

• How many saw Alternative A?• Umm, do that stats magic for me.

www.abingo.org

Company LOGO

DQuery You Will NEVER ASK

• Who saw Alternative A in Experiment X?

www.abingo.org

Company LOGO

DPossible Architectures

• Summary statistics (participant counts & conversion counts) in MySQL table with "fairly few" rows. Simple increment statements for updates.

• Participation information (Cindy, Experiment X, Alternative A) in key/value store.

• Or whole thing in key/value store.

www.abingo.org

Company LOGO

DQuick Speed Improvement for SQL

• Give each of your alternatives a unique string ID like MD5(experiment name, alternative name). Calculate that in application code. Index on column.

• UPDATE alternatives SET participants = participants + 1 where lookup_code = 'CALCULATED IN APPLICATION';

• This avoids having to translate human name in code to ID in table. (Or having to use multi-column index for lookup.)

• Note: I am not a very good guy with DBs, but I am informed this is fairly fast. Test for yourself.

www.abingo.org

Company LOGO

DSpecific Key/Value Store

Recommendations

• MySQL with big string columns for key, value: ewwwwww. I mean, ewwwwww.

• Memcachd: Acceptable (and fast) but not persistent. Also tends to only go down when server does. For A/B testing, might just re-run all in progress tests if it dies.

• MemcacheDB: Tried it. Has unacceptable performance when BerkeleyDB flushes to disk. (5 seconds+!)

• Redis: Tried it. Not in production yet. My recommendation – very fast. Vanity also uses it.

www.abingo.org

Company LOGO

DAPI Considerations

Only need to expose two methods:

• ab_test(name, alternatives, conversion_name)• conversion(conversion_name)

Note lack of identity in method calls. Let the framework worry about that.

How you specify alternatives up to you. Array of strings is easy to understand.

www.abingo.org

Company LOGO

DConsuming API

ab_test(name, alternatives, conversion_name) returns the chosen alternative, handles all bookkeeping as side effect.

Typically:

if (ab_test(...) == "something") { #do something} else { #do something else}

Fun opportunity for blocks/binding if your language supports that.

www.abingo.org

Company LOGO

DGot Questions?

Great A/B testing resources:• Eric Ries (startuplessonslearned.com) – heavy on

motivation, less on stats/design decisions• #abtests and @abtests on Twitter. Good

community, many ideas for inspiration.• http://abtests.com – ditto• http://www.bingocardcreator.com/abingo/resources

– links I use when I forget the math.• http://www.kalzumeus.com – my blog• [email protected] – I'm always happy to chat about A/B testing, with anybody. Potentially available for consulting.