the anatomy of an a/b test - jsconf colombia workshop
TRANSCRIPT
A/B testing workshop “In God we trust, all others must bring data”
JSConf Colombia Workshop 2015
@shiota github.com/eshiota
slideshare.net/eshiota eshiota.com
A/B
A/B tests measure how a new idea (version B/variant/test) performs agains an existing implementation (version A/base/control).
Buy now Buy nowversus
coin flip
Buy now
Buy now
50%
50%
When the user sees or is affected by the idea, they are tracked and become part of the test.
Buy now
Buy now
track(my_experiment)
Data about the website is generated as users browse through pages and do their tasks.
product added to cart
number of products added
purchase finished
average price per purchase
number of products seen
user has logged in
used guest checkout
customer service calls
…
When there’s enough information to make a decision, you can either stop the test (keeping version A) or choose version B, directing all traffic to it.
Buy now Buy now
Duration: 14 days Visitors: 45.140 (22.570 per variant)
339 (1.5%) 407 (1.8%)
20% up
144.500 COP 147.390 COP
2% up
Number of purchases:
Average price:
coin flip
Buy now
Buy now
50%
50%
B
Buy now
100%
"But my design is obviously more beautiful and intuitive than what we have now! Why should I run an A/B test?” — the majority of designers
Quiz time!(prizes included)
A: Raise your left hand B: Raise your right hand
Neutral: Don’t raise your hands
Which performed better?
Reduced bounce rate in 1.7%
A: Raise your left hand B: Raise your right hand
Neutral: Don’t raise your hands
Which performed better?
Which performed better?
Increased CTR in 203%
A: Raise your left hand B: Raise your right hand
Neutral: Don’t raise your hands
Which performed better?
Which performed better?
43.4% more purchases
A: Raise your left hand B: Raise your right hand
Neutral: Don’t raise your hands
Which performed better?
Both were statistically equivalent
Which performed better?
Intuition vs
Historical Analysis vs.
Experimentation
We have a 2/3 chance of being wrong when trusting our intuition.
People behave differently each season/month/day of the week.
Different cultures lead to different patterns of usage.
Data analysis alone provides correlation but not causation.
Running your A/B test(in 5 simple steps)
Step 1: Hypothesis
Analyse all possible inputs to come up with an hypothesis to work on.
• Usability research • Benchmarking • Surveys • Data mining • Previous experiments
Hypothesis:
“If users from South America countries relate more to the website, they will book more.”
Step 2: Idea
Idea:
“If we add the country’s flag next to the website’s logo, users will relate more to the brand.”
Step 3: Setup
• Who will participate? • What is the primary metric? • Any secondary impacts? • How will it be implemented?
• Users from Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Guyana, Paraguay, Peru, Suriname, Uruguay and Venezuela, on all platforms
• Conversion (net bookings) uplift is expected • We expect more returning customers
<h1 class="main-header__logo logo"> <% if user.is_from_south_america && track_experiment(:header_flag_for_south_america) == "b" %> <span class="main-header__logo__country-flag"> <%= user.country %> </span> <% end %> <%= image_tag "logo.png" %> </h1>
Step 4: Monitoring
Keep checking the metrics to see if anything’s terribly wrong.
Avoid checking too often, let your test get enough users and enough runtime.
Step 5: Data, decisions, and next steps
When you reach the expected runtime, number of visitors or effect, look at the data and take a decision.
product added to cart
number of products added
purchase finished
average price per purchase
number of products seen
user has logged in
used guest checkout
customer service calls
…
Optimizely dashboard
• How were the primary and secondary metrics impacted?
• What were the results isolated by each country?
• What were the results isolated by each language?
• Did any particular platform (desktop, mobile devices, tablets) perform better?
• Was the impact on returning customers any higher than first time visitors?
Based on the gathered data, plan for next steps.
• Should we add a copy to the flag? • Should we add a tooltip to the flag? • Should we increase/decrease the flag size? • Should we restrict it just for desktop users? • Should we try this for a single country, or
other countries?
What can you test?
(almost) Everything.
You can test a small design change.
versus
You can test large design changes.
versus
You can test different copies.
versus
Submit
Book now
You can test technical improvements and measure page load time, repaints/reflows, and conversion impact.
versus
jQuery 1.11.3
jQuery 2.1.3
You can even test back-end optimisations and measure page load time, rendering time, CPU and memory usage etc.
if track_experiment(:my_optimized_query) @users = my_optimized_query else @users = do_the_normal_thing end
Live coding(I hope that works.)
Find the code at:
https://github.com/eshiota/ab_workshop
Additional links:
https://www.optimizely.com/ https://github.com/splitrb/split/
http://whichtestwon.com http://unbounce.com/
http://blog.booking.com/hamburger-menu.html http://blog.booking.com/concept-dne-execution.html
Gracias!