the anatomy of an a/b test - jsconf colombia workshop

A/B testing workshop “In God we trust, all others must bring data”

JSConf Colombia Workshop 2015

@shiota github.com/eshiota

slideshare.net/eshiota eshiota.com

A/B tests measure how a new idea (version B/variant/test) performs agains an existing implementation (version A/base/control).

Buy now Buy nowversus

coin flip

Buy now

Buy now

50%

50%

When the user sees or is affected by the idea, they are tracked and become part of the test.

Buy now

Buy now

track(my_experiment)

Data about the website is generated as users browse through pages and do their tasks.

product added to cart

number of products added

purchase finished

average price per purchase

number of products seen

user has logged in

used guest checkout

customer service calls

…

When there’s enough information to make a decision, you can either stop the test (keeping version A) or choose version B, directing all traffic to it.

Buy now Buy now

Duration: 14 days Visitors: 45.140 (22.570 per variant)

339 (1.5%) 407 (1.8%)

20% up

144.500 COP 147.390 COP

2% up

Number of purchases:

Average price:

coin flip

Buy now

Buy now

50%

50%

B

Buy now

100%

"But my design is obviously more beautiful and intuitive than what we have now! Why should I run an A/B test?” — the majority of designers

Quiz time!(prizes included)

A: Raise your left hand B: Raise your right hand

Neutral: Don’t raise your hands

Which performed better?

Reduced bounce rate in 1.7%


Increased CTR in 203%


43.4% more purchases

Both were statistically equivalent


Intuition vs

Historical Analysis vs.

Experimentation

We have a 2/3 chance of being wrong when trusting our intuition.

People behave differently each season/month/day of the week.

Different cultures lead to different patterns of usage.

Data analysis alone provides correlation but not causation.

Running your A/B test(in 5 simple steps)

Step 1: Hypothesis

Analyse all possible inputs to come up with an hypothesis to work on.

• Usability research • Benchmarking • Surveys • Data mining • Previous experiments

Hypothesis:

“If users from South America countries relate more to the website, they will book more.”

Step 2: Idea

Idea:

“If we add the country’s flag next to the website’s logo, users will relate more to the brand.”

Step 3: Setup

• Who will participate? • What is the primary metric? • Any secondary impacts? • How will it be implemented?

• Users from Argentina, Bolivia, Brazil, Chile, Colombia, Ecuador, Guyana, Paraguay, Peru, Suriname, Uruguay and Venezuela, on all platforms

• Conversion (net bookings) uplift is expected • We expect more returning customers

Step 4: Monitoring

Keep checking the metrics to see if anything’s terribly wrong.

Avoid checking too often, let your test get enough users and enough runtime.

Step 5: Data, decisions, and next steps

When you reach the expected runtime, number of visitors or effect, look at the data and take a decision.

product added to cart

number of products added

purchase finished

average price per purchase

number of products seen

user has logged in

used guest checkout

customer service calls

…

Optimizely dashboard

• How were the primary and secondary metrics impacted?

• What were the results isolated by each country?

• What were the results isolated by each language?

• Did any particular platform (desktop, mobile devices, tablets) perform better?

• Was the impact on returning customers any higher than first time visitors?

Based on the gathered data, plan for next steps.

• Should we add a copy to the flag? • Should we add a tooltip to the flag? • Should we increase/decrease the flag size? • Should we restrict it just for desktop users? • Should we try this for a single country, or

other countries?

What can you test?

(almost) Everything.

You can test a small design change.

versus

You can test large design changes.

versus

You can test different copies.

versus

Submit

Book now

You can test technical improvements and measure page load time, repaints/reflows, and conversion impact.

versus

jQuery 1.11.3

jQuery 2.1.3

You can even test back-end optimisations and measure page load time, rendering time, CPU and memory usage etc.

if track_experiment(:my_optimized_query) @users = my_optimized_query else @users = do_the_normal_thing end

Live coding(I hope that works.)

Find the code at:

https://github.com/eshiota/ab_workshop

Additional links:

https://www.optimizely.com/ https://github.com/splitrb/split/

http://whichtestwon.com http://unbounce.com/

http://blog.booking.com/hamburger-menu.html http://blog.booking.com/concept-dne-execution.html

Gracias!

the anatomy of an a/b test - jsconf colombia workshop

Technology