jumpstart your experimentation journey with proven ... · geographic location. based on a q4 2016...

Jumpstart Your Experimentation Journey with Proven

Optimization Practices We Know & Love

The Conversion Rate Optimization (CRO) industry

has matured substantially over the past few years.

It is generating results. It is receiving increased

budgets. It is getting respect from key stakeholders.

In a 2017 State of the Conversion Optimization

Industry Report by CXL and Sentient Ascend, 45%

of survey respondents report budget increases

over last year, and 55% said their program was

more effective in 2017 than 2016. Brooks Bell clients

have had similar experiences. Testing remains the

most effective and reliable way to determine which

marketing strategy – from a specific promotion or

targeted message to a unique website element such

as a call to action (CTA) – will produce the highest

return on investment (ROI).

This white paper addresses how to predetermine

your test sample size, how to reach statistical

significance and power within your experimentation

programs, and how to expand your understanding

about minimum detectable effect (MDE)

for optimum test design and results in your

experimentation programs.

The Changing Landscapes of CommerceTechnology and evolving consumer behaviors are

transforming everything from the way people

evaluate products and services to the way they pay

for the things they buy.

Economists predict e-Commerce sales will range

Learn how to predetermine test sample size, reach statistical significance and power and under-stand Minimum Detectable Effect for optimum test results in your experimentation programs.

from $427 – $443 billion in 2017, according to the National

Retail Federation (NRF) as reported by Business Insider, a

growth rate of three times higher (8-12%) than the entire retail

industry. This upward trend will only continue. In fact, Business

Intelligence forecasts consumers will spend $632 billion online

by 2020.

Capitalizing on e-Commerce through digital experimentation

is essential for continued business growth. After conducting

thousands of tests for enterprise brands, over more than a

decade, Brooks Bell has delivered consistent client results –

based on statistical methods and empirical data. Successful

experimentation programs can improve conversion rates,

increase sales and boost annual revenue by enhancing

customers’ online experiences – whether they’re using a

website, smartphone, or other mobile device.

Optimization Tests Are Not All EqualA few different test methods exist today, including Fixed Time

Horizon, Sequential Testing, and Multi-Armed Bandit (auto-

optimizing). Each has its own place, but the majority of digital

optimization programs rely on Fixed Time Horizon testing. This

is generally the most easily understood and the most common

method used by Brooks Bell.

Fixed Time Horizon is a hypothesis test, relying on traditional

and proven statistical processes to set the right sample size.

It provides statistically valid results, based on test design

inputs, after reaching a preset sample size of visitors, which

allows analysts to make data-driven decisions and strategic

recommendations.

Fixed Time Horizon is unique because the sample size is

predetermined prior to the test launch. In addition, someone

manually stops the test when the ideal sample size reaches the

desired significance, power and MDE.

Creating the Right Sample SizeThe concept of sampling a larger population to

determine consumer behavior comes from applied

statistics. Statistically, the more results you collect,

or the more traffic that you sample, the more

confident you can be in the experiment’s final

results.

Establishing a large enough sample size – the

number of visits or visitors exposed to the adjusted

experience within the experiment – is necessary for

producing valid results. Additionally, by presetting

the sample size in advance, experimentation

teams will better understand the amount of time

required to execute tests. If the sample size doesn’t

accurately represent your audience, tests lose their

value and often end with statistically inconclusive

conclusions.

Though predetermining sample size can sometimes

be difficult, there are certain steps to clarify

the process. Find out how Brooks Bell analysts

determine the sample size in six simple steps.

Identifying a Test Winner with ConfidenceAgain, the Fixed Time Horizon approach is unique

because the sample size is predetermined and it

allows testers to manually stop the test.

Some tools distribute automatic alerts when they

identify a winner. This continual monitoring leads

to an inflated false positive rate. In other words,

certain tools predict winners too early because the

appropriate settings are not in place to end tests

based on desired significance. Therefore, these

tools cannot properly recommend the optimal user

experience for all scenarios.

For example, even weeks after tests are live, tools

can calculate and provide a confidence level, defined

as a percentage of certainty about the results. But

realistically, the sample size was probably too low

to accurately draw that conclusion. The lower the

percentage of confidence, the less testing teams

trust the results and the higher the margin of error.

A confidence level above 95% is “best practice.”

Ideally, experimentation teams want a high

confidence level and a low margin of error — the

amount of random sampling errors in test results.

Confidence asserts the likelihood (not certainty)

that the result is going to occur when the adjusted

experience is integrated site wide, assuming the test

included the whole population.

Analysts, and people who test the data, also tend to check

the numbers too frequently. As a result, they insert biases

into their conclusions, erroneously declaring an optimum

user experience. Do not assume these examples only happen

for low-traffic conversion funnels. Even high-traffic pages

produce varying results throughout test cycles. Data fluctuates

in the first 24 hours, the first few days and even weeks into

a test. The test must run its course to collect enough data to

be statistically accurate. Otherwise, an experiment could be a

waste of time and money.

Businesses must set measurable testing goals. If the goal is to

increase online conversions, you need to establish a baseline

and create a key performance indicator (KPI).

Utilizing conversion rate as a performance metric can be

difficult because it impacts every aspect of the user experience

– from landing and category pages to each customer touch

point. Traffic varies throughout the shopping funnel, so

conversion rate should be calculated based on the funnel step

being tested upon. Conversion Rate = Orders / Visits, according

to the most common definition. It varies based on many factors,

including retail category, shoppers’ preferred device, or even

geographic location. Based on a Q4 2016 benchmark report by

Monetate, the CRO worldwide average was 2.95%, compared

to 3% for the U.S. While these numbers may seem small, any

incremental improvement can create a huge ROI by increasing

sales and revenue. This industry average rate will most likely

continue to climb based on historic data and expected growth

trends.

Ultimately, any business conducting optimization tests needs

to set the right sample size prior to test launch. The correct

sample size is essential to collecting enough data for tests to

produce statistically valid results, thereby allowing analysts to

make realistic predictions and suggestions, upon conclusion of

the test, for the greatest ROI.

Fixed Time Horizon is unique because the sample size is

predetermined prior to the test launch. In addition, someone

manually stops the test when the ideal sample size reaches the

desired significance, power and MDE.

When it comes to building a successful testing program, one of

the immediate challenges is that of resources. Securing enough

support to design, develop, QA, and analyze a test can seem

impossible, especially when there is little support for or buy-in

to the essential idea of testing. Communicating the potential

value of the process is critical at this stage and, though a bit

of a chicken-egg paradox, the program must produce tests to

support the argument for more testing.

Whether the testing program is completely new or struggling to

grow, there are two important factors that must be managed:

team and culture.

Making Decisions Based On DataEnterprise businesses are adopting the concept of

data-informed decision-making. They are shifting their

organizational cultures to those where decisions are based on

facts and insights. Optimization is a data-driven process.

In fact, optimization’s goal is to continually enhance customers’

experience on- and off-line, fostering such loyalty that they

keep coming back. When done correctly, optimization removes

the guesswork. Coupled with expert insights, optimization

and testing data offers information about what works and

what doesn’t work. It helps identify customer roadblocks and

opportunities for revenue growth.

As previously mentioned, all optimization tests are not created equal. In order to achieve statistically reliable

results – which in turn help define actionable strategies – tests must be properly designed, conducted and

analyzed.

Understanding ConfidenceIt is considered “best practice” to end a test once it has reached the predetermined sample size, which takes

into account desired confidence, desired power, MDL and data on the specific metric being tested (either

response rates for binary metrics, or mean and standard deviation for continuous metrics).

Statistical confidence facilitates accurate inference for your population of customers and/or prospects.

It is the likelihood the difference in conversion between a given test variation and the baseline (or control

element) is not random, nor is it caused by chance. In statistics, confidence is a way of mathematically

proving a statistic is reliable.

The confidence level targeted during the test design reflects your risk tolerance and controls for Type I errors,

or false positives. The higher the confidence level, the less likely you would see unexpected results when

implementing a recommended change. If testers are 95 percent confident the increase they observed is, in

fact, an increase, there’s still a five percent chance of a false positive result, in which the lift is actually not

positive. At 80 percent confidence, there is a 20 percent probability of a winner being a false positive.

There is a tradeoff of speed for certainty. When designing a test, setting a high target for your desired

confidence levels mean you are seeking a low false positive error rate. In other words, you want to run a

precise test, which requires more visitors. On the other end of the spectrum, lower confidence levels allow

you to test more quickly – but will result in more false positives. The decision on what confidence level to

test will need to be determined by the amount of traffic you receive on the page, as well as the business

implications surrounding the risk threshold for that particular test.

Understanding PowerOptimization professionals discuss power less frequently as a design variable for tests, but it is equally

as important as confidence. Setting an appropriate power level plays an important role in giving tests the

requisite chance to reach a desired confidence level.

Power controls Type II errors, are also known as false

negatives. In simple terms, power controls are the

probability that you will find a significant difference

between a challenger and control, should that difference

actually exist. As power increases, so does the probability

of rejecting the null hypothesis, or detecting a statistically

significant difference.

For standard A/B tests, the industry standard power level is

80 percent. This is less than the recommended confidence

level of 95 percent because there is less risk involved for

most programs in failing to find a significant difference

compared to advocating for a change when improvement

does not actually exist.

Similar to increasing confidence, increasing power escalates

test precision, and the test will require more visitors to

reach that level of precision. If your test is underpowered,

you can technically end tests quicker, but you may miss

out on detecting statistically significant differences that

actually exist. If your test is appropriately powered, you

can detect most differences that exist at your desired

confidence level.

Likewise, for most projects, weighing the different options

available against the desired business goals, help testing

teams find the right balance with confidence, power and

test duration.

Understanding Minimum Detectable Effect (MDE)If you are involved with an online optimization program,

you likely know calculating a test duration prior to launch

is a necessary step in reaching a desired level of statistical

confidence as mentioned above. It is also likely you know

defining MDE – a number that represents the relative

minimum improvement you seek to detect over the control

– is critical to determining accurate test duration. MDE,

is also known as MDL, and the two terms can be used

interchangeably in testing. [we use MDL above, do we want

to change that for consistency?]

If you have ever been confused by this particular input in

test design, you are not alone. It can be confusing.

Theoretically, an A/B test could last forever without

reaching a desired level of statistical significance, or

confidence. If the test has little to no impact compared

to the control, the test could simply run until someone

becomes frustrated and ends it. Similarly, individuals could

become impatient with test progression if a website, or

specific set of pages, are not attracting enough visitors

to collect sufficient data. Creating a well-established test

design from the start minimizes, if not removes, such

frustrations because it relies on data to determine test

duration.

Here is the simplest way to think of MDE. It is essentially

the smallest possible change in your primary KPI that your

experiment can detect with statistical certainty.

Ensuring MDE SuccessThere are often misconceptions associated with MDE

because it is not necessarily a number that can be easily

determined without historical testing data or upfront

analysis.

For clarification, keep these facts top of mind:

• MDE is not the lift, or conversion increase, that a testing professional wants to see.

• MDE is not a guess.

• MDE is an anticipated lift over the control that can be measured with a degree of certainty.

• MDE is the smallest possible change that would be worth investing the time and money to design the test and implement the change permanently on the site.

The main idea behind MDE is you can detect a lift

down to the specified level at the confidence and

significance thresholds you set. For example, if you

set a one percent MDE, you will be able to detect a one,

three, or even 10 percent change in the KPI. It is also

important to remember lower MDEs require increased

traffic to ensure a smaller lift is valid.

If your organization is establishing a new

experimentation program, and historical data is not

available, Brooks Bell recommends selecting an MDE

from the table below. Based on historical analysis,

MDEs vary by page type due to differences in the

ability to impact customer behavior at different stages

of the online shopping journey.

A one percent increase in Revenue Per Visit (RPV)

from the cart page, for example, is going to be much

more valuable than a one percent increase in RPV from

the category page because all revenue goes through

the cart, whereas not all revenue passes through a

category page.

Brooks Bell conducts thousands of online experiments

for enterprise-level clients, and collects data from

each one. The table below presents website MDE

averages by page type.

Page Type Suggested MDL

Sitewide

Homepage

Category Page

Product Page

Cart

Checkout

2.0% - 3.0%

1.5% - 2.5%

2.0% - 3.0%

1.5% - 2.5%

1.0% - 2.0%

1.0% - 2.0%

It is considered “best practice” to use the lowest

possible MDE when designing a test, while still

running it within an acceptable amount of time as

previously mentioned. By selecting smaller, more

conservative MDE levels, testing programs optimize

the certainty of testing outcomes, learnings, and

ultimately ROI.

Jumpstart Your Experimentation Journey

Experimentation programs, including optimization and

testing, are transforming enterprise companies and

customer experiences. No matter where you are on

your experimentation journey, here are five questions

to consider:

Are you experiencing the full benefits from your

optimization and testing data?

Are you stuck on any of the areas above?

Is your experimentation program strategic and

sustainable?

Are your tests producing the results you want?

Do you have buy-in across all functions of your

organization and all key stakeholders?

If your company is not on the cutting edge with

optimization and testing, it is only a matter of time

until it will fall behind. We encourage you to start — or

continue — the conversation with your organization

to take your experimentation program to the next

level. Once implemented, these proven optimization

practices will help.

Brooks Bell is the leader in scaling world-class A/B testing programs.

Learn more! Visit BrooksBell.com

About Brooks Bell

Brooks Bell is the largest independent experimentation consultancy in the country, providing a team of

experimentation experts, who offer flexible solutions to enterprise-level clients. Together, we achieve

business goals in user research, optimization, and personalization to create unforgettable, omni-channel

customer journeys. For more information, please visit www.brooksbell.com.

About The Authors

Taylor Wilson is a Senior Optimization Analyst with fluency in all major testing tools and extensive experience

in data analysis, data visualization and testing ideology. He leads Brooks Bell’s analytics efforts for such top

Fortune 500 brands as Barnes & Noble, Toys”R”Us, Nickelodeon and OppenheimerFunds.

Shana Braun is an Optimization Analyst with Brooks Bell specializing in website analytics and

experimentation strategy. She provides strategic optimization and testing consulting services, as well as

experienced online insights to enterprise-level clients across diverse industries.

jumpstart your experimentation journey with proven ... · geographic location. based on a q4 2016...

Documents