jumpstart your experimentation journey with proven ... · geographic location. based on a q4 2016...
TRANSCRIPT
Page 1
Jumpstart Your Experimentation Journey with Proven
Optimization Practices We Know & Love
The Conversion Rate Optimization (CRO) industry
has matured substantially over the past few years.
It is generating results. It is receiving increased
budgets. It is getting respect from key stakeholders.
In a 2017 State of the Conversion Optimization
Industry Report by CXL and Sentient Ascend, 45%
of survey respondents report budget increases
over last year, and 55% said their program was
more effective in 2017 than 2016. Brooks Bell clients
have had similar experiences. Testing remains the
most effective and reliable way to determine which
marketing strategy – from a specific promotion or
targeted message to a unique website element such
as a call to action (CTA) – will produce the highest
return on investment (ROI).
This white paper addresses how to predetermine
your test sample size, how to reach statistical
significance and power within your experimentation
programs, and how to expand your understanding
about minimum detectable effect (MDE)
for optimum test design and results in your
experimentation programs.
The Changing Landscapes of CommerceTechnology and evolving consumer behaviors are
transforming everything from the way people
evaluate products and services to the way they pay
for the things they buy.
Economists predict e-Commerce sales will range
Learn how to predetermine test sample size, reach statistical significance and power and under-stand Minimum Detectable Effect for optimum test results in your experimentation programs.
Page 2Page 2
from $427 – $443 billion in 2017, according to the National
Retail Federation (NRF) as reported by Business Insider, a
growth rate of three times higher (8-12%) than the entire retail
industry. This upward trend will only continue. In fact, Business
Intelligence forecasts consumers will spend $632 billion online
by 2020.
Capitalizing on e-Commerce through digital experimentation
is essential for continued business growth. After conducting
thousands of tests for enterprise brands, over more than a
decade, Brooks Bell has delivered consistent client results –
based on statistical methods and empirical data. Successful
experimentation programs can improve conversion rates,
increase sales and boost annual revenue by enhancing
customers’ online experiences – whether they’re using a
website, smartphone, or other mobile device.
Optimization Tests Are Not All EqualA few different test methods exist today, including Fixed Time
Horizon, Sequential Testing, and Multi-Armed Bandit (auto-
optimizing). Each has its own place, but the majority of digital
optimization programs rely on Fixed Time Horizon testing. This
is generally the most easily understood and the most common
method used by Brooks Bell.
Fixed Time Horizon is a hypothesis test, relying on traditional
and proven statistical processes to set the right sample size.
It provides statistically valid results, based on test design
inputs, after reaching a preset sample size of visitors, which
allows analysts to make data-driven decisions and strategic
recommendations.
Fixed Time Horizon is unique because the sample size is
predetermined prior to the test launch. In addition, someone
manually stops the test when the ideal sample size reaches the
desired significance, power and MDE.
Page 3
Creating the Right Sample SizeThe concept of sampling a larger population to
determine consumer behavior comes from applied
statistics. Statistically, the more results you collect,
or the more traffic that you sample, the more
confident you can be in the experiment’s final
results.
Establishing a large enough sample size – the
number of visits or visitors exposed to the adjusted
experience within the experiment – is necessary for
producing valid results. Additionally, by presetting
the sample size in advance, experimentation
teams will better understand the amount of time
required to execute tests. If the sample size doesn’t
accurately represent your audience, tests lose their
value and often end with statistically inconclusive
conclusions.
Though predetermining sample size can sometimes
be difficult, there are certain steps to clarify
the process. Find out how Brooks Bell analysts
determine the sample size in six simple steps.
Identifying a Test Winner with ConfidenceAgain, the Fixed Time Horizon approach is unique
because the sample size is predetermined and it
allows testers to manually stop the test.
Some tools distribute automatic alerts when they
identify a winner. This continual monitoring leads
to an inflated false positive rate. In other words,
certain tools predict winners too early because the
appropriate settings are not in place to end tests
based on desired significance. Therefore, these
tools cannot properly recommend the optimal user
experience for all scenarios.
For example, even weeks after tests are live, tools
can calculate and provide a confidence level, defined
as a percentage of certainty about the results. But
realistically, the sample size was probably too low
to accurately draw that conclusion. The lower the
percentage of confidence, the less testing teams
trust the results and the higher the margin of error.
A confidence level above 95% is “best practice.”
Ideally, experimentation teams want a high
confidence level and a low margin of error — the
amount of random sampling errors in test results.
Confidence asserts the likelihood (not certainty)
that the result is going to occur when the adjusted
experience is integrated site wide, assuming the test
Page 4
included the whole population.
Analysts, and people who test the data, also tend to check
the numbers too frequently. As a result, they insert biases
into their conclusions, erroneously declaring an optimum
user experience. Do not assume these examples only happen
for low-traffic conversion funnels. Even high-traffic pages
produce varying results throughout test cycles. Data fluctuates
in the first 24 hours, the first few days and even weeks into
a test. The test must run its course to collect enough data to
be statistically accurate. Otherwise, an experiment could be a
waste of time and money.
Businesses must set measurable testing goals. If the goal is to
increase online conversions, you need to establish a baseline
and create a key performance indicator (KPI).
Utilizing conversion rate as a performance metric can be
difficult because it impacts every aspect of the user experience
– from landing and category pages to each customer touch
point. Traffic varies throughout the shopping funnel, so
conversion rate should be calculated based on the funnel step
being tested upon. Conversion Rate = Orders / Visits, according
to the most common definition. It varies based on many factors,
including retail category, shoppers’ preferred device, or even
geographic location. Based on a Q4 2016 benchmark report by
Monetate, the CRO worldwide average was 2.95%, compared
to 3% for the U.S. While these numbers may seem small, any
incremental improvement can create a huge ROI by increasing
sales and revenue. This industry average rate will most likely
continue to climb based on historic data and expected growth
trends.
Ultimately, any business conducting optimization tests needs
to set the right sample size prior to test launch. The correct
sample size is essential to collecting enough data for tests to
Page 5
produce statistically valid results, thereby allowing analysts to
make realistic predictions and suggestions, upon conclusion of
the test, for the greatest ROI.
Fixed Time Horizon is unique because the sample size is
predetermined prior to the test launch. In addition, someone
manually stops the test when the ideal sample size reaches the
desired significance, power and MDE.
When it comes to building a successful testing program, one of
the immediate challenges is that of resources. Securing enough
support to design, develop, QA, and analyze a test can seem
impossible, especially when there is little support for or buy-in
to the essential idea of testing. Communicating the potential
value of the process is critical at this stage and, though a bit
of a chicken-egg paradox, the program must produce tests to
support the argument for more testing.
Whether the testing program is completely new or struggling to
grow, there are two important factors that must be managed:
team and culture.
Making Decisions Based On DataEnterprise businesses are adopting the concept of
data-informed decision-making. They are shifting their
organizational cultures to those where decisions are based on
facts and insights. Optimization is a data-driven process.
In fact, optimization’s goal is to continually enhance customers’
experience on- and off-line, fostering such loyalty that they
keep coming back. When done correctly, optimization removes
the guesswork. Coupled with expert insights, optimization
and testing data offers information about what works and
what doesn’t work. It helps identify customer roadblocks and
opportunities for revenue growth.
Page 6
As previously mentioned, all optimization tests are not created equal. In order to achieve statistically reliable
results – which in turn help define actionable strategies – tests must be properly designed, conducted and
analyzed.
Understanding ConfidenceIt is considered “best practice” to end a test once it has reached the predetermined sample size, which takes
into account desired confidence, desired power, MDL and data on the specific metric being tested (either
response rates for binary metrics, or mean and standard deviation for continuous metrics).
Statistical confidence facilitates accurate inference for your population of customers and/or prospects.
It is the likelihood the difference in conversion between a given test variation and the baseline (or control
element) is not random, nor is it caused by chance. In statistics, confidence is a way of mathematically
proving a statistic is reliable.
The confidence level targeted during the test design reflects your risk tolerance and controls for Type I errors,
or false positives. The higher the confidence level, the less likely you would see unexpected results when
implementing a recommended change. If testers are 95 percent confident the increase they observed is, in
fact, an increase, there’s still a five percent chance of a false positive result, in which the lift is actually not
positive. At 80 percent confidence, there is a 20 percent probability of a winner being a false positive.
There is a tradeoff of speed for certainty. When designing a test, setting a high target for your desired
confidence levels mean you are seeking a low false positive error rate. In other words, you want to run a
precise test, which requires more visitors. On the other end of the spectrum, lower confidence levels allow
you to test more quickly – but will result in more false positives. The decision on what confidence level to
test will need to be determined by the amount of traffic you receive on the page, as well as the business
implications surrounding the risk threshold for that particular test.
Understanding PowerOptimization professionals discuss power less frequently as a design variable for tests, but it is equally
as important as confidence. Setting an appropriate power level plays an important role in giving tests the
Page 7
requisite chance to reach a desired confidence level.
Power controls Type II errors, are also known as false
negatives. In simple terms, power controls are the
probability that you will find a significant difference
between a challenger and control, should that difference
actually exist. As power increases, so does the probability
of rejecting the null hypothesis, or detecting a statistically
significant difference.
For standard A/B tests, the industry standard power level is
80 percent. This is less than the recommended confidence
level of 95 percent because there is less risk involved for
most programs in failing to find a significant difference
compared to advocating for a change when improvement
does not actually exist.
Similar to increasing confidence, increasing power escalates
test precision, and the test will require more visitors to
reach that level of precision. If your test is underpowered,
you can technically end tests quicker, but you may miss
out on detecting statistically significant differences that
actually exist. If your test is appropriately powered, you
can detect most differences that exist at your desired
confidence level.
Likewise, for most projects, weighing the different options
available against the desired business goals, help testing
teams find the right balance with confidence, power and
test duration.
Understanding Minimum Detectable Effect (MDE)If you are involved with an online optimization program,
you likely know calculating a test duration prior to launch
is a necessary step in reaching a desired level of statistical
confidence as mentioned above. It is also likely you know
defining MDE – a number that represents the relative
minimum improvement you seek to detect over the control
– is critical to determining accurate test duration. MDE,
Page 8
is also known as MDL, and the two terms can be used
interchangeably in testing. [we use MDL above, do we want
to change that for consistency?]
If you have ever been confused by this particular input in
test design, you are not alone. It can be confusing.
Theoretically, an A/B test could last forever without
reaching a desired level of statistical significance, or
confidence. If the test has little to no impact compared
to the control, the test could simply run until someone
becomes frustrated and ends it. Similarly, individuals could
become impatient with test progression if a website, or
specific set of pages, are not attracting enough visitors
to collect sufficient data. Creating a well-established test
design from the start minimizes, if not removes, such
frustrations because it relies on data to determine test
duration.
Here is the simplest way to think of MDE. It is essentially
the smallest possible change in your primary KPI that your
experiment can detect with statistical certainty.
Ensuring MDE SuccessThere are often misconceptions associated with MDE
because it is not necessarily a number that can be easily
determined without historical testing data or upfront
analysis.
For clarification, keep these facts top of mind:
• MDE is not the lift, or conversion increase, that a testing professional wants to see.
• MDE is not a guess.
• MDE is an anticipated lift over the control that can be measured with a degree of certainty.
• MDE is the smallest possible change that would be worth investing the time and money to design the test and implement the change permanently on the site.
Page 9
The main idea behind MDE is you can detect a lift
down to the specified level at the confidence and
significance thresholds you set. For example, if you
set a one percent MDE, you will be able to detect a one,
three, or even 10 percent change in the KPI. It is also
important to remember lower MDEs require increased
traffic to ensure a smaller lift is valid.
If your organization is establishing a new
experimentation program, and historical data is not
available, Brooks Bell recommends selecting an MDE
from the table below. Based on historical analysis,
MDEs vary by page type due to differences in the
ability to impact customer behavior at different stages
of the online shopping journey.
A one percent increase in Revenue Per Visit (RPV)
from the cart page, for example, is going to be much
more valuable than a one percent increase in RPV from
the category page because all revenue goes through
the cart, whereas not all revenue passes through a
category page.
Brooks Bell conducts thousands of online experiments
for enterprise-level clients, and collects data from
each one. The table below presents website MDE
averages by page type.
Page Type Suggested MDL
Sitewide
Homepage
Category Page
Product Page
Cart
Checkout
2.0% - 3.0%
1.5% - 2.5%
2.0% - 3.0%
1.5% - 2.5%
1.0% - 2.0%
1.0% - 2.0%
It is considered “best practice” to use the lowest
possible MDE when designing a test, while still
running it within an acceptable amount of time as
previously mentioned. By selecting smaller, more
conservative MDE levels, testing programs optimize
the certainty of testing outcomes, learnings, and
ultimately ROI.
Jumpstart Your Experimentation Journey
Experimentation programs, including optimization and
testing, are transforming enterprise companies and
customer experiences. No matter where you are on
your experimentation journey, here are five questions
to consider:
Are you experiencing the full benefits from your
optimization and testing data?
Are you stuck on any of the areas above?
Is your experimentation program strategic and
sustainable?
Are your tests producing the results you want?
Do you have buy-in across all functions of your
organization and all key stakeholders?
If your company is not on the cutting edge with
optimization and testing, it is only a matter of time
until it will fall behind. We encourage you to start — or
continue — the conversation with your organization
to take your experimentation program to the next
level. Once implemented, these proven optimization
practices will help.
Page 10
Brooks Bell is the leader in scaling world-class A/B testing programs.
Learn more! Visit BrooksBell.com
About Brooks Bell
Brooks Bell is the largest independent experimentation consultancy in the country, providing a team of
experimentation experts, who offer flexible solutions to enterprise-level clients. Together, we achieve
business goals in user research, optimization, and personalization to create unforgettable, omni-channel
customer journeys. For more information, please visit www.brooksbell.com.
About The Authors
Taylor Wilson is a Senior Optimization Analyst with fluency in all major testing tools and extensive experience
in data analysis, data visualization and testing ideology. He leads Brooks Bell’s analytics efforts for such top
Fortune 500 brands as Barnes & Noble, Toys”R”Us, Nickelodeon and OppenheimerFunds.
Shana Braun is an Optimization Analyst with Brooks Bell specializing in website analytics and
experimentation strategy. She provides strategic optimization and testing consulting services, as well as
experienced online insights to enterprise-level clients across diverse industries.