periodic measurement of advertising e ectiveness using ... · the rst two parameters in the model,...

17
Periodic Measurement of Advertising Effectiveness Using Multiple-Test-Period Geo Experiments Jon Vaver, Jim Koehler Google Inc. Abstract In a previous paper [6] we described the appli- cation of geo experiments to the measurement of advertising effectiveness. One reason this method of measurement is attractive is that it provides the rigor of a randomized experiment. However, related decisions, such as where and how to spend advertising budget, are not static. To address this issue, we extend this methodol- ogy to provide periodic (ongoing) measurement of ad effectiveness. In this approach, the test and control assignments of each geographic region ro- tate across multiple test periods, and these rota- tions provide the opportunity to generate a se- quence of measurements of campaign effective- ness. The data across test periods can also be pooled to create a single aggregate measurement of campaign effectiveness. These sequential and pooled measurements have smaller confidence in- tervals than measurements from a series of geo experiments with a single test period. Alter- natively, the same confidence interval can be achieved with a reduced magnitude and/or du- ration of ad spend change, thereby lowering the cost of measurement. The net result is a better method for periodic and isolated measurement of ad effectiveness. Keywords : ad effectiveness, advertising exper- iment, periodic measurement, experimental de- sign 1 Introduction Advertisers benefit from the ability to measure the effectiveness of their campaigns. This knowl- edge is fundamental to strategic decision mak- ing and operational efficiency and improvement. However, advertising is dynamic. Competitors come and go, product lines evolve, and consumer behavior changes. Consequently, measuring ad effectiveness is not a one-time exercise. Advertis- ers with search campaigns need to know if their bidding strategy, keyword sets, and ad creatives are having a consistently compelling impact on consumer behavior. Since an assessment of ad effectiveness is relevant for a limited amount of time, the need for measurement is ongoing. Methods of measurement need to be adapted to accommodate this persistent need for measure- ment. There are several key capabilities that a peri- odic geographically based measurement method should provide. Most importantly, the method needs to provide the ability to generate a se- quence of ad effectiveness measurements across time. Additionally, the experimental units should rotate between the test and control groups. This rotation ensures that, over time, all geographic regions, or “geos”, experience an equivalent set of campaign conditions, which bal- ances the ad spend opportunity across geos. The capability to evaluate the design of an experi- ment is also important. Understanding how the measurement uncertainty is impacted by charac- teristics such as experiment length, test fraction, and magnitude of ad spend change is critical to 1

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Periodic Measurement of Advertising Effectiveness Using

Multiple-Test-Period Geo Experiments

Jon Vaver, Jim Koehler

Google Inc.

Abstract

In a previous paper [6] we described the appli-cation of geo experiments to the measurementof advertising effectiveness. One reason thismethod of measurement is attractive is that itprovides the rigor of a randomized experiment.However, related decisions, such as where andhow to spend advertising budget, are not static.To address this issue, we extend this methodol-ogy to provide periodic (ongoing) measurementof ad effectiveness. In this approach, the test andcontrol assignments of each geographic region ro-tate across multiple test periods, and these rota-tions provide the opportunity to generate a se-quence of measurements of campaign effective-ness. The data across test periods can also bepooled to create a single aggregate measurementof campaign effectiveness. These sequential andpooled measurements have smaller confidence in-tervals than measurements from a series of geoexperiments with a single test period. Alter-natively, the same confidence interval can beachieved with a reduced magnitude and/or du-ration of ad spend change, thereby lowering thecost of measurement. The net result is a bettermethod for periodic and isolated measurementof ad effectiveness.

Keywords: ad effectiveness, advertising exper-iment, periodic measurement, experimental de-sign

1 Introduction

Advertisers benefit from the ability to measurethe effectiveness of their campaigns. This knowl-edge is fundamental to strategic decision mak-ing and operational efficiency and improvement.However, advertising is dynamic. Competitorscome and go, product lines evolve, and consumerbehavior changes. Consequently, measuring adeffectiveness is not a one-time exercise. Advertis-ers with search campaigns need to know if theirbidding strategy, keyword sets, and ad creativesare having a consistently compelling impact onconsumer behavior. Since an assessment of adeffectiveness is relevant for a limited amountof time, the need for measurement is ongoing.Methods of measurement need to be adapted toaccommodate this persistent need for measure-ment.

There are several key capabilities that a peri-odic geographically based measurement methodshould provide. Most importantly, the methodneeds to provide the ability to generate a se-quence of ad effectiveness measurements acrosstime. Additionally, the experimental unitsshould rotate between the test and controlgroups. This rotation ensures that, over time,all geographic regions, or “geos”, experience anequivalent set of campaign conditions, which bal-ances the ad spend opportunity across geos. Thecapability to evaluate the design of an experi-ment is also important. Understanding how themeasurement uncertainty is impacted by charac-teristics such as experiment length, test fraction,and magnitude of ad spend change is critical to

1

Page 2: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

executing an effective and efficient experiment.

The application of geo experiments to the mea-surement of advertising effectiveness was de-scribed in a previous paper [6]. These exper-iments measure ad effectiveness across a singletest period. A series of single test period geo ex-periments meets the requirements above. How-ever, the time required to execute these exper-iments is less than optimal. Each experimentrequires a separate pretest period, which signif-icantly limits measurement frequency. This re-striction is particularly undesirable since ongoingmeasurement is the primary goal. The alterna-tive approach described here avoids this problemby combining the test period of one measure-ment with the pretest period of the next mea-surement. This coupling of the pretest and testperiods not only avoids the inefficiency of iso-lated pretest periods, it also uses ad spend moreefficiently to reduce the confidence interval of thead effectiveness measurements.

2 Description of Multiple-Test-Period Geo Experiments

Our objective is to measure the impact of ad-vertising on consumer behavior. Examples ofthis behavior include clicks, online and offlinesales, newsletter sign-ups, and software down-loads. We refer to the selected behavior as theresponse metric. Results of the analysis are inthe form of return on ad spend (ROAS), whichis the incremental impact that a change in adspend has on the response metric divided by thechange in ad spend.

In this paper, we describe how ad effectivenesscan be measured periodically using a multipletest-period geo experiment, which is a general-ization of the single test period geo experiment.Consequently, many of the considerations andsteps for performing periodic measurement arethe same, or similar, to those discussed previ-ously for generating an isolated measurement ofad effectiveness.

The first step is to partition the geographic re-

gion of interest, (e.g. a country), into a set ofgeos. It must be possible to target ad serving tothese geos, and track ad spend and the responsemetric at this same geo level. The location andsize of the geos can be used to mitigate potentialad serving inconsistency due to finite ad servingaccuracy and consumer travel across geo bound-aries. A process that uses optimization to gen-erate geos will be described in a future paper.In the United States, one possible set of geos isthe 210 DMAs (Designated Market Areas) de-fined by Nielson Media, which is broadly used asa geo-targeting unit by many advertising plat-forms.

The next step is to randomly assign each of theN geos to a geo-group. Randomization is an im-portant component of a successful experiment asit guards against potential hidden biases. Thatis, there could be fundamental, yet unknown,differences between the geos and how they re-spond to the treatment. Randomization helpsto keep these potential differences equally dis-tributed across the geo-groups. In an experi-ment that contains multiple test periods, we ro-tate the assignment of the test condition betweengeo groups.. If there are M geo-groups, then thetest fraction is 1/M . That is, only N/M geos areassigned to the test condition at any given pointin time. It also may be helpful to use a random-ized block design [3] in order to better balancethe geo-groups across one or more characteris-tics or demographic variables. We have foundthat grouping the geos by size prior to assign-ment can reduce the confidence interval of theROAS measurement by 10% or more.

Each experiment contains a series of distincttime periods: one pretest period and one ormore test periods (see Figure 1). During thepretest period there are no differences in cam-paign structure across geos (e.g. bidding strat-egy, keyword set, and ad creatives). All geosoperate at the same baseline level and there areno incremental differences between the test andcontrol geos in the ad spend and response metric.

In each test period, the campaigns of the geosin one geo-group are modified so that they differ

2 Google Inc.

Page 3: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 1: Diagram of a periodic geo experimentwith four test periods. Ad spend is modified in adifferent geo-group during each of the first threetest periods, and it returns to the baseline statein the final test period. There may be some delaybefore the corresponding change in a responsemetric is fully realized.

from the baseline condition. This modificationgenerates a nonzero difference in the ad spendfor these geos relative to the others. That is,the ad spend differs from what it would havebeen if these campaigns had not been modi-fied. This difference will be negative if the cam-paign change causes the ad spend to decrease(e.g. campaigns turned off), and positive if thechange causes an increase in ad spend (e.g. bidsincreased and/or keywords added).

The ad spend difference will, hopefully, generatea non zero difference in the response metric, per-haps with some time delay, ν. Each test periodextends beyond the end of the ad spend changeby ν to fully capture this incremental change inthe response metric. Total clicks (paid plus or-ganic) is an example of a response metric thatis likely to have ν = 0. Offline sales is an ex-ample of a response metric that is likely to haveν > 0, since it takes time for consumers to com-plete their research, make a decision, and thenvisit a store to make their purchase.

Each test period provides another opportunity tomeasure ROAS. So, the length of the test period

determines the frequency with which advertisingeffectiveness can be assessed. In addition to thismonitoring capability, the adjacency of these testperiod transitions also reduces the confidence in-terval of these ROAS estimates. For example,the transition from Test Period 1 to Test Period2 provides the opportunity to observe the im-pact of reducing the ad spend on the responsemetric in geo-group 2. It also provides the op-portunity to observe the impact of restoring thead spend to the baseline level in geo-group 1.The use of adjacent test periods effectively dou-bles the difference in ad spend for each test pe-riod level measurement of ROAS, except for thefirst measurement (transition from Pretest Pe-riod 0 to Test Period 1), and the last (transitionfrom Test Period 3 to Test Period 4, in the Fig-ure 1 example). This effective doubling of thead spend reduces the confidence interval of theROAS measurement by increasing the leveragein fitting the linear model described below 1. So,in Test Period 1 only the geos in geo-group 1have a test condition with reduced ad spend. InTest Period 2 the geos in geo-groups 1 and 2 havea test condition with increased and reduced adspend, respectively, and in Test Period 3 the geosin geo-groups 2 and 3 have a test condition withincreased and reduced ad spend, respectively.

3 Linear Model

After an experiment is executed, the ROAS fortest period j is generated by fitting the followinglinear model:

yi,j = β0j + β1jyi,j−1 + β2jδi,j + εi,j (1)

where i = 1, ..., N , yi,j is the aggregate of theresponse metric during test period j for geo i,δi,j is the difference between the actual ad spendin geo i and the ad spend that would have oc-curred without the campaign change associatedwith the transition to test period j, and εi,j isthe error term. We fit this model using weights

1This reduction can be characterized analytically, asdemonstrated in Appendix A.

Google Inc. 3

Page 4: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

wi = 1/yi,0 in order to control for heteroscedas-ticity caused by heterogeneity in geo size.

The first two parameters in the model, β0j andβ1j , are used to account for seasonal differencesin the response metric across periods j and j−1.The parameter of primary interest is β2j , whichis the return on ad spend (ROAS) of the responsemetric for test period j. A sequence of ROASmeasurements can be calculated by fitting thismodel separately for each test period in the ex-periment. Note that when j = 1, Equation 1matches the linear model for the single test pe-riod geo experiment described in [6].

More generally, the ROAS can be estimated bypooling the data from a set of test periods, J . Inthis situation, the model becomes

yi,j = β0j + β1jyi,j−1 + β2Jδi,j + εi,j . (2)

This model has the same form as Equation 1,except here j ranges over all of the values of J .Each combination of geo and test period pro-vides another observation. So, instead of fittingthe model withN observations, it is fit withN |J |observations. The set J can include any numberof test periods, and there is no need for these pe-riods to be consecutive, although typically theywill be. J may also include all of the test peri-ods, in which case all of the experiment data arepooled to generate a single ROAS estimate.

The values of yi,j (e.g. offline sales) are gener-ated by the advertiser’s reporting system. Thegeo level ad spend is available through the adplatform reporting system (e.g. AdWords). Theprocess for finding the ad spend counterfactualfor each test period, δi,j , is analogous to the pro-cess described in [6]. If there is no ad spend dur-ing period j − 1 then the ad spend difference intest period j, δi,j , is simply the ad spend duringtest period j. However, if the ad spend is posi-tive during period j−1 and it is either increasedor decreased, as depicted in Figure 1, then thead spend difference is found by fitting a secondlinear model:

si,j = γ0j + γ1jsi,j−1 + µi,j (3)

Here, si,j is the ad spend in geo i during test

period j and µi,j is the error term 2. Assumingsi,0 > 0, this model is fit with weights wi = 1/si,0using only the set of control geos (C).

This ad spend model characterizes the impactof seasonality on ad spend across the transitionsbetween test periods, and it is used as a coun-terfactual 3 to calculate the ad spend difference.The ad spend differences in the control and testgeos (T ) of each test period transition are:

δi,j =

{si,j − (γ0j + γ1jsi,j−1) for i ∈ T

0 for i ∈ C(4)

The zero ad spend difference in the control geosreflects the fact that these geos continue to op-erate at the baseline level across the test periodtransition. Note that, with the exception of thefirst and last test periods, all test periods willinclude δi,j that are positive and negative, sincead spend increases across the test period bound-ary for some geos while is decreases for others,as described in Section 2.

4 Example Results

We employed a geo experiment in [6] to eval-uate the potential cannibalization of cost-freeorganic clicks by paid search clicks for an ad-vertiser. We did so because the advertiser wasconcerned that consumers were clicking on paidsearch links when they would have clicked onfree organic search links had there not been paidlinks present. The goal of the experiment was tomeasure the cost per incremental click (CPIC).That is, the cost for clicks that would not haveoccurred without the search campaign. Here weshow the results of a similar geo experiment thatwas run to monitor the effectiveness of an exist-ing national search advertising campaign acrosstime.

2The error term in Equation 3 is scaled by the ROASas it propagates through to Equation 1 or 2 in an additiveway through the application of Equation 4. However,this error term is often smaller than the error term inEquations 1 and 2 by an order of magnitude, or more.

3The counterfactual is the ad spend that would haveoccurred in the absence of the campaign change acrosseach test period transition.

4 Google Inc.

Page 5: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

The measurement period lasted approximately12 weeks and included 11 test periods. Dur-ing this experiment, the advertiser’s search adswere not shown in 1/6 of the geos during eachof the first 10 test periods. Each of the six geo-groups took turns going “dark” across the exper-iment, which ensured that no geo stopped show-ing search ads for more than about a week at atime. Search ads were shown in all of the geosagain starting with the 11th test period.

Figure 2 shows individual test period level resultsgenerated using Equation 1. There is one ROASmeasurement for each test period. These mea-surements are not quite independent of one an-other because the test period associated with onemeasurement is the pretest period for the subse-quent test period. However, there is no overlapin the data used to generate the measurementsin non-adjacent test periods. The ROAS rangesfrom 1.3 to 2.9 clicks per dollar (CPIC rangesfrom $0.34 to $0.77 per incremental click). Notethat the width of the confidence interval remainsroughly the same across test periods 2 through10. It is higher in test periods 1 and 11 where theexperiment does not benefit from the effectivedoubling of the ad spend difference, as describedin Section 2.

Figure 3 shows results generated by pooling dataacross test periods using Equation 2. EachROAS measurement is generated by pooling thedata from each test period with the data fromall of the previous test periods. So, the ROASgenerated for the first period has J = {1},for the second J = {1, 2}, and for the 11th

J = {1, 2, 3, ..., 11}. The final ROAS estimateis 1.9 clicks per dollar (CPIC = $0.53 per incre-mental click). Note that the confidence intervaldecreases monotonically across the length of theexperiment as additional data are added to themodel.

In contrast to the individual test period results,this scenario provides a long-term estimate ofthe response metric. For example, the impactof shorter term factors such as the weather, or acompetitor’s promotions, in a single week mightbe smoothed across an entire quarter. It is also

Figure 2: Measurement of return on ad spendfor clicks as a function of test period. There isone ROAS measurement for each test period.

possible to balance these alternative views of thedata by pooling across a subset of test periods.For example, the ROAS can be calculated bypooling data across consecutive pairs of test peri-ods by letting J = {1, 2}, J = {2, 3}, J = {3, 4},and so on.

One potential application for the informationpresented above is the generation of combinedShewhart-CUSUM quality control charts [5].These charts are used in detection monitoring.They can identify both a sudden change (She-whart) and a gradual change (CUSUM) in theresponse metric.

5 Experimental Design

As always, design is an important step for run-ning an efficient and effective experiment. Thedesign considerations include the length of theexperiment, the test fraction, the magnitude ofthe ad spend difference, and the length of the testperiods (switching frequency). Although thereare more design considerations with a multiple-test-period geo experiment than with a single

Google Inc. 5

Page 6: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 3: Measurement of return on ad spend forclicks as a function of test period length. EachROAS measurement is generated by pooling thedata from all of the previous test periods.

test period geo experiment, the same approachthat was employed in [6] is still applicable.

Consider the matrix form of Equation 2,

Y = Xβ + ε (5)

where X = (Xj1Xj2 ...Xj|J|∆) is the concatena-tion of |J | matrices with dimension [N |J | x 2]and one matrix with dimension [N |J | x 1];

Xj1 =

1 y1,j1−1. .. .1 yN,j1−10 0. .. .. .. .. .. .. .. .. .0 0

, Xj2 =

0 0. .. .0 01 y1,j2−1. .. .1 yN,j2−10 0. .. .. .. .. .0 0

Xj|J| =

0 0. .. .. .. .. .. .. .. .. .0 01 y1,j|J|−1. .. .1 yN,j|J|−1

, ∆ =

δ1,j1..

δN,j1δ1,j2..

δN,j2...

δ1,j|J|

.

.δN,j|J|

Y =

y1,j1..

yN,j1y1,j2..

yN,j2...

y1,j|J|

.

.yN,j|J|

, ε =

ε1,j1..

εN,j1ε1,j2..

εN,j2...

ε1,j|J|

.

.εN,j|J|

.

The coefficient vector, β, is a vector with length2|J | + 1,

β =

β0,j1β1,j1β0,j2β1,j2..

β0,j|J|

β1,j|J|

β2

.

With the model in this form, the variance-covariance matrix of the weighted least squaresestimated regression coefficients is:

var(β) = σ2ε (XTWX)−1 (6)

6 Google Inc.

Page 7: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

(see [4]), where W is an [N |J | x N |J |] diagonalmatrix containing the weights wi;

W =

W 0 . . 0

0 W .. . .. . 0

0 . . . W

with

W =

w1 0 . . 00 w2 .. . .. . 00 . . . wN

.

The lower right component of the matrix inEquation 6 is the variance of β2. So,

var(β2) = σ2εadj(XTWX)N |J |,N |J |

det(XTWX)(7)

where adj(A)n,n is the n, n cofactor of the matrixA and det(A) is the determinant.

Using a set of geo-level pretest data in the re-sponse variable, it is possible to use Equation7 to estimate the width of the ROAS confidenceinterval for a specified design scenario. This pro-cess is analogous to the process described in [6].

The first step is to select a consecutive set ofdays from the pretest data to create pseudopretest and test periods. The lengths of thepseudo pretest and test periods should matchthe lengths of the corresponding periods in thehypothesized experiment. For example, an ex-periment with a 14 day pretest period and three14 day test periods should have pseudo pretestand test periods with the same lengths using 56days of data. These data are used to estimateW and all but the last column of X in Equation7.

The next step is to randomly assign each geo toa geo group. If blocking is used, as suggested inSection 2, then this random assignment shouldbe similarly constrained. It may be possible todirectly estimate the value of δi,j at the geo level.For example, if the ad spend will be turned off inthe test geos, then δi is just the average daily ad

spend for test geo i times the number of days intest period j. Otherwise, an aggregate ad spenddifference ∆j can be hypothesized for each testperiod and the geo-level ad spend difference canbe estimated using

δi,j =

{∆j(yi,0/

∑i yi,0) for i ∈ T

0 for i ∈ C (8)

The last value to estimate in Equation 7 is σε.This estimate is generated by considering the re-duced linear model;

Y = Xβ + ε (9)

where X = (Xj1Xj2 ...Xj|J|). This model has thesame form as Equation 6 except the column ofad spend difference terms, ∆, has been dropped.Fitting this model using the pseudo pretest andtest period data results in a residual variance ofσε, which is used to approximate σε.

To avoid any peculiarities associated with a par-ticular random assignment, Equation 7 is evalu-ated for many random control/test assignments.In addition, different partitions of the pretestdata are used to create the pseudo pretest andtest periods by circularly shifting the data intime by a randomly selected offset. The halfwidth estimate for the ROAS confidence interval

is 2

√var(β2), where var(β2) is the average vari-

ance of β2 across all of the random assignments.This process can be repeated across a number ofdifferent scenarios to evaluate and compare de-signs.

Figure 4 shows the confidence interval predic-tion as a function of test period for the exam-ple shown in Figure 3. The dashed line corre-sponds to the predicted confidence interval halfwidth and the solid line corresponds to resultsfrom the experiment. For this comparison, thead spend difference calculated using Equation 4was used as input to the prediction. That is, weassumed that the ad spend difference was knownwith certainty, as it will be when the pretest pe-riod spend is zero. The relatively good matchbetween these two curves demonstrates that theabsolute size of the confidence interval can be

Google Inc. 7

Page 8: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 4: ROAS confidence interval half-widthprediction across the length of the experiment.

predicted quite well, at least as long as the adspend difference can be accurately predicted. Inpractice, the accuracy of this prediction is im-pacted by the dynamic environment of the liveauctions and uncertainty in the relationship be-tween changes in campaign settings, such as bidsand keyword sets, and resulting changes in adspend. The bid simulator tool [1] and the trafficestimator tool [2] can help with ad spend predic-tion, and closely monitoring ad spend and ad-justing campaign changes in the test group dur-ing the early stages of the experiment can alsohelp realize the targeted ad spend difference.

6 Multiple vs. Single-Test-Period Geo Experiments

In this section, we compare the application ofmultiple and single-test-period geo experimentsto scenarios in which the objectives are periodicand isolated measurement. Multiple-test-periodexperiments are a better choice for periodic mea-surement, and the same is true for isolated mea-surement.

To make the comparisons more clear, all of thescenarios considered below have the same num-ber of geo groups and, for all but one scenario,the same test fraction, q = 1/3. Also, when itis nonzero, the ad spend intensity (i.e. the adspend per geo per unit time) is constant acrossscenarios. There is no delay in the impact of ad-vertising on user behavior (i.e. ν = 0) for thescenarios described in Sections 6.1 through 6.3.The experiment budget is the cost per measure-ment and each scenario has an absolute exper-iment budget (i.e. aggregate magnitude of adspend difference) of either B or 2B, dependingon whether the ad spend is pulsed or continuous.Computational results were generated using thegeos and response data from the example shownin Section 4. Analytical results were generatedusing variants of the analysis described in Ap-pendix A.

6.1 Periodic Measurement - PulsedSpend

The measurement objective in the first set ofcomparisons is to monitor ad effectiveness overtime. The most obvious approach for extend-ing single period geo experiments to this situa-tion is to run a series of consecutive experiments.In this case, each experiment has the ad spendprofile depicted in Figure 8 in Appendix B. Thetest group has an aggregate ad spend differenceof B in every 9 day test period. The following9 days are reserved for the pretest period asso-ciated with the next test period, which resultsin a pattern of pulsed ad spend. This scenariocorresponds to the first row in Table 1.

The analog for this test in the multiple-test-period paradigm corresponds to the second rowin Table 1 (also see Figure 9). The spend profileis the same as the first scenario. Most notably,the budget is still B. The primary differenceis that the 9 day period subsequent to test pe-riod i is not only used as a pretest period fortest period i + 1, it is also used to reduce theconfidence interval of the ROAS estimate associ-ated with measurement i. That is, the informa-tion provided by increasing the ad spend in geo

8 Google Inc.

Page 9: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

group 1 in transitioning to test period i is com-bined with the information provided by decreas-ing the ad spend in geo group 1 in transitioningto test period i+ 1. Pooling information in thisway reduces the confidence interval by a factorof 1/

√2. 4. Alternatively, the same confidence

interval can be achieved using only one half ofthe ad spend difference. Along with this lowercost, the ad effectiveness measurement is morerelevant to the current level of ad spend. 5

Test Ad Spend C. I.# Test Period Difference, Half

scen. Periods Length Leverage Width

1 1 9 B, B 1.59

2 2 9 B, 2B 1.18

Table 1: Periodic measurement scenarios withpulsed ad spend. See Figures 8 and 9 in Ap-pendix B for the spend profiles associated withthese scenarios.

6.2 Periodic Measurement - Continu-ous Spend

Both of the scenarios in Section 6.1 have a bud-get of B. Now consider the situation in which themeasurement interval continues to be 18 days,but we allow for a continuous change in ad spendacross time. The budget for these scenarios is2B. It is not possible to apply a single periodgeo experiment to a situation in which there is acontinuous change in ad spend across time. So, abudget-equivalent comparison is made using Sce-nario 3, which is identical to Scenario 1 exceptthat the test fraction has been doubled to achieve

4In Table 1 the confidence interval improvement forScenario 2 over Scenario 1 is slightly less than expected(1.18 versus 1.12). This discrepancy is caused by the useof an 18 day pretest period, which was chosen for its com-patibility with subsequent scenarios. It disappears if a 9day pretest period is used to match the length of the 9day test periods in Scenarios 1 and 2.

5Generally speaking, the effectiveness of judiciouslyapplied advertising spend decreases as ad spend volumeincreases. So, using a smaller ad spend difference providesa more precise measure of the marginal value of the adspend.

a budget of 2B (see Figure 10).

Test Ad Spend C. I.# Test Period Difference, Half

scen. Periods Length Leverage Width

3 1 9 2B, 2B 0.79

4 1 18 2B, 4B 0.645 2 9 2B, 4B 0.686 3 6 2B, 4B 0.667 6 3 2B, 4B 0.678 9 2 2B, 4B 0.669 18 1 2B, 4B 0.64

Table 2: Sequence of multiple-test-period scenar-ios. The leverage from the ad spend is twice aslarge as the actual ad spend when switching isused. See Figures 10 through 14 in Appendix Bfor the spend profiles associated with Scenarios3-7.

The base scenario in the multiple-test-periodparadigm is Scenario 4 in Table 2 with the cor-responding spend profile in Figure 11. In thisscenario, the test period spans the entire 18 daymeasurement interval. While the budget is 2B,the corresponding leverage is 4B because at eachtest period transition there is an increase in thespend for one geo group, and a decrease in spendfor another. This scenario has a confidence in-terval that is smaller than Scenario 1 by a factorof (1/2)

√1− q, and smaller than Scenario 3 by

a factor of√

1− q.

Additional test group rotations are included inthe 18 day measurement period in scenarios 5through 9 (also see the spend profiles in Figures12 through 14). The ROAS measurements aregenerated by pooling data across these shortertest periods. In all these cases, the ad spenddifference is 2B and the ad spend leverage is4B. The ad spend leverage remains constant be-cause the more frequent switching is offset by theshorter length of the test periods. As a result,the confidence interval remains constant acrossthese scenarios (see Figure 5). So, once a mea-surement period is established, there is no bene-fit, or harm, in rotating the test condition morefrequently with regard to the width of the confi-

Google Inc. 9

Page 10: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 5: ROAS confidence interval predictionfor the periodic measurement scenarios in Tables1 and 2.

dence interval 6. Switching more frequently pro-vides the option of reducing the measurementperiod during the analysis phase of the experi-ment, although it comes with the additional lo-gistics associated with rotating the test conditionmore frequently.

The results demonstrate that multiple-test-period experiments use budget and/or time moreefficiently than consecutive single test period ex-periments.

6.3 Isolated Measurement

Now we consider the objective of generating asingle measurement of ad effectiveness. The firstscenario considered follows the isolated measure-ment approach described in [6], which corre-sponds to the first row in Table 3. The spendprofile for this scenario is depicted in Figure 15in Appendix B. The ad spend difference acrossthe 18 day test period is 2B, as is the ad spendleverage.

In Scenarios 2-6, the 18 day measurement pe-

6See Section 6.4 for the exception to this rule.

Test Ad Spend C. I.# Test Period Difference, Half

scen. Periods Length Leverage Width

1 1 18 2B, 2.00 B 1.102 2 9 2B, 3.00 B 0.843 3 6 2B, 3.33 B 0.754 6 3 2B, 3.67 B 0.705 9 2 2B, 3.78 B 0.686 18 1 2B, 3.89 B 0.64

Table 3: Sequence of isolated measurement sce-narios. The ad spend leverage approaches twicethe value of the ad spend difference as switchingfrequency is increased. See Figures 15 through18 in Appendix B for the spend profiles associ-ated with Scenarios 1-4.

Figure 6: ROAS confidence interval predictionfor the isolated measurement scenarios in Table3.

10 Google Inc.

Page 11: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

riod is partitioned by a set of test group rota-tions. ROAS measurements are generated bypooling data across these shorter test periods. Inall of these scenarios, the ad spend difference is2B, and the ad spend leverage approaches 4B asthe switching frequency increases. The ad spendleverage is less than 4B because the first switchdoes not take place until the start of the sec-ond test period. As the switching frequency in-creases, the impact of not having a switch in thefirst test period decreases (see Figure 6).

As the switching frequency increases, the confi-dence interval decreases. In the limit, it becomessmaller than the confidence interval of Scenario1 by a factor of (1/

√2)√

1− q 7. These resultsindicate that, even when the goal is isolated mea-surement, a multiple-test-period experiment usesthe ad spend difference more efficiently than asingle test period experiment.

6.4 Implications of Delayed Ad Im-pact

In the fourth set of comparisons the objective isthe same as in Section 6.1: monitor ad effective-ness over time. However, in this case we considerthe implications of a non-zero delay for the im-pact of the advertising on the response metric(i.e. ν > 0). In this situation, more frequentrotation of the geos through the test conditionresults in a reduction in the ad spend differenceand leverage, and a correspondingly larger con-fidence interval.

The scenarios in Table 4 are the same as thefirst six scenarios in Tables 1 and 2, except hereν = 3 days. Consequently, the ad spend changeis truncated 3 days prior to the end of each testperiod to allow the full impact of the advertising

7This reduction in the confidence interval is the samereduction that would have occurred if an additional 18day “observation” period were added to the analysis. Thisadditional period would be used to observe the impact onthe response variable of returning the ad spend to thebaseline level in Group 1, similar to Scenario 2 in Table1. So, one interpretation of the benefit of switching isthat it allows the length of the analysis period to be cutin half without impacting the confidence interval.

to be realized within each test period, which issimilar to the situation depicted in Figure 1. Asa result, the ad spend difference and the ad spendleverage are less than the analogous values inTables 1 and 2.

The ad spend difference in Scenario 1 is reducedby a factor of 2/3 because of the impact delay.This ad spend reduction increases the confidenceinterval by a factor of 1/(2/3) = 1.5. The samelogic extends to scenarios 2-6. For a measure-ment period of length L, the ad spend is reducedby a factor of (L−mν)/L, where m is the numberof test periods during the measurement period 8.The confidence intervals for Scenarios 4-6 in Ta-ble 4 are larger than the confidence intervals ofthe corresponding scenarios in Table 2 by abouta factor of L/(L − mν), where L = 18, ν = 3,and m=1, 2, 3, respectively.

The confidence intervals for Scenarios 1-6 areplotted in Figure 7. Even with a delay in adimpact, the multiple-test-period alternatives toScenarios 1 and 3 (i.e. Scenario 2 for pulsed adspend difference, and Scenario 4 - for continu-ous ad spend difference) will always have a lowerconfidence interval. However, the larger confi-dence intervals of Scenarios 5 and 6 demonstratethat partitioning the measurement interval withadditional switching is not always harmless. The

8Note that mν must be less than L to avoid a situationin which some of the impact of the ad spend is sharedacross adjacent test periods.

Test Ad Spend C. I.# Test Period Difference, Half

sc. Periods Length Leverage Width

1 1 9 0.67B, 0.67B 2.372 2 9 0.67B, 1.33B 1.773 2 9 1.33B, 1.33B 1.174 2 9 1.67B, 3.33B 0.775 3 6 1.33B, 2.67B 1.026 6 3 1.00B, 2.00B 1.32

Table 4: Sequence of multiple-test-period scenar-ios in which the impact of the advertising on theresponse metric lasts up to three days.

Google Inc. 11

Page 12: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 7: ROAS confidence interval predictionfor the delayed ad impact scenarios in Table 4.

presence of a non-zero delay in ad impact makesit necessary to trade-off confidence interval sizewith measurement frequency.

7 Additional Design Notes

The approach described in Appendix A can beused to analyze a variety of design choices. Thissection includes the results of several of theseanalyses.

7.1 Impact of Modifying Test Fractionand Ad Spend Difference

Most advertisers prefer to measure ad effective-ness with as little impact as possible to theirexisting campaigns. Smaller test fractions havea smaller impact on existing campaigns. How-ever, they also generate ROAS measurementswith larger confidence intervals. Modifying thetest fraction by a factor of ft changes the ex-pected confidence interval by a factor of

√1/ft.

So, reducing the test fraction from 1/4 to 1/8will increase the confidence interval by a factor

of√

2.

Note that when the test fraction was scaled byft, the magnitude of the aggregate ad spend dif-ference was also scaled by ft. This additionalscaling keeps the average geo level ad spend dif-ference, i.e. the intensity of ad spend differ-ence, constant. However, scaling this intensityis another way to impact the confidence inter-val. Smaller intensities correspond to smallerchanges in the existing campaigns. Increasingthe ad spend difference by a factor of fδ modi-fies the expected confidence interval by a factorof 1/fδ. This means that halving the ad spenddifference will double the confidence interval.

These results indicate that changing the num-ber of test geos has less impact on the confi-dence interval than changing the leverage of thelinear model by modifying the ad spend differ-ence. However, increasing the magnitude of thead spend difference has other implications. Theefficiency of ad spend typically decreases as adspend increases. So, the ROAS associated witha large ad spend difference may be smaller thanthe ROAS associated with a smaller one. Unfor-tunately, the exact relationship between ROASand the volume of ad spend is usually unknown.Using an ad spend difference that is too largemay lower the ROAS and measure ROAS at alevel of ad spend that is not relevant to the ad-vertiser.

7.2 Trade-off: Test Fraction and TestLength

Some advertisers may want to limit the impactof running an experiment on their existing cam-paigns by using a smaller test fraction, but theymay prefer not to do so at the expense of mea-surement precision. An alternative is to offsetthe use of a smaller test fraction by increasingthe length of the measurement period. If the testfraction is scaled by ft, then the confidence in-terval can be kept constant by scaling the length

of the measurement period by f(−2/3)t . So, if the

test fraction is cut in half, then the length ofthe measurement needs to increase by a factor

12 Google Inc.

Page 13: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

of (1/2)(−2/3) ≈ 1.6 to keep the same confidenceinterval.

7.3 Impact of Geo Expansion and GeoSplitting

One potential method for reducing the confi-dence interval of ROAS measurement is to addmore geos to the experiment 9. Scaling the num-ber of geos by fg changes the expected confidenceinterval by a factor of

√1/fg. This means that

doubling the number of geos will decrease theconfidence interval by a factor of 1/

√2.

An alternative to expanding the geographic cov-erage of an experiment is to re-partition thesame geographic area into a larger number ofgeos. Once again, scaling the number of geosby fg changes the expected confidence intervalby a factor of

√1/fg. So, this approach has the

same impact as adding new geos, but it does sowithout increasing the aggregate ad spend dif-ference. The down side of increasing the num-ber of geos via geographic re-paritioning is thatsmaller geos are more likely to suffer from con-trol/test contamination. Finite geo location ac-curacy may inconsistently label consumers wholive near boundaries between control and testgeos. Consumers are also more likely to travelacross these boundaries during the course of theirdaily activities, including commuting to work.

8 Concluding Remarks

Our previous work demonstrated that geo exper-iments deserve consideration in many decision-making situations that require the measurementof ad effectiveness. They provide the rigor of arandomized experiment, and they can be appliedto a variety of user behavior while avoiding pri-vacy concerns that may be associated with alter-native approaches. Here we have demonstratedthat these experiments can also be used to track

9Decreasing the number of geos included in the exper-iment is another way to reduce the impact of the experi-ment on existing campaigns.

ad effectiveness over time. This additional stepexpands the applicability of geo experiments tothe common situation in which one time mea-surement is not sufficient to meet the needs ofadvertisers. As an added benefit of generalizingthe application of geo experiments, we also iden-tified a better framework for both periodic andisolated measurement of ad effectiveness.

Acknowledgments

We thank Tony Fagan for reviewing this workand providing valuable feedback and many help-ful suggestions.

References

[1] AdWords Help. “What is the bid simulator,and how does it work?” March 20, 2012.http://support.google.com/adwords/bin/answer.py?hl=en&answer=138148

[2] AdWords Help. “Traffic Estimator” March20, 2012.http://support.google.com/adword/bin/answer.py?hl=en&answer=6329

[3] Box, G., et al. Statistics for Experimenters.New York: John Wiley & Sons, Inc, 1978.

[4] Kutner, M.H. et al. Applied Linear Statisti-cal Models. New York: McGraw-Hill/Irwin,2005.

[5] Lucas, J.M. 1982. “Combined Shewhart-CUSUM quality control schemes.” Journalof Quality Technology 14, 51-59.

[6] Vaver, J., Koehler, J. “Measuring AdEffectiveness Using Geo Experiments.”http://googleresearch.blogspot.com/2011/12/measuring-ad-effectiveness-using-geo.html,2011.

Google Inc. 13

Page 14: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

9 Appendix A

The adjacent test periods in a multiple-test-period geo experiment allow information to beused more efficiently than in a single-test-periodexperiment. Using Equation 7, along with sev-eral reasonable assumptions, this efficiency canbe characterized analytically.

For this comparison, we consider the confidenceinterval associated with an ROAS measurementfrom a single-test-period geo experiment and itsmultiple-test-period analog; a single test periodtransition from test period j−1 to test period j,where j > 1. The length of the pretest period inthe single-test-period geo experiment is the sameas the length of test period j − 1. The lengthsof the single test period and test period j fromthe multiple-test-period experiment are also thesame, as is the associated ad spend difference.

Now, let p and q be the fraction of geos in the testgroup for the single and multiple-test-period ex-periments, respectively. Assume that all groupsof geos (i.e. the control and test groups in thesingle-test-period experiment and the geo-groupsin the multiple-test-period experiment) are sta-tistically identical. For example, the distributionof the response metric volume is the same for allgroups of geos. Similarly, the mean ad spend dif-ference in each group of test geos is the same asthe ad spend difference that would have occurredin each group of control geos, if they had beenassigned to the test condition. Let δ be the, re-alized or unrealized, ad spend difference for eachgroup of geos.

Furthermore, assume that the ad spend differ-ence of each geo in a test group is proportionalto the response metric in the pretest period. So,for the single-test-period experiment

|δi| = α yi,0 i ∈ {1, ..., N} (10)

and for the continuous experiment

|δi,j | = α yi,0 i ∈ {1, ..., N}. (11)

This assumption is reasonable since we expectgeos with larger ad spend to have a larger volume

in the response metric, and we expect campaignchanges to have a larger absolute impact on adspend in the larger geos. Going one step further,assume that the impact of the ad spend changeis small relative to the differences in the responsemetric volume across geos so that

|δi,j | = α yi,0 ≈ α yi,j i ∈ {1, ..., N}. (12)

In this analysis, the linear model is modified byignoring the less important β0j terms. So, forthe standard experiment

X =

y1,0 0. .. .

y(1−p)N,0 0

y(1−p)N+1,0 δ(1−p)N+1

. .

. .yN,0 δN

and

XTWX =

N∑i=1

wiy2i,0

N∑i=1

wiyi,0δi

N∑i=1

wiyi,0δiN∑i=1

wiδ2i

.Since wi = 1/yi,0 and |δi| = α yi,0,

N∑i=1

wiy2i,0 =

1

α

N∑i=1

|δi| =1

αN |δ|

N∑i=1

wiyi,0δi =N∑

i=(1−p)N+1

wiyi,0δi = pNδ

N∑i=1

wi(δi)2 = α

N∑i=(1−p)N+1

|δi| = αpN |δ|.

Then from Equation 7

var(β2) = σ2ε

1αN |δ|

[ 1αN |δ|][αpN |δ|]− [pNδ]2

=σ2ε

α N |δ| p(1− p). (13)

14 Google Inc.

Page 15: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

For the multiple-test-period experiment

X =

y1,j−1 0. .. .

y(1−2q)N,j−1 0

y(1−2q)N+1,j−1 δ(1−2q)N+1,j

. .

. .y(1−q)N,j−1 δ(1−q)N,jy(1−q)N+1,j−1 δ(1−q)N+1,j

. .

. .yN,j−1 δN,j

and

XTWX =

N∑i=1

wiy2i,j−1

N∑i=1

wiyi,j−1δi

N∑i=1

wiyi,j−1δiN∑i=1

wiδ2i

.Since wi = 1/yi,0 and |δi,j | = α yi,0 ≈ α yi,j−1,

N∑i=1

wiy2i,j−1 ≈

1

α

N∑i=1

|δi| =1

αN |δ|

N∑i=1

wiyi,j−1δi ≈N∑

i=(1−2q)N+1

δi = 0

N∑i=1

wi(δi)2 ≈ α

N∑i=(1−2q)N+1

|δi| = αN |δ|2q.

The off diagonal terms in XTWX are zerobecause in X the ad spend difference termsδ(1−2q)N+1,j , ..., δ(1−q)N,j have the same magni-tude as the terms δ(1−q)N,j , ..., δN,j , but with theopposite sign. Then from Equation 7,

var(β2) ≈ σ′ε2

1αN |δ|

[ 1αN |δ|][α2qNδ]

=σ′ε

2

α N |δ| 2q. (14)

With the assumption that σ2ε = σ′ε2, the ra-

tio of Equations 13 and 14 gives the ratio ofvar(β2) from the single-test-period experiment

and var(β2j) from the multiple-test-period ex-periment,

var(β2)

var(β2j)≈ 2q

p(1− p). (15)

If p = q = 1/3, then the confidence interval ofthe ROAS in the single-test-period experimentwill be

√3 ∼ 1.73 times greater than it is in the

multiple-test-period experiment. Alternatively,if we assume that both of the confidence in-tervals are the same, var(β2) = var(β2j), thenq ≈ p(1 − p)/2. So, for a case in which p = 1/2we have q = 1/8. The multiple-test-period ex-periment delivers the same confidence interval asthe single-test-period geo experiment with a testfraction that is only 1/4 as large.

10 Appendix B

This appendix contains ad spend profiles for sce-narios from Tables 1 through 3.

Figure 8: Scenario 1 from Table 1.

Google Inc. 15

Page 16: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 9: Scenario 2 from Table 1.

Figure 10: Scenario 3 from Table 2.

Figure 11: Scenario 4 from Table 2.

Figure 12: Scenario 5 from Table 2.

Figure 13: Scenario 6 from Table 2.

Figure 14: Scenario 7 from Table 2.

16 Google Inc.

Page 17: Periodic Measurement of Advertising E ectiveness Using ... · The rst two parameters in the model, 0j and 1j, are used to account for seasonal di erences in the response metric across

Figure 15: Scenario 1 from Table 3.

Figure 16: Scenario 2 from Table 3.

Figure 17: Scenario 3 from Table 3.

Figure 18: Scenario 4 from Table 3.

Google Inc. 17