minimax-regret sample design in anticipation o f missing ...cfm754/mmr... · missing data problems...

Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data

Jeff Dominitz

RAND

and

Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern University

Revised: August 2019, forthcoming in the Journal of Econometrics

Abstract

Missing data problems are ubiquitous in data collection. In surveys, these problems may arise from unit response, item nonresponse, and panel attrition. Building on the Dominitz and Manski (2017) study of choice between two or more sampling processes that differ in cost and quality, we study minimax-regret sample design in anticipation of missing data, where the collected data will be used for prediction under square loss of the values of functions of two variables. The analysis imposes no assumptions that restrict unobserved outcomes. Findings are reported for prediction of the values of linear and indicator functions using panel data with attrition. We also consider choice between a panel and repeated cross sections.

We are grateful for the comments of Max Tabord-Meehan, Kei Hirano, and a reviewer.

1

1. Introduction

Missing data problems are ubiquitous in data collection. In surveys, missing data may arise from

unit response, item nonresponse, and panel attrition. Researchers who want to minimize the mean square

error of estimates in surveys with missing data should be concerned with both bias and variance, as

recommended in the literature on total survey error. However, statisticians have focused on variance, as

explained by Groves and Lyberg (2010): “The total survey error format forces attention to both variance

and bias terms. . . . . . Most statistical attention to surveys is on the variance terms—largely, we suspect,

because that is where statistical estimation tools are best found” (p. 868).

Dominitz and Manski (2017) provided tools for making sample design choices that explicitly

account for both variance and bias while imposing no assumptions that restrict the unobserved outcomes.

The analysis used the Wald framework of statistical decision theory to study choice between two or more

sampling processes that differ in the cost of data collection and the quality of the data obtained, where data

quality is determined by the response rate.

We studied application of the minimax-criterion, which seeks a decision that is uniformly near

optimal. Initially considering the decision problem in abstraction, we observed that analytical solution is

feasible only in special cases and that computation of numerical solutions often is computationally

challenging. To make progress, we focused on minimax-regret sample design for prediction of a real-

valued outcome under square loss.

The ideal, but unknown, best predictor in this familiar setting of square loss is the population mean

outcome. In the Wald framework, the risk of a candidate predictor using finite sample data is the sum of

the population variance of the outcome and the mean square error (MSE) of the predictor as an estimate of

the mean outcome. The regret of a predictor is its MSE as an estimate of the mean. A minimax-regret

predictor minimizes maximum mean square error. Thus, minimax-regret prediction of the outcome is

equivalent to minimax estimation of the population mean.

2

Even though study of prediction under square loss is simpler to study than under other loss

functions, it is still computationally challenging to determine the value and maximum regret of the predictor

that minimizes maximum regret when some data are missing. Seeking an approach that is both tractable

and reasonable, we studied prediction using the midpoint of a sample analog estimate of the identification

region for the population mean. This midpoint predictor is easy to compute, and its maximum regret has a

simple and sensible analytical form. If the identification interval for the population mean were known

rather than estimated, then its midpoint would be the minimax-regret prediction, which we find to be another

appealing aspect of this midpoint predictor.

We now build on this framework to study minimax-regret sample design in anticipation of missing

data, where the collected data will be used for prediction under square loss of functions of two variables.

Relative to our previous study, addressing this expanded prediction problem requires attention to additional

dimensions of data cost and quality. We specifically study choice of sample size for a two-period panel

with attrition. That is, we consider longitudinal data collection with a 100-percent response rate in period

1 and some nonresponse in period 2. Some findings apply as well to collection of data on two household

members and to cross-sectional surveys with item nonresponse.

Section 2 summarizes key elements of and findings from our previous study and calls attention to

some complications that must be addressed in the expanded prediction problem. Analysis may be impacted

not only by the higher dimension of the data but also by the form of the function whose value is to be

predicted. As in our previous study, one must take a stand on how the data will be used before making

sample design decisions.

Section 3 studies the maximum regret of sample designs for prediction of two types of functions.

The cases we study are prediction of the value of linear and indicator functions. In both cases, we presume

knowledge of the response rate and use of a midpoint predictor akin to that posed in the previous study.

Again, the attractions of midpoint predictors are that they are easy to compute and have sensible analytical

forms for regret. Again, it is computationally challenging to determine the value and maximum regret of

3

the predictor that minimizes maximum regret when the identification region for the population mean must

be estimated.

Section 4 compares the maximum regret of predictions with panel data to maximum regret of

prediction with repeated cross-sectional (RCS) data. We find that RCS data collection often yields smaller

maximum regret than a panel with equivalent sample size and cost when interest centers on prediction of

linear functions. A particularly striking result is that RCS yields smaller maximum regret than a panel with

complete response. The reason is that RCS draws an independent random sample each period, but panel

observations may be correlated across periods. However, collection of panel data is typically more

informative than RCS when the problem is to predict the value of an indicator function.

Section 5 discusses extensions of the analysis.

2. Best Prediction under Square Loss of Functions of Two Variables

Consider best prediction under square loss of a bounded real function f(y1, y2), where (y1, y2) take

values in a bounded interval on R2, normalized to be the unit square [0, 1] × [0, 1]. Let P(y1, y2) be the

probability distribution of (y1, y2) in a population that is a continuum. Then the best predictor is E[f(y1,

y2)]. The regret of a predictor based on sample data is its mean square error. The subscripts 1 and 2 may

refer to time periods in a panel study, a husband and a wife in a study of households, or two different

variables associated with each individual in a cross-section.

Suppose a random sample is drawn from P(y1, y2), but there may be missing data. Let zt = 1 if yt

is observed and zt = 0 if yt is missing for t = 1, 2. In a cross-sectional survey, unit nonresponse means that

z1 = z2 = 0, whereas item nonresponse means that either (z1 = 0, z2 = 1) or (z1 = 1, z2 = 0). In a panel with

full response in the first period but some attrition in the second, attrition means that (z1 = 1, z2 = 0). We

focus on this case. Thus, we assume P(z1 = 1) = 1, but permit P(z2 = 1) < 1. We assume knowledge of the

response rate in the second period but no knowledge of the composition of nonresponse.

4

Assuming knowledge of P(z2 = 1) simplifies our analysis greatly. It appears that numerical

computation of maximum regret must generally be performed if P(z2 = 1) is estimated rather than known.

The assumption of a known response rate is sometimes realistic, because historical experience may give

survey designers a sense of the attrition rate to expect with various modes of survey administration (face-

to-face, telephone, internet). We do not go further and assume additional knowledge. For example, we do

not assume knowledge of the second-period response rate conditional on first-period responses.

Although we focus on collection of panel data, our analysis applies as well to household surveys in

which a researcher always interviews a specified spouse, but interviews only a subset of the other spouse.

It also applies to surveys of individuals in which all sample members respond to one question but only a

subset responds to another question. For example, the first question may ask about a non-sensitive matter

such as age or education, while the second asks about a sensitive matter such as income or drug use.

In general, survey response may vary with the process used to collected data. For example, a person

may agree to provide data in an internet survey but not to be interviewed face-to-face. To make the

dependence of response on the survey process explicit, we could denote the process by q and the missing-

data indicators by (zq1, zq2). For simplicity, we keep the q notation implicit in most of the paper.

2.1 Previous Findings

Dominitz and Manski (2017) studied minimax-regret sample design for best prediction under

square loss of a function of one variable, when a high-cost/high-quality sampling process accurately

measures the outcome of each sample member and a low-cost/low-quality sampling process has

nonresponse. We considered a cross-sectional survey, with all data obtained at t = 1. The analysis assumed

knowledge of the response rate, but it imposed no restrictions on the values of the data missing due to

nonresponse. Using the present notation, one chooses between two processes for measuring y1. The high-

cost process has P(z1 = 1) = 1, whereas the low-cost process has known P(z1 = 1) < 1. Therefore, the high-

cost process point-identifies E(y1), whereas the low-cost process partially identifies E(y1).

5

With a predetermined budget, the minimax-regret choice between these two sampling processes is

easy to determine when it is assumed that specific reasonable predictors will be used. We assumed that a

sample-average predictor will be calculated based on the high-cost data and a midpoint predictor will be

calculated based on the low-cost data. The midpoint predictor is the middle of a sample analog estimate of

the identification region for the population mean. Among other findings, we showed that the maximum

regret of the midpoint predictor is smaller than that of sample-average predictors that have been commonly

used by researchers who face missing data problems. The latter predictors use ignorability assumptions to

impute missing values or to motivate discarding these sample members.

The analysis generalizes to designs that combine low-cost and high-cost sampling processes, under

the assumption that the observed outcomes will be pooled. Further, when the budget is not predetermined,

the analysis shows how to choose a budget sufficient to achieve an ε-optimal design; that is, a budget

sufficient to make maximum regret less than a specified ε > 0.

2.2. Application to Functions of Two Variables

Predicting the value of a function of two variables requires attention to additional dimensions of

cost and quality, as well as to the form of the function. The general approach to analyzing the problem,

however, is unchanged. We illustrate the approach for linear and indicator functions in Section 3. In

particular, we first determine the identification region for the best predictor under the assumed data

generating process. Second, we define for any sample size a midpoint predictor based on a sample analog

of this identification region. Finally, we solve for the maximum regret of this predictor in the case of

indicator functions and an informative upper bound on maximum regret in the case of linear functions.

When f(y1, y2) is a general function, analysis similar to that performed in Dominitz and Manski

(2017) yields an outer bound on E[f(y1, y2)]; that is, a bound that holds but need not be sharp. The midpoint

predictor studied there is applicable, but it may not be the best possible. Suppose, for example, that one

knows the rate of complete response; that is, P(z1 = z2 = 1). The outer bound is obtained by (a) noting that

6

the value of f(y1, y2) is observed if both y1 and y2 are observed and (b) considering the value of f(y1, y2) to

be missing otherwise. The bound obtained in this manner may not be sharp because observation of either

but not both of y1 and y2 may constrain the value of f(y1, y2). Section 3 studies two classes of functions in

which this occurs.

3. Linear and Indicator Functions of Two Variables

The main new contributions of this paper are to study prediction of the values of two classes of

functions whose structure is such that observation of y1 alone may be informative about the value of f(y1,

y2). We consider prediction of the value of linear functions in Section 3.1 and indicator functions in Section

3.2. We assume complete response in period 1 and that one knows the response rate in period 2. We

suppose throughout that the midpoint of a sample analog of the identification region for E[f(y1, y2)] is used

to predict the function value.

To motivate the classes of functions we analyze, consider studies of employment dynamics, such

as those that have utilized the Panel Study of Income Dynamics for the past 50 years. Let y1 and y2 denote

the fraction of the year that a person works in years 1 and 2. Interest may center on employment change

(y2 − y1) or average employment (y1 + y2)/2. Section 3.1 covers such linear functions. Alternatively, one

may want to predict the occurrence of an event, such as employment growth. Then the objective is

prediction of the value of the indicator function 1[y2 > y1]. Section 3.2 covers prediction of indicator

functions.

3.1. Linear Functions

Let 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 for known values of (a, b, c). For ease of exposition, we consider

functions where b ≥ 0 and c ≥ 0. The analysis may be extended to other cases by rearranging terms in the

7

limits of the identification region below and then making corresponding revisions in the definition of the

midpoint predictor and in the derivation of regret.

The best predictor is 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] = 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2). The identification region for

𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] with panel data is the interval

(1) [𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)P(𝑧𝑧2 = 1),𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)P(𝑧𝑧2 = 1) + 𝑐𝑐P(𝑧𝑧2 = 0)]

= [𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2), 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2) + 𝑐𝑐P(𝑧𝑧2 = 0)].

This interval may be derived by applying the Law of Iterated Expectations. The lower bound obtains when

𝑃𝑃(𝑦𝑦2|𝑧𝑧2 = 0) is degenerate at the value 0, the lower limit of the support of P(y2). The upper bound obtains

when 𝑃𝑃(𝑦𝑦2|𝑧𝑧2 = 0) is degenerate at the value 1, the upper bound of the support of P(y2). The width of this

interval is 𝑐𝑐P(𝑧𝑧2 = 0).

3.1.1. Panel Data Midpoint Predictor

Let mt be the sample average of the observed values in period t; that is, 𝑚𝑚1 = 1𝑁𝑁1∑ 𝑦𝑦1𝑖𝑖𝑁𝑁1𝑖𝑖=1 and 𝑚𝑚2 =

1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . A sample-analog midpoint predictor is

(2) 𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2

P(𝑧𝑧2 = 0).

The regret of the predictor is its mean square error, the sum of squared bias and variance. To find

the squared bias of (2) with known period-2 response rate P(z2 = 1), use the Law of Iterated Expectations

to write the best predictor as 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] = 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2) + 𝑐𝑐𝐸𝐸(𝑦𝑦2(1 − 𝑧𝑧2)). Under random

sampling, 𝐸𝐸(𝑚𝑚1) = 𝐸𝐸(𝑦𝑦1) and 𝐸𝐸(𝑚𝑚2) = 𝐸𝐸(𝑦𝑦2𝑧𝑧2). Bias arises from deviation between 𝐸𝐸(𝑦𝑦2(1 − 𝑧𝑧2)) and

8

the midpoint predictor’s assigned value of ½ P(𝑧𝑧2 = 0). Squared bias is therefore 𝑐𝑐2𝑃𝑃(𝑧𝑧2 =

0)2 �𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 0) − 12�2

.

The variance of the predictor is

(3) 𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2

P(𝑧𝑧2 = 0)� = 𝑏𝑏2𝑉𝑉(𝑚𝑚1) + 𝑐𝑐2𝑉𝑉(𝑚𝑚2) + 2𝑏𝑏𝑐𝑐𝑏𝑏(𝑚𝑚1,𝑚𝑚2).

Random sampling implies that 𝑉𝑉(𝑚𝑚1) = 1𝑁𝑁1𝑉𝑉(𝑦𝑦1). We show in an Appendix that

(4a) 𝑉𝑉(𝑚𝑚2) = 1𝑁𝑁1𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)]

(4b) 𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 1𝑁𝑁1

𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)].

Hence,


P(𝑧𝑧2 = 0)� =

1𝑁𝑁1

�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)] +

2𝑏𝑏𝑐𝑐𝑃𝑃(𝑧𝑧2 = 1) ∙ [𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)] �

.

To maximize regret, one can consider the variance and bias terms separately. This is so because

bias depends only on E(y2|z2 = 0), which can vary independently of all the quantities that determine

variance. Squared bias is maximized if P(y2|z2 = 0) is degenerate at 0 or 1, in which case maximum squared

bias is 𝑐𝑐2

4𝑃𝑃(𝑧𝑧2 = 0)2.

9

Maximum variance across all states of nature has a simple form in the polar cases of no response

and complete response in period 2. When 𝑃𝑃(𝑧𝑧2 = 1) = 0, the variance of the predictor reduces to

1𝑁𝑁1𝑏𝑏2𝑉𝑉(𝑦𝑦1). 𝑉𝑉(𝑦𝑦1) is maximized at ¼ when P(y1) is Bernoulli with mean ½. Hence, maximum variance is

1(4𝑁𝑁1)

𝑏𝑏2.

When 𝑃𝑃(𝑧𝑧2 = 1) = 1, the variance of the predictor reduces to

1𝑁𝑁1

�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)] +

2𝑏𝑏𝑐𝑐 ∙ [𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1)] �

The variance and covariance terms can each take a maximum value of ¼ when (y1, y2) take values in [0, 1]

× [0, 1]. That is,

(a) max𝑃𝑃(.)

𝑉𝑉(𝑦𝑦1) = 14

(b) 𝑚𝑚𝑎𝑎𝑚𝑚𝑃𝑃(.)

𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1) = 14

(c) max𝑃𝑃(.)

|𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1)| = 14

The maxima in (b) and (c) are both achieved when P(y1, y2|z2 = 1) is bivariate Bernoulli with mean (½, ½)

and covariance ¼. Finally, with P(y1, y2|z2 = 1) bivariate Bernoulli with mean (½, ½), the maximum in (a)

is achieved when P(y1|z2 = 0) is also Bernoulli with mean ½.

It appears difficult to determine maximum variance across all states of nature in non-polar cases.

However, we can determine the maximum of an informative upper bound on the variance of the midpoint

predictor. The Appendix shows that


P(𝑧𝑧2 = 0)� ≤

10

1𝑁𝑁1

�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)]

+ 2(𝑏𝑏𝑐𝑐)𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1) + (𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)�

Using the same argument as for the case with 𝑃𝑃(𝑧𝑧2 = 1) = 1, the maximum of the upper bound on

the variance is 𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝑃𝑃(𝑧𝑧2=1)

4𝑁𝑁1+ 1

𝑁𝑁1(𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0).

Putting the maximum of the upper bound on variance and maximum squared bias together, the

upper bound on the maximum regret of the panel data midpoint predictor is

(7) 14�𝑏𝑏

2+�𝑐𝑐2+2|𝑏𝑏𝑐𝑐|�𝑃𝑃(𝑧𝑧2=1)𝑁𝑁1

+ 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 0)2� + 1𝑁𝑁1

(𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0).

Inspection of (7) reveals that the upper bound on maximum regret is decreasing in the initial sample size

N1, holding the response rate fixed. This holds because the upper bound on maximum variance declines

while maximum squared bias is unchanged.

Observe that the upper bound (7) reduces to 14�𝑏𝑏

2+𝑐𝑐2+2|𝑏𝑏𝑐𝑐|𝑁𝑁1

� in the polar case where P(z2 = 1) = 1.

In this case, the upper bound is achieved when P(y1, y2|z2 = 1) is bivariate Bernoulli with mean (½, ½) and

covariance ¼. In the polar case of P(z2 = 1) = 0, (4) reduces to 14�𝑏𝑏

2

𝑁𝑁1+ 𝑐𝑐2�. In this case, the upper bound

is achieved when P(y1) is Bernoulli with mean ½ and P(y2|z2 = 0) is degenerate at 0 or 1.

3.2. Indicator Functions

Suppose now that the objective is best prediction of the event that (y1, y2) take values in some set

A ⊂ [0, 1] × [0, 1]. Then f(y1, y2) is the indicator function 1[(y1, y2) ∈ A] and the best predictor is P[(y1,

y2) ∈ A]. Examples include: (a) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘] for some values 𝑗𝑗,𝑘𝑘 ∈ [0, 1], (b)

11

𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑦𝑦2 = 𝑘𝑘] for some 𝑗𝑗,𝑘𝑘 ∈ [0, 1], (c) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦2 > 𝑦𝑦1], and (d) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) =

1 �𝑦𝑦1+𝑦𝑦22

> 𝛾𝛾� for some 𝛾𝛾 ∈ (0,1).

For ease of exposition, we focus on settings in which observation of y1 may imply that (y1, y2) ∉ A

but cannot imply that (y1, y2) ∈ A. This holds, for example, when 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘]. Then y1

≠ j implies that (y1, y2) ∉ A, but y1 = j does not imply that (y1, y2) ∈ A. It also holds when 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) =

1 �𝑦𝑦1+𝑦𝑦22

> 𝛾𝛾� and γ > ½. Then y1 ≤ 2γ ̶ 1 implies that (y1 + y2)/2 < γ, but y1 > 2γ ̶ 1 does not imply that

(y1 + y2)/2 > γ. The analysis may be extended to settings where observation of y1 implies that (y1, y2) ∈ A

by adding a term to the lower limit of the identification region below to account for the probability of this

event among the observations with z2 = 0 and then making corresponding revisions to the definition of the

midpoint predictor and in the derivation of regret.

To obtain the identification region for 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)], define the binary random variable u as follows:

u = 1 if z2 = 0 and the observed value of y1 implies (y1, y2) ∉ A, u = 0 otherwise. The identification region

for P[(y1, y2) ∈ A] is the interval

(8) �𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1�,𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� + 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)�.

This interval may be derived by applying the Law of Total Probability and recalling that, when z2 = 0, the

event (y1, y2) ∈ A can occur only when u = 0.

3.2.1. Panel Data Midpoint Predictor

A midpoint predictor based on (8) is the midpoint of its sample analog, namely

(9) 1𝑁𝑁1∑ 1𝑁𝑁1𝑖𝑖=1 �(𝑦𝑦1𝑖𝑖 , y2𝑖𝑖)∈ A, 𝑧𝑧2𝑖𝑖 = 1� + 1

2∙ 1𝑁𝑁1∑ 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0)𝑁𝑁1𝑖𝑖=1 .

12

To solve for the maximum regret of this predictor, we first derive its squared bias and variance.

We then maximize the sum. To shorten the notation, we define two Bernoulli random variables 𝑤𝑤 =

1�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� and 𝑚𝑚 = 1(𝑧𝑧2 = 0,𝑢𝑢 = 0). Then we rewrite (9) as

(9’) 1𝑁𝑁1∑ 𝑤𝑤𝑖𝑖𝑁𝑁1𝑖𝑖=1 + 1

2∙ 1𝑁𝑁1∑ 𝑚𝑚𝑖𝑖𝑁𝑁1𝑖𝑖=1 .

Squared Bias

To find the squared bias, use the Law of Total Probability to write the best predictor

(10) 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴� = 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 1� + 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈A, 𝑧𝑧2 = 0�.

Under random sampling, 𝐸𝐸(𝑤𝑤) = 𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� and 𝐸𝐸(𝑚𝑚) = 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0). Therefore, bias

arises from deviation between 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� and 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)/2. The event

�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� implies that u = 0. Hence, 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� = 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0,𝑢𝑢 = 0�

and squared bias may be expressed as follows:

(11) �𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� − 1/2�2𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)2 .

Variance

To find the variance, note that, under random sampling, the midpoint predictor (9) is a linear

function of the bivariate Bernoulli random variable (𝑤𝑤, 𝑚𝑚) whose realizations are independent and

identically distributed across individuals i. Let 𝑝𝑝𝑠𝑠𝑠𝑠 = 𝑃𝑃(𝑤𝑤 = 𝑠𝑠, 𝑚𝑚 = 𝑡𝑡), 𝑠𝑠 ∈ {0, 1}, 𝑡𝑡 ∈ {0, 1}. Note that

𝐸𝐸(𝑤𝑤) = 𝑝𝑝10 + 𝑝𝑝11, 𝐸𝐸(𝑚𝑚) = 𝑝𝑝01 + 𝑝𝑝11, 𝑉𝑉(𝑤𝑤) = (𝑝𝑝10 + 𝑝𝑝11)(1− 𝑝𝑝10 − 𝑝𝑝11), and 𝑉𝑉(𝑚𝑚) = (𝑝𝑝01 +

𝑝𝑝11)(1− 𝑝𝑝01 − 𝑝𝑝11). Analysis of bivariate Bernoulli random variables in Dai et al. (2013), equation (2.12)

shows that 𝑏𝑏(𝑤𝑤, 𝑚𝑚) = 𝑝𝑝11𝑝𝑝00 − 𝑝𝑝01𝑝𝑝10.

13

We also know that 𝑃𝑃(𝑤𝑤 = 1) ≤ 𝑃𝑃(𝑧𝑧2 = 1), 𝑃𝑃(𝑚𝑚 = 1) ≤ 𝑃𝑃(𝑧𝑧2 = 0), and 𝑃𝑃(𝑤𝑤 = 1, 𝑚𝑚 = 1) = 0. It

now follows that

o 𝐸𝐸(𝑤𝑤) = 𝑝𝑝10 ≤ 𝑃𝑃(𝑧𝑧2 = 1)

o 𝐸𝐸(𝑚𝑚) = 𝑝𝑝01 ≤ 𝑃𝑃(𝑧𝑧2 = 0)

o 𝐸𝐸(𝑤𝑤𝑚𝑚) = 0

o 𝑉𝑉(𝑤𝑤) = 𝑝𝑝10(1− 𝑝𝑝10)

o 𝑉𝑉(𝑚𝑚) = 𝑝𝑝01(1 − 𝑝𝑝01)

o 𝑏𝑏(𝑤𝑤, 𝑚𝑚) = −𝑝𝑝01𝑝𝑝10

o 𝑝𝑝10 + 𝑝𝑝01 + 𝑝𝑝00 = 1

Thus, the variance of the midpoint predictor (9) can be written as follows:

(12) 1𝑁𝑁1�𝑉𝑉(𝑤𝑤) + 1

4𝑉𝑉(𝑚𝑚) + 2 1

2𝑏𝑏(𝑤𝑤, 𝑚𝑚)� = 1

𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1

4𝑝𝑝01(1− 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�.

Maximum Regret

Summing (11) and (12), the regret of the midpoint predictor is

(13) 1𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1

4𝑝𝑝01(1 − 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�+ �𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� − 1/2�2𝑝𝑝012 .

Note that 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0) is found in both the variance and the squared bias components of regret.

Hence, in contrast to the case with linear functions, the maximum regret of the predictor of an indicator

function cannot be determined by separately maximizing variance and squared bias.

Setting 𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� = 0 or 1 maximizes squared bias for any feasible value

of (𝑝𝑝10,𝑝𝑝01) and does not affect variance. 𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� is only defined if 𝑝𝑝01 > 0 but, if

𝑝𝑝01 = 0, there is no missing data problem. Hence, maximum regret for a given value of (𝑝𝑝10,𝑝𝑝01) is

14

(14) 1𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1

4𝑝𝑝01(1 − 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�+ 1

4𝑝𝑝012 .

The problem is to maximize (14) over the feasible range 𝑝𝑝10 ≤ 𝑃𝑃(𝑧𝑧2 = 1) and 𝑝𝑝01 ≤ 𝑃𝑃(𝑧𝑧2 = 0).

Fix p01 at any feasible value and differentiate (14) with respect to 𝑝𝑝10. The derivative

1𝑁𝑁1

(1 − 𝑝𝑝01 − 2𝑝𝑝10) is decreasing in 𝑝𝑝10. Hence, the maximum occurs at the interior solution 𝑝𝑝10 = 1−𝑝𝑝012

if this is a feasible value of p10 and at the boundary 𝑝𝑝10 = 0 otherwise. Considering first the interior

solution for 𝑝𝑝10, plug 𝑝𝑝10 = 1−𝑝𝑝012

into (14) and solve the concentrated optimization problem

(15) max𝑝𝑝01

14�1−𝑝𝑝01

𝑁𝑁1+ 𝑝𝑝012 � s.t. 𝑝𝑝01 ∈ [0,𝑃𝑃(𝑧𝑧2 = 0)].

The derivative − 14𝑁𝑁1

+ 12𝑝𝑝01 is increasing in 𝑝𝑝01. Therefore, with an interior solution for 𝑝𝑝10, regret is

maximized at the boundary where either 𝑝𝑝01 = 0 or 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0), in which case 𝑝𝑝10 = 12 or 𝑝𝑝10 =

12𝑃𝑃(𝑧𝑧2 = 1), respectively. Inspection of the possible boundary solutions shows that, when 𝑃𝑃(𝑧𝑧2 = 1) ≤

1−1/N1, maximum regret occurs where 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0) and 𝑝𝑝10 = 12𝑃𝑃(𝑧𝑧2 = 1). It follows that the

maximum regret of the midpoint predictor (9) in this typical setting is

(16) 14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2� .

Inspection of (16) shows that maximum regret is decreasing in the initial sample size N1, holding

the response rate fixed. Differentiation of (16) with respect to the response rate shows that, holding the

sample size fixed, maximum regret is decreasing in the response rate when 𝑃𝑃(𝑧𝑧2 = 1) < 1 ̶ 1/(2𝑁𝑁1).

15

3.2.2. Outer-Bound Midpoint Predictor

To apply the Dominitz and Manski (2017) midpoint predictor to indicator functions of two

variables, once again consider the value of f(y1, y2) to be missing when y2 is missing. Then the identification

region for E[f(y1, y2)] is the interval

(17) �𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1�,𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� + 𝑃𝑃(𝑧𝑧2 = 0)�.

A midpoint predictor based on this outer bound on E[f(y1, y2)] is

(18) 1𝑁𝑁1∑ 1𝑁𝑁1𝑖𝑖=1 �(𝑦𝑦1𝑖𝑖 , y2𝑖𝑖)∈ A, 𝑧𝑧2𝑖𝑖 = 1� + 1

2∙ 1𝑁𝑁1∑ 1(𝑧𝑧2𝑖𝑖 = 0)𝑁𝑁1𝑖𝑖=1

Note that (18) differs from (9) only in the arguments of the second indicator function; that is, 1(𝑧𝑧2𝑖𝑖 = 0)

versus 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0). Maximum regret of midpoint predictor (18) is identical to maximum regret of

midpoint predictor (9), because maximum regret of (9) arises where 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0), and, therefore,

1(𝑧𝑧2𝑖𝑖 = 0) = 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0) for all i.

Equivalence of the two midpoint predictors with respect to maximum regret does not imply that

the two are equivalent in all states. In fact, midpoint predictor (9) dominates (18). That is, its regret is less

than that of (18) in states where first-period data may be informative and equals that of (18) in the "worst-

case" states where the first-period data are always uninformative. The latter are states in which z = 0 always

implies that u = 0, so observation of the period-1 outcome is not informative about the value of E[f(y1, y2)].

Maximum regret occurs in states when the first-period data are always uninformative. Hence, the maximum

regret of (9) equals the maximum regret of (18).

3.3. Choice of Sample Design

16

Suppose that a set Q of sampling processes are feasible. Each q ∈ Q has a cost πq per initial sample

member and a vector of response rates P(zq1 = i, zq2 = j), where i and j equal 0 or 1. We assume for simplicity

that the cost per sample member does not depend on whether a person responds in period 2. Also for

simplicity, we consider choice between two designs, the lower-cost design having higher attrition. Let N1L

and N1H be the two initial sample sizes. The low-cost/low-quality design L has total cost πL∙N1L and

response rate ρL in period 2, whereas the high-cost/high-quality design has total cost πH∙N1H and response

rate ρH in period 2, with 0 < πL < πH and ρL < ρH < 1.

Our analysis assumes that the design is chosen ex ante with commitment, before observation of the

distribution of responses in period 1. It may sometimes be feasible to defer choice of the sampling plan for

period 2 until after the responses in period 1 have been obtained. In such cases, one might contemplate

sequential designs as well as ones that are chosen ex ante with commitment. Sequential sample design has

long been a subject of study in the literature on Bayesian decision making, where it is a classical dynamic

programming problem. However, being Bayesian requires specification of a precise prior subjective

distribution on all relevant unknown quantities. It appears challenging to characterize the properties of

sequential procedures that do not invoke prior subjective distributions, such as minimax-regret. We do not

examine sequential designs here.

In principle, one may have additional information about the sample design beyond cost and attrition

rate that would impact the calculation of maximum regret and choice among designs. For example, the

composition of those who choose to respond or not to respond in period 2 could be known to vary across

designs 𝑞𝑞 and 𝑞𝑞′ even if they have identical response rates. Thus, it may be that 𝑃𝑃�𝑧𝑧𝑞𝑞2 = 𝑗𝑗� = 𝑃𝑃�𝑧𝑧𝑞𝑞2 = 𝑗𝑗�

yet 𝑃𝑃�𝑦𝑦1,𝑦𝑦2|𝑧𝑧𝑞𝑞2 = 𝑗𝑗� ≠ 𝑃𝑃�𝑦𝑦1,𝑦𝑦2|𝑧𝑧𝑞𝑞2 = 𝑗𝑗�. We assume that no such information is available prior to data

collection.

3.3.1 Allocation of a Predetermined Budget

Suppose that the objective is to predict the value of a linear or indicator function, using the relevant

17

midpoint predictor. Suppose that the planner has a predetermined budget B and must choose between one

of the two designs. The feasible sample sizes are N1L = INT(B/πL) for low-cost sampling and N1H =

INT(B/πH) for high-cost sampling. We henceforth ignore for simplicity the fact that sample sizes must be

integers and take the feasible low-cost sample size to be N1L = B/πL and the feasible high-cost sample size

to be N1H = B/πH .

Consider first best prediction of the value of the linear function 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 for

known values of (a, b, c). Using the midpoint predictor (2), the feasible low-cost and high-cost designs

yield upper bounds on maximum regret of 14�𝑏𝑏

2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿

+ 𝑐𝑐2(1− 𝜌𝜌𝐿𝐿)2� + �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)𝐵𝐵/𝜋𝜋𝐿𝐿

and

14�𝑏𝑏

2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻

+ 𝑐𝑐2(1− 𝜌𝜌𝐻𝐻)2�+ �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)𝐵𝐵/𝜋𝜋𝐻𝐻

, respectively. Hence, the low-cost design has a

smaller upper bound on maximum regret when the budget is less than a certain threshold and the high-cost

design does otherwise. The threshold budget is

(19) 𝐵𝐵 = 𝜋𝜋𝐻𝐻�𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻� 4⁄ −𝜋𝜋𝐿𝐿�𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿� 4⁄ +𝜋𝜋𝐻𝐻��𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)�−𝜋𝜋𝐿𝐿��𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)�𝑐𝑐2[(1−𝜌𝜌𝐿𝐿)2−(1−𝜌𝜌𝐻𝐻)2] .

Consider now best prediction of the value of indicator function 1[(y1, y2) ∈ A]. Using midpoint

predictor (9), the feasible low-cost and high-cost designs yield maximum regret 14� 𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿

+ (1 − 𝜌𝜌𝐿𝐿)2� and

14� 𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻

+ (1 − 𝜌𝜌𝐻𝐻)2�, respectively. The low-cost design has smaller maximum regret when the budget is

less than a certain threshold and the high-cost design is better otherwise. The threshold budget is

(20) 𝐵𝐵 = (𝜋𝜋𝐻𝐻𝜌𝜌𝐻𝐻−𝜋𝜋𝐿𝐿𝜋𝜋𝐿𝐿)(1−𝜌𝜌𝐿𝐿)2−(1−𝜌𝜌𝐿𝐿)2 .

3.3.2. Choice of Budget to Achieve ε-Optimal Prediction

Now suppose that budget is a choice variable. In principle, the planner should perform a benefit-

18

cost analysis. Devoting a larger budget to data collection improves prediction of outcomes but diverts

resources from other uses. The planner must resolve this tension.

Adapting arguments in Manski and Tetenov (2016) regarding sample size selection to enable ε-

optimal treatment decisions, Dominitz and Manski (2017) consider choice of a design so that the maximum

MSE of the midpoint predictor is no larger than a specified ε > 0. If the objective is prediction of the value

of linear functions, the analysis above shows that a budget of size B suffices to achieve this objective if

(21) min{14�𝑏𝑏

2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿

+ 𝑐𝑐2(1− 𝜌𝜌𝐿𝐿)2� + �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)𝐵𝐵/𝜋𝜋𝐿𝐿

, 14�𝑏𝑏

2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻

+ 𝑐𝑐2(1 − 𝜌𝜌𝐻𝐻)2� +

�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)𝐵𝐵/𝜋𝜋𝐻𝐻

} ≤ ε.

Similarly, for indicator functions, a budget of size B suffices to achieve this objective if

(22) min{14� 𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿

+ (1 − 𝜌𝜌𝐿𝐿)2�, 14� 𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻

+ (1 − 𝜌𝜌𝐻𝐻)2�} ≤ ε.

These budget sizes suffice for ε-optimality but may not be necessary. The smallest budgets that

enable ε-optimal prediction occur when one uses MMR predictors rather than the tractable midpoint

predictors studied in Sections 3.1 and 3.2. In the absence of knowledge of the MMR predictors, we can

provide sufficient budget sizes but not necessary ones.

4. Panel Data versus Repeated Cross Sections

This section uses our analysis to guide sample design choice between panel data and repeated cross-

sectional (RCS) data, with complete response in each cross-section. Continuing the two-period framework

19

utilized above, RCS data are generated by a sampling process in which two independent random samples

are drawn, with (z1 = 1, z2 = 0) in one sample and (z1 = 0, z2 = 1) in the other; thus, P(z1 = 1, z2 = 1) = 0.

With repeated observations on individual sample members, panel data with full response point-identify the

joint distribution P(y1, y2), whereas RCS data point-identify only the period-specific marginal distributions

P(y1) and P(y2).

Much attention has been paid to estimation of dynamic models that are point-identified by RCS

data; see, for example, the review in Verbeek (2005). In early work on this topic, Deaton (1985) restricted

attentions to linear models that may include an additive fixed effect to be “differenced out.” Then the

outcome of interest is the linear function 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 with b = −c. Moffitt (1993) extended

the approach to some nonlinear models, focusing on binary choice models where the outcome of interest in

each period is an indicator function.

Moffitt emphasized that estimation of dynamic models with RCS data is made difficult by the

“general lack of information on lagged dependent and independent variables and the consequent

unobservability of the intertemporal covariances needed to identify and estimate dynamic models” (Moffitt,

1993, p. 99). This line of research, which replaces individual observations with cohort means and uses

additional assumptions to identify the models, is often motivated by cases in which panel data are not

available. However, it has also been noted that panel data “are often inferior to the available cross-sections

in some respects” (Moffitt, 1993, p. 100), such as smaller sample sizes in each time period, lower rates of

response arising from attrition, and “large and persistent errors of measurement” (Deaton, 1985, p. 110).

Rather than try to identify conditions under which existing panel data should be preferred to

existing RCS data or vice versa, here we consider how one should design longitudinal data collection before

commencing it. The answer to this question depends crucially on how the data will be used.

4.1. Linear Functions

20

Under random sampling with no missing data, RCS data point-identify the expectations of linear

functions of y1 and y2. As above, let 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 and let mt be the sample average of the

Nt observed values in period t. The RCS sample-average predictor is 𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2. Observe that

𝐸𝐸[𝑚𝑚1] = 𝐸𝐸[𝑦𝑦1], 𝐸𝐸[𝑚𝑚2] = 𝐸𝐸[𝑦𝑦2], 𝑉𝑉[𝑚𝑚1] = 1𝑁𝑁1𝑉𝑉[𝑦𝑦1], 𝑉𝑉[𝑚𝑚2] = 1

𝑁𝑁2𝑉𝑉[𝑦𝑦2], and 𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 0. Squared

bias equals 0 and variance is maximized if P(y1, y2) is bivariate Bernoulli with mean (½, ½). Maximum

regret is

(23) 14�𝑏𝑏

2

𝑁𝑁1+ 𝑐𝑐2

𝑁𝑁2�.

We may compare (23) with the maximum regret of the panel-data midpoint predictor. To make

the comparison precise, let N1 be the period-1 sample size for both RCS and panel data collection. Let N2

= N1 be the period-2 sample size in each case as well. N2 is a new random sample in the RCS case and is

the sample of period-2 responders in the panel case with no attrition. Let both designs have the same cost

per observation. Thus, we compare designs that yield the same numbers of observations in each period and

have the same cost, but differ in the composition of the period-2 observations.

As previously noted, the maximum regret of the panel data midpoint predictor is 14�𝑏𝑏

2+𝑐𝑐2+2𝑏𝑏𝑐𝑐𝑁𝑁1

�

when P(z2 = 1) = 1. Comparison with (23), when N2 = N1, reveals that the maximum regret of the panel

data predictor exceeds that of the RCS predictor by �2𝑏𝑏𝑐𝑐4𝑁𝑁1

�. This difference is attributable to the potential

covariation between the period-1 and period-2 sample averages in a panel.

This finding effectively turns a common argument in favor of a panel over RCS on its head. Unlike

RCS, a panel yields information on the joint distribution of outcomes across periods. But this information

has no value when the objective is to predict a linear function under square loss. Moreover, the possibility

of covariation of outcomes across periods increases the maximum variance of the panel data predictor

relative to the RCS predictor, which draws an independent random sample each period.

21

The comparison is not as straightforward when there is attrition, because we only have an analytical

upper bound on the maximum regret of the panel data midpoint predictor. Nonresponse in period-2 may

reduce the impact of covariation of outcomes across periods on the maximum variance of the panel data

midpoint predictor relative to the RCS predictor, but nonresponse also increase the predictor’s maximum

squared bias. In contrast, the RCS predictor is unbiased under the maintained assumptions.

4.2. Indicator Functions

Suppose now that the objective is best prediction of the event that (y1, y2) take specified values.

The best predictor is 𝑃𝑃[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘] for values 𝑗𝑗,𝑘𝑘 ∈ [0, 1]. With panel data, we found in (8) the

identification region to be the following interval of width 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0):

[𝑃𝑃(𝑦𝑦1 = 𝑗𝑗, y2 = 𝑘𝑘, 𝑧𝑧2 = 1),𝑃𝑃(𝑦𝑦1 = 𝑗𝑗, y2 = 𝑘𝑘, 𝑧𝑧2 = 1) + 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)]

With RCS data, the Frechet bound on a joint probability using knowledge of the marginals gives

the identification region as the interval

(24) �max(0,𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) + 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) − 1), min�𝑃𝑃(𝑦𝑦1 = 𝑗𝑗),𝑃𝑃(𝑦𝑦2 = 𝑘𝑘)��.

This interval has maximum width ½, which obtains when 𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) = 12 and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) = 1

2.

Suppose one were to know the marginal probabilities 𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘). Then the

minimax-regret predictor would be the midpoint of (24). Without additional prior information, the

maximum regret of this midpoint predictor equals its maximum squared bias of 1/16, which obtains when

𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) = 12 and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) = 1

2.

22

Suppose instead that one uses information from a finite sample to estimate the marginal

probabilities. Maximum regret of a predictor using finite-sample information on the probabilities cannot

be less than maximum regret of the minimax-regret predictor using knowledge of the probabilities.

Therefore, maximum regret of a sample-analog RCS midpoint predictor must be no less than 1/16.

Recall that maximum regret of the panel data midpoint predictor is 14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� +

𝑃𝑃(𝑧𝑧2 = 0)2�. Thus, the lower bound on the maximum regret of a finite-sample RCS midpoint predictor

exceeds the maximum regret of the panel data midpoint predictor when � 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2) <

14. When 𝑃𝑃(𝑧𝑧2 = 1) > ½, there exists a threshold sample size such that this inequality holds for all samples

larger than the threshold.

RCS data may be even less informative when predictors of other indicator functions discussed in

Section 3 are of interest. Consider the event [y2 > y1], whose best predictor is 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1). It is possible

that the marginal distributions P(y1) and P(y2) identified by RCS data are compatible with both 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1)

arbitrarily close to 0 and 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1) arbitrarily close to 1, as would be the case when (y1, y2) are

continuously distributed on [0, 1] with P(y1) = P(y2). Suppose one were to know the distributions P(y1) and

P(y2). Then the minimax-regret predictor would be the midpoint of the identification region. Without

additional information, this identification region is the open unit interval. The minimax-regret prediction

is therefore ½, and maximum regret of this midpoint predictor equals its maximum squared bias of ¼. Thus,

the RCS data are potentially uninformative and a finite-sample RCS midpoint predictor must have

maximum regret no less than ¼, whereas maximum regret of the panel data midpoint predictor is again

14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2�.

5. Conclusion

23

This paper continues our effort to encourage increased use of statistical decision theory to inform

the design of data collection when data quality is a decision variable. Building on our previous study, we

demonstrate how the framework may be applied to more complex design problems.

A notable general finding is that, when collecting panel data with attrition, prediction of the value

of a function of two variables is more subtle than prediction of a function of one variable. The reason is

that observation of the outcome in the first period may constrain the value of the function when the outcome

in the second period is not observed. The nature of the constraint, if any, depends on the form of the

function being predicted. Juxtaposition of linear and indicator functions demonstrates this relationship

well.

The form of the function being predicted also is important when considering choice between

collection of panel data and RCS. In the absence of restrictions on the joint distribution of outcomes, RCS

data are well-suited for prediction of linear functions but not for prediction of nonlinear function such as

indicator functions.

When predicting linear functions, choice between panel data and RCS may depend on the relative

magnitudes of recruitment and retention costs, as well as the relationship between retention costs and the

attrition rate. The framework adopted in the study should be useful for addressing these matters and related

questions. For instance, what is the optimal length of time between interview waves, when, all else equal,

an increase in period length should increase retention costs and/or attrition? To what extent can a rotating

panel be used to optimally combine the best elements of RCS and panel? Finally, how can retrospective

questions in RCS be used in addition to or in lieu of a (rotating) panel and how does this answer depend on

the length of time between interviews given the relationship between the length of this time span and

retrospective reporting errors?

We should re-emphasize that our analysis assumes knowledge of response rates but uses no

information on how the sample design affects the composition of respondents. Such information may affect

minimax-regret choice among designs. For example, one may be able to lower maximum regret by

combining complementary designs that tend to attract different segments of the population of interest.

24

Appendix

Derivation of V(m2)

Recall that 𝑚𝑚2 = 1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . Random sampling implies that

𝑉𝑉(𝑚𝑚2) = 𝑉𝑉 � 1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 � = � 1

𝑁𝑁1�2𝑉𝑉��𝑦𝑦21𝑧𝑧21 + 𝑦𝑦22𝑧𝑧22 +⋯+ 𝑦𝑦2𝑁𝑁1𝑧𝑧2𝑁𝑁1�� = � 1

𝑁𝑁1�2𝑁𝑁1𝑉𝑉(𝑦𝑦2𝑧𝑧2)

= 1𝑁𝑁1𝑉𝑉(𝑦𝑦2𝑧𝑧2).

Observe that

𝐸𝐸(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)

and

𝐸𝐸[(𝑦𝑦2𝑧𝑧2)2] = 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1).

Thus,

𝑉𝑉(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸[(𝑦𝑦2𝑧𝑧2)2]− [𝐸𝐸(𝑦𝑦2𝑧𝑧2)]2 = 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) − [𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)]2

= 𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)]

Hence,

𝑉𝑉(𝑚𝑚2) = 1𝑁𝑁1𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)].

Derivation of C(m1, m2)

Recall that 𝑚𝑚1 = 1𝑁𝑁1∑ 𝑦𝑦1𝑖𝑖𝑁𝑁1𝑖𝑖=1 , 𝑚𝑚2 = 1

𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . Random sampling implies that

25

𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 𝑏𝑏 �1𝑁𝑁1

�𝑦𝑦1𝑖𝑖

𝑁𝑁1

𝑖𝑖=1

,1𝑁𝑁1

�𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖

𝑁𝑁1

𝑖𝑖=1

� =1𝑁𝑁12

𝑏𝑏 ��𝑦𝑦1𝑖𝑖

𝑁𝑁1

𝑖𝑖=1

,�𝑦𝑦2𝑖𝑖

𝑁𝑁1

𝑖𝑖=1

𝑧𝑧2𝑖𝑖�

= 1𝑁𝑁12𝑏𝑏�𝑦𝑦11 + 𝑦𝑦12 + ⋯+ 𝑦𝑦1𝑁𝑁1 ,𝑦𝑦21𝑧𝑧21 + 𝑦𝑦22𝑧𝑧22 + ⋯+ 𝑦𝑦2𝑁𝑁1𝑧𝑧2𝑁𝑁1�

= 1𝑁𝑁12𝑁𝑁1𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2)

= 1𝑁𝑁1𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2).

Observe that 𝐸𝐸(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) and 𝐸𝐸(𝑦𝑦1𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1). Thus,

𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦1𝑦𝑦2𝑧𝑧2) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2𝑧𝑧2)

= 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)

= 𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)].

It follows that

𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 1𝑁𝑁1

𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)]

Derivation of upper bound on the variance of the midpoint predictor

The variance of the predictor is

𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 +𝑐𝑐2

P(𝑧𝑧2 = 0)� =

27

1𝑁𝑁1

�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)]

+ 2𝑏𝑏𝑐𝑐 𝑏𝑏𝑜𝑜𝐶𝐶(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) + (𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)�

28

References

Dai, B., S. Ding, and G. Wahba (2013), “Multivariate Bernoulli Distribution,” Bernoulli, 19, 1465-1483. Deaton, A. (1985), “Panel Data from Time Series of Cross Sections,” Journal of Econometrics 30, 109-126. Dominitz, J., and C. Manski (2017), "More Data or Better Data? A Statistical Decision Problem," Review of Economic Studies, 84, 1583-1605. Groves, R. (2006), “Nonresponse Rates and Nonresponse Bias in Household Surveys,” Public Opinion Quarterly, 70, 646–675. Groves, R. and L. Lyberg (2010), “Total Survey Error: Past, Present, and Future,” Public Opinion Quarterly, 74, 849-879. Manski, C. and A. Tetenov (2016), “Sufficient Trial Size to Inform Clinical Practice,” Proceedings of the National Academy of Sciences, 113, 10518-10523. Moffitt, R. (1993), “Identification and Estimation of Dynamic Models with a Time Series of Repeated Cross-Sections,” Journal of Econometrics, 59, 99-123. Verbeek, M. (2008), “Pseudo-Panels and Repeated Cross-Sections,” in L. Matyas and P. Sevestre (eds.) The Econometrics of Panel Data, Berlin: Springer-Verlag, 369-383.

minimax-regret sample design in anticipation o f missing ...cfm754/mmr... · missing data problems...

Documents