minimax-regret sample design in anticipation o f missing ...cfm754/mmr... · missing data problems...
TRANSCRIPT
Minimax-Regret Sample Design in Anticipation of Missing Data, With Application to Panel Data
Jeff Dominitz
RAND
and
Charles F. Manski Department of Economics and Institute for Policy Research, Northwestern University
Revised: August 2019, forthcoming in the Journal of Econometrics
Abstract
Missing data problems are ubiquitous in data collection. In surveys, these problems may arise from unit response, item nonresponse, and panel attrition. Building on the Dominitz and Manski (2017) study of choice between two or more sampling processes that differ in cost and quality, we study minimax-regret sample design in anticipation of missing data, where the collected data will be used for prediction under square loss of the values of functions of two variables. The analysis imposes no assumptions that restrict unobserved outcomes. Findings are reported for prediction of the values of linear and indicator functions using panel data with attrition. We also consider choice between a panel and repeated cross sections.
We are grateful for the comments of Max Tabord-Meehan, Kei Hirano, and a reviewer.
1
1. Introduction
Missing data problems are ubiquitous in data collection. In surveys, missing data may arise from
unit response, item nonresponse, and panel attrition. Researchers who want to minimize the mean square
error of estimates in surveys with missing data should be concerned with both bias and variance, as
recommended in the literature on total survey error. However, statisticians have focused on variance, as
explained by Groves and Lyberg (2010): “The total survey error format forces attention to both variance
and bias terms. . . . . . Most statistical attention to surveys is on the variance terms—largely, we suspect,
because that is where statistical estimation tools are best found” (p. 868).
Dominitz and Manski (2017) provided tools for making sample design choices that explicitly
account for both variance and bias while imposing no assumptions that restrict the unobserved outcomes.
The analysis used the Wald framework of statistical decision theory to study choice between two or more
sampling processes that differ in the cost of data collection and the quality of the data obtained, where data
quality is determined by the response rate.
We studied application of the minimax-criterion, which seeks a decision that is uniformly near
optimal. Initially considering the decision problem in abstraction, we observed that analytical solution is
feasible only in special cases and that computation of numerical solutions often is computationally
challenging. To make progress, we focused on minimax-regret sample design for prediction of a real-
valued outcome under square loss.
The ideal, but unknown, best predictor in this familiar setting of square loss is the population mean
outcome. In the Wald framework, the risk of a candidate predictor using finite sample data is the sum of
the population variance of the outcome and the mean square error (MSE) of the predictor as an estimate of
the mean outcome. The regret of a predictor is its MSE as an estimate of the mean. A minimax-regret
predictor minimizes maximum mean square error. Thus, minimax-regret prediction of the outcome is
equivalent to minimax estimation of the population mean.
2
Even though study of prediction under square loss is simpler to study than under other loss
functions, it is still computationally challenging to determine the value and maximum regret of the predictor
that minimizes maximum regret when some data are missing. Seeking an approach that is both tractable
and reasonable, we studied prediction using the midpoint of a sample analog estimate of the identification
region for the population mean. This midpoint predictor is easy to compute, and its maximum regret has a
simple and sensible analytical form. If the identification interval for the population mean were known
rather than estimated, then its midpoint would be the minimax-regret prediction, which we find to be another
appealing aspect of this midpoint predictor.
We now build on this framework to study minimax-regret sample design in anticipation of missing
data, where the collected data will be used for prediction under square loss of functions of two variables.
Relative to our previous study, addressing this expanded prediction problem requires attention to additional
dimensions of data cost and quality. We specifically study choice of sample size for a two-period panel
with attrition. That is, we consider longitudinal data collection with a 100-percent response rate in period
1 and some nonresponse in period 2. Some findings apply as well to collection of data on two household
members and to cross-sectional surveys with item nonresponse.
Section 2 summarizes key elements of and findings from our previous study and calls attention to
some complications that must be addressed in the expanded prediction problem. Analysis may be impacted
not only by the higher dimension of the data but also by the form of the function whose value is to be
predicted. As in our previous study, one must take a stand on how the data will be used before making
sample design decisions.
Section 3 studies the maximum regret of sample designs for prediction of two types of functions.
The cases we study are prediction of the value of linear and indicator functions. In both cases, we presume
knowledge of the response rate and use of a midpoint predictor akin to that posed in the previous study.
Again, the attractions of midpoint predictors are that they are easy to compute and have sensible analytical
forms for regret. Again, it is computationally challenging to determine the value and maximum regret of
3
the predictor that minimizes maximum regret when the identification region for the population mean must
be estimated.
Section 4 compares the maximum regret of predictions with panel data to maximum regret of
prediction with repeated cross-sectional (RCS) data. We find that RCS data collection often yields smaller
maximum regret than a panel with equivalent sample size and cost when interest centers on prediction of
linear functions. A particularly striking result is that RCS yields smaller maximum regret than a panel with
complete response. The reason is that RCS draws an independent random sample each period, but panel
observations may be correlated across periods. However, collection of panel data is typically more
informative than RCS when the problem is to predict the value of an indicator function.
Section 5 discusses extensions of the analysis.
2. Best Prediction under Square Loss of Functions of Two Variables
Consider best prediction under square loss of a bounded real function f(y1, y2), where (y1, y2) take
values in a bounded interval on R2, normalized to be the unit square [0, 1] × [0, 1]. Let P(y1, y2) be the
probability distribution of (y1, y2) in a population that is a continuum. Then the best predictor is E[f(y1,
y2)]. The regret of a predictor based on sample data is its mean square error. The subscripts 1 and 2 may
refer to time periods in a panel study, a husband and a wife in a study of households, or two different
variables associated with each individual in a cross-section.
Suppose a random sample is drawn from P(y1, y2), but there may be missing data. Let zt = 1 if yt
is observed and zt = 0 if yt is missing for t = 1, 2. In a cross-sectional survey, unit nonresponse means that
z1 = z2 = 0, whereas item nonresponse means that either (z1 = 0, z2 = 1) or (z1 = 1, z2 = 0). In a panel with
full response in the first period but some attrition in the second, attrition means that (z1 = 1, z2 = 0). We
focus on this case. Thus, we assume P(z1 = 1) = 1, but permit P(z2 = 1) < 1. We assume knowledge of the
response rate in the second period but no knowledge of the composition of nonresponse.
4
Assuming knowledge of P(z2 = 1) simplifies our analysis greatly. It appears that numerical
computation of maximum regret must generally be performed if P(z2 = 1) is estimated rather than known.
The assumption of a known response rate is sometimes realistic, because historical experience may give
survey designers a sense of the attrition rate to expect with various modes of survey administration (face-
to-face, telephone, internet). We do not go further and assume additional knowledge. For example, we do
not assume knowledge of the second-period response rate conditional on first-period responses.
Although we focus on collection of panel data, our analysis applies as well to household surveys in
which a researcher always interviews a specified spouse, but interviews only a subset of the other spouse.
It also applies to surveys of individuals in which all sample members respond to one question but only a
subset responds to another question. For example, the first question may ask about a non-sensitive matter
such as age or education, while the second asks about a sensitive matter such as income or drug use.
In general, survey response may vary with the process used to collected data. For example, a person
may agree to provide data in an internet survey but not to be interviewed face-to-face. To make the
dependence of response on the survey process explicit, we could denote the process by q and the missing-
data indicators by (zq1, zq2). For simplicity, we keep the q notation implicit in most of the paper.
2.1 Previous Findings
Dominitz and Manski (2017) studied minimax-regret sample design for best prediction under
square loss of a function of one variable, when a high-cost/high-quality sampling process accurately
measures the outcome of each sample member and a low-cost/low-quality sampling process has
nonresponse. We considered a cross-sectional survey, with all data obtained at t = 1. The analysis assumed
knowledge of the response rate, but it imposed no restrictions on the values of the data missing due to
nonresponse. Using the present notation, one chooses between two processes for measuring y1. The high-
cost process has P(z1 = 1) = 1, whereas the low-cost process has known P(z1 = 1) < 1. Therefore, the high-
cost process point-identifies E(y1), whereas the low-cost process partially identifies E(y1).
5
With a predetermined budget, the minimax-regret choice between these two sampling processes is
easy to determine when it is assumed that specific reasonable predictors will be used. We assumed that a
sample-average predictor will be calculated based on the high-cost data and a midpoint predictor will be
calculated based on the low-cost data. The midpoint predictor is the middle of a sample analog estimate of
the identification region for the population mean. Among other findings, we showed that the maximum
regret of the midpoint predictor is smaller than that of sample-average predictors that have been commonly
used by researchers who face missing data problems. The latter predictors use ignorability assumptions to
impute missing values or to motivate discarding these sample members.
The analysis generalizes to designs that combine low-cost and high-cost sampling processes, under
the assumption that the observed outcomes will be pooled. Further, when the budget is not predetermined,
the analysis shows how to choose a budget sufficient to achieve an ε-optimal design; that is, a budget
sufficient to make maximum regret less than a specified ε > 0.
2.2. Application to Functions of Two Variables
Predicting the value of a function of two variables requires attention to additional dimensions of
cost and quality, as well as to the form of the function. The general approach to analyzing the problem,
however, is unchanged. We illustrate the approach for linear and indicator functions in Section 3. In
particular, we first determine the identification region for the best predictor under the assumed data
generating process. Second, we define for any sample size a midpoint predictor based on a sample analog
of this identification region. Finally, we solve for the maximum regret of this predictor in the case of
indicator functions and an informative upper bound on maximum regret in the case of linear functions.
When f(y1, y2) is a general function, analysis similar to that performed in Dominitz and Manski
(2017) yields an outer bound on E[f(y1, y2)]; that is, a bound that holds but need not be sharp. The midpoint
predictor studied there is applicable, but it may not be the best possible. Suppose, for example, that one
knows the rate of complete response; that is, P(z1 = z2 = 1). The outer bound is obtained by (a) noting that
6
the value of f(y1, y2) is observed if both y1 and y2 are observed and (b) considering the value of f(y1, y2) to
be missing otherwise. The bound obtained in this manner may not be sharp because observation of either
but not both of y1 and y2 may constrain the value of f(y1, y2). Section 3 studies two classes of functions in
which this occurs.
3. Linear and Indicator Functions of Two Variables
The main new contributions of this paper are to study prediction of the values of two classes of
functions whose structure is such that observation of y1 alone may be informative about the value of f(y1,
y2). We consider prediction of the value of linear functions in Section 3.1 and indicator functions in Section
3.2. We assume complete response in period 1 and that one knows the response rate in period 2. We
suppose throughout that the midpoint of a sample analog of the identification region for E[f(y1, y2)] is used
to predict the function value.
To motivate the classes of functions we analyze, consider studies of employment dynamics, such
as those that have utilized the Panel Study of Income Dynamics for the past 50 years. Let y1 and y2 denote
the fraction of the year that a person works in years 1 and 2. Interest may center on employment change
(y2 − y1) or average employment (y1 + y2)/2. Section 3.1 covers such linear functions. Alternatively, one
may want to predict the occurrence of an event, such as employment growth. Then the objective is
prediction of the value of the indicator function 1[y2 > y1]. Section 3.2 covers prediction of indicator
functions.
3.1. Linear Functions
Let 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 for known values of (a, b, c). For ease of exposition, we consider
functions where b ≥ 0 and c ≥ 0. The analysis may be extended to other cases by rearranging terms in the
7
limits of the identification region below and then making corresponding revisions in the definition of the
midpoint predictor and in the derivation of regret.
The best predictor is 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] = 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2). The identification region for
𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] with panel data is the interval
(1) [𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)P(𝑧𝑧2 = 1),𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)P(𝑧𝑧2 = 1) + 𝑐𝑐P(𝑧𝑧2 = 0)]
= [𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2), 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2) + 𝑐𝑐P(𝑧𝑧2 = 0)].
This interval may be derived by applying the Law of Iterated Expectations. The lower bound obtains when
𝑃𝑃(𝑦𝑦2|𝑧𝑧2 = 0) is degenerate at the value 0, the lower limit of the support of P(y2). The upper bound obtains
when 𝑃𝑃(𝑦𝑦2|𝑧𝑧2 = 0) is degenerate at the value 1, the upper bound of the support of P(y2). The width of this
interval is 𝑐𝑐P(𝑧𝑧2 = 0).
3.1.1. Panel Data Midpoint Predictor
Let mt be the sample average of the observed values in period t; that is, 𝑚𝑚1 = 1𝑁𝑁1∑ 𝑦𝑦1𝑖𝑖𝑁𝑁1𝑖𝑖=1 and 𝑚𝑚2 =
1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . A sample-analog midpoint predictor is
(2) 𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2
P(𝑧𝑧2 = 0).
The regret of the predictor is its mean square error, the sum of squared bias and variance. To find
the squared bias of (2) with known period-2 response rate P(z2 = 1), use the Law of Iterated Expectations
to write the best predictor as 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)] = 𝑎𝑎 + 𝑏𝑏𝐸𝐸(𝑦𝑦1) + 𝑐𝑐𝐸𝐸(𝑦𝑦2𝑧𝑧2) + 𝑐𝑐𝐸𝐸(𝑦𝑦2(1 − 𝑧𝑧2)). Under random
sampling, 𝐸𝐸(𝑚𝑚1) = 𝐸𝐸(𝑦𝑦1) and 𝐸𝐸(𝑚𝑚2) = 𝐸𝐸(𝑦𝑦2𝑧𝑧2). Bias arises from deviation between 𝐸𝐸(𝑦𝑦2(1 − 𝑧𝑧2)) and
8
the midpoint predictor’s assigned value of ½ P(𝑧𝑧2 = 0). Squared bias is therefore 𝑐𝑐2𝑃𝑃(𝑧𝑧2 =
0)2 �𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 0) − 12�2
.
The variance of the predictor is
(3) 𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2
P(𝑧𝑧2 = 0)� = 𝑏𝑏2𝑉𝑉(𝑚𝑚1) + 𝑐𝑐2𝑉𝑉(𝑚𝑚2) + 2𝑏𝑏𝑐𝑐𝑏𝑏(𝑚𝑚1,𝑚𝑚2).
Random sampling implies that 𝑉𝑉(𝑚𝑚1) = 1𝑁𝑁1𝑉𝑉(𝑦𝑦1). We show in an Appendix that
(4a) 𝑉𝑉(𝑚𝑚2) = 1𝑁𝑁1𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)]
(4b) 𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 1𝑁𝑁1
𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)].
Hence,
(5) 𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2
P(𝑧𝑧2 = 0)� =
1𝑁𝑁1
�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)] +
2𝑏𝑏𝑐𝑐𝑃𝑃(𝑧𝑧2 = 1) ∙ [𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)] �
.
To maximize regret, one can consider the variance and bias terms separately. This is so because
bias depends only on E(y2|z2 = 0), which can vary independently of all the quantities that determine
variance. Squared bias is maximized if P(y2|z2 = 0) is degenerate at 0 or 1, in which case maximum squared
bias is 𝑐𝑐2
4𝑃𝑃(𝑧𝑧2 = 0)2.
9
Maximum variance across all states of nature has a simple form in the polar cases of no response
and complete response in period 2. When 𝑃𝑃(𝑧𝑧2 = 1) = 0, the variance of the predictor reduces to
1𝑁𝑁1𝑏𝑏2𝑉𝑉(𝑦𝑦1). 𝑉𝑉(𝑦𝑦1) is maximized at ¼ when P(y1) is Bernoulli with mean ½. Hence, maximum variance is
1(4𝑁𝑁1)
𝑏𝑏2.
When 𝑃𝑃(𝑧𝑧2 = 1) = 1, the variance of the predictor reduces to
1𝑁𝑁1
�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)] +
2𝑏𝑏𝑐𝑐 ∙ [𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1)] �
The variance and covariance terms can each take a maximum value of ¼ when (y1, y2) take values in [0, 1]
× [0, 1]. That is,
(a) max𝑃𝑃(.)
𝑉𝑉(𝑦𝑦1) = 14
(b) 𝑚𝑚𝑎𝑎𝑚𝑚𝑃𝑃(.)
𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1) = 14
(c) max𝑃𝑃(.)
|𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1)| = 14
The maxima in (b) and (c) are both achieved when P(y1, y2|z2 = 1) is bivariate Bernoulli with mean (½, ½)
and covariance ¼. Finally, with P(y1, y2|z2 = 1) bivariate Bernoulli with mean (½, ½), the maximum in (a)
is achieved when P(y1|z2 = 0) is also Bernoulli with mean ½.
It appears difficult to determine maximum variance across all states of nature in non-polar cases.
However, we can determine the maximum of an informative upper bound on the variance of the midpoint
predictor. The Appendix shows that
(6) 𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 + 𝑐𝑐2
P(𝑧𝑧2 = 0)� ≤
10
1𝑁𝑁1
�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)]
+ 2(𝑏𝑏𝑐𝑐)𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1) + (𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)�
Using the same argument as for the case with 𝑃𝑃(𝑧𝑧2 = 1) = 1, the maximum of the upper bound on
the variance is 𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝑃𝑃(𝑧𝑧2=1)
4𝑁𝑁1+ 1
𝑁𝑁1(𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0).
Putting the maximum of the upper bound on variance and maximum squared bias together, the
upper bound on the maximum regret of the panel data midpoint predictor is
(7) 14�𝑏𝑏
2+�𝑐𝑐2+2|𝑏𝑏𝑐𝑐|�𝑃𝑃(𝑧𝑧2=1)𝑁𝑁1
+ 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 0)2� + 1𝑁𝑁1
(𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0).
Inspection of (7) reveals that the upper bound on maximum regret is decreasing in the initial sample size
N1, holding the response rate fixed. This holds because the upper bound on maximum variance declines
while maximum squared bias is unchanged.
Observe that the upper bound (7) reduces to 14�𝑏𝑏
2+𝑐𝑐2+2|𝑏𝑏𝑐𝑐|𝑁𝑁1
� in the polar case where P(z2 = 1) = 1.
In this case, the upper bound is achieved when P(y1, y2|z2 = 1) is bivariate Bernoulli with mean (½, ½) and
covariance ¼. In the polar case of P(z2 = 1) = 0, (4) reduces to 14�𝑏𝑏
2
𝑁𝑁1+ 𝑐𝑐2�. In this case, the upper bound
is achieved when P(y1) is Bernoulli with mean ½ and P(y2|z2 = 0) is degenerate at 0 or 1.
3.2. Indicator Functions
Suppose now that the objective is best prediction of the event that (y1, y2) take values in some set
A ⊂ [0, 1] × [0, 1]. Then f(y1, y2) is the indicator function 1[(y1, y2) ∈ A] and the best predictor is P[(y1,
y2) ∈ A]. Examples include: (a) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘] for some values 𝑗𝑗,𝑘𝑘 ∈ [0, 1], (b)
11
𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗 𝑜𝑜𝑜𝑜 𝑦𝑦2 = 𝑘𝑘] for some 𝑗𝑗,𝑘𝑘 ∈ [0, 1], (c) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦2 > 𝑦𝑦1], and (d) 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) =
1 �𝑦𝑦1+𝑦𝑦22
> 𝛾𝛾� for some 𝛾𝛾 ∈ (0,1).
For ease of exposition, we focus on settings in which observation of y1 may imply that (y1, y2) ∉ A
but cannot imply that (y1, y2) ∈ A. This holds, for example, when 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 1[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘]. Then y1
≠ j implies that (y1, y2) ∉ A, but y1 = j does not imply that (y1, y2) ∈ A. It also holds when 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) =
1 �𝑦𝑦1+𝑦𝑦22
> 𝛾𝛾� and γ > ½. Then y1 ≤ 2γ ̶ 1 implies that (y1 + y2)/2 < γ, but y1 > 2γ ̶ 1 does not imply that
(y1 + y2)/2 > γ. The analysis may be extended to settings where observation of y1 implies that (y1, y2) ∈ A
by adding a term to the lower limit of the identification region below to account for the probability of this
event among the observations with z2 = 0 and then making corresponding revisions to the definition of the
midpoint predictor and in the derivation of regret.
To obtain the identification region for 𝐸𝐸[𝑓𝑓(𝑦𝑦1,𝑦𝑦2)], define the binary random variable u as follows:
u = 1 if z2 = 0 and the observed value of y1 implies (y1, y2) ∉ A, u = 0 otherwise. The identification region
for P[(y1, y2) ∈ A] is the interval
(8) �𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1�,𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� + 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)�.
This interval may be derived by applying the Law of Total Probability and recalling that, when z2 = 0, the
event (y1, y2) ∈ A can occur only when u = 0.
3.2.1. Panel Data Midpoint Predictor
A midpoint predictor based on (8) is the midpoint of its sample analog, namely
(9) 1𝑁𝑁1∑ 1𝑁𝑁1𝑖𝑖=1 �(𝑦𝑦1𝑖𝑖 , y2𝑖𝑖)∈ A, 𝑧𝑧2𝑖𝑖 = 1� + 1
2∙ 1𝑁𝑁1∑ 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0)𝑁𝑁1𝑖𝑖=1 .
12
To solve for the maximum regret of this predictor, we first derive its squared bias and variance.
We then maximize the sum. To shorten the notation, we define two Bernoulli random variables 𝑤𝑤 =
1�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� and 𝑚𝑚 = 1(𝑧𝑧2 = 0,𝑢𝑢 = 0). Then we rewrite (9) as
(9’) 1𝑁𝑁1∑ 𝑤𝑤𝑖𝑖𝑁𝑁1𝑖𝑖=1 + 1
2∙ 1𝑁𝑁1∑ 𝑚𝑚𝑖𝑖𝑁𝑁1𝑖𝑖=1 .
Squared Bias
To find the squared bias, use the Law of Total Probability to write the best predictor
(10) 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴� = 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 1� + 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈A, 𝑧𝑧2 = 0�.
Under random sampling, 𝐸𝐸(𝑤𝑤) = 𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� and 𝐸𝐸(𝑚𝑚) = 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0). Therefore, bias
arises from deviation between 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� and 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)/2. The event
�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� implies that u = 0. Hence, 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0� = 𝑃𝑃�(𝑦𝑦1,𝑦𝑦2)∈𝐴𝐴, 𝑧𝑧2 = 0,𝑢𝑢 = 0�
and squared bias may be expressed as follows:
(11) �𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� − 1/2�2𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)2 .
Variance
To find the variance, note that, under random sampling, the midpoint predictor (9) is a linear
function of the bivariate Bernoulli random variable (𝑤𝑤, 𝑚𝑚) whose realizations are independent and
identically distributed across individuals i. Let 𝑝𝑝𝑠𝑠𝑠𝑠 = 𝑃𝑃(𝑤𝑤 = 𝑠𝑠, 𝑚𝑚 = 𝑡𝑡), 𝑠𝑠 ∈ {0, 1}, 𝑡𝑡 ∈ {0, 1}. Note that
𝐸𝐸(𝑤𝑤) = 𝑝𝑝10 + 𝑝𝑝11, 𝐸𝐸(𝑚𝑚) = 𝑝𝑝01 + 𝑝𝑝11, 𝑉𝑉(𝑤𝑤) = (𝑝𝑝10 + 𝑝𝑝11)(1− 𝑝𝑝10 − 𝑝𝑝11), and 𝑉𝑉(𝑚𝑚) = (𝑝𝑝01 +
𝑝𝑝11)(1− 𝑝𝑝01 − 𝑝𝑝11). Analysis of bivariate Bernoulli random variables in Dai et al. (2013), equation (2.12)
shows that 𝑏𝑏(𝑤𝑤, 𝑚𝑚) = 𝑝𝑝11𝑝𝑝00 − 𝑝𝑝01𝑝𝑝10.
13
We also know that 𝑃𝑃(𝑤𝑤 = 1) ≤ 𝑃𝑃(𝑧𝑧2 = 1), 𝑃𝑃(𝑚𝑚 = 1) ≤ 𝑃𝑃(𝑧𝑧2 = 0), and 𝑃𝑃(𝑤𝑤 = 1, 𝑚𝑚 = 1) = 0. It
now follows that
o 𝐸𝐸(𝑤𝑤) = 𝑝𝑝10 ≤ 𝑃𝑃(𝑧𝑧2 = 1)
o 𝐸𝐸(𝑚𝑚) = 𝑝𝑝01 ≤ 𝑃𝑃(𝑧𝑧2 = 0)
o 𝐸𝐸(𝑤𝑤𝑚𝑚) = 0
o 𝑉𝑉(𝑤𝑤) = 𝑝𝑝10(1− 𝑝𝑝10)
o 𝑉𝑉(𝑚𝑚) = 𝑝𝑝01(1 − 𝑝𝑝01)
o 𝑏𝑏(𝑤𝑤, 𝑚𝑚) = −𝑝𝑝01𝑝𝑝10
o 𝑝𝑝10 + 𝑝𝑝01 + 𝑝𝑝00 = 1
Thus, the variance of the midpoint predictor (9) can be written as follows:
(12) 1𝑁𝑁1�𝑉𝑉(𝑤𝑤) + 1
4𝑉𝑉(𝑚𝑚) + 2 1
2𝑏𝑏(𝑤𝑤, 𝑚𝑚)� = 1
𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1
4𝑝𝑝01(1− 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�.
Maximum Regret
Summing (11) and (12), the regret of the midpoint predictor is
(13) 1𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1
4𝑝𝑝01(1 − 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�+ �𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� − 1/2�2𝑝𝑝012 .
Note that 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0) is found in both the variance and the squared bias components of regret.
Hence, in contrast to the case with linear functions, the maximum regret of the predictor of an indicator
function cannot be determined by separately maximizing variance and squared bias.
Setting 𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� = 0 or 1 maximizes squared bias for any feasible value
of (𝑝𝑝10,𝑝𝑝01) and does not affect variance. 𝑃𝑃�(𝑦𝑦1, y2)∈ A|𝑧𝑧2 = 0,𝑢𝑢 = 0� is only defined if 𝑝𝑝01 > 0 but, if
𝑝𝑝01 = 0, there is no missing data problem. Hence, maximum regret for a given value of (𝑝𝑝10,𝑝𝑝01) is
14
(14) 1𝑁𝑁1�𝑝𝑝10(1− 𝑝𝑝10) + 1
4𝑝𝑝01(1 − 𝑝𝑝01) − 𝑝𝑝10𝑝𝑝01�+ 1
4𝑝𝑝012 .
The problem is to maximize (14) over the feasible range 𝑝𝑝10 ≤ 𝑃𝑃(𝑧𝑧2 = 1) and 𝑝𝑝01 ≤ 𝑃𝑃(𝑧𝑧2 = 0).
Fix p01 at any feasible value and differentiate (14) with respect to 𝑝𝑝10. The derivative
1𝑁𝑁1
(1 − 𝑝𝑝01 − 2𝑝𝑝10) is decreasing in 𝑝𝑝10. Hence, the maximum occurs at the interior solution 𝑝𝑝10 = 1−𝑝𝑝012
if this is a feasible value of p10 and at the boundary 𝑝𝑝10 = 0 otherwise. Considering first the interior
solution for 𝑝𝑝10, plug 𝑝𝑝10 = 1−𝑝𝑝012
into (14) and solve the concentrated optimization problem
(15) max𝑝𝑝01
14�1−𝑝𝑝01
𝑁𝑁1+ 𝑝𝑝012 � s.t. 𝑝𝑝01 ∈ [0,𝑃𝑃(𝑧𝑧2 = 0)].
The derivative − 14𝑁𝑁1
+ 12𝑝𝑝01 is increasing in 𝑝𝑝01. Therefore, with an interior solution for 𝑝𝑝10, regret is
maximized at the boundary where either 𝑝𝑝01 = 0 or 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0), in which case 𝑝𝑝10 = 12 or 𝑝𝑝10 =
12𝑃𝑃(𝑧𝑧2 = 1), respectively. Inspection of the possible boundary solutions shows that, when 𝑃𝑃(𝑧𝑧2 = 1) ≤
1−1/N1, maximum regret occurs where 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0) and 𝑝𝑝10 = 12𝑃𝑃(𝑧𝑧2 = 1). It follows that the
maximum regret of the midpoint predictor (9) in this typical setting is
(16) 14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2� .
Inspection of (16) shows that maximum regret is decreasing in the initial sample size N1, holding
the response rate fixed. Differentiation of (16) with respect to the response rate shows that, holding the
sample size fixed, maximum regret is decreasing in the response rate when 𝑃𝑃(𝑧𝑧2 = 1) < 1 ̶ 1/(2𝑁𝑁1).
15
3.2.2. Outer-Bound Midpoint Predictor
To apply the Dominitz and Manski (2017) midpoint predictor to indicator functions of two
variables, once again consider the value of f(y1, y2) to be missing when y2 is missing. Then the identification
region for E[f(y1, y2)] is the interval
(17) �𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1�,𝑃𝑃�(𝑦𝑦1, y2)∈ A, 𝑧𝑧2 = 1� + 𝑃𝑃(𝑧𝑧2 = 0)�.
A midpoint predictor based on this outer bound on E[f(y1, y2)] is
(18) 1𝑁𝑁1∑ 1𝑁𝑁1𝑖𝑖=1 �(𝑦𝑦1𝑖𝑖 , y2𝑖𝑖)∈ A, 𝑧𝑧2𝑖𝑖 = 1� + 1
2∙ 1𝑁𝑁1∑ 1(𝑧𝑧2𝑖𝑖 = 0)𝑁𝑁1𝑖𝑖=1
Note that (18) differs from (9) only in the arguments of the second indicator function; that is, 1(𝑧𝑧2𝑖𝑖 = 0)
versus 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0). Maximum regret of midpoint predictor (18) is identical to maximum regret of
midpoint predictor (9), because maximum regret of (9) arises where 𝑝𝑝01 = 𝑃𝑃(𝑧𝑧2 = 0), and, therefore,
1(𝑧𝑧2𝑖𝑖 = 0) = 1(𝑧𝑧2𝑖𝑖 = 0,𝑢𝑢𝑖𝑖 = 0) for all i.
Equivalence of the two midpoint predictors with respect to maximum regret does not imply that
the two are equivalent in all states. In fact, midpoint predictor (9) dominates (18). That is, its regret is less
than that of (18) in states where first-period data may be informative and equals that of (18) in the "worst-
case" states where the first-period data are always uninformative. The latter are states in which z = 0 always
implies that u = 0, so observation of the period-1 outcome is not informative about the value of E[f(y1, y2)].
Maximum regret occurs in states when the first-period data are always uninformative. Hence, the maximum
regret of (9) equals the maximum regret of (18).
3.3. Choice of Sample Design
16
Suppose that a set Q of sampling processes are feasible. Each q ∈ Q has a cost πq per initial sample
member and a vector of response rates P(zq1 = i, zq2 = j), where i and j equal 0 or 1. We assume for simplicity
that the cost per sample member does not depend on whether a person responds in period 2. Also for
simplicity, we consider choice between two designs, the lower-cost design having higher attrition. Let N1L
and N1H be the two initial sample sizes. The low-cost/low-quality design L has total cost πL∙N1L and
response rate ρL in period 2, whereas the high-cost/high-quality design has total cost πH∙N1H and response
rate ρH in period 2, with 0 < πL < πH and ρL < ρH < 1.
Our analysis assumes that the design is chosen ex ante with commitment, before observation of the
distribution of responses in period 1. It may sometimes be feasible to defer choice of the sampling plan for
period 2 until after the responses in period 1 have been obtained. In such cases, one might contemplate
sequential designs as well as ones that are chosen ex ante with commitment. Sequential sample design has
long been a subject of study in the literature on Bayesian decision making, where it is a classical dynamic
programming problem. However, being Bayesian requires specification of a precise prior subjective
distribution on all relevant unknown quantities. It appears challenging to characterize the properties of
sequential procedures that do not invoke prior subjective distributions, such as minimax-regret. We do not
examine sequential designs here.
In principle, one may have additional information about the sample design beyond cost and attrition
rate that would impact the calculation of maximum regret and choice among designs. For example, the
composition of those who choose to respond or not to respond in period 2 could be known to vary across
designs 𝑞𝑞 and 𝑞𝑞′ even if they have identical response rates. Thus, it may be that 𝑃𝑃�𝑧𝑧𝑞𝑞2 = 𝑗𝑗� = 𝑃𝑃�𝑧𝑧𝑞𝑞2 = 𝑗𝑗�
yet 𝑃𝑃�𝑦𝑦1,𝑦𝑦2|𝑧𝑧𝑞𝑞2 = 𝑗𝑗� ≠ 𝑃𝑃�𝑦𝑦1,𝑦𝑦2|𝑧𝑧𝑞𝑞2 = 𝑗𝑗�. We assume that no such information is available prior to data
collection.
3.3.1 Allocation of a Predetermined Budget
Suppose that the objective is to predict the value of a linear or indicator function, using the relevant
17
midpoint predictor. Suppose that the planner has a predetermined budget B and must choose between one
of the two designs. The feasible sample sizes are N1L = INT(B/πL) for low-cost sampling and N1H =
INT(B/πH) for high-cost sampling. We henceforth ignore for simplicity the fact that sample sizes must be
integers and take the feasible low-cost sample size to be N1L = B/πL and the feasible high-cost sample size
to be N1H = B/πH .
Consider first best prediction of the value of the linear function 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 for
known values of (a, b, c). Using the midpoint predictor (2), the feasible low-cost and high-cost designs
yield upper bounds on maximum regret of 14�𝑏𝑏
2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿
+ 𝑐𝑐2(1− 𝜌𝜌𝐿𝐿)2� + �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)𝐵𝐵/𝜋𝜋𝐿𝐿
and
14�𝑏𝑏
2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻
+ 𝑐𝑐2(1− 𝜌𝜌𝐻𝐻)2�+ �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)𝐵𝐵/𝜋𝜋𝐻𝐻
, respectively. Hence, the low-cost design has a
smaller upper bound on maximum regret when the budget is less than a certain threshold and the high-cost
design does otherwise. The threshold budget is
(19) 𝐵𝐵 = 𝜋𝜋𝐻𝐻�𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻� 4⁄ −𝜋𝜋𝐿𝐿�𝑏𝑏2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿� 4⁄ +𝜋𝜋𝐻𝐻��𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)�−𝜋𝜋𝐿𝐿��𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)�𝑐𝑐2[(1−𝜌𝜌𝐿𝐿)2−(1−𝜌𝜌𝐻𝐻)2] .
Consider now best prediction of the value of indicator function 1[(y1, y2) ∈ A]. Using midpoint
predictor (9), the feasible low-cost and high-cost designs yield maximum regret 14� 𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿
+ (1 − 𝜌𝜌𝐿𝐿)2� and
14� 𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻
+ (1 − 𝜌𝜌𝐻𝐻)2�, respectively. The low-cost design has smaller maximum regret when the budget is
less than a certain threshold and the high-cost design is better otherwise. The threshold budget is
(20) 𝐵𝐵 = (𝜋𝜋𝐻𝐻𝜌𝜌𝐻𝐻−𝜋𝜋𝐿𝐿𝜋𝜋𝐿𝐿)(1−𝜌𝜌𝐿𝐿)2−(1−𝜌𝜌𝐿𝐿)2 .
3.3.2. Choice of Budget to Achieve ε-Optimal Prediction
Now suppose that budget is a choice variable. In principle, the planner should perform a benefit-
18
cost analysis. Devoting a larger budget to data collection improves prediction of outcomes but diverts
resources from other uses. The planner must resolve this tension.
Adapting arguments in Manski and Tetenov (2016) regarding sample size selection to enable ε-
optimal treatment decisions, Dominitz and Manski (2017) consider choice of a design so that the maximum
MSE of the midpoint predictor is no larger than a specified ε > 0. If the objective is prediction of the value
of linear functions, the analysis above shows that a budget of size B suffices to achieve this objective if
(21) min{14�𝑏𝑏
2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿
+ 𝑐𝑐2(1− 𝜌𝜌𝐿𝐿)2� + �𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐿𝐿(1−𝜌𝜌𝐿𝐿)𝐵𝐵/𝜋𝜋𝐿𝐿
, 14�𝑏𝑏
2+�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻
+ 𝑐𝑐2(1 − 𝜌𝜌𝐻𝐻)2� +
�𝑐𝑐2+2𝑏𝑏𝑐𝑐�𝜌𝜌𝐻𝐻(1−𝜌𝜌𝐻𝐻)𝐵𝐵/𝜋𝜋𝐻𝐻
} ≤ ε.
Similarly, for indicator functions, a budget of size B suffices to achieve this objective if
(22) min{14� 𝜌𝜌𝐿𝐿𝐵𝐵/𝜋𝜋𝐿𝐿
+ (1 − 𝜌𝜌𝐿𝐿)2�, 14� 𝜌𝜌𝐻𝐻𝐵𝐵/𝜋𝜋𝐻𝐻
+ (1 − 𝜌𝜌𝐻𝐻)2�} ≤ ε.
These budget sizes suffice for ε-optimality but may not be necessary. The smallest budgets that
enable ε-optimal prediction occur when one uses MMR predictors rather than the tractable midpoint
predictors studied in Sections 3.1 and 3.2. In the absence of knowledge of the MMR predictors, we can
provide sufficient budget sizes but not necessary ones.
4. Panel Data versus Repeated Cross Sections
This section uses our analysis to guide sample design choice between panel data and repeated cross-
sectional (RCS) data, with complete response in each cross-section. Continuing the two-period framework
19
utilized above, RCS data are generated by a sampling process in which two independent random samples
are drawn, with (z1 = 1, z2 = 0) in one sample and (z1 = 0, z2 = 1) in the other; thus, P(z1 = 1, z2 = 1) = 0.
With repeated observations on individual sample members, panel data with full response point-identify the
joint distribution P(y1, y2), whereas RCS data point-identify only the period-specific marginal distributions
P(y1) and P(y2).
Much attention has been paid to estimation of dynamic models that are point-identified by RCS
data; see, for example, the review in Verbeek (2005). In early work on this topic, Deaton (1985) restricted
attentions to linear models that may include an additive fixed effect to be “differenced out.” Then the
outcome of interest is the linear function 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 with b = −c. Moffitt (1993) extended
the approach to some nonlinear models, focusing on binary choice models where the outcome of interest in
each period is an indicator function.
Moffitt emphasized that estimation of dynamic models with RCS data is made difficult by the
“general lack of information on lagged dependent and independent variables and the consequent
unobservability of the intertemporal covariances needed to identify and estimate dynamic models” (Moffitt,
1993, p. 99). This line of research, which replaces individual observations with cohort means and uses
additional assumptions to identify the models, is often motivated by cases in which panel data are not
available. However, it has also been noted that panel data “are often inferior to the available cross-sections
in some respects” (Moffitt, 1993, p. 100), such as smaller sample sizes in each time period, lower rates of
response arising from attrition, and “large and persistent errors of measurement” (Deaton, 1985, p. 110).
Rather than try to identify conditions under which existing panel data should be preferred to
existing RCS data or vice versa, here we consider how one should design longitudinal data collection before
commencing it. The answer to this question depends crucially on how the data will be used.
4.1. Linear Functions
20
Under random sampling with no missing data, RCS data point-identify the expectations of linear
functions of y1 and y2. As above, let 𝑓𝑓(𝑦𝑦1,𝑦𝑦2) = 𝑎𝑎 + 𝑏𝑏𝑦𝑦1 + 𝑐𝑐𝑦𝑦2 and let mt be the sample average of the
Nt observed values in period t. The RCS sample-average predictor is 𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2. Observe that
𝐸𝐸[𝑚𝑚1] = 𝐸𝐸[𝑦𝑦1], 𝐸𝐸[𝑚𝑚2] = 𝐸𝐸[𝑦𝑦2], 𝑉𝑉[𝑚𝑚1] = 1𝑁𝑁1𝑉𝑉[𝑦𝑦1], 𝑉𝑉[𝑚𝑚2] = 1
𝑁𝑁2𝑉𝑉[𝑦𝑦2], and 𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 0. Squared
bias equals 0 and variance is maximized if P(y1, y2) is bivariate Bernoulli with mean (½, ½). Maximum
regret is
(23) 14�𝑏𝑏
2
𝑁𝑁1+ 𝑐𝑐2
𝑁𝑁2�.
We may compare (23) with the maximum regret of the panel-data midpoint predictor. To make
the comparison precise, let N1 be the period-1 sample size for both RCS and panel data collection. Let N2
= N1 be the period-2 sample size in each case as well. N2 is a new random sample in the RCS case and is
the sample of period-2 responders in the panel case with no attrition. Let both designs have the same cost
per observation. Thus, we compare designs that yield the same numbers of observations in each period and
have the same cost, but differ in the composition of the period-2 observations.
As previously noted, the maximum regret of the panel data midpoint predictor is 14�𝑏𝑏
2+𝑐𝑐2+2𝑏𝑏𝑐𝑐𝑁𝑁1
�
when P(z2 = 1) = 1. Comparison with (23), when N2 = N1, reveals that the maximum regret of the panel
data predictor exceeds that of the RCS predictor by �2𝑏𝑏𝑐𝑐4𝑁𝑁1
�. This difference is attributable to the potential
covariation between the period-1 and period-2 sample averages in a panel.
This finding effectively turns a common argument in favor of a panel over RCS on its head. Unlike
RCS, a panel yields information on the joint distribution of outcomes across periods. But this information
has no value when the objective is to predict a linear function under square loss. Moreover, the possibility
of covariation of outcomes across periods increases the maximum variance of the panel data predictor
relative to the RCS predictor, which draws an independent random sample each period.
21
The comparison is not as straightforward when there is attrition, because we only have an analytical
upper bound on the maximum regret of the panel data midpoint predictor. Nonresponse in period-2 may
reduce the impact of covariation of outcomes across periods on the maximum variance of the panel data
midpoint predictor relative to the RCS predictor, but nonresponse also increase the predictor’s maximum
squared bias. In contrast, the RCS predictor is unbiased under the maintained assumptions.
4.2. Indicator Functions
Suppose now that the objective is best prediction of the event that (y1, y2) take specified values.
The best predictor is 𝑃𝑃[𝑦𝑦1 = 𝑗𝑗,𝑦𝑦2 = 𝑘𝑘] for values 𝑗𝑗,𝑘𝑘 ∈ [0, 1]. With panel data, we found in (8) the
identification region to be the following interval of width 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0):
[𝑃𝑃(𝑦𝑦1 = 𝑗𝑗, y2 = 𝑘𝑘, 𝑧𝑧2 = 1),𝑃𝑃(𝑦𝑦1 = 𝑗𝑗, y2 = 𝑘𝑘, 𝑧𝑧2 = 1) + 𝑃𝑃(𝑧𝑧2 = 0,𝑢𝑢 = 0)]
With RCS data, the Frechet bound on a joint probability using knowledge of the marginals gives
the identification region as the interval
(24) �max(0,𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) + 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) − 1), min�𝑃𝑃(𝑦𝑦1 = 𝑗𝑗),𝑃𝑃(𝑦𝑦2 = 𝑘𝑘)��.
This interval has maximum width ½, which obtains when 𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) = 12 and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) = 1
2.
Suppose one were to know the marginal probabilities 𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘). Then the
minimax-regret predictor would be the midpoint of (24). Without additional prior information, the
maximum regret of this midpoint predictor equals its maximum squared bias of 1/16, which obtains when
𝑃𝑃(𝑦𝑦1 = 𝑗𝑗) = 12 and 𝑃𝑃(𝑦𝑦2 = 𝑘𝑘) = 1
2.
22
Suppose instead that one uses information from a finite sample to estimate the marginal
probabilities. Maximum regret of a predictor using finite-sample information on the probabilities cannot
be less than maximum regret of the minimax-regret predictor using knowledge of the probabilities.
Therefore, maximum regret of a sample-analog RCS midpoint predictor must be no less than 1/16.
Recall that maximum regret of the panel data midpoint predictor is 14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� +
𝑃𝑃(𝑧𝑧2 = 0)2�. Thus, the lower bound on the maximum regret of a finite-sample RCS midpoint predictor
exceeds the maximum regret of the panel data midpoint predictor when � 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2) <
14. When 𝑃𝑃(𝑧𝑧2 = 1) > ½, there exists a threshold sample size such that this inequality holds for all samples
larger than the threshold.
RCS data may be even less informative when predictors of other indicator functions discussed in
Section 3 are of interest. Consider the event [y2 > y1], whose best predictor is 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1). It is possible
that the marginal distributions P(y1) and P(y2) identified by RCS data are compatible with both 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1)
arbitrarily close to 0 and 𝑃𝑃(𝑦𝑦2 > 𝑦𝑦1) arbitrarily close to 1, as would be the case when (y1, y2) are
continuously distributed on [0, 1] with P(y1) = P(y2). Suppose one were to know the distributions P(y1) and
P(y2). Then the minimax-regret predictor would be the midpoint of the identification region. Without
additional information, this identification region is the open unit interval. The minimax-regret prediction
is therefore ½, and maximum regret of this midpoint predictor equals its maximum squared bias of ¼. Thus,
the RCS data are potentially uninformative and a finite-sample RCS midpoint predictor must have
maximum regret no less than ¼, whereas maximum regret of the panel data midpoint predictor is again
14�� 1𝑁𝑁1∙ 𝑃𝑃(𝑧𝑧2 = 1)� + 𝑃𝑃(𝑧𝑧2 = 0)2�.
5. Conclusion
23
This paper continues our effort to encourage increased use of statistical decision theory to inform
the design of data collection when data quality is a decision variable. Building on our previous study, we
demonstrate how the framework may be applied to more complex design problems.
A notable general finding is that, when collecting panel data with attrition, prediction of the value
of a function of two variables is more subtle than prediction of a function of one variable. The reason is
that observation of the outcome in the first period may constrain the value of the function when the outcome
in the second period is not observed. The nature of the constraint, if any, depends on the form of the
function being predicted. Juxtaposition of linear and indicator functions demonstrates this relationship
well.
The form of the function being predicted also is important when considering choice between
collection of panel data and RCS. In the absence of restrictions on the joint distribution of outcomes, RCS
data are well-suited for prediction of linear functions but not for prediction of nonlinear function such as
indicator functions.
When predicting linear functions, choice between panel data and RCS may depend on the relative
magnitudes of recruitment and retention costs, as well as the relationship between retention costs and the
attrition rate. The framework adopted in the study should be useful for addressing these matters and related
questions. For instance, what is the optimal length of time between interview waves, when, all else equal,
an increase in period length should increase retention costs and/or attrition? To what extent can a rotating
panel be used to optimally combine the best elements of RCS and panel? Finally, how can retrospective
questions in RCS be used in addition to or in lieu of a (rotating) panel and how does this answer depend on
the length of time between interviews given the relationship between the length of this time span and
retrospective reporting errors?
We should re-emphasize that our analysis assumes knowledge of response rates but uses no
information on how the sample design affects the composition of respondents. Such information may affect
minimax-regret choice among designs. For example, one may be able to lower maximum regret by
combining complementary designs that tend to attract different segments of the population of interest.
24
Appendix
Derivation of V(m2)
Recall that 𝑚𝑚2 = 1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . Random sampling implies that
𝑉𝑉(𝑚𝑚2) = 𝑉𝑉 � 1𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 � = � 1
𝑁𝑁1�2𝑉𝑉��𝑦𝑦21𝑧𝑧21 + 𝑦𝑦22𝑧𝑧22 +⋯+ 𝑦𝑦2𝑁𝑁1𝑧𝑧2𝑁𝑁1�� = � 1
𝑁𝑁1�2𝑁𝑁1𝑉𝑉(𝑦𝑦2𝑧𝑧2)
= 1𝑁𝑁1𝑉𝑉(𝑦𝑦2𝑧𝑧2).
Observe that
𝐸𝐸(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)
and
𝐸𝐸[(𝑦𝑦2𝑧𝑧2)2] = 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1).
Thus,
𝑉𝑉(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸[(𝑦𝑦2𝑧𝑧2)2]− [𝐸𝐸(𝑦𝑦2𝑧𝑧2)]2 = 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) − [𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)]2
= 𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)]
Hence,
𝑉𝑉(𝑚𝑚2) = 1𝑁𝑁1𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)].
Derivation of C(m1, m2)
Recall that 𝑚𝑚1 = 1𝑁𝑁1∑ 𝑦𝑦1𝑖𝑖𝑁𝑁1𝑖𝑖=1 , 𝑚𝑚2 = 1
𝑁𝑁1∑ 𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖𝑁𝑁1𝑖𝑖=1 . Random sampling implies that
25
𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 𝑏𝑏 �1𝑁𝑁1
�𝑦𝑦1𝑖𝑖
𝑁𝑁1
𝑖𝑖=1
,1𝑁𝑁1
�𝑦𝑦2𝑖𝑖𝑧𝑧2𝑖𝑖
𝑁𝑁1
𝑖𝑖=1
� =1𝑁𝑁12
𝑏𝑏 ��𝑦𝑦1𝑖𝑖
𝑁𝑁1
𝑖𝑖=1
,�𝑦𝑦2𝑖𝑖
𝑁𝑁1
𝑖𝑖=1
𝑧𝑧2𝑖𝑖�
= 1𝑁𝑁12𝑏𝑏�𝑦𝑦11 + 𝑦𝑦12 + ⋯+ 𝑦𝑦1𝑁𝑁1 ,𝑦𝑦21𝑧𝑧21 + 𝑦𝑦22𝑧𝑧22 + ⋯+ 𝑦𝑦2𝑁𝑁1𝑧𝑧2𝑁𝑁1�
= 1𝑁𝑁12𝑁𝑁1𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2)
= 1𝑁𝑁1𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2).
Observe that 𝐸𝐸(𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) and 𝐸𝐸(𝑦𝑦1𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1). Thus,
𝑏𝑏(𝑦𝑦1,𝑦𝑦2𝑧𝑧2) = 𝐸𝐸(𝑦𝑦1𝑦𝑦2𝑧𝑧2) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2𝑧𝑧2)
= 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1)
= 𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)].
It follows that
𝑏𝑏(𝑚𝑚1,𝑚𝑚2) = 1𝑁𝑁1
𝑃𝑃(𝑧𝑧2 = 1)[𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)]
Derivation of upper bound on the variance of the midpoint predictor
The variance of the predictor is
𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 +𝑐𝑐2
P(𝑧𝑧2 = 0)� =
26
1𝑁𝑁1
�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)] +
2𝑏𝑏𝑐𝑐𝑃𝑃(𝑧𝑧2 = 1) ∙ [𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1)𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)] �
Outcomes are normalized to lie in the unit interval. Therefore, we can derive an upper bound on
𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1), as follows:
𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1)
= 𝐸𝐸(𝑦𝑦22|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 0) + 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 0)
= V(𝑦𝑦2|𝑧𝑧2 = 1) + 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)2𝑃𝑃(𝑧𝑧2 = 0)
≤ V(𝑦𝑦2|𝑧𝑧2 = 1) + 𝑃𝑃(𝑧𝑧2 = 0) .
We can also derive an upper bound on 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1). By the Law of Iterated
Expectations,
𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)
= 𝐸𝐸(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) − [𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 1) + 𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 0) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)]
= 𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1) + 𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 1) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0) − 𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 0) 𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)
= 𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1) + [𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 1) − 𝐸𝐸(𝑦𝑦1|𝑧𝑧2 = 0)]𝐸𝐸(𝑦𝑦2|𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)
≤ 𝑏𝑏(𝑦𝑦1,𝑦𝑦2|𝑧𝑧2 = 1) + 𝑃𝑃(𝑧𝑧2 = 0)
Inserting these upper bounds gives this upper bound on the variance of the midpoint predictor:
𝑉𝑉 �𝑎𝑎 + 𝑏𝑏𝑚𝑚1 + 𝑐𝑐𝑚𝑚2 +𝑐𝑐2
P(𝑧𝑧2 = 0)� ≤
27
1𝑁𝑁1
�𝑏𝑏2𝑉𝑉(𝑦𝑦1) + 𝑐𝑐2𝑃𝑃(𝑧𝑧2 = 1)[ 𝑉𝑉(𝑦𝑦2|𝑧𝑧2 = 1)]
+ 2𝑏𝑏𝑐𝑐 𝑏𝑏𝑜𝑜𝐶𝐶(𝑦𝑦1𝑦𝑦2|𝑧𝑧2 = 1) + (𝑐𝑐2 + 2𝑏𝑏𝑐𝑐)𝑃𝑃(𝑧𝑧2 = 1)𝑃𝑃(𝑧𝑧2 = 0)�
28
References
Dai, B., S. Ding, and G. Wahba (2013), “Multivariate Bernoulli Distribution,” Bernoulli, 19, 1465-1483. Deaton, A. (1985), “Panel Data from Time Series of Cross Sections,” Journal of Econometrics 30, 109-126. Dominitz, J., and C. Manski (2017), "More Data or Better Data? A Statistical Decision Problem," Review of Economic Studies, 84, 1583-1605. Groves, R. (2006), “Nonresponse Rates and Nonresponse Bias in Household Surveys,” Public Opinion Quarterly, 70, 646–675. Groves, R. and L. Lyberg (2010), “Total Survey Error: Past, Present, and Future,” Public Opinion Quarterly, 74, 849-879. Manski, C. and A. Tetenov (2016), “Sufficient Trial Size to Inform Clinical Practice,” Proceedings of the National Academy of Sciences, 113, 10518-10523. Moffitt, R. (1993), “Identification and Estimation of Dynamic Models with a Time Series of Repeated Cross-Sections,” Journal of Econometrics, 59, 99-123. Verbeek, M. (2008), “Pseudo-Panels and Repeated Cross-Sections,” in L. Matyas and P. Sevestre (eds.) The Econometrics of Panel Data, Berlin: Springer-Verlag, 369-383.