on making valid inferences by integrating data …...on making valid inferences by integrating data...

31
On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada Abstract Survey samplers have long been using probability samples from one or more sources in conjunction with census and administrative data to make valid and efficient inferences on finite population parameters. This topic has received a lot of attention more recently in the context of data from non-probability samples such as transaction data, web surveys and social media data. In this paper, I will provide a brief overview of probability sampling methods first and then discuss some recent methods, based on models for the non-probability samples, which could lead to useful inferences from a non-probability sample by itself or when combined with a probability sample. I will also explain how big data may be used as predictors in small area estimation, a topic of current interest because of the growing demand for reliable local area statistics. Keywords. Big data, Dual frames, Probability sampling, Non-probability sampling, Sample selection bias, Small area estimation 1 Introduction Sample surveys have long been conducted to obtain reliable estimates of finite population descriptive parameters, such as totals, means and quantiles, and asso- ciated standard errors and normal theory confidence intervals with large enough sample sizes. Probability sampling designs and repeated sampling inference, also called the design-based approach, has played a dominant role, especially in the production of official statistics, ever since the publication of the landmark paper by Neyman (1934) which laid the theoretical foundations of the design-based ap- proach. The design-based approach was almost universally accepted by practicing official statisticians. The early landmark contributions to the design-based ap- proach, outlined in Section 2.1, were mostly motivated by practical and efficiency considerations. Methods for combining two or more probability samples were also developed to increase the efficiency of estimators for a given cost (Section 3). Data Sankhyā B: The Indian Journal of Statistics https://doi.org/10.1007/s13571-020-00227-w # Indian Statistical Institute 2020

Upload: others

Post on 09-Jul-2020

19 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

On Making Valid Inferences by Integrating Datafrom Surveys and Other Sources

J. N. K. RaoCarleton University, Ottawa, Canada

AbstractSurvey samplers have long been using probability samples from one or moresources in conjunction with census and administrative data to make valid andefficient inferences on finite population parameters. This topic has received a lot ofattention more recently in the context of data from non-probability samples suchas transaction data, web surveys and social media data. In this paper, I willprovide a brief overview of probability sampling methods first and then discusssome recent methods, based on models for the non-probability samples, whichcould lead to useful inferences from a non-probability sample by itself or whencombined with a probability sample. I will also explain how big data may be usedas predictors in small area estimation, a topic of current interest because of thegrowing demand for reliable local area statistics.

Keywords. Big data, Dual frames, Probability sampling, Non-probabilitysampling, Sample selection bias, Small area estimation

1 Introduction

Sample surveys have long been conducted to obtain reliable estimates of finitepopulation descriptive parameters, such as totals, means and quantiles, and asso-ciated standard errors and normal theory confidence intervals with large enoughsample sizes. Probability sampling designs and repeated sampling inference, alsocalled the design-based approach, has played a dominant role, especially in theproduction of official statistics, ever since the publication of the landmark paper byNeyman (1934) which laid the theoretical foundations of the design-based ap-proach. The design-based approach was almost universally accepted by practicingofficial statisticians. The early landmark contributions to the design-based ap-proach, outlined in Section 2.1, were mostly motivated by practical and efficiencyconsiderations. Methods for combining two or more probability samples were alsodeveloped to increase the efficiency of estimators for a given cost (Section 3). Data

Sankhyā B: The Indian Journal of Statisticshttps://doi.org/10.1007/s13571-020-00227-w# Indian Statistical Institute 2020

Page 2: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

collection issues have received considerable attention in recent years to control costsand maintain response rates using new modes of data collection (Section 4.1).

In spite of the efforts under the probability sampling setup to gatherdesigned data, response rates are decreasing and costs are rising (Williamsand Brick 2018). At the same time, due to technological innovations, largeamounts of inexpensive data, called big data or organic data, and data fromnon-probability samples (especially online panels) are now accessible. Big datainclude the following types: administrative data, transaction data, social mediadata, internet of things and scrape data from websites, sensor data and satelliteimages. Big data and data from web panels have the potential of providingestimates in near real time, unlike traditional data derived from probabilitysamples. Statistical agencies publishing official statistics are now taking mod-ernization initiatives by finding new ways to integrate data from a variety ofsources and produce “reliable” near real-time official statistics. However, naïveuse of such data can lead to serious sample selection bias (Section 4.2) andwithout adjustment to reduce selection bias it can lead to the “big dataparadox: the bigger the data, the surer we fool ourselves “(Meng 2018). Sections5, 6 and 7 discuss methods for attempting to reduce sample selection bias underdifferent scenarios. These methods build on the techniques discussed in Sections2, 3 and 4 for making efficient inferences from probability samples.

Big data has the potential of providing good predictors for models useful insmall area estimation (Section 8). In the case of small areas with very small samplesizes within areas, direct estimators based on traditional probability sampling donot provide adequate precision and it becomes necessary to use model-basedmethods to “borrow strength” across related areas through linking models basedon suitable predictors, such as census data and administrative records. Big datapredictors can be useful as additional predictors in the linking models.

2 Probability Sampling

2.1 Some early landmark contributions Probability sampling from afinite population of units assumes that every unit i in a finite population U ofsize N has a known non-zero probability of inclusion in the sample, denoted byπi. Focusing on estimating the population total Y= ∑i∈U

yi or the mean Y¼ Y=N of a variable of interest y from a probability sampleA, Neyman (1934)laid the theoretical foundations of the design based approach. He introducedthe ideas of efficiency of design-unbiased estimators, depending on the inclusionprobabilities, and optimal sample size allocation in his theory of stratifiedsimple random sampling. He also demonstrated that balanced purposive

2 J. N. K. Rao

Page 3: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

sampling may perform poorly if the underlying model assumptions are violated.This result is of importance in the context of making inferences from non-probability samples based on implicit or explicit models.

A general design-unbiased estimator of the population total is of the formbY ¼ ∑i∈Adiyi, where di ¼ π−1

i is called the design weight (Horvitz andThompson 1952; Narain 1951). This estimator may be expressed asbY ¼ ∑i∈U aidið Þyi, where ai is the indicator variable for the inclusion of pop-ulation unit i in the sample and is equal to 1 with probability πi and 0 withprobability 1 − πi. In the case of stratified simple random sampling, the designweights are equal to the inverses of sampling fractions within strata and theymay vary across strata. Design-unbiased estimators of the variance of bY ,denoted by v bY

� �¼ s2 bY

� �, can be obtained provided all the joint inclusion

probabilities are positive, as is the case under stratified simple randomsampling. Sampling practitioners often use the coefficient of variation,C bY� �

¼ s bY� �

= bY as a measure of precision of the estimator bY and con-struct 95% normal theory confidence intervals on the total asbY−2s bY

� �; bY þ 2s bY

� �n o. This assumes large samples and single-stage

sampling. Neyman noted that for large enough samples the frequency oferrors in the confidence statements based on all possible stratified simplerandom samples that could be drawn does not exceed the limit prescribed inadvance “whatever the unknown properties of the finite population”.

The above attractive features of probability sampling and design-basedinference were recognized soon after Neyman's paper appeared. This, in turn,led to the use of probability sampling in a variety of sample surveys coveringlarge populations, and to theoretical developments of efficient sampling designsminimizing cost subject to specified precision of the estimators, and associatedestimation theory. I will now list a few important post-Neyman theoreticaldevelopments under the design-based approach.

Mahalanobis used probability sampling designs for surveys in India as earlyas 1937. His classic 1944 paper (Mahalanobis 1944) rigorously formulated costand variance functions for the efficient design of sample surveys based onprobability sampling. He studied simple random sampling, stratified randomsampling, single-stage cluster sampling and stratified cluster sampling for theefficient design of sample surveys of different crops in Bengal, India. He alsoextended the theoretical setup to subsampling of sampled clusters (which henamed as two-stage sampling). He was instrumental in establishing the Na-tional Sample Survey (NSS) of India and the world famous Indian StatisticalInstitute. The NSS is the largest multi-subject continuing survey with full-timestaff using personal interviews for socio-economic surveys and physicalmeasurements for crop surveys. He attracted brilliant survey statisticians to

ON MAKING VALID INFERENCES BY... 3

Page 4: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

work with him, including D. B. Lahiri, M. N. Murthy and Des Raj. Hall (2003)provides a scholarly historical account of the pioneering contributions ofMahalanobis to the early developments of survey sampling theory and methodsin India. P. V. Sukhatme, who studied under Neyman, also made pioneeringcontributions to the design and analysis of large-scale agricultural surveys inIndia, using stratified multi-stage sampling.

Under the leadership of Morris Hansen, survey statisticians at the U. S.Census Bureau made fundamental contributions to the theory and methods ofprobability sampling and design-based inference, during the period 1940-1960.The most significant contributions from the Census Bureau group include thedevelopment of the basic theory of stratified two-stage cluster sampling withone cluster (or primary sampling unit) within each stratum drawn with prob-ability proportional to size (PPS), where size measures are obtained fromexternal data sources such as a recent census (Hansen and Hurwitz 1943).Sampled clusters may then be subsampled at a rate to provide an overall self-weighting (equal overall probability of selection) sample design. Such probabil-ity sampling designs can lead to significant variance reduction by controlling thevariability arising from unequal cluster sizes. Another major contribution wasthe introduction of rotation sampling with partial replacement of households tohandle response burden in surveys repeated over time, such as the monthly U. S.Current Population Survey (CPS) for measuring monthly unemployment ratesand month-to-month changes in the labor force. Hansen et al. (1955) developedefficient composite estimators of level and change under rotation sampling. Thismethodology is widely used in large-scale rotating panel surveys. Yet anothersignificant contribution is the development of a unified approach for construct-ing confidence intervals for quantiles, such as the median, applicable to generalprobability sampling designs (Woodruff 1952). This method remains a corner-stone for making inference on the population quantiles.

As described above, population information on auxiliary variables related toa variable of interest (the study variable) is often used at the design stage forstratification or PPS sampling or both. Use of information on auxiliary variablesat the estimation stage was also advocated. In particular, ratio estimation basedon a single auxiliary variable x correlated with y has been widely used. The valuexi is obtained for each unit in the sample and the population total X=∑i∈Uximust be known. The sample values of x are either directly observed along withthe associated values of y or through record linkage. A ratio estimator of thetotal is of the form bYr ¼ bY=bX

� �X ¼ bRX, where bX ¼ ∑i∈A

dixi is the design-unbiased estimator of the known total X. It is not necessary to know all theindividual population values xi to implement the ratio estimator, unlike the useof xi at the design stage. The ratio estimator also enjoys the calibration property,

4 J. N. K. Rao

Page 5: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

namely it reduces to the known total X when yi is replaced by xi. It can lead toconsiderable gain in efficiency when yi is roughly proportional to xi . The abovedesirable properties and the computational simplicity of the ratio estimator ledto extensive use in surveys based on probability sampling. The well-knownHajek estimator of the total is a special case of the ratio estimator. It is givenby bYH ¼ bY=bN

� �N, where bN ¼ ∑i∈Adi. The Hájek estimator of the popula-

tion mean is given by bYH ¼ bY=bN and it is widely used in practice and it doesnot require the knowledge of N, unlike the unbiased estimator bY=N.

The ratio estimator is not generally design unbiased but it is design consis-tent for large samples. Survey researchers do not insist on design unbiasedness(contrary to statements in some papers on inferential issues of sampling theory)because it “often results in much larger MSE than necessary (Hansen et al.1983)”. A design consistent estimator of the variance of the ratio estimator issimply obtained by replacing the variable yi in the variance estimator v bY

� �by

the residuals ei ¼ yi−bRxi for i∈A. Regression estimation was also studied earlyon but it was not widely used due to computational limitations in those days.

In the early days of survey sampling, surveys were generally much simplerthan they are today and data were collected through personal interviews orthrough mail questionnaires (possibly followed by personal interviews of asubsample of non-respondents). Physical measurements, such as those advo-cated by Mahalanobis for estimating crop yields in India, were also used.Response rates were generally high and the 1970s is generally regarded as the“Golden Age of Survey Research” (Singer 2016).

Together with the consolidation of the basic design-based sampling theory,attention turned to measurement or response errors in the 1940 U. S. census(Hansen et al. 1951). Under additive response error models with minimal modelassumptions on the observed responses treated as random variables, the totalvariance of the estimator bY is decomposed into sampling variance, simpleresponse variance and correlated response variance (CRV) due to interviewers.The CRV can dominate the total variance if the number of interviews perinterviewer is large. Partly for this reason, self-enumeration by mail was firstintroduced in the 1960 U. S. Census to reduce CRV. This is indeed a successstory of theory influencing practice.

Prior to the work of Hansen et al. (1951), Mahalanobis (1946) proposed theingenious method of interpenetrating subsamples to assess both sampling errorsand variable interviewer errors in socio-economic surveys. He showed that boththe total variance and the interviewer variance component can be estimated byassigning the subsamples at random to the interviewers. Kalton (2019) notesthat response errors remain a major concern although much research on total

ON MAKING VALID INFERENCES BY... 5

Page 6: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

survey error has been conducted since the pioneering work of Mahalanobis(1946) and Hansen et al. (1951).

Nonresponse in probability surveys was also addressed in early surveysampling development. Hansen and Hurwitz (1946) proposed two-phase sam-pling for following up initial nonrespondents. In their application, the sample iscontacted by mail in the first phase and a subsample of nonrespondents is thensubjected to personal interview, assuming complete response or negligiblenonresponse at the second phase. This method is currently used in the Amer-ican Community Survey and it can also be regarded as an early application ofwhat is now fashionable as a responsive or adaptive design (Tourangeau et al.2017).

2.2 Model-assisted calibration Formal working linear regressionmodels, relating the study variable yi to a vector xi of auxiliary variables withknown population total X, were studied in later years to develop efficientestimators of the total Y, called generalized regression estimators (GREGs).The working linear regression model is of the formyi ¼ x

0iβ þ εi; i∈U , with

model errors εi assumed to be uncorrelated with mean zero and varianceproportional to a known constant ci. The GREG estimator under the aboveworking model is given by

bYgr ¼ bY þ bβ0

d X−bX� �

;

where bβd ¼ ∑i∈Adixix0i=ci

� �−1 ∑i∈Adixiyi=cið Þ is the survey weighted leastsquares estimator of the regression parameter β. The GREG estimator isdesign-consistent regardless of the validity of the working model (Fuller1975). It can lead to significant gain in efficiency over the basic design-unbiased estimator bY if the working model provides a good fit to the sampledata. The GREG estimator reduces to the ratio estimator bYr in the special caseof a scalar x and regression through the origin with error variance proportionalto x.

The GREG estimator can also be expressed as a weighted sum∑i∈Awiyi,where the weight wi= digi with gi ¼ 1þ X−bX

� �0∑i∈Adixix

0i=ci

� �−1xi=ci. The

adjustment factor gi ensures the calibration property ∑i∈Awixi=X which isattractive to the user. Note that the GREG weights do not depend on the studyvariable as long as the same linear regression model is used for all the studyvariables y.

The use of a common weight wi for all study variables may be justifiedthrough a model-free calibration approach in which the weights wi for i∈A areobtained by minimizing a chi-square distance measure between di and wi

6 J. N. K. Rao

Page 7: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

subject to calibration constraints ∑i∈Awixi=X. The resulting calibrationweights wi are in fact identical to the GREG weights wi based on the linearregression model (Deville and Sarndal 1992). Calibration estimation hasattracted the attention of users due to its model-free property and its abilityto produce a common set of weights. Sarndal (2007) says “Calibration hasestablished itself as an important methodological instrument in large scaleproduction of statistics”. Brakel Van Den and Bethlehem (2008) note that theuse of common calibration weights for estimation in multipurpose surveysmakes the calibration method “very attractive to produce timely official sta-tistics in a regular production environment”. The calibration approach has thepotential for adjusting for selection bias due to non-probability sampling, asshown later.

Study variable-specific working models have also been studied, by assuminga mean function Em(yi| xi) = h(xi) for i∈U, where Em denoted model expecta-tion. In the parametric case h(xi) = h(xi, β) for known function h(.), where β isthe model parameter. Wu and Sitter (2001) estimated β from the sample dataand used bhi ¼ hðxi; bβ) as the predictor of h(xi, β) for i∈U, assuming all thepopulation values xi are known, where bβ is the estimator of the model param-eter. The resulting model-assisted estimator of the total is given by

bYma ¼ ∑i∈Adi yi−bhi� �

þ∑i∈Ubhi ð1Þ

and the resulting calibration weights wi satisfy ∑i∈Awibhi ¼ ∑i∈Ubhi. The

GREG estimator bYgr is a special case of (1) when h xi; βð Þ ¼ x0iβ. Breidt and

Opsomer (2017) review model-assisted estimation using modern predictionmethods, including non-parametric regression and machine learning.

In practice, available auxiliary variables might include several categoricalvariables with many categories and significant interactions may exist amongthe variables. For example, the Quarterly Census of Employment and Wages(QCEW) in the United States is used as a sampling frame for many establish-ment surveys. The frame is compiled from administrative records and containsseveral categorical variables such as industry code, firm ownership type, loca-tion and size class of the establishment (McConville and Toth 2018). Clearly,the linear regression working model including all categorical variables and theirtwo factor interactions as auxiliary variables can lead to unstable GREGweights and even some negative weights. In this case, the GREG estimatorcan be even less efficient than the basic estimator bY . McConville and Toth(2018) proposed a model-assisted regression tree approach that can automat-ically account for relevant interactions, leading to significantly more efficientdesign-consistent post-stratified estimators than the GREG estimator obtained

ON MAKING VALID INFERENCES BY... 7

Page 8: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

after variable selection from the pool of categorical variables. The GREGestimator performed poorly when the pool for variable selection also includedall the two factor interactions. The tree weights are always positive and theirvariability is considerably smaller than the variability of the GREG weights.Regression trees use a recursive partitioning algorithm that partitions the spaceof predictor variables into a set of boxes (post strata) B1, ..., BL such that thesample units within a box are homogeneous with respect to the study variable.The model-assisted estimator (1) with unspecified mean function reduces to theregression tree estimator of the form

bYrt ¼ ∑Ll¼1Nl ∑i∈Al

diyi=∑i∈Aldi

� � ð2Þ

where Al is the set of sample units and Nl is the number of population unitsbelonging to poststratum l. It follows from (2) that the weight attached to asample unit belonging to poststratum l is Nldi=∑i∈Al

di and the regression treeestimator (2) calibrates to the poststrata counts Nl. Note that all the popula-tion values xi need to be known in order to obtain the poststrata counts Nl, l=1, ..., L, unlike in the case of the GREG estimator which depends only on thevector of population totals X. Also, the tree boxes B1, ..., BL depend on thevalues of the study variable in the sample. Establishment surveys typicallyprovide population values of the vector x at the unit level, unlike socio-economic surveys. Regression tree methods can be very useful in the contextof utilizing multiple sources of auxiliary data provided they can be linked to thesample data on the study variables. McConville and Toth (2018) also provide adesign consistent estimator of the variance of the regression tree estimator (2),based on Taylor linearization.

In some applications, the number of potential auxiliary variables could belarge, often correlated, and several of those variables may not have significantrelationship with the study variable. In such cases, it is desirable to selectsignificant auxiliary variables first and then apply GREG to the reducedmodel. McConville et al. (2017) used a weighted lasso method which canperform model selection and coefficient estimation simultaneously by shrinkingweakly related auxiliary variables to zero through a penalty function(Tibshirani 1996). Note that the model selection depends on the choice oftuning parameter(s) in the penalty function. The lasso GREG can also beexpressed as a weighted sum but the weights depend on the study variableand all the population values xi must be known, as in the case of tree GREG.They also established design consistency of the lasso GREG estimator,assuming that the number of potential auxiliary variables is fixed as thesample size increases. Ta et al. (2019) allowed the number of auxiliary variables

8 J. N. K. Rao

Page 9: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

to increase as the sample size increases and established design consistency of thelasso GREG estimator. McConville et al. (2017) conducted a simulation studyon a forestry population dataset with a marginal model based on 12 potentialcovariates and a more complex model with 78 potential covariates based onthe main effects and interactions of those variables. As expected, lassoGREG led to very large gains in efficiency over the customary GREG inthe second case and significant gains in the first case when the sample size isnot large. However, the efficiency of customary GREG can be improved inthe second case by doing variable selection first and then using GREG on theselected variables.

3 Combining two independent probability samples

3.1 Double sampling We now turn to making valid inferences bycombining two independent probability samples, possibly drawn fromdifferent frames covering the same population. We focus on a simple casewhere a large sample A(1) is drawn from the first frame and inexpensiveauxiliary variables z are observed and a much smaller sample A(2) is drawnfrom the second frame and both the study variable y and z are observed. Wedenote the design weights for the two samples by d1i, i∈A(1) and d2i, i∈A(2)respectively, the corresponding basic design unbiased estimators of the total Zof the common variable z by bZ 1 and bZ2 respectively, and the estimator of thetotal of the study variable y from the sample A(2) by bY 2. The aim is to getmore efficient estimators of Y by taking advantage of bZ 1 obtained from thelarge sample A(1). C. Bose of the Indian Statistical Institute studied regressionestimation of the totalYusing such double sampling designs 75 years ago (Bose1943). However, it is only in recent years that those designs have receivedconsiderable attention in the context of combining data from two or moreindependent probability samples. Hidiroglou (2001) described a real applicationof the double sampling design in Statistics Canada's Survey of Employment,Payrolls and Hours (SEPH). In this application, the large sample A(1) wasselected from a Canada Customs and Revenue Agency administrative data fileand auxiliary variables, z, which include the number of employees and the totalamount of payroll, were collected. A much smaller sample A(2) was indepen-dently selected from the Statistics Canada Business Register and the studyvariables y, number of hours worked by employees and summarized earnings,were collected in addition to the variables z. A GREG-type estimator is of theform bY 2;gr ¼ bY 2 þ bZ 1−bZ 2

� �0

β, where β is estimated either by a weighted leastsquares estimator or an “optimal” estimator minimizing the variance of bY 2;gr

ON MAKING VALID INFERENCES BY... 9

Page 10: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

(Hidiroglou 2001). Note that the variance of bY 2;gr is asymptotically larger thanthe corresponding GREG with known totals Z because of the extra variabilitydue to estimating Z from the larger sample. We refer the reader to Guandaliniand Tille (2017) for more recent work on double sampling designs involving twoindependent probability samples and relevant references to past work.

Kim and Rao (2012) studied a related problem of integrating two indepen-dent probability samples. Here the primary interest was to create a singlesynthetic data set of proxy values eyi associated with zi in the larger sampleA(1) to produce a projection estimator of the total Y. The proxy values eyi; i∈A 1ð Þ are generated by first fitting a working model relating y to z in the smallerA(2) sample dataset {(yi, zi), i∈A(2)} and then predicting yi associated withzi, i∈A(1). This approach facilitates the use of only synthetic data andassociated design weights d1i reported in survey 1. Kim and Rao (2012) iden-tified the conditions for the projection estimator bY 1p ¼ ∑i∈A 1ð Þd1ieyi to beasymptotically design consistent. Variance estimation is also considered.Schenker and Raghunathan (2007) reported several applications of the syn-thetic data approach using a parametric model-based method to estimate thetotal Y. In one application, sample A(2) observed both self-reporting healthmeasurements zi and clinical measurements from physical examination yi andthe much larger survey A(1) had only self-reported measurements zi. Only theimputed, or synthetic, data from sample A(1) and associated survey weightsare released to the users of sample A(1), thus minimizing disclosure risk (Reiter2008).

3.2 Dual frame sampling designs In this section, we give a briefaccount of combining data on a study variable from samples drawn fromtwo different frames. There are two scenarios: (1) Both frames are incom-plete but their union is a complete frame, as often happens with list frames;and (2) frame 1 is complete (such as an area frame) and frame 2 isincomplete (such as a list frame). We assume that we can correctly ascer-tain to which frames a sample unit belongs. For simplicity, we focus on thesecond case since it has relevance to combining data from a non-probabilitysample with data from a probability sample drawn from the completeframe (see Section 6). The dual frame approach is efficient when it ischeaper to sample from frame 2 or when a large proportion of a rarepopulation of interest belongs to frame 2.

Denote the sample drawn from frame 1 as A and the sample from frame2 as B, and the corresponding design weights as d1i and d2i, respectively. Wecan express the population total Y as Y=∑i∈U(1 − δi)yi+∑i∈Uδiyi, whereδi= 1 if the unit i belongs to frame 2 and δi= 0 otherwise. Using this

10 J. N. K. Rao

Page 11: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

representation, a class of estimators that combines data from the twosamples is given by

bYH ¼ ∑i∈Ad1i 1−δið Þyi þ p ∑i∈Aδid1iyið Þ þ q ∑i∈Bd2iyið Þ; pþ q ¼ 1 ð3Þ

For a fixed p, the dual frame estimator bYH can be written as a weighted sumwith weights not depending on the study variable, that is the same weight isused for all study variables. A simple choice is p= 0.5 and the resultingestimator is called a multiplicity estimator. Hartley (1962) obtained an optimalvalue of p by minimizing the variance of the dual frame estimator bYH , but theoptimal p will depend on the values of the study variable in the population. Ascreening dual frame estimator is obtained by letting p=0 in (3). In this case,the data on units from sample A belonging to frame 2 are not collected and thiscould lead to cost saving if it is cheaper to collect data from the incompleteframe 2. For example, in agricultural surveys, farms belonging to the list frame2 are removed from the area frame 1 before sampling commences and this couldlead to considerable cost saving since the list frame is cheaper to sample and itcontains the largest farms (Lohr 2011). An application of the screening dualframe estimator to combining a non-probability sample with a probabilitysample is given in Section 6.

Lohr (2011) provides an excellent account of dual frame sampling theoryand its extension to more than two frames. The relative merits of severalalternative estimators of the total are discussed. She also presents replication-based variance estimators, in particular jackknife variance estimators. Theeffect of errors in ascertaining to which frame a sample unit belongs is alsostudied.

4 Inference from non-probability samples

4.1 Present and future of probability sampling Many importantadvances in probability sampling theory and methodology have taken placesince the landmark contributions of the early period. Issues addressed includeefficient sample designs and analysis of survey data taking account of surveydesign features. Mixed mode data collection methods, such as combining face-to-face and web, have been proposed to control costs and maintain goodcoverage and response rates (de Leeuw 2005). Also, responsive or adaptivedesigns (Groves and Heeringa 2006) and paradata collection to increase re-sponse rates and adjust for non-response bias and measurement errors havebeen studied. Kalton (2019) notes that the new work on responsive designs andparadata collection “have not had great success in counteracting falling

ON MAKING VALID INFERENCES BY... 11

Page 12: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

response rates and increasing costs”. Other significant contributions includerandomized response sampling methods to handle sensitive questions(Chaudhuri and Christofides 2013), and adaptive or network sampling whenserviceable frames are not available. (Thompson 2002).

Kalton (2019) mentions the use of quasi-probability sample designs that arenot strictly based on probability sampling. One example is quota samplingwhich may be “described as stratified sampling with a more or less nonrandomselection of units within strata” (Cochran 1977). It is often used in marketingresearch to reduce cost and time. A requirement of quota sampling is that thepopulation counts in quota cells need to be known accurately. Other examplesare respondent driven sampling and venue-based sampling to access hard tosurvey populations (Kalton 2019). Note that the data generated by probabilitysampling or quasi-probability sampling are “designed data” collected to addresspre-specified purposes unlike “organic data” (Groves 2011) extracted fromsocial media (Facebook, Google and Twitter), unrestricted web surveys andother sources such as commercial transactional data and administrativerecords.

Steadily falling response rates, increasing costs and response burden associ-ated with traditional sample surveys have become major concerns in highlydeveloped countries, especially for socio-economic surveys. On the other hand,the low cost of obtaining non-probability samples of very large sizes throughself-reporting Internet web surveys, social media and other external sources andthe speed with which estimates can be produced from those data are veryattractive to statistical agencies producing official statistics. As a result, thoseagencies are now taking modernization initiatives to find new ways to integratedata from a variety of sources and to produce “real-time” official statistics.Citro (2014) says “official statistical offices need to move from the probabilitysample survey paradigm for the past 75 years to a mixed model data sourceparadigm for the future”. Holt (2007) listed five formidable challenges forofficial statistics: “wider, deeper, better, quicker and cheaper”. Citro (2014)added “less burdensome” and “more relevant”. It is doubtful that we will evencome close to achieving those goals in the near future.

Administrative records are not primarily collected for statistical purposesand there is no direct control by the statistical agency, unlike for survey data.However, in some cases administrative data may be used as a substitute forsome items in surveys based on probability samples to lower response burdenand reduce cost. For example, in the Canada Income Survey and the Census ofPopulation, basic income questions are skipped by accessing tax records. Ad-ministrative records are more current than census data and can be moreeffective as auxiliary variables when combined with survey data, especially in

12 J. N. K. Rao

Page 13: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

small area estimation (Section 8). Kalton (2019) notes that administrative datamay provide longitudinal data both for the period before and the period afterthe survey data collection. Limitations of data from administrative sourcesinclude possible incomplete coverage of the target population and confidential-ity and privacy issues. Further, in the United States, sharing some administra-tive records by federal agencies is not allowed. Also, for some records held bystates and localities, there is no mandate or incentive to share data with federalstatistical agencies. Administrative data should be subject to the same scrutinyas survey data with regard to response errors. More so, because they are usuallycompiled by a large team of clerical staff with uncertain training on quality.

Given the current initiatives to modernize official statistics through exten-sive use of nonprobability samples and big data, one might rightly ask thequestion “Are probability surveys bound to disappear for the production ofofficial statistics?” (Beaumont 2019). To answer this question, we shouldfirst examine the effects of sample selection bias in non-probability samples(Section 4.2) and the use of models to reduce selection bias. If selection biascannot be reduced sufficiently through modeling of non-probability samples,then we may examine the utility of combining non-probability samples withprobability samples in an attempt to address sample selection bias (Sections 5and 6). Also, many major surveys conducted by federal agencies, such asmonthly labor force surveys or business surveys, will continue to use probabilitysampling and efficient estimators based on model-assisted methods using aux-iliary information. Kalton (2019) says “it is unlikely that social surveys will bereplaced by administrative records, although these data can be valuable addi-tion to surveys” and “quality of estimates from internet surveys is a concern”.Couper (2013) notes some limitations in big data, including a single variableand few covariates observed, lean on relationships and demographic informa-tion often lacking for a large proportion of big data. However, big data may bepotentially useful in providing predictors for models used for small area esti-mation (Section 8).

4.2 Effect of selection bias We first consider the case of a non-probability sample B of size NB with data{(i, yi), i∈B}, assuming no measure-ment errors. Let δi=1 if the population unit i belongs to B and δi= 0 other-wise. The estimator of the population mean Y in the absence of supplementaryinformation is given by the mean yB ¼ N−1

B ∑i∈Uδiyi and , following Hartleyand Ross (1954), its estimation error yB−Y may be expressed as the product ofthree terms: (1) corr(δ, y) = ρδ, y , called data quality, (2) square root of (1− fB)/fB with fB=NB/N, called data quantity, and (3) square root of the populationvariance σ2

y, called problem difficulty (Meng 2018). The data quality term

ON MAKING VALID INFERENCES BY... 13

Page 14: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

plays the key role in determining the bias and it is approximately zero on theaverage under simple random sampling (assuming complete response andcoverage). Note that we do not have control on the participation mechanismunder non-probability sampling, unlike in the case of probability sampling.

We now turn to the model MSE of yB, assuming the indicators δi arerandom with unknown nonzero participation probabilities qi. Then, the modelMSE is given by

MSEδ yB� �

¼ Eδ ρ2δ;y

� �� f −1B 1− f Bð Þ � σ2

y ð4Þ

where Eδ denotes the expectation with respect to the unknown participationmechanism. Meng (2018) named the first term in (4) as the data defect index,DI ¼ Eδ ρ2δ;y

� �. Note thatNB could be very large in the context of big data, for

example,NB is five million andN is 10 million giving a sampling fraction fB=1/2. Yet the MSE given by (4) depends only on the sampling fraction fB. As aresult, we will only need a relatively small simple random sample of size n toachieve the same MSE. In the above example, suppose that the averagecorrelation Eδ(ρδ, y) is as small as 0.05, then the “effective” sample size n ofthe big data is less than 400. Another important result is that the width of theconfidence interval, treating the non-probability sample as a simple randomsample, goes to zero as NB and N increase such that the ratio fB tends to somepositive limiting value; however, the interval has a small chance to cover thetrue valueY because it is centered at a wrong value (Meng 2018). We know thisphenomenon under probability sampling when the ratio of bias to standarderror is large. For example, Cochran (1977, p.14) reports a coverage rate of 68%for nominal 95% when the ratio of bias to standard error is 1.50. For a designbiased, but design consistent, estimator, such as the ratio estimator, the biasratio goes to zero as the sample size increases.

If the non-probability sample B is subject to measurement errors and weobserve y*i ¼ yi þ εi, where εi is the measurement error, then the expected biasof the estimator Y

*B is unchanged provided E(εi) = 0, otherwise the bias will

contain an additional term, Bε, assuming a constant expected bias E(εi) =Bε.On the other hand, the MSE will contain two additional terms, the variancefrom measurement error σ2/NB+B2

ε and an interaction term 2Bε[(1 − fB)/fB]1/2Eδ(ρδ, y) respectively, where σ2 is the variance of measurement error(Biemer 2018). If Bε=0, then the interaction term is zero and the variancefrom measurement error term is negligible for large NB. As a result, thecontribution from sample selection bias dominates the total MSE. It is unlikelythat the expected bias will be small for “found data” from online sources, suchas Facebook, where people may actively lie.

14 J. N. K. Rao

Page 15: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

The above facts need to be taken into consideration in using non-probabilitysamples without some adjustment for selection bias. Reducing bias throughweighting is a common practice used in the context of bias due to nonresponse.In the latter case, we have some information on the nonrespondents, such ascovariates observed on all the sample units, unlike in the case of non-probability sampling. This additional information is used to estimate responseprobabilities and used to reduce bias through weighting. We discuss methods ofestimating participation probabilities in the case of non-probability samplingand using them through weighting in Section 6.

Meng (2018) studied the effect of using the weighted mean YBω=∑i∈Uωiδiyi/∑i∈Uωiδiwith arbitrary weights ωi. In this case, letting eδi ¼ δiωi, the first term inthe estimation error YBω−Y is changed to corr eδ; y

� �¼ ρeδ;y, the second term is

inflated by the factor 1þ CV 2ω= 1− f Bð Þ� �1=2≥1 where CVω is the coefficient of

variation (CV) of the ωi for i∈B, and the third term remains unchanged. Hence, itfollows that if the weighting does not lead to reduction in the first term, theestimation error is in fact larger than not using weights and this inflation issignificant if the CVof the weights is large as in the case of nonresponse.

Under probability sampling, the estimator YBω is approximately designunbiased, noting that the inclusion probabilities πi are known, ωi ¼ di ¼ π−1

iand Eδ ρδ;y

� �≈0. However, in the case of non-probability sampling, the partic-

ipation probabilities P(δi= 1)= qi are unknown and we need to estimate thoseprobabilities through models by combining the non-probability sample B withan independent probability sample A observing some auxiliary variables com-mon to those associated with the non-probability sample. If the study variableis also observed in the probability sample and the units in the probabilitysample that do not belong to the non-probability sample can be identified,then the problem is similar to the screening dual frame estimator mentioned inSection 3.2 and no model assumptions are needed. If only the nonprobabilitysample B is observed, then the population information on the auxiliary vari-ables associated withB is needed to make inference through models relating thestudy variable to the auxiliary variables.

5 Study variable observed in both samples

In the ideal case, the study variable y is observed in the non-probabilitysample B as well as in a probability sample A of size n independently selectedfrom the target population. We make the assumption that the units in sampleA that do not belong to sample B can be identified. This assumption is not metin a lot of applications, especially when the demographic information in sample

ON MAKING VALID INFERENCES BY... 15

Page 16: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

B is poor quality. Using the indicator variable δi , we can express the total Y asYB+YC, where YB=∑i∈Uδiyi=∑i∈Byi and YC=∑i∈U(1 − δi)yi are the to-tals for the units in the sample B and the units not belonging to B, respectively.This situation can be thought of as a population composed of two strata: thestratum of the sample B that is completely enumerated and a stratum of theelements not in B from which a probability sample is selected. It immediatelyfollows that a design unbiased direct estimator of the total is given by YB

þ bYC , where bYC ¼ ∑i∈Adi 1−δið Þyi½ � is the design unbiased estimator of YC

and di are the design weights associated with the probability sample. Thisestimator is a special case of the dual frame screening estimator when all theunits in the incomplete frame 2 are observed.

Sometimes, it may be desirable to reduce the big data sample size by takinga large random sample from an extremely large big data sample B , as in thecase of transaction data. In this case, we need methods of drawing randomsamples from a very large big data file stored in the fast memory of thecomputer and whose size NB may not be known in advance. McLeod andBellhouse (1983) proposed a convenient algorithm for drawing simple randomsamples from big data files.

Since the sizesNB andNC=N−NB are known, a more efficient post-stratifiedestimator is given by bYP ¼ YB þNC bYC=bNC

� �, where bNC ¼ ∑i∈Adi 1−δið Þ.

Kim and Tam (2018) showed that with a simple random sample A, the post-stratified estimator achieves a large reduction in the design variance compared tothe design unbiased estimator bY ¼ ∑i∈Adiyi based only on the probability sampleA. In particular, if the sampling fraction f=n/N is small and the populationvariance σ2

y and the variance of the population units not belonging to B, denotedσ2C ;y, are roughly equal, V bYP

� �=V bY

� �≈1−WB, where WB=NB/N is the

proportion of units belonging to the non-probability sample B.One could use calibration estimation by minimizing the chi-square

distance,∑i∈A(wi − di)2/di, between the design weights di and the calibrationweights wi subject to calibration constraints ∑i∈Awiδi=NB, ∑i∈Awi(1 − δi) =NC and ∑i∈Awi(δiyi) =YB, where NB, NC and YB are known. The resultingcalibration estimator ∑i∈Awiyi is identical to the post-stratified estimator bYP

(Kim and Tam 2018). However, the main advantage of the calibration ap-proach is that it permits the inclusion of other calibration constraints, ifavailable. One important application of the calibration approach is the estima-tion of the total Y when the study variable observed in the non-probabilitysample B is subject to measurement error and what we observe is y*i instead ofyi for i∈B, and no measurement errors in the probability sample A. In thiscase, we minimize the chi-square distance subject to the previous constraintsbut replacing yi by y*i for i∈B. The resulting calibration estimator is given by

16 J. N. K. Rao

Page 17: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

∑i∈Awiyi. The case of measurement errors only in the probability sample A is

more complex and requires a measurement error model. Kim and Tam (2018)also studied calibration estimation in the case of unit nonresponse in theprobability sample A, assuming a general response model allowing the proba-bility of response to depend on the study variable.

Kim and Tam (2018) gave an interesting application of the above setupin official statistics. In this application, the non-probability sample (or bigdata) is the Australian Agricultural Census with 85% response rate and theprobability sample is the Rural Environment and Agricultural Commodi-ties Survey. In this application, the study variable is subject to measure-ment error in the probability sample while the true value is observed in thenon-probability sample.

6 Study variable not observed in the probability sample

We now turn to the case where a non-probability sample B observing thestudy variable y and a reference probability sample A observing some otherstudy variables have common covariates, denoted by x. The available data aredenoted by {(i, yi, xi), i∈B} and {(i, xi), i∈A}. The above scenario is similarto the double sampling case of section 3.1 but the roles of the large sample andthe small sample are reversed. Here, the study variable y is cheaper to collectfrom sample B, unlike in the double sampling case where the study variable isobserved in a small probability sample.

First, we consider the case where the units in sample A that do not belongto sample B can be identified, as in Section 5. We can then use the data {(δi,xi), i∈A} to fit a model for the participation probabilities or propensity scoresP(δi=1| xi) = q(xi, θ) = qi in sample B based on the missing at random (MAR)assumption and qi › 0 for all i∈U. For example, one could use a logisticregression model for the binary variable δi and obtain estimators bqi ¼ qxi; bθ

� �fori∈B, using the xi observed in the sample B. We then calculate a

ratio estimator of the total as bYr;q ¼ N ∑i∈Bωiyið Þ= ∑i∈Bωið Þ, where ωi ¼ bq−1i(Kim andWang 2019). This estimator will lead to valid inferences provided themodel on the participation probabilities is correctly specified. The MAR as-sumption that the participation probabilities qi depend only on the observed xiis a strong assumption and difficult to validate in practice.

Suppose a working population model, Em(yi) =m(xi, β) =mi for i∈U, isassumed to hold for the sample B, where Em denotes model expectation and themean function mi is specified. Using the data from the sample B we obtain an

ON MAKING VALID INFERENCES BY... 17

Page 18: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

estimator of bβ which is consistent for β if the model is correctly specified. Then adoubly robust (DR) estimator of the total is given by

bYDR ¼ ∑i∈Bωi yi−bmi

� �þ∑i∈Adi bmi ð5Þ

where bmi ¼ m xi;bβ� �

denote the predicted values (Kim and Wang 2019). Theestimator (5) is DR in the sense that it is consistent if either the model for theparticipation probabilities or the model for the study variable is correctlyspecified. The assumption that the model holds for the nonprobability sampleis also a strong assumption. DR estimators have been used in the context ofnonresponse in a probability sample (Kim and Haziza 2014).

Chen et al. (2018a) proposed DR estimators of the type (5) that do not requirematching the two samples and in fact the set of units in sampleA that also belongto sample Bmay be empty. This scenario is more common in practice. Under thisapproach, the values {(δi, xi), i∈U} are first used under the assumed participationprobabilities model to obtain a population log likelihood l(θ), involving the un-known term ∑i∈U log {q(xi, θ)}, and then replacing this unknown term by itsdesign unbiased estimator based on the probability sample A. This two-stepprocedure leads to the following pseudo log likelihood:

bl θð Þ ¼ ∑i∈Blogit q xi; θð Þf g þ∑i∈Adilog 1−q xi; θð Þf g ð6Þwhere logit(a) = log {a/(1 − a)}.

For the commonly used logistic regression model, logit q xi; θð Þf g ¼ x0iθ , the

score equations corresponding to (6) reduce to

bU θð Þ ¼ ∑i∈Bxi−∑i∈Adiq xi; θð Þxio¼ 0 ð7Þ

Solving (7), we obtain an estimator bθ and the corresponding bqi ¼ q xi; bθ� �

. Wesimply use the resulting ωi in (5) to get a DR estimator of the total. Chen et al.(2018a) derived variance estimators for the DR estimator of the total or mean.

For general q(xi, θ), a calibration-type estimator of the model parameter θmay be obtained by solving

∑i∈B q xi; θð Þ½ �−1xi ¼ ∑i∈Adixi ð8Þ

If the population total X is available from one or more external sources, such asa census, then there is no need for the probability sample and we simply replacethe design unbiased estimator bX ¼ ∑i∈Adixi in (8) by the known total X(Singh et al. 2017). Note that the Chen et al. (2018a) estimator of qi depends

18 J. N. K. Rao

Page 19: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

on the design weights associated with the probability sample. Chen et al.(2018a) also provided asymptotically valid variance estimators of the DR esti-mators under the assumed models. Lee (2006) and Lee and Valliant (2009) earlierproposed the above setup, but their method of estimating the participationprobabilities by letting δi=1 if i∈B and δi=0 if i∈A leads to biased estimatesof the participation probabilities, as observed by Valliant and Dever (2011).

Yang et al. (2019) extended the DR estimators of Chen et al. (2018a) to thecase of a large number of candidate covariates x. They used a two-step ap-proach for variable selection and estimation of the finite population parameter,using a generalization of lasso. Some of the commercial web panels and healthrecords may observe many covariates, but several of those covariates may beweakly related to the study variable. In such cases, the lasso-based DR estima-tors of Yang et al. (2019) might be useful.

A regression type estimator of the total is obtained by estimating theunknown total of the predicted values bmi, using the probability sample(Chen et al. 2018a, b):

bYREG ¼ ∑i∈Adi bmi ð9Þ

Note that (9) could be regarded as a mass imputed estimator with missingvalues yi for i∈A replaced by the corresponding imputed values bmi . Theestimator (9) will be biased if the model for the study variable is incorrectlyspecified. A ratio type version of (9) is obtained by multiplying (9) with thescale factor N=bN, where bN ¼ ∑i∈Adi is the design unbiased estimator of N.Rivers (2007) used a non-parametric mass imputation approach that avoidsthe specification of the mean function Em(yi| xi). For each unit i in theprobability sample A, a nearest neighbor (NN) to the associated xi is foundfrom the donor set {(i, xi), i∈B}, say xl, using Euclidean distance, and theassociated yl is used as the imputed value y*i ¼ ylð Þ, leading to the massimputed estimator

bYRI ¼ ∑i∈Adiy*i ð10Þ

The estimator (10) is based on real values obtained from the donor set, unlike(9). It is not design-model unbiased unless Em y*i jxi

� � ¼ Em yijxið Þ: Singh et al.(2017) proposed alternative mass imputed estimators based on NN. Kim et al.(2019) studied mass imputation under a semi-parametric model for the non-probability sample B with E(y| x) =m(x, β) for a known function m(., .) andestablished design-model consistency of the mass imputed estimator. They alsoderived variance estimators using either linearization or the bootstrap method.

ON MAKING VALID INFERENCES BY... 19

Page 20: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

Chen et al. (2018a) conducted a simulation study on the performance of thealternative estimators. They demonstrated that the DR estimator performswell in terms of relative bias and mean squared error (MSE) when either theassumed response model or the model on the study variable is correctly spec-ified. On the other hand, the estimator bYr:q, based only on the estimated qiperforms poorly when the model on the participation probabilities qi is incor-rectly specified. Similarly, the regression estimator (9) performs poorly whenthe model on the study variable is incorrectly specified. The authors did notstudy the case when both models are incorrectly specified. It may be possible todevelop multiply robust (MR) estimators, along the lines of Chen and Haziza(2017), by specifying multiple response models and multiple models for thestudy variable. Under this scenario, an estimator is MR if it performs well whenat least one of the candidate models is correctly specified. Simulation results ofChen and Haziza (2017) indicated that the MR estimators tend to perform welleven when all the candidate models are incorrectly specified.

7 Non-probability sample alone available

We now turn to the scenario where only non-probability sample B data {(i, yi,xi), i∈B} are available. In this case, we need some population information on theauxiliary variables x. It is then possible to use a model-dependent or predictionapproach which was advocated for probability samples by Royall (1970) andothers. An advantage of the prediction approach for probability samples is that itprovides conditional model-based inferences referring to the particular sample ofunits selected. Such conditional inferences may be more relevant and appealingthan the unconditional repeated sampling inferences used in the design-basedapproach. Inferences do not depend on the sampling design if the assumed popu-lation model holds for the sample or there is no sample selection bias.

In the case of probability sampling, Little (2015) notes that sample selectionbias may be reduced by incorporating explicitly design features like stratifica-tion and sample selection probabilities into the model to ensure design consis-tency of the prediction estimators. For example, with disproportionate strati-fied random sampling, stratum effects are included in the model, leading to aprediction estimator that agrees with the standard design-based estimator.Unfortunately, under non-probability sampling the above calibrated modelingapproach is not feasible and issues of selection bias are not resolved. Little(2015) suggests incorporating auxiliary population information, such aspoststratification, to reduce the sample selection bias. Bethlehem (2016) dem-onstrated the effectiveness of post-stratification in reducing the selection bias in

20 J. N. K. Rao

Page 21: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

the context of a web panel on voting (binary variable) when the panel is post-stratified into cells defined by age (young, middle and old) and education ( low,high) and the population cell counts are known. Wang et al. (2015) proposedmultilevel regression and post-stratification (MRP) to forecast elections withnon-representative polls and demonstrated its effectiveness in forecasting U. S.presidential elections, based on a large number of post-stratification cells withknown population cell counts. Smith (1983) examined the conditions for ignor-ing non-random selection mechanisms with particular attention to post-stratification and quota sampling.

Suppose the population model is given by the mean function Em(yi) =h(xi, β), i∈U and the model holds for the non-probability sample B. We fitthe model to the sample data to obtain an estimator bβ and predictors bhi ¼ hxi; bβ

� �for i∈U, assuming the population values xi are known. In this case, the

prediction estimator of the total is given by

bYPR ¼ ∑i∈Byi þ∑i∈eB

bhi ð11Þ

where eB denotes the set of population units not belonging to B. In practice, itmay not be possible to identify the units belonging to eB in the context of non-probability sampling. However, if the model is a linear regression modelh xi; βð Þ ¼ x

0iβ with constant model variance and the intercept term is included

in the model, then the prediction estimator (11) reduces to

bYPR ¼ Xbβ ð12Þ

where bβ is the ordinary least squares estimator of β. In this special case, theprediction estimator (12) requires only the vector of population totals X.

A major challenge in making inference based only on a non-probabilitysample B is how to account for sample selection bias. Chen et al. (2018b)suggested using a large pool of predictors and then applying the lasso(Tibshirani 1996) to implement both variable selection and estimation of modelparameters associated with the selected variables. Their simulation studysuggests that the resulting predictors might be able to account for sampleselection bias. However, the chosen setups do not reflect strong sample selectionbias. Also, it should be noted that the number of demographic variablesavailable for use as predictors is limited in the context of volunteer web surveysand other nonprobability samples and not all variables may be available for allthe units in the sample (Couper 2013).

ON MAKING VALID INFERENCES BY... 21

Page 22: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

8 Small area estimation

Reliable local area or subpopulation (denoted small area) statistics areneeded in formulating policies and programs, allocation of governmentfunds, regional planning and making business decisions. Traditional directestimators of small area totals or means, based on probability sampling andarea-specific sample data, do not lead to adequate precision due to smallsample sizes within small areas. As a result, it becomes necessary to“borrow strength” across areas through linking models based on auxiliaryinformation such as recent census counts and administrative records.Kalton (2019) says that opposition to using models has been overcome bythe growing demand for small area estimation (SAE). Rao and Molina(2015) provide a comprehensive account of model-based methods forSAE. Models used for SAE may be broadly classified into two categories:(a) area level models that relate area level direct estimators to area levelcovariates, and (b) unit level models that relate the observed values of thestudy variable to unit-specific auxiliary variables.

For simplicity, assume that simple random samples of sizes ni(i= 1, ...,m)are drawn from m areas, that ni › 0 for all i and that the corresponding directestimators of the area means Yi are the sample means yi. Let the samplingmodel be denoted by yi ¼ Yi þ ei, where ei is the sampling error with meanzero and known variance ψi. A model linking the area means Yi is given byYi ¼ z

0iβ þ vi, where zi is a vector of area level covariates not subject to

sampling or measurement errors and vi is a random area effect with mean zeroand variance σ2

v. Combining the sampling model with the linking model leadsto the well-known Fay-Herriot model yi ¼ z

0iβ þ vi þ ei (Fay and Herriot

1979). The “optimal” estimator of the mean Yi derived under this model,assuming the model parameters are known for simplicity, is a weighted combi-nation of the sample mean yi and a “synthetic” estimator z

0iβ with weights γ i

¼ σ2v= σ2

v þ ψ i

� �and 1 − γi. It follows that the optimal estimator gives more

weight to the synthetic estimator as the sampling variance increases or thesample size ni decreases. The mean squared prediction error (MSPE) of theoptimal estimator is equal to γiψi, which shows a large reduction in MSPEcompared to the variance ψi of the direct estimator yi when γi is small or thesampling variance is large. Hidiroglou et al. (2019) applied the area level modelto estimate unemployment rates for cities (small areas) in Canada, using directestimates from the Canadian Labour Force Survey and the number of employ-ment insurance beneficiaries as the area level covariates. They evaluated theabsolute relative error (ARE) by comparing the estimates to the unemploy-ment rates from the 2016 long-form Census treated as the gold standard. Their

22 J. N. K. Rao

Page 23: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

results showed that ARE of the direct estimates averaged over all the areas isreduced from 33.9% to 14.7% through the use of the “optimal” model-basedestimates. For the 28 smallest areas, reduction in ARE is more pronounced:70.4% to 17.7%.

Ybarra and Lohr (2008) studied the case of area-level covariates subject tosampling error. The covariates are obtained from a much larger survey, similarto the double sampling scenario of Section 3.1. They developed “optimal”estimators under this setup, assuming the sampling variances and covariancesof the area level covariates are known.

Use of area level big data as additional predictors in the area levelmodel has the potential of providing good predictors for modeling. Wemention three recent applications that have used big data covariates inan area level model. Marchetti et al. (2015) studied the estimation ofpoverty rates for local areas in the Tuscany region of Italy. In thisapplication, the big data covariate is a mobility index based on differentcar journeys between locations automatically tracked with a GPS device.Direct estimates of area poverty rates were obtained from a probabilitysample. The big data covariate in this application is based on anonprobability sample which was treated as a simple random sampleand the Ybarra-Lohr method was used to estimate model-based povertyrates. The second application analyzed relative change in the percent ofSpanish speaking households in the eastern half of the Unites States(Porter et al. 2014). Here direct estimates for the states (small areas)were obtained from the American Community Survey (ACS) and a bigdata covariate was extracted from Google Trends of commonly usedSpanish words available at the state level. In the third application,Schmid et al. (2017) used mobile phone data as covariates in the basicarea level model to estimate literacy rate by gender at the communelevel in Senegal. Direct estimates of area literacy rates were obtainedfrom a Demographic and Health Survey based on a probability sample.It is interesting that recent census data or social media data were notavailable to use as covariates. The authors provide details regarding theconstruction of the mobile phone covariates. In another application of bigdata, Muhyi et al. (2019) obtained small area estimates of the electabil-ity of a candidate as Central Java Governor in 2018, using predictorvariables extracted from Twitter data and direct estimates obtained froma sample survey.

Turning to a basic unit level model, the unit level sample data are givenby {(yij, xij), j= 1, ..., ni; i= 1, ...,m} , where j denotes a sample unit be-longing to area i, and the population area means Xi are assumed to be

ON MAKING VALID INFERENCES BY... 23

Page 24: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

known. In practice, either xij is observed along with the study variable yij orobtained from external sources through linkage. In the latter case, theobserved data may be subject to linkage errors if unique unit identifiersare not available (Chambers et al. 2019). Battese et al. (1988) proposed anested error linear regression model yij ¼ x

0ijβ þ vi þ eij in the case of xij

observed along with yij , where eij is unit error with mean zero and varianceσ2e, and vi is random area effect with mean zero and variance σ2

v. Theoptimal model-based estimator is again a weighted combination of a directestimator and a synthetic estimator. Battese et al. (1988) applied thenested error regression model to estimate county crop areas using samplesurvey data in conjunction with satellite information. Each county wasdivided into area segments and the area under corn and soybeans, takenas yij, was ascertained for a random sample of segments by interviewingfarm operators. Auxiliary variable xij in the form of number of pixelsclassified as corn and soybeans were obtained for all the area segments,including the sample segments, in each county using the LANSAT satellitereadings. This is an application of big data in the form of satellite readings.Chambers et al. (2019) studied the effect of linkage errors when xij isobtained from another source such as a population register. They derivedmodel-based estimators of the area means, taking account of linkage errors.Both Battese et al. (1988) and Chambers et al. (2019) assumed the absenceof sample selection bias in the sense that the population model holds for thesample of units within an area. However, sampling becomes informative ifthe known selection probability for the sample unit is related to theassociated yij given xij. In this case, the population model may not holdfor the sample and methods that ignore informative sampling can lead tobiased estimators of small area means. Pfeffermann and Sverchkov (2007)studied small area estimation under informative probability sampling bymodeling the known selection probabilities for the sample as functions ofassociated xij and yij . They developed bias-adjusted estimators under theirset-up. Verret et al. (2015) proposed augmented models by using a suitablefunction of the known selection probability as an additional auxiliaryvariable in the sample model to take account of the sample selection biasand demonstrated that the resulting estimators of small area means canlead to considerable reduction in MSE relative to methods that ignoresample selection bias. However, neither method is applicable to data ob-tained from a non-probability sample because the selection probabilities areunknown.

In the case of sample surveys repeated over time, considerable gain inefficiency can be achieved for small area estimation by borrowing strength

24 J. N. K. Rao

Page 25: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

across both areas and time, using extensions of the basic cross-sectional arealevel model. There is extensive literature on this important topic and we referthe reader to Rao and Molina (2015, Section 8.3). Big data sources can be usedto construct covariates for use under such models by combining time series andcross-sectional survey data. A referee noted that “Particularly, the high fre-quency of big data sources can be used to improve the timeliness of surveysamples in nowcasting methods.”

9 Concluding remarks

In this paper, I have discussed the effect of selection bias on inferencefrom non-probability samples and model-based methods to reduce sampleselection bias. Models that have been used for participation probabilitiesand for the study variable are based on strong assumptions. Understandingthose assumptions and validating them is a big challenge in making reliableinferences from a non-probability sample alone, or from a non-probabilitysample in combination with a probability sample collecting auxiliary var-iables common to those observed in the nonprobability sample. The assur-ance of success of model-based methods will increase with the availabilityof covariates strongly related to the study variables in the non-probabilitysamples. The similarity of the methods for dealing with nonresponse inprobability sampling and selection bias in non-probability sampling shouldbe noted. The evidence to date is that the use of nonresponse and calibra-tion adjustments to compensate for nonresponse in probability samples canreduce nonresponse bias but it will not eliminate it. Applying such adjust-ments with non-probability samples is likely to be less successful. Thedilemma for analysts of non-probability samples is to assess how large isthe residual nonresponse bias and whether the survey estimators are “fit forpurpose”.

A major concern in using estimates based on non-probability samples orfound data is their ability in providing statistics reliable for use as officialstatistics. Since analyses of trends over time are widespread, another con-cern is whether the non-probability samples are comparable across time.Research on the quality of the responses obtained from administrativerecords and from non-probability samples is needed in the same way thatit is needed for the responses obtained with probability samples. However,quality measures such as total MSE, developed for the estimates derivedfrom probability samples and extensively used by statistical agencies, maynot be entirely relevant for the estimates derived from non-probability

ON MAKING VALID INFERENCES BY... 25

Page 26: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

samples. Couper (2013) says “We need other ways to quantify the risks ofselection bias or non-coverage in big data or non-probability surveys.” TheEuropean Statistical System (2015) has published guidelines for reportingabout the quality of statistics calculated from non-probability samples andadministrative sources, as well as for statistical processes involving multipledata sources. The U.S. Federal Committee on Statistical Methodology(2018) has outlined some steps that might be taken toward more transpar-ent reporting of data quality for integrated data.

Covariates extracted from big data have the potential of providing goodadditional predictors in linking models used in small area estimation. Wecan expect to see more applications using big data predictors in small areaestimation. In the time series context, big data has the potential of provid-ing estimates for small areas over time that can improve the timeliness ofsurvey samples using nowcasting methods, as noted by a referee.

I have not discussed other practical issues related to big data and non-probability samples, such as privacy, access and transparency, and I refer thereader to the following overview and appraisal papers: Baker et al. (2013),Brick (2011), Citro (2014), Couper (2013), Elliott and Valliant (2017), Groves(2011), Kalton (2019) , Keiding and Louis (2016), Lohr and Raghunathan(2017), Mercer et al. (2017), Tam and Kim (2018) and Thompson (2019).The report by the National Academies of Sciences, Engineering, andMedicine (2017) extensively treated the privacy issue, in addition to method-ology for integrating data from multiple sources.

It is unlikely that all surveys based on probability sampling , especiallylarge-scale surveys, will be replaced by big data, non-probability samples oradministrative data in the near future because probability samples havemuch wider scope such as collecting multiple study variables to estimaterelationships. For some studies, data can only be obtained in person(Kalton 2019). Of course, we should make improvements to probabilitysampling such as reducing survey length and respondent burden and formaking increased use of technology (Couper 2013). Rao and Fuller (2017)provide some future directions. Inevitably, non-probability samples will bemore widely used in the future, and we need to continue researchingmethods for obtaining valid (or at least acceptable) inferences from them,possibly in combination with probability samples as illustrated in thispaper. Falling response rates and increasing respondent burden are oftengiven as reasons for using non-probability samples, especially in socio-economic surveys, but those reasons do not necessarily apply to traditionalsample surveys not involving people as respondents, such as agriculturaland natural resources surveys.

26 J. N. K. Rao

Page 27: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

I had many stimulating discussions on inferential issues in survey samplingwith the late Professor Jayanta Ghosh while I was a visiting professor at theIndian Statistical Institute during the period 1968-69. He also contributed tomy overview Sankhya paper (Rao 1999) by acting as a discussant and provid-ing valuable insights and comments. In the discussion, he suggested that morecomplex modeling than simple linear regression models might mitigate thefailure of model-based methods for large probability samples. In particular, heproposed nonlinear and nonparametric regression or linear regression modelswhose dimension or complexity increases with the sample size. It now appearsthat such methods might indeed be useful to adjust for sample selection biasinduced by non-probability samples.

Acknowledgement This research was supported by a grant from the Nat-ural Sciences and Engineering Research Council of Canada. I thank Jean-Francois Beaumont, Paul Biemer, Mike Brick, Wayne Fuller, Jack Gambino,Graham Kalton, Jae Kim, Frauke Kreuter, Sharon Lohr and Jean Opsomer forsome useful comments and suggestions on my paper. I also thank two refereesfor constructive comments.

References

BAKER, R., BRICK, J. M., BATES, N. A., BATTAGLIA, M., COUPER, M. P., DEVER, J. A., GILE, K. J. AND

TOURANGEAU, R. (2013). Report of the AAPOR task force on non-probability sampling.J. Surv. Statist. Methodol., 1, 90-143.

BATTESE, G. E., HARTER, R. M. AND FULLER, W. A. (1988). An error component model for prediction ofcounty crop areas using survey and satellite data. J. Am. Stat. Assoc., 83, 28-36.

BEAUMONT, J. – F. (2019). Are probability surveys bound to disappear for the production of officialstatistics? Technical Report. Statistics Canada.

BETHLEHEM, J. (2016). Solving the nonresponse problem with sample matching. Soc. Sci. Comput.Rev., 34, 59-77.

BIEMER, P. P. (2018). Quality of official statistics: present and future. Paper presented at theInternational Methodology Symposium. Statistics Canada, Ottawa.

BOSE, C. (1943). Note on the sampling error in the method of double sampling. Sankhya, 6, 329-330.BRAKEL VAN DEN, J. A. AND BETHLEHEM, J. (2008). Model-assisted estimators for official statistics.

Discussion Paper 09002, Statistics Netherland.BREIDT, F. J. AND OPSOMER, J. D. (2017). Model-assisted survey estimation with modern prediction

techniques. Stat. Sci., 32, 190-205.BRICK, M. J. (2011). The future of survey sampling. Public Opin. Q., 75, 872-888.CHAMBERS, R. L., FABRIZI, E. AND SALVATI, N. (2019). Small area estimation with linked data.

Technical report appeared as arXiv: 1904.00364v1.CHAUDHURI, A. AND CHRISTOFIDES, T. (2013). Indirect Questioning in Sample. Springer: New York.CHEN, S. AND HAZIZA, D. (2017). Multiply robust imputation procedures for the treatment of item

nonresponse in surveys. Biometrika, 104, 439-453.

ON MAKING VALID INFERENCES BY... 27

Page 28: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

CHEN, Y., LI, P. AND WU, C. (2018A). DOUBLY ROBUST INFERENCE WITH NON-PROBABILITY SURVEY SAMPLES.TECHNICAL REPORT: ARXIV: 1805.06432V1 [STAT. ME].

CHEN, J. K. T., VALLIANT, R. L. AND ELLIOTT, M. R. (2018B). MODEL-ASSISTED CALIBRATION OF NON-PROBABILITY SAMPLE SURVEY DATA USING ADAPTIVE LASSO. SURV. METHODOL., 44, 117-144.

CITRO, C. (2014). From multiple modes for surveys to multiple data sources for estimates. Surv.Methodol., 40, 137-161.

COCHRAN, W. G. (1977). Sampling Techniques, 3rd Edition, Wiley: New York.COUPER, M. P. (2013). Is the sky falling? New technology changing media, and the future of

surveys. Surv. Res. Methods, 7, 145-156.LEEUW, E. D. D. (2005). To mix or not to mix. Data collection modes for surveys. J. Off. Stat., 21,

233-255.DEVILLE, J. C. AND SARNDAL, C. E. (1992). Calibration estimators in survey sampling. J. Am. Stat.

Assoc., 87, 376-382.ELLIOTT, M. R. AND VALLIANT, R. (2017). Inference for nonprobability samples. Stat. Sci., 32, 249-264.EUROPEAN STATISTICAL SYSTE. (2015). ESS Handbook for Quality Reports, 2014 Edition.

Luxembourg: Publications Office of the European Union. Available at https://ec.europa.eu/eurostat/documents/3859598/6651706/KS-GQ-15-003-EN-N.pdf.

FAY, R. E. AND HERRIOT, R. A. (1979). Estimation of income for small places: An application ofJames-Stein procedures to census data. J. Am. Stat. Assoc., 74, 269-277.

FEDERAL COMMITTEE ON STATISTICAL METHODOLOG. (2018). Transparent Quality Reporting in theIntegration of Multiple Data Sources: A Progress Report, 2017-2018. Washington, DC:Federal Committee on Statistical Methodology. Available at https://nces.ed.gov/FCSM/pdf/Quality_Integrated_Data.pdf.

FULLER, W. A. (1975). Regression analysis for sample survey. Sankhya Ser. C., 31, 117-132.GROVES, R. M. (2011). Three eras of survey research. Public Opin. Q., 75, 861-871 (Special 75th

Anniversary Issue).GROVES, R. M. AND HEERINGA, S. G. (2006). Responsive design for household surveys: Tools for

actively controlling survey errors and costs. J. R. Stat. Soc. Ser. A, 169, 439-457.GUANDALINI, A. AND TILLE, Y. (2017). Design-based estimators calibrated on estimated totals from

multiple surveys. Int. Stat. Rev., 85, 250-269.HALL, P. (2003). A short prehistory of the bootstrap. Stat. Sci., 18, 158-167.HANSEN, M. H. AND HURWITZ, W. N. (1943). On the theory of sampling from finite populations. Ann.

Math. Stat., 14, 333-362.HANSEN, M. H. AND HURWITZ, W. N. (1946). The problem of non-response in sample surveys. J. Am.

Stat. Assoc., 41, 517-529.HANSEN, M. H., HURWITZ, W. N., MARKS, E. S. AND MAULDIN, W. P. (1951). Response errors in surveys.

J. Am. Stat. Assoc., 46, 147-190.HANSEN, M. H., HURWITZ, W. N., NISSELSON, H. AND STEINBERG, J. (1955). The redesign of the census

current population survey. J. Am. Stat. Assoc., 50, 701-719.HANSEN, M. H., MADOW, W.G. AND TEPPING, B. J. (1983). An evaluation of model-dependent and

probability sampling inferences in sample surveys. J. Am. Stat. Assoc., 78, 776-793.HARTLEY, H. O. (1962). Multiple frame surveys. Proceedings of the Social Statistics Section,

American Statistical Association, 203-206.HARTLEY, H. O. AND ROSS, A. (1954). Unbiased ratio estimators. Nature, 174, 270-271.HIDIROGLOU, M. (2001). Double sampling. Surv. Methodol., 27, 143-154.HIDIROGLOU, M., BEAUMONT, J.-F AND YUNG, W. (2019). Development of a small area estimation

system at Statistics Canada. Surv. Methodol., 45, 101-126.

28 J. N. K. Rao

Page 29: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

HOLT, D. T. (2007). The official statistics Olympics challenge: Wider, deeper, quicker, better,cheaper. The American Statistician, 61, 1-8. With commentary by G. Brackstone and J.L. Norwood.

HORVITZ, D. G. AND THOMPSON, D. J. (1952). A generalization of sampling without replacement froma finite universe. J. Am. Stat. Assoc., 47, 6630685.

KALTON, G. (2019). Developments in survey research over the past 60 years: A personal perspec-tive. Int. Stat. Rev., 87, S10-S30.

KEIDING, N. AND LOUIS, T. A. (2016). Perils and potentials of self-selected entry to epidemiologicalstudies and surveys. J. R. Soc. Stat. Ser. A, 179, 319-376.

KIM, J. K. AND HAZIZA, D. (2014). Doubly robust inference with missing data in survey sampling.Stat. Sin., 24, 375-394.

KIM, J. K. AND RAO, J. N. K. (2012). Combining data from independent surveys: model-assistedapproach. Biometrika, 99, 85-100.

KIM, J. K. AND TAM, S-M. (2018). Data integration by combining big data and survey sample datafor finite population inference. Submitted for publication.

KIM, J. K. AND WANG, Z. (2019). Sampling techniques for big data analysts. Int. Stat. Rev. (in press).KIM, J. K., PARK, S., CHEN, Y. AND WU, C. (2019). Combining non-probability and probability survey

samples through mass imputation. Technical Report: arXiv: 1812. 10694v2 [stat.ME].LEE, S. (2006). Propensity score adjustment as a weighting scheme for voluntary panel web

surveys. J. Off. Stat., 22, 329-349.LEE, S. AND VALLIANT, R. (2009). Estimation for volunteer panel web surveys using propensity score

adjustment and calibration adjustment. Sociol. Methods Res., 37, 319-343.LITTLE, R. J. (2015). Calibrated Bayes, an inferential paradigm for official statistics in the era of big

data. Stat. J. IAOS, 31, 555-563.LOHR, S. L. (2011). Alternative survey sample designs: Sampling with multiple overlapping frames.

Surv. Methodol., 37, 197-213.LOHR, S. L. AND RAGHUNATHAN, T. E. (2017). Combining survey data with other data sources. Stat.

Sci., 32, 293-312.MAHALANOBIS, P. C. (1944). On large scale sample surveys. Philos. Trans. R. Soc. B, 231, 329-351.MAHALANOBIS, P. C. (1946). Recent experiments in statistical sampling in the Indian Statistical

Institute. J. R. Stat. Soc., 109, 325-378.MARCHETTI, S., GIUSTI, C., PRATESI, M., SALVATI, N., GIANNOTTI, F., PEDRESCHI, D., RINZIVILLO, S.,

PAPPALARDO, L AND GABRIELLI, L. (2015). Small area model-based estimators using big datasources. J. Off. Stat., 31, 263-281.

MCCONVILLE, K. S. AND TOTH, D. (2018). Automated selection of post-strata using a model-assistedregression tree estimator. Scand. J. Stat. (in press).

MCCONVILLE, K. S., BREIDT, F. J., LEE, T. C. AND MOISEN, G. G. (2017). Model-assisted survey regressionestimation with the lasso. J. Surv. Statist. Methodol., 5, 131-158.

MCLEOD, A. I. AND BELLHOUSE, D. R. (1983). A convenient algorithm for drawing a simple randomsample. Applied Statistics., 32, 182-184.

MENG, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations,big data paradox, and the 2016 US presidential election. Ann. Appl. Stat., 12, 685-726.

MERCER, A. W., KREUTER, F. AND STUART, E. A. (2017). Theory and practice in nonprobability surveys.Public Opin. Q., 81, 250-279.

MUHYI, F. A., SARTONO, B., SULVIANTI, I. D. AND KURNIA, A. (2019). Twitter utilization in application ofsmall area estimation to estimate electability of candidate central java governor. IOPConf.Ser. Earth Environ. Sci., 299 012033, 1-10.

NARAIN, R. D. (1951). On sampling without replacement with varying probabilities. J. Indian Soc.Agric. Stat., 3, 169-174.

ON MAKING VALID INFERENCES BY... 29

Page 30: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

NATIONAL ACADEMIES OF SCIENCES, ENGINEERING, AND MEDICINE. (2017). Federal Statistics, MultipleData Sources, and Privacy Protection: Next Steps. Washington, DC: The NationalAcademies Press. https://doi.org/10.17226/24893.

NEYMAN, J. (1934). On the two different approaches of the representative method. The method ofstratified sampling and the method of purposive selection. J. R. Stat. Soc., 97, 558-606.

PFEFFERMANN, D. AND SVERCHKOV, M. (2007). Small-area estimation under informative probabilitysampling of area and within the selected areas. J. Am. Stat. Assoc., 102, 1427-1439.

PORTER, A. T., HOLAN, S. H., WIKLE, C. K. AND CRESSIE, N. (2014). Spatial Fay-Herriot model for smallarea estimation with functional covariates. Spat. Stat., 10, 27-42.

RAO, J. N. K. (1999). Some current trends in sample survey theory and methods (with discussion).Sankhya Ser. B, 61, 1-57.

RAO, J. N. K. AND FULLER, W. A. (2017). Sample survey theory and methods: Past, present and futuredirections (with discussion). Surv. Methodol., 43, 145-181.

RAO, J. N.K. AND MOLINA, I. (2015). Small Area Estimation. Wiley, Hoboken.REITER, J. (2008). Multiple imputation when records used for imputation are not used or dissem-

inated for analysis. Biometrika, 95, 933-946.RIVERS, D. (2007). Sampling for web surveys. In 2007 JSM Proceedings, ASA Section on Survey

Research Methods, American Statistical Association.ROYALL, R. M. (1970). On finite population sampling under certain linear regression models.

Biometrika, 57, 377-387.SARNDAL, C.-E. (2007). The calibration approach in survey theory and practice. Surv. Methodol.,

33, 99-119.SCHENKER, N. AND RAGHUNATHAN, T. (2007). Combining information from multiple surveys to

enhance estimation of measure of health. Stat. Med., 26, 1802-1811.SCHMID, T., BRUCKSCHEN, F., SALVATI, N. AND ZBIRANSKI, T. (2017). Constructing sociodemographic

indicators for national statistical institutes by using mobile phone data: estimating literacyrates in Senegal. J. R. Stat. Soc. Ser. A, 180, 1163-1190.

SINGER, E. (2016). Reflections on surveys past and future. J. Surv. Statist. Methodol., 4, 463-475.

SINGH, A. C., BERESOVSKY, V. AND YE, C. (2017). Estimation from purposive samples with the aid ofprobability supplements but without data on the study variable. In 2017 JSMProceedings,ASA Section on the Survey Research Method Section, American Statistical Association.

SMITH, T. M. F. (1983). On the validity of inferences from non-random samples. J. R. Stat. Soc. Ser.A, 146, 393-403.

TA, T., SHAO, J., LI, Q. AND WANG, L. (2019). Generalized regression estimators with high-dimensionalcovariates. Stat. Sin. (in press).

TAM. S. M. AND KIM, J. K. (2018). Big data, selection bias and ethics – an official statistician sperspective. Stat. J. IAOS, 34, 577-588.

THOMPSON, S. K. (2002). Sampling. Wiley: New York.THOMPSON, M. E. (2019). Combining data from new and traditional sources in population surveys.

Int. Stat. Rev., 87, S79-S89.TIBSHIRANI, R. (1996). Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. Ser. B,

58, 267-288.TOURANGEAU, R., BRICK, M. J., LOHR, S. AND LI, J. (2017). Adaptive and responsive survey designs: a

review and assessment. J. R. Stat. Soc. Ser. A, 180, 203-223.VALLIANT, R. AND DEVER, J. A. (2011). Estiamting propensity adjustments for volunteer web surveys.

Sociol. Methods Res., 40, 105-137.VERRET, F., RAO, J. N. K. AND HIDIROGLOU, M. H. (2015). Model-based small area estimation under

informative sampling. Surv. Methodol., 41, 333-347.

30 J. N. K. Rao

Page 31: On Making Valid Inferences by Integrating Data …...On Making Valid Inferences by Integrating Data from Surveys and Other Sources J. N. K. Rao Carleton University, Ottawa, Canada

WANG, W., ROTHSCHILD, D., GOEL, S. AND GELMAN, A. (2015). Forecasting elections with non-representative polls. Int. J. Forecast., 31, 980-991.

WILLIAMS, D. AND BRICK, M. J. (2018). Trends in U. S. face-to-face household survey nonresponse andlevel of effort. J. Surv. Statist. Methodol., 6, 186-211.

WOODRUFF, R. S. (1952). Confidence intervals for medians and other position measures. J. Am.Stat. Assoc., 47, 635-646.

WU, C. AND SITTER, R. R. (2001). A model-calibrated approach to using complete auxiliaryinformation from survey data. J. Am. Stat. Assoc., 96, 185-193.

YANG, S., KIM, J. K. AND SONG, R. (2019). Doubly robust inference when combining probability andnon-probability samples with high-dimensional data. Technical Report: arXiv:1903.05212v1 [stat.ME].

YBARRA, L. M. R. AND LOHR, S. L. (2008). Small area estimation when auxiliary information ismeasured with error. Biometrika, 95, 919-931.

Publisher’s Note . Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

J. N. RAO CARLETON UNIVERSITY, OTTAWA, CANADA

ON MAKING VALID INFERENCES BY... 31