using prediction-oriented software for survey estimation

37
Using Prediction-Oriented Software for Survey Estimation James R. Knaub, Jr. US Dept. of Energy, Energy Information Administration, EI-53.1 ABSTRACT: Survey sampling and inference may be accomplished by solely design-based procedures, or solely model-based procedures, or by model-assisted, design-based procedures. Depending upon circumstances, there are advantages to each of these methods. There are times, particularly in (highly skewed) establishment surveys, when, either in terms of resources, and/or nonsampling error, it may not be practical to sample from among the ‘smallest’ members of the population, and solely model-based procedures may then be advantageous. Further, imputation for either a sample or a census is accomplished in the same manner. That is, whether data are ‘missing’ because of nonresponse, or because a sample is being used, a model can be used to predict values for the ‘missing’ data. It is important, therefore, to apply models to relatively homogeneous sets of data (i.e., such that the model parameters apply reasonably well). When predictions are made available for all ‘missing’ data, an estimate is available for the total. This article shows a general approach that may be used to organize such estimations for totals and their estimated variances in a flexible manner. Readily available regression software may be used and results may be easily reorganized to present various aggregations. This article explores what can be done to decompose variances, and simplify the applications. Note that the concept of model variance (Royall (1970)) as a measure of uncertainty applies equally well to the uncertainty in a reported total after imputation has been applied to a census, as it does to a sample. The number and distribution of ‘missing’ data points can make a difference in the optimal assignment of the regression weight. This is influenced by the type of survey (establishment or household), and whether the goal is imputation or sampling. (Note that the difference between imputation and model- based sampling is only a matter of the number and distribution of ‘missing’ data points.) One must sometimes also consider whether it is useful to study a large ‘area,’ or better to consider strata. This article contains an example of inference from model-based cutoff sampling, and an example of imputation for randomly missing data from a census. Thus, it combines the material to be found in a paper for the 1999 Joint Statistical Meetings (JSM), Proceedings of the Section on Survey Research Methods, “Using Prediction-Oriented Software for Model-Based and Small Area Estimation,” and another paper to be given at the International Conference on Survey Nonresponse (ICSN) in October 1999, in which the focus will be on imputation, “Using Prediction-Oriented Software for Estimation in the Presence of Nonresponse.” The papers for these two conferences (Knaub (1999a) and Knaub (1999b), respectively) greatly overlap, but this article provides a comprehensive view. KEYWORDS: survey sampling; estimation; imputation; variance estimation BACKGROUND: Cutoff model-based sampling has proven to be useful at the Energy Information Administration (EIA) for establishment surveys because of practical problems that arise when trying to sample from among the smaller members of a population. These problems involve both timeliness and nonsampling error. An obvious example of this occurred when it was observed that a small electric utility only read its meters once every three months. A census of electricity sales is performed annually, but a sample is

Upload: others

Post on 12-Sep-2021

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Prediction-Oriented Software for Survey Estimation

Using Prediction-Oriented Software for Survey EstimationJames R. Knaub, Jr.

US Dept. of Energy, Energy Information Administration, EI-53.1

ABSTRACT: Survey sampling and inference may be accomplished by solely design-based procedures,or solely model-based procedures, or by model-assisted, design-based procedures. Depending uponcircumstances, there are advantages to each of these methods. There are times, particularly in (highlyskewed) establishment surveys, when, either in terms of resources, and/or nonsampling error, it may notbe practical to sample from among the ‘smallest’ members of the population, and solely model-basedprocedures may then be advantageous. Further, imputation for either a sample or a census isaccomplished in the same manner. That is, whether data are ‘missing’ because of nonresponse, orbecause a sample is being used, a model can be used to predict values for the ‘missing’ data. It isimportant, therefore, to apply models to relatively homogeneous sets of data (i.e., such that the modelparameters apply reasonably well). When predictions are made available for all ‘missing’ data, anestimate is available for the total. This article shows a general approach that may be used to organizesuch estimations for totals and their estimated variances in a flexible manner. Readily availableregression software may be used and results may be easily reorganized to present various aggregations.This article explores what can be done to decompose variances, and simplify the applications. Note thatthe concept of model variance (Royall (1970)) as a measure of uncertainty applies equally well to theuncertainty in a reported total after imputation has been applied to a census, as it does to a sample.

The number and distribution of ‘missing’ data points can make a difference in the optimal assignmentof the regression weight. This is influenced by the type of survey (establishment or household), andwhether the goal is imputation or sampling. (Note that the difference between imputation and model-based sampling is only a matter of the number and distribution of ‘missing’ data points.) One mustsometimes also consider whether it is useful to study a large ‘area,’ or better to consider strata.

This article contains an example of inference from model-based cutoff sampling, and an example ofimputation for randomly missing data from a census. Thus, it combines the material to be found in apaper for the 1999 Joint Statistical Meetings (JSM), Proceedings of the Section on Survey ResearchMethods, “Using Prediction-Oriented Software for Model-Based and Small Area Estimation,” andanother paper to be given at the International Conference on Survey Nonresponse (ICSN) in October1999, in which the focus will be on imputation, “Using Prediction-Oriented Software for Estimation inthe Presence of Nonresponse.” The papers for these two conferences (Knaub (1999a) and Knaub(1999b), respectively) greatly overlap, but this article provides a comprehensive view.

KEYWORDS: survey sampling; estimation; imputation; variance estimation

BACKGROUND: Cutoff model-based sampling has proven to be useful at the Energy Information Administration (EIA)for establishment surveys because of practical problems that arise when trying to sample from amongthe smaller members of a population. These problems involve both timeliness and nonsampling error.An obvious example of this occurred when it was observed that a small electric utility only read itsmeters once every three months. A census of electricity sales is performed annually, but a sample is

Page 2: Using Prediction-Oriented Software for Survey Estimation

2

done monthly. To ask that utility to participate in a monthly sample would not be very practical.

Since early discussions by Brewer (1963) and Royall (1970), and even comments by Cochran (1953),mentioned in Knaub (1995), most model-based sampling has probably only made use of simple linearregression, with a fixed zero intercept. (We will use a common misnomer and say “no intercept.”) Atthe opposite extreme, econometric applications of regression for prediction have often used moreintricate models, although perhaps too often they are overspecified. Carroll and Rupert (1988) wrotean excellent monograph, Transformation and Weighting in Regression, with great application to model-based inference as well as many other applications. Still, because model-based sampling and inferencehave been slow to gain acceptance among survey statisticians, we have not seemed to have advanced agreat deal in this area in the last 30 or more years. Royall and Cumberland (1981) studied improvementin variance estimation for the simple linear regression model, but, from Knaub (1992), page 879, Figure1, it can be seen that in practice, improvement, if any, may be negligible. Royall has also considered theincorporation of randomization in the sampling procedure, and seems to have moved away from cutoffsampling. Brewer (1995) showed how model-based sampling and inference, and design-based samplingand inference may complement each other. Further, model-assisted, design-based sampling (see Sarndal,Swensson and Wretman (1992), Chaudhuri and Stenger (1992), et. al.) has become fairly popular.Other works, such as Sweet and Sigman (1995), and Steel and Fay (1995), have also advanced the useof models in a supporting role. However, for purposes of imputation and/or cutoff sampling, model-based applications have perhaps stalled somewhat. Cutoff sampling can be very useful for highlyskewed establishment surveys, not only because of practical data collection problems, but because ofthe efficient use it makes of resources. This should not be ignored.

Although the simple linear regression model is very useful for inference from model-based surveysampling, there are times when a multiple linear regression model may be more useful. In Knaub (1996),we see such an example. (This is expanded upon in Knaub (1997).) In such cases, one may use test datato see which of the models with one or two (or more) regressors performs best, but there can be otherconsiderations also. In Knaub (1996), the variate of interest was electricity sales for resale among acertain class of generators. The best regressor was the same variate from a previous census. However,it was found that a lot of those generators were sporadic in their sales for resale, sometimes using allelectricity for their own purposes, or perhaps not producing electricity for extended periods. Therefore,it was not uncommon to have non-zero current sales for resale, but a zero value for the correspondingregressor. However, when a second regressor, nameplate capacity, was introduced, the situation wasmuch improved. Every generating plant must have a positive nameplate capacity value. If thisprocedure were not used, then all cases where a zero value for sales for resale was recorded in theprevious census would need to be handled separately, perhaps as a separate stratum within the sample.A census or a design-based sample could be performed within that stratum.

NEW METHODOLOGY:For each of several models (with or without an intercept, and with one or two regressors), the authorwrote test programming at the Energy Information Administration (EIA) to estimate totals and relativestandard errors of the estimated totals, for multiple applications. It became obvious that it would be

Page 3: Using Prediction-Oriented Software for Survey Estimation

3

advantageous to utilize existing vendor generated programming if the future held possibilities of usingmore regressors, or at least the need to test and consider alternatives. The SAS system is available toEIA employees, and so the use of SAS PROC REG was explored. This, however, is not to beconsidered an official endorsement of SAS products. Any statistical software package that will providepredicted values, a standard error or variance of the prediction error, and the mean square error (MSE)from the analysis of variance, will suffice. Thus, this paper relates to any prediction-oriented software.Such software will provide predictions, using a specified model, for every specified member of a set ofpotential respondents from whom data was not collected. If we add those predictions and the collectedvalues, then we obtain exactly the same estimated total that we would have obtained using the moretraditional model-based approach to inference, if the set of data used is from the population or theportion of a population that is of interest. The estimation of variance will differ, but this was studied andpractical conclusions and a detailed example will follow. For now, however, note the implications ofthis new organization of the estimation procedure. We need not limit ourselves to sampling within acategory whose total we wish to estimate! When the time comes to present an estimated total for a givencategory, we need only to have a collected or a predicted value for every member of the populationfalling within that category. We then simply add the appropriate collected and predicted numbers toobtain the estimated total. Each member will have a value for the variate of interest. If the value wasnot collected, then ideally it will have been estimated from the optimal regression model using theoptimal corresponding sample. The data in that sample need not have been limited to the category forwhich we want an estimated total. The “optimal sample” would be one in which there is a compromisebetween sample size and heterogeneity. That is, using a larger sample by including a broader group ofrespondents only helps if these respondents do behave similarly under the model. (That is, the modeland the parameter values chosen must relate reasonably well to all data to which the model is applied.)

Let us consider a generic example. Suppose that the volume of electricity generated by coal-firedplants was to be estimated by geographic region for a recent period. Suppose that some data arecollected for that period, and that regressor data on the entire population are available. If practicalconsiderations limit us to a relatively small sample from among these establishments whose data wouldbe highly skewed, we could do a cutoff model sample. Census division regions are groups of States thatwere first proposed by the Bureau of the Census, and the EIA often reports on totals by Census divisionregion. There are also geographic regions known as North American Electric Reliability Council(NERC) regions that consist of areas of land usually covering several States each, but not alwaysfollowing State borders. That is, a piece of one State may be grouped with other States. Suppose thatit appeared that the data would be more homogeneous by NERC region than by Census division region,and that we had enough resources to sample by NERC region. We would then collect our sample, andmake predictions, within NERC regions, for each number not collected (because it was not in thesample, or because there was a nonresponse from among the sampled establishments). Then, we wouldadd the collected and predicted numbers within a given Census division region to estimate a total there.This is straightforward, but the variance still needs to be addressed. This paper will describe a way ofobtaining a variance estimate that is nearly as straightforward, and accurate enough for practicalpurposes. Detailed examples will be given.

One more thing should be noted at this point. If, for the electricity generated by coal-fired plants in the

Page 4: Using Prediction-Oriented Software for Survey Estimation

4

example just sketched, we wanted an estimated total for a specific State, we would definitely be able tosupply such an estimate, even if no data were collected from establishments within that State. Thismeans that every number added to form the estimated total could be a predicted number. However, insuch a case, the estimated relative standard error of the estimated total might often be so large that wewould want to withhold publishing such an estimated total. If, however, we had reason to believe thatthe data were all predicted by using relatively homogeneous groupings, and if the estimated relativestandard error was not too large for practical purposes, given the uses likely to be made of the data, thenpublication of this ‘small area’ estimate might be possible.

ESTIMATION OF VARIANCE:

Here we will use to represent the variance of an estimated total (or the variance of the* *L (T T)V −

error in estimating the total). This is a multivariate extension of found in Royall and CumberlandVL

(1981).

Let = + + + ...* *

L(T T)V − *2

e irwσ∑ ( )2

1r∑ ( )0

*V b ( ) ( )2 *1 1Vir

x b∑so

= + + LV T T)* *( − *2

e ir wσ∑ ( )2N n− ( )V* b0 ( ) ( )2 *

1 1Virx b∑

+ + + ...( ) ( )2*

2 2Virx b∑ ( ) ( )2

*3 3Vir

x b∑ + 2 + 2 + ...( )( ) ( )*

1 0 1COV ,irN n x b b− ∑ ( )( ) ( )*1 2 1 2COV ,i ir rx x b b∑ ∑

where means to sum over the cases not in the sample (Royall (1970)). (Knaub (1996)) isr∑ 2*eσ

the estimated variance of the random factor of the residual, (see Knaub (1993, 1995)), where thee0

error term is . Also, is the regression weight; (N - n) is the number of memberse w ei i oi= −1 2/ wi

of the population that are not in the sample; the b’s are regression coefficients; and the x’s are regressors.

and are estimates of variance and covariance.V* * C O V

For the case where the data element of interest is collected for all members of a population, but we wishto ‘predict’ a value for a new case, then the variance of the prediction error is represented by

. (See Maddala (1992).) This may usually be considered to be a way to predict “futureLV* *( )y yi i−

observations” (Maddala (1977), page 464), but it could be used to estimate for a single ‘missing’observation.

jk7
Note that this method is both a small area method and an imputation procedure that would work well with any design-based sample, model-based sample, or census survey, as stated later in this article. Also, it simplifies data processing, working well in conjunction with graphical editing. Nonsampling error can be explored as discussed later in this article, as well as the error due to imputations (sampling error), as these errors impact on totals, as shown below. Nonsampling error could also be judged through a sensitivity analysis by using this method in conjunction with scatterplots for graphical editing. One could look for aberrant data on the scatterplots with regressor data or a function of regressor data on the x-axis, and the data of current interest on the y-axis, as well as other scatterplots. Thus this gives rise to a unified approach to survey processing. This is proving to be useful in applications at the Energy Information Administration where simplification is greatly needed to deal with changing and substantial needs that greatly tax the available workforce. Data manager feedback on the use of scatterplots has been useful/informative and has shown it to be well received. That this relates well to the generalized estimation method of this article has some appeal.
Page 5: Using Prediction-Oriented Software for Survey Estimation

5

Let = + + + ...* *

L( )V i iy y− σ e iw*2 ( )0

*V b ( )11

2 *Vi bxthen

= + + + ...* *

L( )V i ir y y−∑ *2

e irwσ∑ ( )21r∑ ( )0

*V b ( ) ( )2 *1 1Vir

x b∑so

= + + +* *

L( )V i ir y y−∑ *2

e irwσ∑ ( )N - n ( )V* b0 ( ) ( )2 *

1 1Virx b∑

+ + ...( ) ( )2 *2 2Vir

x b∑ ( ) ( )2 *3 3Vir

x b∑

+ 2 + 2 + ...( ) ( )*1 1COV ,i or

x b b∑ ( ) ( )*1 2 1 2COV ,i ir

x x b b∑

Now consider regressors, , where for each regressor, the values are fairly constant. In the extreme,x j

if is a constant for all (but may be different for each value of j), then we have, summing over x ji i ifor a given j,

= = and = , so in the extreme,( )2

jirx∑ ( )2

1j rc ∑ ( )c N nj

2 2− 2jir

x∑ ( )2jc N n−

/ approaches N - n. ( )2

jirx∑ 2

jirx∑

Similarly, = = ( )( )ki lir rx x∑ ∑ [ ]( ) [ ]( )k lc N n c N n− − ( )2

k lc c N n−

and = = , in the extreme case. ki lirx x∑ k lr

c c∑ ( )k lc c N n−

Therefore, in general, we may approximate as follows:

= , LV T T)* *( − ( ) ( )

*2 *2* *LV i i

i i

e er rN n y y

w w

σ σδ

− − − +

∑ ∑

where, .0 1< <δ

Page 6: Using Prediction-Oriented Software for Survey Estimation

6

Although and are usually nearly equal in many practical applicationsL

* *( )V i iy y− *2ie wσ

(the difference being negligible and not considered in Knaub(1998), where the coefficients weredealt with as if they were constants), there is a cumulative impact here that is not negligible when (N-

n) becomes somewhat larger than 1. Further, the nature of was explored, and a few comments willδnow be made regarding this: When we approach the extreme case where the are constants for all ,jix i

then approaches 1. However, such a case would not be found in practice because there would be noδinformation provided by such regressors. Anything approaching this situation would mean

that would be very large. Using both real and artificial test data, it appears that the* *L (T T )V −

distribution of peaks at about 0.3, or perhaps 0.4, for practical situations involving cutoff modelδsampling of highly skewed establishment data. In such cases, one could use 0.4 to help avoid

understating variance, unless data are considered by strata, in which case = 0.3 might be used withinδeach stratum, and then the variances corresponding to each of the strata may be added. Stratification

should reduce the variances of the parameters and make less important to the estimate of varianceδof the estimated total. Using = 0.4 in the unstratified case is a precaution against underestimationδof the uncertainty, but can not adjust for very heterogeneous data that should have been stratified, as willbe shown in an example.

Note that as (N-n) approaches 1, once again approaches 1. However, typically will quicklyδ δdecrease as (N-n) becomes a little larger.

Consider the following artificial examples:

First, if we had only one regressor, and 0.9 , i.e., 90 percent of the ‘missing’ values had exactly( )N-n

the same regressor value, and the other 10 percent had one other value, exactly 10 times larger, then δwould be approximately 0.331.

If, however, every value of a single regressor were distributed incrementally (i.e., a, 2a, 3a, ... , (N-n)a),

then = , which simulation shows quickly approaches 0.75 as becomesδ ( )( )( )2

2

r

r

i

i N n−∑

∑( )N-n

larger. (At (N-n) = 10, < 0.79.) δ

Page 7: Using Prediction-Oriented Software for Survey Estimation

7

Thus, the example with the more skewed regressor values had the smaller value for . Further,δskewed, establishment survey data were used in examples that indicated = 0.3 may be reasonableδfor cutoff model sampling.

In the case of imputation for these skewed data, assuming a more random selection of missing data, a

smaller value for would make sense because the missing values would tend to be more highlyδskewed than in a cutoff sample. Perhaps = 0.2 would be appropriate. (See Example 2, below.) Forδsampling or for imputation from a household survey, the ‘missing numbers’ may be even less skewed

than the ‘missing data’ using a cutoff model for an establishment survey. Thus = 0.4 or greater mayδbe more appropriate for household surveys.

ADVANTAGES OF NEW METHODOLOGY:One advantage is that data can be estimated using the most efficient groupings available. If that datagrouping is just the category (State, Census division, whatever) for which we are trying to estimate atotal, then we will obtain the same estimated totals as when using Brewer (1963), Royall (1970), Knaub(1996), and others. (Standard errors will be slightly different due to the approximation above.)However, this method allows the use of any data set one chooses to designate for purposes of predictingeach ‘missing’ number. (A number is ‘missing’ if it was not collected/observed. This could be anumber for an entity not in a sample, or for a nonrespondent in a sample or a census.) In predicting eachmissing number, the more relevant data that can be used, the better. That is, the larger and morehomogeneous the data used in each case, the better the predictions, and the better the overall estimationof the total. Thus, this is a more powerful method than if we were to only consider ‘borrowing strength,’a common term in small area estimation, where one may use data from a 'neighboring' area when the datain the area for which one wishes to report are too sparse.

Another advantage, the one for which this method was created, is that the model can be quickly alteredwhen necessary. That is, regressors may be added or deleted, as well as the intercept term, and theregression weight may be more easily altered. In practice, this means first running a program on the largest sets of data that are reasonablyhomogeneous. The optimal data sets for this purpose will vary with data availability. After a file is built

with the resulting predicted numbers and , the variance of the prediction error, andL

* *( )V i iy y−

, the MSE divided by the weight, are stored in each case, and the collected numbers areσ e iw*2

stored in the same file, a second program can regroup the data according to the categories for which wewish to estimate subtotals. Each record of the file will therefore contain either an observed or apredicted number, two variance related numbers (each set to zero if there is an observed number), andindicators to identify the groupings used to estimate the predicted numbers, and to identify possiblecategories for which we may wish to estimate subtotals. This yields a highly organized and flexible file

Page 8: Using Prediction-Oriented Software for Survey Estimation

8

that can satisfy the customer who wants to see what we added together to obtain a given subtotal, andcan be archived and later examined without ambiguity. (EIA customers have asked for this kind ofinformation.) Further, later regrouping of data will be easy. Recently, for example, the boundarybetween two of the North American Electric Reliability Council (NERC) Regions changed substantiallydue to several companies changing their affiliations. Accounting for such a change would be easy whenusing this new methodology. (Note that figures (maps) are given at the end of this article which showthe change in NERC regions.)

Picture a typical data file as follows, where “EG” is a category for purposes of performing predictions(an “estimation group”), and “PG” is a category for purposes of publishing subtotals (a “publicationgroup”). Each line represents a record for a given member of the population. A value is any

observed (or “collected”) value, and is a predicted value. Let = , they* 21iS* *

L( )V i iy y−

variance of the prediction error, and = , the mean square error divided by the22 iS *2ie wσ

regression weight, for each case, . i

Example of a partial file:

or EG PG(a) PG(b) PG( )yi

*iy S i1 S i2 c

6725 0 0 1 1 2 5 4359 0 0 1 2 1 3 1289 0 0 2 1 4 4 497 20 17 1 1 3 2 317 13 11 1 2 2 2 278 10 9 1 1 3 2 223 9 8 2 1 3 2

Next, suppose that we have some information on nonsampling error. Although nonsampling error isdifficult to measure, tables of revision ‘errors’ (i.e., changes made) are sometimes maintained. Therelative percent change between preliminary and final submissions from respondents may give someindication of the severity of nonsampling error. The S1 values are the standard errors of the predictionerrors. To confuse the issue, they would be impacted by nonsampling error. In spite of the lack ofinformation and the complicated nature of the true relationships between errors, it may be instructiveto perform a data quality study occasionally, that would supplement the S1 and S2 values above so thatapplying the variance formula would no longer estimate only model variance, but instead would, to anextent, approximate overall error. For example, in the partial table above, we might, based on revisions,estimate that , if the are to replace the zero values associated with each observed0.02 3i id y≈ + id

Page 9: Using Prediction-Oriented Software for Survey Estimation

9

value, . The S1 and S2 values above might also be replaced by , usingiy ( )1/ 22

2*0.02i i ir y S ≈ +

“S” in place of either “S1" or “S2.” This would yield the following:

or or or EG PG(a) PG(b) PG( )yi*iy di r i1 di r i2 c

6725 137 137 1 1 2 5 4359 90 90 1 2 1 3 1289 29 29 2 1 4 4 497 22 20 1 1 3 2 317 14 13 1 2 2 2 278 11 11 1 1 3 2 223 10 9 2 1 3 2

Preliminary to Examples:

This example is for hydroelectric generation in the Western United States. Logically, it would seem thatUS hydroelectric generation may best be collected within the US Standard Regions for Temperature andPrecipitation used by the National Climatic Data Center of the National Oceanic and AtmosphericAdministration (NCDC/NOAA). A map is included below. When estimating (actually ‘predicting’)hydroelectric generation for ‘missing’ observations (i.e., for that part of the population that was not onthe sample or, whether sample or census, did not respond), the regressors used here were generatornameplate capacity and previously reported generation from a census. The relationship of those numbersto a current sample or census of generation might be expected to differ by geographic region becausethe changes in precipitation might be expected to be similar within the NCDC regions. However, theEnergy Information Administration (EIA) would publish generation numbers by Census division orNERC region. The EIA might also publish some of these numbers by State.

Collecting hydroelectric plant generation data by NCDC/NOAA US Standard Regions for Temperatureand Precipitation would appear to be more logical than to collect data by Census division, sinceprecipitation impacts upon hydroelectric generation, and one would like to have data grouped ashomogeneously as possible for the largest data set possible.

The first figure below shows an NCDC/NOAA region in the West consisting of California and Nevada,and another region north of that consisting of Oregon, Washington and Idaho. In the Census divisionmap, we see California, Oregon and Washington in one region. (That region also contains Alaska and

Page 10: Using Prediction-Oriented Software for Survey Estimation

10

Hawaii, but is subdivided in EIA reports so that California, Oregon and Washington make up an entiredivision among themselves, the Pacific Contiguous Census Division.)

What if hydroelectric generation data were collected in each of the two NCDC/NOAA regions justmentioned, and data ‘predicted’ for all members of the population for which data were not collected inthose regions, and then a total was published across the Census subdivision for California, Oregon andWashington? Two examples will be shown. The first will illustrate cutoff model-based sampling andinference. The second example will illustrate the use of this method for imputation.

The image below was taken, with permission, from the NCDC/NOAA Internet Web pages:

Source: National Oceanic and Atmospheric Administration, National Climatic Data Center,Asheville, North Carolinahttp://www.ncdc.noaa.gov/ol/climate/research/1998/ann/usrgns_pg.gif

Page 11: Using Prediction-Oriented Software for Survey Estimation

11

Next, consider the Census division map (below) found on the EIA Web site athttp://www.eia.doe.gov/emeu/recs/recs97/recs.html, which gives the US Bureau of the Census asits source.

Page 12: Using Prediction-Oriented Software for Survey Estimation

12

Consider, for example, that a population could be subdivided into seven categories which each representnearly homogeneous data under one model per category. Therefore, there are seven EGs. Furthersuppose that the population is divided into four parts for which subtotals and their variances are to beestimated. That would mean having four PGs. This could be represented visually as in the figure below:

Page 13: Using Prediction-Oriented Software for Survey Estimation

13

In the case at hand, for California (CA), Nevada (NV), Oregon (OR), Washington (WA) and Idaho (ID),and considering the NCDC West and Northwest regions as EG1 and EG2, respectively, and the PacificContiguous Census Division as the PG area, the following figure applies:

Page 14: Using Prediction-Oriented Software for Survey Estimation

14

NCDC/NOAA West

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

Past Hydro. Generation, x

Cu

rren

t H

ydro

. Gen

erat

ion

, y

In the first example, for cutoff model-based sampling, the “cutoff” will be that data are not collectedfrom hydroelectric plants with less than 200 megawatts of nameplate capacity. These data were chosenfor use in testing as they represent a reasonable set of real data to isolate for this purpose. The authoremployed a testing technique used for a number of years since being suggested by Dean Fennell at theEIA. The data in this case come from two annual censuses. Data are removed from one census tosimulate a sample with the remaining data. The other census supplies regressor data. After‘prediction’/estimation has been accomplished, the results are compared to what had been collectedbefore the artificial ‘sample’ was formed. With enough such test data sets, one may judge theperformance of both total estimation and variance estimation to some degree.

Here, the observed totals for hydroelectric generation in the regions of interest are as follows:Region Observed Total (MWh)NCDC/NOAA West (CA, NV) 41,350,600 Northwest (OR, WA, ID) 163,438,700 Census Division Pacific Contiguous (CA, OR, WA) 188,710,100

EXAMPLE 1, SAMPLING:For the NCDC/NOAA West region, there are only 7 hydroelectric plants out of 231 that meet the 200megawatt capacity threshold. They account for just over 20% of the generation. Two regressors areused: generation from a past census, x, and plant nameplate capacity, c. The sample (n = 7) of generationvalues, y, is plotted against each of those regressors below, and there is also a graph of the sample

generation values against . This was a preliminary estimate of y, to be used in the^

0.5 1.3y x c= +regression weight.

Page 15: Using Prediction-Oriented Software for Survey Estimation

15

NCDC/NOAA West

0

500

1000

1500

2000

2500

3000

0 500 1000 1500

Nameplate Capacity, c

Cu

rren

t H

ydro

. Gen

erat

ion

, y

NCDC/NOAA West

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500

y (y^ = 0.5x+1.3c)

Cu

rren

t H

ydro

. Gen

erat

ion

, y

Page 16: Using Prediction-Oriented Software for Survey Estimation

16

NCDC/NOAA Northwest

0

5000

10000

15000

20000

25000

30000

0 5000 10000 15000 20000 25000 30000

y^ (y = x+0.9c)

Cu

rren

t H

ydro

. Gen

., y

NCDC/NOAA NW (portion)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 2000 4000 6000 8000 10000

y^

y

For the NCDC/NOAA Northwest region, there are 25 hydroelectric plants out of 147 that meet the 200megawatt capacity threshold. They account for about 85% of the generation. The same two regressorsare used: generation from a past census, x, and plant nameplate capacity, c. The sample (n = 25) of

generation values, y, is plotted against Again, is a preliminary estimate of y, to^

0.9 .y x c= +^y

be used in the regression weight.

Page 17: Using Prediction-Oriented Software for Survey Estimation

17

Pacific Contiguous Census Div.

0100020003000400050006000700080009000

0 2000 4000 6000 8000 10000

y^ (y = x+0.8c)

Cu

rren

t H

ydro

.Gen

., y

For the Pacific Contiguous Census Division, there are 28 hydroelectric plants out of 331 that meet the200 megawatt capacity threshold. They account for about 70% of the generation. Two regressors areused: generation from a past census, x, and plant nameplate capacity, c. The sample (n = 28) of

generation values, y, is plotted against Once again, this was a preliminary estimate^

0.8 .y x c= +of y, to be used in the regression weight.

Following are some results found when sampling within the preceding regions in the more conventionalmanner, and then a result made available due to the new methodology. Using the new method, the‘missing data’ in the Pacific Contiguous Census Division that are also contained within theNCDC/NOAA Northwest region may be ‘predicted’ more accurately because data from plants in Idahoare included. Similarly, data from Nevada supplement California data. This may not be as helpful hereas it could be in many similar situations. Idaho and Nevada do not have that much additional data tooffer, and Nevada and California data may not be very homogeneous, yet there still is substantialimprovement found using the new method in this example. This may primarily be because the newmethod does not mix Washington and Oregon data with California data during ‘prediction’/estimation.

Page 18: Using Prediction-Oriented Software for Survey Estimation

18

For the NCDC/NOAA West region, consisting of California (CA), and Nevada (NV), we have thefollowing model and also some statistics from SAS PROC REG:

, where = 0.8 is used here. This is a multiple regression( )0 0.5 1.3i x i c iy x c e x cγ

β β= + + + γextension of the format used in Knaub(1993), page 520. This is further developed in Knaub (1997). Thenonrandom factor of the residual makes use of an estimate of y. A good example of a similar use isfound in Knaub (1998). A discussion of the rationale for the residual, and this value of will be givenγlater.

R-square: 0.945

x coefficient: 0.45; standard error: 0.19c coefficient: 1.41; standard error: 0.44

n = 7; N = 231

Using notation shown in Royall(1970), where means summation over a sample, and s∑ r∑

means summation over the part of the population not in the sample:

= 8372, and = 28058, in gigawatthours (GWh), so ,ys i∑ y

r i*∑ T* = + =8372 28058 36430

which is in substantial error considering the observed T which was 41351 GWhs. However, when usingonly 7 of the observations, which cover only 8372 GWhs directly, a large error is to be expected.

Now, , but (using the new methodology, with = 0.4), the estimated standard error isT - T* = 4921 δ6078. Since there are about two chances in three that an observed error would be less than one standarderror, the results are quite reasonable. (Note also that the estimated standard error, using the newmethodology, with = 0.3, is 5296.)δ

Page 19: Using Prediction-Oriented Software for Survey Estimation

19

For the NCDC/NOAA Northwest region, consisting of Oregon (OR), Washington (WA), and Idaho(ID), we have the following model and also some statistics from SAS PROC REG:

, where = 0.8 is used here as before. ( )0 0.9i x i c iy x c e x cγ

β β= + + + γ

R-square: 0.996

x coefficient: 1.01; standard error: 0.07c coefficient: 0.90; standard error: 0.31

n = 25; N = 147

= 139,580, and = 23,086, in gigawatthours (GWh), so ys i∑ yr i

*∑

, which is not nearly in as much error as in the previous case*T 139,580.4 23, 086.4 162, 667= + ≅

because the sample size is larger, and the summation over the sample, , is much more substantialys i∑

relative to the estimated total.

Now, , and (using the new methodology, with = 0.4) the estimated standard error is 586,*T-T 772= δwhich would mean that an error of 772 would be fairly reasonable.

Page 20: Using Prediction-Oriented Software for Survey Estimation

20

When the Pacific Contiguous Census Division region (CA, OR and WA) is used as a category fromwhich to sample, the model does not perform as well. Although graphs for these data can be somewhatdeceiving, the graphs above which compare the same range (0 to 10,000 GWh) for the PacificContiguous Census Division region and the NCDC/NOAA Northwest region indicate thatheteroscedasticity is a more ill-defined phenomenon in the Pacific Contiguous case.

Here the model is

, where = 0.8 is used here as before.( )0 0.8i x i c iy x c e x cγ

β β= + + + γ

R-square: 0.978

x coefficient: 1.01; standard error: 0.08c coefficient: 0.75; standard error: 0.29

n = 28; N = 331

= 138,558, and = 62,123, in GWh, so = 200,681ys i∑ yr i

*∑ T*

= -11,971 but the estimated standard error using the new method, with = 0.4, is onlyT - T* δ2821. Using the more traditional (but multivariate) (Knaub(1996,1997)), the estimated standardV

Lerror is 2443. Thus the standard error appears to be understated. (Note that Royall andCumberland(1981) explored alternatives to in the case of simple linear regression, but that in theV

Ltesting done in Knaub(1992), performed relatively well.) The poor performance here appearsV

Lto be the result of sampling from within a category (the Pacific Contiguous Census Division) thatis too heterogeneous. That is, we know that there are groups within this category (those that are partof the NCDC/NOAA West region and those that are part of the NCDC/NOAA Northwest region) thatshould be modeled using different coefficients. Here, the NCDC/NOAA West region seems to addjust enough to the ‘mix’ to disturb the model that was used for the NCDC/NOAA Northwest region ifone wanted to apply it to the Pacific Contiguous Census Division. This can be seen by comparing themodel coefficients provided above.

Page 21: Using Prediction-Oriented Software for Survey Estimation

21

So we see that if we predict a y value, , and obtain the corresponding = andyi* S i12

L

* *( )V i iy y−

= values for each member of the NCDC/NOAA West region for which data wereS i22 σ e iw*2

not collected, and do the same for each member of the NCDC/NOAA Northwest region, as in the firsttwo cases shown, then we can do better at estimating both T and its standard error for the PacificContiguous Census Division region. Here, after making the predictions, the predicted and collected dataare aggregated within the Pacific Contiguous region by NCDC/NOAA regions. For each stratum, weuse

= 0.3 and = 0.8, as mentioned earlier. One then obtains , and therefore δ γ T* = 184 597, T - T*

= 188,710 - 184,597 = 4113 with an estimated standard error of 5257. To estimate 189terawatthours as 185 is decidedly better than 201.

Note that if one compared nothing but the estimated standard errors, then = 200,681 would appearT*

to be the better estimate of T, but that does not seem likely here. Other testing has been done, and thenew method appears to be sound.

The data file that was created in this example, as described in the more generic and more abbreviatedexample earlier, contained an indicator for the NCDC/NOAA region (“1" for the West region and “2"for the Northwest region), and the State postal abbreviation codes so that one could choose only thosein the Pacific Contiguous Census Division region. A code could have been assigned for the Censusdivisions, but it was convenient to use the State codes that happened to already be available. Of course,the collected/observed y, or the predicted y value is present in this file for each case, as well as the S1and S2 values corresponding to the predicted y values.

Page 22: Using Prediction-Oriented Software for Survey Estimation

22

Below is an excerpt from this file:

State or NCDC/NOAAyiyi*

S i1 S i2

CA 1.149 1.315 1.308 1 CA 0.904 1.095 1.088 1 CA 0.790 0.933 0.899 1 CA 0.606 0.768 0.763 1 CA 0.353 0.485 0.472 1 CA 0.154 0.248 0.243 1 ID 2235.094 2 ID 1298.898 2 ID 3344.642 2 OR 2950.887 2 OR 6542.899 2 OR 5861.074 2 OR 13889.131 2 OR 1274.139 2 OR 8030.864 2 WA 1075.928 2 WA 3966.529 2 WA 1157.234 2 WA 3257.702 2 WA 7280.043 2 WA 4023.082 2 WA 5643.458 2 WA 27032.313 2 WA 6433.497 2 WA 5340.163 2 WA 956.149 2 WA 2918.442 2 WA 14631.013 2 WA 1576.052 2 WA 4022.615 2 WA 4838.530 2 OR 1412.538 116.838 111.946 2 WA 959.870 85.853 82.175 2WA 927.307 80.949 79.954 2 WA 757.793 68.780 68.027 2 ID 739.563 68.763 66.746 2 WA 716.478 65.676 65.049 2ID 569.037 55.317 54.087 2

Page 23: Using Prediction-Oriented Software for Survey Estimation

23

NCDC/NOAA West

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

y^ (y^ = 0.631x+1.105c)

Current Hydroelectric Generation, y

EXAMPLE 2, IMPUTATION:

For the NCDC West region, approximately ten percent of the data in the test data set were removedat random, as if the data were ‘missing.’ This was accomplished by selecting (as ‘missing’) each casewhere a given byte was a given value, when it could have assumed any single digit value. This resultedin randomly imputing for 27 of 233 hydroelectric power producers, yielding n = 206, which, in this case,covered about 93% of the generation. Using a ‘reasonable’ regression weight, with = 0.5 (to beγdiscussed in the next section of this article), one obtains

T - = 41,351 - 41,521 = -170*T

= 253( )0.2, 0.5* δ γσ = =

Page 24: Using Prediction-Oriented Software for Survey Estimation

24

For the NCDC/NOAA Northwest region, the same procedure resulted in imputing randomly for 13of 147 hydroelectric power producers, yielding n = 134, which covers about 95% of generation.

Using the same ‘reasonable’ regression weight as before:

T - = 163,439 - 163,534 = -95*T

= 355( )0.2, 0.5* δ γσ = =

Page 25: Using Prediction-Oriented Software for Survey Estimation

25

Pacific Contiguous Census Division

0

5000

10000

15000

20000

25000

30000

0 5000 10000 15000 20000 25000 30000

y^ (y^ = 0.8185x + 1.37c)

Current Hydroelectric Generation, y

Pacific Contiguous Census Division (portion)

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 200 400 600 800 1000 1200 1400 1600 1800 2000

y^

y

Now, note these figures for the Pacific Contiguous Census Division:

Page 26: Using Prediction-Oriented Software for Survey Estimation
Page 27: Using Prediction-Oriented Software for Survey Estimation

27

REGRESSION WEIGHTS:The regression weight is a parameter that may optionally be provided to SAS PROC REG by the

programmer. Using for the weights, the default would set = 1 for all , which would be thewi wi i

homoscedastic or Ordinary Least Squares (OLS) case. For the Weighted Least Squares (WLS) case

(Knaub(1997)), one must first decide with respect to what variable the variance of is to beyi

considered to increase/decrease. Since we are considering multivariate cases, we could choose anyregressor or combination of regressors, but one might logically consider any preliminary estimate of y,

say . (This was done in the case of a number of regressors in Knaub(1998).) In the case of simpley^

linear regression, is a form that has long been considered (Cochran(1953)), and usedw xi i= −2γ

(Brewer(1963), Royall(1970)), and was discussed in Knaub(1995). It has performed well for a variety

of data. Therefore, here we use . Note that this means that the error term beingw yi i=

−^ 2γ

considered is .e w e y ei i o i oii= =−1 2/

^ γ

What value for should be used? This was considered in Knaub(1997). Often, applying either theγiterative weighted least squares method (Carroll and Ruppert(1988), pages 69 and 70), or the alternativemethod in Knaub (1993), for highly skewed establishment surveys, one will find that the data indicatethat should be about 0.8. However, Knaub(1997) indicates that there are other considerations. Inγa cutoff model-based sample of highly skewed establishment data, if there is no nonresponse among thelargest establishments, then a larger value for may be desirable as that would put more emphasis onγthe smaller observations made, which would be closer to the size of those data which were not collected.

This may be important if the range of impacts on the best values for the regression weight. Thus,y^

= 1 may be a good choice for such sampling. γ

The value used for the cutoff sample in Example 1, found in this article, was not = 1 , however,γ γbecause of an additional consideration. In Example 1, the sample sizes were quite small. Larger valuesfor reduce the effective sample size. This may be counterproductive when information is so scarce.γTherefore, it was decided to use = 0.8 here. (It may have been better to have used = 0.5 forγ γthe part involving the NCDC/NOAA West region, and this could be done. In fact, there is no reason notto use entirely different models for each set of predictions, subsets of which can be used as strata whenaggregating the totals/subtotals to be published.)

Page 28: Using Prediction-Oriented Software for Survey Estimation

28

In the case of imputation for nonresponse in an establishment survey, what if large establishmentsbehave relatively more like other large establishments? In that case, a smaller value for may beγdesirable, so that data collected from larger establishments will have more influence, assuming that somedata from larger establishments will be missing and some will be present. (Note that heteroscedasticitywill normally mean that the larger establishments will be ‘weighted’ less than the smaller ones. Alsonote that when = 0, one has the homoscedastic (OLS) case, and all data points are given equalγweight.) In the second of the two full examples above, Example 2, dealing with imputation for randomlymissing data in an establishment survey, the value used for was 0.5 (yielding the “ratio” estimate inγRoyall(1970)).

For household surveys, data are typically not as skewed as in establishment surveys, and data sets tendto be larger, whether one is dealing with sampling inference, imputation or both. When applying a linearmodel in such cases, it is probably best to use whatever the data indicate for (see Knaub (1993)).γFrom Brewer (1999), page 38, “ ... the precise choice of ...” the regression weights “... has little influenceon the accuracy of the sample estimates for large samples” in the case of model assisted (design-based)survey sampling. That may be true for large samples in purely prediction-based inference as well. Thisarticle, however, concentrates on the generally smaller, heavily skewed establishment survey, and itparticularly relates to cases where a cutoff sample is most practical. However, with possibleadjustments to regression weights, and to , the general procedure should be very widely useful,δespecially for imputation.

In both Example 1 and Example 2, preliminary estimates of model coefficients and the values asγstated above were used to form the regression weights. In both examples, the model coefficients weresubstantially different for the two NCDC/NOAA regions, but similar for the same region acrossexamples.

As an experiment, for data restricted to the Pacific Contiguous Census Division region in Example 1

above, estimates of and its standard error were done in accordance with Knaub(1997). Here, theγweight involved , instead of a single regressor. Since this region apparently contained a ‘mix’ ofy

^

establishments, better grouped by NCDC/NOAA regions, a ‘best’ value for the Pacific ContiguousγCensus Division region is a nebulas concept. The format of the weight may not adequately describe thishybrid situation. Low estimates of in this case were found. First, was estimated as 0.59, andγ γthis led to a standard error estimate of approximately 0.10 about another estimate of 0.52. Usingγ ≅

= 0.59, or even 0.50 for the Pacific Contiguous Census Division region as a group, however, didγnot improve results over the previous estimate of = 200,681. (That yielded = -T* T - T*

Page 29: Using Prediction-Oriented Software for Survey Estimation

29

11,971.) Also, the standard error, which was estimated to be 2443, continues to apparently beunderestimated.

For = 0.59, = 201,829, = -13,119, and the estimated standard error is only 2425.γ T* T - T*

For = 0.50, = 202,195, = -13485, and the estimated standard error is only 2553.γ T* T - T*

SMALL AREA ESTIMATION:The reason that this method is appropriate to small area estimation, as stated on page 4, is that wheneverthere are regressor data available and models can be used to predict a response for any member of thepopulation, then an estimated total may be produced for any subgroup. If, for example, we wished toestimate a hydroelectric generation total for a given State from which we had collected few if anyresponses (but we have complete regressor data), then we may do so. Accuracy may be too low to makethe result useful, but this is always true for small area estimation. In this case, however, one can estimatea standard error, although accuracy of that statistic, as well as the estimated total, would be dependentupon the accuracy of homogeneity assumptions. In the case of the data used in the above examples, itturned out that data were homogeneous by State to the extent that for those States, it would have beenbest to have sampled and estimated by State. Thus if all data were missing for a given State, it may beinappropriate to ‘predict’ each of those missing values using data from other States. A variance estimatewould be produced, but it may not be acceptably accurate.

SUMMARY, COMMENTS AND CONCLUSIONS:Using prediction-oriented software, such as SAS PROC REG, one may take advantage of theflexibility such a package may have in specifying different models. With SAS PROC REG, one mayeasily specify different numbers of regressors for a linear regression, and one may specify regressionweights and whether or not the intercept will be fixed at the origin. For any case where one has regressordata, but a ‘missing’ observation for the variate of interest, the software will predict a response. (Onemay normally use the term “missing” when referring to nonresponse only, but here the term is used morebroadly. A ‘missing’ datum may be due to nonresponse when a response was sought, or because asample was collected, and there was no plan to collect the ‘missing’ observation. The mixture of suchcases may impact on one’s decision as to the regression weight to be used.) If one builds a file withcollected and predicted numbers for the variate of interest for all members of a population, or at leastthe portion of a population that is of interest, then estimates of aggregations can easily be made byadding the collected and predicted numbers that belong to the appropriate category. This means that datamay be collected under one set of categories, and published under a different set of categories. Foroptimal efficiency, one should predict using as large a set of relatively homogeneous data as possible.By “homogeneous,” it is meant that the same model with the same coefficients would be appropriate forall members of a given subgroup of the data. Because there is a compromise to be made between

Page 30: Using Prediction-Oriented Software for Survey Estimation

30

sample size and the degree of homogeneity of the data, decisions like this may be somewhat subjective.Some testing may be possible, although hypothesis testing is generally misused, as discussed inKnaub(1987). Without considering a simple alternative hypothesis, such as is done in sequentialhypothesis testing, or at least doing a sensitivity analysis, results are often misinterpreted.Experimenting with a number of sets of test data would be a more direct approach. For Example 1 givenabove, further work showed that there was enough data collected by State for California, Oregon andWashington to have estimated the Pacific Contiguous Census Division region slightly better byestimating for each of those States and adding results. However, this may not be true in data sets forother time periods, or for other regions.

If one has a very limited survey budget (leading one to expect a great deal of ‘missing’ data), thencollecting hydroelectric generation data within precipitation regions that each cover multiple States maybe a logical approach. Once predictions are made for every ‘missing’ datum, totals/subtotals may beestimated by adding predicted and collected numbers under any categorization requested. (Thus thechange in a NERC regional border, mentioned earlier, would not be a problem.) Standard errors areestimated by use of a formula that approximates for each relatively homogeneous subgroup underVL

the category for which an estimated total is to be published. The variance of the total is then estimatedby adding those variances, assuming independence. Thus we have prediction/estimation categories, andwe have publication categories. They may divide the population differently. Once a file is built withcollected or predicted numbers for every member of a population, such a file will easily lead to estimatesof totals for any (sub)group within that population. Even if there are no collected values, but onlypredicted values for that group, an estimate for the total of that group can be made, but the estimatedstandard error would probably be large. Such results may not be very reliable. If there is any reason thatdata for that group may behave differently, there will not be any evidence of that fact. However, if theestimated standard error is not too large for the purposes to which the data are to be used, and one statesthe homogeneity assumptions being made, it may be possible to provide such an estimated total to data‘customers’ without misinforming them.

The method promoted in this article is well suited to imputation for a census, or for inference frommodel-based sampling and for imputation associated with such a sample, as long as suitable regressordata are available. As for imputation associated with design-based sampling, there has been other workdone in that area, such as Steel and Shao (1997) and Montaquila (1999). However, the method of thisarticle might also be applied to design-based sampling in some cases by simply adding the (model-based)variance estimate for the imputed values to the design-based variance estimate which assumed all datawere observed. (According to Lee, Rancourt and Saerndal (1999), this procedure is apparently commonto some extent to several methods involving single value imputation.) This may be fairly accurate in thecurrent case, and very convenient to apply, as well as being interpretable. (However, the variance of thevariance estimate may be high.) Observational errors are another matter, as discussed earlier.

Note the change in the NERC regional map below that has been used by the EIA, as compared to thenew map, now reported by the NERC, also shown (further) below.

Page 31: Using Prediction-Oriented Software for Survey Estimation

31

Old NERC Map

Note that the source of this old NERC region map is an issue of the Electric Power Monthly, producedby the Energy Information Administration, Office of Coal, Nuclear, Electric and Alternative Fuels.

Page 32: Using Prediction-Oriented Software for Survey Estimation

32

Newest NERC Regions (June 1999)Copied with permission.

This map also shows Canada and a small part of Mexico, but the US boundaries are clearly shown.Thus, it may be compared to the old map. Notice the changed boundary between the SPP and the SERC.This is due to several companies changing their affiliations.

Source for updated NERC map: NERC Web site, URL: http://www.nerc.com/regional/

Page 33: Using Prediction-Oriented Software for Survey Estimation

33

Summary of Software Needs

Estimated Totals or SubtotalsTo estimate (sub)totals, either (1) an observed or (2) a predicted value is needed for each potentialrespondent of interest for the variate of interest. The sum of those numbers is the estimated (sub)total.The software need only retain observed numbers, compute predicted numbers using the data and modelsspecified by the user, and perform summations. Accounting for heteroscedasticity requires that changein variance is to be considered with respect to a regressor or function of regressors. A preliminaryestimate of y is suggested for this purpose. To obtain this, a preliminary use of the prediction software(such as SAS PROC REG) would be in order.

Estimate of Uncertainty in the Estimated (Sub)totalsThe software must be capable of estimating the variance of the prediction error for each nonrespondent,and the mean square error as found in an ANOVA table. It must then use this information, along witha specified regression weight, and the number of missing observations. Estimated model variances foreach strata are considered independently and are added to obtain the overall variance estimate. Errorsof data collection may also be considered.

Some SAS Software Code:

Below is example (mainframe) SAS code that, in addition to the observed y values, depending upon

the version of SAS used, may provide the predicted value, , the variance of the prediction error, *iy 21iS

= , and = , which is the mean square error divided by the* *

L( )V i iy y− 22iS *2

ie wσregression weight, for each missing data case, . Here “EG” is a categorization for purposes ofiperforming predictions as discussed on page 8.

//SAS1 EXEC SAS,REGION=5000K,WORK='80,40',SORT=100 //IN DD DSN=JK76944.HYDRO.TEST.IMPUTE.DATA,DISP=SHR //OUTY DD DSN=JK76944.RESULTS.DATA,DISP=MOD //OUT DD DSN=JK76944.VLT.PRELIM.RESULTS1,DISP=SHR //OUTN DD DSN=JK76944.VLT.PRELIM.RESULTS2,DISP=SHR //SASLOG DD SYSOUT=* //SASLIST DD SYSOUT=* //SYSIN DD * OPTIONS REPLACE LINESIZE=132; DATA SASREG; INFILE IN;

Page 34: Using Prediction-Oriented Software for Survey Estimation

34

INPUT @6 EG $2. Y 21-30 X1 31-40 X2 91-100 @101 PG $4.; IF EG NE 'A' AND EG NE 'B' AND EG NE 'C' THEN DELETE; YHAT =0; YHAT = (X1*0.83)+(X2*1.76); PROC SORT DATA=SASREG; BY EG; DATA D; SET SASREG; FILE OUTY; IF Y NE '.' THEN PUT @2 PG $4. @7 EG $2. @11 Y 10.3; W = YHAT**(-1.0); * imputation ....... gamma=0.5; SRW = SQRT(W); PROC REG OUTEST=ODW2; BY EG; MODEL Y=X1 X2 / NOINT P; WEIGHT W; OUTPUT OUT=ODW P=YPW STDI=YSTIW; PROC SORT DATA=ODW; BY EG; DATA ODW; MERGE ODW ODW2; DATA M; SET ODW; MSE = _RMSE_*_RMSE_; FILE OUTN; IF MSE='.' THEN DELETE; PUT @3 MSE BEST13. @19 EG $2.; DATA A; SET ODW; IF Y NE '.' THEN DELETE;*IF W=0 THEN DELETE;*RMS=_RMSE_; DATA B; SET A; S1 = YSTIW; V1 = S1**2; DATA C; SET B; FILE OUT; PUT @2 PG $4. @7 EG $2. @10 SRW 9.6 @23 S1 9.3 @41 YPW 10.3; /* //SAS2 EXEC SAS,REGION=5000K,WORK='80,40',SORT=100 //OUTY DD DSN=JK76944.RESULTS.DATA,DISP=MOD //IN DD DSN=JK76944.VLT.PRELIM.RESULTS1,DISP=SHR //INMSE DD DSN=JK76944.VLT.PRELIM.RESULTS2,DISP=SHR //SASLOG DD SYSOUT=* //SASLIST DD SYSOUT=* //SYSIN DD * OPTIONS REPLACE LINESIZE=132; DATA COMBINE1; INFILE IN; INPUT @2 PG $4. @7 EG $2. @10 SRW 9.6 @23 S1 9.3 @41 YPW 10.3; W = SRW**2; PROC SORT; BY EG; DATA COMBINE2; INFILE INMSE; INPUT @3 MSE BEST13. @19 EG $2.; PROC SORT; BY EG; DATA COMBINE; MERGE COMBINE1 COMBINE2; BY EG; V2 = MSE/W; S2 = SQRT(V2); DATA ALL; SET COMBINE; FILE OUTY; PUT @2 PG $4. @7 EG $2. @11 YPW 10.3 @31 S1 10.3 @51 S2 10.3; /*

Page 35: Using Prediction-Oriented Software for Survey Estimation

35

REFERENCES:

Brewer, K.R.W. (1963), "Ratio Estimation in Finite Populations: Some Results Deducible from theAssumption of an Underlying Stochastic Process," Australian Journal of Statistics, 5, pp. 93-105.

Brewer, K.R.W. (1995), “Combining Design-Based and Model-Based Inference,” Business SurveyMethods, ed. by B.G. Cox, D.A. Binder, B.N. Chinnappa, A. Christianson, M.J. Colledge, and P.S. Kott,John Wiley & Sons, pp. 589-606.

Brewer, K.R.W. (1999), “Design-based or Prediction-based Inference? Stratified Random vs StratifiedBalanced Sampling,” International Statistical Review, 67, 1, pp. 35-47, International Statistical Institute.

Carroll, R.J., and Ruppert, D. (1988), Transformation and Weighting in Regression, Chapman & Hall.

Cochran, W.G. (1953), Sampling Techniques, 1st ed., John Wiley & Sons, (3rd ed., 1977).

Chaudhuri, A. and Stenger, H. (1992), Survey Sampling: Theory and Methods, Marcel Dekker, Inc.

Karmel, T.S., and Jain, M. (1987), “Comparison of Purposive and Random Sampling Schemes forEstimating Capital Expenditure,” Journal of the American Statistical Association, American StatisticalAssociation, 82, pp. 52-57.

Knaub, J.R., Jr. (1987), "Practical Interpretation of Hypothesis Tests," Vol. 41, No. 3 (August), letter,The American Statistician, American Statistical Association, pp. 246-247.

Knaub, J.R., Jr. (1992), “More Model Sampling and Analyses Applied to Electric Power Data,”Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 876-881.

Knaub, J.R., Jr. (1993), "Alternative to the Iterated Reweighted Least Squares Method: ApparentHeteroscedasticity and Linear Regression Model Sampling," Proceedings of the International Conferenceon Establishment Surveys, American Statistical Association, pp. 520-525.

Knaub, J.R., Jr. (1995), "A New Look at 'Portability' for Survey Model Sampling and Imputation,"Proceedings of the Section on Survey Research Methods, Vol. II, American Statistical Association, pp.701-705.

Knaub, J.R., Jr. (1996), “Weighted Multiple Regression Estimation for Survey Model Sampling,”InterStat, May 1996, http://interstat.stat.vt.edu/InterStat. (Note shorter, more recent version in ASASurvey Research Methods Section proceedings, 1996.)

Page 36: Using Prediction-Oriented Software for Survey Estimation

36

Knaub, J.R., Jr. (1997), “Weighting in Regression for Use in Survey Methodology,” InterStat, April1997, http://interstat.stat.vt.edu/InterStat. (Note shorter, more recent version in ASA Survey ResearchMethods Section proceedings, 1997.)

Knaub, J.R., Jr. (1998), “Filling in the Gaps for A Partially Discontinued Data Series,” InterStat, October1998, http://interstat.stat.vt.edu/InterStat. (Note shorter, more recent version in ASA Business andEconomic Statistics Section proceedings, 1998.)

Knaub, J.R., Jr. (1999a), "Using Prediction-Oriented Software for Model-Based and Small AreaEstimation," to appear in the 1999 Proceedings of the Section on Survey Research Methods, AmericanStatistical Association.

Knaub, J.R., Jr. (1999b), "Using Prediction-Oriented Software for Estimation in the Presence ofNonresponse," to be presented at the 1999 International Conference on Survey Nonresponse, AmericanStatistical Association.

Lee, H., Rancourt, E., and Saerndal, C.-E. (1999), “Variance Estimation from Survey Data Under SingleValue Imputation,” presented at the International Conference on Survey Nonresponse, Oct. 1999, to bepublished in a monograph.

Maddala, G.S. (1977), Econometrics, McGraw-Hill, Inc.

Maddala, G.S. (1992), Introduction to Econometrics, 2nd ed., Macmillan Pub. Co.

Montaquila, Jill M.(1999), “Variance Estimation 1: Accounting for Imputation,” Washington StatisticalSociety Seminar, presented October 12, 1999. Abstract available at URL:http://www.science.gmu.edu/~wss/seminar.html#991012. (This is a more general extension of work inMontaquila, J.M. and Jernigan, R.W. (1997), “Variance Estimation in the Presence of Imputed Data,”Proceedings of the Section on Survey Research Methods, American Statistical Association, pp. 273-278.)

Royall, R.M. (1970), "On Finite Population Sampling Theory Under Certain Linear Regression Models,"Biometrika, 57, pp. 377-387.

Royall, R.M. and Cumberland, W.G. (1981), "An Empirical Study of the Ratio Estimator and Estimatorsof its Variance," Journal of the American Statistical Association, 76, pp.66-88.

Saerndal, C.-E., Swensson, B. and Wretman, J. (1992), Model Assisted Survey Sampling, Springer-Verlag.

Page 37: Using Prediction-Oriented Software for Survey Estimation

37

Steel, P. and Fay, R.E. (1995), “Variance Estimation for Finite Populations with Imputed Data,”Proceedings of the Section on Survey Research Methods, Vol. I, American Statistical Association, pp.374-379.

Steel, P.M. and Shao, J. (1997), “Estimation of Variance Due to Imputation in the Transportation AnnualSurvey (TAS),” Proceedings of the Section on Survey Research Methods, American StatisticalAssociation, pp. 141-146.

Sweet, E.M. and Sigman, R.S. (1995), “Evaluation of Model-Assisted Procedures for Stratifying SkewedPopulations Using Auxiliary Data,” Proceedings of the Section on Survey Research Methods, Vol. I,American Statistical Association, pp. 491-496.