a balanced sampling approach for multiway stratification design for small area estimation

28
A balanced Sampling approach for multiway stratification design for small area estimation Piero Demetrio Falorsi - Paolo Righi ISTAT

Upload: whilemina-dejesus

Post on 03-Jan-2016

21 views

Category:

Documents


0 download

DESCRIPTION

A balanced Sampling approach for multiway stratification design for small area estimation Piero Demetrio Falorsi - Paolo Righi ISTAT. Index. The issue of multivariate-multidomain sampling strategy The proposed sampling strategy Balanced sample for multiway stratification - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A balanced Sampling approach for multiway stratification design for small area estimation

A balanced Sampling approach for multiway stratification design for small

area estimation

Piero Demetrio Falorsi - Paolo RighiISTAT

Page 2: A balanced Sampling approach for multiway stratification design for small area estimation

Index

1. The issue of multivariate-multidomain sampling strategy

2. The proposed sampling strategy

3. Balanced sample for multiway stratification

4. Modified GREG estimator

5. The algorithm for the sample size definition

6. Application fields and experiments

Page 3: A balanced Sampling approach for multiway stratification design for small area estimation

1. The issue of multivariate-multidomain sampling strategy

When planning a sample strategy for a survey aiming at producing estimates for several domains (defined as non-nested partitions of the population) an issue is to define the sample size so that the sampling errors of domain estimates of several parameters are lower than given thresholds.

A sampling strategy is proposed here dealing with multivariate‑multidomain surveys when the overall sample size must satisfy budget constraints.

The standard solution of a stratification given by cross-classification of the domain variables is often not feasible because the number of strata can be larger than the overall sample size. Moreover, even if the overall sample size allows covering all the strata, the resulting allocation could lead to an inefficient design.

Page 4: A balanced Sampling approach for multiway stratification design for small area estimation

Population

Planned and actual sample with cross-classification stratification

1. The issue of multivariate-multidomain sampling strategy

Page 5: A balanced Sampling approach for multiway stratification design for small area estimation

Example: Business Structural Statistics

36.000 cross-classification strata

Table 1.2. Number of domains of the Italian Structural Business Statistics Survey by partition Partitions Number of domains

Economic activity class (4-digits of the NACE rev.1 classification) 465 Economic activity group (3-digits of the NACE rev.1 classification) by Size class(1) 395 Economic activity division (2-digits of the NACE rev.1 classification) by Region(1)

961

Total number of estimation domains 1,821 (1) Size classes are defined in terms of number of persons employed.

(2) Regions are 21 including autonomous provinces.

1. The issue of multivariate-multidomain sampling strategy

Page 6: A balanced Sampling approach for multiway stratification design for small area estimation

Standard strategy

Standard solution to obtain planned domains adopts cross-stratified sampling design by combining the domains

Consequences:– when the population size in many strata is small, the

stratification scheme could be inefficient; – if different partitions in domains of interest are not nested, the

allocation of the sample in the cross‑classified strata may be substantially different from the optimal allocation for the domains of a given partition;

– the sample size to cover all strata could be too large for the survey economical constrains;

– dealing with surveys repeated over time, statistical burden may arise if there exist strata containing only few units in the population.

1. The issue of multivariate-multidomain sampling strategy

Page 7: A balanced Sampling approach for multiway stratification design for small area estimation

One possible solution is the multi-way stratification:

Several sophisticated solutions have been proposed to keep under control the sample size in all the categories of the stratifying variables without using cross-classification design. These methods are generally referred to as multi-way stratification techniques, and have been developed under two main approaches:

(i) Latin Squares or Latin Lattices schemes (Bryant et al., 1960; Jessen, 1970); the indipendece among rows and columns is supposed. these methods work only if all the cross-strata exist in the population.

(ii) Controlled rounding problems via linear programming (Causey et al., 1985; Sitter and Skinner, 1994). Very computationally complex methods, not always get to a solution, inclusion probability (both simple and joint) cannot be computed immediately.

The main weaknesses of these approaches derives from the computational complexity and moreover a solution is not always reached.

1. The issue of multivariate-multidomain sampling strategy

Page 8: A balanced Sampling approach for multiway stratification design for small area estimation

Aim of this work is to define a sample strategy that is optimal with regard to the sample scheme and to the estimator utilized, by exploiting the available auxiliary information in both phases:

Define a probabilistic sample method

Realize a multiway stratification based on balanced sampling, controlling the sample size of the margin domains

Use a modified GREG estimator

Define the sample allocation, aiming at controlling the sampling errors on margins, using a variance estimator taking into account jointly both the regression model under the GREG estimator and the balanced sampling design

The strategy may take into account a simple (Fay Herriot) Small Area Estimator

The proposed overall sampling strategy is easy to implement and a software has been developed for each phase

It is possible to extend it to different contexts (considering the anticipated variance or the use of indirect small area estimators)

It is possible to develop a sample strategy for small area estimation considering the sample and estimation phases jointly

2. The proposed sampling strategy

Page 9: A balanced Sampling approach for multiway stratification design for small area estimation

Notation

Denote with:

U the population of size N;

Ub the b-th partition in Mb domains Ubd , b=1,…, B, d=1,…, Mb

the value of the (r = 1,…,R) variable of interest in the k‑th population unit

the domain membership indicator

n the overall fixed sample size

r-th parameter of interest

bdUk

krkbdUk

krrbd yyt ,,

2. The proposed sampling strategy

kry ,

kbd

Page 10: A balanced Sampling approach for multiway stratification design for small area estimation

3. Balanced sampling and multi-way stratification

Balanced sampling is a class of designs using auxiliary information.

Properties have been studied in the • model based approach (Royall and Herson, 1973; Valliant et

al., 2000);• design based approach (Deville and Tillé, 2004, 2005).

In the following we consider the design based or model assisted approach

Page 11: A balanced Sampling approach for multiway stratification design for small area estimation

Let us define the sampling design p(.) with inclusion probabilities a design which assigns a probability p(s) to each sample s such that

being a vector of sample indicators.

Let be a vector of Q auxiliary variables known for each unit in the population. The sampling design p(s) is said to be balanced with respect to the Q auxiliary variables if and only if it satisfies the balancing equations given by

being the sample weight

),...,,...,( 1 Nk π

)( 1 Nk ...,,...,, λ

SsspE πλλ )()(

)( 1 Qkhkkk z,...,z,...,zz

kUk

kkkUk

zHTz att zz

ˆ

3. Balanced sampling and multi-way stratification

kka /1

Page 12: A balanced Sampling approach for multiway stratification design for small area estimation

Multi-way stratification design can represent a special case of balanced design, when for unit k the auxiliary variable vector is the indicator of the belonging to the domains of the different partitions multiplied by its inclusion probability

The z vector, in this case, is defined as

the balancing equations assure that for each selected sample s, the size of the subsample is a non-random quantity and is

)()0000( 11

1

kBMkbdkk

Bb

k

b

kk B,...,,...,,...,,...,,...,,...,,..., δδδπππ

z

bdUk kbdn

bdbd Uss

3. Balanced sampling and multi-way stratification

Page 13: A balanced Sampling approach for multiway stratification design for small area estimation

For multiway stratification the balancing equations become

being the sample size for the d-th domain of the b-th partition

and

bdUk

kUk

kbdkkkUk

kbdk n/)(bd

bdn

3. Balanced sampling and multi-way stratification

Page 14: A balanced Sampling approach for multiway stratification design for small area estimation

A relevant drawback of balanced sampling has always been implementing a general procedure giving a multivariate balanced random sample.

Deville and Tillé (2004) proposed a sample selection method (cube method) drawing a balanced samples for a large set of auxiliary variables and with respect to different vectors of inclusion probabilities.

A free macro for the selection of balanced samples for large data sets may be downloaded (SAS or R routine)http://www.insee.fr/fr/nom_df_met/outils_stat/cube/accueil_cube.htm

Deville and Tillé (2000) show that with our specification of the auxiliary vectors, the balancing equations can be exactly satisfied, while in general the balancing equation are approximately respected

3. Balanced sampling and multi-way stratification

Page 15: A balanced Sampling approach for multiway stratification design for small area estimation

In the context of multi-variate estimation, the r-th parameter of interest is

The modified GREG estimator is (through a specific domain weight)

The superpopulation working model is

Uk

krrbd yt ,

sk

krkbdgregrbd ywt ,,ˆ

kkkkksk

kkhtbdbdkbdkkbd cacattaw /)/()ˆ(

1

, xxxxx

krrkkry ,, βx

4. Modified GREG estimator

Page 16: A balanced Sampling approach for multiway stratification design for small area estimation

Variance of the Horvitz-Thompson estimator with the balanced sampling

Deville and Tillé (2005) proposed an approximation of the variance expression for HT estimator and the overall domain

 

with

Ukzkk

kzhtzht y

QN

Nˆ|t̂V 2)(11

)( Bzttπ

1

11

11

kk

Ukk

kk

Ukkz y

ππzzzB

4. Modified GREG estimator: variance

Page 17: A balanced Sampling approach for multiway stratification design for small area estimation

Starting from the result by Deville (2005) it is possible to derive the approximate expression of the variance for the modified GREG estimator under balanced sampling

being

and

Ukkrbd

khtgregrbdp QN

NtttV 2

,,, 11

)ˆ|ˆ( zz

dbbdk

bdbdkkrkrbd Uk

Uk

for

for

,

,,,

z

z

Bz

Bz

1

11

1,

1

,k

kbdkrUk

kk

kUk

kbd

zzzBz

4. Modified GREG estimator: variance

Page 18: A balanced Sampling approach for multiway stratification design for small area estimation

5. The algorithm for the sample size definition

In order to calculate the inclusion probabilities it is necessary to fix the sample size for each domain so that the constraints on the sampling errors were accomplished

When considering separately each marginal partition we would have for each of them a different set of inclusion probabilities

In our methodology we calculate a single inclusion probability through a two step procedure

• Optimisation (calculating of optimal probabilities)

• Calibration (calculating of “working” probabilities)

Page 19: A balanced Sampling approach for multiway stratification design for small area estimation

Optimisation: the calculus of the inclusion probabilities (sample size and domain allocation) is carried out with the aim of minimizing the expected sampling errors on several domains and estimates:

Multi domains Multi variable

The problem is solved through the system

)N,...,k(

VQN

N

Min

k

Ukrbdk,rbd

k

Ukk

110

11 2

π

ηπ

πThe solution can be obtained through the Chromy algorithm(the one used in the software for allocation MAUSS, which can be can be downloaded from www.istat.it)

k,rbd η Residual term

5. The algorithm for the sample size definition

Page 20: A balanced Sampling approach for multiway stratification design for small area estimation

Calibration: optimal inclusion probabilities lead to non integer values for the domain sample size

Rounding of the expected domain sample size to next integer;

Calculating “working” probabilities nearest to the optimal ones

The problem is defined through the system

Solution obtained by means of the Newton algorithm (with some change), the same used in calibration software Genesees which can be can be downloaded from www.istat.it)

1-11

);(

bbdUk

k

Ukk

Ukkk

,…, M,…,B; d=b=n

n

GMin

bd

5. The algorithm for the sample size definition

Page 21: A balanced Sampling approach for multiway stratification design for small area estimation

Population – Contingency table

Variable for the allocation and estimation model

21

k,kk, x.y 11 350 ε

0)( ,1 kmE 0)( ,1,1 lkmE lk

,

kkm xV 5.1)( ,1

6. Application fields and experiments

Artificial data

Page 22: A balanced Sampling approach for multiway stratification design for small area estimation

Compared sampling designs and expected CV(%)

226. Application fields and experiments

Artificial data

Page 23: A balanced Sampling approach for multiway stratification design for small area estimation

6. Application fields and experiments

Real data

A simulation on real enterprises data (N=10,392) has been carried out to evaluate the effects of planned sample size for small domain of estimate (Falorsi et al., 2006) :

• U1 partition: regions (20 domains);

• U2 partition: economic activity by size class (24 domains);

• Cross-classification strata with population units: 360.

• Variables of interest: value added and labour cost

• the sample sizes of U1 and U2 partitions have been planned separately by means of a compromise allocation

• the 2 allocations guarantee a CV of 34.5% for U1 and 8.7% for U2 with regard to the variables number of employers (supposed known at sampling stage);

• the overall sample size is n=360

Page 24: A balanced Sampling approach for multiway stratification design for small area estimation

6. Application fields and experiments

Real dataThe experiment examines a situation characterizing many real survey contexts in which the overall sample size n is fixed and the marginal sample sizes are determined by a quite simple rule being a compromise between the Allocation Proportional to Population size (APP) and the allocation uniform for each domain of a given partition:

The probabilities of both designs for U1 and U2 partitions have been obtained as solution of the calibration problem below where the initial probabilities are set

uniformly equal to

bbbdbbd MnNNnn /)1()/( 10 b

1-11

);(

bbdUk

k

Ukk

Ukkk

,…, M,…,B; d=b=n

n

GMin

bd

Nnk /

Page 25: A balanced Sampling approach for multiway stratification design for small area estimation

6. Application fields and experiments: Real data

Page 26: A balanced Sampling approach for multiway stratification design for small area estimation

7. Extension to the Fay Herriot Model

26

Let b denote the partition for which it is necessary to adopt a small area indirect estimator and let us consider the model (7.1.1) described in Rao (2005, pag.

116). for the domains of the thb partition, this model may be defined as

rdbrdbdbrdbdbgregrdbgregrdb uvhNtt φa/ˆˆ,,

where adb is a p1 vector of area level covariates, rφ is an unknown p1 vector of regression coefficients, hdb is a known

quantity related to the thdb domain, rdb v iid ),0( 2 rb

independent of the sampling error rdb u approximately ind

),0( 2trdb , being

2,,

2 /)ˆ|ˆ( dbhtgregrdbptrdb NtttV zz .

Page 27: A balanced Sampling approach for multiway stratification design for small area estimation

7. Extension to the Fay Herriot Model

27

For known 2 rb and

2trdb values, the BLUP estimator of rdb t is

(ˆ , dbbluprdb Nt )ˆ)1(ˆ, rdbrdbgregrdbrdb t φa

being

)/( 22222 hh dbrbtrdbdbrbrdb

The MSE of the BLUP estimator is

)ˆ( ,bluprdb tMSE

aaaa db

M

ddbrbtrdbdbdbdbrdbtrdbrdb

b

dbhN

1

1

222222 )()1( .

Page 28: A balanced Sampling approach for multiway stratification design for small area estimation

7. Extension to the Fay Herriot Model

28

Looking at previous expressions it is possible to note that for a given

values of the variance 2 rb , it is possible to control the )ˆ( ,bluprdb tMSE

in the sampling design phase, by defining a proper value of the variance 2trdb .

An iterative procedure finds the k inclusion probabilities which guarantee the minimum sample size and assure the respects of the following constraints

Uk

rbdkrbdk VQNN 2,1/1)/(

(for bb ; d=1,…, ),...,1; RrMb And

rdbbluprdb VtMSE )ˆ( , (d=1,…, ),...,1; RrMb .