developing a theoretical framework for selective editing ...€¦ · developing a theoretical...

Developing a theoretical framework for selective editing based on modelling and optimization

Work Session on Statistical Data Editing Budapest, 14-16 September 2015

Pedro RevillaINE Spain

1

Outline

• Editing based on modelling

•Macroediting tools based on models

• Selective editing based on optimization

2

Editing based on modelling

• Specification of the edits based on the use of statistical models

• Essential editing criterion: data have to be consistent with the available information

• The available information is summarized in a model

• Data is consistent when it approaches to prediction

References

- Revilla, P. and Rey, P. (1999). “Selective editing methods based on time series modelling.” UN/ECE Work Session on SDE. 2-4 June 1999, Rome.

- Revilla, P. (2002) “An E&I method based on time series modelling designed to improve timeliness.” UN/ECE Work Session on SDE. 27 – 29 May 2002, Helsinki.

3

Microediting

Edit

𝑃 𝑥𝑖𝑗𝑡 − 1.96𝜎𝑖𝑗 < 𝑥𝑖𝑗𝑡 < 𝑥𝑖𝑗𝑡 + 1.96𝜎𝑖𝑗 = 0.95

Microimputation

𝑥𝑖𝑗𝑡

4

Macroediting

Edit

P 𝐼𝑖𝑡 − 1.96𝜎𝑖 < 𝐼𝑖𝑡 < 𝐼𝑖𝑡 + 1.96𝜎𝑖 = 0.95

Macroimputation

𝐼𝑖𝑡

5

Selective editing tools

• Surprises

• Influences

6

Surprise 𝑆𝑖,𝑡 for the index 𝐼𝑖,𝑡

Relative change between the observed and the forecasted data

𝑆𝑖,𝑡 =𝐼𝑖,𝑡− 𝐼𝑖,𝑡

𝐼𝑖,𝑡

Distribution

• 𝑒𝑖,𝑡 = 𝐿𝑛𝐼𝑖,𝑡 − 𝐿𝑛 𝐼𝑖,𝑡 ≅ 𝐼𝑖,𝑡 − 𝐼𝑖,𝑡 𝐼𝑖,𝑡• 𝑒𝑖,𝑡 is 𝑁 0, 𝜎𝑖 𝑆𝑖,𝑡 is approximately 𝑁 0, 𝜎𝑖

7

Confidence interval for the surprises

𝑃 −1.96𝜎𝑖 < 𝑆𝑖,𝑡 ≤ 1.96𝜎𝑖 = 0,95

Edit Outliers can be defined as the indices with surprise outside the confidence interval

8

Surprises

Standard Surprise

𝑆𝑖,𝑡𝜎𝑖

=𝐼𝑖,𝑡 − 𝐼𝑖,𝑡

𝐼𝑖,𝑡

1

𝜎𝑖

It allows the comparison between indices of different variability

Weighted Standard Surprise𝑆𝑖,𝑡𝜎𝑖

𝑤𝑖 =𝐼𝑖,𝑡 − 𝐼𝑖,𝑡

𝐼𝑖,𝑡

𝑤𝑖

𝜎𝑖

It allows ranking indices taking into account not only the magnitude of the surprise but also the different weights

9

Influence over the aggregated index 𝐼𝑡

Influence of an individual datum over an aggregated magnitude the difference between the observed aggregated magnitude and the value for this same magnitude when the individual datum is not available

𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑖𝑤𝑖𝐼𝑖,𝑡 − 𝑖≠𝑖0

𝑤𝑖𝐼𝑖,𝑡 +𝑤𝑖0 𝐼𝑖0, 𝑡−1 𝑗≠𝑗0

𝑞𝑖0,𝑗,𝑡+ 𝑞𝑖0,𝑗0,𝑡

𝑗 𝑞𝑖0,𝑗,𝑡−1=

= 𝑤𝑖0𝐼𝑖0,𝑡−1𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1

10

Influence factors

𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑤𝑖0𝐼𝑖0,𝑡−1

𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1

• The product (or activity) weight 𝑤𝑖0

• The index 𝐼𝑖0,𝑡−1 which “updates” the weight

• A measure of the relative discrepancy between the observed and the imputed data

𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1

11

Surprises

Sector Actual rate Forecasted rate Surprise Standard surprise Weighted standard surprise

4243 70,28 3,32 64,73 3,79 17,10

2511 -27,73 -3,29 -25,25 -3,11 -16,93

4110 -50,24 -6.89 -70,96 -6,84 -16,89

2514 -15,92 4,64 -19,62 -3,00 -16,87

2512 39,39 -11,83 58,12 7,22 16,51

4752 -0,74 2,06 -2,75 -1,09 -15,66

3299 -11,97 4,45 -15,70 -2,02 -15,57

4751 22,82 -7,36 32,55 2,34 14,64

3630 -0,28 3,68 -3,82 -0,81 -14,54

3166 15,97 5,83 9,58 1,89 13,92

Macroediting tools based on models

• We use a RegARIMA model to estimate a set of characteristics of a short term indicator

• Characteristics:

- Level behaviour

- Seasonal behaviour

- Calendar effects

- Other deterministic effects

- Outliers

- Uncertainty

References

- Revilla, P. and Rey, P. (2000). “Analysis and quality control from ARIMA modelling”. UN/ECE Work Session on SDE. Cardiff, 18-20 October 2000.

13

14

Gas Manufacturing

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Dairy Industries


15

Beer Brewing Clothing Industries

Jan Feb MarApr May Jun Jul Aug Sep Oct Nov Dec


Index of Industrial ProductionLevel Behaviour

Seasonal Behaviour

Working-Days Effect %

Easter Effect %February Strike Effect %

Outliers Uncertainty %

National Total Trend Yes 1.9 -4.2 -3.8 2.1

Andalusia Trend Yes 1.7 -2.0 (*)+7.5 Jan 1996 (-) Feb 1997 (+) Feb 1998 (+)

2.7

Aragón Trend Yes 2.1 -4.1 (*) -6.6 Dec 1994 (+) Feb 1997 (-)

2.2

Asturias Trend Yes 1.3 -4.3 -4.4 2.3

Balearic Islands

Trend Yes 1.6 -3.5 2.3

Canary Islands Local oscillations

No 1.8 -3.7 Mar 1994 (+) Mar 1997 (+)

2.9

Cantabria Trend Yes 1.9 -4.2 2.8

Castilla-León Trend Yes 1.6 -5.2 (*) -8.1 Feb 1996 (-) Jul 1995 (-) Dec 1996 (-) Feb 1997 (-) Nov 1997 (-)

16

Index of Industrial Production

17

Level Behaviour

Seasonal Behaviour

Working-Days Effect %

Easter Effect %February Strike Effect %

Outliers Uncertainty %

Castilla-La Mancha

Trend Yes 0.8 4.0 2.7

Catalonia Trend Yes 2.0 -4.7 -4.6 2.4

Valencian Community

Trend Yes 2.2 -4.6 -4.4 1.7

Estremadura Local oscillations

No 1.3 4.5

Galicia Trend Yes 1.9 -3.3 (*) -5.6 Dec 1995 (+) Feb 1997 (-)

1.9

Madrid Trend Yes 1.6 -5.2 3.0

Murcia Region Trend Yes 2.1 -3.2 2.7

Navarre Trend Yes 2.4 -4.7 (*) -7.6 Feb 1997 (-) 2.6

Basque Country

Trend Yes 2.4 -3.8 (*)-7.7 Aug 1993 (-) Feb 1997 (-)

2.6

Rioja Trend Yes 2.5 -4.9 Jan 1994 (-) Dec 1994 (+)

3.5

Selective editing based on optimization

• Selective editing as an optimization problem

• To reconcile two objectives (reduce editing work and maintain quality at the aggregate level)

• We will determine a selection strategy that allows editing the minimum number of units, while obtaining certain accuracy requirements in the aggregates

• Score functions can be obtained

References

- Arbués I., Revilla, P. and Salgado D. (2013). “An optimization approach to selective editing”. Journal of Official Statistics (JOS)

- Arbués I., González M. and Revilla, P. (2012). “A class of stochastic optimization problems with application to selective data editing". Optimization

18

General approach

True (𝑦𝑘0), observed 𝑦𝑘

𝑜𝑏𝑠 and edited (𝑦𝑘𝑒𝑑𝑖𝑡) values

The ultimate variables are the selection strategy vector

RT = (R1, R2,…, Rn)

for the sample units s = 1,...,n,

where Rk= 0 if the unit k is selected for interactive editing and Rk = 1 otherwise

Objective function to maximize

The objective function to maximize, given the available information 𝑍 is

Em Ri/Z ,

(in matrix notation, Em 1TR/Z where 1 stands for a vector of ones),

whose maximization amounts to minimizing the number of selected units

Constraints

Each constraint controls the loss of accuracy in terms of the chosen loss function L due to non-selected units

Two loss functions most used in practice

- absolute loss L = L 1 a. b = a − b

- squared loss L = L 2 a, b = (a − b)2

For these loss functions, each constraint can always be written as a bound on a quadratic form, denoted by Em RTΔ R Z

Generic optimization problem

𝑃0 max Em 1TR/Z

s. t. Em RT∆(q)R/Z ≤ ηq, q = 1,2,… , Q

RϵΩ0

Ω0 denotes the admissible outcome space of R

q refers to the different constrains (the different constraints q may arise from the fact that there are multiple variables of interest inside the questionnaire)

Choosing the auxiliary information Z and the subset S0 of sought selection strategies in the general problem P0, we end up with different optimization versions

Stochastic version

If no auxiliary information is used and the sought selection strategies are of the form:

𝑅 ∈ 𝑆: 𝑅𝑘 = 1 𝑠𝑖 𝜉𝑘 < 𝑄𝑘

0 𝑖𝑓𝜉𝑘 > 𝑄𝑘

𝜉𝑘 random variable U(0.1) and 𝑄𝑘 = 𝑄𝑘(X, 𝑌𝑜𝑏𝑠, S) is a continuous random variable

23

Combinatorial version

We use both all available auxiliary information and the observed values found in the sample and do not restrict the form of the sought selection strategies S0 = S

Pco max 1´r

s. t. rtM(q)r ≤ mq2 , 𝑞 = 1,… . . , 𝑄

r ∈ Bn the realized selection B = 0,1.

M(q) condenses the modelization of the measurement error

mq bounds chosen by the statistician

Score functions obtained from the optimization approachAdditional assumption is neglecting the cross-unit terms in each constraint

Then these constraints can be rewritten as

Em [RT R/Z] = Em [RT diag()/Z]

Score function

One constrain (Q = 1)

Unit k is selected provided Mkk > 1/*

Mkk can be regarded as a single score and 1/* as the threshold value.

* Mkk can be considered as a “standardized” score, in the sense that the threshold value is generically set to 1.

Multiple constraints (Q >1)

each q* Mkk

(q) is a standardized local score,

qq* Mkk

(q) is the standardized global score, with the generic global threshold value 1.

Evaluation

Compare the performance of the score obtained under the optimization approach to that of the score-function described, for example, in Hedlin (2003)

𝛿𝑖0 = 𝜔𝑖 𝑥𝑖

𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒

, prediction: data t-1

𝛿𝑖1 = 𝜔𝑖 𝑥𝑖

𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒

, prediction: ARIMA model

𝛿𝑖2 optimization method

Effectiveness

𝐸1𝑗𝑛 =

𝑖≥𝑛(𝜔𝑖

𝑗)2(𝑥𝑖𝑗

𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )2

𝐸2𝑗𝑛 =

𝑖≥𝑛

𝜔𝑖𝑗(𝑥𝑖𝑗

𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )

2

(Units arranged in descending order according to the corresponding score function)

These measures can be interpreted as estimates of the remaining error after editing the n first units

28

Comparison of score functions methods

Turnover Orders

𝐸1 𝐸2 𝐸1 𝐸2

𝛿 0 0.43 0.44 1.16 1.33

𝛿 1 0.30 0.38 0.36 0.45

𝛿 2 0.21 0.26 0.28 0.37

29

Remaining error as a function of the edited questionnaires (N~9000).

Final remarks

• We have introduced theoretical frameworks using models and optimization techniques

• We consider the search for an adequate selection strategy as a generic optimization problem with an stochastic and a combinatorial version. We have shown that a certain score function provides the solution to the problem with linear constraints

• The selection obtained outperforms that of traditional score functions in general

• More methodological research and practical experiences are needed

31

developing a theoretical framework for selective editing ...€¦ · developing a theoretical...

Documents