developing a theoretical framework for selective editing ...€¦ · developing a theoretical...
TRANSCRIPT
Developing a theoretical framework for selective editing based on modelling and optimization
Work Session on Statistical Data Editing Budapest, 14-16 September 2015
Pedro RevillaINE Spain
1
Outline
• Editing based on modelling
•Macroediting tools based on models
• Selective editing based on optimization
2
Editing based on modelling
• Specification of the edits based on the use of statistical models
• Essential editing criterion: data have to be consistent with the available information
• The available information is summarized in a model
• Data is consistent when it approaches to prediction
References
- Revilla, P. and Rey, P. (1999). “Selective editing methods based on time series modelling.” UN/ECE Work Session on SDE. 2-4 June 1999, Rome.
- Revilla, P. (2002) “An E&I method based on time series modelling designed to improve timeliness.” UN/ECE Work Session on SDE. 27 – 29 May 2002, Helsinki.
3
Microediting
Edit
𝑃 𝑥𝑖𝑗𝑡 − 1.96𝜎𝑖𝑗 < 𝑥𝑖𝑗𝑡 < 𝑥𝑖𝑗𝑡 + 1.96𝜎𝑖𝑗 = 0.95
Microimputation
𝑥𝑖𝑗𝑡
4
Macroediting
Edit
P 𝐼𝑖𝑡 − 1.96𝜎𝑖 < 𝐼𝑖𝑡 < 𝐼𝑖𝑡 + 1.96𝜎𝑖 = 0.95
Macroimputation
𝐼𝑖𝑡
5
Selective editing tools
• Surprises
• Influences
6
Surprise 𝑆𝑖,𝑡 for the index 𝐼𝑖,𝑡
Relative change between the observed and the forecasted data
𝑆𝑖,𝑡 =𝐼𝑖,𝑡− 𝐼𝑖,𝑡
𝐼𝑖,𝑡
Distribution
• 𝑒𝑖,𝑡 = 𝐿𝑛𝐼𝑖,𝑡 − 𝐿𝑛 𝐼𝑖,𝑡 ≅ 𝐼𝑖,𝑡 − 𝐼𝑖,𝑡 𝐼𝑖,𝑡• 𝑒𝑖,𝑡 is 𝑁 0, 𝜎𝑖 𝑆𝑖,𝑡 is approximately 𝑁 0, 𝜎𝑖
7
Confidence interval for the surprises
𝑃 −1.96𝜎𝑖 < 𝑆𝑖,𝑡 ≤ 1.96𝜎𝑖 = 0,95
Edit Outliers can be defined as the indices with surprise outside the confidence interval
8
Surprises
Standard Surprise
𝑆𝑖,𝑡𝜎𝑖
=𝐼𝑖,𝑡 − 𝐼𝑖,𝑡
𝐼𝑖,𝑡
1
𝜎𝑖
It allows the comparison between indices of different variability
Weighted Standard Surprise𝑆𝑖,𝑡𝜎𝑖
𝑤𝑖 =𝐼𝑖,𝑡 − 𝐼𝑖,𝑡
𝐼𝑖,𝑡
𝑤𝑖
𝜎𝑖
It allows ranking indices taking into account not only the magnitude of the surprise but also the different weights
9
Influence over the aggregated index 𝐼𝑡
Influence of an individual datum over an aggregated magnitude the difference between the observed aggregated magnitude and the value for this same magnitude when the individual datum is not available
𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑖𝑤𝑖𝐼𝑖,𝑡 − 𝑖≠𝑖0
𝑤𝑖𝐼𝑖,𝑡 +𝑤𝑖0 𝐼𝑖0, 𝑡−1 𝑗≠𝑗0
𝑞𝑖0,𝑗,𝑡+ 𝑞𝑖0,𝑗0,𝑡
𝑗 𝑞𝑖0,𝑗,𝑡−1=
= 𝑤𝑖0𝐼𝑖0,𝑡−1𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
10
Influence factors
𝐼𝑁𝐹𝑖0,𝑗0𝐼𝑡 = 𝑤𝑖0𝐼𝑖0,𝑡−1
𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
• The product (or activity) weight 𝑤𝑖0
• The index 𝐼𝑖0,𝑡−1 which “updates” the weight
• A measure of the relative discrepancy between the observed and the imputed data
𝑞𝑖0,𝑗0,𝑡 − 𝑞𝑖0,𝑗0,𝑡 𝑗 𝑞𝑖0,𝑗,𝑡−1
11
Surprises
Sector Actual rate Forecasted rate Surprise Standard surprise Weighted standard surprise
4243 70,28 3,32 64,73 3,79 17,10
2511 -27,73 -3,29 -25,25 -3,11 -16,93
4110 -50,24 -6.89 -70,96 -6,84 -16,89
2514 -15,92 4,64 -19,62 -3,00 -16,87
2512 39,39 -11,83 58,12 7,22 16,51
4752 -0,74 2,06 -2,75 -1,09 -15,66
3299 -11,97 4,45 -15,70 -2,02 -15,57
4751 22,82 -7,36 32,55 2,34 14,64
3630 -0,28 3,68 -3,82 -0,81 -14,54
3166 15,97 5,83 9,58 1,89 13,92
Macroediting tools based on models
• We use a RegARIMA model to estimate a set of characteristics of a short term indicator
• Characteristics:
- Level behaviour
- Seasonal behaviour
- Calendar effects
- Other deterministic effects
- Outliers
- Uncertainty
References
- Revilla, P. and Rey, P. (2000). “Analysis and quality control from ARIMA modelling”. UN/ECE Work Session on SDE. Cardiff, 18-20 October 2000.
13
14
Gas Manufacturing
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Dairy Industries
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
15
Beer Brewing Clothing Industries
Jan Feb MarApr May Jun Jul Aug Sep Oct Nov Dec
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Index of Industrial ProductionLevel Behaviour
Seasonal Behaviour
Working-Days Effect %
Easter Effect %February Strike Effect %
Outliers Uncertainty %
National Total Trend Yes 1.9 -4.2 -3.8 2.1
Andalusia Trend Yes 1.7 -2.0 (*)+7.5 Jan 1996 (-) Feb 1997 (+) Feb 1998 (+)
2.7
Aragón Trend Yes 2.1 -4.1 (*) -6.6 Dec 1994 (+) Feb 1997 (-)
2.2
Asturias Trend Yes 1.3 -4.3 -4.4 2.3
Balearic Islands
Trend Yes 1.6 -3.5 2.3
Canary Islands Local oscillations
No 1.8 -3.7 Mar 1994 (+) Mar 1997 (+)
2.9
Cantabria Trend Yes 1.9 -4.2 2.8
Castilla-León Trend Yes 1.6 -5.2 (*) -8.1 Feb 1996 (-) Jul 1995 (-) Dec 1996 (-) Feb 1997 (-) Nov 1997 (-)
16
Index of Industrial Production
17
Level Behaviour
Seasonal Behaviour
Working-Days Effect %
Easter Effect %February Strike Effect %
Outliers Uncertainty %
Castilla-La Mancha
Trend Yes 0.8 4.0 2.7
Catalonia Trend Yes 2.0 -4.7 -4.6 2.4
Valencian Community
Trend Yes 2.2 -4.6 -4.4 1.7
Estremadura Local oscillations
No 1.3 4.5
Galicia Trend Yes 1.9 -3.3 (*) -5.6 Dec 1995 (+) Feb 1997 (-)
1.9
Madrid Trend Yes 1.6 -5.2 3.0
Murcia Region Trend Yes 2.1 -3.2 2.7
Navarre Trend Yes 2.4 -4.7 (*) -7.6 Feb 1997 (-) 2.6
Basque Country
Trend Yes 2.4 -3.8 (*)-7.7 Aug 1993 (-) Feb 1997 (-)
2.6
Rioja Trend Yes 2.5 -4.9 Jan 1994 (-) Dec 1994 (+)
3.5
Selective editing based on optimization
• Selective editing as an optimization problem
• To reconcile two objectives (reduce editing work and maintain quality at the aggregate level)
• We will determine a selection strategy that allows editing the minimum number of units, while obtaining certain accuracy requirements in the aggregates
• Score functions can be obtained
References
- Arbués I., Revilla, P. and Salgado D. (2013). “An optimization approach to selective editing”. Journal of Official Statistics (JOS)
- Arbués I., González M. and Revilla, P. (2012). “A class of stochastic optimization problems with application to selective data editing". Optimization
18
General approach
True (𝑦𝑘0), observed 𝑦𝑘
𝑜𝑏𝑠 and edited (𝑦𝑘𝑒𝑑𝑖𝑡) values
The ultimate variables are the selection strategy vector
RT = (R1, R2,…, Rn)
for the sample units s = 1,...,n,
where Rk= 0 if the unit k is selected for interactive editing and Rk = 1 otherwise
Objective function to maximize
The objective function to maximize, given the available information 𝑍 is
Em Ri/Z ,
(in matrix notation, Em 1TR/Z where 1 stands for a vector of ones),
whose maximization amounts to minimizing the number of selected units
Constraints
Each constraint controls the loss of accuracy in terms of the chosen loss function L due to non-selected units
Two loss functions most used in practice
- absolute loss L = L 1 a. b = a − b
- squared loss L = L 2 a, b = (a − b)2
For these loss functions, each constraint can always be written as a bound on a quadratic form, denoted by Em RTΔ R Z
Generic optimization problem
𝑃0 max Em 1TR/Z
s. t. Em RT∆(q)R/Z ≤ ηq, q = 1,2,… , Q
RϵΩ0
Ω0 denotes the admissible outcome space of R
q refers to the different constrains (the different constraints q may arise from the fact that there are multiple variables of interest inside the questionnaire)
Choosing the auxiliary information Z and the subset S0 of sought selection strategies in the general problem P0, we end up with different optimization versions
Stochastic version
If no auxiliary information is used and the sought selection strategies are of the form:
𝑅 ∈ 𝑆: 𝑅𝑘 = 1 𝑠𝑖 𝜉𝑘 < 𝑄𝑘
0 𝑖𝑓𝜉𝑘 > 𝑄𝑘
𝜉𝑘 random variable U(0.1) and 𝑄𝑘 = 𝑄𝑘(X, 𝑌𝑜𝑏𝑠, S) is a continuous random variable
23
Combinatorial version
We use both all available auxiliary information and the observed values found in the sample and do not restrict the form of the sought selection strategies S0 = S
Pco max 1´r
s. t. rtM(q)r ≤ mq2 , 𝑞 = 1,… . . , 𝑄
r ∈ Bn the realized selection B = 0,1.
M(q) condenses the modelization of the measurement error
mq bounds chosen by the statistician
Score functions obtained from the optimization approachAdditional assumption is neglecting the cross-unit terms in each constraint
Then these constraints can be rewritten as
Em [RT R/Z] = Em [RT diag()/Z]
Score function
One constrain (Q = 1)
Unit k is selected provided Mkk > 1/*
Mkk can be regarded as a single score and 1/* as the threshold value.
* Mkk can be considered as a “standardized” score, in the sense that the threshold value is generically set to 1.
Multiple constraints (Q >1)
each q* Mkk
(q) is a standardized local score,
qq* Mkk
(q) is the standardized global score, with the generic global threshold value 1.
Evaluation
Compare the performance of the score obtained under the optimization approach to that of the score-function described, for example, in Hedlin (2003)
𝛿𝑖0 = 𝜔𝑖 𝑥𝑖
𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒
, prediction: data t-1
𝛿𝑖1 = 𝜔𝑖 𝑥𝑖
𝑜𝑏𝑠 − 𝑥𝑖𝑝𝑟𝑒
, prediction: ARIMA model
𝛿𝑖2 optimization method
Effectiveness
𝐸1𝑗𝑛 =
𝑖≥𝑛(𝜔𝑖
𝑗)2(𝑥𝑖𝑗
𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )2
𝐸2𝑗𝑛 =
𝑖≥𝑛
𝜔𝑖𝑗(𝑥𝑖𝑗
𝑜𝑏𝑠 − 𝑥𝑖𝑗0 )
2
(Units arranged in descending order according to the corresponding score function)
These measures can be interpreted as estimates of the remaining error after editing the n first units
28
Comparison of score functions methods
Turnover Orders
𝐸1 𝐸2 𝐸1 𝐸2
𝛿 0 0.43 0.44 1.16 1.33
𝛿 1 0.30 0.38 0.36 0.45
𝛿 2 0.21 0.26 0.28 0.37
29
Remaining error as a function of the edited questionnaires (N~9000).
Final remarks
• We have introduced theoretical frameworks using models and optimization techniques
• We consider the search for an adequate selection strategy as a generic optimization problem with an stochastic and a combinatorial version. We have shown that a certain score function provides the solution to the problem with linear constraints
• The selection obtained outperforms that of traditional score functions in general
• More methodological research and practical experiences are needed
31