daniela ichim
DESCRIPTION
Community Innovation Survey: a Flexible Approach to the Dissemination of Microdata Files for Research. Daniela Ichim. Dissemination of Microdata Files for Research Risk assessment Disclosure limitation Data quality Record linkage Data utility. Outline. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/1.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Community Innovation Survey:a Flexible Approach to the Dissemination of Microdata Files for Research
Daniela Ichim
![Page 2: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/2.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Outline
• Dissemination of Microdata Files for Research• Risk assessment• Disclosure limitation• Data quality
– Record linkage– Data utility
![Page 3: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/3.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Confidentiality against Dissemination
Find the right balance!
Disclosure scenarios
![Page 4: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/4.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Community Innovation Survey
• IDENTIFYING VARIABLES– Nace– Nuts– Size– Turnover (TURN)
(STRUCTURAL VARIABLES)
• CONFIDENTIAL VARIABLES– Expenditures in innovation (RTOT, …)– Number of patents, …
(VARIABLES INVOLVED IN ANALYSES)
![Page 5: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/5.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Confounding
Categorical Numerical
safe
unsafe
AA…Ak-anonymity
cn ttt ,
![Page 6: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/6.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
a) Given a threshold (on units)b) Local Outlier Factor as a
measure of difference in density between a unit and its nearest neighbours
General risk function
Distance between and
*M
1t 2t
1,0,),(),(),( 211
2121 ccnn Ied tttttt
t
)(
)(
)(
)(*
*'
*
*
*
)(
'
t
t
t
ttt
M
NM
M
M N
LRD
LRD
LOFM
Density around :
![Page 7: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/7.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
• Threshold - dissemination policy
Parameters*M
• Cut-off point for density (LOF)– quantiles– automatic
![Page 8: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/8.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Stratification variables
TUR
N
Analysis by Nace
Nace A all Nace
![Page 9: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/9.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Disclosure limitation
MFR Selective masking
k-anonymity Nearest neighbour
Micro-aggregation on tails
![Page 10: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/10.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Quality assessment
Dissemination
Confidentiality
![Page 11: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/11.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Risk measure assessment
Quality of the external database
D
E
Chambers of Commerce database
Record linkage
![Page 12: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/12.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Record linkage
M*=3
1 equal unit within 10%
less than 3 units within 10%
less than 3 units within 20%
less than 3 units within 30%
NACE 88% 84% 97% 100%
NACEEMP 63% 60%a 74%a 87%a
M*=5
1 equal unit within 10%
less than 5 units within 10%
less than 5 units within 20%
less than 5 units within 30%
NACE 88% 73% 87% 96%
NACEEMP 63% 58%a 70%a 80%a
a) 100% for enterprises with more than 250 employees
![Page 13: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/13.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Information content analysis
Information preservation• Selective masking
– Data utility– Only identifying and confidential variables were
modified.– Only records at risk were modified.
• The weights were not modified.– weighted totals (coherence with the already
published information)
Some statistical indicators were slightly modified: variances
![Page 14: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/14.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Information content analysisData utility
Assessment of the perturbation impact on ratios like RTOT/TURN
Original
Selective masking
Individual ranking
![Page 15: Daniela Ichim](https://reader035.vdocuments.site/reader035/viewer/2022081504/568150c9550346895dbeed6d/html5/thumbnails/15.jpg)
European Conference on Quality in Official Statistics, Rome, July 2008
Conclusions
1. Confidentiality: Risk measure based on the k-anonymity principle
Flexible a) continuous and categorical variables b) easy to implement c) consistent for extreme choices
2. Data utility: Selective protection to achieve the k-anonymity
3. Comparable dissemination: Control both risk of re-identification and information loss
QUALITY DIMENSIONS