statistical disclosure control (sdc) for 2011 census progress update keith spicer – ons sdc...
TRANSCRIPT
Statistical Disclosure Control (SDC) for 2011 Census
Progress Update
Keith Spicer – ONS SDC Methodology
23 April 2009
CONTENTS2011 Census: Context
: Progress
Tabular outputs: • Short-listed methods• Risk Utility Framework and measures• Registrars General statements
Microdata:• Reflection on 2001 use of SDC• Issues arising
2011 Census - Context
• SDC for 2011 Census outputs is a major concern for users
• Different SDC methodologies were adopted for tabular 2001 Census outputs across UK
• Late addition of small cell adjustment by ONS/NISRA resulted in high level of user confusion and dissatisfaction
• Publicised commitment to aim for a common UK SDC methodology for all 2011 Census outputs
Progress
Development of SDC Strategy
UK SDC working group established to take forward methodological work consisting of representatives from Wales, Northern Ireland and Scotland
UKCDMAC subgroup set up to QA work
Methodological research:Determine the short-list of SDC methods (Aug ‘07)
Quantitative evaluation of short-list (continuing)
Short-listed methods
PRE-TABULARRecord swappingOver-imputation
POST-TABULARIACP (Invariant ABS Cell Perturbation)
Using 2001 Census tables to assess SDC methods
B
Area B
A
Treatment:Find a different geographical Area Identify another individual in a different area with virtually all the same characteristics Swap the records
Characteristics:
Age: 22,
Sex: Male,
Marital Status: Married
No of Cars: 3
Region: Area A
Characteristics
Age: 22,
Sex: Male,
Marital Status: Married
No of Cars: 1
Region: Area B
Matches all variables except No of Cars
Unique as only person with 3 cars in Area A
Swap records
Record Swapping
25 male single 6 people in hhld
0 cars student
21 male single 6 people in hhld
0 cars student
Blank out age from record
Find a donor to impute age
Over-Imputation• Select set of records to be protected – either random
or targeted• Distance based nearest neighbour to use as a donor
based on a set of matching variables
Invariant ABS Cell Perturbation (IACP) Method
• Based on method developed by Australian Bureau of Statistics (ABS)
• Perturb each cell value in a table to create uncertainty around the true value
• This new post-tabular method
preserves consistency: same cell value in different tables always the same – however small inconsistencies when cells broken down further
Risk Utility Framework
Minimising risk of disclosure is important (in fact probably the most important aspect of SDC)
But so is maintaining utility of data………
The Statistical Disclosure The Statistical Disclosure Control ProblemControl Problem
Original Data
Data Utility: Information about legitimate items
Maximum Tolerable Risk
Released DataNo data
Disclosure Risk:
Information about confidential units
Risk and Utility Measures Risk measures (original v protected):
Attribute disclosure - % protected
Group disclosure
Within group disclosure
Negative attribute disclosure
% of zeros left unchanged
Identity disclosure - % small cells unperturbed
Risk and Utility Measures Utility measures (original v protected table):
Ratio of variances across variables
Association between variables – Cramers V
Hellingers Distance metric
Absolute Deviation – Relative & Absolute
Impact on totals & sub-totals
Registrars General statements
• Commitment to aim for common UK SDC methodology
• Small counts could be included in publicly disseminated tables provided that
– Sufficient uncertainty that count is true value
– Creating that uncertainty does not significantly damage the data
• Key risk for 2011 output is attribute disclosure
• Their preference is for pre-tabular method
SDC for Tabular Outputs: Next steps
Intention to go to UKCC in July 2009 with broad strategy
Additional work on level of protection necessary
Microdata: reflection on 2001 use of SDC
Ind L SAR SAM SL-HSAR CAMS
PRAM PRAM (more) Some PRAM -
Recode Recode (more) Some Recode -
88+ 88+ 58+ 176,157+
GOR LA E&W combined LA
3% indiv 5% indiv 1% hhold 3%, 1%
EUL EUL SL VML
Microdata: Issues arising I
• Protection through either access (CAMS), data perturbation (EUL samples) or bit of both (SL-HSAR)
• PRAM involved post-randomisation of variables – transition probability matrix; most values perturbed, if at all, by one or two categories – goal to treat sample uniques that are also population uniques
• How much protection is offered by EUL, SDS, VML
• Onus on researchers to comply with conditions as well as ONS to provide access
Microdata: Issues arising II
• Smaller sample does help (uncertainty that an individual or household is in the microdata)
• Want tabular outputs to provide “sufficient uncertainty” at all geographies – c.f. record swapping in Scotland 2001
• Over-imputation and IACP would offer some protection to microdata
• After decision on tabular outputs, need to consider any additional SDC needed for microdata products
Summary
• UK SDC Working Group in mid-June; UKCC in late July to agree strategy for tabular outputs
• Three short-listed methods
• Effect on microdata is among assessment criteria
• Choice of method for tables will influence how we protect microdata
• Likely to be a range of microdata samples – making use of either/both SDC and access conditions
• Work on specific SDC methods for microdata will progress further after decision on tabular methods
Thank you
Any Questions ?