techniques to apply cell suppression to large sparse linked tables and some results using those...
TRANSCRIPT
Techniques to apply cell suppression to large sparse linked tables and some results using those techniques on the
2012 (US) Economic Census
Philip Steel, James Fagan, Vitoon Harusadangkul, Paul Massell,
Richard Moore Jr., John Slanta, Bei WangU.S. Census Bureau
Cell Suppression ca. 1931
Footnote: Alabama,1 establishment; Arkansas, 1; District of Columbia, 4 … New Jersey, 15; … Wisconsin, 5;
Headnote: [comparability warning] Statistics are given for all States for which separate figures can be published without disclosing, exactly or approximately, the data reported by individual establishments. One of the “Other States,” namely, New Jersey, is, however, more important in the industry than any of the States shown separately.
Outline
Follow-up to P. Steel et al Re-development of the Cell Suppression Methodology at the US Census Bureau, UNECE Ottawa, Canada, 28-30 October 2013
1. Some background2. Production software capabilities3. Large table results4. Future research
Table Relations
A table is the set of cells defined by crossing 2 or more relations. A relation specifies an additive relationship between sets of
cells. Most frequently represented as a margin in a table. Two tables are linked when a cell or set of cells appears in both
tables, usually as a margin in one and in the interior of another. A table group is a minimal set of tables such that no table is
linked outside the group within the publication or release. Our cell suppression software is driven by 2 kinds of input.
All of the additive relations, by convenience categorized separately into files for each table dimension.
Cell attribute file.
NAICS
North American Industry Classification System 6 digit classification that allows tabulation at 2
(industry) 3 (subindustry) and other levels Example of an additive relation:
23611,236115,236116,236117,236118
The graph structure associated with a classification system of this type is a tree
Distinct codes used for the manufacturing segment: 518 (roughly 154 relations)
Geographic Publication Levels
Nation, State, County, Consolidated Statistical Metropolitan Area, Metropolitan Areas, Places, and balances.
Population varies from 100s to millions Economic activity in a detailed NAICS code is
often 0 at the Place and even County level In excess of 16,000 codes (roughly 3000
relations) Not a simple hierarchy
Definitions
Sensitivity rule (p-percent) Primary suppression: a cell not shown in a publication
because it failed the sensitivity rule Secondary suppression: a cell not shown in a
publication because its absence blocks recovery of a primary suppression
Target: a cell with a protection requirement Supercell: a sensitive aggregate
An example: two single report cells in a row or column LP: Linear Programing [problem]
minimize:
rows
i
cols
Akjij
levs
k
lkji
ukjikji xxcY
1),,(1 1
,,,,,,
subject to:
(a) )(1,,
)(1,,
),,(2
)(,,
)(,,
lji
uji
levs
Akjik
lkji
ukji xxxx
for i =1, ... , rows, j = 1, ... ,cols : levs > 1, ws(i,j,1) = 0
(b) )(,),0,(
)(,),0,(
)(lim
),,(1
)(,),,(
)(,),,(
lkjiirowrel
ukjiirowrel
iir
Akjii
lkjiiirowrel
ukjiiirowrel xxxx
for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) ≥ 1, ws(ii,j,k) = 0
(c) )(),0,(,
)(),0,(,
)(lim
),,(1
)(),,(,
)(),,(,
lkjjcolreli
ukjjcolreli
jjc
Akjij
lkjjjcolreli
ukjjjcolreli xxxx
for i = 1, ... , rows, jj = 1, ... , cc, k = 1, ... , levs : limc(cc) ≥ 1, ws(i,jj,k) = 0
(d) kjiu
kji hx ,,)(,,0 ; kji
lkji hx ,,
)(,,0
for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) A⋲
(e) protx uplevpcolprow )(
,, ; 0)(,, lplevpcolprowx
where:
CPkjiwhen
Ukjiwhenvc kji
kji),,(0
),,(,0max ,,
,,
hi,j,k = max(0,vi,j,k)
m-LP model
m pairs of (e)
Definitions continued
m-LP: an LP that protects m targets simultaneously m-LP screener: criteria for determining whether a target cell is too
close to other target cells in a group of m cells Effective m: the actual as opposed to the requested number of
targets Cost: the weight given a particular cell in the LP’s objective function
—roughly a measure of the desire to retain the cell Capacity: how much of a cell’s value can be used to protect a target
in secondary suppression Frozen cell: capacity=0 if a cell has already been published Infeasible: a target whose LP problem has no solution, a circumstance
that can be caused by frozen cells or overlapping patterns in m-lp
Methods, Programming and Data
LP methodology is straightforward; most of the (very dry) complexity is in how to handle exceptions in calculating cost and capacity
Implementation in production software is a difficult problem and doesn’t fit well with most software development models
Differences in data can produce novel circumstances and performance varies dramatically
Program features
Data structure: graph vs array Automatic global un-duplication Additivity check Decomposes input into table groups Utilizes solver’s data structures Scalar parameter Individual cell cost adjustment Accommodates linked data
Program features (cntd)
Freeze of publication status across linked tables Tolerant of rounded data Skip-P procedure m for m-LP Supercell (aggregate/company protection) Negative values adjustment Tolerant of non-additivity Single “D” check
Processing the Manufacturing Geographic Area Series
NAICS (518 distinct) Geography (16,000 distinct) Content
standalone (hours, capital expenditures) 2D In a relation (employment, payroll and value added) 3
dimensional (3D) Processed 3D in 3 steps
2&3 digit NAICS by full geography and content Freeze and process each full 3 digit tree Cleanup of infeasibles
Performance on 2D tables
Cartesiancells
%0 Nonzero %Pm
ave optsminutes per opt
hours 8.7 M 90.0 867,913 71.9 30.2 2688 2.76
cextot 8.3 M 91.4 711,749 83.2 21.1 2063 4.06
Without skip P, ‘hours’ would have had to do 57,000 optimizations-- taking 109 days instead of 5!
Performance on 3D tables
tableCartesian
cells%0 nonzero %P
maverage opts
Minutes per opt
emp2a 1.9 M 85.4 270,583 89.4 3.7 4465 0.28
emp3 26.7M 89.7 2,760,737 --- 1 67 105.93
The second step for processing employment variables does each subindustry, the first line represents the 1st subindustry. The second line represents the 3rd step where we unfreeze and redo the infeasibles that occurred in the (entire) 2nd step.
Agenda for the 2017 Economic Census
Correct current defects Test solver dependency Adapt to a shift in emphasis from industry to
product tables in the 2017 Census Identify and review pivotal suppressions Better employ m-LP, scaling Investigate over-tabulation Create a data product that synthesizes
suppressed cells
Thanks for your attention