techniques to apply cell suppression to large sparse linked tables and some results using those...

Techniques to apply cell suppression to large sparse linked tables and some results using those techniques on the

2012 (US) Economic Census

Philip Steel, James Fagan, Vitoon Harusadangkul, Paul Massell,

Richard Moore Jr., John Slanta, Bei WangU.S. Census Bureau

Cell Suppression ca. 1931

Footnote: Alabama,1 establishment; Arkansas, 1; District of Columbia, 4 … New Jersey, 15; … Wisconsin, 5;

Headnote: [comparability warning] Statistics are given for all States for which separate figures can be published without disclosing, exactly or approximately, the data reported by individual establishments. One of the “Other States,” namely, New Jersey, is, however, more important in the industry than any of the States shown separately.

Outline

Follow-up to P. Steel et al Re-development of the Cell Suppression Methodology at the US Census Bureau, UNECE Ottawa, Canada, 28-30 October 2013

1. Some background2. Production software capabilities3. Large table results4. Future research

Table Relations

A table is the set of cells defined by crossing 2 or more relations. A relation specifies an additive relationship between sets of

cells. Most frequently represented as a margin in a table. Two tables are linked when a cell or set of cells appears in both

tables, usually as a margin in one and in the interior of another. A table group is a minimal set of tables such that no table is

linked outside the group within the publication or release. Our cell suppression software is driven by 2 kinds of input.

All of the additive relations, by convenience categorized separately into files for each table dimension.

Cell attribute file.

NAICS

North American Industry Classification System 6 digit classification that allows tabulation at 2

(industry) 3 (subindustry) and other levels Example of an additive relation:

23611,236115,236116,236117,236118

The graph structure associated with a classification system of this type is a tree

Distinct codes used for the manufacturing segment: 518 (roughly 154 relations)

Geographic Publication Levels

Nation, State, County, Consolidated Statistical Metropolitan Area, Metropolitan Areas, Places, and balances.

Population varies from 100s to millions Economic activity in a detailed NAICS code is

often 0 at the Place and even County level In excess of 16,000 codes (roughly 3000

relations) Not a simple hierarchy

Definitions

Sensitivity rule (p-percent) Primary suppression: a cell not shown in a publication

because it failed the sensitivity rule Secondary suppression: a cell not shown in a

publication because its absence blocks recovery of a primary suppression

Target: a cell with a protection requirement Supercell: a sensitive aggregate

An example: two single report cells in a row or column LP: Linear Programing [problem]

minimize:

rows

i

cols

Akjij

levs

k

lkji

ukjikji xxcY

1),,(1 1

,,,,,,

subject to:

(a) )(1,,

)(1,,

),,(2

)(,,

)(,,

lji

uji

levs

Akjik

lkji

ukji xxxx

for i =1, ... , rows, j = 1, ... ,cols : levs > 1, ws(i,j,1) = 0

(b) )(,),0,(

)(,),0,(

)(lim

),,(1

)(,),,(

)(,),,(

lkjiirowrel

ukjiirowrel

iir

Akjii

lkjiiirowrel

ukjiiirowrel xxxx

for ii = 1, ... , rr, j = 1,..,cols, k = 1, ... , levs : limr(ii) ≥ 1, ws(ii,j,k) = 0

(c) )(),0,(,

)(),0,(,

)(lim

),,(1

)(),,(,

)(),,(,

lkjjcolreli

ukjjcolreli

jjc

Akjij

lkjjjcolreli

ukjjjcolreli xxxx

for i = 1, ... , rows, jj = 1, ... , cc, k = 1, ... , levs : limc(cc) ≥ 1, ws(i,jj,k) = 0

(d) kjiu

kji hx ,,)(,,0 ; kji

lkji hx ,,

)(,,0

for i = 1, ... , rows, j = 1, ... , col, k = 1, ... , levs : (i,j,k) A⋲

(e) protx uplevpcolprow )(

,, ; 0)(,, lplevpcolprowx

where:

CPkjiwhen

Ukjiwhenvc kji

kji),,(0

),,(,0max ,,

,,

hi,j,k = max(0,vi,j,k)

m-LP model

m pairs of (e)

Definitions continued

m-LP: an LP that protects m targets simultaneously m-LP screener: criteria for determining whether a target cell is too

close to other target cells in a group of m cells Effective m: the actual as opposed to the requested number of

targets Cost: the weight given a particular cell in the LP’s objective function

—roughly a measure of the desire to retain the cell Capacity: how much of a cell’s value can be used to protect a target

in secondary suppression Frozen cell: capacity=0 if a cell has already been published Infeasible: a target whose LP problem has no solution, a circumstance

that can be caused by frozen cells or overlapping patterns in m-lp

Methods, Programming and Data

LP methodology is straightforward; most of the (very dry) complexity is in how to handle exceptions in calculating cost and capacity

Implementation in production software is a difficult problem and doesn’t fit well with most software development models

Differences in data can produce novel circumstances and performance varies dramatically

Program features

Data structure: graph vs array Automatic global un-duplication Additivity check Decomposes input into table groups Utilizes solver’s data structures Scalar parameter Individual cell cost adjustment Accommodates linked data

Program features (cntd)

Freeze of publication status across linked tables Tolerant of rounded data Skip-P procedure m for m-LP Supercell (aggregate/company protection) Negative values adjustment Tolerant of non-additivity Single “D” check

Processing the Manufacturing Geographic Area Series

NAICS (518 distinct) Geography (16,000 distinct) Content

standalone (hours, capital expenditures) 2D In a relation (employment, payroll and value added) 3

dimensional (3D) Processed 3D in 3 steps

2&3 digit NAICS by full geography and content Freeze and process each full 3 digit tree Cleanup of infeasibles

Performance on 2D tables

Cartesiancells

%0 Nonzero %Pm

ave optsminutes per opt

hours 8.7 M 90.0 867,913 71.9 30.2 2688 2.76

cextot 8.3 M 91.4 711,749 83.2 21.1 2063 4.06

Without skip P, ‘hours’ would have had to do 57,000 optimizations-- taking 109 days instead of 5!

Performance on 3D tables

tableCartesian

cells%0 nonzero %P

maverage opts

Minutes per opt

emp2a 1.9 M 85.4 270,583 89.4 3.7 4465 0.28

emp3 26.7M 89.7 2,760,737 --- 1 67 105.93

The second step for processing employment variables does each subindustry, the first line represents the 1st subindustry. The second line represents the 3rd step where we unfreeze and redo the infeasibles that occurred in the (entire) 2nd step.

Agenda for the 2017 Economic Census

Correct current defects Test solver dependency Adapt to a shift in emphasis from industry to

product tables in the 2017 Census Identify and review pivotal suppressions Better employ m-LP, scaling Investigate over-tabulation Create a data product that synthesizes

suppressed cells

Thanks for your attention

[email protected]

techniques to apply cell suppression to large sparse linked tables and some results using those...

Documents

cell suppression software

table relationsa table

table group

cell suppression methodology

cell suppression rm

table dimension

cell attribute file

kinds of tables