university of notre dame, notre dame, in 46556 · di erential privacy (dp) provides a mathematical...

Construction of Microdata from a Set of Differentially PrivateLow-dimensional Contingency Tables through Solving Linear Equations

with Tikhonov Regularization

Evercita C. Eugenio and Fang Liu∗

Department of Applied and Computational Mathematics and Statistics

University of Notre Dame, Notre Dame, IN 46556

∗[email protected]

August 6, 2019

Abstract

When individual-level data are shared for research and public use, they are oftenperturbed to provide some level of privacy protection. A simple way to perturb a high-dimensional data set where individual-level data can be easily generated with goodutility is to sanitize the full contingency table or full-dimensional histogram. How-ever, it can be costly from the data storage and memory perspective to work withfull tables. In addition, most of the observed signals in the high-order interactionsamong all attributes are likely just sample randomness rather than being of statisti-cal significance and rarely of interest to practitioners. We introduce a new algorithm,CIPHER, which can reproduce individual-level data from a set of meaningful differen-tially private low-dimensional contingency (LDC) tables constructed from the originalhigh-dimensional data, through solving a set of linear equations with the Tikhonov reg-ularization. CIPHER is conceptually simple and requires no more than decomposingjoint probabilities via basic probability rules to construct the equation set and subse-quently solving linear equations. Compared to full table sanitization, the set of LDCtables that CIPHER works with has drastically lower requirements on data storageand memory. We run experiments to compare CIPHER with the full table sanitizationand the multiplicative weighting exponential mechanism (MWEM) which can also beused to generate individual-level synthetic data given a set of LDC tables.The resultsdemonstrate that CIPHER outperforms MWEM in preserving original information atthe same privacy budget and converges to the full-table sanitization in utility as thesample data size or the privacy budget increases.

Keywords: differentially private data synthesis (DIPS), multiplicative weighting, sign andstatistical significance (SSS), contingency tables, data storage and memory, Laplace mecha-nism

1The research is funded by the National Science Foundation grants #1546373 and #1717417.

arX

iv:1

812.

0567

1v2

[cs

.LG

] 5

Aug

201

9

1 Introduction

1.1 Background and Motivation

When releasing data sets for research and public use, protection of individual private infor-mation while still maintaining good utility of the data is of extreme importance. Even withdata anonymization, it is still possible for a data intruder to identify a subject in a releaseddata set. For example, the Netflix Prize data set that contained anonymous movie ratingsof 500,000 Netflix subscribers was used in conjunction with the public IMDB database tosuccessfully identify individual Netflix users (Narayanan and Shmatikov, 2006, 2008), un-covering political preferences and other sensitive information of the movie rates in Netflix.Other recent re-identification cases include the Washington state data for health records(Sweeney, 2013), the New York City Taxi and Limousine Commission data (Tockar, 2014),and the Australian de-identified open health dataset (Culnane et al., 2017). These exam-ples, together with other disclosure cases, have intensified the concerns on individual privacyand call for more rigorous and mathematically sound concepts and frameworks to protectindividual privacy when releasing data.

Differential privacy (DP) provides a conceptual framework to bring rigorous mathematicalguarantee for privacy protection without making strong or ad-hoc assumptions about theintruder’s background knowledge (Dwork et al., 2006; Dwork, 2008). There exist DP mech-anisms for general query release such as the Laplace mechanism (Dwork et al., 2006), theexponential mechanism (McSherry and Talwar, 2007; McSherry, 2009), the median mech-anism (Roth and Roughgarden, 2010), the Gaussian mechanism (Dwork et al., 2014; Liu,2019), and the generalized Gaussian mechanism (Liu, 2019). There are also DP mechanismsfor releasing specific statistical analyses, such as contingency tables (Barak et al., 2007),data cubes (Ding et al., 2011), empirical risk optimization (Chaudhuri et al., 2011), princi-pal component analysis (Chaudhuri et al., 2012), high-dimensional regression (Kifer et al.,2012), graphs and social networks (Kasiviswanathan et al., 2013; Yan et al., 2016; Li et al.,2017), and deep learning (Shokri and Shmatikov, 2015; Abadi et al., 2016), among others.

One of the applications of DP is to generate differentially private individual-level syntheticdata for release. Compared to releasing differently private queries upon request, which is bothburdensome for data curators and practically unsatisfactory for data users as the privacybudget can be quickly consumed with a limited number of queries, releasing differentiallyprivate individual-level data is more convenient for data curators and flexible for data users.On the other hand, differentially private data synthesis is not without limitation. First,some assumptions, whether data-dependent or data-independent, whether weak or strong,are often needed to generate synthetic data. Second, when synthetic data are large in size,it can be computationally costly to store them, especially when multiple sets are released asa way to account for the uncertainty introduced through the synthesis process.

In this paper, we propose a new data synthesis approach for multi-dimensional categoricaldata that does not have to rely on strong data-specific assumptions nor does it have a highdemand on data storage.

2

1.2 Related Work

A simple way that imposes minimal assumptions on the local data and is still able to perturbmulti-dimensional categorical data while maintaining good utility is the sanitization of thefull cross-tabulation, from which individual-level synthetic data can be easily generated. Theapproach is often used as a baseline to benchmark other differentially private methods foranswering queries or generating synthetic data in terms of utility. Despite its simplicityand offering good utility, the full table sanitization does have some drawbacks. First, thefull cross-tabulation among all attributes is likely to generate a lot of empty cells, and thehighest-order interactions among the attributes and the observed signals in the full tableare most likely just white noises and do not represent meaningful population-level signals ofstatistical significance. Second, it can be costly to store or release the full table when theoriginal data set is of moderate to high dimension. For example, the full table among p = 10attributes with 5 levels per attribute has 9, 765, 625 cells.

If the size of the set of the cell frequencies, from which synthetic data are generated, canbe reduced without affecting the population-level signals contained in the original data toa meaningful degree, it would be welcomed from a data storage perspective. There existssome work along this line. Barak et al. (2007) use Fourier transforms and linear program-ming to generate differentially private individual-level synthetic data from the low-ordercontingency tables. Though consistency, non-negativity, and privacy are ensured, solvingthe linear programming could be a bottleneck for this algorithm especially when p is large.Hay et al. (2010) introduce the universal histogram approach that benefits the utility oflow-order histograms, but at the expense of precision of the higher-order histogram. Chenet al. (2015) use a sampling-based framework to build attribute clusters and the syntheticdata are generated from the differentially private histograms formed by the attribute clus-ters. The formation of optimal attribute clusters is an NP-hard problem and the authorsintroduce an approximation algorithm that does not guarantee optimality particularly ifthere are any non-convexity issues. Zhang et al. (2014) introduce PrivBayes to differentiallyprivately construct a Bayesian networks, from which samples are taken to release. When por the degree of the network is large, the construction of the differentially private Bayesiannetwork can be time consuming. Liu (2016a) proposes the model-based approach (modips)to generate differentially private synthetic data in the Bayesian framework. The modips canbe computationally intensive in the large p setting. In addition, both the PrivBayes and themodips are subject to mis-specification of the synthesis models, which would lead to a biasedsynthetic sample. Abowd and Vilhuber (2008) propose to generate differentially private cat-egorical data from the Multinomial/Dirichlet model in the Bayesian framework. McClureand Reiter (2012) propose a slightly different approach to synthesize one-dimensional binarydata. Machanavajjhala et al. (2008) demonstrate that the Multinomial-Dirichlet synthesizerleads to poor inferences due to data sparsity when it is applied to release the commutingpatterns of the US population data Bowen and Liu (2016) also show that both approacheshave worse performance than the full table sanitization via the Laplace mechanism and themodips approach at the same privacy budget. Hardt et al. (2012) propose the iterativeMultiplicative Weights via Exponential Mechanism (MWEM) approach to generate a differ-entially private empirical distributions given a set of linear queries. Though not originally

3

proposed d for obtaining differentially private queries, synthetic data can be easily sampledfrom the differentially private empirical distributions. The MWEM algorithm achieves thenear optimal bound on the l∞ error for the queries ∈ Q for an optimal number of iterationsT . The downside of MWEM is that it is very sensitive to the choice of T and choosing theoptimal T can be challenging.

1.3 Our contributions

We propose a novel procedure, namely, Construction of Individual-level data from a set ofdifferentially Private low-dimensional contingency tables tHrough solving linear Equationswith Tikhonov Regularization (CIPHER), to generate differentially private empirical distri-butions which can be easily converted to the individual-level data or microdata.

Oftentimes the population-level signals in real-life data with categorical attributes are con-tained in a set of low-dimensional contingency tables. For example, suppose there are p = 6attributes in the original data. Seldom is the 6-way interaction among all 6 attributes mean-ingful or of interest. Meaningful signals in the data might well be summarized in a setof low-dimensional contingency tables, for example, (X1, X2) ⊥⊥ (X1, X3) ⊥⊥ (X2, X3) ⊥⊥(X3, X4, X5) ⊥⊥ X6. In addition, it is often the case that an attribute occurs in morethan one of the low-dimensional tables, such as X1, X2 and X3 in this example. If nosanitization is involved, then fitting a log-linear model with these interactions terms (e.g.,X1X2 +X1X+X2X3 +X3X4X5 +X6) to the data would lead to consistent estimates for thefull-table cell probabilities and frequencies. However, due to the injection of the differentialprivate noise, the marginal counts, say those of X3, would become inconsistent across thethree tables that involve X3. One would need a method can automatically correct for theinconsistency in the marginals, a goal that CIPHER can achieve without having to explicitlyincorporating the constraints by solving a set of equations.

CIPHER is conceptually simple and requires nothing than decomposing joint probabilitiesvia basic probability rules to construct a linear equation set Ax = b and subsequently solvingthe linear equations. The computational cost for solving the equation set is expected to below once the equation sets are constructed. Since A is block-diagonal, taking the inverse ofATA + λI is relatively cheap even if the linear equation set is large. Compared to the fulltable sanitization to re-generate individual-level data with privacy, the set of LDC tables thatCIPHER works with has drastically lower requirements for computer storage and memory.For example, compared to 9,765,625 cells resultant from the full table among 10 attributeswith 5 levels each, there is a 95.4% and 99.99% reduction in the number of cells – down to62,200 and 8,440, respectively – if the set of 210 four-way contingency tables or the set 45of two-way contingency tables are used instead.

If the LDC tables are already given and differentially privately sanitized, then data userscan apply CIPHER themselves to generate microdata for their analyses. During the wholeCIPHER procedure, there is no probing or going back to the original data, thus DP ispreserved. For data curators whose goal is to release microdata and don’t have the setof LDC tables yet, there are several options. First, to choose a set via a model selectionprocedure – which costs privacy budget per se; second, to leverage the domain knowledge

4

to come up with a set without relying on the specific values of the data at hand; third,to be conservative and use high-order contingency tables that but still lower than the fulldimension. The latter two approaches do not cost privacy budget, all of which can be directedtoward sanitizing the LDC table set.

The remaining of the paper is organized as follows. Section 2 reviews the basic concepts inDP and some differentially private mechanisms related to this work. Section 3 introducesthe CIPHER procedure and proposes the SSS (Sign and Statistical Significance) assessmentto evaluate the inferences based on differentially private synthetic data against the originalinferences. Section 4 compares the CIPHER with several other sanitization methods on thestatistical utility of the synthetic data in simulated and real-life data. Section 5 providessome concluding remarks and discusses future research directions.

2 Preliminaries

Consider a data set D. A query/statistic or a set of queries/statistics f asks specific questionsabout D. DP provides a rigorous and robust mathematical conceptual framework to protectindividual privacy information when releasing the query results f .

Definition 1 (ε-differential privacy (Dwork et al., 2006)). A randomized mechanism Rsatisfies ε-differential privacy if for all data sets D1 and D2 differing on one element and allresult subsets S to query f , e−ε ≤ Pr[R(f(D1))∈S]

Pr[R(f(D2))∈S] ≤ eε.

ε is often referred as the privacy budget and is pre-specified. The smaller ε is, the more privacyprotection is imposed on the individuals in the data, in the sense that the probabilitiesof getting the same sanitized query results via R for D1 and D2 gets more similar. Theformulation of privacy via the DP is robust and guards against the worst-case scenario as itdoes not impose any assumptions about the behavior or the background knowledge of dataintruders.

Definition 2 (sequential composition and parallel composition (McSherry, 2009)).Let q = 1, ..., K represent a set of queries on data D and ε be the total privacy budget.Denote by Mq a randomization mechanism of εq-DP. The Sequential Composition states

that the sequence of Mq(D) provides(∑

q εq

)-DP. The Parallel Composition states that the

sequence of of Mq (X ∩Dq) provides ε-DP if {Dq} are arbitrary disjoint subjects of D.

The sequential composition and parallel composition principles are very useful to track andcount privacy budget, and when designing differentially private mechanisms.

There are a variety of mechanisms to provide differentially private results, as alluded to inSection 1. Here we mention two of them – the Laplace mechanism and the Exponentialmechanism, which will be in the experiments in Section 4.

Definition 3 (Laplace mechanism (Dwork et al., 2006)). The ε-differentially privateLaplace mechanism generates the sanitized query result as in f∗(D) = f(D)+Lap(∆f/ε),where ∆f = max

D1,D2

‖f(D1) − f(D2)‖1 is the l1 global sensitivity of query f , for all D1, D2

5

differing in one element.

The larger ∆f is, the more noise would be injected to f(D) to satisfy ε-DP. Generaliza-tion of the Laplace mechanism include the Gaussian mechanism and Generalized Gaussianmechanism that is built upon the lp norm (p ≥ 1) (Dwork et al., 2014; Liu, 2019), amongothers.

Definition 4 (exponential mechanism (McSherry and Talwar, 2007)). Let u be a utilityfunction that assigns a score to each possible output of a query to data D. The Exponentialmechanism that satisfies ε-DP releases query result f ∗(D) with probability

exp(u(f ∗(D);D) ε2δu

)/∫u(f ∗(D);D) ε

2δud(f ∗(D)),

where δu is the maximum change in score u with one element change in data D.

3 CIPHER

We propose the CIPHER method to generate differentially-private full tables and individual-level synthetic data from a set of LDC tables. As mentioned in Section 1, the main motivationfor the development of CIPHER is the reduction of the query size to save on data storage,leveraging the common knowledge that high-order interactions among the the full cross-tabulation are often meaningless and not worth preserving. Figure 1 shows the drasticreduction in the number of cells that need to be stored if the sets of 1-way, 2-way, 3-way, and4-way LDC tables are used in place of the full table for varying p (the number of attributesin the original data). The order of the LDC tables used for getting the full table is allowedto grow with p, but again interactions of very high order are rarely of interest in real-lifedata and are also hard to explain and analytically and computationally challenging.

5 10 15 20

05

1015

20

dimension p

log2

(num

ber

of c

ells

)

2p

full table4−way3−way2−way1−way

5 10 15 20

05

1015

2025

30

dimension p

log2

(num

ber

of c

ells

)

3p


5 10 15 20

010

2030

dimension p

log2

(num

ber

of c

ells

)

{2,3,4,5}p


Figure 1: log(Number of stored cell) for FHD and sets of LDC tables of various dimensionvs p

3.1 Method and Algorithm

The CIPHER algorithm is presented in Algorithm 1, followed by some remarks about thealgorithm. In brief, the CIPHER procedure starts from the lowest-order contingency table(s)

6

in a given set of LDC tables Q and arrives at a solution of the differentially private full tableusing a stepwise approach, without a need for complex sampling algorithms. The LDCtables in Q, which do not have to be of the same dimension, are expected to capture theimportant signals and relationships among the attributes in the original data. Two specialcases of Q are the single p-way full table and the set of p one-way contingency tables,respectively Forming Q can be guided by the domain knowledge without having to consumethe information and thus privacy of the current data. If the domain knowledge is not availableor the data curator prefers to choose a set using the information of the current data, thenthe total privacy budget will need to be divided between the selection of Q and the CIPHERalgorithm itself. In the rest of the discussion, we assume Q is preset before the applicationof the CIPHER algorithm.

Algorithm 1 CIPHER

1: INPUT: original data D (n × p); query set Q; privacy budget ε; number of syntheticdata sets m (Remark 1); Tikhonov regularization constant λ (Remark 2).

2: Denote the lowest dimension of the LDC tables ∈ Q by p0.3: FOR l = 1, . . . ,m4: Sanitize all queries ∈ Q via a mechanism of ε-DP (e.g., q̃

(l)k = qk+Lap(0, ε/(m|Q|)) for

k = 1, . . . , |Q| if the Laplace mechanism is used).5: FOR j = p0 + 1, . . . , p6: List all j-way contingency tables Tj.7: FOR each query qi 6∈ (Tj+1 ∩Q), run the 5 steps below.8: 1) Denote the set of variables that form query qi by Xi and pi = |Xi|.9: 2) Randomly pick a variable out of Xi. WLOG, denote that variable by Xi1, and

the rest of the variables by Xi2, . . . , Xi,pi . Denote the number of cells in Xik byKik for i = 1, . . . , pi.

10: 3) For k = 2, . . . , (pi−1), define bk=Pr(Xi1 6= Ki1|Xi\(Xi1, Xik)=∑

XikPr(Xi1 6=

Ki1, Xik|Xi\(Xi1, Xik)) = Akzk =∑

XikPr(Xik|Xi\(Xi1, Xik)) Pr(Xi0 6= Ki1|Xi\

(Xi1, Xik), Xik), where zk is the conditional probability of (Xi1 6= Ki1) given therest of variables in Xi, Ak is either observed or calculated from step j − 1, and(Xi1 6= Ki1) represents the vector (Xi1 = 1, . . . , Xi1 =Ki1 − 1).

11: 4) Let b = (b1, . . . ,bpi−1)T , z = (z1, . . . ,bpi−1)

T , and A = Diag{A1, . . . ,Api−1};solve for z from Az = b with the Tikhonov regularization; that is, z = (ATA +λI)−1ATb, where I is the identity matrix.

12: 5) Calculate the empirical probability for qi: Pr(Xi) = z · Pr(Xi \Xi1).13: END FOR14: END FOR15: Correct negativity and normalize the empirical joint probability (Pr(X))(l) =

(Pr(X1, ..., Xp))(l) (Remark 3).

16: Generate differentially private data D̃(l) of size n from (Pr(X))(l).17: END FOR18: OUTPUT: m sets of differentially private data D̃(1), . . . , D̃(m).

Remark 1 (number of synthetic data sets m). We recommend setting m at a smallnumber > 1 if the released data will be used for statistical inferences. Releasing multiple sets

7

offers a convenient way to account for the uncertainty and randomness introduced by thesanitization and synthesis procedures, coupled with proper inferential combination rules (Liu,2016a). It is easy to implement in practice and can be viewed as a Monte Carlo approach toaccount for the sanitization and synthesis uncertainty. Though releasing a single set coupledwith explicitly modeling the sanitization mechanism and the synthesis model can also helpto accommodate the uncertainty, the modeling can be much more challenging analyticallyand computationally compared to releasing multiple sets. In addition, as long as m is nottoo large in that the total privacy budget is not spread too thin over the multiple sets(each synthetic set receives 1/m of the total privacy budget per the sequential compositiontheorem), the precision gained by averaging over m sets of synthetic data could outweighthe additional noises introduced from releasing multiple sets than a single set.

Remark 2 (Tikhonov regularization). The reason for using the Tikhonov regularization(aka the l2 regularization) to solve for z from Az = b is that the columns of A are linearlydependent and ATA is not full rank. The Tikhonov regularization is known for solving ill-posed problems like Az = b when the solution z is not unique due to the singularity of A.(Tikhonov, 1963; Tikhonov et al., 2013). It works by adding a small positive constant λ tothe diagonal elements of ATA, and calculating z = (ATA + λI)−1ATb. The constant λ isa tuning parameter. We found from the empirical studies that the solutions from CIPHERare relatively robust to the choice of λ and lead to similar joint distribution except for somenegligible numerical errors as long as λ is relatively small (on the order of o(1)). Since A isblock-diagonal, taking the inverse of ATA + λI is relatively cheap computationally even ifthe linear equation set is large.

Remark 3 (correction of non-negativity and normalization). The cell probabilitiesin the differentially private LDC tables in Q can be < 0 or ≥ 1. In addition, the solutionsfor the conditional probabilities from the linear questions in CIPHER can also be < 0 or≥ 1. We could correct for the non-negativity by the truncation or the boundary inflationtruncation procedures (Liu, 2016b) and normalize the probabilities in every time the sanitizedor solved probabilities are outside [0, 1), or we could wait until the last step of generatingthe full table to make one overall correction. We compared both approaches and found thatoftentimes the two led to similar results and the final overall correction in some cases ledto better results. Given this and the fact that one correction is easier operationally thantaking multiple corrections during the CIPHER algorithm, we recommend users take onefinal correction when obtaining the joint distribution from the full table.

If two or more LDC tables in Q share the same variable(s), then after the sanitization,the frequencies in the LDC tables formed by the shared variables would be inconsistent.For example, suppose table T1 in set Q is a 3-way table (V1, V2, V3) and table T2 is 3-way (V1, V2, V4). The cell frequencies in 2-way table (V1, V2) calculated from the two 3-wayContingency tables would be the same and so would be the cell frequencies in all the 1-way contingency tables in the original data. However, after noises being injected in thedifferentially private sanitization of T1 and T2, the bin counts in the table (V1, V2) calculatedfrom T1 and T2 are not the same. Barak et al. (2007) transform the data into the Fourierdomain, where adding noise will not violate consistency. However, this approach has abottleneck in the linear programming when p is large. The CIPHER procedure does not

8

have this issue with the way it solves for the empirical distributions. The inconsistencyamong the LHDs in Q if they have some shared variables is automatically averaged outwhen solving for the non-full rank linear equation set with the Tikhonov regularization.

Claim 1. The CIPHER algorithm satisfies the ε-DP.

The satisfaction of the DP in CIPHER is straightforward to establish. The only time at whichthe original data are probed during the application of CIPHER is when the queries in Q aresanitized, and the data are accessed mK times with a privacy budget of ε/(mK) per access.Per the sequential composition, the total privacy budget is maintained at (mK)ε/(mK) = ε.

3.2 Example: Illustration of CIPHER in the 3-variable Case

We illustrate the CIPHER procedure with a simple example. Say the original data contain 3variables (p = 3). Denote the 3 variables by V1, V2, V3 with K1, K2 and K3 levels, respectively.Let Q = {T (V1, V2), T (V2, V3), T (V1, V3)} that contains all the 2-way contingency tables.Therefore, p0 = 2 in Algorithm 1. WLOG, suppose V3 is X0 in Algorithm 1. We first findthe relationships among the probabilities, which are{

Pr(V3|V1) =∑

V2Pr(V3, V2|V1) =

∑V2

Pr(V3|V1, V2) Pr(V2|V1)Pr(V3|V2) =

∑V1

Pr(V3, V1|V2) =∑

V1Pr(V3|V1, V2) Pr(V1|V2)

,

We now convert the above relationships into the equation set b = Az. Specifically, b =(Pr(V3|V1)\Pr(V3 = K3|V1), and Pr(V3|V2)\Pr(V3 = K3|V1))T is a known vector of dimension(K1+K2)(K3−1), z = Pr(V3|V1, V2)\Pr(V3 = K3|V1, V2) is of dimension K1K2(K3−1), A is aknown diagonal matrix with K3−1 identical blocks, where each block is a (K1+K2)×(K1K2)matrix comprising the coefficients (i.e., Pr(V1|V2),Pr(V2|V1) or 0) associated with z. After zis solved from b = Az, the joint distribution of Pr(V1, V2, V3) is calculated by z · Pr(V1, V2).The experiments in Section 4 contain more complicated applications of CIPHER.

3.3 Differences between CIPHER and MWEM

Both CIPHER and MWEM can work with a pre-specified set of linear queries to generatean empirical distribution, but they are methodologically and algorithmically different. First,MWEM relies on an iterative multiplicative weighting procedure whereas CIPHER is not aniterative procedure but solves one or more sets of linear equations analytically to reach thedifferentially private empirical joint distribution among the p variables. Second, the queriesin CIPHER are sanitized through a DP mechanism (say the Laplace sanitizer) before beingfed into the algorithm and they only need to be sanitized once. By contrast, each iterationin the MWEM algorithm incurs privacy cost due to it accessing the original data to fetchthe query selected by the Exponential mechanism, which is subsequently sanitized by theLaplace mechanism. As a result, the two algorithms spend different privacy on a query fora given total privacy budget. Suppose the total budget is fixed at ε for the CIPHER andMWEM algorithms. The number of queries in Q is |Q|. If we use equal allocation of the

9

privacy budget, then each query in Q gets a budget of ε/|Q| in the CIPHER algorithm.The sanitization of each query selected by the Exponential mechanism costs ε/(2T ) in theMWEM algorithm. On the other hand, a query can be selected multiple times throughoutthe T iterations. Let ck denote that times that how many times qk ∈ Q is selected amongthe T iterations. Note

∑|Q|k=1 ck = T . Unless ck/(2T ) > |Q|−1 or ck/

∑|Q|k=1 ck > 2|Q|−1,

then the budget allocated to qk in the MWEM algorithm would always be smaller thanthat in CIPHER. In other words, the selection probability for a query needs to at leastdoubles the average selection probability (1/|Q|) to be receive more privacy budget in theMWEW algorithm than in the CIPHER algorithms. Our own experiences from runningthe MWEM algorithm suggest that choosing the “right”number of iterations T for MWEMcan be challenging. T too small is not sufficient to allow the empirical distribution to fullycapture the signals summarized in the queries; and T too large would lead to a large amountof noises being injected as the privacy budget has to be distributed across the T iterations,eventually leading to a useless synthetic data set as each iteration costs privacy.

4 Experiments

We run experiments with simulated and real-life data to evaluate CIPHER, and benchmarkits performance against MWEM and the full table sanitization. We provide below thejustification on the choice of these two methods to compare to CIPHER.

4.1 Methods for Comparison

The full table sanitization can be achieved through injecting independent Laplace noisesdrawn from Lap(0, ε−1) to the cell frequencies in the full table across all the attributes in adata set. Though technically there is only one query (a single histogram), the number of cellsgrows quickly with p (Figure 1), not to mention that a lot of cells in the full table are likelyto be empty. From a statistical perspective, constructing the full table is equivalent to fittinga log-linear model with all possible interactions among all p attributes. Hay et al. (2016) (inanswering 1D or 2D range queries) and Bowen and Liu (2016) show that the the full tablesanitization is likely to outperform and or be similar to the more complex algorithms (e.g.,modips, the Multinomial-Dirichlet synthesizer, DPcube, Privelet) in utility when the size ofthe query set is large or when n or the privacy budget is high. The flat Laplace sanitizeris therefore a useful baseline to benchmark against for other differentially private methodsfor generating queries or synthetic data, especially considering its simplicity for practicalimplementation.

The MWEM algorithm achieves the near optimal bound on the l∞ error between the originaland sanitized linear queries in Q. Though originally proposed for obtaining differentially pri-vate linear queries, the MWEM algorithm is ready for generating synthetic data, assumingthe queries are representative of the population-level signals in the data, given that it out-puts a differentially private empirical distribution. Given that both CIPHER and MWEMalgorithms work with a pre-specified linear query set and since the MWEM achieves theoptimal l∞ error on the query set, it thus makes sense to compare CIPHER to MWEM to

10

see if it can beat MWEM procedure in the l∞ error as well as per other utility metrics.

Though there exist other methods to generate synthetic data from a set of low dimensionalqueries in categorical data, the queries are often model-based (e.g., PrivBayes and MODIPS).Selection of these queries can be computationally costly especially when the dimension ofthe data is high; and some of the queries used in these procedures are not linear or notstraightforward to sanitize (e.g., regression coefficient from logistic regression).

All taken together, in the experiments below, we focus on the comparison between CIPHERand MWEM, using the full table sanitization as the baseline. We aim to show CIPHERdelivers better utility that MWEM with much lower requirement on data storage the fulltable sanitization.

4.2 The SSS assessment

When comparing the utility of synthetic data generated by CIPHER, MWEW, and thefull table sanitization, we not only examine the descriptive statistics such as mean and lp(p > 0) distance between the synthetic and the original data, we also examine the informationpreservation in statistical inferences on population parameters when hypothesis testing isinvolved. Toward that end, we propose the SSS assessment. The first S refers to the the Signof the estimated parameter, and the second and third S’ refer to the Statistical Significanceof the estimated parameter. The consistency in the sign and statistical significance forthe parameter estimates based on the original and synthetic data leading to seven possiblescenarios as listed in Table 1. The best scenario is when both the sign and the statistical

Table 1: Preservation of Signs and Statistical Significance on the estimated parameters (theSSS assessment)

parameter estimates Best II+ I+ Neutral II- I- Worstmatching Signs between original and synthetic? Y Y Y Y N N N NStatistical Significance in original data Y N Y N N Y N YStatistical Significance in synthetic data Y N N Y N N Y Y

significance of the parameter estimates from the original and synthetic data match; and theworst case scenario is that both estimates are statistically significant but with opposite signs,which entails detrimental consequences in practice. Between the two extremes, there are fiveother possibilities.

• II+ and I+ indicate an increase in Type II and Type I error rates, respectively. Inboth cases, the signs match, but the statistical significance goes from significance to non-significance in the synthetic data for II+, resulting in an inflated Type II error rate; and goesfrom non-significance in the original to significance in the synthetic data for I+, resultingin an inflated Type I error rate.

• Neutral indicates that the signs change between the original data and the synthetic data,but are not significant in both cases.

11

• II- indicates a sign change, and the statistical significance changes from being significantoriginally to non-significance in the synthetic data; and I- indicates a sign change and thestatistical significance changes from being non-significant in the original to significant inthe synthetic data.

For the synthetic data, we would want the probability of the best scenario to be high, followedby Neural, II+, II-, I+, I-; and hope the worst case scenario has a close-to-0 probability tooccur. We apply the SSS assessment to the data in the experiment to compare the inferencesbetween the original data and the differentially private synthetic data.

4.3 Experiment 1: Simulated Data

In this experiment, we use simulated data to investigate the inferential properties and theutility of the sanitized data sets generated via CIPHER and compare to the MWEM algo-rithm and the full table sanitization.

The simulation study examines a data scenario with 4 categorical variables, where V1 andV2 have 2 categories each and V3 and V4 have 3 categories each. The data was simulated viaa sequence of multinomial logistic regression models. Specifically,

V1 ∼ Bernoulli(0.5);

V2|V1 was simulated from a logistic model

logit(Pr(V2 = 1|V1)) = β0 + β1V1 with β0 = 0.5 and β1 = 1;

V3|V1, V2 was simulated from multinomial logistic modelln(

Pr(V3=2|V1,V2)Pr(V3=1|V1,V2)

)= β01 + β11V1 + β21V2

ln(

Pr(V3=3|V1,V2)Pr(V3=1|V1,V2)

)= β02 + β12V1 + β22V2

with β01 = −1, β11 = 2, β21 = 1, β02 = 0.5, β12 = 1, β22 = −1;

V4|V1, V2, V3 was simulated from multinomial logistic modelln(

Pr(V4=2|V1,V2,V3)Pr(V4=1|V1,V2,V3)

)=β01+β11V1+β21V2+β311(V3 =1)+β411(V3 =2)

ln(

Pr(V4=3|V1,V2,V3)Pr(V4=1|V1,V2,V3)

)=β02+β12V1+β22V2+β321(V3 =1)+β421(V3 =2)

with β01 = 1.5, β11 = −1, β21 = 0.5, β31 = 1, β41 = −2, and

β02 = 1, β12 = −1.5, β22 = −0.5, β32 = 0.75, and β42 = −1.

We examine two samples size scenarios at n = 200 and n = 500, respectively, each underfive privacy budget scenarios ε = (e−2, e−1, 1, e, e2). We run 1,000 repetitions for each n andε scenario so to investigate the stability of each method. m = 5 synthetic data sets weregenerated by CIPHER, MWEM, and the full table sanitization, respectively, so that theuncertainty of the synthesis model and the randomness brought by the differential privatemechanisms can be properly accounted for. Each synthetic data set has the same samplesize as the original data set.

For the the Laplace sanitizer, the full table across the 4 variables contains 36 cells. Laplace

12

noises were drawn from Lap(0, (mε)−1) and added to each of the 36 cell counts in the fulltable. For the CIPHER and MWEM algorithms, we consider two different query sets Q: (1)Q3 contains all 4 three-way contingency tables among the four variables, which leads to 32cells (88.9% of the full table); (2) Q2 contains all 6 two-way contingency tables among thefour variables, which leads to 20 cells (55.6% of the full table).

For the CIPHER algorithm, we sanitized all the contingency tables in Q2 or Q3 and followedsteps in Algorithm 1 to synthesize the individual-level data. We use CIPHER 3-way andCIPHER 2-way to denote the two cases, according to whether Q3 or Q2 is used. Thelinear equation sets in both cases are presented in the supplementary materials. For theMWEM algorithm, the starting distribution was set as the mutually independent categoricaldistribution with equal probability across all categories for each of the four variables. We runboth MWEM 3-way (if Q3 is used as the query set) and MWEM 2-way (if Q2 is used as thequery set). The number of iterations T can affect the quality of the synthetic data greatly.Since this is a simulation study, we were able to use independent simulated data from thesame model to roughly optimize T for different ε and n; specifically, T = {5, 15, 25, 60, 120}at n = 200 and T = {10, 25, 50, 100, 200} at n = 500 for ε = {e−2, e−1, 1, e1, e2}, respectively.

n = 200

●

●

●

●●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for One−Way Tables

e−2 e−1 e0 e1 e2

0.0

0.1

0.2

0.3

0.4

●

●

●

●

●

● CIPHER 3−wayCIPHER 2−wayMWEM 3−way

MWEM 2−way full table sanitization

●

●

●

●

●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for Two−Way Tables

e−2 e−1 e0 e1 e2

0.0

0.2

0.4

0.6

0.8

●

●

●

●

●


MWEM 2−way flat Laplace

●

●

●

●

●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for Three−Way Tables

e−2 e−1 e0 e1 e2

01

23

●

●

●

●

●



n = 500

●

●

●

● ●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for One−Way Tables

e−2 e−1 e0 e1 e2

0.00

0.05

0.10

0.15

0.20

0.25

0.30

●

●

●

●●


MWEM 2−way flat Laplace

●

●

●

●

●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for Three−Way Tables

e−2 e−1 e0 e1 e2

0.0

0.5

1.0

1.5

2.0

2.5

3.0

●

●

●

●●



●

●

●

● ●

privacy budget

Tota

l Var

iatio

n D

ista

nce

(TV

D)

TVD for Two−Way Tables

e−2 e−1 e0 e1 e2

0.0

0.1

0.2

0.3

0.4

●

●

●

● ●



Figure 2: Total Variation Distance (mean ± SD) on 1-way, 2-way and 3-way tables inExperiment 1

We run three types of analyses on the synthetic data. The first two analyses are descriptiveand examine the ability of each method in recovering the original information, while the thirdis inferential, compares some analysis results between the synthetic and original data andalso examines the ability of the methods in preserving the population-level information forstatistical inferences. Specifically, in the first analysis, the average total variation distance(TVD) between the original and synthetic data sets was calculated for the cell probabilities inall three-way, two-way and one-way tables, respectively. The TVD for the cell probabilities in

13

n = 200 n = 500

Figure 3: The SSS (Signs and Statistical Significance) assessment on the estimated regressioncoefficients for n = 200 and n = 500

a table is defined as |p− p̄∗|/2, where p and p̄∗ represent the cell probabilities in the originaldata and those averaged over the m synthetic data sets, which were then averaged for all k-way tables, where k = 1, 2, 3, respectively. In the second analysis, we examine the l∞ error forQ2 andQ3, respectively. MWEM is claimed to have the optimal l∞ error for the set of queriesthat are fed to the algorithm with an optimal T (Hardt et al., 2012). In the third analysis,we fitted the multinomial logistic model with V4 as the outcome and V1, V2, V3 as covariates.The inferences from the m = 5 synthetic data sets were combined using the combination rulein Liu (2016a). Specifically, the final point estimate for a parameter β is β̄ = m−1

∑mj=1 β̂

(j),

where β̂(j) is the MLE of β in synthetic set j; and the variance is estimated by V = m−1B+W ,where W = m−1

∑mj=1 v

2(j)) (the average within-set variability), where v2(j) is the variance

estimate β̂(j), and B = (m−1)−1∑m

j=1(β̂(j)− β̄)2 (the between-set variability). Inferences of

θ are based on the t-distribution tν(β̄, V ) with degrees of freedom ν = (m− 1) (1 +mW/B)2.The bias, root mean square error (RMSE), coverage probability (CP) and confidence interval(CI) width of the 95% CI were determined for each of the regression coefficients from themultinomial logistic regression model. We also run the SSS assessment on the the regressioncoefficients to evaluate the consistency between synthetic and original data on the inferenceson the parameters.

The results for the average TVD are presented in Figure 2. Between CIPHER and MWEM,MWEM produces similar or smaller bias compared to CIPHER when ε = e−2, but is out-performed by CIPHER at ε > 1. There is not much difference between 3-way and 2-wayCIPHER or between 3-way and 2-way MWEW for this analysis. The full table sanitizationis the best performer overall especially in the 3-way table case for ε ≥ e−1. CIPHER andthe full table sanitization delivers similar performances to for 1-way and 2-way tables whenε ≥ e.

14

n = 200 n = 500

●

●

●

●

●

privacy budget

Max

imum

Abs

olut

e D

iffer

ence

Two−Way Tables

e−2 e−1 e0 e1 e2

050

100

150

●

●

●

●

●

●

●

CIPHER 3−wayCIPHER 2−wayMWEM 3−way


●

●

●

●

●

privacy budget

Max

imum

Abs

olut

e D

iffer

ence

Two−Way Tables

e−2 e−1 e0 e1 e2

010

020

030

040

0

●

●

●

●

●

●

●



●

●

●

●

●

privacy budget

Max

imum

Abs

olut

e D

iffer

ence

Three−Way Tables

e−2 e−1 e0 e1 e2

050

100

150

●

●

●

●

●

●

●



●

●

●

●

●

privacy budget

Max

imum

Abs

olut

e D

iffer

ence

Three−Way Tables

e−2 e−1 e0 e1 e2

010

020

030

040

0

●

●

●

●

●

●

●



Figure 4: l∞ (mean ± SD) for Q2 and Q3 in Experiment 1

The results for the l∞ error over the prespecified query set are given Figure 4. The perfor-mance of MWEM does not seem to live up to the claim that it has the optimal l∞ error forthe set of queries that are fed to the algorithm with an optimal T . For example, for 3-waytables, per this claim, MWEM 3-way would have produced the smallest l∞ error, which isnot the case per the results. This might be due to T not being optimized in a precise way,which is not an easy hyper-parameter to tune. In summary, the three methods are similarat ε = e−2, but the Laplace sanitizer edges out as ε increases. CIPHER also outperformsMWEM when ε > e−1.

The results for the SSS assessment on the regression coefficients from the logistic regressionare provided in Figures 3. A method with the longest red bar (best-case scenario) and theshortest purple bar (the worst-case scenario) would be preferable. The two inflated typeI error types (I+/yellow bar and I-/blue bar) would preferably be of low probability. Thetwo inflated type II error or decreased power types (II+/orange bar and I-/green bar) andneural (gray) are acceptable. Per the listed criteria above, first, it is comforting to see theundesirable cases (purple+blue bars) are the shortest among all the 7 scenarios for eachDIPS method; second, as expected, the inferences improve quickly with CIPHER and thefull table sanitization and rather slowly with MWEM as ε increases; third, the full tablesanitization is the best performer in preserving SSS, especially for the medium valued ε,followed closely by CIPHER. Finally, even for CIPHER and the full table sanitization, thereare always non-ignorable proportions of II+ (and II- when ε was small) even when ε is aslarge as e2, suggesting the sanitization decreases the efficiency of the statistical inferences,which is the expected price paid for privacy protection.

15

●

●

●

●

●

●

●

●

●

●

bias

e−2−

20

24

●

●

●

●

●

●

●●

●

●

●

●


MWEM 2−way full table sanitizationOriginal

●

●

●

●

● ●

●

●

●

●

e−1

●

●

●

●

● ●

●

●

●

●

●

●


MWEM 2−way full table sanitization Original

●

●

●

●

●●

●

●

●

●

1

●

●●

●

●

●

●

●

●

●

●

●



● ●

●●

●●

●

●

●●

e1

●

● ●

●

●

●

●

●

●

●

●

●



● ●●

●● ●

●●

● ●

e2

●

● ●

●

● ●

●

●

●

●

●

●



●● ●

●● ● ● ●

● ●

Cov

erag

e P

roba

bilit

y0

2040

6080

100

●● ●

●

● ● ● ●●

●●

●●

●

● ● ● ●●

●● ●●

●

● ● ● ● ● ●

●

●

●●

● ● ● ●● ●

●

●

●

●

● ● ●

● ●

●● ● ●

●

● ● ● ● ● ●● ● ●

●

● ● ●

●

● ● ● ● ● ● ● ● ● ● ● ●● ● ●

●

● ● ●

●

● ●

●

●

●●

●●

●

●

●

●

rmse

01

23

4

●●

●●

●

●

●

●

●

●

●

● ●●

●●

●

● ●

●●

●

●

●

●

●

●●

●

●

● ●

●

●

● ●

●

●

●●

● ●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●

● ●

●●

●

●

●

●

●●

●

●

● ●

●

●

● ●●●

●

●

●●

●

●

●

●

●●

●●

●●

●●

●●

log(

Con

fiden

ce In

terv

al W

idth

)

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

01

23

4

●

●●

●

●

●

●

●

●

●● ●

● ● ● ●

● ● ●●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●●

●

●

● ●●

●

●

●

●

●●

●

●

●

●

●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

● ●

●●

● ●

●

●

●●

●●

●●

●●

●

●

●●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

● ●

●● ●

●

●●

●●

Figure 5: Bias, root mean square error (rmse), coverage probability, and log(CI width) of95% CI at n = 200 in Experiment 1

The results on the bias, RMSE, CP, and CI width are presented in Figure 5 and 6 for n = 200and n = 500, respectively. First, between MWEM and CIPHER, CIPHER always deliversnear-nominal CP across all examined ε and both n scenarios while MWEM suffers severeunder-coverage on some parameters. The two methods have similar bias when ε < 1, butthe bias shrinks toward 0 for ε > 1, especially for the 3-way CIPHER, while MWEM hasbias of similar magnitude across all ε values. But MWEM does have the smallest RMSE andCI width for ε ≤ 1. The RMSE and CI width for CIPHER decrease quickly and approachthe original values with increasing ε, whereas those associated with MWEM remain largelyconstant. Second, CIPHER delivers similar performance to the full table sanitization forε < e−1, but the latter has smaller bias, RMSE, and CI width for ε > e−1. Similar toCIPHER, the full table sanitization always has near-nominal CP. Third, the performance ofall the methods improves as n increases from 200 to 500 regarding the bias, RMSE, CP andthe CI width.

16

●

●

●

●

●●

●

●

●

●

bias

e−2−

20

24

●

● ●

●

●●

●

●

●

●

●

●



●

●

●

●

● ●

●

●

●

●

e−1

●

● ●

●

●

●

●

●

●

●

●

●



●●

●

●

●●

●

●

●●

e0

●

●

●

●

●

●

●

●

●

●

●

●



● ●

●●

● ●●

●

● ●

e1

●

●●

●

● ●

●

●

●

●

●

●



● ● ●●

● ●●

●● ●

e2

●●

●

●

● ●

●

●

●●

●

●



●

●●

●

● ● ● ●

●

●

Cov

erag

e P

roba

bilit

y0

2040

6080

100

●

●

●

●

● ●●

●

●

●

●

●

●

●

● ● ●●

●

●

●

●

●

●

● ● ●

● ●

●

●● ●

●

● ● ●● ●

●●

●

●

●

● ● ●

●

●● ● ● ● ● ● ● ● ● ● ●● ● ●

●

● ●●

●

● ● ● ● ● ● ● ● ● ● ● ●● ● ●

●

● ●●

●

● ●

●

● ●

●

● ●

●

●

●

●

rmse

0.0

0.5

1.0

1.5

2.0

2.5

3.0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●●

●

●

●

● ●

●

●

●

●

● ●

●

●

● ●

●

●

● ●● ●●

●

●

●

●

●

● ●

● ●

●

●

● ●

●

●

●●●

●●

●

●●

●

●

●

●

● ●

●●

● ●

●

●

● ●●●

●

●

● ●

●

●

●

●

● ●

● ●● ●

● ●●

●

log(

Con

fiden

ce In

terv

al W

idth

)

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

0.0

0.5

1.0

1.5

2.0

2.5

3.0

●

●●

●

●

●

●

●

●

●

●●

●

● ● ●

●

●● ●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●●

●●

●● ● ●

●

●

●

●●

●

●

●

●

●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●

● ●

●●

●

●●

●●

●

●●

●

●

●

●●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●

● ●

● ●

●

●●

●●

●●

●●

●

●

●●

β01 β11 β21 β31 β41 β02 β12 β22 β32 β42

●●

●● ●

●●

●

●●

Figure 6: Bias, root mean square error (rmse), coverage probability and log(CI width) of95% CI at n = 500 in Experiment 1.

4.4 Experiment 2: Company Bankruptcy Data

The experiments runs on a real-life qualitative bankruptcy data set. Qualitative bankruptcydata are often used for feature selection in bankruptcy prediction and to discover experts’decision rules on bankruptcy vs. non-bankruptcy given the qualitative attributes (Kim andHan, 2003; Tsai, 2009; Nagaraj and Sridhar, 2015). The data set used was collected toidentify the qualitative risk factors associated with bankruptcy and is available for downloadfrom the UCI Machine Learning repository (Dheeru and Karra Taniskidou, 2017). The datacontains n = 250 businesses and 7 variables (Table 2). Though the data set does not containany identifiers, sensitive information (such as the bankruptcy evaluation or Credibility) manystill be disclosed using the pseudo-identifiers left in the data (such as Industrial Risk orCompetitiveness), or be used to be linked to other public data to trigger other types ofinformation disclosure.

When applying CIPHER and MWEM to the bankruptcy data, we first decided on the setof LDC tables Q to be sanitized. We selected Q based on the domain knowledge, and thecomputational and analytical considerations when solving the linear equations.Specifically,

17

Table 2: Variables in the Bankruptcy Data

Variable Category (Frequency)industrial risk (IR) positive (80),average (89),negative (81))management risk (MR) positive (62),average (119),negative (69)financial flexibility (FF) positive (57),average (119),negative (74)credibility (CR) positive (79),average (94),negative (77)competitiveness (CO) positive (91),average (103),negative (56)operating risk (OR) positive (79),average (114),negative (57)Class bankruptcy (107),non-Bankruptcy (143)

we first created a 6-category Class/CR variable from the full cross-tabulation, both of whichcan be regarded as sensitive information and might be associated, and a 9-category IR/COcross-tabulation; and then applied the CIPHER 2-way and MWEM 2-way to the 5 variableswith 6 (Class/CR), 9 (IR/CO), 3 (OR), 3 (MR), and 3 (FF) levels respectively. The size ofQ (the number of counts) is thus 149, though technically speaking, there are 10 sets of 2Dhistogram queries. After the synthetic data were generated, we decoupled the two sets ofcombined variables (Class/CR and IR/CO), so the final synthetic data set still contain all7 attributes as in the original data set. In terms of the original 7 attributes, Q employedby the CIPHER and MWEM procedures contains one 4-way contingency table, six 3-waycontingency tables, and three 2-way contingency tables. For the MWEM algorithm, weexamine two iteration scenarios with T = 5 and T = 20, depending on the value of ε (T = 5for small ε ∼ 0.14 to ∼ 0.37 and T = 20 for larger ε = 1 to ∼ 2.27).

For the full table sanitization, there are 1,458 cells in the cross-tabulation across the 7attributes, which is about 10 folds the numbers of cells for CIPHER and MWEM (149).Among the 1,458 cells, 1,355 are empty cells which should be regarded sample zeros, meaningthat these cells are empty because of the finite sample size, and are expected to change ordisappear as the sample size increases or in a different sample data set. In other words,these sample-zero cells are part of the data and should be sanitized as the non-empty cells;otherwise, information about the raw data would be leaked. The same rule applies toCIPHER and MWEM when empty cells are encountered in Q.

We consider 4 privacy budget levels ε = (e−2, e−1, 1, e1), and run 24 repetitions for ε andeach method to examine the stability of the methods. In each repetition, 5 synthetic datasets with n = 250 were generated. We ran a logistic regression model with “Class” as theoutcome variable (bankruptcy vs non-bankruptcy) and the other attributes as predictors,and a support vector machine (SVM) analysis to predict “Class” using other attributes,both benchmarked against the original results. Understanding what predicts the bankruptcystatus and having the ability to predict the bankruptcy status with high accuracy wouldbe what companies and banks are interested in. In both analyses, the results from the 5synthetic data sets were combined using the combination properties outlined in Liu (2016a).

In the logistic regression model, we examined the relationships of the 6 qualitative categoricalcovariates (IR, MR, FF, CR, CO, and OR) with the outcome variable of Class to determinethe odds of bankruptcy (Kim and Han, 2003). Each of the categorical covariates has three

18

categories, and the “average” level of risk was used as the reference for each. Specifically,

the model is log(

P (bankruptcy)1−P (bankruptcy)

)= β0 + β1 · IRN + β2 · IRP + β3 ·MRN + β4 ·MRP + β5 ·

FFN + β6 · FFP + β7 ·CRN + β8 ·CRP + β9 ·CON + β10 ·COP + β11 ·ORN + β12 ·ORP . Theregression coefficients of β and their variance estimates were estimated using the R packagelogistf, which implements the Firth’s bias-reduced penalized-likelihood logistic regression(Heinze and Ploner, 2016). We applied the SSS assessment to the estimated parameters.The results are presented in Figure 7. The figure suggests that all three DIPS methodsperformed well in the sense that the probability that they produced a “bad” estimate (theworst, II-, and I- categories) was close to 0, and the estimates were mostly likely to landin the “best” or the “neutral” categories. The full table Laplace sanitizer had the largestchance to produce estimates in the “best” category for ε ≥ e−1. MWEM, regardless of ε, hadaround 50% probability to land in the “best” category or in “neutral”. Overall, the threealgorithms seemed performance similarly per the SSS assessment.

Figure 7: The SSS assessment on the logistic regression coefficients in the bankruptcy data

Table 3: Accuracy (%) of Support Vector Machines (SVM) for Predicting “Class” in thebankruptcy data

ε CIPHER MWEM full table sanitizatione−2 67.8 50.0 41.1e−1 64.7 51.3 55.51 68.5 51.0 63.8e1 77.8 47.2 85.7

The prediction accuracy with the original training data is 100%.

In the SVM analysis to classify Class and determine the bankruptcy status, given the sixqualitative risk attributes, we randomly split the original data into a training data set of200 samples (80% of n = 250) and a testing data set of 50 (20% of n = 250). We then

19

apply CIPHER, MWEM, and the full table Laplace sanitization to the training set onlyto generate synthetic data, on which the SVM was trained. The trained SVM with thesynthetic data from each method was applied to make predictions on the same testing set.We 24 repetitions and generated 5 sets of synthetic data with 1/5 of total privacy budgetper set. The averaged prediction accuracy rates over 5 sets and 24 repeats are presentedin Table 3. CIPHER is the obvious winner for ε ≤ 1 with significantly better predictionaccuracy than the other two. When ε = e, the full table sanitization is the best with ∼86%accuracy, followed by CIPHER with ∼78% accuracy. Regardless of ε, MWEM has difficultyin classifying Class, with accuracy between 45∼55% at all the examined ε.

5 Discussion

We proposed the CIPHER algorithm to release differentially private synthetic data setsgiven a set of LDC tables. We also proposed the SSS assessment to evaluate the utilityof the synthetic data hypothesis testing. We compared our algorithm with the full tablesanitization and the MWEM algorithm in a simulation study and a real-life qualitativebankruptcy data set. CIPHER delivers similar the statistical inferences of population-levelparameters as the full table sanitization when ε is relatively small or large and somewhatinferior to the latter around the medium-size ε (in the neighborhood of 1), but working witha significantly smaller set of sanitized statistics compared the full table sanitization. ThoughMWEM, like CIPHER, can works with a small set of statistics, the utility of the syntheticdata is not as good of CIPHER in general.

The asymptotic version of both CIPHER and MWEM is the full table sanitization whenLDC table set contains only one query – the full table. If the Q comprises a set LDCtables instead of the full table, both CIPHER and MWEM have additional sources of noisecompared to the full table sanitization, in addition to the noise introduced by differentiallyprivate sanitizer, which deviates the synthetic data further away from the original data. ForCIPHER, it is the shrinkage brought by the l2 regularization; fro MWEM it is the numericalerrors introduced through the iterative procedure with a hard-to-choose T .

We demonstrated the implementation CIPHER for categorical data, but the algorithm canalso be used in data with numerical attributes. Rather than taking on a set of LDC tablesas input, the input would become a set of low-dimensional histograms. This implies the nu-merical attributes will need to be cut into bins first before the application of the CIPHER.High-dimensional histograms with good statistical properties are difficult to construct (Scott,2015), which poses additional changes for the full table sanitation in addition to the datastorage issue. Low-dimensional histograms would be more desirable from a statistical per-spective, on top of the huge saving in data storage. CIPHER can be directly applied to theset of low-dimensional histograms, following the steps in Algorithm 1, to generate the empir-ical joint distribution among all the attributes. For any numerical attributes involved in thesynthesized histograms, one can uniformly sample from the sanitized bins to “transform”the discretized values back to the numerical values for these attributes.

For future work, we plan to investigate the theoretical aspect for CIPHER in terms of

20

accuracy by certain utility criterion. In addition, we plan to apply CIPHER to more dataof higher dimensions in terms of both attributes and the number of levels per attribute tosee how CIPHER scales up in those cases.

References

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L.(2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSACConference on Computer and Communications Security, pages 308–318. ACM.

Abowd, J. M. and Vilhuber, L. (2008). How protective are synthetic data? In Privacy inStatistical Databases, pages 239–246. Springer.

Barak, B., Chaudhuri, K., Dwork, C., Kale, S., McSherry, F., and Talwar, K. (2007). Privacy,accuracy, and consistency too: a holistic solution to contingency table release. In Pro-ceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principlesof database systems, pages 273–282. ACM.

Bowen, C. M. and Liu, F. (2016). Comparative study of differentially private data synthesismethods. arXiv preprint arXiv:1602.01063.

Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. (2011). Differentially private empiricalrisk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109.

Chaudhuri, K., Sarwate, A., and Sinha, K. (2012). Near-optimal differentially private prin-cipal components. In Advances in Neural Information Processing Systems, pages 989–997.

Chen, R., Xiao, Q., Zhang, Y., and Xu, J. (2015). Differentially private high-dimensionaldata publication via sampling-based inference. In Proceedings of the 21th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, pages 129–138. ACM.

Culnane, C., Rubinstein, B. I. P., and Teague, V. (2017). Health data in an open world.arXiv preprint arXiv:1712.05627v1.

Dheeru, D. and Karra Taniskidou, E. (2017). UCI machine learning repository.

Ding, B., Winslett, M., Han, J., and Li, Z. (2011). Differentially private data cubes: optimiz-ing noise sources and consistency. In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pages 217–228. ACM.

Dwork, C. (2008). Differential privacy: A survey of results. In International Conference onTheory and Applications of Models of Computation, pages 1–19. Springer.

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivityin private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer.

Dwork, C., Roth, A., et al. (2014). The algorithmic foundations of differential privacy.Foundations and Trends R© in Theoretical Computer Science, 9(3–4):211–407.

21

http://arxiv.org/abs/1602.01063


Hardt, M., Ligett, K., and McSherry, F. (2012). A simple and practical algorithm fordifferentially private data release. In Advances in Neural Information Processing Systems,pages 2339–2347.

Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., and Zhang, D. (2016). Principledevaluation of differentially private algorithms using dpbench. In Proceedings of the 2016International Conference on Management of Data, pages 139–154. ACM.

Hay, M., Rastogi, V., Miklau, G., and Suciu, D. (2010). Boosting the accuracy of differ-entially private histograms through consistency. Proceedings of the VLDB Endowment,3(1-2):1021–1032.

Heinze, G. and Ploner, M. (2016). logistf: Firth’s Bias-Reduced Logistic Regression. Rpackage version 1.22.

Kasiviswanathan, S. P., Nissim, K., Raskhodnikova, S., and Smith, A. (2013). Analyzinggraphs with node differential privacy. In Theory of Cryptography, pages 457–476. Springer.

Kifer, D., Smith, A., and Thakurta, A. (2012). Private convex empirical risk minimizationand high-dimensional regression. In Conference on Learning Theory, pages 25–1.

Kim, M.-J. and Han, I. (2003). The discovery of experts’ decision rules from qualitativebankruptcy data using genetic algorithms. Expert Systems with Applications, 25(4):637–646.

Li, X., Yang, J., Sun, Z., and Zhang, J. (2017). Differential privacy for edge weights in socialnetworks. Security and Communication Networks, 2017.

Liu, F. (2016a). Model-based differential private data synthesis. arXiv preprintarXiv:1606.08052.

Liu, F. (2016b). Statistical properties of sanitized results from differentially private laplacemechanisms with noninformative bounding. arXiv preprint arXiv:1607.08554.

Liu, F. (2019). Generalized gaussian mechanism for differential privacy. IEEE Transactionson Knowledge and Data Engineering, 31(4):747 – 756.

Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy:Theory meets practice on the map. IEEE ICDE IEEE 24th International Conference,pages 277 – 286.

McClure, D. and Reiter, J. P. (2012). Differential privacy and statistical disclosure riskmeasures: An investigation with binary synthetic data. Transactions on Data Privacy,5(3):535–552.

McSherry, F. and Talwar, K. (2007). Mechanism design via differential privacy. In Foun-dations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages94–103. IEEE.

22



McSherry, F. D. (2009). Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Con-ference on Management of data, pages 19–30. ACM.

Nagaraj, K. and Sridhar, A. (2015). A predictive system for detection of bankruptcy usingmachine learning techniques. arXiv preprint arXiv:1502.03601.

Narayanan, A. and Shmatikov, V. (2006). How to break anonymity of the netflix prizedataset. CoRR, abs/cs/0610105.

Narayanan, A. and Shmatikov, V. (2008). Robust de-anonymization of large sparse datasets.In Security and Privacy, 2008. SP 2008. IEEE Symposium on, pages 111–125. IEEE.

Roth, A. and Roughgarden, T. (2010). Interactive privacy via the median mechanism. InProceedings of the forty-second ACM symposium on Theory of computing, pages 765–774.ACM.

Scott, D. W. (2015). Multivariate density estimation: theory, practice, and visualization.John Wiley & Sons.

Shokri, R. and Shmatikov, V. (2015). Privacy-preserving deep learning. In Proceedingsof the 22nd ACM SIGSAC conference on computer and communications security, pages1310–1321. ACM.

Sweeney, L. (2013). Matching known patients to health records in washington state data.CoRR, abs/1307.1370.

Tikhonov, A. N. (1963). On the solution of ill-posed problems and the method of regular-ization. Doklady Akademii Nauk, 151(3):501–504.

Tikhonov, A. N., Goncharsky, A., Stepanov, V., and Yagola, A. G. (2013). Numericalmethods for the solution of ill-posed problems, volume 328. Springer Science & BusinessMedia.

Tockar, A. (2014). Riding with the stars: Passenger privacy in the nyc taxicab dataset.https://research.neustar.biz/author/atockar/.

Tsai, C.-F. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems,22(2):120–127.

Yan, S., Pan, S., Zhao, Y., and Zhu, W.-T. (2016). Towards privacy-preserving data min-ing in online social networks: Distance-grained and item-grained differential privacy. InAustralasian Conference on Information Security and Privacy, pages 141–157. Springer.

Zhang, J., Cormode, G., Procopiuc, C. M., Srivastava, D., and Xiao, X. (2014). Privbayes:Private data release via bayesian networks. In Proceedings of the 2014 ACM SIGMODInternational Conference on Management of Data, SIGMOD ’14, pages 1423–1434, NewYork, NY, USA. ACM.

23


https://research.neustar.biz/author/atockar/

Supplementary Materials for “Construction of Microdata from a Set of DifferentiallyPrivate Low-dimensional Contingency Tables through Solving Linear Equations with

Tikhonov Regularization” by Evercita C. Eugenio and Fang Liu

The supplementary materials contain additional simulation results and the derivation of thelinear equations sets Ax = b for the three-variable and four-variable cases. Specifically,Tables 1 to and 4 present the numerical values on the bias, RMSE, coverage probability andconfidence interval width for the results presented in Figure 2 for n = 200; and Tables 5 to8 give the numerical values for n = 500 in the simulation study; Tables ?? and ?? presentedthe ill-conditioned synthetic data sets in simulation study. Section 2 includes the detailedderivation for Ax = b using several examples when p = 3 and p = 4, respectively. Thefour-variable case p = 4 is also what was used in the CIPHER algorithm for the simulationstudy.

24

1A

dd

itio

nal

Sim

ula

tion

Resu

lts

1.1

n=

200

Tab

le1:

Sim

ula

tion

Res

ult

s:B

ias

forn

=20

0

εA

lgor

ith

mB

ias

β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay-1

.388

0.04

20.

870

1.48

6-0

.013

0.31

5-0

.562

-0.8

941.

593

0.94

6C

IPH

ER

Tw

o-W

ay-1

.543

1.80

81.

027

1.63

5-0

.295

0.25

7-0

.674

-0.8

561.

918

0.84

3M

WE

MT

hre

e-W

ay(T

=5)

-1.5

060.

983

-0.5

03-0

.976

2.02

3-1

.005

1.47

60.

476

-0.7

301.

042

MW

EM

Tw

o-W

ay(T

=5)

-1.4

651.

035

-0.4

79-1

.030

1.96

8-0

.956

1.54

30.

479

-0.7

930.

944

FD

HL

apla

ceS

anit

izer

-1.2

06-0

.971

0.97

01.

168

0.27

20.

366

-0.1

90-0

.610

1.01

60.

519

e−1

CIP

HE

RT

hre

e-W

ay-0

.980

0.08

40.

906

1.21

40.

101

0.09

5-0

.370

-0.8

280.

904

0.38

0C

IPH

ER

Tw

o-W

ay-1

.125

1.57

40.

927

1.40

50.

070

0.00

2-0

.413

-1.0

801.

375

0.51

5M

WE

MT

hre

e-W

ay(T

=15

)-1

.521

1.03

5-0

.444

-0.9

951.

932

-0.9

841.

488

0.47

6-0

.738

0.96

2M

WE

MT

wo-

Way

(T=

15)

-1.4

520.

909

-0.3

21-1

.058

2.00

9-0

.989

1.40

00.

534

-0.8

080.

995

FD

HL

apla

ceS

anit

izer

-0.7

04-0

.529

0.68

00.

847

0.40

10.

256

-0.2

23-0

.644

0.33

20.

276

e0C

IPH

ER

Th

ree-

Way

-0.7

28-0

.275

0.63

21.

050

0.30

70.

108

-0.3

07-0

.852

0.53

90.

221

CIP

HE

RT

wo-

Way

-0.7

951.

009

0.83

01.

171

0.36

3-0

.086

-0.5

80-1

.544

0.69

40.

045

MW

EM

Th

ree-

Way

(T=

25)

-1.4

970.

962

-0.4

47-0

.965

1.98

6-0

.977

1.46

90.

487

-0.7

431.

009

MW

EM

Tw

o-W

ay(T

=25

)-1

.372

0.98

5-0

.292

-0.9

151.

920

-1.0

001.

460

0.47

0-0

.720

0.95

5F

DH

Lap

lace

San

itiz

er-0

.285

-0.1

910.

403

0.63

00.

316

0.14

0-0

.332

-0.6

230.

028

-0.0

31

e1C

IPH

ER

Th

ree-

Way

-0.2

95-0

.213

0.46

40.

688

0.22

40.

099

-0.3

26-0

.593

0.09

5-0

.047

CIP

HE

RT

wo-

Way

-0.2

180.

554

0.49

61.

124

0.22

9-0

.172

-0.6

29-1

.400

-0.0

09-0

.469

MW

EM

Th

ree-

Way

(T=

60)

-1.4

310.

958

-0.3

29-0

.841

1.84

5-1

.007

1.44

40.

464

-0.6

651.

022

MW

EM

Tw

o-W

ay(T

=60

)-1

.030

0.97

1-0

.186

-0.8

321.

695

-0.9

871.

318

0.44

7-0

.775

0.71

3F

DH

Lap

lace

San

itiz

er-0

.077

-0.0

230.

131

0.24

60.

204

0.04

9-0

.226

-0.3

46-0

.099

-0.0

67

e2C

IPH

ER

Th

ree-

Way

0.06

00.

028

0.13

60.

338

0.14

70.

055

-0.2

08-0

.390

-0.1

96-0

.152

CIP

HE

RT

wo-

Way

0.20

60.

455

0.38

51.

053

-0.0

56-0

.148

-0.8

12-1

.216

-0.3

26-0

.605

MW

EM

Th

ree-

Way

(T=

120)

-1.3

880.

982

-0.1

94-0

.672

1.83

9-0

.910

1.35

40.

433

-0.6

930.

922

MW

EM

Tw

o-W

ay(T

=12

0)-0

.734

1.01

30.

107

-0.7

671.

448

-0.9

771.

148

0.40

3-0

.901

0.81

8F

DH

Lap

lace

San

itiz

er0.

072

0.06

40.

012

0.05

60.

062

0.00

4-0

.094

-0.1

38-0

.130

-0.1

06

25

Tab

le2:

Sim

ula

tion

Res

ult

s:R

oot

Mea

nSquar

eE

rror

(RM

SE

)fo

rn

=20

0

εA

lgor

ithm

Sim

ula

tion

Res

ult

s:R

oot

Mea

nSquar

eE

rror

(RM

SE

)β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay3.

335

2.91

73.

187

3.28

22.

934

2.82

13.

395

3.23

93.

817

3.40

0C

IPH

ER

Tw

o-W

ay3.

303

3.18

62.

689

2.77

72.

513

2.27

53.

090

2.76

73.

821

2.90

3M

WE

MT

hre

e-W

ay(T

=5)

1.56

01.

043

0.62

91.

064

2.06

81.

086

1.52

30.

637

0.83

71.

133

MW

EM

Tw

o-W

ay(T

=5)

1.74

41.

271

0.89

91.

290

2.10

31.

338

1.72

60.

908

1.12

11.

280

FD

HL

apla

ceSan

itiz

er2.

829

3.13

82.

424

2.77

22.

067

2.55

82.

750

3.29

63.

172

3.37

1

e−1

CIP

HE

RT

hre

e-W

ay2.

695

2.49

42.

454

2.57

02.

363

2.24

42.

796

3.08

53.

074

2.93

4C

IPH

ER

Tw

o-W

ay2.

824

2.89

52.

314

2.51

42.

248

2.07

62.

512

2.57

62.

974

2.34

5M

WE

MT

hre

e-W

ay(T

=15

)1.

904

1.49

61.

011

1.44

12.

238

1.44

31.

723

1.11

81.

345

1.51

0M

WE

MT

wo-

Way

(T=

15)

2.60

62.

090

1.93

32.

201

2.81

12.

239

2.22

61.

842

2.06

62.

084

FD

HL

apla

ceSan

itiz

er1.

972

2.20

11.

445

1.79

61.

375

1.45

82.

000

2.58

32.

110

2.32

3

e0C

IPH

ER

Thre

e-W

ay1.

837

1.84

11.

516

1.85

81.

504

1.55

31.

890

2.28

41.

999

1.90

9C

IPH

ER

Tw

o-W

ay2.

122

2.16

51.

610

1.86

91.

767

1.63

01.

788

2.47

92.

004

1.91

1M

WE

MT

hre

e-W

ay(T

=25

)1.

639

1.13

00.

772

1.22

22.

095

1.17

21.

575

0.77

31.

017

1.21

2M

WE

MT

wo-

Way

(T=

25)

2.03

71.

505

1.18

81.

767

2.34

81.

959

1.94

91.

333

1.62

61.

742

FD

HL

apla

ceSan

itiz

er1.

185

1.35

10.

742

1.04

80.

727

0.80

61.

321

1.74

51.

216

1.40

3

e1C

IPH

ER

Thre

e-W

ay1.

157

1.31

30.

817

1.14

80.

797

0.88

31.

291

1.66

11.

171

1.35

9C

IPH

ER

Tw

o-W

ay1.

308

1.48

50.

883

1.45

81.

031

1.10

41.

384

2.02

51.

252

1.43

0M

WE

MT

hre

e-W

ay(T

=60

)1.

825

1.37

11.

068

1.43

72.

193

1.53

41.

745

1.06

11.

477

1.58

3M

WE

MT

wo-

Way

(T=

60)

2.48

82.

103

1.78

52.

218

2.61

42.

717

2.42

32.

069

2.39

72.

312

FD

HL

apla

ceSan

itiz

er0.

754

0.87

00.

495

0.64

00.

544

0.59

50.

875

1.10

60.

788

0.88

3

e2C

IPH

ER

Thre

e-W

ay0.

945

1.02

00.

557

0.70

80.

566

0.60

51.

089

1.27

30.

987

1.01

9C

IPH

ER

Tw

o-W

ay1.

014

1.13

30.

580

1.18

20.

604

0.69

21.

270

1.60

91.

025

1.18

4M

WE

MT

hre

e-W

ay(T

=12

0)1.

826

1.36

10.

978

1.40

32.

214

1.53

41.

698

1.12

11.

445

1.59

4M

WE

MT

wo-

Way

(T=

120)

2.28

21.

911

1.80

02.

084

2.42

52.

855

2.25

52.

271

2.61

42.

358

FD

HL

apla

ceSan

itiz

er0.

716

0.78

60.

480

0.57

40.

500

0.57

10.

793

0.93

00.

738

0.79

1

26

Tab

le3:

Sim

ula

tion

Res

ult

s:C

over

age

Pro

bab

ilit

y(C

P)

forn

=20

0

εA

lgor

ithm

Cov

erag

eP

robab

ilit

y(C

P)

β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay97

.199

.498

.697

.199

.899

.699

.199

.497

.298

.3C

IPH

ER

Tw

o-W

ay97

.199

.998

.794

.099

.899

.699

.098

.896

.298

.2M

WE

MT

hre

e-W

ay(T

=5)

51.0

91.6

99.7

95.7

18.4

92.9

43.3

99.8

99.1

93.9

MW

EM

Tw

o-W

ay(T

=5)

70.5

94.1

99.7

95.8

32.0

95.3

56.1

99.9

99.4

96.6

FD

HL

apla

ceSan

itiz

er90

.697

.793

.590

.099

.398

.998

.298

.593

.210

0.0

e−1

CIP

HE

RT

hre

e-W

ay95

.299

.297

.294

.299

.799

.599

.299

.696

.799

.3C

IPH

ER

Tw

o-W

ay10

0.0

100.

097

.493

.599

.799

.710

0.0

100.

010

0.0

100.

0M

WE

MT

hre

e-W

ay(T

=15

)80

.292

.399

.296

.162

.195

.176

.899

.297

.795

.5M

WE

MT

wo-

Way

(T=

15)

93.5

96.8

99.6

98.0

81.1

97.1

86.6

99.6

99.4

98.5

FD

HL

apla

ceSan

itiz

er10

0.0

99.1

100.

092

.710

0.0

99.4

100.

099

.010

0.0

99.0

e0C

IPH

ER

Thre

e-W

ay92

.798

.494

.492

.499

.599

.298

.997

.610

0.0

99.1

CIP

HE

RT

wo-

Way

95.4

99.7

95.2

90.7

99.8

99.2

99.1

94.6

95.8

99.2

MW

EM

Thre

e-W

ay(T

=25

)74

.692

.599

.193

.848

.793

.970

.399

.598

.094

.4M

WE

MT

wo-

Way

(T=

25)

88.0

95.6

99.7

97.7

72.0

96.3

82.5

99.0

99.1

97.5

FD

HL

apla

ceSan

itiz

er97

.299

.497

.695

.299

.999

.899

.199

.198

.599

.8

e1C

IPH

ER

Thre

e-W

ay97

.799

.098

.495

.299

.499

.799

.499

.098

.599

.3C

IPH

ER

Tw

o-W

ay98

.499

.097

.886

.799

.499

.199

.794

.499

.199

.3M

WE

MT

hre

e-W

ay(T

=60

)85

.892

.398

.196

.875

.895

.381

.899

.098

.395

.2M

WE

MT

wo-

Way

(T=

60)

94.7

96.3

99.6

99.0

88.6

98.3

92.0

99.4

99.2

98.7

FD

HL

apla

ceSan

itiz

er99

.699

.899

.999

.099

.799

.999

.799

.399

.899

.7

e2C

IPH

ER

Thre

e-W

ay99

.399

.599

.498

.899

.799

.699

.599

.099

.499

.4C

IPH

ER

Tw

o-W

ay99

.799

.899

.887

.199

.699

.999

.194

.799

.599

.7M

WE

MT

hre

e-W

ay(T

=12

0)86

.392

.799

.396

.575

.795

.183

.398

.998

.195

.4M

WE

MT

wo-

Way

(T=

120)

94.9

95.4

99.2

98.8

92.2

99.0

95.3

99.4

99.1

99.0

FD

HL

apla

ceSan

itiz

er99

.099

.199

.799

.199

.599

.098

.998

.199

.099

.1

27

Tab

le4:

Sim

ula

tion

Res

ult

s:C

onfiden

ceIn

terv

alW

idth

sfo

rn

=20

0

εA

lgor

ithm

Con

fiden

ceIn

terv

alW

idth

β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay24

.176

21.9

8723

.682

22.0

0823

.293

21.6

0728

.030

25.6

8928

.136

25.6

45C

IPH

ER

Tw

o-W

ay24

.620

19.0

3521

.406

17.1

1222

.118

17.5

0426

.016

19.9

6027

.290

20.5

44M

WE

MT

hre

e-W

ay(T

=5)

3.33

23.

010

3.06

13.

458

3.51

73.

265

3.03

63.

072

3.40

93.

468

MW

EM

Tw

o-W

ay(T

=5)

6.02

54.

688

4.83

15.

509

5.39

55.

818

4.74

64.

469

5.43

45.

764

FD

HL

apla

ceSan

itiz

er16

.386

19.6

3611

.470

14.3

2111

.276

13.8

5818

.846

24.2

0920

.402

25.7

41

e−1

CIP

HE

RT

hre

e-W

ay19

.170

18.4

7516

.696

17.1

9718

.062

17.3

3522

.348

22.5

4323

.617

22.1

10C

IPH

ER

Tw

o-W

ay24

.273

21.9

2815

.827

14.1

3316

.697

14.4

6323

.847

22.7

0225

.324

22.0

85M

WE

MT

hre

e-W

ay(T

=15

)6.

650

5.46

75.

487

6.98

87.

333

6.66

15.

664

5.75

27.

171

7.35

9M

WE

MT

wo-

Way

(T=

15)

17.0

8913

.685

13.7

4216

.892

16.9

8816

.942

13.5

5613

.631

17.1

1317

.442

FD

HL

apla

ceSan

itiz

er27

.574

13.1

4426

.466

8.50

526

.516

8.38

828

.537

17.7

8528

.720

15.0

08

e0C

IPH

ER

Thre

e-W

ay10

.794

12.3

298.

030

9.57

98.

939

9.99

012

.403

15.3

6219

.794

13.9

84C

IPH

ER

Tw

o-W

ay13

.625

12.4

088.

204

9.05

510

.573

9.99

411

.596

12.7

9113

.135

12.2

53M

WE

MT

hre

e-W

ay(T

=25

)4.

411

3.85

23.

970

4.63

14.

843

4.20

03.

850

3.86

34.

358

4.60

4M

WE

MT

wo-

Way

(T=

25)

10.4

898.

184

7.59

610

.348

10.3

1111

.967

8.85

48.

608

11.1

6311

.726

FD

HL

apla

ceSan

itiz

er6.

101

7.46

13.

563

4.55

63.

909

4.63

17.

417

10.1

076.

429

8.18

7

e1C

IPH

ER

Thre

e-W

ay6.

515

7.47

43.

900

5.08

64.

288

4.96

47.

353

9.36

06.

703

7.89

6C

IPH

ER

Tw

o-W

ay7.

269

7.57

44.

331

4.87

25.

406

5.60

37.

397

8.72

77.

142

7.63

0M

WE

MT

hre

e-W

ay(T

=60

)7.

238

6.06

45.

642

8.17

18.

166

7.55

65.

993

5.74

49.

029

8.83

8M

WE

MT

wo-

Way

(T=

60)

19.3

6314

.160

13.9

0718

.644

19.0

9221

.151

16.1

5215

.595

21.5

9321

.131

FD

HL

apla

ceSan

itiz

er4.

246

4.68

62.

920

3.36

12.

992

3.25

74.

629

5.46

14.

332

4.74

1

e2C

IPH

ER

Thre

e-W

ay4.

745

5.15

73.

172

3.58

13.

249

3.45

55.

365

6.17

54.

758

5.17

0C

IPH

ER

Tw

o-W

ay5.

257

5.53

13.

038

3.40

43.

421

3.73

15.

321

5.78

85.

174

5.53

3M

WE

MT

hre

e-W

ay(T

=12

0)6.

830

5.54

55.

517

7.34

97.

661

7.25

65.

842

6.04

78.

315

8.73

0M

WE

MT

wo-

Way

(T=

120)

17.2

6511

.555

12.3

2816

.417

18.3

0521

.648

15.8

1515

.868

22.8

2921

.965

FD

HL

apla

ceSan

itiz

er3.

322

3.53

02.

768

3.04

32.

798

2.97

73.

515

3.89

33.

306

3.56

6

28

1.2

n=

500

Tab

le5:

Sim

ula

tion

Res

ult

s:B

ias

forn

=50

0

εA

lgor

ithm

Bia

sβ01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay-1

.109

-0.1

360.

915

1.32

90.

138

0.35

4-0

.606

-1.0

081.

206

0.49

0C

IPH

ER

Tw

o-W

ay-1

.272

1.01

60.

922

1.44

9-0

.015

0.17

4-0

.582

-1.0

841.

609

0.68

7M

WE

MT

hre

e-W

ay(T

=10

)-1

.464

0.99

0-0

.486

-1.0

151.

984

-0.9

681.

492

0.47

0-0

.771

0.99

9M

WE

MT

wo-

Way

(T=

10)

-1.4

500.

974

-0.4

49-0

.970

1.98

5-0

.979

1.48

40.

484

-0.7

270.

958

FD

HL

apla

ceSan

itiz

er-0

.860

-0.5

840.

736

0.93

10.

421

0.27

8-0

.054

-0.7

620.

400

0.17

0

e−1

CIP

HE

RT

hre

e-W

ay-0

.839

-0.2

530.

793

1.09

60.

223

0.12

9-0

.444

-0.8

830.

709

0.25

8C

IPH

ER

Tw

o-W

ay-0

.952

0.73

50.

833

1.24

40.

189

-0.1

15-0

.497

-1.3

370.

946

0.20

8M

WE

MT

hre

e-W

ay(T

=25

)-1

.498

1.01

7-0

.463

-0.9

321.

960

-0.9

851.

476

0.47

0-0

.771

0.97

8M

WE

MT

wo-

Way

(T=

25)

-1.3

341.

015

-0.3

63-0

.976

1.84

6-1

.041

1.51

00.

442

-0.7

420.

934

FD

HL

apla

ceSan

itiz

er-0

.427

-0.2

760.

426

0.64

40.

382

0.13

7-0

.306

-0.5

500.

096

0.00

4

e0C

IPH

ER

Thre

e-W

ay-0

.537

-0.3

660.

538

0.81

50.

235

0.09

7-0

.303

-0.5

570.

335

0.13

7C

IPH

ER

Tw

o-W

ay-0

.490

0.32

10.

593

1.13

50.

142

-0.1

49-0

.470

-1.2

180.

366

-0.1

98M

WE

MT

hre

e-W

ay(T

=50

)-1

.490

1.01

0-0

.353

-0.9

111.

922

-0.9

701.

469

0.48

2-0

.775

0.99

1M

WE

MT

wo-

Way

(T=

50)

-1.3

170.

994

-0.2

07-0

.782

1.85

9-0

.972

1.40

70.

487

-0.7

380.

940

FD

HL

apla

ceSan

itiz

er-0

.122

-0.0

700.

121

0.25

50.

192

0.07

2-0

.180

-0.3

00-0

.042

-0.0

31

e1C

IPH

ER

Thre

e-W

ay-0

.161

-0.1

530.

176

0.40

40.

149

0.09

3-0

.138

-0.3

390.

028

0.01

0C

IPH

ER

Tw

o-W

ay-0

.043

0.25

40.

413

1.06

8-0

.077

-0.1

53-0

.632

-1.0

39-0

.053

-0.4

06M

WE

MT

hre

e-W

ay(T

=10

0)-1

.443

1.02

9-0

.258

-0.7

851.

813

-0.9

501.

439

0.46

4-0

.745

0.90

7M

WE

MT

wo-

Way

(T=

100)

-1.1

410.

996

0.02

0-0

.716

1.71

8-0

.968

1.27

50.

492

-0.7

340.

811

FD

HL

apla

ceSan

itiz

er0.

006

0.01

50.

027

0.04

50.

057

0.00

8-0

.077

-0.1

00-0

.070

-0.0

61

e2C

IPH

ER

Thre

e-W

ay0.

061

0.06

30.

030

0.18

60.

044

0.03

2-0

.088

-0.2

30-0

.120

-0.1

21C

IPH

ER

Tw

o-W

ay0.

035

0.22

90.

392

0.95

6-0

.160

-0.0

92-0

.647

-0.9

09-0

.086

-0.3

10M

WE

MT

hre

e-W

ay(T

=20

0)-1

.405

1.05

1-0

.042

-0.6

061.

655

-0.9

181.

375

0.44

8-0

.729

0.89

0M

WE

MT

wo-

Way

(T=

200)

-0.9

170.

987

0.26

6-0

.578

1.47

3-0

.926

1.24

90.

449

-0.7

230.

782

FD

HL

apla

ceSan

itiz

er0.

030

0.01

5-0

.005

-0.0

190.

015

-0.0

04-0

.004

0.00

1-0

.052

-0.0

34

29

Tab

le6:

Sim

ula

tion

Res

ult

s:R

oot

Mea

nSquar

eE

rror

(RM

SE

)fo

rn

=50

0

εA

lgor

ithm

Root

Mea

nSquar

eE

rror

(RM

SE

)β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay2.

376

2.12

42.

112

2.28

71.

900

1.87

52.

351

2.49

02.

730

2.36

4C

IPH

ER

Tw

o-W

ay2.

305

1.97

81.

855

2.06

01.

599

1.44

91.

880

2.01

72.

478

1.87

1M

WE

MT

hre

e-W

ay(T

=10

)1.

521

1.07

30.

619

1.09

62.

027

1.05

71.

546

0.59

30.

919

1.07

9M

WE

MT

wo-

Way

(T=

10)

1.62

61.

221

0.76

31.

147

2.07

81.

243

1.64

10.

811

0.97

71.

182

FD

HL

apla

ceSan

itiz

er2.

129

2.15

91.

537

1.89

51.

439

1.63

02.

155

2.83

32.

303

2.35

2

e−1

CIP

HE

RT

hre

e-W

ay1.

585

1.47

41.

357

1.59

51.

183

1.19

11.

567

1.93

11.

737

1.66

5C

IPH

ER

Tw

o-W

ay1.

677

1.57

21.

337

1.66

71.

235

1.20

91.

340

1.91

21.

596

1.31

8M

WE

MT

hre

e-W

ay(T

=25

)1.

629

1.17

10.

727

1.14

02.

067

1.16

11.

582

0.75

71.

034

1.16

7M

WE

MT

wo-

Way

(T=

25)

1.94

71.

503

1.13

11.

476

2.20

61.

790

1.90

21.

240

1.39

01.

619

FD

HL

apla

ceSan

itiz

er1.

113

1.33

60.

731

0.99

80.

777

0.88

11.

257

1.70

41.

156

1.41

8

e0C

IPH

ER

Thre

e-W

ay0.

962

0.99

80.

754

1.03

80.

626

0.64

10.

915

1.20

50.

926

0.94

4C

IPH

ER

Tw

o-W

ay0.

946

0.91

50.

828

1.34

00.

727

0.83

80.

970

1.58

90.

887

0.89

7M

WE

MT

hre

e-W

ay(T

=50

)1.

626

1.15

30.

684

1.11

02.

039

1.17

31.

577

0.77

10.

998

1.17

7M

WE

MT

wo-

Way

(T=

50)

1.96

61.

439

1.09

41.

418

2.24

31.

656

1.71

71.

186

1.37

91.

502

FD

HL

apla

ceSan

itiz

er0.

577

0.67

80.

401

0.53

40.

440

0.44

60.

729

0.90

90.

581

0.68

8

e1C

IPH

ER

Thre

e-W

ay0.

504

0.54

00.

414

0.58

30.

435

0.43

00.

538

0.70

70.

493

0.54

0C

IPH

ER

Tw

o-W

ay0.

517

0.61

30.

524

1.14

60.

451

0.52

00.

795

1.18

60.

496

0.69

1M

WE

MT

hre

e-W

ay(T

=10

0)1.

577

1.17

60.

627

1.01

31.

919

1.18

41.

549

0.77

30.

969

1.09

6M

WE

MT

wo-

Way

(T=

100)

1.63

31.

331

0.91

21.

239

2.00

31.

626

1.66

61.

103

1.35

91.

430

FD

HL

apla

ceSan

itiz

er0.

377

0.42

80.

313

0.37

00.

316

0.35

30.

430

0.52

80.

386

0.44

9

e2C

IPH

ER

Thre

e-W

ay0.

432

0.45

70.

338

0.40

70.

350

0.37

20.

487

0.59

20.

432

0.46

8C

IPH

ER

Tw

o-W

ay0.

443

0.51

50.

458

0.99

60.

360

0.36

20.

758

1.00

80.

435

0.54

3M

WE

MT

hre

e-W

ay(T

=20

0)1.

560

1.19

60.

563

0.86

81.

775

1.11

61.

494

0.71

70.

966

1.09

6M

WE

MT

wo-

Way

(T=

200)

1.38

01.

258

0.81

41.

102

1.74

91.

511

1.51

00.

976

1.30

21.

393

FD

HL

apla

ceSan

itiz

er0.

363

0.41

00.

293

0.35

50.

309

0.34

40.

412

0.49

50.

366

0.42

5

30

Tab

le7:

Sim

ula

tion

Res

ult

s:C

over

age

Pro

bab

ilit

y(C

P)

forn

=50

0

εA

lgor

ithm

Cov

erag

eP

robab

ilit

y(C

P)

β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay90

.597

.995

.191

.098

.998

.397

.697

.592

.597

.5C

IPH

ER

Tw

o-W

ay92

.498

.694

.085

.498

.998

.796

.094

.586

.495

.3M

WE

MT

hre

e-W

ay(T

=10

)38

.174

.898

.282

.113

.282

.429

.998

.694

.483

.0M

WE

MT

wo-

Way

(T=

10)

61.5

83.0

98.8

90.2

25.7

86.4

47.5

98.9

97.5

89.0

FD

HL

apla

ceSan

itiz

er86

.997

.588

.984

.497

.896

.897

.896

.890

.597

.3

e−1

CIP

HE

RT

hre

e-W

ay87

.797

.392

.686

.398

.898

.597

.896

.390

.697

.0C

IPH

ER

Tw

o-W

ay92

.698

.490

.382

.699

.398

.197

.890

.390

.097

.6M

WE

MT

hre

e-W

ay(T

=25

)63

.280

.697

.490

.840

.188

.358

.297

.693

.590

.5M

WE

MT

wo-

Way

(T=

25)

81.4

89.4

98.6

94.0

60.9

92.8

73.9

97.6

97.9

93.9

FD

HL

apla

ceSan

itiz

er94

.098

.396

.292

.096

.998

.598

.896

.797

.298

.6

e0C

IPH

ER

Thre

e-W

ay93

.396

.295

.189

.899

.198

.999

.096

.296

.199

.1C

IPH

ER

Tw

o-W

ay97

.699

.595

.879

.799

.498

.598

.289

.197

.199

.6M

WE

MT

hre

e-W

ay(T

=50

)64

.383

.297

.890

.543

.786

.856

.897

.093

.986

.3M

WE

MT

wo-

Way

(T=

50)

86.8

90.8

98.6

95.3

66.3

93.2

78.4

98.6

97.8

94.5

FD

HL

apla

ceSan

itiz

er99

.799

.699

.799

.299

.599

.899

.598

.399

.899

.8

e1C

IPH

ER

Thre

e-W

ay99

.499

.799

.799

.399

.610

0.0

99.6

98.7

99.3

99.8

CIP

HE

RT

wo-

Way

99.2

99.4

99.4

71.4

100.

099

.797

.588

.199

.599

.3M

WE

MT

hre

e-W

ay(T

=10

0)66

.683

.498

.693

.049

.588

.464

.097

.195

.090

.5M

WE

MT

wo-

Way

(T=

100)

86.2

87.5

98.8

95.2

70.5

92.7

83.0

97.9

96.2

95.6

FD

HL

apla

ceSan

itiz

er10

0.0

99.9

100.

099

.910

0.0

99.9

99.8

99.3

99.9

99.8

e2C

IPH

ER

Thre

e-W

ay99

.799

.899

.999

.899

.999

.999

.599

.099

.699

.9C

IPH

ER

Tw

o-W

ay10

0.0

99.5

99.7

76.8

100.

099

.996

.889

.799

.999

.4M

WE

MT

hre

e-W

ay(T

=20

0)68

.581

.099

.495

.359

.288

.364

.897

.495

.290

.5M

WE

MT

wo-

Way

(T=

200)

89.0

87.9

98.5

96.9

76.9

94.0

83.2

97.7

97.1

96.2

FD

HL

apla

ceSan

itiz

er10

0.0

99.8

99.9

99.9

100.

099

.999

.799

.599

.899

.6

31

Tab

le8:

Sim

ula

tion

Res

ult

s:C

onfiden

ceIn

terv

alW

idth

sfo

rn

=50

0

εA

lgor

ithm

Con

fiden

ceIn

terv

alW

idth

β01

β11

β21

β31

β41

β02

β12

β22

β32

β42

e−2

CIP

HE

RT

hre

e-W

ay13

.450

13.3

5611

.739

11.2

9812

.138

11.7

9914

.900

15.2

6216

.651

15.6

61C

IPH

ER

Tw

o-W

ay12

.521

10.2

059.

495

8.30

110

.245

8.72

011

.275

9.97

912

.000

9.93

6M

WE

MT

hre

e-W

ay(T

=10

)2.

850

2.64

32.

596

2.92

72.

973

2.83

82.

619

2.60

22.

928

2.93

5M

WE

MT

wo-

Way

(T=

10)

4.40

33.

793

3.48

34.

001

4.02

84.

360

3.86

13.

557

4.01

54.

253

FD

HL

apla

ceSan

itiz

er10

.130

11.9

475.

750

7.55

16.

739

7.90

912

.781

17.7

6512

.944

14.4

11

e−1

CIP

HE

RT

hre

e-W

ay8.

407

7.70

15.

548

5.97

16.

887

6.38

47.

313

7.73

37.

524

7.28

9C

IPH

ER

Tw

o-W

ay8.

071

8.61

25.

940

6.63

86.

519

6.71

99.

079

10.6

779.

682

9.74

7M

WE

MT

hre

e-W

ay(T

=25

)3.

825

3.39

63.

363

3.85

93.

813

3.73

93.

431

3.33

23.

811

4.02

8M

WE

MT

wo-

Way

(T=

25)

8.44

06.

290

6.43

27.

438

7.82

48.

935

7.41

66.

535

7.89

08.

564

FD

HL

apla

ceSan

itiz

er5.

383

6.99

43.

239

4.05

23.

449

4.06

86.

728

9.52

86.

077

7.79

3

e0C

IPH

ER

Thre

e-W

ay4.

366

4.90

53.

100

3.62

03.

333

3.62

24.

746

5.69

34.

535

5.05

2C

IPH

ER

Tw

o-W

ay4.

530

4.79

43.

386

3.91

13.

958

4.41

14.

717

5.50

84.

450

4.88

3M

WE

MT

hre

e-W

ay(T

=50

)3.

711

3.36

83.

346

3.78

33.

814

3.68

73.

325

3.41

53.

729

3.75

9M

WE

MT

wo-

Way

(T=

50)

8.37

85.

718

6.01

07.

989

7.96

78.

327

6.14

06.

364

8.34

08.

010

FD

HL

apla

ceSan

itiz

er3.

141

3.57

02.

460

2.79

62.

450

2.67

33.

650

4.37

13.

164

3.66

9

e1C

IPH

ER

Thre

e-W

ay2.

888

3.10

52.

487

2.75

72.

548

2.71

83.

173

3.59

22.

902

3.15

0C

IPH

ER

Tw

o-W

ay2.

976

3.21

72.

370

2.69

62.

673

2.96

93.

046

3.47

83.

000

3.30

3M

WE

MT

hre

e-W

ay(T

=10

0)3.

650

3.38

53.

365

3.79

93.

739

3.73

13.

396

3.47

13.

806

3.80

5M

WE

MT

wo-

Way

(T=

100)

6.98

64.

763

5.18

46.

366

6.69

57.

998

5.84

35.

902

7.79

58.

276

FD

HL

apla

ceSan

itiz

er2.

450

2.61

12.

195

2.40

92.

209

2.34

72.

580

2.88

32.

444

2.64

6

e2C

IPH

ER

Thre

e-W

ay2.

585

2.74

42.

290

2.49

72.

315

2.46

92.

717

3.03

12.

559

2.77

5C

IPH

ER

Tw

o-W

ay2.

592

2.71

52.

160

2.34

32.

292

2.41

72.

623

2.88

82.

581

2.74

4M

WE

MT

hre

e-W

ay(T

=20

0)3.

726

3.29

73.

316

3.76

43.

790

3.70

23.

387

3.39

43.

853

3.78

0M

WE

MT

wo-

Way

(T=

200)

6.17

94.

408

4.36

85.

918

6.08

67.

931

4.90

25.

185

7.94

88.

003

FD

HL

apla

ceSan

itiz

er2.

345

2.49

22.

144

2.34

72.

164

2.29

92.

469

2.74

82.

343

2.53

8

32

2D

eri

vati

on

of

Lin

ear

Equ

ati

on

sets

Ax

=b

Inth

isse

ctio

n,

we

illu

stra

teth

eder

ivat

ion

ofth

elinea

req

uat

ion

set

give

na

pre

-sp

ecifi

edquer

yse

tQ

inth

efo

llow

ing

thre

esc

enar

ios:

(1)

3-va

riab

le2×

2×

2ca

sew

ithQ

=al

l2-

way

his

togr

ams;

2)3-

vari

able

2×

3×

3ca

sew

ithQ

=al

l2-

way

his

togr

ams;

3)4

vari

able

case

:2×

2×

3×

3w

ithQ

=al

l2-

way

his

togr

ams.

2.1

Thre

evari

able

case

(2×

2×

2)

Inth

e3

vari

able

case

2×

2×

2,w

efirs

tob

tain

P(V

3=

0|V1)

=P

(V3

=0,V2

=0|V1)

+P

(V3

=0,V2

=1|V1)

=P

(V3

=0|V2

=0,V1)P

(V2

=0|V1)

+P

(V3

=0|V2

=1,V1)P

(V2

=1|V1)

P(V

3=

0|V2)

=P

(V3

=0,V1

=0|V2)

+P

(V3

=0,V1

=1|V2)

=P

(V3

=0|V1

=0,V2)P

(V1

=0|V2)

+P

(V3

=0|V1

=1,V2)P

(V1

=1|V2).

Exam

inin

gea

chsc

enar

ioofV1

andV2,

the

two

equat

ions

abov

eca

nb

eex

pan

ded

into

four

equat

ions.

P(V

3=

0|V1

=0)

=P

(V3

=0|V2

=0,V1

=0)P

(V2

=0|V1

=0)

+P

(V3

=0|V2

=1,V1

=0)P

(V2

=1|V1

=0)

P(V

3=

0|V1

=1)

=P

(V3

=0|V2

=0,V1

=1)P

(V2

=0|V1

=1)

+P

(V3

=0|V2

=1,V1

=1)P

(V2

=1|V1

=1)

P(V

3=

0|V2

=0)

=P

(V3

=0|V1

=0,V2

=0)P

(V1

=0|V2

=0)

+P

(V3

=0|V1

=1,V2

=0)P

(V1

=1|V2

=0)

P(V

3=

0|V2

=1)

=P

(V3

=0|V1

=0,V2

=1)P

(V1

=0|V2

=1)

+P

(V3

=0|V1

=1,V2

=1)P

(V1

=1|V2

=1)

(1)

Usi

ng

the

sanit

ized

valu

esfr

omth

e2-

way

table

s,th

enth

ele

fthan

dsi

des

ofth

efo

ur

equat

ions

abov

eb

=(P

(V3

=0|V1

=0),P

(V3

=0|V1

=1),P

(V3

=0|V2

=0),P

(V3

=0|V2

=1)

)ar

eknow

n.

Addit

ional

ly,

onth

eri

ght

han

dsi

de,

the

elem

ents

ofP

(V2

=0|V1

=0)

,P

(V2

=0|V1

=1)

,P

(V1

=0|V2

=0)

,P

(V1

=0|V2

=1)

,P

(V2

=1|V1

=0)

,P

(V2

=1|V1

=1)

,P

(V1

=1|V2

=0)

,an

dP

(V1

=1|V2

=1)

can

be

calc

ula

ted

from

the

sanit

ized

2-w

ayta

ble

s.T

her

efor

e,E

qn

(1)

can

be

wri

tten

asb

=Az

,w

her

e‡

=(P

(V3

=0|V1

=0,V2

=0),P

(V3

=0|V1

=1,V2

=0),P

(V3

=0|V1

=0,V2

=1),P

(V3

=0|V1

=1,V2

=1)

)A

conta

ins

know

nco

effici

ents

asso

ciat

edw

ith

z.N

ote

that

thou

ghth

ere

are

four

equat

ions

inE

qn

(1),

they

actu

ally

are

linea

rly

dep

enden

t,T

her

efor

e,w

eap

ply

the

Tik

hon

ovre

gula

riza

tion

toso

lve

for

the

four

unknow

ns

inz.

Once

we

get

zP

(V3

=1|V1,V

2)

=1−P

(V3

=0|V1,V

2),

we

can

subse

quen

tly

calc

ula

teth

ejo

int

pro

bab

ilit

yam

ong

(V1,V

2,V

3)

asinP

(V1,V

2,V

3)

=P

(V3|V

1,V

2)P

(V1,V

2),

from

whic

hw

eca

nsa

mple

the

synth

etic

dat

a.

33

V3

=1|V1,V

2,

and

Aco

nta

ins

the

corr

esp

ondin

gco

effici

ents

.W

eap

ply

the

Tik

hon

ovre

gula

riza

tion

toso

lve

for

zfr

omb

=A

z.

P(V

3=

0|V1

=0)

=P

(V3

=0|V2

=0,V1

=0)P

(V2

=0|V1

=0)

+P

(V3

=0|V2

=1,V1

=0)P

(V2

=1|V1

=0)

+P

(V3

=0|V2

=2,V1

=0)P

(V2

=2|V1

=0)

P(V

3=

0|V1

=1)

=P

(V3

=0|V2

=0,V1

=1)P

(V2

=0|V1

=1)

+P

(V3

=0|V2

=1,V1

=1)P

(V2

=1|V1

=1)

+P

(V3

=0|V2

=2,V1

=1)P

(V2

=2|V1

=1)

P(V

3=

0|V2

=0)

=P

(V3

=0|V1

=0,V2

=0)P

(V1

=0|V2

=0)

+P

(V3

=0|V1

=1,V2

=0)P

(V1

=1|V2

=0)

P(V

3=

0|V2

=1)

=P

(V3

=0|V1

=0,V2

=1)P

(V1

=0|V2

=1)

+P

(V3

=0|V1

=1,V2

=1)P

(V1

=1|V2

=1)

P(V

3=

0|V2

=2)

=P

(V3

=0|V1

=0,V2

=2)P

(V1

=0|V2

=2)

+P

(V3

=0|V1

=1,V2

=2)P

(V1

=1|V2

=2)

P(V

3=

1|V1

=0)

=P

(V3

=1|V2

=0,V1

=0)P

(V2

=0|V1

=0)

+P

(V3

=1|V2

=1,V1

=0)P

(V2

=1|V1

=0)

+P

(V3

=1|V2

=2,V1

=0)P

(V2

=2|V1

=0)

P(V

3=

1|V1

=1)

=P

(V3

=1|V2

=0,V1

=1)P

(V2

=0|V1

=1)

+P

(V3

=1|V2

=1,V1

=1)P

(V2

=1|V1

=1)

+P

(V3

=1|V2

=2,V1

=1)P

(V2

=2|V1

=1)

P(V

3=

1|V2

=0)

=P

(V3

=1|V1

=0,V2

=0)P

(V1

=0|V2

=0)

+P

(V3

=1|V1

=1,V2

=0)P

(V1

=1|V2

=0)

P(V

3=

1|V2

=1)

=P

(V3

=1|V1

=0,V2

=1)P

(V1

=0|V2

=1)

+P

(V3

=1|V1

=1,V2

=1)P

(V1

=1|V2

=1)

P(V

3=

1|V2

=2)

=P

(V3

=1|V1

=0,V2

=2)P

(V1

=0|V2

=2)

+P

(V3

=1|V1

=1,V2

=2)P

(V1

=1|V2

=2)

35

2.3

Four

vari

able

case

(2×

2×

3×

3)

Inth

isex

ample

,th

eva

riab

lesV1

andV2

hav

etw

oca

tego

ries

,an

dV3

andV4

hav

eth

ree

cate

gori

es.

We

assu

me

the

quer

yse

tQ

consi

sts

ofal

l2D

his

togr

ams

amon

ghe

vari

able

sofV1,V2,V3

andV4

(The

pro

cedure

sar

esi

milar

ifQ

consi

sts

ofot

her

typ

esof

his

togr

ams,

such

asal

l3D

his

togr

ams,

and

am

ixtu

reof

2Dor

3Dhis

togr

ams)

.C

IPH

ER

firs

tso

lves

for

the

pro

bab

ilit

ydis

trib

uti

onfo

ral

l3D

his

togr

ams

give

n2D

his

togr

ams,

the

pro

cedure

sar

esi

milar

toth

e3-

vari

able

exam

ple

sin

Sec

tion

s2.

1an

d2.

2of

the

supple

men

tary

mat

eria

ls.

Once

the

3Dhis

togr

ams

are

avai

lable

,w

eca

nca

lcula

teth

epro

bab

ilit

ydis

trib

uti

onof

the

four

vari

able

isca

lcula

ted

give

nth

e3D

his

togr

ams.

The

init

ial

equat

ions

are

P(V

4=

0|V1,V

2)

=P

(V4

=0,V3

=0|V1,V

2)

+P

(V4

=0,V3

=1|V1,V

2)

+P

(V4

=0,V3

=2|V1,V

2)

=P

(V4

=0|V3

=0,V1,V

2)P

(V3

=0|V1,V

2)

+P

(V4

=0|V3

=1,V1,V

2)P

(V3

=1|V1,V

2)

+P

(V4

=0|V3

=2,V1,V

2)P

(V3

=2|V1,V

2)

P(V

4=

0|V1,V

3)

=P

(V4

=0,V2

=0|V1,V

3)

+P

(V4

=0,V2

=1|V1,V

3)

=P

(V4

=0|V2

=0,V1,V

3)P

(V2

=0|V1,V

3)

+P

(V4

=0|V2

=1,V1,V

3)P

(V2

=1|V1,V

3)

P(V

4=

0|V2,V

3)

=P

(V4

=0,V1

=0|V2,V

3)

+P

(V4

=0,V1

=1|V2,V

3)

=P

(V4

=0|V1

=0,V2,V

3)P

(V1

=0|V2,V

3)

+P

(V4

=0|V1

=1,V2,V

3)P

(V1

=1|V2,V

3)

P(V

4=

1|V1,V

2)

=P

(V4

=1,V3

=0|V1,V

2)

+P

(V4

=1,V3

=1|V1,V

2)

+P

(V4

=1,V3

=2|V1,V

2)

=P

(V4

=1|V3

=0,V1,V

2)P

(V3

=0|V1,V

2)

+P

(V4

=1|V3

=1,V1,V

2)P

(V3

=1|V1,V

2)

+P

(V4

=1|V3

=2,V1,V

2)P

(V3

=2|V1,V

2)

P(V

4=

1|V1,V

3)

=P

(V4

=1,V2

=0|V1,V

3)

+P

(V4

=1,V2

=1|V1,V

3)

=P

(V4

=1|V2

=0,V1,V

3)P

(V2

=0|V1,V

3)

+P

(V4

=1|V2

=1,V1,V

3)P

(V2

=1|V1,V

3)

P(V

4=

1|V2,V

3)

=P

(V4

=1,V1

=0|V2,V

3)

+P

(V4

=1,V1

=1|V2,V

3)

=P

(V4

=1|V1

=0,V2,V

3)P

(V1

=0|V2,V

3)

+P

(V4

=1|V1

=1,V2,V

3)P

(V1

=1|V2,V

3)

Exam

inin

gea

chsc

enar

ioofV1,V2,

andV3,

we

can

expan

dth

eab

ove

6eq

uat

ions

into

32w

ith

24unknow

ns

(the

condit

ional

pro

bab

ilit

y.A

gain

,32

equat

ions

are

linea

rly

dep

enden

t,an

dit

sra

nk

is<

24.

Ther

efor

e,w

eap

ply

the

Tik

hon

ovre

gula

riza

tion

toso

lve

for

zfr

omb

=A

z,w

her

eb

conta

ins

the

left

sides

ofth

e32

equat

ions,

and

zre

fer

toth

eco

ndit

ional

pro

bab

ilit

ies

ofV4

=0|V1,V

2,V

3an

dV4

=1|V1,V

2,V

3,

and

Aco

nta

ins

the

corr

esp

ondin

gco

effici

ents

.

36

university of notre dame, notre dame, in 46556 · di erential privacy (dp) provides a mathematical...

Documents