a general methodology for masking output from remote analysis … · 2013-11-13 · output from...

25
A GENERAL METHODOLOGY FOR MASKING OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy

Upload: others

Post on 24-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

A GENERAL METHODOLOGY FOR MASKING

OUTPUT FROM REMOTE ANALYSIS SYSTEMS

Krish Muralidhar

Christine O’Keefe

Rathindra Sarathy

Page 2: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

REMOTE ANALYSIS SYSTEM

O’Keefe and Chipperfield (in press)

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

Page 3: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

FOCUS OF THIS PAPER

Responses to statistical queries involving

numerical variables

We explicitly do not consider tabular data release

Page 4: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

DATA-BASED CONFIDENTIALIZATION MEASURES

FOR REMOTE ANALYSIS

Input Perturbation and Data Subsetting

Restrictions on Data Transformations

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

Page 5: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

ANALYSIS-BASED CONFIDENTIALIZATION

MEASURES

Refusal to answer risky queries

Output checking

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

Page 6: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

OUTPUT CONFIDENTIALIZATION

Modify output prior to release

Query

Dataset Analysis

Output

Data

Transformations

Output for

publication

Page 7: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

EFFECTIVE OUTPUT MASKING

Respond to a diverse set of queries

Meaningful responses to queries

Robust

Control disclosure risk

Automated

Page 8: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

OUTPUT MASKING MECHANISMS

Additive Perturbation

Including differential privacy

In our opinion, the applicability of differential privacy for

statistical analyses involving numerical variables is open

to question. We do not consider differential privacy

further

Multiplicative perturbation

Page 9: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

A SIMPLE ILLUSTRATION

Query: “What is the variance of a particular

subset of the data (n = 100)?”

True response: 3.81

Page 10: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

RESPONSE DISTRIBUTION - ADDITIVE NOISE

But which one?

Page 11: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

RESPONSE DISTRIBUTION - MULTIPLICATIVE

But which one?

Page 12: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

DRAW FROM THE SAMPLING DISTRIBUTION

Use Chi-Square distribution to approximate the sampling distribution of the sample variance. Draw the response from this distribution.

Page 13: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

ROBUST? The Chi-square approximation is sensitive to normality

assumption and not very robust. The data in this case is heavily skewed.

Page 14: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

AN IDEAL MASKING MECHANISM

For any query, select a random sample from the

relevant population (not the database),

compute the value of the statistic, and release

this value

Practically infeasible

Page 15: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

ALTERNATIVE MECHANISM

For any query, derive the sampling distribution

of the statistic. Randomly draw a value from

this distribution. Release this value

May be feasible for some simple statistics (like the

sample mean), but as our variance example

illustrates, may not be possible for others

Theoretically infeasible

Page 16: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

A FEASIBLE APPROACH

Selecting a value from the sampling

distribution of the statistic always provides an

appropriate masked response

Problem – how do we approximate the

sampling distribution of the statistic that is

both accurate and robust?

Solution – THE STATISTICAL BOOTSTRAP

Page 17: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

THE STATISTICAL BOOTSTRAP (EFRON 1979)

Draw a bootstrap sample of size n, with replacement, from the original sample also of size n.

Compute value of statistic from the bootstrap sample

Repeat process of selecting bootstrap samples

The standard deviation of the values of the statistic from the bootstrap samples provide a good approximation of the standard error of the statistic

The distribution of 𝜃 ∗ − 𝜃 provides a good

approximation of the distribution of 𝜃 − 𝜃

𝜃 – Parameter; 𝜃 - Statistic; 𝜃 ∗ - Bootstrap statistic

Page 18: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

BACK TO OUR EXAMPLE

Page 19: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

APPROPRIATE MASKED RESPONSE

Since the bootstrap distribution of the statistic

closely approximates the sampling distribution

of the statistic, choosing a value randomly from

the bootstrap distribution is a close

approximation of choosing a value randomly

from the true sampling distribution of the

statistic

Close equivalent to drawing an independent sample

from the population

Page 20: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

CHOOSING FROM THE BOOTSTRAP

DISTRIBUTION

Only a single realization from the bootstrap

distribution is required

A single realization from the bootstrap

distribution is the result of selecting a single

bootstrap sample

No need to construct the entire bootstrap

distribution!

Page 21: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

ACTUAL MASKING PROCEDURE

From the original query set, select one

bootstrap sample of the same size as the

original set, with replacement.

Compute the value of the statistic for this

bootstrap sample.

Release the value of this statistic as the

masked response.

Page 22: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

CHARACTERISTICS OF THE BOOTSTRAP METHOD

The distribution of 𝜃 ∗ closely approximates the

sampling distribution of 𝜃 ,

If 𝜃 is an unbiased estimator, then 𝐸 𝜃 ∗ = 𝜃 ,

and

Variance of 𝜃 ∗ = 𝜎𝜃 2, the variance of 𝜃 .

Page 23: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

PERFORMANCE OF THE BOOTSTRAP METHOD

Easy implementation

Usefulness: 𝜃 ∗ is a random value chosen from a distribution that closely approximates the

sampling distribution of 𝜃

Disclosure risk: Noise addition approximately

equal to the standard error of the statistic 𝜃

Robust (no assumptions)

Easily automated and programmed without the need for ongoing human intervention.

Page 24: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

FUTURE RESEARCH

Tabular data

Multiple imputation using the bootstrap

Compare with Rubin’s Bayesian bootstrap

Relationship between the bootstrap and

smooth sensitivity

Page 25: A General Methodology for Masking Output from Remote Analysis … · 2013-11-13 · OUTPUT FROM REMOTE ANALYSIS SYSTEMS Krish Muralidhar Christine O’Keefe Rathindra Sarathy . REMOTE

QUESTIONS OR COMMENTS?

Thank you