differential privacy: is it the dawn of data science tomorrow? dr. zhenjie zhang, advanced digital...

Differential Privacy: Is it the dawn of data science tomorrow?

Dr. Zhenjie Zhang,Advanced Digital Sciences Center

University of Illinois at Urbana Champaign

The Force of Big Data is Huge

• Health Care– Disease Study

• Internet-based Economy– E-Commerce– Online Advertising

The Dark Side: Data Privacy

• Personal Sensitive Information– Medical Prescription -> Disease– Movie Rent History -> Sexual

Orientation– Trajectories -> Home

Privacy incident: the MGIC case• Time: mid-1990s• Publisher: Massachusetts Group Insurance Commission (MGIC) • Data released: “anonymized” medical records• Result: A PhD student at MIT was able to identify the medical record of the

governor of Massachusetts

Birth Date Gender ZIP Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis

Medical Records

Name Birth Date Gender ZIP

Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Voter Registration List

match

Birth Data Gender 10000 Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis


Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Privacy incident: the MGIC case• Research [Golle 06] shows that 63% of Americans can

be uniquely identified by {date of birth, gender, zip code}

Medical RecordsVoter Registration List

match

Birth Data Gender 10000 Disease

1960/01/01 F 10000 flu

1965/02/02 M 20000 dyspepsia

1970/03/03 F 30000 pneumonia

1975/04/04 M 40000 gastritis


Alice 1960/01/01 F 10000

Bob 1965/02/02 M 20000

Cathy 1970/03/03 F 30000

David 1975/04/04 M 40000

Lesson Learned• What went wrong?• Intuition: Although the identifiers are removed from

the data, some quasi-identifiers remain• Can we solve the problem by removing quasi-

identifiers?

Medical RecordsVoter Registration List

Unfortunately, no.

Privacy incident: the AOL case• In 2006, AOL released an “anonymized” log of their

search engine to support research• Example of the log:

• Each user only has an ID, i.e., no identifier or quasi-identifier is released

• However, the New York Time was able to identify a user from the log

User ID Query Date/Time …

4417749 “Data privacy workshop location” … …

4417749 “COE price” … …

4417749 “Jurong Point opening hours” … …

Privacy incident: the AOL case• What the New York Time did:

– Find all log entries for AOL user 4417749– Multiple queries for businesses and services in

Lilburn, GA (population 11K)– Several queries for Jarrett Arnold– Lilburn has 14 people with the last name

Arnold– NYT contacts them, finds out AOL User

4417749 is Thelma Arnold• The CTO of AOL resigned after the incident

Lesson Learned• What went wrong?• Intuition: Although all identifiers and quasi-

identifiers are removed, the users’ behavior traces (i.e., their search keywords) reveal their identities

• The same problem occurred in another incident in 2006

Privacy incident: the Netflix case• In 2006, the Netflix movie rental service released some

movie ratings made by its users, for a competition with a 1M USD prize

• Example of data:

• Each user only has an ID, i.e., no identifier or quasi-identifier is released

• However, two researchers from U. Texas were able to link some users to some online identities

User ID Movie Rating Date

123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.15

Privacy incident: the Netflix case

• What the researchers did:– Go to a movie review site IMDB, and get the ratings made by

the IMDB users, as well as the dates– Match the an IMDB user to a Netflix user, if both users give the

same ratings to the same movies on similar dates


123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.16

IMDB ID Movie Rating Date

456 Scary Movie 1 5 2006.07.01

456 Scary Movie 2 4 2006.07.09

456 Scary Movie 3 4 2006.07.15

Privacy incident: the Netflix case

• In general, 99% of users can be identified with 8 ratings + dates

• Result: Netflix was sued; case settled out of court


123 Scary Movie 1 5 2006.07.01

123 Scary Movie 2 4 2006.07.08

123 Scary Movie 3 4 2006.07.16

IMDB ID Movie Rating Date

456 Scary Movie 1 5 2006.07.01

456 Scary Movie 2 4 2006.07.09

456 Scary Movie 3 4 2006.07.15

Lessons learned• Now we know that it is risky to publish

detailed records of individual data, since– quasi-identifiers may reveal identities– behavior information may reveal identities, too

• What if we don’t release detailed records, but only aggregate information?

• Answer: it could still fail to protect privacy

Agenda

• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy

Is It a Privacy Leakage?

Yes and No

Why Yes?

• Every individual contributes to the average height

Average Height = (Total Height of Others + My Height) / Singapore Population

Why No?• Nobody can infer your height from the

statistics itself• Let us consider a special case

– Average Height: 50 cm– Bob is 40 cm, and he happens to know Stuart is 50

cm height

What does the story tell?• Background knowledge of adversary

– The adversaries are not innocent!• The impact of individual record matters

– Including Kevin, the average height is 50 cm– Excluding Kevin, the average height is 45 cm

Differential Privacy for Nutshell• Question-Answering Interface between data

and human– No direct access to the database– The answer is (almost) the same, regardless of the

existence of any individual record in the database– Even if the adversary knows everybody in the

database except Kevin, he cannot infer any information of Kevin by looking at the results

How to enforce Differential Privacy?

Average Height?

Calculation

Exact Answer: 50 cm Sensitivity: 17 cm

What’s the maximal impact of individual record?

Random Answer: 47 cm Noise: -3 cm

Random Num Generation

Noise Injection

Privacy Budget: 0.5

Privacy Budget: Tradeoff Between Utility and Privacy

• Privacy Budget is a positive real number– How much privacy you want to trade for the

accuracy of the result?

Sensitivity: 17 cm

What’s the maximal impact of individual record?

Noise: -3 cm

Privacy Budget: 0.5

Noise Scale: 10

Random Number Generation• Dwork et al. [2003] Laplace Mechanism

– Given the sensitive value, generate the noise from a Laplace distribution

Scale=Sensitivity/Budget

-10 -8 -6 -4 -2 0 2 4 6 8 100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Mathematical Interpretation of Laplace Mechanism

• Neighbor Databases

Name Height

Bob 40

Kevin 60

Stuart 50

Name Height

Kevin 60

Stuart 50

Name HeightBob 40

Kevin 60

Name Height

Bob 40

Stuart 50

Mathematical Interpretation of Laplace Mechanism

• Privacy Guarantee– Query Q on database D– Any neighbor database D’– Any possible answer R– Privacy budget epsilon– Under Laplace mechanism, we have

Utility-Privacy Tradeoff of Laplace Mechanism

• Smaller privacy budget epsilon– The system operates more similarly on neighbor

databases– Higher noise with large scale, reversely

proportional privacy budget

Differential Privacy is Like a Reservoir

• Each query consumes privacy budget• When privacy budget is used up, the database

cannot be queried any more

Privacy Budget

Average Height?

Number of Eyes?

Average Weight?

47 cm

Privacy BudgetPrivacy Budget

5

Why is Differential Privacy Popular?

• Almost the strongest background knowledge assumption

• Nice composition property• Tradeoff between privacy and utility• High Efficiency

– Sensitivity: Pre-calculated– Epsilon: specified by the user– Noise generation: constant time

Agenda

• Basics of Differential Privacy• Optimization and Use Cases

– Histogram Publication– Counting Publication– Data Synthesis

• Limitations of Differential Privacy

Do we need optimization?• Laplace mechanism is universally applicable,

but– The sensitivity is sometimes too high, e.g. median

Value

R1 0

R2 0

R3 100

R4 100

Value

R1 0

R2 0

R3 100

Median=50 Median=0

Do we need optimization?

• Laplace mechanism is universally applicable, but– Budget consumption is fast under multiple queries

Query Privacy budget

Alice What’s the average height? 0.5

Bob What’s the average height? 0.5

Chris What’s the average height? 0.5

Overview of General Tricks• Decompose privacy budget over steps

– Histogram publication• Transform original query into new queries

with smaller sensitivity– Learning tasks, e.g. classification– Group counting queries

• Data Transformation/Compression– Data Synthesis

Query Decomposition Trick

• Recall the composition property

Query 1

Query 2

Query Decomposition: Histogram Publication

• Histogram publication is widely used in data analysis, to support all sorts of statistic queries

• Once a histogram is constructed, we can answer max, min, median and range count, without additional budget consumption

Name

Age HIV+

Frank 42 Y

Bob 31 Y

Mary 28 Y

Dave 43 N

… … …

Query Decomposition: Histogram Publication

• Xu et al. [12] propose a two-step solution– Find the structure of the histogram– Add Laplace noise into the bins

Query Transformation: Group Counting

Q1= xNY+xNJ+xCA+xWA

Q2= xNY+xNJ

Q3= xWA

xNY

xNJ

xCA

xWA

Q1

Q2

Q3

¿

Workload Matrix:

Data: Answer

1 1 1 1

1 1 0 0

0 0 0 1

Sensitivity=2

Query Transformation:Group Counting

• Find an approximate decomposition on the workload matrix, to reduce sensitivity

Workload Matrix: W1 1 1 1

1 1 0 0

0 0 0 1¿

1 1 1

1 0 0

0 0 1

1 1 0 0

0 0 1 0

0 0 0 1

New Workload: W’

Full Rank Strategy Matrix: A

Sensitivity=1

Query Transformation:Group Counting

• Generate results by adding noises on the product of strategy matrix and data vector

1 1 0 0

0 0 1 0

0 0 0 1

xNY

xNJ

xCA

xWA

×

Sensitivity=1

+ Laplace Noise

Smaller Scale

Data TransformationData Synthesis

• Compress -> Noise Insertion -> Coefficient Cutting -> Decompress

2D-Wavelet

Compressive Sensing

Agenda

• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy

Data Collection and Independence Assumption

• Implicit Assumption– All individuals are irrelevant to each other– Counter Example: HIV, genetic disease

Name

Age HIV+

Frank 42 Y

Bob 31 Y

Mary 28 Y

Dave 43 N

… … …

High Sensitivity Computation

• Non-Convex Optimization– E.g. Deep Learning

Operations on Database

• It is difficult to update a database– Can we query the database again, after certain

attributes are updated?– The general answer is no

Conclusion

• Differential Privacy is the most robust privacy model known so far

• Differential privacy is practical on certain application domains, i.e., histogram, counting.

• The applicability of differential privacy remains limited

• It is actually too difficult to understand, even for computer scientists!

differential privacy: is it the dawn of data science tomorrow? dr. zhenjie zhang, advanced digital...

Documents