differential privacy: is it the dawn of data science tomorrow? dr. zhenjie zhang, advanced digital...
TRANSCRIPT
Differential Privacy: Is it the dawn of data science tomorrow?
Dr. Zhenjie Zhang,Advanced Digital Sciences Center
University of Illinois at Urbana Champaign
The Force of Big Data is Huge
• Health Care– Disease Study
• Internet-based Economy– E-Commerce– Online Advertising
The Dark Side: Data Privacy
• Personal Sensitive Information– Medical Prescription -> Disease– Movie Rent History -> Sexual
Orientation– Trajectories -> Home
Privacy incident: the MGIC case• Time: mid-1990s• Publisher: Massachusetts Group Insurance Commission (MGIC) • Data released: “anonymized” medical records• Result: A PhD student at MIT was able to identify the medical record of the
governor of Massachusetts
Birth Date Gender ZIP Disease
1960/01/01 F 10000 flu
1965/02/02 M 20000 dyspepsia
1970/03/03 F 30000 pneumonia
1975/04/04 M 40000 gastritis
Medical Records
Name Birth Date Gender ZIP
Alice 1960/01/01 F 10000
Bob 1965/02/02 M 20000
Cathy 1970/03/03 F 30000
David 1975/04/04 M 40000
Voter Registration List
match
Birth Data Gender 10000 Disease
1960/01/01 F 10000 flu
1965/02/02 M 20000 dyspepsia
1970/03/03 F 30000 pneumonia
1975/04/04 M 40000 gastritis
Name Birth Date Gender ZIP
Alice 1960/01/01 F 10000
Bob 1965/02/02 M 20000
Cathy 1970/03/03 F 30000
David 1975/04/04 M 40000
Privacy incident: the MGIC case• Research [Golle 06] shows that 63% of Americans can
be uniquely identified by {date of birth, gender, zip code}
Medical RecordsVoter Registration List
match
Birth Data Gender 10000 Disease
1960/01/01 F 10000 flu
1965/02/02 M 20000 dyspepsia
1970/03/03 F 30000 pneumonia
1975/04/04 M 40000 gastritis
Name Birth Date Gender ZIP
Alice 1960/01/01 F 10000
Bob 1965/02/02 M 20000
Cathy 1970/03/03 F 30000
David 1975/04/04 M 40000
Lesson Learned• What went wrong?• Intuition: Although the identifiers are removed from
the data, some quasi-identifiers remain• Can we solve the problem by removing quasi-
identifiers?
Medical RecordsVoter Registration List
Unfortunately, no.
Privacy incident: the AOL case• In 2006, AOL released an “anonymized” log of their
search engine to support research• Example of the log:
• Each user only has an ID, i.e., no identifier or quasi-identifier is released
• However, the New York Time was able to identify a user from the log
User ID Query Date/Time …
4417749 “Data privacy workshop location” … …
4417749 “COE price” … …
4417749 “Jurong Point opening hours” … …
Privacy incident: the AOL case• What the New York Time did:
– Find all log entries for AOL user 4417749– Multiple queries for businesses and services in
Lilburn, GA (population 11K)– Several queries for Jarrett Arnold– Lilburn has 14 people with the last name
Arnold– NYT contacts them, finds out AOL User
4417749 is Thelma Arnold• The CTO of AOL resigned after the incident
Lesson Learned• What went wrong?• Intuition: Although all identifiers and quasi-
identifiers are removed, the users’ behavior traces (i.e., their search keywords) reveal their identities
• The same problem occurred in another incident in 2006
Privacy incident: the Netflix case• In 2006, the Netflix movie rental service released some
movie ratings made by its users, for a competition with a 1M USD prize
• Example of data:
• Each user only has an ID, i.e., no identifier or quasi-identifier is released
• However, two researchers from U. Texas were able to link some users to some online identities
User ID Movie Rating Date
123 Scary Movie 1 5 2006.07.01
123 Scary Movie 2 4 2006.07.08
123 Scary Movie 3 4 2006.07.15
Privacy incident: the Netflix case
• What the researchers did:– Go to a movie review site IMDB, and get the ratings made by
the IMDB users, as well as the dates– Match the an IMDB user to a Netflix user, if both users give the
same ratings to the same movies on similar dates
User ID Movie Rating Date
123 Scary Movie 1 5 2006.07.01
123 Scary Movie 2 4 2006.07.08
123 Scary Movie 3 4 2006.07.16
IMDB ID Movie Rating Date
456 Scary Movie 1 5 2006.07.01
456 Scary Movie 2 4 2006.07.09
456 Scary Movie 3 4 2006.07.15
Privacy incident: the Netflix case
• In general, 99% of users can be identified with 8 ratings + dates
• Result: Netflix was sued; case settled out of court
User ID Movie Rating Date
123 Scary Movie 1 5 2006.07.01
123 Scary Movie 2 4 2006.07.08
123 Scary Movie 3 4 2006.07.16
IMDB ID Movie Rating Date
456 Scary Movie 1 5 2006.07.01
456 Scary Movie 2 4 2006.07.09
456 Scary Movie 3 4 2006.07.15
Lessons learned• Now we know that it is risky to publish
detailed records of individual data, since– quasi-identifiers may reveal identities– behavior information may reveal identities, too
• What if we don’t release detailed records, but only aggregate information?
• Answer: it could still fail to protect privacy
Agenda
• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy
Is It a Privacy Leakage?
Yes and No
Why Yes?
• Every individual contributes to the average height
Average Height = (Total Height of Others + My Height) / Singapore Population
Why No?• Nobody can infer your height from the
statistics itself• Let us consider a special case
– Average Height: 50 cm– Bob is 40 cm, and he happens to know Stuart is 50
cm height
What does the story tell?• Background knowledge of adversary
– The adversaries are not innocent!• The impact of individual record matters
– Including Kevin, the average height is 50 cm– Excluding Kevin, the average height is 45 cm
Differential Privacy for Nutshell• Question-Answering Interface between data
and human– No direct access to the database– The answer is (almost) the same, regardless of the
existence of any individual record in the database– Even if the adversary knows everybody in the
database except Kevin, he cannot infer any information of Kevin by looking at the results
How to enforce Differential Privacy?
Average Height?
Calculation
Exact Answer: 50 cm Sensitivity: 17 cm
What’s the maximal impact of individual record?
Random Answer: 47 cm Noise: -3 cm
Random Num Generation
Noise Injection
Privacy Budget: 0.5
Privacy Budget: Tradeoff Between Utility and Privacy
• Privacy Budget is a positive real number– How much privacy you want to trade for the
accuracy of the result?
Sensitivity: 17 cm
What’s the maximal impact of individual record?
Noise: -3 cm
Privacy Budget: 0.5
Noise Scale: 10
Random Number Generation• Dwork et al. [2003] Laplace Mechanism
– Given the sensitive value, generate the noise from a Laplace distribution
Scale=Sensitivity/Budget
-10 -8 -6 -4 -2 0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Mathematical Interpretation of Laplace Mechanism
• Neighbor Databases
Name Height
Bob 40
Kevin 60
Stuart 50
Name Height
Kevin 60
Stuart 50
Name HeightBob 40
Kevin 60
Name Height
Bob 40
Stuart 50
Mathematical Interpretation of Laplace Mechanism
• Privacy Guarantee– Query Q on database D– Any neighbor database D’– Any possible answer R– Privacy budget epsilon– Under Laplace mechanism, we have
Utility-Privacy Tradeoff of Laplace Mechanism
• Smaller privacy budget epsilon– The system operates more similarly on neighbor
databases– Higher noise with large scale, reversely
proportional privacy budget
Differential Privacy is Like a Reservoir
• Each query consumes privacy budget• When privacy budget is used up, the database
cannot be queried any more
Privacy Budget
Average Height?
Number of Eyes?
Average Weight?
47 cm
Privacy BudgetPrivacy Budget
5
Why is Differential Privacy Popular?
• Almost the strongest background knowledge assumption
• Nice composition property• Tradeoff between privacy and utility• High Efficiency
– Sensitivity: Pre-calculated– Epsilon: specified by the user– Noise generation: constant time
Agenda
• Basics of Differential Privacy• Optimization and Use Cases
– Histogram Publication– Counting Publication– Data Synthesis
• Limitations of Differential Privacy
Do we need optimization?• Laplace mechanism is universally applicable,
but– The sensitivity is sometimes too high, e.g. median
Value
R1 0
R2 0
R3 100
R4 100
Value
R1 0
R2 0
R3 100
Median=50 Median=0
Do we need optimization?
• Laplace mechanism is universally applicable, but– Budget consumption is fast under multiple queries
Query Privacy budget
Alice What’s the average height? 0.5
Bob What’s the average height? 0.5
Chris What’s the average height? 0.5
Overview of General Tricks• Decompose privacy budget over steps
– Histogram publication• Transform original query into new queries
with smaller sensitivity– Learning tasks, e.g. classification– Group counting queries
• Data Transformation/Compression– Data Synthesis
Query Decomposition Trick
• Recall the composition property
Query 1
Query 2
Query Decomposition: Histogram Publication
• Histogram publication is widely used in data analysis, to support all sorts of statistic queries
• Once a histogram is constructed, we can answer max, min, median and range count, without additional budget consumption
Name
Age HIV+
Frank 42 Y
Bob 31 Y
Mary 28 Y
Dave 43 N
… … …
Query Decomposition: Histogram Publication
• Xu et al. [12] propose a two-step solution– Find the structure of the histogram– Add Laplace noise into the bins
Query Transformation: Group Counting
Q1= xNY+xNJ+xCA+xWA
Q2= xNY+xNJ
Q3= xWA
xNY
xNJ
xCA
xWA
Q1
Q2
Q3
¿
Workload Matrix:
Data: Answer
1 1 1 1
1 1 0 0
0 0 0 1
Sensitivity=2
Query Transformation:Group Counting
• Find an approximate decomposition on the workload matrix, to reduce sensitivity
Workload Matrix: W1 1 1 1
1 1 0 0
0 0 0 1¿
1 1 1
1 0 0
0 0 1
1 1 0 0
0 0 1 0
0 0 0 1
New Workload: W’
Full Rank Strategy Matrix: A
Sensitivity=1
Query Transformation:Group Counting
• Generate results by adding noises on the product of strategy matrix and data vector
1 1 0 0
0 0 1 0
0 0 0 1
xNY
xNJ
xCA
xWA
×
Sensitivity=1
+ Laplace Noise
Smaller Scale
Data TransformationData Synthesis
• Compress -> Noise Insertion -> Coefficient Cutting -> Decompress
2D-Wavelet
Compressive Sensing
Agenda
• Basics of Differential Privacy• Optimization and Use Cases• Limitations of Differential Privacy
Data Collection and Independence Assumption
• Implicit Assumption– All individuals are irrelevant to each other– Counter Example: HIV, genetic disease
Name
Age HIV+
Frank 42 Y
Bob 31 Y
Mary 28 Y
Dave 43 N
… … …
High Sensitivity Computation
• Non-Convex Optimization– E.g. Deep Learning
Operations on Database
• It is difficult to update a database– Can we query the database again, after certain
attributes are updated?– The general answer is no
Conclusion
• Differential Privacy is the most robust privacy model known so far
• Differential privacy is practical on certain application domains, i.e., histogram, counting.
• The applicability of differential privacy remains limited
• It is actually too difficult to understand, even for computer scientists!
Q&A