privbayes: private data release via bayesian networks
DESCRIPTION
PrivBayes: Private Data Release via Bayesian Networks. Jun Zhang , Graham Cormode , Cecilia M. Procopiuc , Divesh Srivastava , Xiaokui Xiao. Overview. The Problem: Private Data Release Differential Privacy Challenges The Algorithm: PrivBayes Bayesian Network Details of PrivBayes - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/1.jpg)
PrivBayes: Private Data Release via Bayesian Networks
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao
![Page 2: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/2.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 3: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/3.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 4: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/4.jpg)
Data Release
sensitivedatabase
company institute
public adversary
![Page 5: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/5.jpg)
Private Data Release
sensitivedatabase
adversary
syntheticdatabase
companysimilar properties
accurate inference
How can we design such a private data release algorithm?
![Page 6: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/6.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 7: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/7.jpg)
Definition of -Differential Privacy◦ A randomized data release algorithm satisfies -differential
privacy, if for any two neighboring datasets and for any possible synthetic data ,
Differential Privacy [TCC’06]
Name Has cancer?
Alice Yes
Bob No
Chris Yes
Denise Yes
Eric No
Frank Yes
Name Has cancer?
Alice Yes
Bob No
Chris Yes
Denise Yes
Eric No
Frank No
![Page 8: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/8.jpg)
A general approach to achieve differential privacy is injecting Laplace noise to the output, in order to cover the impact of any individual!
More details in Preliminaries part of the paper
Differential Privacy [TCC’06]
![Page 9: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/9.jpg)
Our Target
Design a data release algorithm with differential privacy guarantee.
![Page 10: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/10.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 11: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/11.jpg)
To build a synthetic data, we need to understand the tuple distribution of the sensitive data.
Challenges of Private Data Release
sensitivedatabase
syntheticdatabase
convert
full-dimtuple distribution
noisydistribution
+ noise sample
![Page 12: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/12.jpg)
Example: Database has 10M tuples, 10 attributes (dimensions), and 20 values per attribute:
Scalability: full distribution has cells◦most of them have non-zero counts after noise injection◦ privacy is expensive (computation, storage)
Signal-to-noise: avg. information in each cell is ; avg. noise is (for )
Challenges of Private Data Release
Previous solutions suffer from either scalability or signal-to-noise problem
![Page 13: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/13.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 14: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/14.jpg)
PrivBayes: Dimension Reduction
sensitivedatabase
syntheticdatabase
convert
noisydistribution
+ noise sample
a set of low-dim distributions
noisy low-dim distributions
+ noiseconvert
approximate
full-dimtuple distribution
sample
![Page 15: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/15.jpg)
The advantages of using low-dimensional distributions◦ easy to compute◦ small domain -> high signal density -> robust against noise
But, how to find a set of low-dim distributions that provides a good approximation to full distribution?
PrivBayes: Dimension Reduction
![Page 16: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/16.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 17: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/17.jpg)
A -dimensional database:
Bayesian Network
age workclass
education title
income
Pr [𝑎𝑔𝑒 ] Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ]
Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ]
Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]
![Page 18: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/18.jpg)
A -dimensional database:
Bayesian Network
age workclass
education title
income
Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 ]
Pr [∗ ]≈
![Page 19: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/19.jpg)
Bayesian Network
age workclass
education title
income
Pr [∗ ]≈ Pr [𝑎𝑔𝑒 ] ⋅Pr [𝑒𝑑𝑢∨𝑎𝑔𝑒 ] ⋅Pr [𝑤𝑜𝑟𝑘∨𝑎𝑔𝑒 ,𝑒𝑑𝑢 ]⋅Pr [𝑡𝑖𝑡𝑙𝑒∨𝑒𝑑𝑢,𝑤𝑜𝑟𝑘 ] ⋅Pr [ 𝑖𝑛𝑐𝑜𝑚𝑒∨𝑤𝑜𝑟𝑘 , 𝑡𝑖𝑡𝑙𝑒 ]
Quality of Bayesian network decides the quality of approximation
![Page 20: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/20.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 21: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/21.jpg)
STEP 1: Choose a suitable Bayesian network ◦must in a differentially private way
STEP 2: Compute conditional distributions implied by ◦straightforward to do under differential privacy ◦inject noise – Laplace mechanism
STEP 3: Generate synthetic data by sampling from ◦post-processing: no privacy issues
Outline of the Algorithm
![Page 22: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/22.jpg)
Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges
Optimal Bayesian Network
𝐼 ( 𝑋 ,𝑌 )=∑𝑦∈𝑌
∑𝑥∈𝑋
Pr [𝑥 , 𝑦 ] log( Pr [𝑥 , 𝑦 ]Pr [𝑥 ] Pr [ 𝑦 ] ) .
∑( 𝑋 ,𝑌 ) : edge
𝐼 (𝑋 ,𝑌 ) ,
where
![Page 23: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/23.jpg)
Finding optimal -degree Bayesian network was solved in [Chow-Liu’68]. It is a DAG of maximum in-degree , and maximizes the sum of mutual information of its edges
Optimal Bayesian Network
finding the maximum spanning tree, where the weight of edge is mutual information .
![Page 24: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/24.jpg)
Build a -degree BN for database
Build a Bayesian Network
Alan 0 0 0 0
Bob 0 0 0 0
Cykie 1 1 1 0
David 0 0 0 0
Eric 1 1 0 0
Frank 1 1 0 0
George 0 0 0 0
Helen 1 1 1 0
Ivan 0 0 0 0
Jack 1 1 0 0
![Page 25: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/25.jpg)
Start from a random attribute
Build a Bayesian Network
A C
B D
![Page 26: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/26.jpg)
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
0.5
0.5
0.5 0.2
0.3
0.5 0.5
candidates:
Alan 0 0 0 0
Bob 0 0 0 0
Cykie 1 1 1 0
David 0 0 0 0
Eric 1 1 0 0
Frank 1 1 0 0
George 0 0 0 0
Helen 1 1 1 0
Ivan 0 0 0 0
Jack 1 1 0 0
![Page 27: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/27.jpg)
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
candidates:
![Page 28: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/28.jpg)
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
![Page 29: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/29.jpg)
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
candidates:
![Page 30: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/30.jpg)
Select next tree edge by its mutual information
Build a Bayesian Network
A C
B D
DONE!
![Page 31: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/31.jpg)
It is NP-hard to train the optimal -degree Bayesian network, when [JMLR’04].
Most approximation algorithms are too complicated to be converted into private algorithms.
In our paper, we find a way to extend the Chow-Liu solution (-degree) to higher degree cases.
In this talk, we focus on -degree cases for simplicity.
-degree Bayesian Network
![Page 32: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/32.jpg)
Do it under Differential Privacy!
(Non-private) select the edge with maximum (Private) is data-sensitive -> the best edge is also data-sensitive
Private Bayesian Network
Solution: randomized edge selection!
![Page 33: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/33.jpg)
Exponential Mechanism [FOCS’07]
Databases𝐷
Edges𝑒
define How good edge is as the result of selection, given database
Return with probability: Pr [𝑒 ]∝exp (𝜀2 ⋅ 𝑞 (𝐷 ,𝑒 )Δ (𝑞 ) )
Δ (𝑞)=max𝐷 ,𝐷′ ,𝑒
‖𝑞 (𝐷 ,𝑒)−𝑞 (𝐷′ ,𝑒)‖1where
n oiseinfo
![Page 34: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/34.jpg)
Do it under Differential Privacy!
Select edges with exponential mechanism◦ define (edge) = (edge)◦we prove , where . (Lemma 1)
Private Bayesian Network
Pr [𝑒 ]∝ exp( 𝜀2 ⋅ 𝐼 (𝑒 )log𝑛/𝑛 ) n oiseinfo
Problem solved?
NO
Sensitivity (noise scale) is too large for
![Page 35: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/35.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 36: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/36.jpg)
Basic Facts
Functions Range(scale of info)
Sensitivity(scale of noise)
and have a strong positive correlation
![Page 37: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/37.jpg)
IDEA: define score to agree with at maximum valuesand interpolate linearly in-between
Function
: “optimal” dbnsover thatmaximize ΠPr [𝑥 , 𝑦 ] how far?
𝐹=−12
minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙
‖Pr [𝑥 , 𝑦 ]−Π‖1 Range of :
Sensitivity of :
![Page 38: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/38.jpg)
Function
𝐹=−12
minΠ :𝑜𝑝𝑡𝑖𝑚𝑎𝑙
‖Pr [𝑥 , 𝑦 ]−Π‖1
0.5 0.2
0.3
0.5
0.5
0.5
0.51.60.4
𝐹=−0.2
𝐼=0.4 𝐼=1𝐼=1
![Page 39: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/39.jpg)
vs.
𝐼
𝐹 and of random distributions
correlation coefficient
![Page 40: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/40.jpg)
The Problem: Private Data Release◦ Differential Privacy◦ Challenges
The Algorithm: PrivBayes◦ Bayesian Network◦ Details of PrivBayes
Function : Linear vs. Logarithmic Experiments
Overview
![Page 41: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/41.jpg)
vs.
Adult dataset
![Page 42: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/42.jpg)
We use four datasets in our experiments◦Adult, NLTCS, TPC-E, BR2000
Adult dataset◦ census data of 45,222 individuals◦ 15 attributes: age, workclass, education, marital status, etc.◦ tuple domain size (full-dimensional): about
Dataset
![Page 43: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/43.jpg)
Counting Queries
Query: all -way marginals Query: all -way marginals
![Page 44: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/44.jpg)
Multiple SVMs
Adult, gender Adult, education
Query: build 4 classifiers
![Page 45: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/45.jpg)
Multiple SVMs
Adult, gender Adult, education
Query: build 4 classifiers
![Page 46: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/46.jpg)
Differential privacy can be applied effectively for data release
Key ideas of the solution:◦Bayesian networks for dimension reduction◦ carefully designed linear quality for exponential mechanism
Many open problems remain:◦ extend to other forms of data: graph data, mobility data◦ obtain alternate (workable) privacy definitions
Concluding Remarks
Thanks!
![Page 47: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/47.jpg)
Appendix
![Page 48: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/48.jpg)
Privacy, accuracy, and consistency too: a holistic solution to contingency table release [PODS’07]◦ incurs an exponential running time◦ only optimized for low-dimensional marginals
Differentially private publication of sparse data [ICDT’12]◦ achieves scalability, but no help for signal-to-noise problem
Differentially private spatial decompositions [ICDE’12]◦ coarsens the histogram H to control nr. cells◦ has some limits, e.g., range queries, ordinal domain
Previous Work
![Page 49: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/49.jpg)
Assume that . A distribution maximizes the mutual information between and if and only if◦, for any ;◦For each , there is at most one with .
: Optimal Distributions
![Page 50: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/50.jpg)
two score functions for real and
neighboring databases and
Sensitivity (noise) max of derivative and
Analogy: Logarithmic vs. Linear
![Page 51: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/51.jpg)
Interactive Model
database differentially privatealgorithm
query privacy budget
noisy answer
1. risk of privacy breach cumulates after answering multiple queries
2. It requires specific DP algorithm for every particular query
user
![Page 52: PrivBayes: Private Data Release via Bayesian Networks](https://reader034.vdocuments.site/reader034/viewer/2022051402/56815ef7550346895dcdb55f/html5/thumbnails/52.jpg)
Non-interactive Model: Data Release
private data releaseprivacy budget
query
noisy answer
synthetic data
Reusability: only access sensitive data once
Generality: support most queries