anonymization algorithms - other techniques, metrics, and extended scenarios
DESCRIPTION
Anonymization Algorithms - Other techniques, metrics, and extended scenarios. Li Xiong CS573 Data Privacy and Anonymity. So far. k-anonymity (protect identity disclosure) Anonymization algorithms Generalization and suppression Microaggregation and clustering - PowerPoint PPT PresentationTRANSCRIPT
Anonymization Algorithms - Other techniques, metrics, and extended scenarios
Li Xiong
CS573 Data Privacy and Anonymity
So far
k-anonymity (protect identity disclosure) Anonymization algorithms
Generalization and suppression Microaggregation and clustering
Privacy principles beyond k-anonymity l-diversity, t-closeness (protect attribute
disclosure) m-invariance (protect continuous publishing)
Agenda
Other anonymization technique Anatomization
Information metrics Extended scenarios
Anonymization methods
Non-perturbative: don't distort the data Generalization Suppression
Perturbative: distort the data Microaggregation/clustering Additive noise
Anatomization and permutation De-associate relationship between QID and
sensitive attribute
tuple ID Age Sex Zipcode Disease1 (Bob) 23 M 11000 pneumonia
2 27 M 13000 Dyspepsia3 35 M 59000 Dyspepsia4 59 M 12000 pneumonia5 61 F 54000 flu6 65 F 25000 stomach pain
7 (Alice) 65 F 25000 flu8 70 F 30000 bronchitis
table 1
tuple ID Age Sex Zipcode Disease1 [21,60] M [10001, 60000] pneumonia2 [21,60] M [10001, 60000] Dyspepsia3 [21,60] M [10001, 60000] Dyspepsia4 [21,60] M [10001, 60000] pneumonia5 [61,70] F [10001, 60000] flu6 [61,70] F [10001, 60000] stomach pain7 [61,70] F [10001, 60000] flu8 [61,70] F [10001, 60000] bronchitis
table 2
Problems with k-anonymity and l-diversity
Query A:
SELECT COUNT(*) FROM MicrodataWHERE Disease = 'pneumonia' AND
Age <= 30 AND Zipcode IN [10001,20000]
Querying generalized table
• R1 and R2 are the anonymized QID groups
• Q is the query range
• p = Area(R1 ∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05
• Estimated Answer for A: 2(0.05) = 0.1
Concept of the Anatomy Algorithm
• Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST)
• Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column
• Then produce a sensitive table with Disease statistics
tuple ID Age Sex Zipcode Group-ID1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2
QIT
Group-ID Disease Count1 headache 21 pneumonia 22 bronchitis 12 flu 22 stomach ache 1
ST
Concept of the Anatomy Algorithm
• Does it satisfy k-anonymity? l-diversity?
• Query results?
tuple ID Age Sex Zipcode Group-ID1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2
QIT
Group-ID Disease Count1 headache 21 pneumonia 22 bronchitis 12 flu 22 stomach ache 1
ST
SELECT COUNT(*) FROM MicrodataWHERE Disease = 'pneumonia' AND
Age <= 30 AND Zipcode IN [10001,20000]
Specifications of Anatomy
• T is representation of the microdata to be published
• T has d QI attributes Aqi1, Aqi
2, ..., Aqid and a sensitive
attribute As
• Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As
can only be categorical because of l-diversity
• t is a tuple within T and Aqii is the value of t with [d + 1] as
the As value
• With the above stated, we can consider t to be a point in a (d +1)-dimensional data space regarded as DS
Specifications of Anatomy cont.
DEFINITION 1. (Partition/QI-group)
A partition is several subsets of T and only allow each tuple to belong to one subset
Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm
Specifications of Anatomy cont.
DEFINITION 2. (l-diverse partition)
A partition is considered l-diverse if it conforms to the following:
v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v
cj(v)/|QIj| ≤ 1/l
|QIj| is the number of tuples of QIj
c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2
|QI1| = |QI2| = 4
so this satisfies the condition 2/4 ≤ 1/2
Specifications of Anatomy cont.
DEFINITION 3. (Anatomy)
With a given l-diverse partition anatomy will create QIT and ST tables
QIT will be constructed as the following:
(Aqi1, Aqi
2, ..., Aqid, Group-ID)
ST will be constructed as the following:
(Group-ID, As, Count)
Privacy properties
THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l
Age Sex Zipcode Group-ID Disease Count23 M 11000 1 dyspepsia 223 M 11000 1 pneumonia 227 M 13000 1 dyspepsia 227 M 13000 1 pneumonia 235 M 59000 1 dyspepsia 235 M 59000 1 pneumonia 259 M 12000 1 dyspepsia 259 M 12000 1 pneumonia 261 F 54000 2 bronchitis 161 F 54000 2 flu 261 F 54000 2 stomachache 165 F 25000 2 bronchitis 165 F 25000 2 flu 265 F 25000 2 stomachache 165 F 25000 2 bronchitis 165 F 25000 2 flu 265 F 25000 2 stomachache 170 F 30000 2 bronchitis 170 F 30000 2 flu 270 F 30000 2 stomachache 1
Comparison with generalization
• Compare with generalization on two assumptions:
A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata
If A1 and A2 are true, anatomy is as good as generalization 1/l holds true
If A1 is true and A2 is false, generalization is stronger
If A1 and A2 are false, generalization is still stronger
Preserving Data Correlation
• Examine the correlation between Age and Disease in T using probability density function pdf
• Example: t1
tuple ID Age Sex Zipcode Disease1 (Bob) 23 M 11000 pneumonia
2 27 M 13000 Dyspepsia3 35 M 59000 Dyspepsia4 59 M 12000 pneumonia5 61 F 54000 flu6 65 F 25000 stomach pain
7 (Alice) 65 F 25000 flu8 70 F 30000 bronchitis
table 1
Preserving Data Correlation cont.
• To re-construct an approximate pdf of t1 from the generalization table:
tuple ID Age Sex Zipcode Disease1 [21,60] M [10001, 60000] pneumonia2 [21,60] M [10001, 60000] Dyspepsia3 [21,60] M [10001, 60000] Dyspepsia4 [21,60] M [10001, 60000] pneumonia5 [61,70] F [10001, 60000] flu6 [61,70] F [10001, 60000] stomach pain7 [61,70] F [10001, 60000] flu8 [61,70] F [10001, 60000] bronchitis
table 2
Preserving Data Correlation cont.
• To re-construct an approximate pdf of t1 from the QIT and ST tables:
tuple ID Age Sex Zipcode Group-ID1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2
QIT
Group-ID Disease Count1 headache 21 pneumonia 22 bronchitis 12 flu 22 stomach ache 1
ST
Preserving Data Correlation cont.
• To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation:
The distance for anatomy is 0.5 while the distance for generalization is 22.5
• Anatomy provides for better re-constructions of the probability density functions of all tuples.
Preserving Data Correlation cont.
• measure the error for each pdf by using the following formula:
Objective: for all tuples t in T and obtain a minimal re-construction error (RCE):
Nearly-Optimal Anatomizing Algorithm• They propose an efficient algorithm for anatomizing tables that will minimize the RCE
• The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T
• This algorithm has linear I/O complexity O(n/b) where b is the page size
Nearly-Optimal Anatomizing Algorithm cont.
PROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple.
PROPERTY 2. The set S' always includes at least one QI-group.
PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value
Experiments
• dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes
• Created two sets of microdata tables
Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation
as the sensitive attribute As
Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class
as the sensitive attribute As g
Experiments cont.
Experiments cont.
Experiments cont.
Experiments cont.
Conclusion
• Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy
• Anatomy has a significantly lower error rate as compared with generalization
• Several items would require further research
- Multiple sensitive attributes - Effective mining of patterns in microdata
Agenda
Other anonymization technique Anatomization
Information metrics Extended scenarios
Information Metrics
General purpose metrics Special purpose metrics Trade-off metrics
General Purpose Metrics
General idea: measure “similarity” between the original data and the anonymized data
Minimal distortion metric (Samarati 2001; Sweeney
2002, Wang and Fung 2006) Charge a penalty to each instance of a value
generalized or suppressed (independently of other records)
ILoss (Xiao and Tao 2006) Charge a penalty when a specific value is
generalized
General Purpose Metrics cont.
Discernibility Metric (DM) (K-OPTIMIZE, Mondrian, l-diversity …) Charge a penalty to each record for being
indistinguishable from other records
Special Purpose Metrics
Classification: Classification metric (CM) (Iyengar 2002) Charge a penalty for each record suppressed
or generalized to a group in which the record’s class is not the majority class
Query Query error: count queries Query imprecision: overlapped range
Extended Scenarios
Multiple release publishing Continuous release publishing Collaborative/distributed publishing
Other types of data
High dimensional transaction data Market basket, web queries
Moving objects data Location based services
Textual data