anonymization algorithms - other techniques, metrics, and extended scenarios

Anonymization Algorithms - Other techniques, metrics, and extended scenarios

Li Xiong

CS573 Data Privacy and Anonymity

So far

k-anonymity (protect identity disclosure) Anonymization algorithms

Generalization and suppression Microaggregation and clustering

Privacy principles beyond k-anonymity l-diversity, t-closeness (protect attribute

disclosure) m-invariance (protect continuous publishing)

Agenda

Other anonymization technique Anatomization

Information metrics Extended scenarios

Anonymization methods

Non-perturbative: don't distort the data Generalization Suppression

Perturbative: distort the data Microaggregation/clustering Additive noise

Anatomization and permutation De-associate relationship between QID and

sensitive attribute

tuple ID Age Sex Zipcode Disease1 (Bob) 23 M 11000 pneumonia

2 27 M 13000 Dyspepsia3 35 M 59000 Dyspepsia4 59 M 12000 pneumonia5 61 F 54000 flu6 65 F 25000 stomach pain

7 (Alice) 65 F 25000 flu8 70 F 30000 bronchitis

table 1

tuple ID Age Sex Zipcode Disease1 [21,60] M [10001, 60000] pneumonia2 [21,60] M [10001, 60000] Dyspepsia3 [21,60] M [10001, 60000] Dyspepsia4 [21,60] M [10001, 60000] pneumonia5 [61,70] F [10001, 60000] flu6 [61,70] F [10001, 60000] stomach pain7 [61,70] F [10001, 60000] flu8 [61,70] F [10001, 60000] bronchitis

table 2

Problems with k-anonymity and l-diversity

Query A:

SELECT COUNT(*) FROM MicrodataWHERE Disease = 'pneumonia' AND

Age <= 30 AND Zipcode IN [10001,20000]

Querying generalized table

• R1 and R2 are the anonymized QID groups

• Q is the query range

• p = Area(R1 ∩ RQ)/Area(R1) = (10*10)/(50*40) = 0.05

• Estimated Answer for A: 2(0.05) = 0.1

Concept of the Anatomy Algorithm

• Release 2 tables, quasi-identifier table (QIT) and sensitive table (ST)

• Use the same QI groups (satisfy l-diversity), replace the sensitive attribute values with a Group-ID column

• Then produce a sensitive table with Disease statistics

tuple ID Age Sex Zipcode Group-ID1 23 M 11000 12 27 M 13000 13 35 M 59000 14 59 M 12000 15 61 F 54000 26 65 F 25000 27 65 F 25000 28 70 F 30000 2

QIT

Group-ID Disease Count1 headache 21 pneumonia 22 bronchitis 12 flu 22 stomach ache 1

ST

Concept of the Anatomy Algorithm

• Does it satisfy k-anonymity? l-diversity?

• Query results?


QIT


ST

SELECT COUNT(*) FROM MicrodataWHERE Disease = 'pneumonia' AND

Age <= 30 AND Zipcode IN [10001,20000]

Specifications of Anatomy

• T is representation of the microdata to be published

• T has d QI attributes Aqi1, Aqi

2, ..., Aqid and a sensitive

attribute As

• Each Aqii (1 ≤ i ≤ d ) is either numerical or categorical, but As

can only be categorical because of l-diversity

• t is a tuple within T and Aqii is the value of t with [d + 1] as

the As value

• With the above stated, we can consider t to be a point in a (d +1)-dimensional data space regarded as DS

Specifications of Anatomy cont.

DEFINITION 1. (Partition/QI-group)

A partition is several subsets of T and only allow each tuple to belong to one subset

Subsets are know as QI-groups and are denoted as follows QI1, QI2, ...,QIm


DEFINITION 2. (l-diverse partition)

A partition is considered l-diverse if it conforms to the following:

v is the most frequent sensitive value in a QI-group QIj and cj(v) is the number of tuples that match v

cj(v)/|QIj| ≤ 1/l

|QIj| is the number of tuples of QIj

c1(dyspepsia) = c1(pneumonia) = 2 and c2(flu) = 2

|QI1| = |QI2| = 4

so this satisfies the condition 2/4 ≤ 1/2


DEFINITION 3. (Anatomy)

With a given l-diverse partition anatomy will create QIT and ST tables

QIT will be constructed as the following:

(Aqi1, Aqi

2, ..., Aqid, Group-ID)

ST will be constructed as the following:

(Group-ID, As, Count)

Privacy properties

THEOREM 1. Given a pair of QIT and ST inference of the sensitive value of any individual is at mos 1/l

Age Sex Zipcode Group-ID Disease Count23 M 11000 1 dyspepsia 223 M 11000 1 pneumonia 227 M 13000 1 dyspepsia 227 M 13000 1 pneumonia 235 M 59000 1 dyspepsia 235 M 59000 1 pneumonia 259 M 12000 1 dyspepsia 259 M 12000 1 pneumonia 261 F 54000 2 bronchitis 161 F 54000 2 flu 261 F 54000 2 stomachache 165 F 25000 2 bronchitis 165 F 25000 2 flu 265 F 25000 2 stomachache 165 F 25000 2 bronchitis 165 F 25000 2 flu 265 F 25000 2 stomachache 170 F 30000 2 bronchitis 170 F 30000 2 flu 270 F 30000 2 stomachache 1

Comparison with generalization

• Compare with generalization on two assumptions:

A1: the adversary has the QI-values of the target individual A2: the adversary also knows that the individual is definitely in the microdata

If A1 and A2 are true, anatomy is as good as generalization 1/l holds true

If A1 is true and A2 is false, generalization is stronger

If A1 and A2 are false, generalization is still stronger

Preserving Data Correlation

• Examine the correlation between Age and Disease in T using probability density function pdf

• Example: t1

tuple ID Age Sex Zipcode Disease1 (Bob) 23 M 11000 pneumonia

2 27 M 13000 Dyspepsia3 35 M 59000 Dyspepsia4 59 M 12000 pneumonia5 61 F 54000 flu6 65 F 25000 stomach pain

7 (Alice) 65 F 25000 flu8 70 F 30000 bronchitis

table 1

Preserving Data Correlation cont.

• To re-construct an approximate pdf of t1 from the generalization table:

tuple ID Age Sex Zipcode Disease1 [21,60] M [10001, 60000] pneumonia2 [21,60] M [10001, 60000] Dyspepsia3 [21,60] M [10001, 60000] Dyspepsia4 [21,60] M [10001, 60000] pneumonia5 [61,70] F [10001, 60000] flu6 [61,70] F [10001, 60000] stomach pain7 [61,70] F [10001, 60000] flu8 [61,70] F [10001, 60000] bronchitis

table 2


• To re-construct an approximate pdf of t1 from the QIT and ST tables:


QIT


ST


• To figure out a more rigorous comparison, calculate the “L2 distance” with the following equation:

The distance for anatomy is 0.5 while the distance for generalization is 22.5

• Anatomy provides for better re-constructions of the probability density functions of all tuples.


• measure the error for each pdf by using the following formula:

Objective: for all tuples t in T and obtain a minimal re-construction error (RCE):

Nearly-Optimal Anatomizing Algorithm• They propose an efficient algorithm for anatomizing tables that will minimize the RCE

• The resulting QIT and ST achieves an RCE that only deviates from the lower bound by a factor < 1 + 1/n, where n is the size of T

• This algorithm has linear I/O complexity O(n/b) where b is the page size

Nearly-Optimal Anatomizing Algorithm cont.

PROPERTY 1. At the end of the group-creation phase, each non-empty bucket has only one tuple.

PROPERTY 2. The set S' always includes at least one QI-group.

PROPERTY 3. After the residue-assignment phase, each QI group has at least l tuples with distinct senstive attribute value

Experiments

• dataset CENSUS that contained the personal information of 500k American adults containing 9 discrete attributes

• Created two sets of microdata tables

Set 1: 5 tables denoted as OCC-3, ..., OCC-7 so that OCC-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Occupation

as the sensitive attribute As

Set 2: 5 tables denoted as SAL-3, ..., SAL-7 so that SAL-d (3 ≤ d ≤ 7) uses the first d as QI-attributes and Salary-class

as the sensitive attribute As g

Experiments cont.

Conclusion

• Anatomy was designed to overcome the problem of generalization of losing too much data and still obtain privacy

• Anatomy has a significantly lower error rate as compared with generalization

• Several items would require further research

- Multiple sensitive attributes - Effective mining of patterns in microdata

Agenda

Other anonymization technique Anatomization

Information metrics Extended scenarios

Information Metrics

General purpose metrics Special purpose metrics Trade-off metrics

General Purpose Metrics

General idea: measure “similarity” between the original data and the anonymized data

Minimal distortion metric (Samarati 2001; Sweeney

2002, Wang and Fung 2006) Charge a penalty to each instance of a value

generalized or suppressed (independently of other records)

ILoss (Xiao and Tao 2006) Charge a penalty when a specific value is

generalized

General Purpose Metrics cont.

Discernibility Metric (DM) (K-OPTIMIZE, Mondrian, l-diversity …) Charge a penalty to each record for being

indistinguishable from other records

Special Purpose Metrics

Classification: Classification metric (CM) (Iyengar 2002) Charge a penalty for each record suppressed

or generalized to a group in which the record’s class is not the majority class

Query Query error: count queries Query imprecision: overlapped range

Extended Scenarios

Multiple release publishing Continuous release publishing Collaborative/distributed publishing

Other types of data

High dimensional transaction data Market basket, web queries

Moving objects data Location based services

Textual data

anonymization algorithms - other techniques, metrics, and extended scenarios

Documents