top-down specialization for information and privacy preservation

Top-Down Specialization for Information and

Privacy Preservation

IEEE ICDE 2005

Benjamin C. M. Fung

Simon Fraser University BC, Canada

[email protected]

Ke Wang

Simon Fraser University BC, Canada

[email protected]

Philip S. Yu

IBM T.J. Watson Research Center

[email protected]

2

Conference Paper

• Top-Down Specialization for Information and Privacy Preservation.

• Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005).

• Conference paper and software:– http://www.cs.sfu.ca/~ddm– Software Privacy TDS 1.0

3

Outline

• Problem: Anonymity for Classification

• Our method: Top-Down Specialization (TDS)

• Related works

• Experimental results

• Conclusions

• Q & A

4

Motivation

• Government and business have strong motivations for data mining.

• Citizens have growing concern about protecting their privacy.

• Can we satisfy both the data mining goal and the privacy goal?

5

Scenario

• A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification without scarifying the privacy of the individuals in the released data.

Data owner Data recipients

Person-specificdata

Classification Goal

Trainingdata

Classification Algorithm Decision tree

Education

Sex

G

Female Male

Bachelors

G B

MastersDoctorate

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6


Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

Unseen data(Masters, Female)

Credit?

Good

……

Privacy Threat• If a description on (Education, Sex) is so specific that not

many people match it, releasing the table will lead to linking a unique or a small number of individuals with sensitive information.


9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Masters M 44 4G0B 4

Masters F 44 3G0B 3


Total: 34

Data recipientsAdversary

Education Sex Diagnosis …

Bachelors F Depression …

Bachelors M Heart disease …

Masters F Depression …

Masters F Heart disease …

Doctorate F Knee injury …

8

Privacy Goal: Anonymity• The privacy goal is specified by the anonymity on a

combination of attributes called Virtual Identifier (VID), where each description on a VID is required to be shared by at least k records in the table.

• Anonymity requirement– Consider VID1,…, VIDp. e.g., VID = {Education, Sex}.

– a(vidi) denotes the number of data records in T that share the value vidi on VIDi. e.g., vid = {Doctorate, Female}.

– A(vidi) denotes the smallest a(vidi) for any value vidi on VIDi.

– A table T satisfies the anonymity requirement

{<VID1, k1>, …, <VIDp, kp>}

if A(vidi) ≥ ki for 1 ≤ i ≤ p, where ki is the anonymity threshold on VIDi specified by the data owner.

9

Anonymity Requirement

Example: VID1 = {Education, Sex}, k1 = 4


9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Masters M 44 4G0B 4

Masters F 44 3G0B 3


Total: 34

a( vid1 )

3

4

5

4

10

4

3

1

A(VID1) = 1

10

Problem Definition• Anonymity for Classification:

– Given a table T, an anonymity requirement, and a taxonomy tree of each categorical attribute in UVIDj, generalize T to satisfy the anonymity requirement while preserving as much information as possible for classification.

• A VID may contain both categorical and continuous attributes.

11

Solution: GeneralizationGeneralize values in UVIDj.


9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Masters M 44 4G0B 4

Masters F 44 3G0B 3



9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4



Grad School M 44 4G0B 4

Grad School F 44 4G0B 4

12

Intuition1. Classification goal and privacy goal have no conflicts:

– Privacy goal: mask sensitive information, usually specific descriptions that identify individuals.

– Classification goal: extract general structures that capture trends and patterns.

2. A table contains multiple classification structures. Generalizations destroy some classification structures, but other structures emerge to help.

If generalization is “carefully” performed, identifying information can be masked while still preserving trends and patterns for classification.

13

Algorithm Top-Down Specialization (TDS)

Initialize every value in T to the top most value.

Initialize Cuti to include the top most value.

while some x UCuti is valid do

Find the Best specialization of the highest Score in UCuti.

Perform the Best specialization on T and update UCuti.

Update Score(x) and validity for x UCuti.

end while

return Generalized T and UCuti.

AgeANY

[1-99)

[1-37) [37-99)

14

Search Criteria: Score

• Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of anonymity loss:

Search Criteria: Score

• Rv denotes the set of records having value v before the specialization. Rc denotes the set of records having value c after the specialization where c child(v).

• I(Rx) is the entropy of Rx [8]:

• freq(Rx, cls) is the number records in Rx having the class cls.

• Intuitively, I(Rx) measures the impurity of classes for the data records in Rx . A good specialization reduces the impurity of classes.

• Also use InfoGain(v) to determine the optimal binary split on a continuous interval. [8]

AgeANY

[1-99)

[1-37) [37-99)

To compute Score(ANY_Edu):


ANY_Edu ANY_Sex [1-99) 21G13B 34


Secondary ANY_Sex [1-99) 5G11B 16

University ANY_Sex [1-99) 16G2B 18

17

Perform the Best Specialization

• To perform the Best specialization Best child(Best), we need to retrieve RBest, the set of data records containing the value Best.

• Taxonomy Indexed PartitionS (TIPS) is a tree structure with each node representing a generalized record over UVIDj, and each child node representing a specialization of the parent node on exactly one attribute.

• Stored with each leaf node is the set of data records having the same generalized record.

Education Sex Age # of Recs.

ANY_Edu ANY_Sex [1-99) 34

ANY_Edu ANY_Sex [37-99) 22ANY_Edu ANY_Sex [1-37) 12

AgeANY

[1-99)

[1-37) [37-99)

Secondary ANY_

Sex

[1-37) 12 Secondary ANY_

Sex

[37-99) 4 University ANY_

Sex

[37-99) 18

Link[37-99)LinkSecondary

Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}

19

Count Statistics

• While performing specialization, collect the following count statistics.– |Rc|, number of records containing c after specialization Best,

where c child(Best).– |Rd|, number of records containing d as if d is specialized,

where d child(c).– freq(Rc,cls), number of records in Rc having class cls.– freq(Rd,cls), number of records in Rd having class cls.– |Pd|, number of records in partition Pd where Pd is a child partition

under Pc as if c is specialized.

• Score can be updated using the above count statistics without accessing data records.

Update the Score• Update InfoGain(v): Specialization does not affect

InfoGain(v) except for each value c child(Best).

• Update AnonyLoss(v): Specialization affects Av(VIDj) which is the minimum a(vidj) after specializing Best. If attribute(v) and attribute(Best) are contained in the same VIDj, then AnonyLoss(v) has to be updated.

AgeANY

[1-99)

[1-37) [37-99)

Update the Score• Need to efficiently extract a(vidj) from TIPS for

updating Av(VIDj).

• Virtual Identifier TreeS (VITS):– VITj for VIDj = {D1,…, Dw} is a tree of w levels. The level

i > 0 represents the generalized values for Di. Each root-to-leaf path represents an existing vidj on VIDj in the generalized data, with a(vidj) stored at the leaf node.

Education Sex Age # of Recs.

ANY_Edu ANY_Sex [1-99) 34

ANY_Edu ANY_Sex [37-99) 22ANY_Edu ANY_Sex [1-37) 12

Secondary ANY_

Sex

[1-37) 12 Secondary ANY_

Sex

[37-99) 4 University ANY_

Sex

[37-99) 18

Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}

Age

23

Benefits of TDS

• Handling multiple VIDs– Treating all VIDs as a single VID leads to over generalization.

• Handling both categorical and continuous attributes.– Dynamically generate taxonomy tree for continuous attributes.

• Anytime solution– User may step through each specialization to determine a

desired trade-off between privacy and accuracy.– User may stop any time and obtain a generalized table satisfying

the anonymity requirement. Bottom-up approach does not support this feature.

• Scalable computation

24

Related Works• The concept of anonymity was proposed by Dalenius [2].

• Sweeny [10] employed bottom-up generalization to achieve k-anonymity.– Single VID. Not considering specific use of data.

• Iyengar [6] proposed a genetic algorithm (GA) to address the problem of anonymity for classification.– Single VID. – GA needs 18 hours while our method only needs 7 seconds to

generalize same set of records (with comparable accuracy).

• Wang et al. [12] recently proposed bottom-up generalization to address the same problem.– Only for categorical attributes.– Does not support Anytime Solution.

25

Experimental Evaluation

• Data quality.– A broad range of anonymity requirements.– Used C4.5 and Naïve Bayesian classifiers.

• Compare with Iyengar’s genetic algorithm [6].– Results were quoted from [6].

• Efficiency and Scalability.

26

Data set

• Adult data set– Used in Iyengar [6].– Census data.– 6 continuous attributes.– 8 categorical attributes.– Two classes.– 30162 recs. for training.– 15060 recs. for testing.

Data Quality• Include the TopN most important attributes into a SingleVID, which

is more restrictive than breaking them into multiple VIDs.

Data Quality• Compare with the genetic algorithm using C4.5.• Only categorical attributes. Same taxonomy trees.

Our method

Efficiency and Scalability• Took at most 10 seconds for all previous experiments.• Replicate the Adult data set and substitute some random data.

31

Conclusions

• Quality classification and privacy preservation can coexist.

• An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity.

• Simple but effective data structures.

• Great applicability to both public and private sectors that share information for mutual benefits.

32

Thank you.

Questions?

33

References1. R. Agrawal and S. Ramakrishnan. Privacy preserving data mining. In Proc. of the ACM

SIGMOD Conference on Management of Data, pages 439–450, Dallas, Texas, May 2000. ACM Press.

2. T. Dalenius. Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics, 2(3):329–336, 1986.

3. W. A. Fuller. Masking procedures for microdata disclosure limitation. Official Statistics, 9(2):383–406, 1993.

4. S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. http://kdd.ics.uci.edu.

5. A. Hundepool and L. Willenborg. - and -argus: Software for statistical disclosure control. In the 3rd International Seminar on Statistical Confidentiality, Bled, 1996.

6. V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 279–288, Edmonton, AB, Canada, July 2002.

7. J. Kim and W. Winkler. Masking microdata files. In ASA Proceedings of the Section on Survey Research Methods, pages 114–119, 1995.

8. R. J. Quinlan. C4.5: Progams for Machine Learning. Morgan Kaufmann, 1993.

9. P. Samarati. Protecting respondents’ identities in microdata release. In IEEE Transactions on Knowledge Engineering, volume 13, pages 1010–1027, 2001.

10. L. Sweeney. Datafly: A system for providing anonymity in medical data. In International Conference on Database Security, pages 356–381, 1998.

34

References11. The House of Commons in Canada. The personal information protection and electronic

documents act, April 2000. http://www.privcom.gc.ca/.

12. K. Wang, P. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining 2004 (ICDM 2004), November 2004.

top-down specialization for information and privacy preservation

Documents