top-down specialization for information and privacy preservation

34
Top-Down Specialization for Information and Privacy Preservation IEEE ICDE 2005 Benjamin C. M. Fung Simon Fraser University BC, Canada [email protected] Ke Wang Simon Fraser University BC, Canada [email protected] Philip S. Yu IBM T.J. Watson Research Center [email protected]

Upload: mauli

Post on 11-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Top-Down Specialization for Information and Privacy Preservation. Benjamin C. M. Fung Simon Fraser University BC, Canada [email protected]. Ke Wang Simon Fraser University BC, Canada [email protected]. Philip S. Yu IBM T.J. Watson Research Center [email protected]. IEEE ICDE 2005. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Top-Down Specialization for Information and Privacy Preservation

Top-Down Specialization for Information and

Privacy Preservation

IEEE ICDE 2005

Benjamin C. M. Fung

Simon Fraser University BC, Canada

[email protected]

Ke Wang

Simon Fraser University BC, Canada

[email protected]

Philip S. Yu

IBM T.J. Watson Research Center

[email protected]

Page 2: Top-Down Specialization for Information and Privacy Preservation

2

Conference Paper

• Top-Down Specialization for Information and Privacy Preservation.

• Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005).

• Conference paper and software:– http://www.cs.sfu.ca/~ddm– Software Privacy TDS 1.0

Page 3: Top-Down Specialization for Information and Privacy Preservation

3

Outline

• Problem: Anonymity for Classification

• Our method: Top-Down Specialization (TDS)

• Related works

• Experimental results

• Conclusions

• Q & A

Page 4: Top-Down Specialization for Information and Privacy Preservation

4

Motivation

• Government and business have strong motivations for data mining.

• Citizens have growing concern about protecting their privacy.

• Can we satisfy both the data mining goal and the privacy goal?

Page 5: Top-Down Specialization for Information and Privacy Preservation

5

Scenario

• A data owner wants to release a person-specific data table to another party (or the public) for the purpose of classification without scarifying the privacy of the individuals in the released data.

Data owner Data recipients

Person-specificdata

Page 6: Top-Down Specialization for Information and Privacy Preservation

Classification Goal

Trainingdata

Classification Algorithm Decision tree

Education

Sex

G

Female Male

Bachelors

G B

MastersDoctorate

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

Unseen data(Masters, Female)

Credit?

Good

……

Page 7: Top-Down Specialization for Information and Privacy Preservation

Privacy Threat• If a description on (Education, Sex) is so specific that not

many people match it, releasing the table will lead to linking a unique or a small number of individuals with sensitive information.

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

Data recipientsAdversary

Education Sex Diagnosis …

Bachelors F Depression …

Bachelors M Heart disease …

Masters F Depression …

Masters F Heart disease …

Doctorate F Knee injury …

Page 8: Top-Down Specialization for Information and Privacy Preservation

8

Privacy Goal: Anonymity• The privacy goal is specified by the anonymity on a

combination of attributes called Virtual Identifier (VID), where each description on a VID is required to be shared by at least k records in the table.

• Anonymity requirement– Consider VID1,…, VIDp. e.g., VID = {Education, Sex}.

– a(vidi) denotes the number of data records in T that share the value vidi on VIDi. e.g., vid = {Doctorate, Female}.

– A(vidi) denotes the smallest a(vidi) for any value vidi on VIDi.

– A table T satisfies the anonymity requirement

{<VID1, k1>, …, <VIDp, kp>}

if A(vidi) ≥ ki for 1 ≤ i ≤ p, where ki is the anonymity threshold on VIDi specified by the data owner.

Page 9: Top-Down Specialization for Information and Privacy Preservation

9

Anonymity Requirement

Example: VID1 = {Education, Sex}, k1 = 4

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Total: 34

a( vid1 )

3

4

5

4

10

4

3

1

A(VID1) = 1

Page 10: Top-Down Specialization for Information and Privacy Preservation

10

Problem Definition• Anonymity for Classification:

– Given a table T, an anonymity requirement, and a taxonomy tree of each categorical attribute in UVIDj, generalize T to satisfy the anonymity requirement while preserving as much information as possible for classification.

• A VID may contain both categorical and continuous attributes.

Page 11: Top-Down Specialization for Information and Privacy Preservation

11

Solution: GeneralizationGeneralize values in UVIDj.

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Masters M 44 4G0B 4

Masters F 44 3G0B 3

Doctorate F 44 1G0B 1

Education Sex Age Class # of Recs.

9th F 30 0G3B 3

10th M 32 0G4B 4

11th F 35 2G3B 5

12th F 37 3G1B 4

Bachelors F 42 4G2B 6

Bachelors F 44 4G0B 4

Grad School M 44 4G0B 4

Grad School F 44 4G0B 4

Page 12: Top-Down Specialization for Information and Privacy Preservation

12

Intuition1. Classification goal and privacy goal have no conflicts:

– Privacy goal: mask sensitive information, usually specific descriptions that identify individuals.

– Classification goal: extract general structures that capture trends and patterns.

2. A table contains multiple classification structures. Generalizations destroy some classification structures, but other structures emerge to help.

If generalization is “carefully” performed, identifying information can be masked while still preserving trends and patterns for classification.

Page 13: Top-Down Specialization for Information and Privacy Preservation

13

Algorithm Top-Down Specialization (TDS)

Initialize every value in T to the top most value.

Initialize Cuti to include the top most value.

while some x UCuti is valid do

Find the Best specialization of the highest Score in UCuti.

Perform the Best specialization on T and update UCuti.

Update Score(x) and validity for x UCuti.

end while

return Generalized T and UCuti.

AgeANY

[1-99)

[1-37) [37-99)

Page 14: Top-Down Specialization for Information and Privacy Preservation

14

Search Criteria: Score

• Consider a specialization v child(v). To heuristically maximize the information of the generalized data for achieving a given anonymity, we favor the specialization on v that has the maximum information gain for each unit of anonymity loss:

Page 15: Top-Down Specialization for Information and Privacy Preservation

Search Criteria: Score

• Rv denotes the set of records having value v before the specialization. Rc denotes the set of records having value c after the specialization where c child(v).

• I(Rx) is the entropy of Rx [8]:

• freq(Rx, cls) is the number records in Rx having the class cls.

• Intuitively, I(Rx) measures the impurity of classes for the data records in Rx . A good specialization reduces the impurity of classes.

• Also use InfoGain(v) to determine the optimal binary split on a continuous interval. [8]

AgeANY

[1-99)

[1-37) [37-99)

Page 16: Top-Down Specialization for Information and Privacy Preservation

To compute Score(ANY_Edu):

Education Sex Age Class # of Recs.

ANY_Edu ANY_Sex [1-99) 21G13B 34

Education Sex Age Class # of Recs.

Secondary ANY_Sex [1-99) 5G11B 16

University ANY_Sex [1-99) 16G2B 18

Page 17: Top-Down Specialization for Information and Privacy Preservation

17

Perform the Best Specialization

• To perform the Best specialization Best child(Best), we need to retrieve RBest, the set of data records containing the value Best.

• Taxonomy Indexed PartitionS (TIPS) is a tree structure with each node representing a generalized record over UVIDj, and each child node representing a specialization of the parent node on exactly one attribute.

• Stored with each leaf node is the set of data records having the same generalized record.

Page 18: Top-Down Specialization for Information and Privacy Preservation

Education Sex Age # of Recs.

ANY_Edu ANY_Sex [1-99) 34

ANY_Edu ANY_Sex [37-99) 22ANY_Edu ANY_Sex [1-37) 12

AgeANY

[1-99)

[1-37) [37-99)

Secondary ANY_

Sex

[1-37) 12 Secondary ANY_

Sex

[37-99) 4 University ANY_

Sex

[37-99) 18

Link[37-99)LinkSecondary

Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}

Page 19: Top-Down Specialization for Information and Privacy Preservation

19

Count Statistics

• While performing specialization, collect the following count statistics.– |Rc|, number of records containing c after specialization Best,

where c child(Best).– |Rd|, number of records containing d as if d is specialized,

where d child(c).– freq(Rc,cls), number of records in Rc having class cls.– freq(Rd,cls), number of records in Rd having class cls.– |Pd|, number of records in partition Pd where Pd is a child partition

under Pc as if c is specialized.

• Score can be updated using the above count statistics without accessing data records.

Page 20: Top-Down Specialization for Information and Privacy Preservation

Update the Score• Update InfoGain(v): Specialization does not affect

InfoGain(v) except for each value c child(Best).

• Update AnonyLoss(v): Specialization affects Av(VIDj) which is the minimum a(vidj) after specializing Best. If attribute(v) and attribute(Best) are contained in the same VIDj, then AnonyLoss(v) has to be updated.

AgeANY

[1-99)

[1-37) [37-99)

Page 21: Top-Down Specialization for Information and Privacy Preservation

Update the Score• Need to efficiently extract a(vidj) from TIPS for

updating Av(VIDj).

• Virtual Identifier TreeS (VITS):– VITj for VIDj = {D1,…, Dw} is a tree of w levels. The level

i > 0 represents the generalized values for Di. Each root-to-leaf path represents an existing vidj on VIDj in the generalized data, with a(vidj) stored at the leaf node.

Page 22: Top-Down Specialization for Information and Privacy Preservation

Education Sex Age # of Recs.

ANY_Edu ANY_Sex [1-99) 34

ANY_Edu ANY_Sex [37-99) 22ANY_Edu ANY_Sex [1-37) 12

Secondary ANY_

Sex

[1-37) 12 Secondary ANY_

Sex

[37-99) 4 University ANY_

Sex

[37-99) 18

Consider VID1 = {Education, Sex}, VID2 = {Sex, Age}

Age

Page 23: Top-Down Specialization for Information and Privacy Preservation

23

Benefits of TDS

• Handling multiple VIDs– Treating all VIDs as a single VID leads to over generalization.

• Handling both categorical and continuous attributes.– Dynamically generate taxonomy tree for continuous attributes.

• Anytime solution– User may step through each specialization to determine a

desired trade-off between privacy and accuracy.– User may stop any time and obtain a generalized table satisfying

the anonymity requirement. Bottom-up approach does not support this feature.

• Scalable computation

Page 24: Top-Down Specialization for Information and Privacy Preservation

24

Related Works• The concept of anonymity was proposed by Dalenius [2].

• Sweeny [10] employed bottom-up generalization to achieve k-anonymity.– Single VID. Not considering specific use of data.

• Iyengar [6] proposed a genetic algorithm (GA) to address the problem of anonymity for classification.– Single VID. – GA needs 18 hours while our method only needs 7 seconds to

generalize same set of records (with comparable accuracy).

• Wang et al. [12] recently proposed bottom-up generalization to address the same problem.– Only for categorical attributes.– Does not support Anytime Solution.

Page 25: Top-Down Specialization for Information and Privacy Preservation

25

Experimental Evaluation

• Data quality.– A broad range of anonymity requirements.– Used C4.5 and Naïve Bayesian classifiers.

• Compare with Iyengar’s genetic algorithm [6].– Results were quoted from [6].

• Efficiency and Scalability.

Page 26: Top-Down Specialization for Information and Privacy Preservation

26

Data set

• Adult data set– Used in Iyengar [6].– Census data.– 6 continuous attributes.– 8 categorical attributes.– Two classes.– 30162 recs. for training.– 15060 recs. for testing.

Page 27: Top-Down Specialization for Information and Privacy Preservation

Data Quality• Include the TopN most important attributes into a SingleVID, which

is more restrictive than breaking them into multiple VIDs.

Page 28: Top-Down Specialization for Information and Privacy Preservation

Data Quality• Include the TopN most important attributes into a SingleVID, which

is more restrictive than breaking them into multiple VIDs.

Page 29: Top-Down Specialization for Information and Privacy Preservation

Data Quality• Compare with the genetic algorithm using C4.5.• Only categorical attributes. Same taxonomy trees.

Our method

Page 30: Top-Down Specialization for Information and Privacy Preservation

Efficiency and Scalability• Took at most 10 seconds for all previous experiments.• Replicate the Adult data set and substitute some random data.

Page 31: Top-Down Specialization for Information and Privacy Preservation

31

Conclusions

• Quality classification and privacy preservation can coexist.

• An effective top-down method to iteratively specialize the data, guided by maximizing the information utility and minimizing privacy specificity.

• Simple but effective data structures.

• Great applicability to both public and private sectors that share information for mutual benefits.

Page 32: Top-Down Specialization for Information and Privacy Preservation

32

Thank you.

Questions?

Page 33: Top-Down Specialization for Information and Privacy Preservation

33

References1. R. Agrawal and S. Ramakrishnan. Privacy preserving data mining. In Proc. of the ACM

SIGMOD Conference on Management of Data, pages 439–450, Dallas, Texas, May 2000. ACM Press.

2. T. Dalenius. Finding a needle in a haystack - or identifying anonymous census record. Journal of Official Statistics, 2(3):329–336, 1986.

3. W. A. Fuller. Masking procedures for microdata disclosure limitation. Official Statistics, 9(2):383–406, 1993.

4. S. Hettich and S. D. Bay. The UCI KDD Archive, 1999. http://kdd.ics.uci.edu.

5. A. Hundepool and L. Willenborg. - and -argus: Software for statistical disclosure control. In the 3rd International Seminar on Statistical Confidentiality, Bled, 1996.

6. V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 279–288, Edmonton, AB, Canada, July 2002.

7. J. Kim and W. Winkler. Masking microdata files. In ASA Proceedings of the Section on Survey Research Methods, pages 114–119, 1995.

8. R. J. Quinlan. C4.5: Progams for Machine Learning. Morgan Kaufmann, 1993.

9. P. Samarati. Protecting respondents’ identities in microdata release. In IEEE Transactions on Knowledge Engineering, volume 13, pages 1010–1027, 2001.

10. L. Sweeney. Datafly: A system for providing anonymity in medical data. In International Conference on Database Security, pages 356–381, 1998.

Page 34: Top-Down Specialization for Information and Privacy Preservation

34

References11. The House of Commons in Canada. The personal information protection and electronic

documents act, April 2000. http://www.privcom.gc.ca/.

12. K. Wang, P. Yu, and S. Chakraborty. Bottom-up generalization: a data mining solution to privacy protection. In Proc. of the 4th IEEE International Conference on Data Mining 2004 (ICDM 2004), November 2004.