a survey of privacy preserving data publishing using ... · scenario, data publisher does not need...

14
Appl. Math. Inf. Sci. 8, No. 3, 1103-1116 (2014) 1103 Applied Mathematics & Information Sciences An International Journal http://dx.doi.org/10.12785/amis/080321 A Survey of Privacy Preserving Data Publishing using Generalization and Suppression Yang Xu 1 , Tinghuai Ma 2,, Meili Tang 3 and Wei Tian 1 1 School of Computer and Software, Nanjing University of Information Science & Technology University, Nanjing, 210044, China 2 Jiangsu Engineering Centre of Network Monitoring, Nanjing University of Information Science & Technology, Nanjing, 210044, China 3 School of Public Administration, Nanjing University of Information Science & Technology University, Nanjing, 210044, China Received: 23 May. 2013, Revised: 27 Sep. 2013, Accepted: 28 Sep. 2013 Published online: 1 May. 2014 Abstract: Nowadays, information sharing as an indispensable part appears in our vision, bringing about a mass of discussions about methods and techniques of privacy preserving data publishing which are regarded as strong guarantee to avoid information disclosure and protect individuals’ privacy. Recent work focuses on proposing different anonymity algorithms for varying data publishing scenarios to satisfy privacy requirements, and keep data utility at the same time. K-anonymity has been proposed for privacy preserving data publishing, which can prevent linkage attacks by the means of anonymity operation, such as generalization and suppression. Numerous anonymity algorithms have been utilized for achieving k-anonymity. This paper provides an overview of the development of privacy preserving data publishing, which is restricted to the scope of anonymity algorithms using generalization and suppression. The privacy preserving models for attack is introduced at first. An overview of several anonymity operations follow behind. The most important part is the coverage of anonymity algorithms and information metric which is essential ingredient of algorithms. The conclusion and perspective are proposed finally. Keywords: Data Publishing, Privacy Preserving, Anonymity Algorithms, Information Metric, Generalization, Suppression 1 INTRODUCTION Due to the rapid growth of information, the demands for data collection and sharing increase sharply. A great quantity of data is used for analysis, statistics and computation to find out general pattern or principle which is beneficial to social development and human progress. Meanwhile, threats appear when tremendous data available for the public. For example, people can dig privacy information by getting together safe-seeming data, consequently, there is a great possibility exposing individuals privacy. According to the study, approximately 87 % of the population of the United States can be uniquely identified by given dataset published for the public. To avoid this situation getting worse, measures are taken by security department of many countries, for example, promulgating privacy regulation (e.g. privacy regulation as part of Health Insurance Portability and Accountability Act in the USA [1]). The requirement for data publisher is that data to be published must fit for the predefined conditions. Identifying attribute needs to be omitted from published dataset to guarantee that individuals privacy cannot be inferred from dataset directly. Removing identifier attribute is just the preparation work of data processing, several sanitization operations need to be done further. However, after data processing, it may decrease data utility dramatically, while, data privacy did not get fully preserved. In face of the challenging risk, some researches have been proposed as a remedy of this awkward situation, which target at accomplishing the balance of data utility and information privacy when publishing dataset. The ongoing research is called Privacy Preserving Data Publishing (PPDP). In the past few years, experts have taken up the challenge and undertaken a lot of researches. Many feasible approaches are proposed for different privacy preserving scenario, which solve the issues in PPDP effectively. New methods and theory come out Corresponding author e-mail: [email protected] c 2014 NSP Natural Sciences Publishing Cor.

Upload: phamtruc

Post on 21-Aug-2019

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) 1103

Applied Mathematics & Information SciencesAn International Journal

http://dx.doi.org/10.12785/amis/080321

A Survey of Privacy Preserving Data Publishing usingGeneralization and Suppression

Yang Xu1, Tinghuai Ma2,∗, Meili Tang3 and Wei Tian1

1 School of Computer and Software, Nanjing University of Information Science & Technology University, Nanjing, 210044, China2 Jiangsu Engineering Centre of Network Monitoring, Nanjing University ofInformation Science & Technology, Nanjing, 210044,

China3 School of Public Administration, Nanjing University of Information Science & Technology University, Nanjing, 210044, China

Received: 23 May. 2013, Revised: 27 Sep. 2013, Accepted: 28 Sep. 2013Published online: 1 May. 2014

Abstract: Nowadays, information sharing as an indispensable part appears in our vision, bringing about a mass of discussions aboutmethods and techniques of privacy preserving data publishing which are regarded as strong guarantee to avoid information disclosureand protect individuals’ privacy. Recent work focuses on proposing different anonymity algorithms for varying data publishing scenariosto satisfy privacy requirements, and keep data utility at the same time.K-anonymity has been proposed for privacy preserving datapublishing, which can prevent linkage attacks by the means of anonymity operation, such as generalization and suppression. Numerousanonymity algorithms have been utilized for achievingk-anonymity. This paper provides an overview of the development of privacypreserving data publishing, which is restricted to the scope of anonymity algorithms using generalization and suppression. The privacypreserving models for attack is introduced at first. An overview of several anonymity operations follow behind. The most importantpart is the coverage of anonymity algorithms and information metric which isessential ingredient of algorithms. The conclusion andperspective are proposed finally.

Keywords: Data Publishing, Privacy Preserving, Anonymity Algorithms, Information Metric, Generalization, Suppression

1 INTRODUCTION

Due to the rapid growth of information, the demands fordata collection and sharing increase sharply. A greatquantity of data is used for analysis, statistics andcomputation to find out general pattern or principle whichis beneficial to social development and human progress.Meanwhile, threats appear when tremendous dataavailable for the public. For example, people can digprivacy information by getting together safe-seemingdata, consequently, there is a great possibility exposingindividuals privacy. According to the study,approximately 87 % of the population of the UnitedStates can be uniquely identified by given datasetpublished for the public. To avoid this situation gettingworse, measures are taken by security department ofmany countries, for example, promulgating privacyregulation (e.g. privacy regulation as part of HealthInsurance Portability and Accountability Act in the USA[1]). The requirement for data publisher is that data to be

published must fit for the predefined conditions.Identifying attribute needs to be omitted from publisheddataset to guarantee that individuals privacy cannot beinferred from dataset directly. Removing identifierattribute is just the preparation work of data processing,several sanitization operations need to be done further.However, after data processing, it may decrease datautility dramatically, while, data privacy did not get fullypreserved.

In face of the challenging risk, some researches havebeen proposed as a remedy of this awkward situation,which target at accomplishing the balance of data utilityand information privacy when publishing dataset. Theongoing research is called Privacy Preserving DataPublishing (PPDP). In the past few years, experts havetaken up the challenge and undertaken a lot of researches.Many feasible approaches are proposed for differentprivacy preserving scenario, which solve the issues inPPDP effectively. New methods and theory come out

∗ Corresponding author e-mail:[email protected]

c© 2014 NSPNatural Sciences Publishing Cor.

Page 2: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1104 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

continuously in experts’ effort to complete privacypreserving.

1.1 Privacy Preserving Data Publishing

Generally, the process of Privacy Preserving DataPublishing has two phases, data collection and datapublish phase. It refers to three kinds of roles in theprocess who are data owner, data publisher and datarecipient. The relationship of two phases and three rolesinvolved in PPDP is shown in figure 1. In the datacollection phase, data publisher collects dataset from dataowner. Then, in the data publishing phase, data publishersends the processed dataset to data recipient. It isnecessary to mention that raw dataset from data ownercannot be directly sent to data recipient. The datasetshould be processed by data publisher before being sentto data recipient.

Fig. 1: the relationship of phases and roles in PPDP

In [2], data publisher can be divided into twocategories. In the untrusted model, data publisher is trickywho is more likely to gain privacy from dataset. In thetrusted model, data publisher is reliable and any data intheir hands is safe and without any risk. Owing to thedifference of data publishing scenarios affected byvarying assumptions and requirements to data publisher,data recipients purposes and other factors, it gives fourscenarios for further detailed discussion that maybeappear in real privacy preserving data publishing in [3].The first scenario isthe non-expert data publisher. In thisscenario, data publisher does not need to have specificknowledge about research fields. What they need to do ismake data be published satisfying the requirements ofdata utility and information privacy. The second one isthedata recipient could be an attacker. This scenario is morecommonly-accepted and many proposed solutions make itas the requisite hypothesis. The third one isthe publishdata is not the data mining result. It indicates that datasetprovided by data publisher in this scenario is not merelyfor data mining. That is to say, published dataset is not

data mining result. The last one istruthfulness at recordlevel. Data publisher should guarantee the authenticity ofdata to be published whatever processing methods will beused. Thus, randomization and perturbation cannot meetthe requirements in this scenario.

1.2K-Anonymity

When referring to data anonymization, the most commondata is two-dimensional table in relational database. Forprivacy preserving, the attributes of table are divided intofour categories which areidentifier, quasi-identifiers,non-quasi attributesandsensitive attribute. Identifier canuniquely represent an individual. Obviously, it should beremoved before data processing.Quasi-identifiersare aspecific sequence of attributes in the table that maliciousattackers can take advantage of these attributes linkingreleased dataset with other dataset that has been alreadyacquired, then breaking privacy, eventually gainingsensitive information. Data sanitization operated by datapublisher mainly targets on quasi-identifiers. Due touncertainty of the number of quasi-identifiers, eachapproach of PPDP assumes the quasi-identifiers sequencein advance. Only in this way can the following processingcarry out.Non-quasi attributeshave less effect on dataprocessing. For this reason, sometimes, these attributesdoes not turn up in the progress of data processing whichtremendously decrease memory usage and improve theperformance of the proposed algorithm.Sensitiveattribute contains sensitive information, such as disease,salary. From table 1(2), this is a two-dimensional table tobe published. According to above introduction, we canget the conclusion thatID is identifier. If table 1(1) is aknown table which attacker will use as backgroundknowledge, then we knowBirthday, SexandZipCodearequasi-identifiers,Work is non-quasi attribute andDiseaseis sensitive attribute.

From the example above, we know why dataprocessing steps mainly work on quasi-identifiers. Onlyin this way can we reduce the correlation of dataset to bepublished and other dataset. In PPDP, the progress of dataprocessing is called data anonymization.K-anonymity isone of anonymization approaches proposed by Samaratiand Sweeney[4] that each record in dataset cannot bedistinguished with at least another (k-1) records under theprojection of quasi-identifiers of dataset after a series ofanonymity operations (e.g. replace specific value withgeneral value).K-anonymity assures that the probabilityof uniquely representing an individual in released datasetwill not great than1/k. For example in table 1, we learnabout Miss Yoga has diabetes by linking census data tablewith patient data table byBirthday, Sex and ZipCodeattributes even removing identifier. What if it cannotuniquely determine a record? Thus attacker has no abilityto identify sensitive information with full confidence.How to make patient table in Table 1 meet 2-anonymity?One of practical ways is that replacing data with year for

c© 2014 NSPNatural Sciences Publishing Cor.

Page 3: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1105

Table 1: Illustrate anonymization andk-anonymity(1) Census Data

Name Birthday Sex ZipCodeMyron 1990/10/01 Male 210044Yoga 1980/05/22 Female 210022James 1782/06/23 Male 210001Sophie 1992/03/12 Female 210012

(2) Patient Data

ID Work Birthday Sex ZipCode Diease231001 Student 1990/10/01 Male 210044 Cardipathy231002 Clerk 1980/05/22 Female 210022 Diabetes231003 Official 1990/08/12 Male 210021 Flu231004 HR 1980/02/25 Female 210012 Caner

Birthday attribute and using * replace the last twocharacter ofZipCode attribute. K-anonymity has beenextensively studied in recent years [5,6,7,8]. After2-anonymity, it cannot infer that Miss Yoga has diabetes,or maybe she has cancer. Because in patient data table,there are two records that can be linked to one record incensus data table about Miss Yoga. We can see thatk-anonymity has an effective impact on this scenario.

1.3 Paper Overview

This paper mainly refers to four topics that are privacymodel, anonymity operation, information metric andanonymization algorithm. Due to different kinds ofattacks to steal privacy, it forms different privacypreserving models for these attacks accordingly. Everyprivacy preserving model has its feature, so thatresearchers propose some theory and method for eachtype of attack. Algorithm implementation is based onspecific theory and methodology. So each anonymityalgorithm belongs to the specific privacy preservingmodel. As to anonymity operation and informationmetric, they are the details of algorithms. Anonymityoperation is the core of algorithm, an algorithm oftenkeep one or two operations in mind, and finally make theprocessed dataset to meet privacy requirement. Theinformation metric is incorporated into the algorithm toguide its anonymity process or execution, and finally getbetter result rather than just get a rare result. Therefore,these four topics are essential parts of privacy preservingdata publishing.

There are several essential operations to implementdata anonymization that are generalization, suppression,anatomization, permutation and perturbation.Generalization and suppression usually replace thespecific value of quasi-identifiers with general value.Generally, there exists a taxonomy tree structure for eachquasi-identifier that is used for replacement.Anatomization and permutation decouple the correlationof quasi-identifier and sensitive attribute by separating

then in two datasets. Perturbation distorts dataset by themeans of adding noise, exchanging value or generatingsynthetic data that must keep some statistical propertiesof original dataset.

This paper focuses on the anonymization algorithmsusing generalization and suppression which are thefrequent used anonymity operations to implementk-anonymity. In chapter 2, it gives a representation ofprivacy preserving models of PPDP and puts moreemphasis on privacy model for record linkage attack. Inchapter 3, it introduces the category of anonymityoperation of PPDP. The majority of this paper is tointroduce different information metric criteria andrelevant algorithms using generalization and suppressionwhich are put in chapter 4. Finally, chapter 5 is asummarized conclusion of this paper.

2 PRIVACY PRESERVING MODEL FORATTACKS

The rigorous definition of privacy protection by Dalenius[9] is that addressing to the published dataset should notincrease any possibility of adversary to gain extrainformation about individuals, even with the presence ofbackground knowledge. However, it is impossible toquantize the scope of background knowledge. Therefore,a transparent hypothesis taken by many PPDP literaturesis that adversary has limited background knowledge.According to adversaries’ attack principle, attack modelcan be classified into two categories, which are linkageattack and probabilistic attack.

2.1 Privacy Model for Attacks

The linkage attack is that adversary steals sensitiveinformation by the means of linking with released dataset.It has three types of linkage,record linkage, attributelinkageandtable linkage. Quasi-identifiers are known by

c© 2014 NSPNatural Sciences Publishing Cor.

Page 4: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1106 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

adversary beforehand is the common characteristic oflinkage attack. Furthermore, adversary also grasps thebasic information of individuals and wants to know theirsensitive information under the scenarios of recordlinkage and attribute linkage. While, table linkage attackputs more emphasizes on the point that whether knownindividual’s information presents in released dataset. Theprivacy model of record linkage will be elaboratelydescribed in the next section, which is the important partof this paper.

For the attack of attribute linkage, the adversary couldinfer sensitive information from the released datasetbased on the distribution of sensitive value in the groupthat the individual belongs to. A successful inference ispossible working on the published dataset that satisfiesthe qualifications ofk-anonymity. The common effectivesolution for the attribute linkage attack is to lessen thecorrelation of quasi-identifiers and sensitive attributesoforiginal dataset. Certainly, others models also bloomrecently for capturing this kind of attack, likeℓ-diversity[10] and recursive (c,ℓ)-diversity[11], (X,Y)-Anonymity[12], (a, k)-Anonymity[13], (k,e)-Anonymity[14], t-closeness by Li et al.[15],personalized privacy by Xiao and Tao[16] and so on.

Table linkage is different from both record linkageand attribute linkage. In the table linkage attack, thepresence or absence of individual record in released tablehas already revealed the sensitive information of thespecific individual. Nergiz et al. proposed the theory ofδ -presence to prevent table linkage and further bound theprobability inferring occurrence of individual recordwithin a given range[17].

The probabilistic attack can be depicted in thescenarios that adversary will not immediately scratchsensitive information from released dataset, while, thereleased dataset can do some favor for adversary throughincreasing his/her background knowledge to some extent.This kind of attack is called probabilistic attack that itturns up a visible deviation for gaining sensitiveinformation after accessing the released dataset.Probabilistic attack is not like linkage attack whichprecisely knows individual information, then gainsensitive information combined with existed backgroundknowledge, but it focuses on changing adversary’sprobabilistic confidence of getting privacy informationafter acquiring published dataset. The privacy model tothis attack needs to ensure that the change of probabilisticconfidence is relatively small after obtaining thepublished dataset. Some insightful notions forprobabilistic attack are (c, t)-isolation[18], ε-differentialprivacy[19], (d, γ)-privacy[20], distributional privacy[21]and so on. Different privacy preserving model has itsunique features determined by details of the viciousattack, so related algorithms which belong to a specificprivacy model are customized and targeted at settlingparticular attack situation.

2.2 Privacy Model for Record Linkage

For record linkage attack, we must learn about thedefinition of equivalence class at first. When the valuesunder the projection of quasi-identifiers of dataset aresame, the certain numbers of records form a group. Manygroups make up the dataset. Those groups are calledequivalence class. In the original dataset, the size ofequivalence class varies dramatically. If attackers knownrecord of released dataset matching a group with only onerecord at the worst situation, unfortunately, the privacyinformation of individual related to the only one recordwill be leaked. For example, the dataset in Table 2(1)needs to be released. If publishing it without carrying outany anonymity operations and assuming that adversaryhas the background knowledge of Table 2(2). We canreadily find that Myron who is born in Nanjing, China on1990 has regular headache by linking the two datasets intable 2 on Birthday, Sex and ZipCode. These threeattributes are called quasi-identifiers of this attack fromthe definition introduced above.K-anonymity is a methodto solve record linkage attack which guarantees that thesize of each equivalence class is greater or equal than thegiven valuek by the means of replacing specific valuewith general value. The probability of uniquely inferringthe sensitive information of individual known by theadversary is less than1/k, thus, it can safeguardindividuals’ privacy to a large extent. Eachquasi-identifier has a taxonomy tree structure of whichgeneralization extent increases from leaf to root node.Empirically, every categorical quasi-identifier has apredetermined taxonomy tree, while, the taxonomy treeof numerical quasi-identifier will be dynamicallygenerated in the execution of anonymity algorithm, and,in addition, a specific value of numeric attribute will bereplaced by a well-partitioned range in generalization.The taxonomy tree structures of two quasi-identifiers areshown in figure 2. For example, in taxonomy treestructures ofJob attribute, the root nodeANY is moregeneral than nodeStudent. The parent nodeStudentismore general than its child nodeGraduate.

There are many exquisite methods to solve theproblem of data anonymization, which obey qualificationof k-anonymity or its extension. The detailed descriptionwill be introduced in subsequent chapter. With regard tok-anonymity, most of recent works assume that thereexists only one quasi-identifier sequence containing allpossible attributes. With the number of quasi-identifierincreasing, not only does it take more effort to carry outone anonymity operation, but also level of data distortionincreases respectively. So, some researchers propose adistinct standpoint taking multi quasi-identifier sequencesinto account which is more flexible than onequasi-identifier sequence. However, whatever way ischosen, the determination of attributes in quasi-identifierneeds many attempts. No method or theory can deal withall issues in the specific research area. Afterwards,extensions ofk-anonymity are proposed. Such as (X,

c© 2014 NSPNatural Sciences Publishing Cor.

Page 5: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1107

Fig. 2: taxonomy tree structure of quasi-identifier

Table 2: Illustrate record linkage(1) Patient Data

Work Birthday Sex ZipCode DieaseStudent 1990/10/01 Male 210044 HeadacheClerk 1980/05/22 Female 220022 DiabetesOfficial 1990/08/12 Male 210021 FluHR 1980/02/25 Female 220012 Caner

(2) Background Knowledge

Name Birthday Sex ZipCodeMyron 1990/10/01 Male 210044Yoga 1980/05/22 Female 210022James 1782/06/23 Male 210001Sophie 1992/03/12 Female 210012

(3) 2-anonymous patient data

Work Birthday Sex ZipCode DieaseStudent 1990 Male 2100** HeadacheClerk 1980 Female 2200** DiabetesOfficial 1990 Male 2100** FluHR 1980 Female 2200** Caner

Y)-anonymity and MultiRelationalk-anonymity[22]. (X,Y)-anonymity has more strict constraints thank-anonymity by appending additional requirements, and itis for the scenario that an individual is mapped to morethan one record in released dataset, which means that thedistinct number of Y attribute must greater or equal thanthe given k on the projection of X. MultiRelationalk-anonymity expands the boundary ofk-anonymity,which is for anonymizing multiple datasets instead ofonly one dataset. Basically, k-anonymity, (X,Y)-anonymity and MultiRelational k-anonymityconstitutes the theory basis for privacy model for recordlinkage.

3 ANONYMITY OPERATIONS

A series of anonymity operations works on originaldataset to make it fulfill the privacy requirement during

data anonymization. The frequently used anonymityoperations are generalization, suppression, anatomization,permutation and perturbation. Diverse algorithms towardprivacy preserving data publishing differ in the choice ofanonymity operations. Or, to put it in another way, theidea of algorithm is based on some specific anonymityoperations.

3.1 Anonymity Operations

Generalization and suppression are the most commonanonymity operations used to implementk-anonymityand its extension which are further depicted in the nextsession. Using one sentence to explain generalization isthat replacing specific value of quasi-identifiers with moregeneral value. Suppression is the ultimate state ofgeneralization operation which uses special symboliccharacter to replace its authentic value (e.g. *, &, #), andmakes the value meaningless. Unlike generalization andsuppression, anatomization and permutation does notmake any modification of original dataset, while decreasethe correlation of quasi-identifiers and sensitive attribute.Generally, quasi-identifiers and sensitive attribute arepublished separately. Quite a few researches make use ofthis two anonymity operations[23,24,25]. When justreferring to the purpose of information statistic,perturbation operation has merits of simplicity andefficiency. The main idea of perturbation is to substituteoriginal value for synthetic data, and, ensures thestatistical characteristic of original dataset. Afterperturbation operation, the dataset is completely not thepresentation of original dataset which is its remarkabletrait. Adding noise, swapping data and generatingsynthetic data are the three common means ofperturbation [26,27,28,29,30].

3.2 Generalization and Suppression

Achievingk-anonymity by generalization and suppressionwill lead to not precise, but consistent representation oforiginal dataset. Comprehensive consideration needs to be

c© 2014 NSPNatural Sciences Publishing Cor.

Page 6: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1108 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

taken about three vital aspects referred to PPDP, which areprivacy requirement, data utility and algorithm complexity.

There are roughly four types of generalization withdifference in scope and principle which are full-domaingeneralization, subtree generalization, cell generalizationand multidimensional generalization. By the way,specification is the reverse anonymity operation ofgeneralization.

1)Full-domain generalization[31] is proposed inearly research of PPDP, it has the smallest search space infour types of generalization, while it leads to large datadistortion. The key of full-domain generalization is thatthe value of quasi-identifier must be generalized to thesame level in given taxonomy tree structure. We will usethe taxonomy tree structure in figure 2 to explain. Beforeany anonymity operation, all values stay at the bottom oftaxonomy. If nodeUndergraduateis generalized to itsparent nodeStudent, then node Graduate must begeneralized to nodeStudent, at the same time, nodesITerandHRneed to be generalized to nodeWorker.

2) Subtree generalization[32,33], its boundary issmaller than full-domain generalization. When a node intaxonomy tree structure generalizes to its parent node, allchild nodes of the parent node need to be generalized tothe parent node. For example, in figure 2, if nodeUndergraduateis generalized to its parent nodeStudent,it needs to generalize nodeGraduateto its parent nodeStudentto meet the requirement of subtree generalization.Unrestricted subtree generalization[34] is similar to thesubtree generalization, except that siblings of thegeneralized node could remain unchanged. For example,in figure 2, if nodeUndergraduateis generalized to itsparent nodeStudent, node Graduate is unnecessary togeneralize to its parent nodeStudent.

3) Cell generalization[35] is slightly different fromgeneralization ways above. Cell generalization is forsingle record, while, full-domain generalization is for allof records in the dataset. The search space with thisgeneralization is significantly larger compared to othergeneralization, but the data distortion is relatively small.For example, in figure 2, when nodeUndergraduategeneralizes to its parent nodeStudent, it can maintain therecord withUndergraduatevalue in the dataset. When theanonymous dataset is used for classification of datamining, it suffers from data exploration problem. Forexample, classifier may not know how to distinguishUndergraduateand Student. Those problems are thecommon traits of local recoding scheme.

4) Multidimensional generalization[34,35,36]emphasizes different generalization for differentcombination of values of quasi-identifiers. For example,in figure 2, [Undergraduate, Female] can be generalizedto [Student, ANY], while [Graduate, Male] generalizes to[ANY, Male]. This scheme has less data distortioncompared to full domain generalization. It generalizesrecords by combination of quasi-identifiers with differentvalue.

In most situations of generalization schemes, it mixessuppression operations in its process of dataanonymization. It is without any doubts that there existssome theory or techniques to go on data anonymizationonly using suppression operation [37,38,39]. Like thecategory of generalization, there are five kinds ofsuppression, attribute suppression[34], recordsuppression[40], value suppression[41], cellsuppression[42] and multidimensional suppression[43].Attribute suppression suppresses the whole values of theattribute. Record suppression means suppressing therecords. Value suppression refers to suppressing the givenvalue in the dataset. While cell suppression compared tocell generalization, works on small scope and suppressessome records with the given value in dataset.

4 ALGORITHMS USINGGENERALIZATION AND SUPPRESSION

Before introducing anonymity algorithms usinggeneralization and suppression, it is necessary tointroduce the definition of minimal anonymity andoptimal anonymity. Minimal anonymity means thatoriginal dataset through a series of anonymity operationssatisfying the predefined requirements, and the sequenceof used anonymity operations cannot be cut down.Compared with minimal anonymity, the requirements ofoptimal anonymity algorithms seem more rigorous. Itdemands that dataset after data anonymity operationssatisfies privacy requirements, and, most importantly, theanonymous dataset needs to contain maximal quantity ofinformation by the chosen information metric which willbe described below. Hence, we can infer that theanonymous dataset with maximal information isdefinitely chosen from a collection of anonymous datasetswhich all meet the given privacy requirement. However,some previous works have proven that accomplishingoptimal anonymity is NP-hard. It is generally acceptedthat generating anonymous dataset satisfying minimalanonymity requirement makes sense and can be achievedefficiently. The typical anonymity algorithms merelyusing generalization and suppression that will be narratedbelow and all algorithms belong to minimal anonymity oroptimal anonymity algorithm. First, lets see theexplanation of information metric criteria first.

4.1 Information Metric Criteria

During the process of data anonymization, the anonymityalgorithms need maximally ensuring data utility, andsatisfying privacy requirement at the same time. As aresult, at each step of anonymity algorithm, it needs somequalified metrics criteria to guide the proceeding ofalgorithms. As the vital ingredient part of algorithms, it isnecessary to take some effort to describe common

c© 2014 NSPNatural Sciences Publishing Cor.

Page 7: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1109

information metric frequently referred by privacypreserving algorithms, thus, it gives an overall anddetailed scene of information metric next. Metric forPrivacy Preserving Data Publishing (PPDP) can beroughly divided into two categories, which are datametric and search metric. Data metric is used to comparethe discrepancy of anonymous and original dataset, whichmainly refers to the aspect of data utility. While searchmetrics are usually incorporated into anonymityalgorithms used for guiding execution of algorithm to getsuperior result. The availability of search metric directlyaffects efficiency of algorithm and effectiveness ofanonymous result. According to the purpose of metric,general metric and trade-off metric naturally appear.From the name of general purpose, it is known thatanonymous dataset restricted by it can apply for all kindsof application area. Referring to trade-off metric, it addssome additional measurement for specific purpose so thatanonymous dataset confined by it has narrow scope ofapplication area, but good performance. For uncertaintyof application area consuming anonymous dataset, it isthe best choice that uses metric with general purpose onthe condition of satisfying relevant requirements.However, sometime, for the sake of gaining betteranalysis result from anonymous dataset in specific area, itis generally acknowledged that using trade-off metricinstead of general purpose metric.

4.1.1 General Metric

General metric is used for the scenario that data publisherhas no knowledge about application area of publisheddataset. For this reason, general metric leverages all kindsof factors, make it fit for different situation as much aspossible by measuring discrepancy of anonymous andoriginal dataset. In early works[42,43,44,45], thedefinition of data metric is not totally formed, and it justhas the concept of minimal distortion that correspondingalgorithm must comply with. For example, one measureof information loss was the number of generalized entriesin the anonymous dataset and some algorithm judges thedata quality by the ratio of the size of anonymous andoriginal dataset.

Loss Metric (LM) for records is computed by summingup a normalized information loss of each quasi-identifier.Information loss (ILattr) of each attribute is compute by

ILattr = (Ng−1)/(N−1) (1)

In the formula,Ng is represented for child count of theparent node which current value generalizes to and N isthe child count of this quasi-identifier. For example, ifUndergraduategeneralizes toStudentin figure 2, fromthe taxonomy tree structure ofJob attribute, we can getthat ILattr is 1/3. If ILattr equals 0, it means that the valueis not generalized. This formula only fits for categoricalattribute. While, calculation of information loss (ILattr)

Fig. 3: taxonomy tree structure of numeric quasi-identifier

for numeric attribute is slightly different from categoricalattributes, as shown below. Usually, the value of numericattribute with domain [U, L] is generalized to a specificrange, like [Ug, Lg]. So the information loss of thisgeneralization can be calculated by the formula 2. Forexample, if value 45 ofAge attribute in figure 3generalizes to the specific rage (0-45], and total range ofage attribute is (0-99], so information loss (ILattr) for thisgeneralization is 45/99.

ILattr = (Ng−Lg)/(U −L) (2)

Loss metric (ILrecord) of a record is calculated by the sumof information loss of quasi-identifiers under theassumption that each quasi-identifier has equal weight.

ILrecord = ∑attr∈record

ILattr (3)

Therefore, we can infer that Loss metric for the wholedataset (ILtable) is based on information loss ofgeneralized record like the formula shown below. Thisseries of information metric also adopted by[46,47].

ILtable= ∑record∈table

ILrecord (4)

Quality measure is based on the size of the equivalenceclass E in datasetD. the discernibility metric (CDM)assigns each recordt in datasetD a penalty determined bythe size of the equivalence class containingt. If a recordbelongs to an equivalent class of sizes, the penalty for therecord iss. If a tuple is suppressed, then it is assigned apenalty of |D|. This penalty reflects the fact that asuppressed tuple cannot be distinguished from any othertuple in the dataset. In the formula 5, the first sumcomputes penalties for each non-suppressed tuple, and thesecond for suppressed tuples.

CCM(g,k) = ∑|E|≥k

|E|2+ ∑|E|<k

|D||E| (5)

Another metric is proposed by Iyengar [32] which is usedin privacy preserving data mining. This classificationmetric assigns no penalty to an unsuppressed tuple if it

c© 2014 NSPNatural Sciences Publishing Cor.

Page 8: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1110 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

belongs to the majority class within its inducedequivalence class. All other tuples are penalized a valueof 1. The minority function in formula 6 accepts a set ofclass-labeled tuples and returns the subset of tuplesbelonging to any minority class with respect to that set.

CCM(g,k) = ∑|E|≥k

(|minority(E)|)+ ∑|E|<k

|E| (6)

Normalized average equivalence class size metric (CAVG)is proposed as an alternative. The intention of the metricis to measure how well the partitioning reaches theoptimal case where each record is generalized in theequivalent class ofk indistinguishable records. This sortof information metric also has many supporters[48].

CAVG= (total records

total equivclasses)/(k) (7)

Precision metric is based on taxonomy tree structure. If theoriginal datasetD(A1, . . . ,ANa) and the anonymous datasetD(A1, . . . ,ANa). Then

prec(D′) = 1−Na

∑j=1

|D|

∑i=1

h/|VGHA j |/|D| · |Na| (8)

Na is the count of quasi-identifiers and h represents theheight of taxonomy tree structure ofA j aftergeneralization.VGHA j is total height of taxonomy treestructure ofA j . For any quasi-identifierA j , the smaller ofprec(D′), the worse of information loss.

4.1.2 Trade-off Metric

Trade-off Metric takes both of maximal data informationand minimal privacy disclosure into consideration, andmakes a superior balance of the two requirements at eachstep of algorithm execution. The difference betweentrade-off metric and general metric is that trade-off metricputs more emphasis on application scope of anonymousdataset provided by data publisher, not let tworequirements alone at the same times. And, general metricconsiders more about the extent of data distortionbetween anonymous and original dataset.

Wang et al.[49] adapt the information loss metricbased on information entropy. Their application scenariois that anonymous dataset can be used for classification ofdata mining, and it requires that classification modelgenerated by anonymous dataset has rough equivalenteffectiveness of the classification model by originaldataset. Hence, the information metric needs consideringthe factor that how anonymous dataset is used for theconstruction of classification model. That is to say, itneeds to know approach and mechanism of generatingclassification model. Generating classification model bythe means of decision tree is one of common approaches.Meanwhile, using information gain based on entropy is

one of viable measures to select splitting attribute, andfurther, generating the structure of decision tree step bystep. It is reasonable that using entropy-based informationmetric to restrict data anonymization of original dataset ateach step. It builds some underlying connection withclassification model at the whole process of dataanonymization by using entropy-based informationmetric. The formula for information loss is shown asformula 10. A generalization ({c → p}) is denoted byG.Rc denotes the set of records in the dataset containingcand Rp denotes the set of records containingp afterapplying generalizationG. We know that|Rp|=∑c |Rc|,where symbol|x| represents the count of records in setx.The effect of a generalizationG (IP(G)) is evaluated byboth information metrics of information loss andanonymity gain.

IP(G) = I(G)/P(G) (9)

The formula of information loss(I(G)) shows below.

I(G) = In f o(Rp)−∑c

|Rc|

|Rp|⋆ In f o(Rc) (10)

where the formula of Info(Rc) is shown below, whichstands for the entropy ofRc. Selecting quasi-identifierwith high entropy means that it uses less information toclassify records in relevant class label and reflectsminimal randomness or purity of the involved records,that is to say, records in the equivalent class haveconsistent class label which is an attribute of originaldataset. When generating classification model, thedeviation of records partition is small accordingly.freq(Rx, cls) represents the count of records inRx withclass labelcls

In f o(Rc) =−∑cls

f req(RX,cls)|RX|

× log2f req(RX,cls)

|RX|(11)

The anonymity gain is calculated by

P(G) = AG(VID)−A(VID) (12)

VID denotes the sequence of quasi-identifier(VID={D1,. . . ,Dk}). A(VID) and AG(VID) denote theanonymity before and after applying generalizationG.Without doubt, AG(VID)≥A(VID) and A(VID)≥k.AG(VID)-A(VID) is the redundancy of the anonymityextent. It is unnecessary to have redundancy of anonymityunder the requirement ofk-anonymity, and theredundancy ofk-anonymity means more information losseven though good for privacy protection.Comprehensively considering both two factors ofinformation gain and anonymity gain, it comes to theformula 9. Information-Privacy metric has the tendency toselect the quasi-identifier with minimal information lossfor each extra increment of anonymity gain. IfP(G)equals 0,IP(G) is ∞. On this occasion, using informationloss I(G) as metric selects quasi-identifier to be

c© 2014 NSPNatural Sciences Publishing Cor.

Page 9: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1111

generalized. Another formula for Information-Privacymetric is

IP(G) = I(G)−P(G) (13)

However,I(G)/P(G) is superior toI(G)-P(G) by handlingdifferent quantification relationship. BecauseIP(G)considers anonymity requirement in algorithm execution,it helps focus the search process on the privacy goal, andhas certain look-ahead effect.

In [50], on the basis of entropy, it proposes themonotone entropy metric and non-uniform entropymeasure. They are simple variant of the entropy metricthat respects monotonicity, and their definitions showbelow.

Definition of monotonicity: Let D be a dataset, letg(D) andg′(D) be two generalizations ofD, and let∏ beany measure of loss of information. Then,∏ is calledmonotone if∏(D, g(D))≤ ∏(D, g′(D)) wheneverg(D) ∈g′(D).

Definition of monotone entropy metric: LetD={R1,. . . ,Rn} be a dataset having quasi-identifiersA j ,1 ≤ j ≤ r, andg(D)={R1,. . . , Rn} be a generalization ofD. Then,

∏me

(D,g(D)) =n

∑i=1

r

∑j=1

Pr(Ri( j)) ·H(Xj |Ri( j)) (14)

is the monotone entropy measure of information losscaused by generalizing D into g(D). The monotoneentropy metric coincides with the entropy metric whenconsidering generalization by suppression only, while itpenalizes generalizations more than the entropy measuredoes.

Definition of non-uniform entropy metric : Let D={R1,. . . ,Rn} be a dataset having quasi-identifiersA j , 1≤j≤r, andg(D)={R1,. . . ,Rn} be a generalization ofD. Then,

∏me

(D,g(D)) =n

∑i=1

r

∑j=1

− logPr(Ri( j)|Ri( j)) (15)

is the non-uniform entropy metric of the loss ofinformation caused by generalizingD into g(D).

Both the entropy and the monotone entropy metric areuniform for all records. For example, the domain of aquasi-identifier is{1, 2}, while there are 99 % recordswith the quasi-identifier equals to 1 and the left 1%records are that the value of the quasi-identifier is 2.However, when applying entropy or monotone entropymetric, the information loss of generalizing 1 and 2 to itsparent are same. Apparently, it is a bit unreasonable. So,the non-uniform entropy metric appears to tackle withthis situation.

4.2 Optimal Anonymity Algorithms

Samarati [31] proposes an algorithm for finding minimalfull-domain generalization to implement optimal

anonymity using binary search. In [51], it proposes analgorithm for finding minimal generalization withminimal distortion that is called MinGen. MinGen try toexamine all of potential full-domain generalization, sothat find the optimal generalization according to relevantinformation metric. However, in the work, it points outthat an exhaustive search of all possible generalization isimpractical even for the modest sized dataset.

Incognito [34] is an efficient anonymity algorithm byproducing minimal full-domain generalization to achievek-anonymity, which takes advantage of two key featuresof dynamic programming, namely, bottom-up aggregationalong dimensional hierarchies and a priori aggregatecomputation. Moreover, it uses generalization lattice andthree properties to facilitate finding minimal full-domaingeneralization. There is a brief explanation for figure 4.The leftmost two structures are taxonomy tree structure ofZipcode and Sex quasi-identifier. And the rightmoststructure is two-attribute lattice assembling with the treestructure ofZipcodeandSex. The related three propertiesare generalization property, rollup property and subsetproperty which are showed below in detail. Incognitoalgorithm generates all possible k-anonymousfull-domain generalization of original dataset based onsubset property. It checks whether single-attribute reachesthe requirement ofk-anonymity at the start of algorithmexecution, then iterates by increasing the size ofquasi-identifiers forming larger multi-attributegeneralization lattice till reach the scale of givenquasi-identifiers.

Generalization Property: Let T be a dataset, and letP andQ be sets of attributes inT such thatDP <D DQ.If T is k-anonymous with respect to P, thenT is alsok-anonymous with respect toQ.

Rollup Property: Let T be a dataset, and letP andQ be sets of attributes inT such thatDP <D DQ. If wehave f1, the frequency set ofT with respect toP, then, wecan generate the count off2, the frequency set ofT withrespect toQ, by summing the set of counts inf1 associatedby γ with each value set off2.

Subset Property:Let T be a dataset, and letQ be a setof attributes inT. If T is k-anonymous with respect toQ,thenT is k-anonymous with respect to any set of attributesP such thatP⊆ Q.

K-Optimize [40] algorithm is an algorithm ofimplementing optimal anonymity by pruning unsatisfiedreleased table using a set enumeration tree. The differencebetweenK-Optimize and Incognito is thatK-Optimizeuses subtree generalization and record suppression, while,Incognito incorporates full-domain generalization intoscheme.

4.3 Minimal Anonymity Algorithms

The µ-argus algorithm [52] is the earliest work achievingprivacy preserving by generalization and suppression. Itcan handle a class of disclosure control rules based on

c© 2014 NSPNatural Sciences Publishing Cor.

Page 10: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1112 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

Fig. 4: multi-attribute generalization lattice

checking frequency of certain attribute combinations(quasi-identifiers). It makes decision in favor of bin sizesto generalize and suppress values of the quasi-identifier.The bin size represents the number of records matchingthe characteristics. User provides an overall bin size andmark sensitive attribute by assigning a value to eachattribute which ranges from 0 to 3. Thenµ-argusalgorithm identifies and handles combination (equivalentclass) with small size. The handling covers two disclosurecontrol measures, namely subtree generalization and cellsuppression. They are called global recoding and localsuppression in the work respectively which are appliedsimultaneously to dataset. Eventually, it implementsk-anonymity by computing the frequency of all 3-valuecombinations of domain values and greedily usinganonymity operation.µ-argus algorithm has a seriouslimit that it is only suitable for combination of 3attributes. Therefore the anonymous dataset may not meetthe requirement ofk-anonymity when the size ofquasi-identifier reaches more than three.

Datafly [53] system proposed by Sweeney in 1998guarantees anonymity using medical data byautomatically generalizing, substituting, inserting andremoving specific information without losing substantialinformation of original dataset. In the work, itsummarizes three major difficulties in providinganonymous dataset. The first problem is that anonymity isin the eye of the beholder. The second is that concerns ofunique and unusual information appearing within thedataset when producing anonymous dataset. The last oneis that how to measure the degree of anonymity inreleased data. Datafly system is constructed in the processof considering three difficulties above. Datafly is aninteractive model that users provide it with an overallanonymity level that determines the minimum bin sizeallowable for each field. Like bin size inµ-argusalgorithm, the minimal bin size reflects the smallestnumber of records matching the characteristics. As theminimal bin size increases, the more anonymous thedataset is. The Datafly system calculates the size of eachequivalent class emerged by quasi-identifiers stored in anarray and generalizes those equivalent classes whose size

is less thank based on a heuristic search metric whichselect the quasi-identifier with the largest number ofdistinct values. By the way, Datafly system usesfull-domain generalization and record suppression toimplements data anonymization.

It is the first time that genetic algorithm is introducedto achieve privacy preserving which is proposed byIyengar[32]. The core of this work is that how to modelthe problem of privacy preserving data publishing andmakes the model fit for genetic algorithm. With the helpof taxonomy tree structure, it encodes each state ofgeneralization as a chromosome. At each time of iterativeprocedure, new solutions (that is new chromosome) aregenerated to achieve better result by the means ofcrossover and mutation. A better solution is based on theinformation metric. Applying a solution mainly includestwo parts, namely conduct generalization operationexactly subtree generalization and carry out suppressionas needed. Because execution of genetic algorithm needssetting some parameters, like size of the population, theprobability of mutation and the times of iteration, in thiswork, the size of chromosome is 5000, 0.5 million timesof iteration, and probability of mutation was set to 0.002.It takes approximately 18 hours to transform the trainingset of 30k record. From the result of experiment and theconsumed time, we can infer that using genetic algorithmto achieve privacy preserving is time consuming,impractical and inefficient for very large dataset.

Wang et al. [49] propose bottom-up generalizationwhich is a data mining solution to privacy preserving datapublishing. The brief description of this algorithm is thatit computes theIP(G) of each generalizationG, find thebest generalizationGbest according to forgoingIP(G)which is the information metric of this algorithm, andapplies the best generalizationGbest for dataset until theanonymous dataset satisfy the requirement ofk-anonymity. It is well known that calculating allIP(G)for each generalizationG is impossible. In this situation,it gives the definition of Critical Generalization that isgeneralizationG is critical if AG(VID)>A(VID). A(VID)andAG(VID) are the anonymity before and after applyinggeneralizationG. So, using computeAG(VID) for each

c© 2014 NSPNatural Sciences Publishing Cor.

Page 11: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1113

critical generalization replaces the operation ofcomputing the IP(G) of each generalizationG. Thepurpose of modifying algorithm steps is to reduceunnecessary calculation by the means of pruningredundant generalization judged by proved theory. It usesTEA (Taxonomy Encoded Anonymity) index, twoobservations and one theorem prune unnecessarygeneralization to improve the efficiency of this algorithm.TEA index for the sequence of quasi-identifiers is a treeof k level, each level of TEA index represents currentgeneralization state of a quasi-identifier. Each leaf nodestores the size of equivalent class that is determined bythe path from this leaf node to the root. With the theoremshows below, it is sufficient to find the criticalgeneralization.

Theorem 1G is critical only if every anonymityvid isgeneralized by some size-k segment ofG, wherek>1. Atmost|VID| generalizations satisfy this “only if” condition,where|VID| denotes the number of attributes inVID (VIDis the set of quasi-identifiers).

Top-down specialization [54,55] for privacypreserving can be learned about by comparing withbottom-up generalization. Specialization is a reverseoperation of generalization. TDS (Top-DownSpecialization) starts at the most general state of datasetaccording to taxonomy tree structure of quasi-identifiers,and choose best specialization attributes measured bytrade-off metric to specialize dataset till furtherspecialization would breach anonymity requirement.Compared with bottom-up generalization, TDS algorithmis more flexible that it can stop at any specialized state ofdataset according to users’ requirement. Another merit ofTDS is that it takes multi-sequence of quasi-identifiersinto account. Mondrian multidimensional algorithmproposed by LeFevre [36] is a kind of anonymityalgorithm to achieve minimal anonymity by the means oftop-down specialization. The difference between TDSscheme and Modrian Multidimensional algorithm canspecializate one equivalent class not all equivalent classcontaining the specific value of selected quasi-identifier.Fung et al. also proposesk-anonymity algorithm toprecede privacy preserving for cluster analysis in [56].

Using multidimensional suppression fork-anonymityis proposed by Slava et al. [57]. The algorithm(KACTUS: K-Anonymity Classification Tree UsingSuppression) can generate effective anonymous datasetapplied for the scenario of classification model. Valuesuppression is applied only on certain tuples dependingon other quasi-identifier values. Because of only usingsuppression, it is unnecessary to build taxonomy treestructure for each quasi-identifier which is a relaxationcompared with priori works equipped with predeterminedtaxonomy tree for categorical attributes andruntime-generated taxonomy tree structure for numericattributes. The goal of this work is to generate anonymousdataset to reach the target that the classifier performancetrained on anonymous dataset is approximately similar tothe performance of classifier trained on the original

dataset. KACTUS is composed of two main phases. In thefirst phase, it generates a classification tree trained on theoriginal dataset. Some diverse top-down classificationinducers are prepared for generating classification tree(e.g. C4.5 algorithm). Need to mention, the decision treeis trained on the collection of the given quasi-identifiers.In the second phase, the classification tree generated inthe first phases is used by algorithm to emerge theanonymous dataset. The process of data anonymizationon classification tree proceeds on bottom-up manner.Every leaf node of classification tree take notes of thenumber of records has the same equivalent class which isrepresented by the path from root node to leaf node. If allof leaf nodes in classification tree meet the requirement ofk-anonymity, there is no need to suppress anyquasi-identifiers existed in the classification tree currently.While, if leaves do not meet the requirement ofk-anonymity, it can obtain a new leaf which may complywith k-anonymity by adequately pruning them.

The anonymity operations of generalization andsuppression are still widely used in latest research.Generalization is employed in [58] for improve DNALA.An appropriately robust privacy model is proposed fordata anonymity which is calledβ -likeness in [59], one ofthe anonymization schemes is based on generalization. In[60], it focuses on the generalization hierarchy of numericattributes, extends previous method and validates theirproposal with the help of existed information metric.Information-based algorithm is proposed by Li et al. [61]for classification utility using generalization andsuppression. Sun et al. [62] improves p-Sensitivek-anonymity model and proposes two privaterequirements, namelyp+-sensitivek-anonymity and (p,α)-sensitive k-anonymity properties. It generatesanonymous dataset with a top-down specializationmanner by specializing values step by step. Wong et al.find that anonymization error can be reduced byemploying a non-homogeneous generalization[63].Generalization hierarchy is incorporated to implementuser-oriented anonymization for public data in [64].Adeel et al.[65] also use generalization operation copingwith the problem of sequential release under arbitraryupdate. Mahesh et al.[66] propose a new method topreserve individuals sensitive data from record andattribute linkage attacks by setting range values andrecord elimination.

5 CONCLUSION

Information sharing is becoming indispensable part ofindividuals and organizations, privacy preserving datapublishing comes to receive increasing attentions from allover the world, which is regarded as an essentialguarantee for information sharing. To put it simply, therole of privacy preserving data publishing is to transformthe original dataset from one state to the other state so asto avoid privacy disclosure and withstand diverse attacks.

c© 2014 NSPNatural Sciences Publishing Cor.

Page 12: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1114 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

In this paper, first, we discuss privacy model of PPDP indetails, mainly introduce privacy preserving model forrecord linkage and anonymity operations. Informationmetric with different purposes are collected which is animportant part of anonymity algorithms. Subsequently,more emphasis is put on the anonymization algorithmswith specific anonymity operations, exactly,generalization and suppression operation. This paper maybe used for researcher to scratch the profile of anonymityalgorithms for PPDP by the means of generalization andsuppression. Our further research shows below.

A) Hybrid k-anonymity algorithm. Algorithmimplementation ofk-anonymity is simple and can adapt todifferent scenario. So, it will be an effective scheme tomix k-anonymity with other anonymity techniques.

B) Background knowledge attack simulation to makeinformation safe. It is difficult to accurately simulate thebackground knowledge of attackers. While differentbackground knowledge will cause privacy breach invarying degree. It will be a part of research to find out theway of simulating background knowledge of attackers, sothat provide all-round protection for privacy.

C) Information metric. It has given an overallsummary of information metrics in this paper. We can seethat different metric fits for different scenario of PPDP.Studying new information metric or improving existedmetric will be a part of further research.

D) Multi sensitive attributes anonymity constraint.Existing study focuses on anonymization of a singlesensitive attribute, which cannot simply shift to solve themulti sensitive attribute problem. Therefore, we need tostudy effective anonymity algorithms withmultidimensional constraint. In addition, the difficultiesof implementing personalize anonymity efficiently andchoosing quasi-identifiers exactly are all worthy of furtherthought and study.

ACKNOWLEDGEMENT

This work was supported in part by China PostdoctoralScience Foundation (No. 2012M511303), NationalScience Foundation of China (No. 61173143), andSpecial Public Sector Research Program of China (No.GYHY201206030) and was also supported by PAPD. Theauthors are grateful to the anonymous referee for a carefulchecking of the details and for helpful comments thatimproved this paper.

References

[1] M. S. Wolf, C. L. Bennett, Local perspective of the impactof the HIPAA privacy rule on research, Cancer-PhiladelphiaThen Hoboken,106, 474-479 (2006).

[2] Johannes Gehrke, Models and methods for privacy-preserving data publishing and analysis, In Proceedings of the22nd International Conference on Data Engineering (ICDE),105, (2006).

[3] B. Fung, K. Wang, R. Chen, P. Yu, Privacy-preservingdata publishing: A survey of recent developments, ACMComputing Surveys,42, 1-53 (2010).

[4] L. Sweeney, K-anonymity: A model for protecting privacy,International Journal on Uncertainty, Fuzziness, andKnowledge-based Systems,10, 557-570 (2002).

[5] Dewri R., k-anonymization in the presence of publisherpreferences, Knowledge and Data Engineering, IEEETransactions,23, 1678-1690 (2011).

[6] Nergiz M. E., Multirelational k-anonymity, Knowledge andData Engineering, IEEE Transactions,21, 1104-1117 (2009).

[7] Jiuyong Li, Wong, R. C. W. Wai-chee Fu, A., Jian Pei,Transaction anonymization by local recoding in data withattribute hierarchical taxonomies, Knowledge and DataEngineering, IEEE Transactions,20, 1181-1194 (2008).

[8] Tamir Tassa, Arnon Mazza, k-Concealment: An AlternativeModel of k-Type Anonymity, Transactions on Data Privacy,189-222 (2013).

[9] Dalenius T., Towards a methodology for statistical disclosurecontrol, Statistik Tidskrift,15, 429-444 (1977).

[10] Ahmed Abdalaal, Mehmet Ercan Nergiz, Yucel Saygin,Privacy-preserving publishing of opinion polls, Computers &Security, 143-154 (2013).

[11] Machanavajjhala A., Gethrke J., Kifer D.,Venkitasubramaniam M., l-diversity: Privacy beyond k-anonymity, In Proceedings of the 22nd IEEE InternationalConference on Data Engineering (ICDE),24, (2006).

[12] Ke Wang, Benjamin C. M. Fung, Anonymizing sequentialreleases, In Proceedings of the 12th ACM SIGKDDConference, 414-423 (2006).

[13] Raymond Chi-Wing Wong, Jiuyong Li, Ada Wai-Chee Fu,Ke Wang, (a, k)-anonymity: An enhanced k-anonymity modelfor privacy preserving data publishing, In Proceedings of the12th ACM SIGKDD, 754-759 (2006).

[14] Qing Zhang, Koudas N., Srivastava D., Ting Yu, Aggregatequery answering on anonymized tables, In Proceedings ofthe 23rd IEEE International Conference on Data Engineering(ICDE), 116-125 (2007).

[15] Ninghui Li, Tiancheng Li, Venkatasubramanian S., t-closeness: privacy beyond k-anonymity and l-diversity, InProceedings of the 21st IEEE International Conference onData Engineering (ICDE), 106-115 (2007).

[16] Xiaokui Xiao, Yufei Tao, Personalized privacy preservation,In Proceedings of the ACM SIGMOD Conference, 229-240(2006).

[17] Mehmet Ercan Nergiz, Maurizio Atzori, ChrisClifton,Hiding the presence of individuals from shareddatabases, In Proceedings of ACM SIGMOD Conference,665-676 (2007).

[18] Chawla S., Dwork C., Mcshrry F., Smith A., Wee H., Towardprivacy in public databases, In Proceedings of the Theory ofCryptography Conference (TCC), 363-385 (2005).

[19] Dwork C., Differential privacy, In Proceedings of the33rd International Colloquium on Automata, Languages andProgramming (ICALP), 1-12 (2006).

[20] Rastogi V., Suciu D., Hong S., The boundary betweenprivacy and utility in data publishing, In Proceedings of the33rd International Conference on Very Large Data Bases(VLDB), 531-542 (2007).

[21] Blum A., Ligett K., Roth A., A learning theory approach tonon-interactive database privacy, In Proceedings of the 40th

c© 2014 NSPNatural Sciences Publishing Cor.

Page 13: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

Appl. Math. Inf. Sci.8, No. 3, 1103-1116 (2014) /www.naturalspublishing.com/Journals.asp 1115

Annual ACM Symposium on Theory of Computing (STOC),609-618 (2008).

[22] Nergiz M. E., Clifton C., Nergiz A. E., Multirelationalk-Anonymity, Knowledge and Data Engineering, IEEETransactions on,21, 1104-1117 (2009).

[23] Xiangmin Ren, Research on privacy protection based on k-anonymity, Biomedical Engineering and Computer Science(ICBECS), 1-5 (2010).

[24] Vijayarani S., Tamilarasi A. Sampoorna M., Analysis ofprivacy preserving k-anonymity methods and techniques,Communication and Computational Intelligence (INCOCCI),540-545 (2010).

[25] Wenbing Yu, Multi-attribute generalization methodin privacy preserving data publishing, eBusiness andInformation System Security (EBISS), 1-4 (2010).

[26] J. Domingo Ferrer, A survey of inference control methodsfor privacy preserving data mining, Advance in DatabaseSystems,34, 53-80 (2008).

[27] N. Shlomo, T. De Waal, Protection of micro-data subjectto edit constraints against statistical disclosure, Journal ofOfficial Statistics,24, 229-253 (2008).

[28] I. Cano, V. Torra, Edit constraints on microaggregationand additive noise, In Proceedings of the InternationalECML/PKDD Conference on Privacy and Security Issues inData Mining and Machine Learning (PSDML10), Springer,1-14 (2010).

[29] Jing Zhang, Xiujun Gong, Zhipeng Han, Siling Feng,An improved algorithm for k-anonymity, ContemporaryResearch on E-business Technology and StrategyCommunications in Computer and Information Science,352-360 (2012).

[30] Xiaoling Zhu, Tinggui Chen, Research on privacypreserving based on k-anonymity, Computer, Informatics,Cybernetics and Applications Lecture Notes in ElectricalEngineering,107, 915-923 2012.

[31] P. Samarati, Protecting respondents identities in microdatarelease, IEEE Trans. On Knowledge and Data Engineering,13, (2001).

[32] V. Iyengar, Transforming data to satisfy privacy constraints,In ACM SIGKDD, (2002).

[33] Xuyun Zhang, Chang Liu, Surya Nepal, Jinjun Chen, Anefficient quasi-identifier index based approach for privacypreservation over incremental data sets on cloud, Journal ofComputer and System Sciences, 542-555 (2013).

[34] Lefevre K., Dewitt D. J., Ramakrishnan R., Incognito:efficient full-domain k-annymity, In Proceedings of ACMSIGMOD, 49-60 (2005).

[35] Xu J., Wang W., Pei J., Wang X., Shi B., Fu A. W. C., Utility-based anonymization using local recoding, In Proceedings ofthe 12th ACM SIGKDD Conference, (2006).

[36] Lefevre K., Dewitt D. J., Ramakrishnan R, Mondrianmultidimensional k-anonymity, In Proceedings of the 22ndIEEE International Conference on Data Engineering (ICDE),(2006).

[37] B.C.M. Fung, K. Wang, P.S. Yu, Anonymizing classificationdata for privacy preservation, IEEE Transaction Knowledgeand Data Engneering,19, 711-725 (2007).

[38] Rui Chen, Benjamin C.M. Fung, Noman Mohammed, BipinC. Desai , Ke Wang, Privacy-preserving trajectory datapublishing by local suppression, Information Sciences, 83-97(2013).

[39] Martin Serpell, Jim Smith, Alistair Clark, AndreaStaggemeier, A preprocessing optimization applied tothe cell suppression problem in statistical disclosure control,Information Sciences, 22-32 (2013).

[40] Bayardo R. J., Agrawal R., Data privacy through optimal k-anonymization, In Proceedings of the 21st IEEE InternationalConference on Data Engineering (ICDE), 217-228 (2005).

[41] Wang K., Fung B. C. M., Yu P. S., Template-based privacypreservation in classification problems, In Proceedings of the5th IEEE International Conference on Data Mining (ICDM),466-473 (2005).

[42] Meyerson A., Williams R., On the complexity of optimalk-anonymity, In Proceedings of the 23rd ACM SIGMOD-SIGACT-SIGART PODS, 223-228 (2004).

[43] Slava Kisilevich, Lior Rokach, Yuval Elovici, BrachaShapira, Efficient multidimensional suppression for k-anonymity, IEEE TKDE,22, 334-347 (2010).

[44] Tripathy B. K., Kumaran K., Panda G.K., An improved l-diversity anonymization algorithm, Computer Networks andIntelligent Computing Communications in Computer andInformation Science,157, 81-86 (2011).

[45] Asaf Shabtai, Yuval Elovici, Lior Rokach, Privacy, dataanonymization, and secure data publishing, A Survey of DataLeakage Detection and Prevention Solutions SpringerBriefsin Computer Science, 47-68 (2012).

[46] Batya Kening, Tamir Tassa, A pratical approximationalgorithm for optimal k-anonymity, Data Mining andKnowledge Discovery,25, 134-168 (2012).

[47] Florian Kohlmayer, Febian Prasser, Claudia Eckert, AlfonsKemper, Klaus A. Kuhn, Flash: efficent, stable and optimalK-anonymity, In Proceedings of the 2012 ASE/IEEEInternational Conference on Social Computing and 2012ASE/IEEE International Conference on Privacy, Security,Risk and Trust, 708-717 (2012).

[48] Y. He, J. F. Naughton, Anonymization of set-valued data viatop-down, local generalization, In Proceedings of the VeryLarge DataBase (VLDB) Endowment,2, 934-945 (2009).

[49] Wang K., Yu P. S., Chakraborty S., Bottom-upgeneralization: A data mining solution to privacy protection,the fourth IEEE International Conference on Data Mining(ICDM2004), 249-256 (2004).

[50] Gionis A., Tassa T., k-Anonymization with Minimal Lossof Information, IEEE Transactions on Knowledge and DataEngineering,21, 206-219 (2009).

[51] L. Sweeney, Achieving k-anonymity privacy protectionusing generalization and suppression, International Journalon Uncertainty, Fuzziness, and Knowledge-based Systems,10, 571-588 (2002).

[52] Hundepool A., Willenborg L., -and r-argus: Software forstatistical disclosure control, In Proceeding of the 3rd Int.Seminar on Statistical Confidentiality, 208-217 (1996).

[53] Sweeney, L., Datafly: A system for providing anonymity inmedical data, In Proceedings of the IFIP TC11 WG11.3 11thInternational Conference on Database Security XI: Status andProspects,113, 356-381 (1998).

[54] B.C.M. Fung, K. Wang, P.S. Yu, Top-down specializationfor information and privacy preservation, In Proceedings 21stIEEE International Conference Data Engneering (ICDE 05),205-216 (2005).

[55] Mingquan Ye, Xindong Wua, Xuegang Hua, Donghui Hua,Anonymizing classification data using rough set theory,Knowledge-Based Systems, 82-94 (2013).

c© 2014 NSPNatural Sciences Publishing Cor.

Page 14: A Survey of Privacy Preserving Data Publishing using ... · scenario, data publisher does not need to have specific knowledge about research fields. What they need to do is make

1116 Y. Xu et. al. : A Survey of Privacy Preserving Data Publishing using...

[56] Fung B. C. M., Wang K., Wang L., Hung P. C. K.,Privacy-preserving data publishing for cluster analysis, DataKnowledge Engineering,68, 552-575 (2009).

[57] Slava Kisilevich, Lior Rokach, Yuval Elovici, BrachaShapira, Efficient multidimensional suppression for k-anonymity, IEEE TKDE,22, 334-347 (2010).

[58] Guang Li, Yadong Wang, Xiaohong Su, Improvementson a privacy-protection algorithm for DNA sequences withgeneralization lattices, Computer Method and Programs inBiomedicine,108, 1-9 (2012).

[59] Jianneng Cao, Panagiotis Karras, Publishing microdata witha robust privacy guarantee, In Proceeding of the VLDBEndowment,5, 1388-1399 (2012).

[60] Alina Campan, Nicholas Cooper, Traian Marius Truta, On-the-fly generalization hierarchies for numerical attributesrevisited, Secure Data Management Lecture Notes inComputer Science,6933, 18-32 (2011).

[61] JiuYong Li, Jixue Liu, Muzammi Baig, Raymod Chi-Wing Wong, Information based data anonymization forclassification utility, Data & Knowledge Engineering,70,1030-1045 (2011).

[62] Xiaoxun Sun, Lili Sun, Hua Wang,Extended k-anonymitymodels against sensitive attribute disclosure, ComputerCommunications,34, 526-535 (2011).

[63] Wai Kit Wong, Nikos Mamoulis, David Wai Lok Cheung,Non-homogeneous generalization in privacy preserving datapublishing, In Proceedings of the 2010 ACM SIGMODInternational Conference on Management of data, 747-758(2010).

[64] Shinsaku Kiyomoto, Toshiaki Tanaka, A user-orientedanonymization mechanism for public data, In Proceedings ofthe 5th international Workshop on data privacy management,and 3rd international conference on Autonomousspontaneous security, 22-35 (2010).

[65] Adeel Anjum, Guillaume Raschia, Anonymizing SequentialReleases under Arbitrary Updates, Proceedings of the JointEDBT/ICDT 2013 Workshops, 145-154 (2013).

[66] R. Mahesh, T. Meyyappan, Anonymization Techniquethrough Record Elimination to Preserve Privacy of PublishedData, Proceedings of the 2013 International Conference onPattern Recognition, Informatics and Mobile Engineering,328-332 (2013).

Yang Xu receivedhis Bachelor degree inComputer Science andEngineering from NanjingUniversity of InformationScience & Technology, Chinain 2011. Currently, he isa candidate for the degree ofMaster of Computer Scienceand Engineering in Nanjing

University of Information Science & Technology. Hisresearch interests include privacy preserving, datapublishing etc.

Tinghuai Ma is aprofessor in ComputerSciences at NanjingUniversity of InformationScience & Technology,China. He received hisBachelor (HUST, China,1997), Master (HUST,China, 2000), PhD (ChineseAcademy of Science,2003) and was Post-doctoral

associate (AJOU University, 2004). From Nov.2007to Jul. 2008, he visited Chinese MeteorologyAdministration. From Feb.2009 to Aug. 2009, he was avisiting professor in Ubiquitous computing Lab, KyungHee University. His research interests are in the areas ofData Mining and Privacy Protected in Ubiquitous System,Grid Computing. He has published more than 80journal/conference papers. he is principle investigator ofseveral NSF projects. He is a member of IEEE.

Meili Tang is anassociate professor in Schoolof Public Administrationat Nanjing Universityof Information Science& Technology, China. Shereceived her master degreefrom Huazhong Universityof Science & Technology,China, 2000. Her mainresearch interests are in the

areas of e-government and data publishing.

Wei Tian is an assistantprofessor of ComputerSciences at NanjingUniversity of InformationScience & Technology,China. He received hismaster degree from NanjingUniversity of InformationScience & Technology,China, 2006. Now, he is adoctoral candidate of Applied

Meteorology, Nanjing University of Information Science& Technology. His main research interests are in theareas of Cloud Copmputing and Meteorological DataProcessing.

c© 2014 NSPNatural Sciences Publishing Cor.