a novel anonymization technique for privacy preserving data publishing

1

A Novel Anonymization Technique for Privacy Preserving Data Publishing

2

Introduction

• Data publishing approach may lead to insufficient

protection. So, providing privacy for the micro data

publishing.

• Privacy protection is an important issue in data processing.

Data must be anonymized to protect privacy.

• Privacy preserving data publishing approach provides

methods for publishing the useful information.

3

• Here micro data contains records each of which contains

information about an individual entity like person,

organization.

• The data mining community has focused on hiding sensitive

rules generated from transactional databases.

• Before anonymizing the data, one can analyze the data

characteristics and use those characteristics in data

anonymization.

4

Literature Survey

• In both Generalization and Bucketization approaches

attributes are partitioned into three categories:

1) Some attributes are identifiers that can uniquely identify

an individual, such as Name or Social Security Number

2) Some attributes are Quasi Identifiers (QI), which the

adversary may already know and can potentially identify

an individual, e.g., Birthdate, Sex, and Zip code.

5

3) Some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary.

• In generalization and bucketization, one first removes identifiers from the data and then partitions tuples into buckets.

• Generalization transforms the QI-values in each bucket.

• Bucketization, one separates the SAs from the QIs by randomly permuting the SA values in each bucket.

6

Privacy threats

• When publishing microdata, there are three types of privacy disclosure threats. They are as follows

1) Membership disclosure

2) Identity disclosure

3) Attribute disclosure

7

Problem Specification

• The anonymization techniques for privacy preserving microdata publishing are:

1.Generalization

2. Bucketization

• Generalization losses considerable amount of information,

especially for high-dimensional data. This is due to curse of

dimensionality.

• Bucketization has better data utility than generalization but

have some drawbacks.

8

• It does not prevent member ship disclosure.

• In many data sets, it is unclear which attributes are QIs and

which are SAs.

• It requires a clear separation between QIs and SAs.

• By separating the QI attributes and SA here it breaks down

the attribute correlation between attributes.

• To overcome the drawbacks of above approaches a novel

technique called slicing will be implemented.

9

Slicing

• Slicing partitions the dataset both vertically and horizontally.

• Grouping the attributes into columns, each column contains

a subset of attributes, i.e., vertical partition

• Slicing also partition tuples into buckets. Each bucket

contains a subset of tuples, i.e. horizontal partition.

• Within each bucket, values in each column are randomly

permutated to break the linking between different columns.

10

Slicing Algorithms

• Our algorithm consists of three phases they are as

follows:

Attribute partitioning

Column generalization

Tuple partitioning

11

Flow Chart for Project Design

12

Algorithm

• The algorithm maintains two data structures: a queue of buckets Q and a set of sliced buckets SB.

• Initially, Q contains only one bucket which includes all tuples and SB is empty.

• In each iteration, the algorithm removes a bucket from Q and splits the bucket into two buckets.

• If the sliced table after the split satisfies ‘l-diversity, then the algorithm puts the two buckets at the end of the queue Q.

13

Cont..

• Otherwise, we cannot split the bucket anymore and the algorithm puts the bucket into SB.

• When Q becomes empty, we have computed the sliced table.

• The set of sliced buckets is SB. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies ‘l-diversity.

14

Tuple Partitioning Algorithm

Algorithm tuple-partition(T, ℓ)

1. Q = {T}; SB = .∅2. while Q is not empty

3. remove the first bucket B from Q; Q = Q − {B}.

4. split B into two buckets B1 and B2.

5. if diversity-check(T, Q {B1,B2} SB, ℓ)∪ ∪6. Q = Q {B1,B2}.∪7. else SB = SB {B}.∪8. return SB.

15

Diversity Check Algorithm

Algorithm diversity-check(T,T_, ℓ)

1. for each tuple t T, L[t] = .∈ ∅2. for each bucket B in T_

3. record f(v) for each column value v in bucket B.

4. for each tuple t T∈5. calculate p(t,B) and find D(t,B).

6. L[t] = L[t] {(p(t,B),D(t,B))}.∪7. for each tuple t T∈8. calculate p(t, s) for each s based on L[t].

9. if p(t, s) ≥ 1/ℓ, return false.

10. return true.

16

Examples for Anonymization Techniques

• This the original microdata table and its anonymized versions using anonymization techniques

17

• In above figure it consists of QI and SA. Age, sex, zipcode is QI and disease is SA and generalized table that satisfies 4-anonymity

18

• The above dataset shows the bucketized table that satisfies 2-diversity.

19

The above tables shows the Multiset-based generalization, one attribute per column slicing and the below table shows sliced table

20

Design and Implementation

It shows that the activities of the user. The user can provide the privacy to the microdata by generalizing the records, dividing into number of buckets and breaking the correlation between the attributes. The attributes of the table are sliced by performing random permutation and probability function.

Tuple Partitioning

Attribute Clustering

Correlation Measure

Diverse Slicing

User

Data SlicingUse case diagram

21

Class diagram

It shows that how probability functions are calculated and randomly permuted. It having five different classes with their attributes, methods how data is retrieved and methods are applied.

CorrelationMeasureAttributes[]mscdomain[]partitionset[]

RetrieveAttributes()ComputeDomain()CalculateMSC()ApplyDiscretization()PartitionsAttributeDomain()

TuplePartitioningtuples[]probabilityldiversity

CreateQueueBuckets()AssignTuples()RemoveBucket()CheckDiversity()RecordMatchingProbability()

1..*1..*

AttributeClusteringcentroidattribcorellation

RetrieveCorrelation()ComputeCentroids()CheckCorrelation()ApplyClustering()

1..*1..*

DataSlicingpartSet[]tuplesAttribTemp

SelectAttributes()ReplaceMultiset()PartitionsAttributes()PartitionTuples()RandomPermuteTuple()

DiverseSlicingbuckets[]bucketIDprobablity

DetermineMatchingBuckets()AssignProbability()ComputeProbabilitySensitive()CheckPotentialMatch()

1..*1..*

1..*1..*

22

Modules

Data slicing Diverse slicing Correlation measure Attribute clustering Tuple partitioning

23

• The attributes of tables are used for slicing and

bucketized.

• Slicing first partition attributes into columns and

then partition tuples into buckets.

• In diverse slicing has to extend the above analysis

to the general case and introduce the notion of l-

diverse slicing and applying the probability

function to the attributes.

24

• Two Correlation measures are used for measuring correlation between two continuous attributes and two categorical attributes.

• After the correlations for each pair of attributes, we use clustering to partition attributes into columns.

• After that tuple partition will be done. Here tuples are

partitioned into buckets.

25

H\w And S\w Specifications

Hardware:

(1)2GB RAM

(2) 320GB Hard disk

(3) Intel processor(P3)

Software:

(1) Visual Studio

(2) Windows xp/7

26

Results

Retrieving the dataset file

27Showing the dataset files to show path of the file

28Selecting The Dataset Test File

29Displaying the path of the dataset file

30Displaying the message in the textbox when user cancel the browsing

31Displaying the file path of the dataset attribute

32Displaying the attributes from the dataset

33Selecting the attribute set from the dropdown list

34Displaying the attribute domains of the attribute set

35Displaying the tuples

36Entering the number of buckets

37Displaying the generalization values in both text and table format

38Displaying the bucketization values in text format

39Displaying the values which have a clear separation between Quasi-

Identifiers and Sensitive Attributes

40Displaying the diversity mapping based on the salary values

41Displaying sliced values in textbox

42Displaying the sliced sets in the table

43Showing the probability based on the countries and salary of all the

buckets

44Providing security for sliced sets and storing the values in the database

46Duplicating the attributes

47Displaying the duplicate attributes in the table format

48

Comparison

Graph showing time consumed for three techniques.

Anonymity Diversity Slicing0

50

100

150

200

250

300

350

400

Tim

e (m

sec)

49

Comparative graph showing time consumed in Enhanced Slicing.

Slicing Enhanced Slicing0

50

100

150

200

250

300

350

400

Tim

e (m

sec)

50

Conclusion• Dataset is taken and performing anonymization techniques to protect

privacy for micro data.

• Attribute values are considered and applying the probability functions.

• Generalization, bucketization and slicing techniques are implemented.

• By using DES it provides the security to the sliced set table.

• Overlapping slicing is done, which duplicates an attribute in more

than one columns.

• By comparing slicing preserves better data utility than generalization

and bucketization based on time consuming.

51

Future Work

• Slicing gives better privacy than generalization and bucketization but still in future there is a scope to increase the privacy for microdata publishing, by using different anonymization techniques.

52

References[1] Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy “Slicing: A New

Approach for Privacy Preserving Data Publishing” ieee transactions on

knowledge and data engineering, vol. 24, no. 3, march 2012.

[2] A. Inan, M. Kantarcioglu, and E. Bertino, “Using Anonymized Data for

Classification,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 429-440,

2009.

[3] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, “Privacy Skyline: Privacy with

Multidimensional Adversarial Knowledge,” Proc. Int’l Conf. Very Large Data

Bases (VLDB), pp. 770-781, 2007.

[4] C. Dwork, “Differential Privacy: A Survey of Results,” Proc. Fifth Int’l Conf.

Theory and Applications of Models of Computation (TAMC), pp. 1-19, 2008.

53

Thank you..

a novel anonymization technique for privacy preserving data publishing

Documents