a novel anonymization technique for privacy preserving data publishing
DESCRIPTION
A Novel Anonymization Technique for Privacy Preserving Data PublishingTRANSCRIPT
1
A Novel Anonymization Technique for Privacy Preserving Data Publishing
2
Introduction
• Data publishing approach may lead to insufficient
protection. So, providing privacy for the micro data
publishing.
• Privacy protection is an important issue in data processing.
Data must be anonymized to protect privacy.
• Privacy preserving data publishing approach provides
methods for publishing the useful information.
3
• Here micro data contains records each of which contains
information about an individual entity like person,
organization.
• The data mining community has focused on hiding sensitive
rules generated from transactional databases.
• Before anonymizing the data, one can analyze the data
characteristics and use those characteristics in data
anonymization.
4
Literature Survey
• In both Generalization and Bucketization approaches
attributes are partitioned into three categories:
1) Some attributes are identifiers that can uniquely identify
an individual, such as Name or Social Security Number
2) Some attributes are Quasi Identifiers (QI), which the
adversary may already know and can potentially identify
an individual, e.g., Birthdate, Sex, and Zip code.
5
3) Some attributes are Sensitive Attributes (SAs), which are unknown to the adversary and are considered sensitive, such as Disease and Salary.
• In generalization and bucketization, one first removes identifiers from the data and then partitions tuples into buckets.
• Generalization transforms the QI-values in each bucket.
• Bucketization, one separates the SAs from the QIs by randomly permuting the SA values in each bucket.
6
Privacy threats
• When publishing microdata, there are three types of privacy disclosure threats. They are as follows
1) Membership disclosure
2) Identity disclosure
3) Attribute disclosure
7
Problem Specification
• The anonymization techniques for privacy preserving microdata publishing are:
1.Generalization
2. Bucketization
• Generalization losses considerable amount of information,
especially for high-dimensional data. This is due to curse of
dimensionality.
• Bucketization has better data utility than generalization but
have some drawbacks.
8
• It does not prevent member ship disclosure.
• In many data sets, it is unclear which attributes are QIs and
which are SAs.
• It requires a clear separation between QIs and SAs.
• By separating the QI attributes and SA here it breaks down
the attribute correlation between attributes.
• To overcome the drawbacks of above approaches a novel
technique called slicing will be implemented.
9
Slicing
• Slicing partitions the dataset both vertically and horizontally.
• Grouping the attributes into columns, each column contains
a subset of attributes, i.e., vertical partition
• Slicing also partition tuples into buckets. Each bucket
contains a subset of tuples, i.e. horizontal partition.
• Within each bucket, values in each column are randomly
permutated to break the linking between different columns.
10
Slicing Algorithms
• Our algorithm consists of three phases they are as
follows:
Attribute partitioning
Column generalization
Tuple partitioning
11
Flow Chart for Project Design
12
Algorithm
• The algorithm maintains two data structures: a queue of buckets Q and a set of sliced buckets SB.
• Initially, Q contains only one bucket which includes all tuples and SB is empty.
• In each iteration, the algorithm removes a bucket from Q and splits the bucket into two buckets.
• If the sliced table after the split satisfies ‘l-diversity, then the algorithm puts the two buckets at the end of the queue Q.
13
Cont..
• Otherwise, we cannot split the bucket anymore and the algorithm puts the bucket into SB.
• When Q becomes empty, we have computed the sliced table.
• The set of sliced buckets is SB. The main part of the tuple-partition algorithm is to check whether a sliced table satisfies ‘l-diversity.
14
Tuple Partitioning Algorithm
Algorithm tuple-partition(T, ℓ)
1. Q = {T}; SB = .∅2. while Q is not empty
3. remove the first bucket B from Q; Q = Q − {B}.
4. split B into two buckets B1 and B2.
5. if diversity-check(T, Q {B1,B2} SB, ℓ)∪ ∪6. Q = Q {B1,B2}.∪7. else SB = SB {B}.∪8. return SB.
15
Diversity Check Algorithm
Algorithm diversity-check(T,T_, ℓ)
1. for each tuple t T, L[t] = .∈ ∅2. for each bucket B in T_
3. record f(v) for each column value v in bucket B.
4. for each tuple t T∈5. calculate p(t,B) and find D(t,B).
6. L[t] = L[t] {(p(t,B),D(t,B))}.∪7. for each tuple t T∈8. calculate p(t, s) for each s based on L[t].
9. if p(t, s) ≥ 1/ℓ, return false.
10. return true.
16
Examples for Anonymization Techniques
• This the original microdata table and its anonymized versions using anonymization techniques
17
• In above figure it consists of QI and SA. Age, sex, zipcode is QI and disease is SA and generalized table that satisfies 4-anonymity
18
• The above dataset shows the bucketized table that satisfies 2-diversity.
19
The above tables shows the Multiset-based generalization, one attribute per column slicing and the below table shows sliced table
20
Design and Implementation
It shows that the activities of the user. The user can provide the privacy to the microdata by generalizing the records, dividing into number of buckets and breaking the correlation between the attributes. The attributes of the table are sliced by performing random permutation and probability function.
Tuple Partitioning
Attribute Clustering
Correlation Measure
Diverse Slicing
User
Data SlicingUse case diagram
21
Class diagram
It shows that how probability functions are calculated and randomly permuted. It having five different classes with their attributes, methods how data is retrieved and methods are applied.
CorrelationMeasureAttributes[]mscdomain[]partitionset[]
RetrieveAttributes()ComputeDomain()CalculateMSC()ApplyDiscretization()PartitionsAttributeDomain()
TuplePartitioningtuples[]probabilityldiversity
CreateQueueBuckets()AssignTuples()RemoveBucket()CheckDiversity()RecordMatchingProbability()
1..*1..*
AttributeClusteringcentroidattribcorellation
RetrieveCorrelation()ComputeCentroids()CheckCorrelation()ApplyClustering()
1..*1..*
DataSlicingpartSet[]tuplesAttribTemp
SelectAttributes()ReplaceMultiset()PartitionsAttributes()PartitionTuples()RandomPermuteTuple()
DiverseSlicingbuckets[]bucketIDprobablity
DetermineMatchingBuckets()AssignProbability()ComputeProbabilitySensitive()CheckPotentialMatch()
1..*1..*
1..*1..*
22
Modules
Data slicing Diverse slicing Correlation measure Attribute clustering Tuple partitioning
23
• The attributes of tables are used for slicing and
bucketized.
• Slicing first partition attributes into columns and
then partition tuples into buckets.
• In diverse slicing has to extend the above analysis
to the general case and introduce the notion of l-
diverse slicing and applying the probability
function to the attributes.
24
• Two Correlation measures are used for measuring correlation between two continuous attributes and two categorical attributes.
• After the correlations for each pair of attributes, we use clustering to partition attributes into columns.
• After that tuple partition will be done. Here tuples are
partitioned into buckets.
25
H\w And S\w Specifications
Hardware:
(1)2GB RAM
(2) 320GB Hard disk
(3) Intel processor(P3)
Software:
(1) Visual Studio
(2) Windows xp/7
26
Results
Retrieving the dataset file
27Showing the dataset files to show path of the file
28Selecting The Dataset Test File
29Displaying the path of the dataset file
30Displaying the message in the textbox when user cancel the browsing
31Displaying the file path of the dataset attribute
32Displaying the attributes from the dataset
33Selecting the attribute set from the dropdown list
34Displaying the attribute domains of the attribute set
35Displaying the tuples
36Entering the number of buckets
37Displaying the generalization values in both text and table format
38Displaying the bucketization values in text format
39Displaying the values which have a clear separation between Quasi-
Identifiers and Sensitive Attributes
40Displaying the diversity mapping based on the salary values
41Displaying sliced values in textbox
42Displaying the sliced sets in the table
43Showing the probability based on the countries and salary of all the
buckets
44Providing security for sliced sets and storing the values in the database
45
46Duplicating the attributes
47Displaying the duplicate attributes in the table format
48
Comparison
Graph showing time consumed for three techniques.
Anonymity Diversity Slicing0
50
100
150
200
250
300
350
400
Tim
e (m
sec)
49
Comparative graph showing time consumed in Enhanced Slicing.
Slicing Enhanced Slicing0
50
100
150
200
250
300
350
400
Tim
e (m
sec)
50
Conclusion• Dataset is taken and performing anonymization techniques to protect
privacy for micro data.
• Attribute values are considered and applying the probability functions.
• Generalization, bucketization and slicing techniques are implemented.
• By using DES it provides the security to the sliced set table.
• Overlapping slicing is done, which duplicates an attribute in more
than one columns.
• By comparing slicing preserves better data utility than generalization
and bucketization based on time consuming.
51
Future Work
• Slicing gives better privacy than generalization and bucketization but still in future there is a scope to increase the privacy for microdata publishing, by using different anonymization techniques.
52
References[1] Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy “Slicing: A New
Approach for Privacy Preserving Data Publishing” ieee transactions on
knowledge and data engineering, vol. 24, no. 3, march 2012.
[2] A. Inan, M. Kantarcioglu, and E. Bertino, “Using Anonymized Data for
Classification,” Proc. IEEE 25th Int’l Conf. Data Eng. (ICDE), pp. 429-440,
2009.
[3] B.-C. Chen, K. LeFevre, and R. Ramakrishnan, “Privacy Skyline: Privacy with
Multidimensional Adversarial Knowledge,” Proc. Int’l Conf. Very Large Data
Bases (VLDB), pp. 770-781, 2007.
[4] C. Dwork, “Differential Privacy: A Survey of Results,” Proc. Fifth Int’l Conf.
Theory and Applications of Models of Computation (TAMC), pp. 1-19, 2008.
53
Thank you..