towards graph analytics and privacy protectionnew privacy concerns for the individuals involved. in...
TRANSCRIPT
Towards Graph Analytics and Privacy Protectionby
Dongqing Xiao
A Dissertation
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
In partial fulfillment of the requirements for the
Degree of Doctor of Philosophy
in
Computer Science
by
APPROVED:
Professor Mohamed Y. EltabakhWorcester Polytechnic InstituteAdvisor
Professor Elke A. RundensteinerWorcester Polytechnic InstituteCommittee Member
Professor Craig WillsWorcester Polytechnic InstituteHead of Department
Professor Xiangnan KongWorcester Polytechnic InstituteCommittee Member
Dr. Yuanyuan TianIBM AlmadenExternal Committee Member
Abstract
In many prevalent application domains, such as business to business network,
social networks, and sensor networks, graphs serve as a powerful model to
capture the complex relationships inside. These graphs are of significant im-
portance in various domains such as marketing, psychology, and system de-
sign. The management and analysis of these graphs is a recurring research
theme. The increasing scale of data poses a new challenge on graph analysis
tasks. Meanwhile, the revealed edge uncertainty in the released graph raises
new privacy concerns for the individuals involved.
In this dissertation, we first study how to design an efficient distributed tri-
angle listing algorithms for web-scale graphs with MapReduce. This is a
challenging task since triangle listing requires accessing the neighbors of the
neighbor of a vertex, which may appear arbitrarily in different graph parti-
tions (poor locality in data access). We present the Bermuda method that
effectively reduces the size of the intermediate data via redundancy elimina-
tion and sharing of messages whenever possible. Bermuda encompasses two
general optimization principles that fully utilize the locality and re-use dis-
tance of local pivot message. Leveraging these two principles, Bermuda not
only speeds up the triangle listing computations by factors up to 10 times but
also scales up to larger datasets.
Second, we focus on designing anonymization approach to resisting de-anonymization
with little utility loss over uncertain graphs. In uncertain graphs, the adver-
sary can also take advantage of the additional information in the released
uncertain graph, such as the uncertainty of edge existence, to re-identify
the graph nodes. In this research, we first show the conventional graph
anonymization techniques either fails to guarantee anonymity or deteriorates
utility over uncertain graphs. To this end, we devise a novel and efficient
framework Chameleon that seamlessly integrates uncertainty. First, a proper
utility evaluation model for uncertain graphs is proposed. It focuses on the
changes on uncertain graph reliability features, but not purely on the amount
of injected noise. Second, an efficient algorithm is designed to anonymize
a given uncertain graph with relatively small utility loss as empowered by
reliability-oriented edge selection and anonymity-oriented edge perturbing.
Experiments confirm that at the same level of anonymity, Chameleon pro-
vides higher utility than the adaptive version of deterministic graph anonymiza-
tion methods.
Lastly, we consider resisting more complex re-identification risks and pro-
pose a simple-yet-effective Galaxy framework for anonymizing uncertain
graphs by strategically injecting edge uncertainty based on nodes role. In par-
ticular, the edge modifications are bounded by the derived anonymous proba-
bilistic degree sequence. Experiments show our method effectively generates
anonymized uncertain graphs with high utility.
Acknowledgements
The growth of my knowledge over the last few years is to a huge part due
to the inspiration and guidance I received from my advisor, Professor Prof.
Mohamed Eltabakh. He gave me the freedom to explore any topic in graph
analytics research, provided sound directions at every turn, and the prompt
feedback that pushed my research process. I have been fortunate to have him
as my advisor. I express my sincere thanks for his support, advice, patience,
and encouragement throughout my Ph.D career. I am grateful to Prof. Xi-
angnan Kong for always being patient and being there for our discussion.
I sincerely thank the members of my Ph.D. committee, Prof. Elke runden-
steiner, Professor Xiangnan Kong, and Dr. YuanYuan Tian for providing me
valuable feedback during all milestones in my Ph.D study. Their insightfull
suggestion helped me improve my research and the content of this disserta-
tion. I would like to thank Prof. Elke rundensteiner for her guidance for my
research qualification. My thanks also goes to the National Science Foun-
dation (NSF) for providing funding for the computing resources used in my
dissertation.
I would like to thank my collaborators Karim Ibrahim, Hai Liu and Pankaj
Didwania. My thank you also goes to all other previous and current DSRG
members – in particular Dr. Chuan Lei, Dr. Lei Cao, Yizhou Yan and Xiao
Qin for their insightful discussion, helpful feedback, and friendship.
I would like to thank my family members for their patience, support and love
during the past few years. Their passion to achieve bigger and better things
ingrained in me is a drive to reach excellence.
My Publications
Publications Contributing to this Dissertation
In this context I have achieved research advances that are selectively included in this
dissertation as detailed below.
Topic I:
Topic I of this dissertation addresses the problem of distributed triangle listing for massive
graphs.
1. Dongqing Xiao, Mohamed Y. Eltabakh, Xiangnan Kong, Bermuda: An Efficient
MapReduce Triangle Listing Algorithm for Web-Scale Graphs. SSDBM 2016,
pages 1-12.
Relationship to this dissertation: In this work, we propose “Bermuda” method that
effectively reduces the size of the intermediate data via redundancy elimination and
sharing of messages whenever possible, together for efficient triangle listing.
Chapters 2 to 5 in Part I of this dissertation are based on this work.
iii
Topic II: Degree Anonymization over Uncertain Graphs
Topic II of this dissertation addresses the problem of resisting degree-based de-anonymization
over anonymized uncertain graphs.
2. Dongqing Xiao, Mohamed Y. Eltabakh, Xiangnan Kong, Chameleon: Towards the
Preservation of Privacy and Reliability in Anonymized Uncertain Graphs, in sub-
mission to a major conference.
Relationship to this dissertation: In this work, we present Chameleon, the first
anonymization framework for uncertain graphs, namely, Chameleon. Chameleon
constructs the anonymized uncertain graphs in iterative skeleton empowered by (1)
an efficient cost-benfit oriented edge selection strategy to identify the candidate
edge sets for obfuscation (2) an efficient entropy-driven edge perturbation strategy
for maximizing the privacy gain.
Chapters 6 to 10 in Part II of this dissertation are based on this work.
Topic III: Probabilistic Degree Anonymization over Uncertain Graphs
Topic III of this dissertation address the novel probabilistic degree-based re-identification
problem over uncertain graphs.
3. Dongqing Xiao, Mohamed Y. Eltabakh, Xiangnan Kong, Galaxy: Resisting Proba-
bilistic Re-identification in Anonymized Uncertain Graphs ready for submission to
a major conference.
Relationship to dissertation: In this work, we present Galaxy framework which
leverages the (k, ε)-obf degree sequence to bound and guide the random perturbation-
based anonymization schemes.
Chapters 11 to 13 in Part II of this dissertation are based on this work.
iv
Other Publications
The below listed publications correspond to other research projects I have undertaken dur-
ing my PhD at WPI mostly on the topic of query optimization and meta data management.
4. Hai Liu, Dongqing Xiao, Pankaj Didwania, Mohamed Y. Eltabakh: Exploiting Soft
and Hard Correlations in Big Data Query Optimization. PVLDB 2016, pages 1005-
1016.
5. Karim Ibrahim, Dongqing Xiao, Mohamed Y. Eltabakh, Elevating Annotation Sum-
maries To First-Class Citizens In InsightNotes. EDBT 2015, pages 49-60.
6. Dongqing Xiao, Mohamed Y. Eltabakh: InsightNotes: summary-based annotation
management in relational databases.SIGMOD 2014, pages 661-672.
v
Contents
My Publications iii
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Distributed Triangle Listing for Massive Graphs . . . . . . . . . 1
1.1.2 Uncertain Graph Anonymization . . . . . . . . . . . . . . . . . . 3
1.2 State-Of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Distributed Triangle Listing Algorithms . . . . . . . . . . . . . . 5
1.2.2 Determinitic Graph Anonymization . . . . . . . . . . . . . . . . 7
1.3 Research Challenges Addressed in This Dissertation . . . . . . . . . . . 13
1.4 Proposed Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.1 Distributed Triangle Listing . . . . . . . . . . . . . . . . . . . . 16
1.4.2 Resisting Degree-based De-anonymization in Uncertain Graphs . 17
1.4.3 Resisting Probabilistic Degree-based De-anonymization in Un-
certain Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 20
vi
CONTENTS
I Distributed Triangle Listing With MapReduce 22
2 Bermuda Preliminaries 23
2.1 Triangle Listing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Sequential Triangle Listing . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 MapReduce Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Triangle Listing in MapReduce . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 Analysis and Optimization Opportunities . . . . . . . . . . . . . 28
3 Bermuda Technique 31
3.1 Bermuda Edge-Centric Node++ . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 Analysis of Bermuda-EC . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Bermuda Vertex-Centric Node++ . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Message Sharing Management . . . . . . . . . . . . . . . . . . . 41
3.2.1.1 Usage-Based Tracking . . . . . . . . . . . . . . . . . . 42
3.2.1.2 Bucket-Based Tracking . . . . . . . . . . . . . . . . . 43
3.2.2 Analysis of Bermuda-VC . . . . . . . . . . . . . . . . . . . . . . 44
4 Performance Evaluation 46
4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Experiment Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Bermuda Technique . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.2 Effect of the number of reducers . . . . . . . . . . . . . . . . . . 49
4.2.3 Message Sharing Management . . . . . . . . . . . . . . . . . . . 51
4.2.4 Execution Time Performance . . . . . . . . . . . . . . . . . . . . 52
5 Related Works 54
vii
CONTENTS
II Resisting Degree-based De-anonymization in Anonymized Un-
certain Graphs 57
6 Problem Definition 58
6.1 Uncertain Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Attack Model and Privacy Criteria . . . . . . . . . . . . . . . . . . . . . 59
6.3 Reliability-Based Utility Loss Metric . . . . . . . . . . . . . . . . . . . . 60
6.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7 Uncertain Graph Anonymization via Representative Instance 62
8 Chameleon Framework 64
8.1 Chameleon Iterative Skeleton . . . . . . . . . . . . . . . . . . . . . . . . 64
8.2 Hybrid Edge Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.1 Uniqueness Score . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.2.2 Reliability Relevance . . . . . . . . . . . . . . . . . . . . . . . . 68
8.3 Reliability-oriented Edge Selection Procedure . . . . . . . . . . . . . . . 73
8.4 Anonymity-Oriented Edge Perturbing . . . . . . . . . . . . . . . . . . . 76
9 Performance Evaluation 83
9.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.1.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2 Performance of Uncertain Graph Anonymization . . . . . . . . . . . . . 87
9.2.1 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 87
9.2.2 Utility Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
10 Related Work 92
viii
CONTENTS
III Resisting Probabilistic Degree-based De-anonymization in Anonymized
Uncertain Graphs 96
11 Problem Definition 97
11.1 Privacy Threats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
11.2 Anonymity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 100
11.3 Utility Preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
12 Galaxy Techniques 104
12.1 Overview of The Galaxy Approach . . . . . . . . . . . . . . . . . . . . . 104
12.2 Probabilistic Degree Anonymization . . . . . . . . . . . . . . . . . . . . 107
12.3 Probabilistic Degree Sequence Alignment . . . . . . . . . . . . . . . . . 110
12.4 Probablistic Anonymous Graph Construction . . . . . . . . . . . . . . . 112
12.5 The Anonymity-Bounded Obfuscation Algorithm . . . . . . . . . . . . . 113
13 Performance Evaluation 117
13.1 Experiment Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
13.1.2 Utility Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 118
13.2 Performance of Uncertain Graph Anonymization . . . . . . . . . . . . . 119
13.2.1 Efficiency Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 119
13.2.2 Utility Loss Evaluation . . . . . . . . . . . . . . . . . . . . . . . 121
IV Conclusion and Future Work 124
14 Conclusion of This Dissertation 125
ix
CONTENTS
15 Future Work 127
15.1 Defeating More Involved De-anonymization Attacks . . . . . . . . . . . 127
15.2 Big Graph Anonymization . . . . . . . . . . . . . . . . . . . . . . . . . 129
15.3 Learning to Anonymize Uncertain Graphs . . . . . . . . . . . . . . . . . 130
References 133
x
List of Figures
1.1 Examples of real-world uncertain graphs with privacy concerns. . . . . . 4
2.1 Bermuda: Adjacency List. . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Bermuda: Bermuda-EC (Edge Centric) Execution. . . . . . . . . . . . . 34
3.2 Bermuda: Bermuda-VC (Vertex Centric) Execution. . . . . . . . . . . . . 39
3.3 Bermuda: Access Patterns of Pivot Messages. . . . . . . . . . . . . . . . 42
3.4 Bermuda: The Usage of External Memory. . . . . . . . . . . . . . . . . . 43
4.1 Bermuda: Distribution of Mapper Elapsed Times. . . . . . . . . . . . . . 48
4.2 Bermuda: Disk Space vs. Memory Tradeoff. . . . . . . . . . . . . . . . . 49
4.3 Bermuda: Running Time of Bermuda-EC. . . . . . . . . . . . . . . . . . 49
4.4 Bermuda (disk-based): Varying k vs. Running Time. . . . . . . . . . . . 51
4.5 Bermuda: The Accumulation of Sharing Messages. . . . . . . . . . . . . 51
6.1 Chameleon: Privacy Risk Assessment. . . . . . . . . . . . . . . . . . . . 59
7.1 Chameleon: Representative based Anonymization (Rep-An). . . . . . . . 63
8.1 Chameleon: Edge Modifications’ Impact vs. Relaiblity Relevance. . . . . 70
8.2 Chameleon: Sampling Estimator for ERR . . . . . . . . . . . . . . . . . 72
8.3 Chameleon: Anonymity-Oriented Edge Perturbation. . . . . . . . . . . . 77
xi
LIST OF FIGURES
9.1 Chameleon: Distribution of Edge Probabilities, Degrees. . . . . . . . . . 84
9.2 Chameleon: Two Terminal Reliablity Discrepancy. . . . . . . . . . . . . 86
9.3 Chameleon: Running Time Comparision vs. Rep-An. . . . . . . . . . . . 87
9.4 Chameleon: Graph Property Preservation. . . . . . . . . . . . . . . . . . 88
9.5 Chameleon: Double Loss of Rep-An. . . . . . . . . . . . . . . . . . . . 89
9.6 Chameleon: The Gain of RS and ME. . . . . . . . . . . . . . . . . . . . 90
11.1 Galaxy: Probabilistic Degree-based De-anonymization. . . . . . . . . . . 99
11.2 Galaxy: Illustration of Convex and Non-Convex Set. . . . . . . . . . . . 102
11.3 Galaxy: Invalidity of Being Convex Set. . . . . . . . . . . . . . . . . . . 103
12.1 Galaxy Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
12.2 Galaxy: Probabilistic Degree Sequence Approximation. . . . . . . . . . . 108
12.3 Galaxy: Fuzzy Vertex Alignments. . . . . . . . . . . . . . . . . . . . . . 110
12.4 Galaxy: Derived Perturbation Model. . . . . . . . . . . . . . . . . . . . 112
12.5 Galaxy: Anonymous Degree Sequence Realization. . . . . . . . . . . . . 114
13.1 Galaxy: Running Time Comparisions vs. Chameleon. . . . . . . . . . . . 120
13.2 Galaxy: Two Terminal Reliablity Preservation. . . . . . . . . . . . . . . 120
13.3 Galaxy: The Change Ratio of Degree. . . . . . . . . . . . . . . . . . . . 121
13.4 Galaxy: Average Path Distance Preservation. . . . . . . . . . . . . . . . 122
13.5 Galaxy: Clustering Coefficient Preservation. . . . . . . . . . . . . . . . . 122
13.6 Galaxy: Degree Distribution Preservation. . . . . . . . . . . . . . . . . . 123
15.1 Parallel Graph Anonymization Process. . . . . . . . . . . . . . . . . . . 130
15.2 Graph Anonymization Learning Process. . . . . . . . . . . . . . . . . . . 131
xii
List of Tables
2.1 Bermuda:Summary of Notations. . . . . . . . . . . . . . . . . . . . . . . 24
4.1 Bermuda: Basic Statistics about Datasets. . . . . . . . . . . . . . . . . . 47
4.2 Bermuda: Reduction Factors of Communication Cost. . . . . . . . . . . . 47
4.3 Bermuda: Effectiveness Evaluation. . . . . . . . . . . . . . . . . . . . . 52
9.1 Chameleon: Dataset Statistics and Privacy Parameters. . . . . . . . . . . 84
9.2 Chameleon: Summary of Uncertain Graph Anonymization Methods. . . . 86
10.1 Chameleon: Summary of Adversary Knowledge. . . . . . . . . . . . . . 93
10.2 Chameleon: Privacy Criteria Summary of Perturbation-based Graph Anonymiza-
tion Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
10.3 Chameleon: Positioning Chameleon w.r.t State-Of-Art Techniques. . . . . 94
13.1 Galaxy: Dataset Statistics and Privacy Parameters. . . . . . . . . . . . . . 118
xiii
1
Introduction
1.1 Motivation
1.1.1 Distributed Triangle Listing for Massive Graphs
Graphs arise naturally in many real-world applications such as social networks, bio-
medical networks, and communication networks. In these applications, the graph can
often be massive involving billions of vertices and edges. For example, Facebook’s social
network involves more than 1.23 billion users (vertices), and more than 208 billion friend-
ships (edges). Such massive graphs can easily exceed the available memory of a single
commodity computer. That is why distributed analysis on massive graphs has become an
important research area in recent years [1, 2].
Triangle listing—which involves listing all triangles in a given graph—is well identi-
fied as a building-block operation in many graph analysis and mining techniques [3, 4].
First, several graph metrics can be directly obtained from triangle listing, e.g., clustering
coefficient and transitivity. Such graph metrics have wide applications including quantify-
ing graph density, detecting spam pages in web graphs, and measuring content quality in
social networks [5]. Moreover, triangle listing has a broad range of applications including
1
1.1 MOTIVATION
the discovery of dense sub-graphs [4], study of motif occurrences [6], and uncovering of
hidden thematic relations in the web [3]. There is another well-known and closely-related
problem to triangle listing, which is the triangle counting problem. Clearly, solving the
triangle listing problem would automatically solve triangle counting, but not vice versa.
Compared to triangle counting, triangle listing serves a broader range of applications. For
example, Motif identification [6], community detection [7], and dense subgraphs [4] are
all dependent on the more complex triangle listing problem.
Several techniques have been proposed for processing web-scale graphs including
streaming algorithms [5, 8], external-memory algorithms [9, 10, 11], and distributed
parallel algorithms [12, 13]. The streaming algorithms are limited to the approximate
triangle counting problem. External-memory algorithms exploit asynchronous I/O and
multi-core parallelism for efficient triangle listing [9, 11, 14]. In spite of achieving an
impressive performance, external-memory approaches assume that the input graphs are
in a centralized storage, which is not the case for many emerging applications that gen-
erate graphs distributed in nature. Even more seriously, external-memory approaches
cannot easily scale up in terms of computing resources and parallelization degree. Algo-
rithm [13] presents a parallel algorithm for exact triangle counting using the MapReduce
framework. The algorithm proposes a partitioning scheme that improves the memory re-
quirements to some extent, yet it still suffers from a huge communication cost. Algorithm
[12] presents an efficient MPI-based distributed memory algorithm on the basis of [13]
with load balancing techniques. However, as a memory-based algorithm, it suffers from
memory limitations.
In addition to these techniques, several distributed and specialized graph frameworks
have been recently proposed as general-purpose graph processing engines [1, 2, 15].
However, most of these frameworks are customized for iterative graph processing where
distributed computations can be kept in-memory for faster subsequent iterations. How-
2
1.1 MOTIVATION
ever, the triangle listing algorithms are not iterative and would not make use of these
optimizations.
1.1.2 Uncertain Graph Anonymization
In many prevalent application domains, such as business to business (B2B) [16], social
networks [17, 18], and sensor networks [19], graphs serve as powerful models to capture
the complex relationships inherent in these applications. Most graphs in these applications
are uncertain by nature, where each edge carries a degree of uncertainty (probability)
representing the probability of its presence in the real world. This uncertainty can be due
to various reasons ranging from the use of prediction models to predict the edges (as in
social media and B2B networks) to physical properties that affect the edges’ reliabilities
(as in sensor and communication networks).
These rich uncertain graphs are of significant importance due to the analytics and
knowledge extraction that can be applied on them, e.g., understanding graph structures [20,
21], social interactions [22], information discovery and propagation [23], advertising and
marketing [18], among many others. Publishing such uncertain graph data would allow
a wide variety of ad hoc analyses and novel valid uses of the data, but it also raises huge
privacy concerns. This is because these uncertain graphs contain sensitive information
about the graph entities as well as their connections, whose disclosure may violate pri-
vacy regulations.
Motivation Scenario I (Social Trust Networks): In social networks, the trust and
influence relationships among users—which may greatly impact users’ behaviors—are
usually probabilistic and uncertain [18] (See Figure 1.1(a)). The existence of the trust re-
lationship depends on many factors, such as the area of expertise and emotional connec-
tions. Researchers are very interested in studying the structure of social trust networks,
in order to promote products, or choose strategies for a campaign. However, the release
3
1.1 MOTIVATION
Ana
Bob
Carol
David
Eve
Grace
0.8
0.1 0.7
0.5 0.9
0.1
0.7 0.1
0.9
0.9 0.8
Trustandinfluencebetweentwousers
Potennalfuturebusinessinteracnon
0.6
(a) Social Trust Network
Ana
Bob
Carol
David
Eve
Grace
0.8
0.1 0.7
0.5 0.9
0.1
0.7 0.1
0.9
0.9 0.8
Trustandinfluencebetweentwousers
Potennalfuturebusinessinteracnon
0.6
(b) B2B Network
Figure 1.1: Examples of real-world uncertain graphs with privacy concerns.
of such uncertain graphs with simple anonymization may cause serious privacy issues.
The attackers can re-identify private and sensitive information, such as the identity of the
users and their trustiness relationship, from the released data.
Motivation Scenario II (B2B Networks): Another uncertain graph example comes
from Businesses to Businesses network (See Figure 1.1(b)). In these networks, e.g., “Al-
ibaba”, nodes represent companies (or businesses in general) while edges represent the
trust and the potential of future transactions among them [16]. Such future interactions
are uncertain since they are obtained by prediction models based on historical data [24].
B2B networks can be analyzed and mined for various applications including advertise-
ment targeting [25] and customer segmenting [26]. Certainly, information about a com-
pany’s interactions with other companies is considered sensitive data since any leak can
be used to infer the company’s financial conditions.
Motivation Scenario III (Wireless Sensor Networks): In wireless sensor networks
(WSNs), the communication network among different sensors are usually uncertain. The
existence of network communication between sensors depends on many factors, such as
the power of the sensors and the quality of wireless connection. Researchers are very
interested in studying the structure of the connection network in WSN in order to improve
the design of sensor networks. However, releasing the uncertain graphs about WSNs may
4
1.2 STATE-OF-THE-ART
cause privacy or security problems. For example, the WSNs in smart power grids are of
great value in research studies, but the release of such data can potentially cause attacks
to the power grid system, especially if attackers can re-identify the exact locations of the
sensors from the released data.
These scenarios show the immediate need for efficient uncertain graph anonymization,
i.e., protecting the sensitive information while maintaining the graph utility. In general,
the graph anonymization problem has been studied extensively and various anonymiza-
tion techniques have been proposed [27, 28, 29, 30, 31, 32, 33, 34]. However, these
techniques focus only on deterministic graphs, where edges’ presence is known with cer-
tainty.
1.2 State-Of-the-Art
1.2.1 Distributed Triangle Listing Algorithms
Triangle listing is a basic operation of the graph analysis. Many research works have been
conducted on this problem, which can be classified into three categories: in-memory algo-
rithms, external-memory algorithms and distributed algorithms. Here, we briefly review
these works.
In-Memory Algorithm. The majority of previously introduced triangle listing al-
gorithms are the In-Memory processing approaches. Traditionally, they can be further
classified as Node-Iterator[35, 36, 37] and Edge-Iterator ones[38, 39] with the respect to
iterator-type. Authors [37, 38, 39]improved the performance of in-memory algorithms
by adopting degree-based ordering. Matrix multiplication is used to count triangles [35].
However, all these algorithms are inapplicable to massive graphs which do not fit in mem-
ory.
5
1.2 STATE-OF-THE-ART
External-Memory Algorithms. In order to handle the massive graph, several external-
memory approaches were introduced [9, 10, 11]. Common idea of these methods is:
(1) Partition the input graph to make each partition fit into main-memory, (2) Load each
partition individually into main-memory and identify all its triangles, and then remove
edges which participated in the identified triangle, and (3) After the whole graph is loaded
into memory buffer once, the remaining edges are merged, then repeat former steps until
no edges remain. These Algorithms require a lot of disk I/Os to perform the reading and
writing of the edges. Authors [9, 10] improved the performance by reducing the amount
of disk I/Os and exploiting multi-core parallelism. External-Memory Algorithms show
great performance in time and space. However, the parallelization of external-memory
algorithms is limited. External-memory approaches cannot easily scale up in terms of
computing resources and parallelization degree.
Distributed Algorithms. Another promising approach to handle triangle listing on
large-scale graphs is the distributed computing. Suri et al. [13] introduced two Map-
Reduce adaptions of NodeIterator algorithm and the well-known Graph Partitioning (GP)
algorithm to count triangles. The Graph Partitioning algorithm utilizes one universal hash
partition function over nodes to distribute edges into overlapped graph partitions, then
identifies triangles over all the partitions. Park et al. [40] further generalized Graph Par-
titioning algorithm into multiple rounds, significantly increasing the size of the graphs
that can be handled on a given system. The authors compare their algorithm with GP
algorithm [13] across various massive graphs then show that they get speedups rang-
ing from 2 to 5. In this work, we show such large or even larger speedup (from 5 to
10) can also be obtained by reducing the size intermediate result directly via our meth-
ods. Teixeira et al. [41] presented Arabesque, one distributed data processing platform
for implementing subgraph mining algorithms on the basis of MapReduce framework.
Arabesque automates the process of exploring a very large number of subgraphs, includ-
6
1.2 STATE-OF-THE-ART
ing triangles. However, these MapReduce algorithms must generate a large amount of
intermediate data that travel over the network during the shuffle operation, which degrade
their performance. Arifuzzaman et al. [12] introduced an efficient MPI-based distributed
memory parallel algorithm (Patric) on the basis of NodeIterator algorithm. The Patric al-
gorithm introduced degree-based sorting preprocessing step for efficient set intersection
operation to speed up execution. Furthermore, several distributed solutions designed for
subgraph mining on large graph were also proposed [1, 42]. Shao et al. introduced the
PSgl framework to iteratively enumerate subgraph instance. Different from other parallel
approaches, the PSgl framework completes relies on the graph traversal and avoids the
explicit join operation. These distributed memory parallel algorithms achieve impressive
performance over large-scale graph mining tasks. These methods distributed the data
graph among the worker’s memory, thus they are not suitable for processing large-scale
graph with small clusters.
1.2.2 Determinitic Graph Anonymization
The privacy concerns associated with data analysis over graph data have incurred the re-
cent research. In particular, privacy disclosure risks arise when the data owner wants to
publish or share the graph data with third party for research or business-related appli-
cations. Privacy-preserving graph publishing techniques are usually adopted to protect
privacy through masking, modifying and/or generalizing the original data while without
sacrificing much data utility.
The privacy breaches in graph data can be grouped as follows.
1. Identity disclosure occurs when the identity of an individual who is associated with
a vertex is revealed. It includes sub-categories such as vertex existence, vertex
properties, and graph metrics.
7
1.2 STATE-OF-THE-ART
2. Attribute disclosure which seeks not necessarily to identify a vertex, but to reveal
sensitive labels of the vertex.
3. Link disclosure when the sensitive relationship between two individuals is dis-
closed. Depending on graph’s type, we can refine this category as link relationships,
link weight, and sensitive edge labels.
The identity disclosure corresponds to the scenario where the identity of an individual
who is associated with a node is revealed. The link disclosure corresponds to the scenario
where the sensitive relation between two individuals is disclosed. The attribute disclosure
denotes the sensitive dta associated with each node is compromised. Identify disclosure
often lead to attribute disclosure due to that fact that identity disclosure occurs when an
individual is identified within a dataset, whereas attribute disclosure occurs when sensitive
information that an individual wished to keep private is identified.
The model/assumption of prior knowledge and utility loss quantification play key
roles in designing effective and meaningfull anonymization techniques for graph data.
Determining the knowledge of the adversary is a challenging problem. A variety of
adversaries’ knowledge has been proposed in conjunction with their attack and a protec-
tion method. Attacks on naively anonymized network data have been developed, which
can reidentify vertices, disclose edges between vertices. These attacks include matching
attacks, which use external knowledge of vertex features [27, 28, 43, 44]; injection attacks
which alter the network prior to publication [45]; and auxiliary network attacks which use
publicly available networks as an external information source [46]. To solve these prob-
lems, methods which introduce noise to the original data have been developed in order to
hinder the potential process of re-identification.
Graph Anonymziation Approaches In general, the state-of-art anonymization meth-
ods on simple graph data can be categoirzed into four categories as follows.
8
1.2 STATE-OF-THE-ART
• Generalization or clustering-based approaches which can be essentially regarded
as grouping vertices and edges into partitions called super-vertices and super-edges.
The details about individuals can be hidden properly, but the graph may be shrunk
considerably after anonymization, which may not be desirable for analyzing local
structures.
• Edge and vertex modification approaches first transform the data by edges or ver-
tices modification (adding and/or deleting) and then release the perturbed data. The
data is thus made available for unconstrained analysis with existing graph mining
techniques.
• Uncertain graphs are approaches based on adding or removing edges “partially” by
assigning a probability to each edge in the anonymous network. Instead of creating
or deleting edges, the set of all possible edges is considered and a probability is
assigned to each edge.
All mentioned methods first transform the data into different types of graph’s modi-
fications and then release the perturbed data. The data is thus made available for uncon-
strained analysis. On the contrary, there is “privacy preserving graph mining” methods,
which do not release data, but only the output of an analysis task. For instance, differential
privacy [47] is a well-known privacy preserving graph mining method. In our work, we
do not consider such method for anonymizing uncertain graphs, since they do not allow
us to release the entire network which allows ad-hoc graph analysis tasks.
• Generalization approaches
Generalization approaches can be essentially regarded as grouping vertices and edges into
partitions called super-vertices and super-edges. The details about individuals can be hid-
den properly, but the graph may be shrunk considerably after anonymization, which may
be not desirable for analyzing local structures. All methods developed, therefore, need the
9
1.2 STATE-OF-THE-ART
whole graph to be applied to. Consequently, they are not able to deal with the streaming
graph data. Here, we remind the reader new methods can be developed using this core
idea to generate anonymous graph dataset. The first approach of this category was pro-
posed by Hay et al. [48]. It uses the size of the partition to ensure node anonymity. After
grouping, each super-vertex represents at least k nodes and each super-edge represents
all the nodes between nodes in two super-vertices. Only the edge density is published for
each partition, so it will be hard to distinguish between individuals in a partition. A sim-
ilar idea was applied for the complex network, i.e., labeled network [49].The clustering
problem is known to be NP-hard. Researchers ever present different methods for opti-
mization. For instance, Sihag et al. chose the genetic algorithm to optimize this NP-hard
problem. It does achieve a better result in terms of information loss. Unfortunately, this
method does not seem scalable for large networks.
• Edge and vertex modification approaches
Edge and vertex modification approaches anonymize a graph by modifying edges or ver-
tices in the graph. These modifications can be made at random (referred to as randomiza-
tion, random perturbation). random perturbation techniques are generally the simplest
and present the lowest complexity. Thus, they are able to deal with large networks. The
first method was proposed by Hay et al., called Random perturbation which anonymizes
unlabelled graphs using Rand add/del strategy, i.e., randomly removing p edges and then
randomly add p fake edges without change the set of vertices and the total number of
edges. On this basis, Ying and Wu [32, 50] developed two algorithms designed to pre-
serve spectral characteristics of the original graph called Spctr Add/Del and Spctr Switch.
Following the path, Stokes and Torra states an appropriate selection of the eigenvalues
in the spectral method can perturbation the graph while keeping its most significative
edges. The generic strategy which aims to preserve the most important edges in the net-
work trying to maximize utility while achieving the desired privacy level. Generally, such
10
1.2 STATE-OF-THE-ART
utility-aware methods achieve lower information loss, but at a cost of increasing com-
plexity. Another improved variation of Random perturbation was proposed by Ying et
al. [50], called Blockwise Random Add/Del. This method divides the graph into blocks
according to the degree sequence and implements edge modifications on the vertices at
high risk of re-identification, not at random over the entire set of vertices. However, the
ever-mentioned random perturbation techniques do not offer privacy guarantee.
The modification can be performed in order to fulfill some desired constraints (re-
ferred to as constrained perturbation methods). Among them, the k-anonymity model is
the most well-known privacy notation which imported from relational data anonymiza-
tion. The k-anonymity model indicates that an attacker can not distinguish between
k records although he manages to find a group of quasi-identifiers. Therefore, the at-
tacker can not re-identify an individual with a probability greater than 1k. The concept
can be used as quasi-identifier to extend k-anonymity on the graph data such as k-degree
anonymity.
Constrained graph modification based on modifying the graph structure (by edge mod-
ifications) to ensure all the vertices satisfy k-anonymity. The first method was proposed
by Liu and Terzi [27] which based on integer linear programming and edge switch to con-
struct a new anonymous graph which is k-degree anonymous. Hartung et al. [51] showed
k-degree anonymity becomes NP-hard on graphs with H-index three, which is a quite
common case for large networks. Different kinds of heuristics were proposed to improve
over Liu and Terzi’s work in terms of speed and scalability [52, 53]. For instance, Nagle
et al. [52] proposed a local anonymization algorithm based on k-degree anonymity that
focuses on obscuring structurally important vertices that are not well anonymized, thereby
reducing the cost of the overall anonymization procedure. However, results are similar to
Liu and Terzi’s algorithm in terms of information loss. Namely, they suffer from the high
utility low bound.
11
1.2 STATE-OF-THE-ART
• Uncertain graphs
Rather than anonymization graphs by generalized them or adding/removing edges to sat-
isfy privacy parameter, recent methods have explored the semantics of uncertain graphs
to achieve privacy protection. The first approach was proposed by Boldi et al. [28]. It
is based on injecting uncertainty in deterministic graphs and publishing the resulting un-
certain graphs. The authors notice that from a probabilistic perspective, adding a non-
existing edge corresponds to changing its existence probability from 1 to 0 vice versa.
In their method, instead of considering only binary edge probabilities, they allow proba-
bilities to take any value in the range [0, 1]. From the perspective of graph modification,
they provide more gained way “partial Add/Del Edge” to transform the input graph to
the anonymous one thereby reduce the information loss in the anonymization procedure.
However, the specific method ignores several opportunities for further reducing informa-
tion loss in the anonymization procedure. Nguyen et al. [30] proposed a generalized ob-
fuscation model based on uncertain adjacency matrices that keep expected node degrees
equals to those in the original graph, and a generic framework for privacy and utility quan-
tification of anonymization methods. The same authors present another method based on
maximum variance to achieve better trade-off privacy and data utility (referred to as Max-
Var). In particular, they transform the optimization problem into independent quadratic
optimization problems by dividing the large input graph into subgraphs. From the view of
graph modification, they provide more subtle way “partial Switch Edge” for anonymizing
the input graph thereby achieve the better trade-off between privacy and utility. However,
MaxVar fails to provide meaning privacy guarantee for user tunable purpose. What’s
more, these two methods assumes each edge modification has the equal impact over the
graph. As ever shown in ever-discussed vertex and edge modification techniques, it is not
always the case, especially in large networks.
In summary, the privacy preserving graph publishing problem has been extensively
12
1.3 RESEARCH CHALLENGES ADDRESSED IN THIS DISSERTATION
studied, a lot of graph anonymization techniques have been proposed. However, they are
tailored towards determinitic graphs. The ignorance of edge uncertainty in anonymity and
utility loss assessment makes they neither provide enough anonymity nor preserve graph
utility in a right way.
1.3 Research Challenges Addressed in This Dissertation
Although many effective graph anonymization techniques have been proposed for de-
terministic graphs, shifting the graph anonymization techniques to uncertain graphs is
challenging.
It is challenging to develop a proper metric for quantifying the information loss for
uncertain graph anonymization. Fundamentally, the graph anonymization techniques re-
quired modifying graph structure at some level. The intermediate goal of graph anonymiza-
tion is to balance the utility and privacy. The first question we need to solve is to develop
proper metrics for quantifying loss of information. In the context of deterministic graphs,
this problem has been extensively studied. Most of the previous works use the total num-
ber of modified edges to measure the utility loss [27, 28]. Researchers argued this measure
is not effective as it assumes each edge modification has an equal impact on the original
graph properties [54]. They suggest studying the change on other structural properties
such as the spectrum [32], community structure [54], shortest path length and the neigh-
borhood overlap [33]. However, above-mentioned metrics are all designed for comparing
deterministic graphs and can not be used to handle uncertain graphs directly. Thus, we
need to investigate other utility metrics suitable for uncertain graphs which are able to
capture the essence of structural properties for better serving a wide range of analytics.
Developing metrics for quantifying loss of information for uncertain graph mining task is
a challenging research work on its own.
13
1.3 RESEARCH CHALLENGES ADDRESSED IN THIS DISSERTATION
It is challenging to define the adversary knowledge and privacy protection model
which explicitly incorporates edge uncertainty. Compared to deterministic graphs, the
publishing of uncertain graphs reveals additional information–associated edge uncertain-
ties which can be used by the adversary to re-identify some entities in the released uncer-
tain graph. Clearly, the public available edge uncertainty can be used to enhance various
kinds of de-anonymization attack. The second question we need to solve is to model how
the adversary incorporates edge uncertainty into the de-anonymization process, namely
Attack Model. In this work, we focus on structural attack. Usually, structural attacks
perform in the following way. Let G be a graph and G be the anonymized version of the
given graph. The adversary locates the match vertices in G according to the structural
information about the target in G. If there is a limited number of answers, it may lead
to target node re-identification and to privacy infringement. In the case of deterministic
graphs, the structural information of a target in G is an assertion with certainty, such as
“Ana has 3 neighbors” and the matching assertion can be evaluated as True or False with
certainty. In the case of uncertain graphs, it becomes more complex. First, as ever dis-
cussed, the structural property of a target node v in the original uncertain graph is defined
as a set of observations over all the possible worlds.First, depending on the domain of real
applications, the adversary may have a complete and exact knowledge or only the aggre-
gated statistics with respect to the structural property. Let the property be node degree, the
adversary may assess the exact degree distribution or only the global statistic. Second, the
matching process which links the nodes in the perturbed graph with the collected struc-
tural information (complete distribution or expected value) performs in a different way.
We need to extend the matching evaluation in uncertain graphs with different kinds of ad-
versary knowledge, and then design a proper privacy model on the basis of the matching
evaluation. In the context of uncertain graphs, it remains an unexplored problem.
It is challenging to design an effective and efficient uncertain graph techniques. Al-
14
1.4 PROPOSED SOLUTIONS
though several graph anonymization methods have been developed, they are only applica-
ble to deterministic graphs. Anonymization in an uncertain graph is still an open problem.
Given an input graph G and allowed operations O, the task of graph anonymization is to
transform G into the anonymous one by performing as few operations as possible.The
problem is known to be NP-hard when the input graph G is deterministic one and graph
modification operations includes edge addition and edge deletion [51]. The complexity
of uncertain graph anonymization problem falls into the same category.
1.4 Proposed Solutions
In this dissertation, we first investigate the problem of making triangle listing techniques
effective yet efficient in Web-scale graphs. Fundamental observations and optimization
can be used for speeding up triangle listing algorithm with MapReduce, but also can be de-
ployed over other platforms such as Pregel [1], PowerGraph [2], and Spark-GraphX [15],
and can be integrated with techniques that apply a graph pre-partitioning step.
We identify the novel kinds of privacy risks associated with uncertain graph publish-
ing where edge uncertainty acts as powerful auxiliary information for de-anonymization.
We show that existing graph anonymization algorithms are tailored towards determin-
istic graphs. They either fail to protect privacy correctly or destroy graph utility fully.
We address the problem of performing uncertain graph modification to provide enough
individual privacy protection with the minimal amount of utility loss. Fundamental ob-
servations and optimization can be used for improving the utility of anonymized graphs
(deterministic and uncertain graph) and can be incorporated with other graph optimization
techniques tailored to different privacy attacks.
15
1.4 PROPOSED SOLUTIONS
1.4.1 Distributed Triangle Listing
The triangle enumeration problem has then been studied in MapReduce. The main ob-
jective of these results is to derive efficient MapReduce algorithms requiring a very small
number of MapReduce rounds. However, since triangle listing process involves access on
neighbor information of neighbor vertices, a MapReduce algorithm using a small number
of rounds must generate a large amount of intermediate data that travel over the network
during the shuffle operation. Since the amount of this intermediate data can be much
larger than the input size, issues related to the performance of the network and to system
failure may arise with massive input graphs. Indeed, the network may be subject to con-
gestion since a large amount of data is created and sent over the network in a small time
interval (i.e., during the shuffle step), reducing the scalability and fault tolerance of the
network.
Redudancy in Communication: In the plain Hadoop, each reduces instance pro-
cesses its different keys (nodes) independent from each other. Generally, this is good for
parallelization. However, in triangle listing, it involves significant redundancy in commu-
nication. In the map phase, each node sends the identical pivot message to each of its
effective neighbors in NHv even though many of them may reside in and get processed
by the same reduce instance. For example, if a node v has 1,000 effective neighbors on
reduce worker j, then v sends the same message 1,000 times to reduce worker j. In web-
scale graphs, such redundancy in intermediate data can severely degrade the performance,
drastically consume resources, and in some cases causes job failures.
Therefore, the network traffic can be reduced by sending only one message to the
destination reducer, and either has main-memory cache or distributing the message to
actual graph nodes within each reducer. Although this strategy seems quite simple, and
other systems such as GPS [55] and X-Pregel [56] have implemented it, the trick lies on
how to efficiently perform the caching and sharing. We propose new effective caching
16
1.4 PROPOSED SOLUTIONS
strategies to maximize the sharing benefit while encountering little overhead. We also
present a novel theoretical analysis of the proposed techniques.
The contributions in this area include:
1. Independence of Graph Partitioning: Our Bermuda does not require any special
partitioning of the graph, which is suitable for current applications in which graph
structures are very complex and dynamically changing.
2. Awareness of Processing Order and Locality in the reduce phase: Bermuda’s
efficiency and optimizations are driven by minimizing the communication overhead
and the number of message passing over the network. Bermuda achieves these
goals by dynamically keeping track of where and when vertices will be processed
in the reduce phase and then maximizing the re-usability of information among the
vertices that will be processed together. We propose several reduce-side caching
strategies for enabling such re-usability and sharing of information.
3. Portability of Optimization: We implemented Bermuda over the MapReduce in-
frastructure. However, the proposed optimizations can be deployed over other plat-
forms such as Pregel [1], PowerGraph [2], and Spark-GraphX [15], and can be
integrated with techniques that apply a graph pre-partitioning step.
4. Scalability to Large Graphs even with Limited Compute Clusters: As our experi-
ments show, Bermuda’s optimizations—especially the reduction in communication
overheads—enable the scalability to very large graphs, while the state-of-art tech-
nique fails to finish the job given the same resources.
1.4.2 Resisting Degree-based De-anonymization in Uncertain Graphs
We design an simple but effective anonymization framework called Chameleon to anonymiz
an uncertain graph with less impact on its data utility. Targeting at the balance between
17
1.4 PROPOSED SOLUTIONS
utility and anonymity in the context of uncertain graphs, the design of Chameleon incor-
porates the edge uncertainty into privacy risk and utility evaluation componenents.
In contrast to the classical deterministic graph utility metrics, we propose a new util-
ity metric based on the reliability measure—which is a core metric in numerous uncertain
graph applications [23, 57, 58]. The anonymization process need to change the graph
structure by modifying the edge probabilities of a subset of the edges, which is an ex-
ponential search space. Therefore, we propose a ranking algorithm that ranks the edges
w.r.t the impact of a change on the graph structure—which we refer to as “reliability
Relevance”—and that ranking will guide the edge selection process. Moreover, we pro-
pose a theoretically-founded probability-alteration strategy based on the entropy of graph
degree sequence, which enables achieving maximum privacy gain for an added amount
of perturbation.
The contributions in this area include:
• Identifying the new and important problem of uncertain graph anonymization where
edge uncertainties need to be seamless integrated into the core of the anonymization
process. Otherwise, either the privacy will not be protected or the utility will be
severely damaged.
• Proposing a new utility-loss metric based on the solid connectivity-based graph
model under the possible world semantics, namely the reliability discrepancy.
• Introducing a theoretically-founded criterion, called reliability relevance, that en-
codes the sensitivity of the graph edges and vertices to the possible injected per-
turbation. The criterion will guide the edges’ selection during the anonymization
process.
• Proposing uncertainty-aware heuristics for efficient edge selection and noise injec-
tion over the input uncertain graph to achieve anonymization at a slight cost of
18
1.4 PROPOSED SOLUTIONS
reliability.
• Building the Chameleon framework that integrates the aforementioned contribu-
tions. Chameleon is experimentally evaluated using several real-world datasets to
evaluate its effectiveness and efficiency. The results demonstrate a significant ad-
vantage over the conventional methods that do not directly consider edge uncertain-
ties.
1.4.3 Resisting Probabilistic Degree-based De-anonymization in Un-
certain Graphs
We first introduce a probabilistic model of node degree knowledge available to the adver-
sary, then quantify the re-identification risk level on individuals in anonymized uncertain
graphs. We show that the risks of such attacks vary based on the uncertain graph struc-
ture. We then propose a novel approach Galaxy to anonymizing uncertain graph data by
modifying edges’ existence likelihood judiciously.
In particular, we formalize the structural indistinguishability of a mode with respect
to an adversary with locally-bounded external information—fuzzy equivalence. On this
basis, we extend the notation of (k, ε)-obfuscation for uncertain graphs. We provide meth-
ods for efficiently assessing the level of obfuscation achieved by an uncertain graph with
regards to the probabilistic degree property. It speeds up the anonymization parameter
learning. Instead of relying on heuristics for guiding the perturbation scheme, Galaxy first
constructs a degree sequence with over candidate anonymity and leverages it to distribute
and bound edge uncertainty perturbation of individual vertex. This approach guarantees
the anonymity for entities in the uncertain graph and allows ad hoc analysis tasks with
relatively little utility loss.
The contributions in this area include:
19
1.5 DISSERTATION ORGANIZATION
• Identifying the new and important problem of uncertain graph anonymization where
edge uncertainties need to be seamless integrated into the core of the anonymization
process. Otherwise, either the privacy will not be protected or the utility will be
severely damaged.
• Proposing a flexible probabilistic model of external information used by an adver-
sary to attack naively-anonymized uncertain graphs based on fuzzy equivalence.
This model allows us to evaluate re-identification risk efficiently.
• Formalizing the structural indistinguishability of a node with respect to an adver-
sary with external information of its probabilistic degree, and the extended privacy
notation (k, ε)-obfuscation.
• Proposing a efficient algorithm Galaxy to achieve this privacy condition. The al-
gorithm produces a over-obfuscated degree sequence, which describes the degree
distribution of uncertain graphs with over anonymity. Then, iterative process is
performed to find better obfuscation inside its bounded probabilistic search space.
• Building the Galaxy framework that integrates the aforementioned contributions.
Galaxy is experimentally evaluated using several real-world datasets to evaluate its
effectiveness and efficiency. The results demonstrate a significant advantage over
the conventional methods.
1.5 Dissertation Organization
The rest of this proposal is organized as follows. We then discuss in detail the three re-
search topics of this dissertation, namely Distributed Triangle Listing With MapReduce
in Part I (Chapters 2-5), Degree Anonymization over Uncertain Graphs in Part II (Chap-
ters 6-10), and Probabilitic degree-based Anonymization over Uncertain Graphs in Part
20
1.5 DISSERTATION ORGANIZATION
V (Chapters 11-13), respectively. The discussion of each of the three research topics
includes the problem formulation and analysis, description of the proposed solution, ex-
perimental evaluation, and lastly a discussion of related work. Chapter 14 concludes this
dissertation and Chapter 15 discusses promising future work.
21
2
Bermuda Preliminaries
We introduce several preliminary concepts and notations, and formally define the triangle
listing problem. We then overview existing sequential algorithms for triangle listing,
and highlight the key components of the MapReduce computing paradigm. Finally, we
present naive parallel algorithms using MapReduce and discuss the open optimization
opportunities, which will form the core of the proposed Bermuda technique.
2.1 Triangle Listing Problem
Figure 2.1: Bermuda: Adjacency List.
Suppose we have a simple undirected graph G(V,E), where V is the set of vertices
(nodes), and E is the set of edges. Let n = |V | and m = |E|. Let Nv = u|(u, v) ∈ E
23
2.2 SEQUENTIAL TRIANGLE LISTING
Symbol Definition
G(V,E) A simple graphNv Adjacent nodes of v in GNHv Adjacent nodes of v with higher degreedv Degree of v in Gdv Effective Degree of v in G
4vuw A triangle formed by u, v and w4(v) The set of triangles that contains v4(G) The set of all triangles in G
Table 2.1: Bermuda:Summary of Notations.
denote the set of adjacent nodes of node v, and dv = |Nv| denote the degree of node v.
We assume that G is stored in the most popular format for graph data, i.e., the adjacency
list representation (as shown in Figure 2.1). Given any three distinct vertices u, v, w ∈
V , they form a triangle 4uvw, iif (u, v), (u,w), (v, w) ∈ E. We define the set of all
triangles that involve node v as 4(v) = 4uvw| (v, u), (v, w), (u,w) ∈ E. Similarly,
we define4(G) =⋃v∈V 4(v) as the set of all triangles in G. For convenience, Table 2.1
summarizes the graph notations that are frequently used in the paper.
DEFINITION 1. Triangle Listing Problem: Given a large-scale distributed graph
G(V,E), our goal is to report all triangles in G, i.e.,4(G), in a highly distributed way.
2.2 Sequential Triangle Listing
In this section, we present a sequential triangle listing algorithm which is widely used as
the basis of parallel approaches [12, 13, 40]. In this work, we also use it as the basis of
our distributed approach.
A naive algorithm for listing triangles is as follows. For each node v ∈ V , find the set
of edges among its neighbors, i.e., pairs of neighbors that complete a triangle with node v.
Given this simple method, each triangle (u, v, w) is listed six times—all six permutations
24
2.2 SEQUENTIAL TRIANGLE LISTING
of u, v and w. Several other algorithms have been proposed to improve on and eliminate
the redundancy of this basic method, e.g., [5, 37]. One of the algorithms, known as
NodeIterator++ [37], uses a total ordering over the nodes to avoid duplicate listing of the
same triangle. By following a specific ordering, it guarantees that each triangle is counted
only once among the six permutations. Moreover, the NodeIterator++ algorithm adopts
an interesting node ordering based on the nodes’ degrees, with ties broken by node IDs,
as defined blow:
u v ⇐⇒ du > dv or (du = dv and u > v) (2.1)
This degree-based ordering improves the running time by reducing the diversity of
the effective degree dv. The running time of NodeIterator++ algorithm is O(m3/2). A
comprehensive analysis can be found in [37].
The standard NodeIterator++ algorithm performs the degree-based ordering compar-
ison during the final phase, i.e., the triangle listing phase. The work in [12] and [13]
further improves on that by performing the comparison u v for each edge (u, v) ∈ E
in the preprocessing step (Lines 1-3, Algorithm 1). For each node v and edge (u, v),
node u is stored in the effective list of v (NHv ) if and only if u v, and hence
NHv = u : u v and (u, v) ∈ E. The preprocessing step cuts the storage and
memory requirement by half since each edge is stored only once. After the preprocess-
ing step, the effective degree of nodes in G is O(√m) [37]. Its correctness proof can be
found in [12]. The modified NodeIterator++ algorithm is presented in Algorithm 1. Its
correctness proof is reported in Theorem 1.
Theorem 1 [12]. Algorithm NodeIterator++ lists each triangle inG once and only once.
Proof: Consider a triangle (x1, x2, x3) in G, and without the loss of generality,
assume x3 x2 x1. By the construction of NH in the preprocessing step, we have
25
2.3 MAPREDUCE OVERVIEW
Algorithm 1 NodeIterator++Preprocessing step
1: for all (u, v) ∈ E do2: if u v, store u in NH
v
3: else store v in NHu
Triangle Listing4: 4(G)← ∅5: for all v ∈ V do6: for all u ∈ NH
v do7: for all w ∈ NH
v
⋂NHu do
8: 4(G)←4(G)⋃4vuw
x2, x3 ∈ NHx1
and x3 ∈ NHx2
. When the loop in Line 5-7 begin with v = x1, u = x2,
w = x3 appear in the intersection of NHx1
and NHx2
, the triangle (x1, x2, x3) is counted
once. But this triangle cannot be counted for any other values of v and u since x1 /∈ NHx2
and x1, x2 /∈ NHx3
.
2.3 MapReduce Overview
MapReduce is a popular distributed programming framework for processing large
datasets [59]. MapReduce, and its open-source implementation Hadoop [60], have been
used for many important graph mining tasks [13, 40]. In this paper, our algorithms are
designed and analyzed in the MapReduce framework.
Computation Model. An analytical job in MapReduce executes in two rigid phases,
called the map and reduce phases. Each phase consumes/produces records in the form of
key-value pairs—We will use the keywords pair, record, or message interchangeably to
refer to these key-value pairs. A pair is denoted as 〈k; val〉, where k is the key and val
is the value. The map phase takes one key-value pair as input at a time, and produces
zero or more output pairs. The reduce phase receives multiple key-listOfValues pairs and
produces zero or more output pairs. Between the two phases, there is an implicit phase,
called shuffling/sorting, in which the mappers’ output pairs are shuffled and sorted to
26
2.4 TRIANGLE LISTING IN MAPREDUCE
group the pairs of the same key together as input for reducers.
Bermuda will leverage and extend some of the basic functionality of MapReduce,
which are:
• Key Partitioning: Mappers employ a key partitioning function over their outputs
to partition and route the records across the reducers. By default, it is a hash-based
function, but can be replaced by any other user-defined logic.
• Multi-Key Reducers: Typically, the number of distinct keys in an application is
much larger than the number of reducers in the system. This implies that a single
reducer will sequentially process multiple keys—along with their associated groups
of values—in the same reduce instance. Moreover, the processing order is defined
by key sorting function used in shuffling/sorting phase. By default, a single reduce
instance processes each of its input groups in total isolation from the other groups
with no sharing or communication.
2.4 Triangle Listing in MapReduce
Both [13] and [12] use the NodeIterator++ algorithm as the basis of their distributed
algorithms. [13] identifies the triangles by checking the existence of pivot edges, while
[12] uses set intersection of effective adjacency list (Line 7, Algorithm 1). In this section,
we present the MapReduce version of the NodeIterator++ algorithm similar to the one
presented in [12], referred to as MR-Baseline (Algorithm 2).
The general approach is the same as in the NodeIterator++ algorithm. In the map
phase, each node v needs to emit two types of messages. The first type is used for the
initiation its own effective adjacency list in the reduce side, referred to as a core mes-
sage (Line 1, Algorithm 2). The second type is used for identifying triangles, referred to
as pivot messages (Lines 2-3, Algorithm 2). All pivot messages from v to its effective
27
2.4 TRIANGLE LISTING IN MAPREDUCE
Algorithm 2 MR-BaselineMap: Input: 〈v;NH
v 〉1: emit 〈v; (v,NH
v )〉2: for all u ∈ NH
v do3: emit 〈u; (v,NH
v )〉
Reduce:Input:[〈u; (v,NHv )〉]
4: initiate NHu
5: for all 〈u; (v,NHv )〉 do
6: for all w ∈ NHu ∩NH
v do7: emit4vuw
adjacent nodes are identical. In the reduce phase, each node u will receive a core mes-
sage from itself, and a pivot message from adjacent nodes with the lower degree. Then,
each node identifies the triangles by performing a set intersection operation (Lines 5-6,
Algorithm 2).
We omit the code of the pre-processing procedure since its implementation is straight-
forward in MapReduce. In addition, we will exclude the pre-processing cost for any fur-
ther consideration since it is typically dominated by the actual running time of the triangle
listing algorithm, plus it is the same overhead for all algorithms.
2.4.1 Analysis and Optimization Opportunities
The algorithm correctness and overall computational complexity follow the sequential
case. Our analysis will thus focus on the space usage of the intermediate data and the exe-
cution efficiency captured in terms of the wall-clock execution time. For the convenience
of analysis, we assume that each edge (u, v) requires one memory word.
Intermediate Data Size. As presented in [13], the total number of intermediate
records generated by MR-Baseline can be O(m32 ) in the worst case, where m is the num-
ber of edges. The size of this intermediate data can be much larger than the original graph
size. Thus, issues related network congestion and job failure may arise with massive input
28
2.4 TRIANGLE LISTING IN MAPREDUCE
graphs. Indeed, the network congestion resulting from transmitting a large amount of data
during the shuffle phase can be a bottleneck, degrading the performance, and limiting the
scalability of the algorithm.
Execution Time. It is far from trivial to list the factors contributing to the execution
time of a map-reduce job. In this work, we consider the following two dominating fac-
tors of the triangle list algorithm. The first one is the total size of the intermediate data
generated and shuffled between the map and reduce phases. And the second factor is
the variance and imbalance among the mappers’ workloads. We refer to the imbalanced
workload among mappers as “map skew”. Map skew leads to the straggler problem, i.e.,
a few mappers take significantly longer time to complete than the rest, thus they delay the
progress of the entire job [61, 62]. We use the variance of the map output size to measure
the imbalance among mappers. More specifically, the bigger the variance of the mappers’
output sizes, the greater the imbalance and the more serious the straggler problem. The
map output variance is defined as in the following theorem.
Theorem 2 For a given graph G(V,E), let a random variable x denotes the effective
degree for any vertex in G and the variance of x is denotes as Var(x). Then, the expecta-
tion of x (E(x)) equals the average degree computed as E(x) = mn
. For typical graphs,
V ar(x) 6= 0 and E(x) 6= 0 always hold. Since each mapper starts with approximately
the same input size (say receives c graph nodes), the variance of the output size among
mappers is close to 4cE(X)2V ar(x).
Proof: Let function g(x) be the map output size generated by single node with the
effective degree x, then g(x) = x2 (Line 2-3, Algorithm 2). Thus, the total size of map
output generated by c nodes in a single mapper Ti(X) =∑c
i=1 g(xi). Since x1, x2, ..xc are
independent and identically distributed random variables, V ar(T (x)) = c ∗ V ar(g(x)).
29
2.4 TRIANGLE LISTING IN MAPREDUCE
Apply delta method [63] to estimate Var(g(x)) as follows:
V ar(g(x)) ≈ g′(x)2V ar(x) ≈ (2x)2V ar(x)
The approximate variance of g(x) is then
V ar(g(x)) ≈ 4E(x)2V ar(x)
Here, the variance of the total map output size among mappers is close to
4cE(X)2V ar(x).
Opportunities for Optimization: In the plain Hadoop, each reduce instance pro-
cesses its different keys (nodes) independent from each others. Generally, this is good for
parallelization. However, in triangle listing, it involves significant redundancy in commu-
nication. In the map phase, each node sends the identical pivot message to each of its
effective neighbors in NHv (Lines 2-3, Algorithm 2) even though many of them may re-
side in and get processed by the same reduce instance. For example, if a node v has 1,000
effective neighbors on reduce worker j, then v sends the same message 1,000 times to
reduce worker j. In web-scale graphs, such redundancy in intermediate data can severely
degrade the performance, drastically consume resources, and in some cases causes job
failures.
30
3
Bermuda Technique
With the MR-Baseline Algorithm, one node needs to send the same pivot message to
multiple nodes residing in the same reducer. Therefore, the network traffic can be reduced
by sending only one message to the destination reducer, and either have main-memory
cache or distributing the message to actual graph nodes within each reducer. Although this
strategy seems quite simple, and other systems such as GPS [55] and X-Pregel [56] have
implemented it, the trick lies on how to efficiently perform the caching and sharing. In
this section, we propose new effective caching strategies to maximize the sharing benefit
while encountering little overhead. We also present novel theoretical analysis for the
proposed techniques.
In the frameworks of GPS and X-Pregel, adjacency lists of high degree nodes are
used for identifying distinct destination reducer and distributing the message to target
nodes in the reduce side. This method requires extensive memory and computations for
message sharing. In contrast, in Bermuda, each node uses the universal key partition
function to group its destination nodes. Thus, each node would only send the same pivot
message to each reduce instance only once. At the same time, reduce instances will adopt
different message-sharing strategies to guarantee the correctness of algorithm. As a result,
31
3.1 BERMUDA EDGE-CENTRIC NODE++
Bermuda achieves a trade off between reducing the network communication—which is
known to be a big bottleneck for map-reduce jobs—and increasing the processing cost and
memory utilization. We present two modified algorithms with different message-sharing
strategies.
3.1 Bermuda Edge-Centric Node++
A straightforward (and intuitive) approach for sharing the pivot messages within each
reduce instance is to organize either the pivot or core messages in main-memory for ef-
ficient random access. We propose the Bermuda Edge-Centric Node++ (Bermuda-EC)
algorithm, which is based on the observation that for a given input graph, it is common
to have the number of core messages smaller than the number of pivot messages. There-
fore, the main idea of Bermuda-EC algorithm is to first read the core messages, cache
them in memory, and then stream the pivot messages, and on-the-fly intersect the pivot
messages with the needed core messages (See Figure 3.1). The MapReduce code of the
Bermuda-EC algorithm is presented in Algorithm 3.
In order to avoid pivot message redundancy, a universal key partitioning function is
utilized by mappers. The corresponding modification in the map side is as follows. First,
each node v employs a universal key partitioning function h() to group its destination
nodes (Line 3, Algorithm 3). This grouping captures the graph nodes that will be pro-
cessed by the same reduce instance. Then, each node v sends a pivot message including
the information of NHv to each non-empty group (Lines 4-6, Algorithm 3). Following this
strategy, each reduce instance receives each pivot message exactly once even if it will be
referenced multiple times.
Moreover, we use tags to distinguish core and pivot messages, which are not listed
in the algorithm for simplicity. Combined with the MapReduce internal sorting function,
32
3.1 BERMUDA EDGE-CENTRIC NODE++
Algorithm 3 Bermuda-ECMap: Input:(〈v;NH
v 〉)Let h(.) be a key partitioning function into [0,k-1]
1: j ← h(v)2: emit 〈j; (v,NH
v )〉3: Group the set of nodes in NH
v by h(.)4: for all i ∈ [0, k − 1] do5: if gpi 6= ∅ then6: emit 〈i; (v,NH
v )〉
Reduce:Input:[〈i; (v,NHv )〉]
7: initiate all the core nodes’ NHu in main memory
8: for all pivot message 〈i; (v,NHv )〉 do
9: for all u ∈ NHv and h(u) = i do
10: for all w ∈ NHv ∩NH
u do11: emit4vuw
Bermuda-EC guarantees that all core messages are received by the reduce function before
any of the pivot messages as illustrated in Figure 3.1. Therefore, it becomes feasible to
cache only the core messages in memory, and then perform the intersection as the pivot
messages are received.
The corresponding modification in the reduce side is as follows. For a given reduce
instance Ri, it first reads all the core message into main-memory (Line 7, Algorithm 3).
Then, it iterates over all pivot message. Each pivot message is intersected with the cached
core messages for identifying the triangles. As presented in the MR-Baseline algorithm
(Algorithm 2), each pivot message (v,NHv ) needs to be processed in reduce instance Ri
only for nodes u : u ∈ NHv where h(u) = i. Interestingly, this information is encoded
within the pivot message. Thus, each pivot message is processed for all its requested core
nodes once received (Lines 9-11, Algorithm 3).
33
3.1 BERMUDA EDGE-CENTRIC NODE++
Figure 3.1: Bermuda: Bermuda-EC (Edge Centric) Execution.
3.1.1 Analysis of Bermuda-EC
Extending the analysis in Section 2.4, we demonstrate that Bermuda-EC achieves im-
provement over MR-Baseline w.r.t both space usage and execution efficiency. Further-
more, we discuss the effect of the number of reducers k on the algorithm performance.
Theorem 3 For a given number of reducers k, we have:
• The expected total size of the map output is O(km).
• The expected size of core messages to any reduce instance is O(m/k).
Proof: As shown in Algorithm 3, the size of the map output generated by node v is
at most k ∗ dv. Thus, the total size of the map output T is as follows:
T <∑v∈V
kdv = k∑v∈V
dv = km
For the second bound, observe that a random edge is present in a reduce instance Ri
and represented as a core message with probability 1/k. By following the Linearity of
Expectation, the expected number of the core messages to any reduce instance isO(m∗ 1k).
34
3.1 BERMUDA EDGE-CENTRIC NODE++
Space Usage. Theorem 3 shows that when k √m (the usual case for massive
graphs), then the total size of the map output generated by Bermuda-EC algorithm is
significantly less than that generated by the MR-Baseline algorithm. In other words,
Bermuda-EC is able to handle even larger graphs with limited compute clusters.
Execution Time. A positive consequence of having a smaller intermediate result
is that it requires less time for generating and shuffling/sorting the data. Moreover, the
imbalance of the map outputs is also reduced significantly by limiting the replication
factor of the pivot messages up to k. The next theorem shows the approximate variance
of the number of the intermediate result from mappers. When k < E(x), it implies
smaller variance among the mappers than that of the MR-Baseline algorithm. Together,
Bermuda-EC achieves better performance and scales to larger graphs compared to the
MR-Baseline algorithm.
Theorem 4 For a given graph G(V,E), let a random variable x denotes the effective
degree of any node in G and the variance of x is denotes as Var(x). Then the expectation
of x (E(x)) equals the average degree and computed as E(x) = mn
. For typical graphs,
V ar(x) 6= 0 and E(x) 6= 0 always hold. Since each mapper starts with approximately
the same input size (say receives c graph nodes), the variance of the map output’s size
under the Bermuda-EC Algorithm is O(2ck2V ar(x)), where k represents the number of
reducers.
Proof: Assume the number of reducers is k. Given a graph node v, where its
effective degree dv = x. Let random variable y(x) be the number of distinct reduc-
ers processing the effective neighbors of v, and thus y(x) ≤ k. Then, the size of the
map output generated by a single node u would be xy, denoted as g(x)(Lines 3-4, Algo-
rithm 3). Thus, the total size of the map output generated by c nodes in a single mapper
T (X) =∑c
i=1 g(xi). Since x1, x2, ..xc are independent and identically distributed ran-
dom variables, then V ar(T (x)) = c ∗V ar(g(x)). The approximate variance of g(x) is as
35
3.1 BERMUDA EDGE-CENTRIC NODE++
follows
V ar(xy) = E(x2y2)− E(xy)2
< E(x2y2)
< k2E(x2)
< k2(E(x)2 + V ar(x))
< 2k2V ar(x)
As presented in [37], E(x2) ≈ m32
nand E(x) = m
n. Thus E(x2)
E(x)2≈ n√
m. In many real
graphs where n2 > m it implies n√m>√m > 2. It implies E(x2) > 2E(x)2, thus
V ar(x) = E(x2)− E(x)2 > E(x)2.
We now study in more details the effect of parameter k (the number of reducers) on
the space and time complexity for the Bermuda-EC algorithm.
Effect on Space Usage. The reducers number k trades off the memory used by a
single reduce instance and the size of the intermediate data generated during the MapRe-
duce job. The memory used by a single reducer should not exceed the available memory
of a single machine, i.e., O(m/k) should be sub-linear to the size of main memory in a
single machine. In addition, the total space used by the intermediate data must also re-
main bounded, i.e., O(km) should be no larger than the total storage. Given a cluster of
machines, these two constraints define the bounds of k for a given input graph G(V,E).
Effect on Execution Time. The reducers number k trades off the reduce computation
time and the time for shuffling and sorting. As the parallelization degree k increases,
it reduces the computational time in the reduce phase. At the same time, the size of
the intermediate data, i.e., O(km) increases significantly as k increases (notice that m is
very large), and thus the communication cost becomes a bottleneck in the job’s execu-
36
3.1 BERMUDA EDGE-CENTRIC NODE++
tion. Moreover, the increasing variance among mappers O(2ck2V ar(x)) implies a more
significant straggler problem which slows down the execution progress.
In general, Bermuda-EC algorithm favors the smaller setting of k for higher efficiency
while subjects to memory bound that the expected size of core message O(m/k) should
not exceed the available memory of a single reduce instance.
Unfortunately, for processing web-scale graphs such as ClueWeb with more than 80
billion edges (and total size of approximately 700GBs)—which as we will show the state-
of-art techniques cannot actually process—the number of reducers needed for Bermuda-
EC for acceptable performance is in the order of 100s. Although, this number is very
reasonable for most mid-size clusters, the intermediate resultsO(km) will be huge, which
leads to significant network congestion.
Disk-Based Bermuda-EC: A generalization to the proposed Bermuda-EC algorithm
that guarantees no failure even under the case where the core messages cannot fit in a
reducer’s memory is the Disk-Based Bermuda-EC variation. The idea is straightforward
and relies on the usage of the local disk of each reducer. The main idea is as follows:
(1) Partition the core messages such that each partition fits into main memory, and (2)
Buffer a group of pivot messages, and then iterate over the core messages one partition at
a time, and for each partition, identify the triangles as in the standard Bermuda-EC algo-
rithm. Obviously, such method trades off between disk I/O (pivot message scanning) and
main-memory requirement. For a setting of reduce number k, the expected size of core
messages in a single reduce instance is O(m/k), thus the expected number of rounds is
O( mkM
) whereM represents the size of available main-memory for single reducer. The ex-
pected size of pivot message reaches O(m). Therefore, the total disk I/O reaches O(m2
kM).
In the case of massive graph, it implies longer time.
37
3.2 BERMUDA VERTEX-CENTRIC NODE++
3.2 Bermuda Vertex-Centric Node++
The Bermuda-EC algorithm assumes that the core messages can fit in the memory of a
single reducer. However, it is not always guaranteed to be the case, especially in web-
scale graphs.
One crucial observation is that the access pattern of the pivot messages can be learned
and leveraged for better re-usability. In MapReduce, a single reduce instance processes
many keys (graph nodes) in a specific sequential order. This order is defined based on the
key comparator function. For example, let h() be the key partitioning function and l() be
key comparator function within the MapReduce framework, then h(u) = h(w) = i and
l(u,w) < 0 implies that the reduce instance Ri is responsible for the computations over
nodes u, w, and also the computations of node u precede that of node w. Given these
known functions, the relative order among the keys in the same reduce instance becomes
known, and the access pattern of the pivot message can be predicted. The knowledge of
the access pattern of the pivot messages holds a great promise for proving better caching
and better memory utilization.
Inspired by these facts, we propose the Bermuda-VC algorithm which supports ran-
dom access over the pivot messages by caching them in main-memory while streaming
in the core messages. More specifically, Bermuda-VC will reverse the assumption of
Bermuda-EC, where we now try to make the pivot messages arrive first to reducers, get
them cached and organized in memory, and then the core messages are received and pro-
cessed against the pivot messages. Although the size of the pivot messages is usually
larger than that of the core messages, their access pattern is more predictable which will
enable better caching strategies as we will present in this section. The Bermuda-VC al-
gorithm is presented in Algorithm 4.
The Bermuda-VC algorithm uses a shared buffer for caching the pivot messages. And
38
3.2 BERMUDA VERTEX-CENTRIC NODE++
Algorithm 4 Bermuda-VCMap: Input:(〈v; (NL
v , NHv )〉)
Let h(.) be a key partitioning function into [0,k-1]Let l(.) be a key comparator function
1: emit 〈v; (v,NLv , N
Hv )〉
2: Group the set of nodes in NHv by h(.)
3: for all i ∈ [0, k − 1] do4: if gpi 6= ∅ then5: gpi ⇐ sort(gpi)basedonl(.)6: u⇐ gpi.f irst7: APv,i ⇐ accessPattern(gpi)8: emit 〈u; (v,APv,i, N
Hv )〉
Reduce:Input:[〈u; (v,APv,i, NHv )〉]
9: initiate the core node u’ NLu , N
Hu in main memory
10: for all pivot message 〈u; (v,APv,i, NHv )〉 do
11: for all w ∈ NHv
⋂NHu do
12: emit4vuw
13: Put (v, APv,j, NHv ) into shared buffer
14: NLu ← NL
u − v15: for all r ∈ NL
u do16: Fetch (r, APr,i, N
Hr ) from shared buffer
17: for all w ∈ NHr
⋂NHu do
18: emit4ruw
Figure 3.2: Bermuda: Bermuda-VC (Vertex Centric) Execution.
then, for the reduce-side computations over a core node u, the reducer compares u’s core
message with all related pivot messages—some are associated with u’s core message,
while the rest should be residing in the shared buffer. Bermuda-VC algorithm applies
39
3.2 BERMUDA VERTEX-CENTRIC NODE++
the same scheme to avoid generating redundant pivot messages. It utilizes a universal
key partitioning function to group effective neighbors NHv of each node v. In order to
guarantee the availability of the pivot messages, a universal key comparator function is
utilized to sort the destination nodes in each group (Line 5, Algorithm 4). As a result,
destination nodes are sorted based on their processing order. The first node in group
gpi indicates the earliest request of a pivot message. Hence, each node v sends a pivot
message to the first node of each non-empty group by emitting key value pairs where key
equals the first node ID (Lines 6-8, Algorithm 4).
Combined with the sorting phase of the MapReduce framework, Bermuda-VC guar-
antees the availability of all needed pivot messages of any node u when u’s core message
is received by a reducer, i.e., the needed pivot messages are either associated with u itself
or associated with another nodes processed before u.
The reducers’ processing mechanism is similar to that of the MR-Baseline algorithm.
Each node u reads its core message for initiating NHu and NL
v (Line 9), and then it iterates
over every pivot message associated with key u against its effective adjacency list NHv to
enumerate the triangles (Lines 10-12). As discussed before, not all expected pivot mes-
sages are carried with key u. The rest of the related pivot messages reside in the shared
buffer. Here, NLv is used for fetching the rest of these pivot messages (Line 14, Algorithm
4), and enumerating the triangles (Lines 15-18, Algorithm 4). Moreover, the new com-
ing pivot messages associated with node u are pushed into the shared buffer for further
access by other nodes (Line 13). Figure 3.2 illustrates the reduce-side processing flow of
Bermuda-VC. In the following sections, we will discuss in more details the management
of the pivot messages in the shared buffer.
40
3.2 BERMUDA VERTEX-CENTRIC NODE++
3.2.1 Message Sharing Management
It is obvious that the best scenario is to have the shared buffer fits into the main memory of
each reduce instance. However, that cannot be guaranteed. In general, there are two types
of operations over the shared buffer inside a reduce instance, which are: “Put” for adding
new incoming pivot messages into the shared buffer (Line 13), and “Get” for retrieving
the needed pivot messages (Lines 15-18). For massive graphs, the main memory may
not hold all the pivot messages. This problem is similar to the classical caching problem
studied in [64, 65], where a reuse-distance factor is used to estimate the distances between
consecutive references of a given cached element, and based on that effective replacement
policies can be deployed. We adopt the same idea in Bermuda-VC.
Interestingly, in addition to the reuse distance, all access patterns of each pivot mes-
sage can be easily estimated in our context. The access pattern AP of a pivot message
is defined as the sequence of graph nodes (keys) that will reference this message. In
particular, the access pattern of a pivot message from node v to reduce instance Ri can
be computed based on the sorted effective nodes gpi received by Ri. Several interesting
metrics can be derived from this access pattern. For example, the first node in gpi indi-
cates the occurrence of the first reference, the size of gpi equals the cumulative reference
frequency. Such access pattern information is encoded within each pivot message (Lines
7-8, Algorithm 4). With the availability of this access pattern, effective message sharing
strategies can be deployed under limited memory.
As an illustrative example, Figure 3.3 depicts different access patterns for four pivot
messages m1, m2, m3, m4. The black bars indicate requests to the corresponding
pivot message, while the gaps represent the re-use distances (which are idle periods for
this message). Pivot messages may exhibit entirely different access patterns, e.g., pivot
message m1 is referenced only once, while others are utilized more than once, and some
pivot messages are used in dense consecutive pattern in a short interval, e.g., m2 and
41
3.2 BERMUDA VERTEX-CENTRIC NODE++
Figure 3.3: Bermuda: Access Patterns of Pivot Messages.
m3. Inspired by these observations, we propose two heuristic-based replacement policies,
namely usage-based tracking, and bucket-based tracking. They trade off the tracking
overhead with memory hits as will be described next.
3.2.1.1 Usage-Based Tracking
Given a pivot message originated from node v, the total use frequency is limited to√m,
referring to the number of its effective neighbors, which is much smaller than the expected
number of nodes processed in a single reducer, which is estimated to n/k. This implies
that each pivot message may become useless (and can be discard) as a reducer progresses,
and it is always desirable to detect the earliest time at which a pivot message can be
discarded to maximize the memory’s utilization.
The main idea of the usage-based tracking is to use a usage counter per pivot message
in the shared buffer. And then, the tracking is performed as follows. Each Put operation
sets the counter as the total use frequency. And, only the pivot messages whose usage
counter is larger than zero are added to the shared buffer. Each Get operation decre-
ments the counter of the target pivot message by one. Once the counter reached zero, the
corresponding pivot message is evicted from the shared buffer.
The usage-based scheme may fall short in optimizing sparse and scattered access pat-
terns. For example, as shown in Figure 3.3, the reuse distance of message m4 is large.
42
3.2 BERMUDA VERTEX-CENTRIC NODE++
Figure 3.4: Bermuda: The Usage of External Memory.
Therefore, the usage-based tracking strategy has to keep m4 in the shared buffer although
it will not be referenced for a long time. What’s worse, such scattered access is common
in massive graphs. Therefore, pivot messages may unnecessarily overwhelm the available
memory of each single reduce instance.
3.2.1.2 Bucket-Based Tracking
We introduce the Bucket-based tracking strategy to optimize message sharing over scat-
tered access patterns. The main idea is to manage the access patterns of each pivot mes-
sage at a smaller granularity, called a bucket. The processing sequence of keys/nodesis
sliced into buckets as illustrated in Figure 3.3. In this work, we use the range partitioning
method for balancing workload among buckets. Correspondingly, the usage counter of
one pivot message is defined per bucket, i.e., each message will have an array of usage
counters of a length equals to the number of its buckets. For example, the usage count of
m4 in the first bucket is 1 while in the second bucket is 0. Therefore, for a pivot message
that will remain idle (with no reference) for a long time, its counter array will have a long
sequence of adjacent zeros. Such access pattern information can be computed in the map
function, encoded in access pattern (Line 7 in Algorithm 4), and passed to the reduce side.
The corresponding modification of the Put operation is as follows. Each new pivot
message will be pushed into the shared buffer (in memory) and backed up by local files
43
3.2 BERMUDA VERTEX-CENTRIC NODE++
(in disk) based on its access pattern. Figure 3.4 illustrates this procedure. For the arrival
of a pivot message with the access pattern [1, 0, .., 1, 3], the reduce instance actively adds
this message into back-up files for bucketsBp−1 (next-to-last bucket) andBp (last bucket).
And then, at the end of each bucket processing and before the start of processing the next
bucket, all pivot messages in the shared buffer are discarded, and a new set of pivot
messages is fetched from the corresponding back-up file into memory(See Figure 3.4).
The Bucket-based tracking strategy provides better memory utilization since it pre-
vents the long retention of unnecessary pivot messages. In addition, usage-based tracking
can be applied to each bucket to combine both benefits, which is referred to as the bucket-
usage tracking strategy.
3.2.2 Analysis of Bermuda-VC
In this section, we show the benefits of the Bermuda-VC algorithm over the Bermuda-EC
algorithm. Furthermore, we discuss the effect of parameter p, which is the the number of
buckets, on the performance.
Under the same settings of the number of reducers k, the Bermuda-VC algorithm gen-
erates more intermediate message and takes longer execution time. Firstly, the Bermuda-
VC algorithm generates the same number of pivot messages while generating more core
messages (i.e., additional NLv for reference in the reduce side). Thus, the total size of the
extra NLv core message is
∑v∈V N
Lv = m. Such noticeable size of extra core messages
requires additional time for generating and shuffling. Moreover, an additional computa-
tional overhead (Lines 13-14) is required for the message sharing management.
However, because of the proposed sharing strategies, the Bermuda-VC algorithm can
work under smaller settings for k—which are settings under which the Bermuda-EC al-
gorithm will probably fail. In this case, the benefits brought by having a smaller k will
exceed the corresponding cost. In such cases, Bermuda-VC algorithm will outperform
44
3.2 BERMUDA VERTEX-CENTRIC NODE++
Bermuda-EC algorithm.
Moreover, compared to the disk-based Bermuda-EC algorithm, the Bermuda-VC al-
gorithm has a relatively smaller disk I/O cost because the predictability of the access
pattern of the pivot messages, which enable purging them early, while that is not applica-
ble to the core messages. Notice that, for any given reduce instance, the expected usage
count of pivot message from u is dHu /k. Thus, the expected usage count for any pivot
message is E(dHu )/k, equals m/nk. Therefore, the total disk I/O with pivot messages is
at most m2/nk, smaller than disk I/O cost of Bermuda-EC algorithm m2/Mk, where M
stands for the size of the available memory in a single machine.
Effect of the number of buckets p: At a high level, p trades off the space used by
the shared buffer with the I/O cost for writing and reading the back-up files. Bermuda-VC
algorithm favors smaller settings of p in the capacity of main-memory. As p decreases,
the expected number of reading and writing decreases, however the total size of the pivot
messages in the shared buffer may exceed the capacity of the main-memory. For a setting
of p, the expected size of the pivot messages for any bucket is O(m/kp). Therefore, a
visible solution for O(m/kp) ≤ M is ≥ O(m/kM). In this work, p is set as O(m/kM)
where m is the size of the input graph, and M is the size of the available memory in a
single machine.
45
4
Performance Evaluation
In this section, we present an experimental evaluation of the MR-Baseline, Bermuda-
EC, and Bermuda-VC algorithms. We also compare Bermuda algorithms against GP
(Graph Partitioning algorithm for triangle listing) [13]. The objective of our experimental
evaluation is to show that the proposed Bermuda method improves both time and space
complexity compared to the MR-Baseline algorithm. Moreover, compared to Bermuda-
EC, Bermuda-VC is able to get better performance under the proposed message caching
strategies.
All experiments are performed on a shared-nothing computer cluster of 30 nodes.
Each node consists of one quad-core Intel Core Duo 2.6GHZ processors, 8GB RAM,
400GB disk, and interconnected by 1Gb Internet. Each node runs Linux OS and Hadoop
version 2.4.1. Each node is configured to run up to 4 map and 2 reduce tasks concurrently.
The replication factor is set to 3 unless otherwise is stated.
46
4.1 DATASETS
NodesUndirected
EdgesAvg
Degree Size
Twitter 4.2 ∗ 107 2.4 ∗ 109 57 24GBYahoo 1.9 ∗ 108 9.0 ∗ 109 47 67GBClueWeb12 9.6 ∗ 108 8.2 ∗ 1010 85 688GB
Table 4.1: Bermuda: Basic Statistics about Datasets.
4.1 Datasets
We use three large real-world graph datasets for our evaluation. Twitter is one represen-
tative social network which captures current biggest micro-blogging community. Edges
represent the friendship among users 1. Yahoo is one of the largest real-world web graphs
with over one billion vertices 2, where edges represent the link among web pages. And
ClueWeb12 is one subset of real-world web with six billion vertices 3.
In our experiments, we consider each edge of the input to be undirected. Thus, if
an edge (u,v) appears in the input, we also add edge (v, u) if it does not already exist.
The graph sizes varies from 4.2 ∗ 107 of Twitter , 1.9 ∗ 108 of Yahoo, to 9.6 ∗ 108 of
ClueWeb12, with different densities; ClueWeb12 is the largest but also the sparest dataset.
The statistics on the three datasets are presented in Table 4.1.
4.2 Experiment Result
MRBaseline
BermudaEC and VC
ReductionFactor (RF)
Twitter 3.0 ∗ 1011 1.2 ∗ 1010 30Yahoo 1.4 ∗ 1011 1.9 ∗ 1010 7.5ClueWeb12 3.0 ∗ 1012# 2.6 ∗ 1011 11.5
Table 4.2: Bermuda: Reduction Factors of Communication Cost.
1http://an.kaist.ac.kr/traces/WWW2010.html2http://webscope.sandbox.yahoo.com.20153http://www.lemurproject.org /clueweb12/webgraph.php/.
47
4.2 EXPERIMENT RESULT
(a) Twitter (b) Yahoo
Figure 4.1: Bermuda: Distribution of Mapper Elapsed Times.
4.2.1 Bermuda Technique
Bermuda directly reduces the size of intermediate records by removing redundancy. We
experimentally verify the reduction of the pivot messages as reported in Table 4.2. In
the case of the Twitter dataset, Bermuda’s output is around 30x less than that generated
by the MR-Baseline algorithm. Furthermore, in the case of ClueWeb, the size of the
intermediate result generated by the MR-Baseline algorithm exceeds the available disk
capability of the cluster. The reported number in Table 4.2 is obtained through a counter
without actual record generation. The drastic difference in the size of the pivot messages
has a large impact on the running time. In the case of Twitter, MR-Baseline takes more
than 4 hours to generate and transform 300 billion records. Whereas, the Bermuda-EC
and Bermuda-VC algorithms only generate and transform 12 billion records under the
settings of k = 20, which takes 9 minutes on average.
Moreover, Bermuda methods handle the map-side imbalance more effectively. As dis-
cussed in Section 2.4, the size of the intermediate records generated by the MR-Baseline
algorithm heavily depends on the degree distribution of the input nodes. Whereas Bermuda
mitigates the effect of skewness by limiting the replication factor of the pivot messages
up to k. Figure 4.1 shows the distribution of the mappers’ elapsed times on the Twitter
and Yahoo dataset, respectively.
Figure 4.1(a) illustrates the map-side imbalance problem of the MR-Baseline Algo-
48
4.2 EXPERIMENT RESULT
Figure 4.2: Bermuda: Disk Space vs.Memory Tradeoff.
Figure 4.3: Bermuda: Running Time ofBermuda-EC.
rithm as indicated by the heavy-tailed distribution of the elapsed time (the x-axis). The
majority of the map tasks finish in less than 100 minutes, but there are a handful of map
tasks that take more than 200 minutes. The mappers that have the longest completion
time received high degree nodes to pivot on. This is because for a node of effective de-
gree d, the MR-Baseline algorithm generates O(d2) pivot messages. Figure 4.1(a) shows
a significantly more balanced workload distribution under the Bermuda algorithms. This
is indicated by a smaller spread out between the fastest and slowest mappers , which is
around 10 minutes. This is because for a node of effective degree d, Bermuda would gen-
erate only O(min(k, d)d) pivot messages. Therefore, the variance of mappers’ outputs is
significantly reduced. Figure 4.1(b) manifests the same behavior over the Yahoo dataset.
This empirical observation is in accordance with our theoretical analysis and Theorems 2
and 4. Thus, Bermuda methods outperform the MR-Baseline Algorithm.
4.2.2 Effect of the number of reducers
In Bermuda-EC, the number of reducers k trades off between the memory used by each
reducer and the used disk space. Figure 4.2 illustrates such trade off on the Twitter and
Yahoo datasets. Initially, as k increases, the increase of the storage overhead is small
while the reduction of memory is drastic. In the case of the Yahoo dataset, as k increases,
the size of the core messages decreases, and can fit in the available main-memory. As
49
4.2 EXPERIMENT RESULT
k increases further, the decrease in the memory requirements gets smaller, while the in-
crease of the disk storage grows faster. For a given graph G(V,E) and a given cluster
of machines, the range of k is bounded by two factors, the total disk space available and
the memory available on each individual machine. In the case of Yahoo, k should be no
smaller than 20, otherwise the core messages cannot fit into the available main memory.
Figure 4.3 illustrates the runtime of the Bermuda-EC algorithm under different set-
tings of k over the Twitter and Yahoo datasets. In the case of Twitter, the elapsed time
reduces as k initially increases. We attribute this fact to the increase of parallel compu-
tations. As k continues to increase, this benefit disappears and the total runtime slowly
increases. We attribute the increase of the execution time to the following two factors:
(1) The increasing size of intermediate records O(km), and (2) The higher variance of
the map-side workload O(2ck2V ar(x)). As shown in Figure 4.3, the effect of k varies
from one graph to another. In the case of the Yahoo dataset, the communication cost
dominates the overall performance early, e.g., under k = 20. We attribute these different
behaviors to the nature of the input datasets. Twitter—as one typical social network—
has a lot of candidate triangles. In contrast, Yahoo is one typical hyperlink network with
sparse connections and relatively fewer candidate triangles.
For execution efficiency, Bermuda-EC chooses to keep the relatively small core mes-
sages in the main memory, while allowing a sequential access to the relatively large pivot
messages. For a given web-scale graph, a large setting of k is required to make the core
messages fit into the available memory of an individual reducer. Unfortunately, the price
is a bigger size of intermediate data O(km), which leads to a serious network congestion,
and even a job failure in some cases. In the case of the ClueWeb dataset, Bermuda-EC
requires large number of reducers, e.g., in the order of 100s, which creates a prohibitively
large size of intermediate data (in the order of 100TBs), which is not practical for most
clusters.
50
4.2 EXPERIMENT RESULT
Figure 4.4: Bermuda (disk-based): Vary-ing k vs. Running Time.
Reduce Progress0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Me
mo
ry U
sa
ge
(G
B)
0
1
2
3
4
5
6
7
8
9
Max-MemoryNo TrackingUsageBucketBucket-Usage
Figure 4.5: Bermuda: The Accumula-tion of Sharing Messages.
Although the disk-based Bermuda-EC algorithm can work under smaller settings of
k, its efficiency is limited because of a large amount of disk I/Os. Figure 4.4 presents the
run time of the disk-based Bermuda-EC variation under different settings of k over the
Yahoo dataset. When k ≥ 20, the core messages can fit into memory, and its runtime is
presented in Figure 4.3. When k equals 10, the disk-based Bermuda-EC algorithm takes
less time in generating and shuffling the intermediate data, while more time is taken in
the reduce phase. As expected, the runtime of the reduce step increases quickly and the
benefits induced by the smaller settings of k disappear.
4.2.3 Message Sharing Management
In Figure 4.5, we present the empirical results of the different caching strategies on the
Yahoo dataset where k = 10. As shown in Figure 4.5, the size of sharing messages grows
rapidly, then overwhelms the size of the main memory available to a given reducer, which
leads to a job failure. By tracking the re-use count, the usage-based tracking strategy is
able to immediately discard useless pivot messages when their counter reaches zero. As
a result the increase of the memory usage is slower. However, the retention of the pivot
message having long re-use distance makes the discard strategy not very effective. By
51
4.2 EXPERIMENT RESULT
MR-Baseline GP
Bermuda-EC
Bermuda-VC
Twitter 682 378 52 66Yahoo 439 622 82 69
ClueWeb12 − − − 1528
Table 4.3: Bermuda: Effectiveness Evaluation.
considering the access pattern of the pivot messages at a smaller granularity, the bucket-
based strategy avoids the retention of the pivot messages having too long re-use distance.
As shown in Figure 4.5, the bucket-based strategy achieves better memory utilization. In
the case of the Yahoo dataset, the size of sharing pivot message is practical for a com-
modity machine with the bucket-based strategy. The combination of the two strategies,
i.e., the bucket-usage strategy can further reduce the size of sharing message by avoiding
the memory storage of idle messages inside each bucket.
4.2.4 Execution Time Performance
Table 4.3 presents the runtime of all algorithms on the three datasets. For the Bermuda-
EC algorithm, the number of reducers k is set to 40 and 20, for the Twitter and Yahoo
datasets, respectively. For the Bermuda-VC algorithm, the number of reducers k is set
to 40, 10, and 10, for the Twitter, Yahoo and ClueWeb datasets, respectively. The set-
tings of reducer number k are determined by the cluster and the given datasets. Only
Bermuda-VC manages to list all triangles in ClueWeb dataset, whereas MR-Baseline, GP
and Bermuda-EC fail to finish due to the lack of disk space. As shown in Table 4.3,
Bermuda methods outperform the other algorithms on the Twitter and Yahoo datasets.
Bermuda-EC algorithm shows more than 5x faster performance on the Twitter dataset
compared to the GP algorithm. Moreover, compared to Bermuda-EC, Bermuda-VC is
able to get a better trade-off between the communication cost and the reduce-side compu-
tations. It shows a better performance over the Yahoo dataset under k = 10. Moreover,
52
4.2 EXPERIMENT RESULT
with a relatively small cluster, Bermuda-VC can scale up to larger datasets, e.g., ClueWeb
graph dataset (688GB), while the other techniques fail to finish.
53
5
Related Works
Triangle listing is a basic operation of the graph analysis. Many research works have been
conducted on this problem, which can be classified into three categories: in-memory algo-
rithms, external-memory algorithms and distributed algorithms. Here, we briefly review
these works.
In-Memory Algorithm. The majority of previously introduced triangle listing al-
gorithms are the In-Memory processing approaches. Traditionally, they can be further
classified as Node-Iterator[35, 36, 37] and Edge-Iterator ones[38, 39] with the respect to
iterator-type. Authors [37, 38, 39]improved the performance of in-memory algorithms
by adopting degree-based ordering. Matrix multiplication is used to count triangles [35].
However, all these algorithms are inapplicable to massive graphs which do not fit in mem-
ory.
External-Memory Algorithms. In order to handle the massive graph, several external-
memory approaches were introduced [9, 10, 11]. Common idea of these methods is:
(1) Partition the input graph to make each partition fit into main-memory, (2) Load each
partition individually into main-memory and identify all its triangles, and then remove
edges which participated in the identified triangle, and (3) After the whole graph is loaded
54
into memory buffer once, the remaining edges are merged, then repeat former steps until
no edges remain. These Algorithms require a lot of disk I/Os to perform the reading and
writing of the edges. Authors [9, 10] improved the performance by reducing the amount
of disk I/Os and exploiting multi-core parallelism. External-Memory Algorithms show
great performance in time and space. However, the parallelization of external-memory
algorithms is limited. External-memory approaches cannot easily scale up in terms of
computing resources and parallelization degree.
Distributed Algorithms. Another promising approach to handle triangle listing on
large-scale graphs is the distributed computing. Suri et al. [13] introduced two Map-
Reduce adaption of NodeIterator algorithm and the well-known Graph Partitioning (GP)
algorithm to count triangles. The Graph Partitioning algorithm utilizes one universal hash
partition function over nodes to distribute edges into overlapped graph partitions, then
identifies triangles over all the partitions. Park et al. [40] further generalized Graph Par-
titioning algorithm into multiple rounds, significantly increasing the size of the graphs
that can be handled on a given system. The authors compare their algorithm with GP
algorithm [13] across various massive graphs then show that they get speedups ranging
from 2 to 5. In this work, we show such large or even larger speedup (from 5 to 10)
can also be obtained by reducing the size intermediate result directly via Bermuda meth-
ods. Teixeira et al. [41] presented Arabesque, one distributed data processing platform
for implementing subgraph mining algorithms on the basis of MapReduce framework.
Arabesque automates the process of exploring a very large number of subgraphs, includ-
ing triangles. However, these MapReduce algorithms must generate a large amount of
intermediate data that travel over the network during the shuffle operation, which degrade
their performance. Arifuzzaman et al. [12] introduced an efficient MPI-based distributed
memory parallel algorithm (Patric) on the basis of NodeIterator algorithm. The Patric al-
gorithm introduced degree-based sorting preprocessing step for efficient set intersection
55
operation to speed up execution. Furthermore, several distributed solutions designed for
subgraph mining on large graph were also proposed [1, 42]. Shao et al. introduced the
PSgl framework to iteratively enumerate subgraph instance. Different from other parallel
approaches, the PSgl framework completes relies on the graph traversal and avoids the
explicit join operation. These distributed memory parallel algorithms achieve impressive
performance over large-scale graph mining tasks. These methods distributed the data
graph among the worker’s memory, thus they are not suitable for processing large-scale
graph with small clusters.
56
6
Problem Definition
In this section, we present the models of uncertain graph, privacy criteria, and utility met-
ric. We then present the formal formulation of uncertain graph anonymization problem.
6.1 Uncertain Graph
Let G = (V,E, p) be an uncertain graph, where V is the set of nodes,E is the set of edges,
and function p : E → [0, 1] assigns a probability of existence to each edge, denoted as
p(e). In this paper, we assume the possible world semantics where the edge probabilities
are independent of each other, which is a common assumption in uncertain graph ana-
lytics [17, 18, 23, 66, 67, 68]. Specifically, the possible world semantics interprets G as
a set of possible deterministic graphs W (G) = G1, G2, ..., Gn, where each determin-
istic graph Gi ∈ W (G) includes all vertices of G and a subset of edges EGi ⊂ E. The
probability of observing any possible world Gi = (V,EGi) ∈ W (G) is
Pr[Gi] =∏e∈EGi
p(e)∏
e∈E\EGi
(1− p(e))
In this work, we assume the input uncertain graph undirected and contains no self-
58
6.2 ATTACK MODEL AND PRIVACY CRITERIA
0.9
0.8
0.1
0.80.7
a
d c
b
(a) An uncertain graph (b) Its degree uncertainty matrix
Figure 6.1: Chameleon: Privacy Risk Assessment.
loops or multiple edges between the same pair of vertices.
6.2 Attack Model and Privacy Criteria
In this work, we consider the re-identification attack based on node degree, where an
adversary is assumed to have a prior knowledge of the degree of nodes in the original
uncertain graph [27]. This information can be obtained by various ways, e.g., adversary’s
malicious actions to monitor the network or from a public source [69]. The privacy cri-
terion we adopt to prevent this attack is known as the (k, ε)-obf criterion [28]. The basic
idea behind k-obf is to blend every node with other fuzzy-matching nodes so that the node
cannot be easily distinguished. K-obf is quite similar to k−anonymity but more suitable
to measure the anonymity level provided by an uncertain graph due to its foundation from
information theory [34]. Moreover, the introduction of a tolerance parameter ε allows
skipping (ignoring) up to ε ∗ |V | nodes during the anonymization process, which may
represent extreme unique nodes, e.g., Trump in a Twitter network, whose obfuscation is
almost impossible. The formal definition is as follows:
Definition 1 (k, ε)-obf [28] Let P be a vertex property (i.e., vertex degree in our work),
k ≥ 1 be a desired level of anonymity, and ε > 0 be a tolerance parameter. An
anonymized uncertain graph G is said to k-obfuscate a given vertex v ∈ G w.r.t P if
the entropy H() of the distribution YP (v) over the vertices of G is greater than or equals
59
6.3 RELIABILITY-BASED UTILITY LOSS METRIC
to log2 k:
H(YP (v)) ≥ log2 k.
The uncertain graph G is (k, ε)-obf w.r.t property P if it k-obfuscates at least (1 − ε)|V |
vertices in G.
Figure 6.1(b) gives an example of how to compute the degree entropy for the uncertain
graph in Figure 6.1(a). Here, the vertex property P represent the node degree. Each
row in the L.H.S table represents the degree distribution of the corresponding node. For
example, Node a has degree 0 with probability 0.006. The R.H.S table normalizes the
values in each column (i.e. over each degree value d) to get a distributions YP (v). The
entropy H(YP (v)) is then computed for each degree (one column in the R.H.S table) as
shown in bottom row.
6.3 Reliability-Based Utility Loss Metric
Reliability generalizes the concept of connectivity by capturing the probability that two
given (sets of) nodes are reachable over all possible worlds of the uncertain graph. And,
reliability forms the foundation of numerous uncertain graph theoretic algorithms such as
k−nearest neighbors[18, 66], shortest paths detecting [23, 58]. It is the core interest of
uncertain graph analysis tasks, therefore it should be well-preserved in the anonymization
process.
Inspired by its significance, we propose a novel utility loss metric that is based on reli-
ability (connectivity-based graph model). Specifically, we use the two-terminal reliability
difference to reflect the impact of anonymization on uncertain graph structure(referred to
as Reliability Discrepancy).
Definition 2 Two-Terminal Reliability [70] Given an uncertain graph G, and two distinct
60
6.4 PROBLEM STATEMENT
nodes u and v ∈ V , the reliability of (u, v) is defined as:
Ru,v(G) =∑
G∈W (G)
IG(u, v)Pr[G]
where Pr[G] is the probability of observingG as one possible world of G, and IG(u, v)
is 1 iff u and v are contained in a connected component in G, and 0 otherwise.
Definition 3 Graph Reliability Discrepancy The reliability discrepancy of graph G =
(V,E, p), denoted as ∆(G), w.r.t. an original graph G = (V,E, p) is defined as the sum
of the two-terminal reliability discrepancy over all node pair (u, v) ∈ VG.
∆(G) =∑
(u,v)∈VG
|Ru,v(G)−Ru,v(G)|
6.4 Problem Statement
Given the above foundation, we can now formulate the addressed problem.
Problem 1 Reliability-Preserving Uncertain Graph Anonymization Given an uncertain
graph G = (V,E, p) and anonymization parameters k and ε, the objective is to find a
(k, ε)-obfuscated uncertain graph G = (V,E, p) with minimal ∆(G). That is:
argminG
∆(G)
Subject to G is (k, ε)− obf
61
7
Uncertain Graph Anonymization via
Representative Instance
One baseline approach for anonymizing an uncertain graph G involves two phases that
combine isolated but complementary work from literature. First is to somehow transform
G to one deterministic representative graph (Say G), and then use any technique from the
“uncertainty semantic-based modification” category [28, 29, 30] to generate anonymous
uncertain graphs (See top part of Figure 7.1). Fortunately, few techniques have been re-
cently proposed for exacting representative deterministic graphs from an uncertain graph
while capturing the key properties of the uncertain graph [71].
The advantage of this approach, which we refer to as Rep-An (a short name for Rep-
resentative Anonymization), is that it does not require new anonymization techniques.
However, the disadvantages of Rep-An are multifold. First, the input edge uncertain-
ties (probabilities) are no longer integrated into the anonymization process since they are
detached from the graph in the first phase. Second, the anonymization process (the sec-
ond phase) is oblivious to the reliability metric since its input is a made-up deterministic
graph. Third, since the two phases are isolated from each other, each phase optimizes for
62
The anonymizedUncertain Graph
One Representativedeterministic Graph
The inputUncertain Graph
G G G
0.7
0.8
0.7
0.7
0.7
0.7
[70] [28,29,30]
[Chameleon]
1
1
1
(3,0)-obf
High utility
Figure 7.1: Chameleon: Representative based Anonymization (Rep-An).
a different metric. As the result, this naive Rep-An approach introduces a high level of
noise, and consequently significantly reduces the overall utility of the anonymized graph.
Figure 7.1 illustrates these limitations. The uncertain input graph (L.H.S) will have the
corresponding deterministic representative graph (middle) according to [71]. This graph
is viewed by the state-of-art anonymization techniques as being already anonymized and
will be published as is (R.H.S top graph). However, it is clear that an anonymized graph
with a much-higher utility can be generated, e.g., (R.H.S bottom graph). In the experiment
section, we further study this approach empirically and confirm the its impracticality.
63
8
Chameleon Framework
We first describe the general iterative skeleton of Chameleon, and then present in detail
the two core steps in each iteration, namely edge selection and perturbation injection, in
Sections 8.2 and 8.4, respectively.
8.1 Chameleon Iterative Skeleton
Algorithm 5 presents the skeleton of the framework. It takes as inputs the original uncer-
tain graph G, the adversary knowledge K representing the nodes degrees in G, the privacy
parameters k and ε, and two other parameters, which will be described later. The output
is an anonymized graph Gobf that has the same set of vertices but a modified set of edges.
The core function of the process is the GenerateObfuscation function (Lines 3 &
8), which performs two key tasks; first selecting a subset of edges to alter, and second
deciding on the amount of noise to inject into these edges.
In general, G can have up to |V |(|V | − 1)/2 edges. In each execution of
GenerateObfuscation(), a relatively small candidate set of these edges Ec is se-
lected for perturbation, which is controlled by the input parameter c > 1 (multiplier fac-
64
8.1 CHAMELEON ITERATIVE SKELETON
tor), i.e., |Ec| = c|E|. And then, the function will inject some probability noise over each
edge e ∈ Ec. To simulate the stochastic process, the amount of injected noise re ∈ [0, 1]
follows the [0, 1]-truncated normal distribution:
Rσ(r) :=
Φ0,σ(r)∫ 1
0 Φ0,σ(x)dxr ∈ [0, 1]
0 otherwise(8.1)
where Φ0,σ is the density function of a Gaussian distribution with a standard variation
σ. As σ decreases, a greater mass of Rσ will be concentrated near r = 0, and thus the
amount of perturbation r will be smaller. In other words, the smaller the σ the less noise
is injected and the higher the utility of the published graph. More details on these two
steps will be presented in the following two sections.
Algorithm 5 Chameleon Iterative SkeletonInput: Uncertain graph G, adversary knowledge K, obfuscation level k, tolerancelevel ε, size multiplier c and white noise level qOutput: The anonymized result Gobf
1: σl ← 0; σu ← 12: repeat3: 〈ε, G〉 ← GenerateObfuscation(G, k, ε, c, q, σu,K)4: if ε = 1 (fail) then σl ← σu; σu ← 2σu5: until ε 6= 16: repeat7: σ ← (σu + σl)/28: 〈ε, G〉 ← GenerateObfuscation(G, k, ε, c, q, σu,K)9: if ε = 1 then σl ← σ
10: else σu ← σ; Gobf ← G
11: until σu − σl is enough small12: return Gobf
The iterative algorithm starts with an initial guess on the upper bound for σ, which is
σu = 1 (Line 1). And then, it enters a loop trying to find a successful initial (k, ε)-obf
for the graph (Lines 2-5). With each failed attempt, it doubles the σu to allow for more
noise to be injected (Lines 4). After a successful attempt, the algorithm enters its second
65
8.2 HYBRID EDGE SELECTION
loop, which is a binary search process over the found low bound σl and upper bound σu
(Lines 7-11). The binary search terminates when the search interval is sufficiently short,
and the algorithm outputs the best (k, ε)-obf graph, which is the last one obtained with
the smallest σ.
This general iterative skeleton is similar to the one proposed in [28]. However, they
fundamentally differ in the core function GenerateObfuscation (Line 3 & 8). The
method proposed in [28] has two limitations when considering uncertain graphs anonymiza-
tion. First, the method does not consider the structural relevance of edges in the critical
edge selection step, which leads to unnecessary structural distortion. Second, its scheme
assumes the existence of edges is known with certainty, thus fails to handle uncertain
graphs where the existence of edges is probabilistic.
8.2 Hybrid Edge Selection
As discussed in Chapter 8, the first step inside the GenerateObfuscation() func-
tion is to select a subset of candidate edges for obfuscation. Figuring out the optimal
subset of edges that balances the privacy gain and the utility loss is a typical combina-
tional optimization problem. It involves the consideration over the exponential number of
edge combinations. Let alone the infinite possibilities of probability values on the selected
edges, which further complicates the problem.
In the context of deterministic graphs, various heuristics have been utilized [32, 50,
72] to alleviate the combinational intractability. These heuristics can be classified into two
main categories: (1) Anonymity-oriented heuristics that suggest injecting larger perturba-
tions to the edges associated with the less-anonymized (more-unique) nodes [27, 28, 31,
32, 33, 43, 44, 50, 54, 72, 73, 74], and (2) Utility-oriented heuristics that suggest avoiding
perturbations over “bridge” and sensitive edges whose deletion or addition would signif-
66
8.2 HYBRID EDGE SELECTION
icantly impact the graph structure [31, 32, 33, 43, 54, 72, 74]. It is clear that these two
types are complementary to each other and combining them would introduce an added
benefit as confirmed by practice in deterministic graph anonymization [72]. Neverthe-
less, these two types of heuristics and their combination have not been explored yet in the
context of uncertain graphs.
Therefore, for the edge selection step in Chameleon, we (1) Introduce a novel edge
relevance metric, called reliability relevance (RR), for quantifying the impact of edge
modifications on the overall uncertain graph reliability, (2) Present an efficient algorithm
for reliability relevance evaluation (3) Propose a hybrid heuristic that combines unique-
ness with reliability relevance. On this basis, we propose a sampling-based approach to
identify the candidate subset of edge efficiently.
8.2.1 Uniqueness Score
Uniqueness score is proposed in [28] as a relative measure indicating how common (or
unique) a given node is among the other nodes in the graph w.r.t a specific property, e.g.,
node degree. The definition is given as follows.
Definition 4 Uniqueness Score [28] Let P : V → ΩP be a property on the set of
nodes V of graph G, let d be a distance function on ΩP , and θ > 0 a positive in-
teger parameter. Then, the θ-commonness of a property values ω ∈ ΩP is Cθ(ω) =∑u∈V Φ0,θ(d(ω, P (v))), while the θ-uniqueness of ω ∈ ΩP is Uθ = 1
Cθ(ω).
In a simplified explanation, the commonness of the property value ω is a measure
of how typical is the value ω among the vertices of the graph (density estimation). The
uniqueness score assigned to each node in the graph is the inverse of its commonness
score. The higher the uniqueness score the less-protected the node and the more anonymiza-
tion it eventually needs. To attain a smooth density estimation, the normal distribution Φ
67
8.2 HYBRID EDGE SELECTION
is used as the kernel function, and θ is used as a smoothing parameter, which implies the
spread out of the property value ω over the domain. In the generic work[28], they set
θ = σ, where σ is the standard deviation of the noise-generation Gaussian distribution
(Refer to Algorithm 5). This is because larger injected amount of uncertainties (noise)
indicates that the property values may be spread out on a larger domain.
In Chameleon, we adopt the same metric to capture how typical is the value ω (i.e., a
node degree) among the vertices in the uncertain graph, where ω of one vertex is defined
as a random variable as shown in Figure 6.1(b). Since each vertex has its own distribution,
we set θ equal to the average standard deviation of ω’s distribution across all vertices in
the original uncertain graph G.
8.2.2 Reliability Relevance
The utilization of uniqueness score enables selecting edges that would return high pri-
vacy gain. However, it ignores that modifications on different edges may incur entirely
different utility loss. Referring to the example in Figure 8.1(a), two vertices a and e
will be assigned the same uniqueness score due to the exact probabilities associated with
their edges. As a result, the anonymity-oriented heuristics would select and perturb ei-
ther of the two edges (a, c) and (c, e) with the same probability. However, as indicated
in Figure 8.1(a), modifications to (c, e)—which is the only link connecting two reliable
clusters—clearly incurs much larger structure distortion than that on (a, c). Therefore—to
target high utility—we suggest to calibrate the perturbation applied to an edge e accord-
ing to its “utility impact”. Namely, if the modification over an edge e would incur large
structure distortion, then the injected modifications (if needed) should be minimal. It calls
for an effective measure of the utility loss triggered by edge modification in the context
of uncertain graphs.
To this end, we propose a theoretically sound estimation for the “reliability devia-
68
8.2 HYBRID EDGE SELECTION
tion” caused by individual uncertain edge modifications at a fine-grained way (See Fig-
ure 8.1(b)). This estimation will enable us to systematically quantify the sensitivity of
each edge w.r.t reliability. For example, the difference between edges (c, e) and (a, c) in
Figure 8.1(a) can be captured in a more formal way by considering their reliability devi-
ations as illustrated in Figure 8.1(b). A bigger edge’s slope (e.g., the slope corresponding
to (c, e)) indicates that small changes to that edge’s probability will lead to big distortion
on the reliability.
Based on reliability deviation estimation, we introduce a new measure, called Edge
Reliability Relevance (ERR), at the edge level, and an aggregated measure, called Vertex
Reliability Relevance (VRR), at the vertex level as will be formally defined next. These
measures will enable ranking the edges to be targeted for obfuscation in a meaningful
utility-based perspective.
Definition 5 Two-terminal Reliability Relevance Given an uncertain graph G, and two
nodes u and v, the reliability Ru,v(G) as defined in Def. 2 is considered as a multivariate
function involving all the edge probabilities in G. Thus, given an uncertain edge e ∈ G,
the partial derivative ofRu,v(G) w.r.t e’s probability variable p(e), denoted as ERReu,v(G),
represents the sensitivity of the two-terminal reliability Ru,v w.r.t p(e) while all others are
held constant. It is defined as:
ERReu,v(G) =
∂Ru,v(G)
∂p(e)
Lemma 1 Factorization Lemma Given an uncertain graph G, the reliability of the node
pair (u, v), i.e., Ru,v(G), can be factorized via a specific uncertain edge e as follows:
Ru,v(G) = p(e)Ru,v(Ge) + (1− p(e))Ru,v(Ge)
69
8.2 HYBRID EDGE SELECTION
(a) An uncertain graph (b) The reliability Ra,e v.s p(e)
Figure 8.1: Chameleon: Edge Modifications’ Impact vs. Relaiblity Relevance.
where uncertain graphs Ge and Ge are identical to the original graph G with the exception
that e is certainly present in the former and certainly not present in the later.
According to the factorization lemma, the partial derivative ERReu,v can be rewritten
as:
ERReu,v(G) = Ru,v(Ge)−Ru,v(Ge)
On one hand, this factorization indicates that for a given edge e, the incurred relia-
bility discrepancy is linear to the amount of edge probability difference. On the other
hand, it indicates that edges with different topological locations have different reliability
sensitivity. The other crucial point to highlight is that Ru,v(Ge)− Ru,v(Ge) ≥ 0 is always
true since all connected pairs in Ge are guaranteed to be a superset or at least equal to that
in Ge.
Considering a single uncertain edge e, the derivatives ERReu,v(G) over all vertex pairs
in G can be arranged in a |V | × |V | matrix, and as highlighted above, all entires of this
matrix equal to or greater than zero. By aggregating these derivatives, we can estimate the
overall reliability relevance of edge e, denoted as ERRe(G), as the sum of all the ERReu,v
70
8.2 HYBRID EDGE SELECTION
values. That is:
ERRe(G) =∑u,v
|ERReu,v(G)|
=∑u,v
|Ru,v(Ge)−Ru,v(Ge)|
=∑u,v
Ru,v(Ge)−∑u,v
Ru,v(Ge)
Note that ERRe equals to the difference of the expected number of connected pairs
between the two uncertain graphs Ge and Ge by explicit incorporation of edge uncertainty.
In the context of edge relevance, reliability relevance can be seen as generalization of cut-
edges, which quantifies the impact of partial edge deletion or addition on the connectivity
in the uncertain graph. The higher reliability relevance score of an edge, the bigger impact
of edge perturbation over the overall graph.
On the basis of these edge-level reliablity relevance, we can now compute a vertex-
level reliability relevance of a given vertex (Say u) as a weighted sum of reliability rele-
vance of u’s edges Eu.
VRRu(G) =∑e∈Eu
p(e)ERRe(G)
The VRRu(G) is a measure of the expected impact of vertex modification on the
graph reliability. Namely, the higher the vertex’s reliability relevance, the larger reliability
distortion introduced by modification associated with its edges.
Reliability Relevance Evaluation
Given this theoretical foundation, the challenge is how to evaluate the reliability rel-
evance of edges in a given uncertain graph efficiently (ERR−eval). For each edge e, we
need to measure the reliability difference over Ge and Ge. This evaluation involves the
two-terminal reliability detection problem, which is known to be NP-complete [75].
A baseline algorithm for ERR−eval is to use the Monte Carlo sampling. More pre-
71
8.2 HYBRID EDGE SELECTION
Algorithm 6 Edge Reliability Relevance EvaluationInput: G = (V,E, p), N is the number of sampled graphs;Output: ERR Reliability relevance of edges in G
1: CCe ← 0, CCe ← 02: for i = 1 To N do3: G← A deterministic sampled instance4: Ind(G) is edge existence of sampled graph G
5: cc(G)← the number of connected pairs of G6: CCe+ = Ind(G) · cc(G),CCe+ = (1− Ind(G)) · cc(G)7: ERR = CCe/p − CCe/1− p
The identical set of samples of G
Graph containing 𝑒" and 𝑒#Graph containing 𝑒#Graph containing 𝑒"Graph containing neither 𝑒" , 𝑒#
Ge1
Ge1
Ge2
Ge2
Figure 8.2: Chameleon: Sampling Estimator for ERR
cisely, we sample N possible worlds of the input uncertain graph, where N is large
enough (around 1, 000) to guarantee high approximation accuracy. Over each sampled
possible world (Say G), we carry out a connected-component computation algorithm to
count the number of connected pairs cc(G). Then, the count on the original uncertain
graph cc(G) can be estimated by taking the average over the sampled deterministic graphs.
Theorem 5 The complexity of the baseline ERR−eval algorithm is O(|E| ·Nα(|V |)|E|)
where α is the inverse Ackermann function.
Proof sketch The time complexity of the connected component detection algorithm
based on the union-find method is O(α(|V |)|E|) [76]. Consequently, computing the ERR
for an edge over the N possible worlds takes time O(Nα(|V |)|E|), and the total time
complexity for all the edges is O(|E| ·Nα(|V |)|E|).
Obviously, the baseline algorithm is inefficient when the input uncertain graph is very
large (it is quadratic in the number of edges). Here, we present a efficient algorithm for
ERR evaluation in Algorithm 6. Its basic idea is to re-use the connected components
72
8.3 RELIABILITY-ORIENTED EDGE SELECTION PROCEDURE
detection result of samples as illustrated in 8.2. For each edge e, we group the sampled
possible worlds according to the edge existence (Line 4-6), then get the sampled average
of cc for each group as accurate approximation of cc(Ge) and cc(Ge). By this way, we
bring the the evaluation of edge reliability relevance to the realm.
Theorem 6 The time complexity of Algorithm6 ERR−val is O(Nα(|V |)|E|) where N is
the number of samples.
8.3 Reliability-oriented Edge Selection Procedure
Now, we are ready to present the details of the GenerateObfuscation() function
for finding a (k, ε)-obf instance for an input uncertain graph G in Algorithm 7. The
function receives the parameters that are originally passed to Chameleon skeleton (Al-
gorithm 5) including a perturbation standard deviation parameter σ.
First, the function computes the uniqueness score and reliability relevance for each
node v ∈ G (Lines 1 & 2). These two invariants are crucial for our privacy-preserving
and utility-preserving purpose. In order to use an “perturbation budget” σ (an amount of
noise to be injected in the graph) in the most effective way, the algorithm 7 performs the
following steps.
(Lines 3-4−Exclusion): Since it is allowed not to obfuscate ε|V | of the nodes per the
problem definition, the algorithm leverages the two invariants highlighted above and se-
lects a setH of ε2|V | nodes with the largest combined uniqueness and reliability relevance
scores, and excludes them from subsequent obfuscation efforts.
(Line 5-6− Assigning Uniqueness and Relevance Scores): The set of nodes not in
H are the candidates for anonymization. To anonymize high-uniqueness vertices, higher
noise need to be injected. Thus, edges associated with those vertices need to be sampled
with a higher probability. In contrast, to better preserve the graph structure, edges associ-
73
8.3 RELIABILITY-ORIENTED EDGE SELECTION PROCEDURE
Algorithm 7 GenerateObfuscationInput: Uncertain graph G = (V,E, p), K, k, ε, c, q,and standard deviation σOutput: A pair 〈ε, G〉 where G is a (k, ε)−obfuscation, or ε = 1 if fail to find a(k, ε)-obf.
1: compute the uniqueness U v for v ∈ V2: compute the reliability relevance VRRv for v ∈ V3: Qv ← U v · VRRv for v ∈ V4: H ← the set of d ε
2|V |e with largest Qv
5: Normalized VRRv for v ∈ V \H6: Qv ← U v · 1− VRRv for v ∈ V \H7: ε← 18: for t times do9: repeat
10: EC ← E11: randomly pick a vertex u ∈ V \H according to Q12: randomly pick a vertex v ∈ V \H according to Q13: if (u, v) ∈ E14: then EC ← Ec \ (u, v) with the probability p(e)15: else Ec ← Ec ∪ (u, v)16: until EC = c|E|17: for all e ∈ EC do18: compute σ(e)19: draw w uniformly at random from [0, 1]20: if w < q then re ← U(0, 1)21: else re ← Rσ(e)
22: p(e)← p(e) + (1− 2p(e)) · re23: ε← anonymityCheck(G)
24: if ε < ε and ε then ε← ε; G← G
25: return 〈ε, G〉
ated with high reliability-relevance nodes need to be sampled with a smaller probability.
In order to implement such sampling strategy, our algorithm assigns a probability Qv to
every v ∈ V \ H (v in V but not in H), which is proportional to v’s uniqueness U v and
inverse proportional to v’s reliability relevance VRRv.
(Lines 9-16− Hybrid Edge Selection): After that, the algorithm starts its t trials for
finding (k, ε)-obf. Each trial performs hybrid edge selection to select a set of candidate
edges Ec, which will be subject to probability perturbation. Initially Ec is set to E. Then,
74
8.3 RELIABILITY-ORIENTED EDGE SELECTION PROCEDURE
the algorithm randomly selects two distinct vertices u and v, according to their assigned
probabilities. The edge (u, v) is then excluded from Ec with the probability p(e) if it is an
edge in the original graph (Lines 14), otherwise it is added to Ec (Line 15). The process
is repeated until Ec reaches the require size, which is controlled by the input parameter
c as mentioned in Chapter 8. In typical uncertain graphs, the number of absent edges is
usually significantly larger than the number of present uncertain edges. Thus, the loop
usually ends very quickly for small values of c. And, the resulting set Ec includes most
of edges in E.
(Line 18−Estimating Edge Perturbation): Next, we re-distribute the perturbation
budget among all selected edges e ∈ Ec in proportion to their intermediate representations
Qv. Specially, we define for each e = (u, v) ∈ Ec , its uncertainty level,
Qe :=Qu +Qv
2
and then set
σ(e) = σ|Ec| ·Qe∑e∈Ec Q
e
so that the average of σ(e) over all e ∈ EC equals σ.
(Lines 19-22−Edge Probability Perturbation): This segment of the code is respon-
sible for injecting noise to the selected edges. In the following section, we will describe
the details of this process.
(Lines 24-27−Success or Failure): Finally, If the algorithm successfully finds (k, ε)-
obfuscated graph in one of its t trials, it returns the obfuscated graph with minimal ε.
Otherwise, it indicates the failure by returning ε = 1.
75
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
8.4 Anonymity-Oriented Edge Perturbing
Now, we focus on the details of injecting noise and perturbation to the set of candidate
edges Ec (Lines 19-26 in Algorithm 6). There are few techniques that inject uncertain
noise to deterministic graphs ( 4th cat. [28, 29, 30]). However, as ever discussed, these
techniques assume the initial state of the edges is binary either exist or not, which is
different from uncertain graphs.
Given an uncertain edge ewith an initial probability p(e) in the original graph, we first
estimate a perturbation level σ(e), which shapes the perturbation distribution allowed over
e (Line 18 in Algorithm 6). A naive strategy to create the noise is to inject the perturbation
in a random way (either addition or subtraction) as illustrated in Figure 8.3(a). However,
we can theoretically prove that this “un-guided” injection is not optimal and with the
same amount of injected noise a better anonymization can be achieved if the injection
distribution is more controlled.
We will first introduce the proposed “guided” injection method, which we refer to
as anonymity-oriented perturbation, and then in Section 8.4, we sketch why it works.
Basically, Chameleon alters the probability of a given edge e ∈ Ec according to the
following equation:
p(e) := p(e) + (1− 2p(e)) · re
where the random perturbation re is generated as indicated in Equation 8.1
Namely, for a given edge e with the probability p(e), we only consider the potential
edge probability p in the limited range that is more likely to contribute to a higher graph
anonymity by maximizing the entropy level. In Figure 8.3(a), we show an example where
the initial p(e) = 0.7 and the assigned perturbation level σe = 0.5. In the naive strategy,
p(e) will spread out in the wide range [0, 1], whereas under the proposed anonymity-
oriented perturbation strategy, p(e) is more focused in a specific range that should lead
76
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
(a) (b)
Figure 8.3: Chameleon: Anonymity-Oriented Edge Perturbation.
to a higher entropy.
Clearly, existing schemes in literature—which are defined over deterministic graphs—
become a special case of the proposed scheme (by setting p(e) to either 0 or 1).
Proof Sketching the Heuristic
We proceed to elaborate the rationality this anonymity-oriented edge perturbing scheme
briefly. The formal detail proof of our heuristic is available in tech report. The core idea
is to maximize the entropy of degree uncertainty matrix (referred to as ME).
To facilitate further discussion, we consider the extreme case k-obf, which poses a
set of hard constraints over the anonymized solution. Let the constraint being k-obf be
C, k−obfuscate a vertex v be cv. According to Definition 8, k−obf can be expressed as
joint satisfaction of cv : v ∈ V since the uncertain graph is said to be k-obf iff it
k−obfuscates all the vertices. The formal definition as follows.
C =∏v∈V
cv (8.2)
77
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
where
cv :=
1 H(YP (v)) ≥ log2 k
0 otherwise
In other words, given an uncertain graph, its satisfaction evaluation of C indicates whether
it achieves the desirable anonymity level (k-obf).
However, as shown in Figure 8.3(b), a single constraint at the vertex level is either
fully satisfied or fully violated. It limits the optimization opportunity of methods based
on local search. In this work, we model the individual constraint cv to a fuzzy relation in
which the satisfaction of a constraint is a continuous function of its variables’ values (i.e.,
the entropy H(YP (v)), going from fully satisfied to fully violated as follows.
Cv = eH(YP (v))−log2 |V | (8.3)
Theorem 7 Let Ω presents the domain of degree values in the original uncertain graph,
the maximization of the provided anonymity C is equivalent to the maximization of the
following function: ∑ω∈Ω
s(ω) ·H(Yω) (8.4)
Proof Sketch: First we can see that
C =∏v∈V
cv =∏ω∈Ω
cω . . . cω︸ ︷︷ ︸s(ω)
Taking logarithm for both sides and combining with the approximation equation 8.3, we
78
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
can see that
log(C) =∑ω∈Ω
s(ω) log(cω)
=∑ω∈Ω
s(ω)[H(Yω)− log2 |V |
]=∑ω∈Ω
s(ω)H(Yω)−∑ω∈Ω
log2 |V |
Therefore, after removing the constant∑
ω log2 |V | from log(C), our goal is actually to
maximize Equation 8.4. It provides us with the relation between the global anonymity
and the level of disorder of the degree uncertainty matrix.
Theorem 8 The maximization of Equation 8.4 is equivalent to maximization of the fol-
lowing function:
∑ω∈Ω
s(ω) ·H(Yω) =[∑v∈V
H(dv)]
+ |V | log |V | − |V |H(Ω) (8.5)
The equation stems from the coding length of degree uncertainty matrix from different
perspectives (row and column).1
It provides us with the mechanism for gaining better anonymity, namely increasing the
degree uncertainty per vertex H(dv).
Theorem 9 As implied by the Central Limit Theorem, dv may be approximated by the
normal distribution N(µ, σ2), where µ =∑
e∈Ev p(e) and σ2 =∑
e∈Ev p(e) − p(e)2.
Therefore, its entropy may be approximated by the differential entropy of the normal dis-
tribution 12
ln(2πσ2)+ 12. For a given p(e), its gradient ascent is proportion to 1−2 ·p(e).
Targeting at high entropy, we apply the gradient ascent method—p(e) = p(e) +(1 − 2 ·
p(e))· re for achieving the increase of degree entropy and the anonymity gain.
1More detail of it is available in tech report.
79
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
Proof
We consider the case that a vertex can be re-identified by its degree in the context of
uncertain graphs. Given a uncertain graph G = (V,E, p), let Xωv presents the probability
that a node v with degree value ω over all the possible worlds, and s(ω) presents the
expected number of nodes with degree value ω over the vertices in G. According to the
expectation sum rule, we know,
s(ω) :=∑v∈V
Xωv
Let Yvω presents the probability that a node v is the image of the target node has degree
value ω. Yω corresponds to the posterior probability of the node v given the evidence
degree value ω. Based on the Bayesian theorem, we get the following equation,
Yvω :=Xv(ω)
s(ω)
Accordingly, the entropy of the distribution Yω over the vertices of G is defined as follows.
H(Y (ω)) =∑v
−1 · Yvω log Yvω
=∑v
−1 · Xv(ω)
s(ω)log
Xv(ω)
s(ω)
=∑v
−1 · Xv(ω)
s(ω)[logXv(ω)− log s(ω)]
=∑v
log s(ω) · Xv(ω)
s(ω)+∑v
−1 · Xv(ω)
s(ω)logXv(ω)
= log s(ω) +1
s(ω)·∑v
−1 · Xv(ω) logXωv
80
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
Thus, we get the concise expression of weighted entropy sum as follows.
∑ω
s(ω)H(Y (ω)) =∑ω
s(ω) log s(ω) +∑ω
∑v
−1 · Xωv logXω
v
=∑ω
s(ω) log s(ω) +∑v
∑ω
−1 · Xωv logXω
v︸ ︷︷ ︸Switch Order
=∑ω
s(ω) log s(ω) +∑v
−1 · Xωv logXω
v︸ ︷︷ ︸Entropy
=∑ω
s(ω) log s(ω) +∑v
H(v)
=∑ω
n · Pω log nPω +∑v
H(v)
=n log n− nH(Ω) +∑v
H(v)
Where H(v) represents the entropy of probability distribution of a given node v’s degree,
H(Ω) represents the entropy of probability distribution regarding to the degree distribu-
tion of the overall uncertain graph. To further simply our problem, we assume the global
degree distribution holds constant in the anonymization process or with little change.
Fix v ∈ G and let e1, ·, el be l edges that involve v. For each 1 ≤ i ≤ l, ei is Bernoulli
random variable that equals 1 with some probability p. Let dv be the random variable
corresponding to the degree of v, we have
dv =l∑
i=1
ei
Since dv is the sum of independent random variable, it may be approximated by the
normal distribution N(µ, σ2), where µ =∑E(ei) =
∑pi and σ2 =
∑V ar(ei) =∑
pi(1 − p(i)) as implied by Central Limit Theorem (in the case l ≈ 30; for typical
sizes of l in uncertain graphs, the normal approximation becomes very accurate). Recall
81
8.4 ANONYMITY-ORIENTED EDGE PERTURBING
that, uncertain graph anonymization techniques focus on modifying the less anonymized
nodes. Usually, they are nodes with high degree. This approximation is reasonable. Fol-
lowing this path, H(v) can be approximated as the differential entropy of the normal
distribution.H =− g(x) ln g(x)
=−∫
1√2πσ2
e(x−µ)2
2σ2 ln[1√
2πσ2e
(x−µ)2
2σ2 ]
=1
2ln(2πeσ2)
On this basis, we can get the partial derivative of the weighted entropy sum with respect
to the uncertainty associated to a specific edge e = (u, v) as
∂C(G)
∂p(e)=∂[H(v) +H(u)]
∂p(e)
=∂H(v)
∂σ2v
· σ2v
∂p(e)+∂H(u)
∂σ2u
· σ2u
∂p(e)
∝ 1− 2p(e)
Where we assume the global degree distribution has been changed little, i.e., H(Ω) is
constant.
82
9
Performance Evaluation
Extensive experiments are conducted to evaluate the effectiveness and efficiency of un-
certain graph anonymization methods as summarized in Table 9.2.
9.1 Experiment Settings
9.1.1 Data Collection
We tested our algorithm on three real-world uncertain graph datasets. The characteristics
of these datasets are summarized in Table 13.1.
• DBLP1 is a dataset of scientific publications and authors. In this dataset, each node
represent an author. Two authors are connected by an edge if they have co-authored in
a project. The uncertainty on the edge denotes the probability that the two authors will
collaborate in a new project. It is obtained by a predictive model based on historical data
as described in the literature [28, 68]. The larger the number of coauthored papers, the
more likely they collaborate in a new project.
• Brightkite2 is a location-based social network. In this dataset, each node represent a1http://dblp.dagstuhl.de/xml/release/dblp-2015-11-012https://snap.stanford.edu/data/loc-brightkite.html
83
9.1 EXPERIMENT SETTINGS
Table 9.1: Chameleon: Dataset Statistics and Privacy Parameters.Dataset # Vertices # Edges Avg Degree
Edge Prob(Mean)
Exp degrees(Mean)
Exp degrees(Max)
Tolerance levelε
DBLP 824,774 5,566,096 6.75 0.46 3.1 460 10−4
Brightkite 58,228 214,078 7.35 0.29 2.2 264 10−3
PPI 12,420 397,309 63.97 0.29 19.0 483 10−2
(a) Edge Probability Distribution
(b) Expected Degree Distribution
Figure 9.1: Chameleon: Distribution of Edge Probabilities, Degrees.
user. The probability of any edge corresponds to the chance that two users visit each
other. It is obtained by a prediction model based on historical data as described in the
literature [22]. The more frequent they visited each other, the more likely they will visit
again in the future.
• PPI1 is a dataset of protein-protein interactions, provided by Disease Module Identi-
fication DREAM Challenge. The probability of any edge corresponds to the confidence
that the interaction actually exists, which is obtained through biological experiments.
1https://www.synapse.org/Synapse:syn6156761/wiki/400652
84
9.1 EXPERIMENT SETTINGS
Figure 9.1(a) shows the edge-probability distribution in the three datasets. Note that
the DBLP dataset only has a few probability values, while Brightkite dataset’s proba-
bility values are generally very small. The PPI dataset has a more uniform probability
distribution. We also present their degree distributions of “unique” nodes (with high de-
gree and obfuscation level is smaller than 300). Observe that, all the three graphs have a
heavy-tailed degree distribution,namely, they are difficult to be anonymized.
9.1.2 Evaluation Metrics
The primary goal is to preserve the utility of an anonymized graph in a high level. There-
fore, we measure the utility of uncertain graphs in terms of general structural properties,
i.e., degree-based statistics, shortest-distance statistics, and clustering statistics, following
the existing literature [28, 30].
Degree-based statistics
• Number of edges : SNE = 12
∑v∈V dv
• Average degree: SAD = 1n
∑v∈V dv
• Maximal degree: SMD = maxv∈V dv
• Degree variance: SDV = 1n
∑v(dv − SAD)2
• Power-law exponent of degree sequence: SPL is the estimate of γ assuming the
degree sequence follows a power-law4(d) ≈ d−γ
Shortest path-based statistics
• Average distance: SAPD is the average distance among all pairs of vertices that are
path-connected.
• Effective diameter: SED is the 90-th percentile distance among all all path-connected
pairs of vertices.
85
9.1 EXPERIMENT SETTINGS
Table 9.2: Chameleon: Summary of Uncertain Graph Anonymization Methods.
MethodUncertainty
-awareReliability-oriented
Anonymity-oriented
Publication
Rep-An – – X The baseline [71]+[28]ME X – X ChameleonRS X X – ChameleonRSME X X X Chameleon
Figure 9.2: Chameleon: Two Terminal Reliablity Discrepancy.
• Connectivity length: SCL is defined as the harmonic mean of all pairwise distance
in the graph.
• Diameter: SD is the maximum distance among all pair-connected pairs of vertices.
Clustering Coefficient
• Clustering Coefficient: SCC =3N4N3
where N4 is the number of triangles and N3 is
the number of connected triples.
Except SNE can be computed exactly (for each node, summing up its adjacent edge prob-
ability), we rely on sampling to compute other statistics of uncertain graphs. For each
statistic, we approximate its expectation value by the average value obtained over the
sampled possible worlds. Here, we use 1,000 samples since it has been shown that 1000
usually suffices to achieve accuracy converge [66, 68]. In particular, we use Approximate
Neighborhood Function (ANF) [77], to approximate shortest path-based statistics.
86
9.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
(a) Brighte dataset (b) PPI dataset
Figure 9.3: Chameleon: Running Time Comparision vs. Rep-An.
9.2 Performance of Uncertain Graph Anonymization
9.2.1 Efficiency Evaluation
In the first set of experiments, we consider four obfuscation levels, k ∈ 60, 100, 200, 300,
and tolerance levels ε presented in Table 13.1. We experimented with the setting: white
noise level q = 0.01, obfuscation attempt t = 5, and initial size multiplier c = 1.3. In
some cases, anonymization algorithms failed to find a proper upper bound for σ in the
loop. In those cases, we increase the size multiplier c.
These obfuscation algorithms were implemented in C++ and run on Intel Core i7
CPU, 2 GHZ, 6MB cache size. We report their computation time on PPI and Brightkite
dataset in Figure 13.1. Note that, the PPI is already (60, 0.01)-obfuscated, thus no
anonymization effort is needed. For a larger values for anonymity level k, it takes long
time to output (k, ε)-obf as shown in Figure 13.1(c). It is because the increased efforts
(larger size multiplier c for a more significant amount of noise) needed to achieve the
higher obfuscation level (larger values of k).
In general, the time efficiency of our Chameleon approaches is close to the Rep-An
approach. For the small values of anonymity level k, the efficiency of the three Chameleon
variants is similar. For the larger values of k, RSME is faster than RS and ME. This
is because the combination of reliability sensitive edge selection scheme (RS) and max
87
9.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
Figure 9.4: Chameleon: Graph Property Preservation.
entropy based edge perturbation scheme (ME) has the benefit of maximizing obfuscation
effect with the same noise budget, and keeping the graph size under control, namely, the
smaller size multiplier c. Smaller values of c reduce the running time of Algorithm 7,
where the main loop is over c|E| edges. This effect is evident in Figure 13.1(c), where
the time performance of RSME substantially is better than others.
9.2.2 Utility Loss
To assess how anonymization impacts utility, we compare the anonymized output to the
original uncertain graph based on aforementioned important graph properties. For each
property, we measure it on the original graph and on the anonymized output using the
sampling method. We report the difference of their expectation values.
Reliability Preserving In particular, we report the average reliability discrepancy of
anonymized graphs in Figure 9.2. The smaller discrepancy, the better the reliability pre-
serving, the better graph structure preserving. In all cases, the RSME approach performs
88
9.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
Figure 9.5: Chameleon: Double Loss of Rep-An.
best, followed by its variants RS and ME approaches. The proposed uncertainty-aware
approaches have shown improved utility preservation in most cases of different uncertain
graph type and sizes. For instance, in PPI dataset (k = 300), the reliability discrepancy in-
troduced by Chameleon approach is well below 10%, while the one of Rep-An is around
20%. This improvement is significant on the larger dataset,e.g., DBLP dataset.
• The ineffectiveness of Rep-An The naive Rep-An approach does significantly de-
teriorate data utility. The Rep-Anmethod aims at finding (k, ε)-obfuscation result for the
representative instance Grep instead of the input uncertain graph. The representative ex-
traction phase deteriorates data utility and anonymity at the same time. Here, on extreme
case study is presented: For DBLP dataset where k = 300 and ε = 10−4. As shown in
Figure 9.5(a) Rep-An introduces non-negligible edge perturbation much larger than the
ones of Chameleon approaches. Compared to the original uncertain graph, Grep provides
less obfuscation to all the vertices as witnessed as the larger number of nodes without
the enough anonymity k in Figure 9.5(b). Consequently, the Rep-An obfuscation result
deviates far from the original uncertain graph.
• The gain of RS scheme As shown in Figure 9.6(a), modification around differ-
ent nodes in the graph have varying influence on graph structure (reliability relevance)
even with the exact same uniqueness score. The RS approach locates reliability influen-
tial nodes via reliability relevance, the indicator of node spreading influences. Then, it
89
9.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
(a) RSME vs. ME (b) RSME vs. RS
Figure 9.6: Chameleon: The Gain of RS and ME.
prevents edges connected with such influential nodes from larger modification. By re-
strictive preserving the graph skeleton, the RSME scheme is able to preserve the essence
of the original uncertain graph.
• The gain of ME scheme The generic anonymization approach alters the probabil-
ity values associated with selected edges randomly. For each anonymization attempt, it
fails to guarantee the increase of provided anonymity. Therefore, it requires more noise
than the RSME which adopts the max-entropy principle for guiding edge alteration. Fig-
ure 9.6(b) shows the RS approach introduces larger noises to edge probability compared
to the RSME and ME approaches.
Other statistics Preserving For other statistics, we computed the average statistical
error, that is, the relative absolute difference between the estimation and the real value.
The smaller error, the better utility preserving. We present three statistics comparison in
Figure 9.4.
In general, the RSME can better preserve graph properties such degree, distance, and
clustering coefficient as shown as smaller errors in Figure 9.4. Conversely, for small
values of k, e.g., k = 60, errors introduced by Chameleon approaches are always smaller
than 1%. The larger values of k, the larger errors introduced. The RSME approach keeps
the error in control, better than the other, up to 12% in the PPI dataset. Such benefit is of
particular importance for large networks, e.g., DBLP dataset. We observed that the use of
90
9.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
RS strategy also better preserves uncertain graph structures, as witnessed as small errors
in average path length and clustering coefficient. We attribute this phenomenon to that the
strong relation between the reliability (connectivity) and the path distance property. The
reliability sensitive edge selection strategy (RS) avoids injecting noise over influential
nodes (in the structural context). Thus, it can preserve the uncertain graph structure, as
witnessed as better performance in Figure 9.4.
Summary
Therefore, we can safely conclude that our experimental assessment on real-world
datasets confirms the initial and driving intuition: the Chameleon approach which explic-
itly incorporates edge uncertainty and the possible world semantic in the anonymization
process outperforms the baseline Rep-An approach significantly regarding the uncertain
graph utility preservation. And, the efficiency of the Chameleon approach is similar to
the one of the Rep-An approach.
Another message is: by using fine-grained and uncertainty-aware perturbation strate-
gies such as reliability sensitive edge selection (RS) and max entropy based edge prob
alteration (ME), one can achieve the same desired level of obfuscation with the smaller
change on the uncertain graph thus maintaining higher data utility.
91
10
Related Work
Privacy Attack Models on Graph Data: The assumptions of the adversary knowledge
K—which represent the types of prior knowledge that an adversary may attain and utilize
for graph de-anonymization—play a critical role in modeling privacy attacks and develop-
ing anonymization approaches. For un-weighted graphs, the common types of adversary
knowledge K includes attributes of vertices, links between some target individuals, vertex
degree, neighborhood, embedded subgraphs, and graph metrics (e.g., betweenness, close-
ness, centrality) [48]. For weighted graphs, edge weights are considered to be the most
common adversary knowledge. Table 10.1 summarizes the existing graph anonymization
techniques w.r.t their assumptions of adversary knowledge. As can be observed, vertex
degree, where the adversary is assumed to have knowledge about the nodes’ degrees, is
the most common attack model.
Deterministic Graph Anonymization: Existing graph anonymization approaches
are tailored to deterministic graphs (unweighted & weighted). The majority of them fo-
cuses on un-weighted deterministic graphs. These approaches can be classified into
four categories: (1) Clustering-based generalization [48, 49, 69, 73, 74, 78], (2) Edge
modification [27, 43, 44, 54, 72, 79, 80], (3) Edge randomization [31, 32, 33, 34], and
92
Table 10.1: Chameleon: Summary of Adversary Knowledge.
Adversary Knowledge Anonymization MethodsVertex Degree [27, 28, 30, 32, 33, 34, 48, 54, 72, 73]Neighborhood−subgraph [32, 43, 44, 48, 69]Vertex Attribute [29, 49, 78]Edge Weights [31, 74, 79, 80]
Table 10.2: Chameleon: Privacy Criteria Summary of Perturbation-based Graph Anonymiza-tion Schemes.
Input Graph Anonymized Output Privacy Criteria MethodsDeterministic Super-Graph k−anonymity 1st cat.
Graph Deterministic Graph k−anonymity 2nd & 3th cats.un-& weighted Uncertain Graph k−obfuscation and variants 4th cat.
(4) Uncertainty semantic-based modifications, [28, 29, 30]. Table 10.2 summarized ex-
isting deterministic graph anonymization approaches and adopted privacy notations.
The generalization approaches (1st category) cluster nodes and edges into groups and
anonymizes a subgraph into a super-node. Some techniques used the size of a group (≥ k)
to ensure each node is k−anonymized. However, the graph may considerably shrink in
size after anonymization, which limits the type of analytics that can be performed on the
released graph, e.g., analyzing the local structure would not be feasible.
Approaches of the remaining three categories provide anonymity by local graph mod-
ification and thus enable a wider range of analytical tasks. Both works of 2nd and 3rd cat-
egories first transform the data by different types of graph’s modification and then release
the perturbed output–deterministic graph. The uncertainty semantic-based approaches
(4th category) differ in that they transform the original deterministic graph into an uncer-
tain one to be published. These techniques are known as the state-of-art due to their excel-
lent privacy-utility tradeoff and the flexibility brought to the solution by the fine-grained
perturbations leveraging the uncertain graph semantics. Note that, k−obfuscation and its
variants are used to quantify the anonymity level provided by uncertain graphs through
93
Table 10.3: Chameleon: Positioning Chameleon w.r.t State-Of-Art Techniques.
Adversary Knowledge Graph Utility Model
Deterministic Un-weightedDegree Sequence
Spectrum
Graph WeightedConnectivity & Community
Graph Distance Matrix. . .
Uncertain Graph Un-weighted [chameleon] Reliability [chameleon]
the information theoretic lens, as replacement of k-anonymity. Since this category is the
closest to our proposed work, we will briefly highlight several of its techniques.
The first uncertainty semantic-based modifications approaches is introduce by Boldi
et al. in [28]. It uses a finer-grained perturbation that adds or removes edges partially by
assigning a probability to each edge in anonymized graph. Edge rewiring method based
on random walks (referred to as RandWalk) [29] also introduces uncertainty to edges.
Nguyen et al. [30] presented a generalized model based on uncertainty adjacency matrices
for such approaches and argued that previous two approaches suffer from high lower
bounds for utility. Then they introduced the MaxVar approach that aims at maximizing
degree variance while keeping the degree of all the nodes unchanged. By keeping the key
graph property (degree sequence) as close as possible in the original graph, it is able to
achieve a better trade-off between privacy and data utility.
Weighted Graph Anonymization: Several techniques have studied the anonymiza-
tion of weighted graphs where edges as well as the corresponding weights are considered
to be sensitive [31, 74, 79, 80]. The graph modification is performed by adjusting the
weights of edges. Compared to “un-weighted” ones, they have different utility preserva-
tion objectives such as shortest paths preservation [31, 79]. Nevertheless, these techniques
are still in the context of deterministic graph anonymization.
How Chameleon is positioned compared to State-of-Art: The anonymization over
uncertain graphs is mostly overlooked. To the best of our knowledge, Chameleon is
94
the first anonymization framework in that context. Compared to all existing techniques,
we address a different input graph model—which is the uncertain model. Compared to
weighted graphs, uncertain graphs represent an entirely different model with different
semantics and possible analytics [66]. Basically, “weighted” and “uncertain” are two
orthogonal properties of a graph, i.e., an uncertain graph (where edges have real-world
presence probabilities) can be either weighted (where an edge has a weight (if present)) or
un-weighted. It has been discussed in [66] that casting an uncertain graph to a weighted
graph—with the goal of leveraging existing analytical techniques—leads to an incorrect
semantics. Evidently, the same applies to the anonymization problem at hand. Thus,
existing techniques for weighted deterministic graphs [31, 74, 79, 80] cannot be applied
to un-weighted uncertain graphs.
Chameleon falls under the 4th category mentioned above (Uncertainty semantic-based
modifications) since it consumes and produces graphs of uncertain type. Conceptually,
various attack models and privacy criteria can be considered—the uncertain graph model
can even trigger new attack models that were not applicable in the context of deterministic
graphs. However, as a starting point, we focus on the vertex-degree (a.k.a degree-based)
attack model [27] as one of the most serious and common models combined with the
k-obfuscation privacy criteria as the most suitable for uncertain graphs as shown in Ta-
ble 10.2.
95
11
Problem Definition
In this section, we describe the capabilities and motivations of the adversary in the context
of uncertain graphs. First, we show how the adversary may attack naively anonymized
uncertain graph with uncertain local knowledge, then define node re-identification risk
and corresponding privacy condition.
11.1 Privacy Threats
Uncertain graphs have the similar privacy issue with deterministic graphs, i.e., the ad-
versary may have access to external information about the entities in the graph and their
relationship. The information may be obtained by the malicious actions or public re-
source. For example, in social network, the neighbor of victims can estimate the number
of his/her close friends. Formally, the adversary might gain a sequence of observation
about “the close friends of Ana”. Equipped with such information, he may be able to
reduce the uncertainty in victim de-anonymization and threaten the privacy of entities
(individuals, companies or sensors). For example, the first statement allows the adver-
sary to partially re-identify Ana: cand(Ana) = a, b, c, d. According to the similiarity
97
11.1 PRIVACY THREATS
between prior knowledge of the victim individual and candidates nodes in the released
uncertain graph w.r.t specific subgraph signature (here, degree distribution), candidates
will be assigned with different likelihood of being the image of the victim node.
Following the lecture, we model the adversary’s external information as degree knowl-
edge for the vertices of the original uncertain graph G. Fix a vertex v ∈ G and let
e1 . . . , en−1 be the n − 1 pairs of vertices that include v. In the uncertain graph, ei of
each edge is a Bernoulli random variable that equals 1 with some probability pi. Thus,
we denote the degree of v as the random variable dv as
dv =n−1∑i=1
ei (11.1)
Then for each possible degree value ω of v, we have Xv(ω) = Pr(dv = ω). Through-
out this paper, we assume the adversary has the comprehensive view of dv for the vertices
v ∈ G. As shown in Equation 11.1, there is inherent uncertainty in the adversary knowl-
edge of the target node.
An adversary attempts re-identification for a target node v by using dv to locate the
candidate set. Since Ga is published, the adversary can evaluate degree of the vertices
on Ga, looking for matches. Likewise, for any vertex va ∈ Ga, its degree evaluation
is also uncertain. In determinitic graphs, the candidate set of a target x is the vertices
with the exact matching property. In the uncertain graph’s case, the adversary knowledge
and public structural signature both are uncertain. Therefore, we propose the generlized
conecpt of fuzzy candidate set by capaturing the matching likelihood between two random
variables.
Definition 6 (Fuzzy Candidate Set under Degree) The fuzzy candidate set of target
node v ∈ G w.r.t degree is fcand(v) = (u, Pr(du = dv)) where u ∈ Ga. Namely, it en-
codes the information of candidate node and the equivalent confidence of such matching.
98
11.1 PRIVACY THREATS
a
b
d
e f
c0.8
0.2
0.5
0.7
0.9
0.1
(a) An anonymized uncertain graph
Ana
;
0.08
0.08
0.84
(b) Adversary knowledge
Figure 11.1: Galaxy: Probabilistic Degree-based De-anonymization.
Definition 7 (Fuzzy Equivalence) Two random variables du and dv are said to be equal,
if the event du−dv = 0 occurs with probability 1. Let Ω be the domain. The probability
of du = dv can calculate as follows
Pr(du = dv) =∑ω∈Ω
Pr(du = dv = ω)
=∑ω∈Ω
Pr(du = ω)Pr(dv = ω)
(11.2)
Example 1 Referring to the example graph in Figure 11.1, for the target Ana, the adver-
sary have the discrete distribution of dAna as shown in Figure 11.1(b) and fuzzy candidate
sets a, b, c, d, e.
With respect to degree, we may compute degree uncertainty matrices for a given uncertain
graph G as XG as
x11 x12 x13 . . . x1d
x21 x22 x23 . . . x2d
......
......
...
xn1 xn2 xn3 . . . xnd
Then, we can compute the fuzzy equivalence f(v;u) based on the product of the two
99
11.2 ANONYMITY MEASUREMENT
degree uncertainty matrices as
XGXGaT (11.3)
Those computed values would give the values of f(v;u) for all v ∈ V and u ∈ U .
Each row in this matrix corresponds to an original vertex v ∈ V and gives the related
probabilities f(v;u) for all u ∈ U . This matrix F enables us to compute the poster-belief
distribution Yv by normalizing the corresponding row in the matrix:
Yvi(uj) =Fi,j∑1≤l≤n
Fi,j, 1 ≤ i, j ≤ n. (11.4)
11.2 Anonymity Measurement
In order to provide privacy protection to the vertices in uncertain graphs, we should limit
the low-bound of the uncertainty that the adversary would have when he tries to locate
the image of the target individual. We choose (k, ε)-obf [28] as the privacy condition
for following reasons. First, k−obf [34] provides an entropy-based quantification of the
uncertainty in de-anonymization which is more suitable for measuring the anonymity
provided by uncertain graphs. Moreover, the introduction of tolerance parameter ε which
allows ignoring up to ε∗|V | nodes which may represent extreme unique nodes,e.g., Trump
in Twitter Network, whose obfuscation is almost impossible. The formal definition is
given as follows:
Definition 8 (k, ε)-obf [28] Let P be a vertex property (i.e., vertex degree dist in our
work), k ≥ 1 be a desired level of anonymity, and ε > 0 be a tolerance parameter. An
anonymized uncertain graph G is said to k-obfuscate a given vertex v ∈ G w.r.t P if the
entropy H() of the distribution of fuzzy equivalence Yv over the vertices of G is greater
than or equals to log2 k:
H(Yv) ≥ log2 k.
100
11.3 UTILITY PRESERVATION
The uncertain graph G is (k, ε)-obf w.r.t property P if it k-obfuscates at least (1 − ε)|V |
vertices in G.
11.3 Utility Preservation
Ideally, the anonymized graph should preserve the privacy with the smallest utility loss
for permitting meaningful analysis tasks. In uncertain graph anonymization process, we
target at preserving the reliability property of an given uncertain graph. This choice is
motivated by the observation that reliability forms the foundation of numerous uncertain
graph theoretic algorithms such as k−nearest neighbors[18, 66], shortest paths detect-
ing [23, 58]. Specifically, we use the two-terminal reliability difference to reflect the
impact of anonymization on uncertain graph structure(referred to as Reliability Discrep-
ancy). The formal defintion of reliablity discrepancy is available at Chapter 6.
11.4 Problem Statement
Given the above foundation, we can now formulate the addressed problem—resisting
probabilitic degree-based de-anonymization in anonymized uncertain graphs.
Problem 2 Uncertain Graph Probabilitic Degree Anonymization Given an uncertain
graph G = (V,E, p) and anonymization parameters k and ε, the objective is to find a
(k, ε)-obfuscated uncertain graph G = (V,E, p) with minimal ∆(G). That is:
argminG
∆(G)
Subject to G is (k, ε)− obf
101
11.4 PROBLEM STATEMENT
Figure 11.2: Galaxy: Illustration of Convex and Non-Convex Set.
Non-convex Optimization Problem As ever discribed, the problem of resisting prob-
abilitic degree de-anonymization over uncertain graph is an typical constraint optimiza-
tion problem. The general form of an optimization problem is to find some s∗ ∈ S such
that
f(s∗) = minf(s) : s ∈ S, (11.5)
for some feasible set S ⊂ Rn and an objective f(s) : Rn → R. We are interested in the
convexity of its optimization model. The convex optimization problems share the desir-
able properties: they can be solved quickly and reliably up to very large size—hundreds
of thousands of variables and constraints. However, we will show in general, the problem
of uncertain graph anonymization under (k, ε) privacy notation is a typical non-convex
optimization where the objective or any constraint are non-convex.
The optimization problem is called a convex optimization problem if S is a convex set
and f(x) is a convex function defined on Rn. We start with the defintion of a convex set:
Definition 9 Convex set A subset S ⊂ Rn is a convex set if
x, y ∈ S ⇒ λx+ (1− λ)y ∈ S (11.6)
for λ ∈ [0, 1].
Figure 11.2 shows a convex set and a non-convex set.
102
11.4 PROBLEM STATEMENT
An Original Graph Anonymous Graphs(4, 0.2) obf
G1
(4, 0.2) obfNot
a
b
d
e
c
a
b
d
e
ca
b
d
e
c
G2
a
b
d
e
c
0.50.5
0.5
G3
Figure 11.3: Galaxy: Invalidity of Being Convex Set.
Here, we will show that the solution space of (k, ε)-obf is a non-convex set by con-
strugood counter example. As shown in Figure 11.3, G1 and G2 both 4-obfuscate at least
(1 − 0.2) × 5 = 4 vertices in the original graph since they blend any nodes with only
one neighbor among 4 nodes b, c, d, e in anonymized graph . If (4, 0.2)-obfuscation
instances of the original graph S is a convex set. According to Definition 9, G3 should
be one instance of S since G = λG1 + (1 − λ)G2 where λ = 0.5. However, G3 fails to
4-obfuscated at least (1− 0.2)× 5 = 4 vertices in the original graph since the candidate
cand(degree = 1) = (b, 1), (c, 1), (d, 0.5), (e, 0.5) the entropy of its normalized distri-
bution clearly is smaller than log2 4. In summary, we prove the general (k, ε)-obfucations
of a given uncertain graph isn’t a convex set and the problem is a typical non-convex
optimization problem.
Such a problem may have multiple feasible regions and multiple locally optimal points
within each region. It can take time exponential in the number of variables and con-
straints to determine that a non-convex problem is infeasible, that the objective function
is unbounded, or that an optimal solution is the “global optimum” across all feasible re-
gions. In this work, we propose a sound framework to efficiently solve the uncertain
graph anonymization problem in a practical time.
103
12
Galaxy Techniques
After introcuing the privacy attack model and the adaptive information entropy based
anonymity measure, we are ready to present our Galaxy algorithm that tries to anonymize
a given uncertain graph via edge uncertainty operation with the utility loss as small as
possible. In the following, we first present the basic idea of Galaxy and then detail its main
component individually. Notice that although we only focus on degree anonymization
in this thesis, our appraoch is general and it is applicable to other k-obfsucation based
privacy protection schemes on uncertain graphs.
12.1 Overview of The Galaxy Approach
We propose a two-step approach for the Probabilistic Graph Anonymization problem as
shown in Figure 12.1. For an input uncertain graph G = (V,E, p) with degree sequence
d and user-defined priacy criteria parameters (k, ε), we proceed as follows:
1. First, starting from d, we construct a new probabilistic degree sequence da that is
(k, ε)-obf and such that the degree anonymization cost is minimized.
104
12.1 OVERVIEW OF THE GALAXY APPROACH
G+ G1 . . . GtAnonymization
Process Candidates“approximate utility loss ”
Original Graph
(k, )
Risk Assessment
Update Parameters
ProbabilisticDegree Sequence
AnonymousProbabilistic
Degree Sequence
+-
Figure 12.1: Galaxy Framework.
2. Second, given the new probabilistic degree sequence da, we then construct a graph
Ga that share the same set of vertices with the given uncertain graph.
We use dG to denote the probabilistic degree sequence of G. That is, dG is a vector of size
|V | such that dG(i) is the degree of the i − th node of G. Throughout this part, we use
d(i), d(vi) interchangeably to denote the degree of node vi ∈ V . Clearly, in the contex of
uncertain graph, each d(v) is a random variable.
Such methodology is inspired by Liu and Terzi’s work [27]: construct a hyper-representation
of anonymous graphs,i.e.degree sequence, then construct an anonymous graph that real-
izes the hyper-representation. While their work focuses on the k-degree anonymization
problem over deterministic graphs. Their algorithms are based on principles related to
the realizability of degree sequence. More specifically, they target at minimizing the
residual degrees, namely the difference between the original degree and the degree in the
anonymized degree sequence in the degree anonymization phase. On large real-world
graphs, it generates a sequence at the expense of large residual degrees for large original
degrees, as the dierences between these large original degrees are great. It also generates
the sequence with a small number of changes from the original degree sequence, as many
vertices with small original degrees are already k-anonymous. It may then be impossible
105
12.1 OVERVIEW OF THE GALAXY APPROACH
to compensate the large residual degrees. The sequence is then unrealizable. In summary,
the generated degree sequence may be not realizable. Probing scheme that operates small
random changes on the degree sequence util it is realizable and the graph is constructed.
However, the realizability testing has a time complexity O(|V |2) and Probing is invoked
for a large number of repetitions, the algorithm is very inefficient especially in the cases
of large graphs.
Our work differs in the address graph model and attack model—both are probabilistic.
Consequently, the output degree sequence of step 1 is also probabilistic. To the best of
our knowledge, the reliability of probabilistic degree sequence as the uncertain graph is
an unexplored problem.
Note that step 1 requires the anonymization cost of the probabilistic degree sequence
to be minimized, which in fact translated into the approximate requirement of the min-
imum changes. Step 2 tries to construct an uncertain graph with degree sequence da,
which is a supergraph (or has the large overlap in its sets of edges and edge probabilities)
with the original uncertain graph.
Considering the difficulty of obtaining the optimal solution that realizing the anony-
mous degree sequence, we consider the difficult of realizability in constructing probabilis-
tic (k, ε)-obf degree sequence. Inspired by the observation that the anonymous graph has
the large overlap in its set of edges and edge probabilities with the input uncertain graph,
instead of a Probing scheme, we construct the anonymized graph by operating modifi-
cation such as partial edge addition and deletion based on uncertainty semantic over the
given uncertain graph.
In the next sections, we develop algorithms for solving Probabilistic Degree Anonymiza-
tion Problem and Anonymous Graph Construction Problem.
106
12.2 PROBABILISTIC DEGREE ANONYMIZATION
12.2 Probabilistic Degree Anonymization
We give algorithms for solving the Probabilistic Degree Anonymization problem. Given
the probabilistic degree sequence d of the original input uncertain graph G = (V,E, p),
the algorithm output approximate solution of a (k, ε)-obf degree sequence da with the
minimal anonymization cost. We first show how to simplify the problem, then trans-
form it into a linear optimization problem. Then, we show how to derive the 1-Subgraph
perturbation plan on this basis.
Note that, k-obfuscation verification plays a major role in the uncertain graph anonymiza-
tion process. According to its definition, we need to compute degree uncertainty matrix
for a given perturbed uncertain graph Ga as XG, and verify k-obfuscation over the de-
rived the fuzzy equivalence matrix XGXGTa. Here, we will show the low bound based
k-obfuscation verifying.
Definition 10 An uncertain graph Ga = (V,Ea, pa) is (k, ε)-obf G if the probabilistic
degree sequence of Ga, is (k, ε)-obf the degree sequence of G.
Alternatively, Definition 10 that for every node v ∈ V in the input uncertain graph, the
entropy of the random variable Xv over U (a set of vertices in Ga) is at least logk. This
property prevents the re-identification of individuals by adversaries with a probabilis-
tic priori knowledge of the degree of specfic nodes. Together, the probabilistic degree
anonymization can expressed as
Problem 3 Probabilistic Degree Anonymization Given d, the probabilistic degree se-
quence of the given uncertain graph, and the privacy criteria (k, ε), constructed a (k, ε)-
obf sequence da with the minimal anonymization cost w.r.t degree sequence.
Note that, is a vector of random variables. As ever discussed, each dv can be approx-
imated by the normal distribution N(uv, σ2v) as implied by the Central Limit Theorem.
107
12.2 PROBABILISTIC DEGREE ANONYMIZATION
0 50 100 150 200 250 300 350 400
Mean
0
5
10
15
20
Std
dev
(a) Probabilistic Degree Sequence
100 101 102 103
Node Degree(log)
100
101
102
103
104
Frequency
(log)
Original
(b) Degree distribution
Figure 12.2: Galaxy: Probabilistic Degree Sequence Approximation.
First, we approximate the probability degree sequence d waiting for anonymized as
d = N(0, σ20), . . . , N(0, σ2
0)︸ ︷︷ ︸s(0)
. . . N(D, σ2D), . . . , N(D, σ2
D)︸ ︷︷ ︸s(D)
(12.1)
where s(i) is the expected number of nodes with degree i accross possible worlds, defined
in the continuous range.
Figure 12.2 shows one example of the probabilistic degree sequence d of an uncertain
graph and its approximation representation.
Since it is allowed not to obfuscated ε|V | of vertices, the probabilistic degree anonymiza-
tion that ignores the k-obfuscation constraint for the set of nodes with largest uniqueness
score, a truncated degree sequence in the pre-computed range D.
Together, we translated the probabilistic degree anonymization problem as follows:
argminSa
L1(Sa − S)
Subject to da is k − obfuscate dD∑i=0
Sa(i) <D∑i=0
S(i)
108
12.2 PROBABILISTIC DEGREE ANONYMIZATION
The last inequality constraint in Equation 12.2 is due to the fact of truncated degree se-
quence anonymization.The objective function reflects the utility preservation while other
constraints aim to provide enough anonymity to individual nodes in the input uncertain
graph.
Lemma 2 Fix v ∈ V in a given uncertain graph G, the obfuscation level of a given per-
turbed uncertain graph Ga is always greater than equal to the corresponding candidate
level.
Proof: Fix v ∈ V , let F (v) = (f1, . . . , fn) denote the probability distribution
f(v;u),i.e., the vertices in the perturbed graph are U = u1, . . . un, then fi = f(v;ui).
For any fixed v ∈ V , we have the obfucation level offered by Ga
H(Yv) =n∑i=1
−1Yv(ui) log Yv(ui)
=n∑i=1
−1f(i)∑ni=1 fi
logfi∑ni=1 fi
=n∑i=1
−1fi
Sa(v)[log (u)− logSa(v)]
= logSa(v) +1
Sa(v)·
n∑i=1
−1 · fi log fi
The equation shows that the obfuscation level offered by a given perturbed uncertain
graph Ga is always greater than or equal to the corresponding candidate anonymity level.
Alternatively, for any fixed v ∈ V , a given uncertain graph with the candidate anonymity
level larger than k is k−obfuscated the given vertex v w.r.t the input uncertain graph.
On this basis, we replace the set obfuscation constraints with candidate anonymity con-
109
12.3 PROBABILISTIC DEGREE SEQUENCE ALIGNMENT
11123
11233
11123
11233
112
L1(d,da) = 2
L2(d,da) = 2
L1(d,da) = 2L2(d,da) = 4
Figure 12.3: Galaxy: Fuzzy Vertex Alignments.
straints, as
argminSa
L1(Sa − S)
Subject toD∑j=0
Sjaf(di, dj) ≥ k i = 0, . . . D
D∑i=0
Sa(i) <D∑i=0
S(i)
We focus on verifying convexity of objective function and constraints from convex
calculus rules. As shown can be transformed as the weighted sum of the candidate
anonymity level. The objective function L1(Sa − S) =∑D
i |sa(i) − s(i)| is also con-
vex. Various contemporary methods have been proposed for solving convex minimization
problems such as bundle methods, subgradient projection methods, interior-point meth-
ods. It can solve exactly, with similar complexity as linear programming.
12.3 Probabilistic Degree Sequence Alignment
Intutively, the alignment between the input degree sequence and over-anonymized degree
sequence can be leveraged for guiding graph perturbation in some level. Such alignment
should be point-wise since the perturbed graph should share the same set of vertices of the
given uncertain graph. We target at finding the point-wise alignment function A : V → U
110
12.3 PROBABILISTIC DEGREE SEQUENCE ALIGNMENT
with the minimal L2 distance w.r.t the original degree sequence:
∑v∈V
(d(v)− da(A(v)))2 (12.2)
The choice of such anonymization cost function stems from the observation that the diffi-
culty We show two alignments between the original degree sequence and the anonymous
one in Figure 12.3. L1 distance of the degree sequences corresponds to the number of
edge modifications. However, it can not distinguish two alignments. Clearly, L2 distance
prefers small and disperse edge additions around vertices. Without loss of generality, we
assume that entries in d and da are ordered in decreasing order of the degree expectations
they correspond to, that, d(1) ≥ d(1) ≥ . . . d(n). Let A be the optimal alignment solu-
tion w.r.t d and da, A(i) = i. Our experiments show that it performs extremely well in
practice.
Proof: Let A be any optimal alignment between d and a where A(i) = j, A(j) = i
and i < j. Then Let A′ be A except A′(i) = i, A′(j) = j. We have the anonymization
cost introduced by A
∑l∈[1,n]&l 6=i,j
[d(l)− da(l)]2 + [d(i)− da(j)]2 + [d(j)− da(i)]2
Likewise, the anonymization cost introduced by A′
∑l∈[1,n]&l 6=i,j
[d(l)− da(l)]2 + [d(i)− da(i)]2 + [d(j)− da(j)]2
111
12.4 PROBABLISTIC ANONYMOUS GRAPH CONSTRUCTION
1 3 13d = 3 d = 1d = 5
3 95
d = 3 d = 1d = 5
1 2 1 4 9Retention Rate: 9/13Variation Mean: 2Variation Stddev:0
Figure 12.4: Galaxy: Derived Perturbation Model.
We have their anonymization cost difference as
= [d(i)− da(j)]2 + [d(j)− da(i)]2 − [d(i)− da(i)]2 − [d(j)− da(j)]2
= [d(i)− d(j)][da(i)− da(j)]
≥ 0
This contradicts the assumption thatA is the optimal alignment solution with the minimal
L2 distance. Until here, we prove the greedy choice property of the alignment.
12.4 Probablistic Anonymous Graph Construction
It is tough to construct an uncertain graph from scratch that realizing the generated
probabilistic degree sequence. The previous works on deterministic graphs shows it is
computational expensive. Therefore, we turn to the other avenue that anonymizing un-
certain graphs by edge uncertainty perturbation under the anonymity bound. The main
idea of the algorithm is to utilize the over-anonymized degree sequence as the anonymity
bound of anonymization process.
For certain properties of interest, such as degree, the majority of vertices in real-world
graphs are already anonymous even without random perturbations. The reason is that
for most values of the property P there are many vertices have the same value. In other
words, their point-to-point distances over anonymous degree sequence are close to zero.
Hence, we aim at controlling the amount of applied perturbation, so that larger perturba-
112
12.5 THE ANONYMITY-BOUNDED OBFUSCATION ALGORITHM
Algorithm 8 Galaxy SkeletonInput: Uncertain graph G, adversary knowledge K, obfuscation level k, tolerancelevel ε, size multiplier c and white noise level qOutput: The anonymized result Gobf
1: da ← (k, ε)-obf(d)2: σl ← 0; σu ← 13: repeat4: 〈ε, G〉 ← galaxyObfuscation(G, k, ε, c, q, σu,K, da)5: if ε = 1 (fail) then σl ← σu; σu ← 2σu6: until ε 6= 17: repeat8: σ ← (σu + σl)/29: 〈ε, G〉 ← galaxyObfuscation(G, k, ε, c, q, σu,K, da)
10: if ε = 1 then σl ← σ11: else σu ← σ; Gobf ← G
12: until σu − σl is enough small13: return Gobf
tion is added at vertices with larger deviations. In particular, we suggest calibrating the
amount of perturbation applied to a vertex v according to its L1 distance w.r.t the matched
vertex as illustrated in Figure 12.4. To simulate stochastic degree sequence point-wise
matching, we calibrate retention rate and the deviation model as shown in Figure 12.4. In
particular, we suggest to the perturbation applied to a pair e = (u, v) ∈ Ec according to
the “distance” of the two vertices u and v w.r.t the anonymous image. Namely, if both
expected shift of v and u are small values, the re perturbation should be subtle; on the
other hand, v and u are inter-cluster matching nodes, then the perturbation re should be
higher. To mitigate the “mismatching” namely the expected shifts of u and v are in the
reverse direction, we adopt majority voting rule for deciding perturbation direction.
12.5 The Anonymity-Bounded Obfuscation Algorithm
Our algorithm for computing a (k, ε)-obf of an uncertain graph on the probabilistic node
degree is outlined as Algorithm 8. Note that, a (k, ε)-obf degree sequence provides
113
12.5 THE ANONYMITY-BOUNDED OBFUSCATION ALGORITHM
d = 3 d = 1d = 5
d = 5 d = 3 d = 1
v1 v2 v17
Figure 12.5: Galaxy: Anonymous Degree Sequence Realization.
high bound of anonymity guarantee. In other words, the obfuscation that exactly re-
alizes (k, ε)-obf degree sequence may suffer from high utility loss. Targeting for high
utility, our algorithm aims at injecting the minimal amount of noise needed to achieve
the require obfuscation. Likewise, computing the minimal amount of uncertainty is
achieved via a search on the anonymization parameters including size multiplier c and
the noise parameter σ. The search flow of Algorithm 8 is determined by the function
galaxyObfuscation, which is shown as Algorithm 9. Since the general iterative
skeleton is similar to Chameleon, we omit its detail here. The key difference between
Galaxy and Chameleon is the exploitation of (k, ε)-obf degree sequence in the function
galaxyObfuscation.
The function galaxyObfuscation (Algorithm 9) aims at finding a (k, ε)-obf of
G with given anonymization parameters (indicator of the total noise budget). Here, we
assume the utility loss is affected by the number of edge operations performed and the
utility loss caused by each edge operation. In other words, we try to solve the problem by
reducing the amount of noise and meanwhile perform the edge perturbations that cause
smaller utility with high priority. Intuitively, to align vertices with larger distance, a
larger amount of noise is necessary. Thus, edges need to be sampled with the higher
probability if they are adjacent to such vertices. To handle this sampling process, our
algorithm assigns a probabilityQ(v) to every v ∈ V , which is proportional to the expected
L1 distance w.r.t matched (k, ε)-obf degree sequence and the reverse to its reliability
114
12.5 THE ANONYMITY-BOUNDED OBFUSCATION ALGORITHM
Algorithm 9 GalaxyObfuscationInput: Uncertain graph G = (V,E, p), K, k, ε, c, q,standard deviation σ, and anonymous degree sequence daOutput: A pair 〈ε, G〉 where G is a (k, ε)−obfuscation, or ε = 1 if fail to find a(k, ε)-obf.
1: compute Average discrepancy ADv for v ∈ V2: compute the reliability relevance VRRv for v ∈ V3: Qv ← ADv · VRRv for v ∈ V4: H ← the set of d ε
2|V |e with largest Qv
5: Normalized VRRv for v ∈ V \H6: Qv ← ADv · 1− VRRv for v ∈ V \H7: ε← 18: for t times do9: RD ← targetedDiscrepancy(d, da)
10: repeat11: EC ← E12: randomly pick a vertex u ∈ V \H according to Q13: randomly pick a vertex v ∈ V \H according to Q14: if (u, v) ∈ E then15: EC ← Ec \ (u, v), TD[u], TD[v]+ = p(e), p(e) with the probability p(e)16: until EC = c|E|17: for all e ∈ EC do18: compute σ(e)19: draw w uniformly at random from [0, 1]20: if w < q then21: re ← U(0, 1)22: else23: re ← Rσ(e)
24: Inde ← majorityV oting(RD(u), RD(v))25: re = min(re, |RD(u)|, |RD(v)|)26: p(e) = p(e) + Inde · re27: RD(u)− = Inde · re; RD(v)− = Inde · re28: ε← anonymityCheck(G)
29: if ε < ε and ε then ε← ε; G← G
30: return 〈ε, G〉
centrality.
After that, the search for a (k, ε)-obf starts: the algorithm is randomized, and there is a
non-zero probability of failure, t attempts are executed. Each attempt begins by sampling
vertices matching as illustrated in Figure 12.5. Then, it starts by randomly selecting a
115
12.5 THE ANONYMITY-BOUNDED OBFUSCATION ALGORITHM
subsect of edges Ec, which will be subjected to edge perturbation where Ec is initialized
to E. The algorithm randomly select two distinct vertices u and v, according to the prob-
ability distribution Q. The edge (u, v) is removed from Ec with its existence probability
p(e), or added to Ec otherwise. Note that, the corresponding impact on node degree is
reflected in the pointwise perturbation bound. The process is repeated util EC reaches
the required size c|E|. The perturbation re-distribution process is similiar to Chameleon.
While, the amount of applied perturbation on specfic edge (u, v) is bounded by the resid-
ual perturbation of vertices u and v (Line 25).
116
13
Performance Evaluation
13.1 Experiment Settings
13.1.1 Data Collection
We tested our algorithm on three real-world uncertain graph datasets. The characteristics
of these datasets are summarized in Table 13.1.
• DBLP1 is a dataset of scientific publications and authors. In this dataset, each node
represent an author. Two authors are connected by an edge if they have co-authored in
a project. The uncertainty on the edge denotes the probability that the two authors will
collaborate in a new project.
• Brightkite2 is a location-based social network. In this dataset, each node represent a
user. The probability of any edge corresponds to the chance that two users visit each
other.
• PPI3 is a dataset of protein-protein interactions, provided by Disease Module Identi-
fication DREAM Challenge. The probability of any edge corresponds to the confidence
1http://dblp.dagstuhl.de/xml/release/dblp-2015-11-012https://snap.stanford.edu/data/loc-brightkite.html3https://www.synapse.org/Synapse:syn6156761/wiki/400652
117
13.1 EXPERIMENT SETTINGS
Table 13.1: Galaxy: Dataset Statistics and Privacy Parameters.
Dataset # Vertices # Edges Avg DegreeEdge Prob
(Mean)Exp degrees
(Mean)Exp degrees
(Max)Tolerance level
εDBLP 824,774 5,566,096 6.75 0.46 3.1 460 10−4
Brightkite 58,228 214,078 7.35 0.29 2.2 264 10−3
PPI 12,420 397,309 63.97 0.29 19.0 483 10−2
that the interaction actually exists, which is obtained through biological experiments.
13.1.2 Utility Evaluation
Measuring utility is hard as there is no standard metric to caputure it. An anonymization
scheme might perfectly preserve the degree distribution of a graph while damaging other
properties. Also, utility depends on the use of data by the analyst. As ever discussed.
reliablity plays a crticial role in uncertain graph analysis. Besides, we look at some fun-
damental utility metrics as they vary with anonymization level. The properties we study
are:
Degree-based statistics
• Number of edges : SNE = 12
∑v∈V dv
• Average degree: SAD = 1n
∑v∈V dv
• Maximal degree: SMD = maxv∈V dv
• Degree variance: SDV = 1n
∑v(dv − SAD)2
• Power-law exponent of degree sequence: SPL is the estimate of γ assuming the
degree sequence follows a power-law4(d) ≈ d−γ
They are used to capture the degree distribution which is an important measure of a small
world graph.
Shortest path-based statistics
• Average distance: SAPD is the average distance among all pairs of vertices that are
path-connected.
118
13.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
• Effective diameter: SED is the 90-th percentile distance among all all path-connected
pairs of vertices.
• Connectivity length: SCL is defined as the harmonic mean of all pairwise distance
in the graph.
• Diameter: SD is the maximum distance among all pair-connected pairs of vertices.
Clustering Coefficient
• Clustering Coefficient: SCC =3N4N3
where N4 is the number of triangles and N3 is
the number of connected triples.
The clustering coefficent SCC measures the extent to which th edges of the graph “close
triangles”. Evidence suggests that in the most real-world networks, and in particular
social network, nodes tend to create tightly knit groups charterised by a relatively high
density of tries; the likehood tends to greater than the average probability of a tie randomly
established between two nodes. Therefore, we believe clustering coefficient is core to the
evolvtion behavior of graphs (determintic & uncertain ones).
These properties are fundamental to the behavior of uncertain graph and damaging
them significantly has an adverse effect on the overall utility of the graph.
13.2 Performance of Uncertain Graph Anonymization
13.2.1 Efficiency Evaluation
In the first set of experiments, we consider four obfuscation levels, k ∈ 100, 150, 200, 250, 300,
and tolerance levels ε presented in Table 13.1. We experimented with the setting: white
noise level q = 0.01, obfuscation attempt t = 5, and initial size multiplier c = 1.3. In
some cases, anonymization algorithms failed to find a proper upper bound for σ in the
loop. In those cases, we increase the size multiplier c.
119
13.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
100 150 200 250 300
k
2
3
4
5
6
Runin
g T
ime(m
s)
×107
Chameleon
Galaxy
(a) DBLP dataset
100 150 200 250 300
k
0
2
4
6
8
Runin
g T
ime(m
s)
×105
Chameleon
Galaxy
(b) Brightkite dataset
100 150 200 250 300
k
0. 0
0. 5
1. 0
Runin
g T
ime(m
s)
×106
Chameleon
Galaxy
(c) PPI dataset
Figure 13.1: Galaxy: Running Time Comparisions vs. Chameleon.
100 150 200 250 300
k
0.02
0.04
0.06
0.08
Avera
ge
Relia
blit
y D
iscr
epancy Chameleon
Galaxy
(a) DBLP dataset
100 150 200 250 300
k
0.00
0.05
0.10
0.15
Avera
ge
Relia
blit
y D
iscr
epancy Chameleon
Galaxy
(b) Brightkite dataset
100 150 200 250 300
k
0.00
0.02
0.04
0.06
0.08
0.10
Avera
ge
Relia
blit
y D
iscr
epancy Chameleon
Galaxy
(c) PPI dataset
Figure 13.2: Galaxy: Two Terminal Reliablity Preservation.
These obfuscation algorithms were implemented in C++ and run on Intel Core i7
CPU, 2 GHZ, 6MB cache size. We report their computation time over three datasets
in Figure 13.1. For a larger values for anonymity level k, it takes long time to output
(k, ε)-obf as shown in Figure 13.1(c). It is because the increased efforts (larger size mul-
tiplier c for a more significant amount of noise) needed to achieve the higher obfuscation
level (larger values of k). In general, the time efficiency of Galaxy algorithm is close
to the Chameleon approach. It confirms that the on-the-fly edge uncertainty perturba-
tion bounded by the anonymous degree sequence incurs light computation overhead. The
great efficiency inherent from random perturbation schemes makes Galaxy a practical
anonymization solution, especially for large graphs.
120
13.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
100 150 200 250 300
k
0.0
0.1
0.2
0.3
0.4
0.5
Rela
tive E
rror
Degre
e
Chamelon
Galaxy
(a) DBLP dataset
100 150 200 250 300
k
0.0
0.5
1.0
1.5
2.0
Rela
tive E
rror
Degre
e
Chamelon
Galaxy
(b) Brightkite dataset
100 150 200 250 300
k
0
1
2
3
4
Rela
tive E
rror
Degre
e
Chamelon
Galaxy
(c) PPI dataset
Figure 13.3: Galaxy: The Change Ratio of Degree.
13.2.2 Utility Loss Evaluation
Reliability Preserving In particular, we report the average reliability discrepancy of
anonymized graphs in Figure 9.2. The smaller discrepancy, the better the reliability
preserving, the better graph structure preserving. It shows that when the k requirement
increases, the amount of distortion also increases. The proposed anonymous degree se-
quence bounded approach Galaxy has shown improved utility preservation in most cases
of different uncertain graph types and sizes. For instance, in PPI dataset (k = 300),
the reliability discrepancy introduced by Galaxy approach is well below 5%, while the
one of Chameleon is around 10%. This improvement is significant on the larger dataset,
e.g., DBLP dataset. The perturbation model dervived by anonymized degree sequence
alignment guides edge probability perturbation wisely.
Global Statistics Preserving For other statistics, we computed the average statistical
error, that is, the relative absolute difference between the estimation and the real value.
The smaller error, the better utility preserving.
Figure 13.4 and Figure 13.5 report the comparison of uncertain graph anonymization
approaches regarding their ability to preserve the average path distance and clustering
coefficient respectively. The other shortest-path based statistics share almost exact tren-
dency with the average path distanace. Therefore, we omit them.Conversely, for small
values of k, e.g., k = 100, errors introduced by Galaxy approaches are always smaller
121
13.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
100 150 200 250 300
k
0.0
0.1
0.2
0.3
0.4
0.5
Rela
tive E
rror
Avera
ge D
ista
nce
Chamelon
Galaxy
(a) DBLP dataset
100 150 200 250 300
k
0.0
0.1
0.2
0.3
0.4
0.5
Rela
tive E
rror
Avera
ge D
ista
nce
Chamelon
Galaxy
(b) Brightkite dataset
100 150 200 250 300
k
0.0
0.1
0.2
0.3
0.4
0.5
Rela
tive E
rror
Avera
ge D
ista
nce
Chamelon
Galaxy
(c) PPI dataset
Figure 13.4: Galaxy: Average Path Distance Preservation.
100 150 200 250 300
k
0.00
0.01
0.02
0.03
0.04
0.05
Rela
tive E
rror
Clu
steri
ng C
oeff
icie
nt
Chamelon
Galaxy
(a) DBLP dataset
100 150 200 250 300
k
0.0
0.1
0.2
0.3
Rela
tive E
rror
Clu
steri
ng C
oeff
icie
nt
Chamelon
Galaxy
(b) Brightkite dataset
100 150 200 250 300
k
0.0
0.2
0.4
0.6
0.8
Rela
tive E
rror
Clu
steri
ng C
oeff
icie
nt
Chamelon
Galaxy
(c) PPI dataset
Figure 13.5: Galaxy: Clustering Coefficient Preservation.
than 1%. The larger values of k, the larger errors introduced. Generally, at the same
level of identity obfuscation our Galaxy method provides higher utility than the existing
chameleon method. Take SAPD error when k = 200 as example. As depicted in Fig-
ure 13.4, the change ratio of our Galaxy method is around 10% for DBLP, and PPI,
while that under Chameleon is 30% for DBLP and Brightkite. The other example is
Scc value. As depicted in Figure 13.5, our Galaxy method caused around 0.039 and 0.08
utility loss on average for Brightkite and PPI respectively. On contrary, Chameleon
cause 0.149 and 0.081 utility loss on average. All these observations verify that “anony-
mous degree sequence” bounded strategy does successfully avoid void edge uncertainty
perturbation, as our Galaxy method is most effective.
The behavior described for scalar statistics is also observed with vector statistics. For
example, Figure 13.6 shows the degree distribution in the original PPI and in obfuscated
versions. Here, six extreme cases are presented: For k = 100, 200, 300 and ε = 10−2
122
13.2 PERFORMANCE OF UNCERTAIN GRAPH ANONYMIZATION
100 101 102 103
Node Degree(log)
100
101
102
103
104
Frequency
(log)
Original
Chameleon
Galaxy
(a) k=100
100 101 102 103
Node Degree(log)
100
101
102
103
104
Frequency
(log)
Original
Chameleon
Galaxy
(b) k=200
100 101 102 103
Node Degree(log)
100
101
102
103
104
Frequency
(log)
Original
Chameleon
Galaxy
(c) k=300
Figure 13.6: Galaxy: Degree Distribution Preservation.
the distribution obtained is qualitatively very similiar. Conversely, for Chameleon, the
estimated distribution is quite far from the original one, roughly flatlines for higher degree
nodes as such nodes are fewer in number and the perturbation. For the large values of
k, the shift towards the right but the change is more extreme as the proportion of non-
edges introduced is orders of magnitude higher. While, our Galaxy method increases
the frequency of each distinct degree to a minimum value, keeping perturbation in well
control.
To sum up, our experiments use different graph properties to evaluate the utility loss,
although our Chameleon and Galaxy is developed based on reliablitity models. THe
experimental results clearly verify our approach can generate anonymized uncertain graph
with much lower utility loss.
123
14
Conclusion of This Dissertation
The goal of this dissertation is to fill the void in the literature of effectively anonymizing
uncertain graphs. Uncertain graphs serve as a powerful model to capture the complicated
relationship in a wide range of applications—business to business (B2B), social networks
and sensor networks. These uncertain graphs are of significance in the analytic and knowl-
edge extraction. However, such data are rarely released to the public for research due to
privacy and security concerns. Conventional approaches on graph anonymization mainly
focused on the determintitic graph. In this dissertation, we propose novel techniques and
systems to address newly identified privacy risks caused by the revealed edge uncertainty.
Within this scope, we focus on two research aspects, namely methods and systems for
resisting different types of de-anonymization where the adversary has the prior knowledge
of the victim node such as degree static or probabilistic degree distribution.
First, we focus on the problem of identity obfuscation over uncertain graphs. We
model the adversary’s prior knowledge as the degree (expectation) of a target node. To
mitigate such privack attack, we propose an uncertainty semantic based perturbation
scheme Chameleon that ensure the lowebound of anonymity provided of the majority of
nodes. Chameleon offers 3 key innovation (1) We introduce a new reliablity-based utility
125
metric for capturing the structural distortion introduced by anonymization scheme. (2)
We introduce a theoretically-founded criterion, called reliability relevance, that encodes
the sensitivity of the graph edges and vertices to the possible injected perturbation. The
criterion will guide the edges’ selection during the anonymization process. (3) We pro-
pose uncertainty-aware heuristics for efficient edge selection and noise injection over the
input uncertain graph to achieve anonymization at a slight cost of data utility. Our com-
prehensive experimental study confirms its efficiency in utility preservation compared to
the conventional methods that do not directly consider edge uncertainties.
Second, we address the problem of resisting probabilistic degree-based de-anonymization
on anonymized uncertain graphs. We model the adversary’s prior knowledge as the de-
gree (distribution) of a target node. To address this challenge, we propose an uncertainty
semantic based perturbation scheme Galaxy. Our Galaxy system offers three key inno-
vation: (1) We adopt the defintion of k-obfuscation by introducing fuzzy equvialance
relation in the context of uncertain graph. (2)We propose a two-phase framework that
first construct a probabilistic degree sequence that over k-obfuscates target nodes and use
derived alginment varitation as weighting factor for guiding perturbation for construct-
ing anonymous uncertain graph. (3)We propose a light-weight anonymity quantification
operation which provides exact evaluation w.r.t k-obfuscation for speeding up the graph
construction process.
126
15
Future Work
15.1 Defeating More Involved De-anonymization Attacks
As ever discussed, the adversary can use a handful of local structural signatures about the
nodes/community of an uncertain graph to de-anonymize the individual in the anonymized
graph. Examples of local structural features are node’s degree, node’s clustering coeffi-
cient, edge density of the node’s neighbors, etc. The utilization of such local structural
features has been investigated comprehensively in the context of deterministic graphs
since they are important auxiliary information used in the de-anonymization attack. To
date, they have not yet been considered in the context of uncertain graphs. We ever
present a new class of de-anonymization attacks against uncertain graphs which incor-
porate with edge uncertainty in the published uncertain graphs. While more involved
de-anonymization attack models remain unexplored.
In this dissertation, we first identified the potential privacy attack over uncertain graphs
where edge uncertainty can be leveraged in de-anonymization. We focus on prevent-
ing node re-identification attacks triggered by the highly revealing information, node’s
degree—one of the most common structural information used in the de-anonymization.
127
15.1 DEFEATING MORE INVOLVED DE-ANONYMIZATION ATTACKS
Our proposed solution is tailored towards uncertain graph, while is designed to defeat
only a narrow set of attacks.
In practice, it is not realistic to assume that an adversary has only a narrow set of
the priori knowledge in the original uncertain graph. Therefore, the uncertain graph
anonymization framework should consider different kinds of privacy attacks such as the
count of small subgraphs (triangles, stars) ones. Moreover, it is not realistic to assume that
an adversary would launch only one type of attacks for uncertain graphs, i.e., an adver-
sary has only one type of structural information about the target node in the anonymized
graph. As our future work, we plan to consider more powerful attacks i.e. the combination
of other kinds of structural information such as the number of triangles, node between-
ness, the embedded subgraph information. Therefore, an uncertain graph anonymiza-
tion framework should work under the assumption that there will be simultaneous mul-
tiple attacks and techniques to be resilient to all of them. We plan to design a genetic
uncertainty-semantic-aware framework to shift existing deterministic graph anonymiza-
tion techniques to the case of uncertain graphs for incremental contributions.
The intuition of how to adopt existing graph anonymization techniques is as follows.
Namely, we utilize the probabilistic structural signature to partition sensitive entities
(nodes, links, subgraphs) in the given uncertain graphs, called anonymized fragments.
We first probe a solution of anonymized fragments based upon the user-defined privacy
condition. In other words, an uncertain graph that realized such anonymized representa-
tion can provide the desirable anonymity. Then, semi-randomized uncertainty-semantic
based perturbation algorithms can be used for probing the anonymous graph instance.
Heuristic stems from the discrepancy could be weighting factor to guide anonymization
process.
128
15.2 BIG GRAPH ANONYMIZATION
15.2 Big Graph Anonymization
Web2.0 fueled interest in social networks. Other large graphs—for example, graphs in-
duced by transportation routes, paths of disease outbreak, or citation relationships among
published scientific work–have attracted research interests. Publishing such graph data
would allow a wide variety of ad hoc analyses over real large graph datasets and fueled
valid use of the data. Meanwhile, it also raises huge privacy concerns. Such graph data
contain sensitive information about the graph entities as well as their connections, whose
disclosure may violate privacy regulations. The privacy violation may be caused by the
de-anonymization attack such as identifying node via its subgraph signatures, or via corre-
lation with overlapping graphs. Although many advanced graph anonymization methods
have been proposed to mitigate such risks, the majority of existing graph (deterministic
& uncertain ) anonymization algorithms are designed to defeat privacy attacks for graphs
in a small or medium size. The majority of existing graph anonymization techniques are
either difficult to scale up to multi-GB to TBs of graph data or efficiency intractable. In
summary, the increasing sizes of graph data sets present a significant challenge to graph
anonymization algorithm.
Since the amount of data to be anonymized has grossly outpaced advances in the mem-
ory available on commodity hardware, we plan to leverage parallel graph anonymization
algorithm to defeat sensitive information disclosure. The main intuition of parallelizing
existing graph anonymization algorithm stemmed from the programming model of paral-
lel and distributed graph processing systems. The growing scale and importance of graph
data have driven the development of numerous specialized graph processing systems in-
cluding Pregel [1], PowerGraph [2], GraphX[15] and many others [42, 55]. By exposing
specialized abstractions backed by graph-specific optimization, these systems can natu-
rally express and efficiently execute iterative graph algorithm on web-scale graphs. For
129
15.3 LEARNING TO ANONYMIZE UNCERTAIN GRAPHS
Figure 15.1: Parallel Graph Anonymization Process.
example, for the sake of maximizing parallelism and scalability, Pregel [1] adopts a non-
traditional programming model where a graph algorithm is implemented as single com-
putation function written in a vertex-centric, message-passing and bulk-synchronous way.
Interestingly, many graph anonymization methods can be naturally expressed as vertex-
centric perturbation such as random perturbation schemes [32, 33, 34].Typical graph
anonymization algorithms are composed of multiple computation kernels connected by
non-trivial control flows such as privacy risk assessment, utility cost evaluation, perturba-
tion generation. Together, we believe it is a promising direction of solving the big graph
anonymization problem.
15.3 Learning to Anonymize Uncertain Graphs
Data privacy is a major problem that has to be considered before releasing datasets to the
public or even a partner company that would compute statistics or make an in-depth anal-
ysis of the dataset. In the context of graphs (deterministic & uncertain), many different
anonymization techniques have been proposed. However, these methods including our
130
15.3 LEARNING TO ANONYMIZE UNCERTAIN GRAPHS
AnonymizationFunction
AnonymizationParameter
Anonymization ModelsG G
Privacy RiskAssessment
Utility CostAssessment
LearningAnonymization
Process
Original Graph Perturbed Ones
Figure 15.2: Graph Anonymization Learning Process.
works are tailored to defeat a particular de-anonymization attack and specific to preserv-
ing a particular known set of properties. First, they are difficult to use in a general context.
Second, they together fail to make an incremental contribution in graph anonymization
area. One promising avenue is to generalize graph anonymization techniques. The intu-
ition stems from the fact that the uncertainty semantic-based anonymization schemes are
proposed as a generalization of perturbation-based schemes while carrying uncertainty
similar to clustering-based ones. Thus, we suggest a parameterized anonymization func-
tion for simulating graph anonymization schemes of different families. The remaining
issue is to simulate the parameter tuning in the anonymization procedure for balancing
the anonymity and utility loss. From the perspective, a particular graph anonymization
scheme provides its choice of anonymization function and efficient (kind of) optimizers.
Interestingly, some works ever designed general procedure to learn an anonymization
function from a set of training data that optimizes the balance between graph anonymity
and utility loss. As ever discussed, we transform the graph anonymization problem as the
constraint optimization problem. Instead of learning the anonymous graph, the learning
of anonymization process is promising, as illustrated in Figure 15.2. However, their work
131
15.3 LEARNING TO ANONYMIZE UNCERTAIN GRAPHS
limits on simulating perturbation-based schemes. While the random perturbation-based
anonymization schemes usually suffer from high bound of utility cost. Experiments [] that
it is hard to provide anonymity while preserving utility whereas some perturbation-based
schemes destroy utility without providing much anonymity.
Moreover, their general learning process ignores the property of anonymization func-
tion. Therefore, it accompanies with inefficient convergence rate. In particular, it is not
guaranteed to converge to a local optimal anonymization model for large graph dataset.
Besides, this method relies on using the large population of graphs with similar proper-
ties for the training step. It is not realistic to assume the existence and availability of
homogeneous graph datasets. We plan to design a genetic machine learning framework
for generalizing existing graph anonymization methods with more flexibility. The intu-
ition of how to generalize graph anonymization method is as follows. Instead of defining
a specific anonymization function, the user defines a set of multi-classed parameterized
function procedures, they together controlling the behavior of the anonymization process.
We consider the mixture of the different family of graph anonymizations over anonymiza-
tion stages. The machine learning framework will be able to provide the best model that
corresponds to the complex anonymization procedure that obtains the best balance be-
tween anonymization quality and utility loss in the user-defined anonymous graph space.
In summary, it would be interesting and promising to explore these three directions.
We hope the experimental evaluation can confirm the superiority of the approaches/models
above. Eventually, future work in this direction will significantly enhance the applicabil-
ity of our next generation graph anonymization system to a wider spectrum.
132
References
[1] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan
Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: A system for large-scale
graph processing. In SIGMOD, pages 135–146, 2010. 1, 2, 7, 15, 17, 56, 129, 130
[2] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.
Powergraph: Distributed graph-parallel computation on natural graphs. OSDI, pages
17–30, 2012. 1, 2, 15, 17, 129
[3] JP Eckmann and E Moses. Curvature of co-links uncovers hidden thematic layers
in the world wide web. Academy of Sciences, 2002. 1, 2
[4] S Khuller and B Saha. On finding dense subgraphs. Automata, 2009. 1, 2
[5] L Becchetti, P Boldi, C Castillo, and A Gionis. Efficient semi-streaming algorithms
for local triangle counting in massive graphs. KDD, pages 16–24, 2008. 1, 2, 25
[6] R Milo, S Shen-Orr, S Itzkovitz, N Kashtan, and D Chklovskii. Network motifs:
simple building blocks of complex networks. Academy of Sciences, 2002. 2
[7] JW Berry, B Hendrickson, RA LaViolette, and CA Phillips. Tolerating the com-
munity detection resolution limit with edge weighting. Physical Review E, 2011.
2
133
REFERENCES
[8] LS Buriol, G Frahling, and S Leonardi. Counting triangles in data streams. PODS,
pages 253–262, 2006. 2
[9] Xiaocheng Hu, Yufei Tao, and Chin-Wan Chung. I/o-efficient algorithms on triangle
listing and counting. ACM Trans. Database Syst., 39(4), 2014. 2, 6, 54, 55
[10] J Kim, WS Han, S Lee, K Park, and H Yu. Opt: a new framework for overlapped
and parallel triangulation in large-scale graphs. SIGMOD, pages 637–648, 2014. 2,
6, 54, 55
[11] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph
computation on just a pc. OSDI, pages 31–46, 2012. 2, 6, 54
[12] Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. Patric: A parallel algo-
rithm for counting triangles in massive networks. CIKM, pages 529–538, 2013. 2,
7, 24, 25, 27, 55
[13] Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last
reducer. KDD, pages 607–614, 2011. 2, 6, 24, 25, 26, 27, 28, 46, 55
[14] Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim,
Jinha Kim, and Hwanjo Yu. Turbograph: a fast parallel graph engine handling
billion-scale graphs in a single pc. KDD, pages 77–85, 2013. 2
[15] JE Gonzalez, RS Xin, A Dave, and D Crankshaw. Graphx: Graph processing in
a distributed dataflow framework. GRADES,SIGMOD workshop, pages 599–613,
2014. 2, 15, 17, 129
[16] Mingfeng Lin, Mei Lin, and Robert J. Kauffman. From clickstreams to search-
streams: Search network graph evidence from a b2b e-market. ICEC, 2012. 3,
4
134
REFERENCES
[17] E Adar and C Re. Managing uncertainty in social networks. IEEE Data Eng. Bull.,
2007. 3, 58
[18] D Kempe, J Kleinberg, and E Tardos. Maximizing the spread of influence through
a social network. 2003. 3, 58, 60, 101
[19] Liang Zhang, Shigang Chen, Ying Jian, Yuguang Fang, and Zhen Mo. Maximizing
lifetime vector in wireless sensor networks. IEEE/ACM Trans. Netw., 2013. 3
[20] K Bollacker, C Evans, P Paritosh, and T Sturge. Freebase: a collaboratively created
graph database for structuring human knowledge. SIMMOD, 2008. 3
[21] NJ Krogan, G Cagney, H Yu, G Zhong, and X Guo. Global landscape of protein
complexes in the yeast saccharomyces cerevisiae. Nature, 2006. 3
[22] Eunjoon Cho, Seth A Myers, and Jure Leskovec. Friendship and mobility: user
movement in location-based social networks. kdd, 2011. 3, 84
[23] B Zhao, J Wang, M Li, FX Wu, and Y Pan. Detecting protein complexes based
on uncertain graph model. IEEE/ACM Transactions on Computational Biology and
Bioinformatics., 2014. 3, 18, 58, 60, 101
[24] D Liben Nowell and J Kleinberg. The link prediction problem for social networks.
The American Society for Information Science and Technology, 2007. 4
[25] Alan S. Abrahams, Eloise Coupey, Eva X. Zhong, Reza Barkhi, and Pete S. Man-
asantivongs. Audience targeting by b-to-b advertisement classification: A neural
network approach. Expert Systems with Applications, 40(8):2777 – 2791, 2013. 4
[26] T. Alsina, D.T. Wilson, S.A. Joshi, and S. Sundaresan. Targeting customer segments,
December 3 2015. US Patent App. 14/289,118. 4
135
REFERENCES
[27] Kun Liu and Evimaria Terzi. Towards identity anonymization on graphs. SIGMOD,
2008. 5, 8, 11, 13, 59, 66, 92, 93, 95, 105
[28] Paolo Boldi, Francesco Bonchi, Aristides Gionis, and Tamir Tassa. Injecting uncer-
tainty in graphs for identity obfuscation. SIGMOD, 2012. 5, 8, 12, 13, 59, 62, 66,
67, 68, 76, 83, 85, 86, 93, 94, 100
[29] P Mittal, C Papamanthou, and D Song. Preserving link privacy in social network
based systems. NDSS, 2013. 5, 62, 76, 93, 94
[30] HH Nguyen, A Imine, and M Rusinowitch. Anonymizing social graphs via uncer-
tainty semantics. CCS, 2015. 5, 12, 62, 76, 85, 93, 94
[31] L Liu, J Wang, J Liu, and J Zhang. Privacy preservation in social networks with
sensitive edge weights. SDM, pages 954–965, 2009. 5, 66, 67, 92, 93, 94, 95
[32] Xiaowei Ying and Xintao Wu. Randomizing social networks: a spectrum preserving
approach. pages 739–750, 2008. 5, 10, 13, 66, 67, 92, 93, 130
[33] Mohd Ninggal and Jemal H Abawajy. Utility-aware social network graph
anonymization. J Netw Comput Appl, 2015. 5, 13, 66, 67, 92, 93, 130
[34] Francesco Bonchi, Aristides Gionis, and Tamir Tassa. Identity obfuscation in graphs
through the information theoretic lens. ICDE, 2014. 5, 59, 92, 93, 100, 130
[35] N. Alon, R. Yuster, and U. Zwick. Finding and counting given length cycles. Algo-
rithmica, 1997. 5, 54
[36] V Batagelj and A Mrvar. A subquadratic triad census algorithm for large sparse
networks with small maximum degree. Social networks, 2001. 5, 54
[37] T Schank. Algorithmic aspects of triangle-based network analysis. Phd in computer
science, 2007. 5, 25, 36, 54
136
REFERENCES
[38] A Itai and M Rodeh. Finding a minimum circuit in a graph. SIAM, 1978. 5, 54
[39] N Chiba and T Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal
on Computing, 1985. 5, 54
[40] Ha-Myung Park, Francesco Silvestri, U Kang, and Rasmus Pagh. MapReduce tri-
angle enumeration with guarantees. CIKM, pages 1739–1748, 2014. 6, 24, 26,
55
[41] Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mo-
hammed J. Zaki, and Ashraf Aboulnaga. Arabesque: a system for distributed graph
mining. SOSP, pages 425–440, 2015. 6, 55
[42] Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, and Ning Xu. Parallel sub-
graph listing in a large-scale graph. SIGMOD, pages 625–636, 2014. 7, 56, 129
[43] Wentao Wu, Yanghua Xiao, Wei Wang, Zhenying He, and Zhihui Wang. k-
symmetry model for identity anonymization in social networks. EDBT, 2010. 8,
66, 67, 92, 93
[44] Bin Zhou and Jian Pei. Preserving privacy in social networks against neighborhood
attacks. ICDE, 2008. 8, 66, 92, 93
[45] L Backstrom, C Dwork, and J Kleinberg. Wherefore art thou r3579x?: anonymized
social networks, hidden patterns, and structural steganography. WWW, 7 2007. 8
[46] Arvind Narayanan and Vitaly Shmatikov. De-anonymizing social networks. SP,
pages 173–187, 2009. 8
[47] Shixi Chen and Shuigeng Zhou. Recursive mechanism: Towards node differential
privacy and unrestricted joins. SIGMOD, pages 653–664, 2013. 9
137
REFERENCES
[48] Michael Hay, Gerome Miklau, David Jensen, Philipp Weis, and Siddharth Srivas-
tava. Anonymizing social networks. 2007. 10, 92, 93
[49] Smriti Bhagat, Graham Cormode, Balachander Krishnamurthy, and Divesh Srivas-
tava. Class-based graph anonymization for social network data. Proc Vldb Endow,
2009. 10, 92, 93
[50] Xiaowei Ying, Kai Pan, Xintao Wu, and Ling Guo. Comparisons of randomization
and k-degree anonymization schemes for privacy preserving social network publish-
ing. SNA-KDD, 2009. 10, 11, 66
[51] Sepp Hartung and Nimrod Talmon. The complexity of degree anonymization by
graph contractions. TAMC, 2015. 11, 15
[52] Ewni: Efficient anonymization of vulnerable individuals in social networks.
PAKDD, 7302:359370, 2012. 11
[53] Xuesong Lu, Yi Song, and Stphane Bressan. Database and expert systems applica-
tions. DEXA, 7446:281295, 2012. 11
[54] Yazhe Wang, Long Xie, Baihua Zheng, and Ken C. K. Lee. Utility-oriented k-
anonymization on social networks. DASFAA, 2011. 13, 66, 67, 92, 93
[55] S Salihoglu and J Widom. Gps: A graph processing system. SSDBM, 2013. 16, 31,
129
[56] Nguyen Bao and Toyotaro Suzumura. Towards highly scalable pregel-based graph
processing platform with x10. WWW, 2013. 16, 31
[57] S Asthana, OD King, FD Gibbons, and FP Roth. Predicting protein complex mem-
bership using probabilistic network reliability. Genome research, 2004. 18
138
REFERENCES
[58] J Ghosh, HQ Ngo, and S Yoon. On a routing problem within probabilistic graphs
and its application to intermittently connected networks. 2007. 18, 60, 101
[59] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on
large clusters. Commun. ACM, 2008. 26
[60] Tom White. Hadoop: The definitive guide. 2010. 26
[61] Anish Das Sarma, Foto N. Afrati, Semih Salihoglu, and Jeffrey D. Ullman. Upper
and lower bounds on the cost of a map-reduce computation. PVLDB, 2013. 29
[62] J Lin. The curse of zipf and limits to parallelization: A look at the stragglers problem
in mapreduce. LSDR-IR workshop, 2009. 29
[63] Gary W. Oehlert. A note on the delta method. The American Statistician, 1992. 30
[64] G Keramidas and P Petoumenos. Cache replacement based on reuse-distance pre-
diction. ICCD, 2007. 41
[65] P Petoumenos and G Keramidas. Instruction-based reuse-distance prediction for
effective cache management. MSP, pages 60–68, 2009. 41
[66] M Potamias, F Bonchi, A Gionis, and G Kollios. K-nearest neighbors in uncertain
graphs. VLDB, 2010. 58, 60, 86, 95, 101
[67] M Hua and J Pei. Probabilistic path queries in road networks: traffic uncertainty
aware path selection. EDBT, 2010. 58
[68] R Jin, L Liu, B Ding, and H Wang. Distance-constraint reachability computation in
uncertain graphs. VLDB, 2011. 58, 83, 86
139
REFERENCES
[69] Michael Hay, Gerome Miklau, David Jensen, Don Towsley, and Chao Li. Resist-
ing structural re-identification in anonymized social networks. The VLDB Journal,
2010. 59, 92, 93
[70] Colbourn and Colbourn. The combinatorics of network reliability. 1987. 60
[71] Parchas, Gullo, Papadias, and Bonchi. The pursuit of a good possible world: ex-
tracting representative instances of uncertain graphs. SIGMOD, 2014. 62, 63, 86
[72] Jordi Casas-Roma. Privacy-preserving on graphs using randomization and edge-
relevance. Modeling Decisions for Artificial Intelligence, 2015. 66, 67, 92, 93
[73] Brian Thompson and Danfeng Yao. The union-split algorithm and cluster-based
anonymization of social networks. 2009. 66, 92, 93
[74] Sudipto Das, Omer Egecioglu, and Amr Abbadi. Anonymizing weighted social
network graphs. ICDE, pages 904–907, 2010. 66, 67, 92, 93, 94, 95
[75] M. O. Ball. Computational complexity of network reliability analysis: An overview.
IEEE Transactions on Reliability, 1986. 71
[76] M. Fredman and M. Saks. The cell probe complexity of dynamic data structures.
STOC, 1989. 72
[77] P Boldi, M Rosa, and S Vigna. HyperANF: approximating the neighbourhood func-
tion of very large graphs on a budget. CoRR, 2011. 86
[78] G Cormode, D Srivastava, T Yu, and Q Zhang. Anonymizing bipartite graph data
using safe groupings. VLDB, 2008. 92, 93
[79] Shyue-Liang Wang, Yu-Chuan Tsai, Hung-Yu Kao, I-Hsien Ting, and Tzung-Pei
Hong. Shortest paths anonymization on weighted graphs. nternational Journal of
140