database fragmentation with conﬁdentiality constraints: a ... · the set of constraints are de...

Database Fragmentation with Confidentiality Constraints:A Graph Search Approach∗

Xiaofeng XuEmory University400 Dowman Dr.

Atlanta, GA [email protected]

Li XiongEmory University400 Dowman Dr.


Jinfei LiuEmory University400 Dowman Dr.


ABSTRACTDatabase fragmentation is a promising approach that can beused in combination with encryption to achieve secure dataoutsourcing which allows clients to securely outsource theirdata to remote untrusted server(s) while enabling query sup-port using the outsourced data. Given a set of confiden-tiality constraints, it vertically partitions the database intofragments such that the set of attributes in each constraintdo not appear together in any one fragment. The optimalfragmentation problem is to find a fragmentation with min-imum cost for query support. In this paper, we proposean efficient graph search based approach which obtains nearoptimal fragmentation. We model the fragmentation searchspace as a graph and propose efficient search algorithms onthe graph. We present static and dynamic search strate-gies as well as a novel level-wise graph expansion techniquewhich dramatically reduces the search time. Extensive ex-periments showed that our method significantly outperformsother state-of-the-art methods.

1. INTRODUCTIONData security is widely recognized as a major barrier to

cloud computing and other data outsourcing arrangements.Users are reluctant to place their sensitive data in the clouddue to concerns about data disclosure to potentially un-trusted cloud providers and other malicious parties. Theproblem of secure data outsourcing or secure Data-As-a-Service (DAS) has received increasing attention in recentyears [14]. The goal is to allow a client to securely out-source their data on remote untrusted server(s) while en-abling computations or query support using the outsourceddata [4, 21].

A common approach for secure DAS is to store entirelyencrypted data on the server. Fully homomorphic encryp-tion scheme [11, 12, 25] allows a user to store fully en-crypted data while enabling arbitrary computations on the

∗This research is supported by the AFOSR DDDAS programunder grant FA9550-12-1-0240.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, March 2–4, 2015, San Antonio, Texas, USA.Copyright c© 2015 ACM 978-1-4503-3191-3/15/03 ...$15.00.http://dx.doi.org/10.1145/2699026.2699121.

Table 1: Sample relation and confidentiality con-straints

(a) PATIENT

SSN Name Occup SicknessZIP

485-95-5671A.Hellman Nurse Obesity 30322456-32-8672B.Dooley Nurse Obesity 30322634-34-5776C.McKinleyClerk Obesity 30307675-96-3284B.Dooley Lawyer Celiac 30322546-46-2755E.Taylor Manger Latex al. 30332784-38-6673F.White DesignerPollen al. 30396

(b) Constraints

C0 {SSN}C1 {Name, Occup}C2 {Name, Sickness}C3 {Occup,Sickness,ZIP}

encrypted data, however, its computation cost is prohibitivein practice. Many works have focused on supporting spe-cific queries on encrypted data using specialized encryptions(e.g. [5, 19, 20, 24]). They often require weaker forms of en-cryptions for complex queries. It is difficult for encryptionalone to support versatile and efficient computation on theoutsourced data with assurances of data confidentiality anddata privacy.

Rekatsinas et.al. [22] recently formalized the problem ofprivacy-aware data partitioning, where a sensitive dataset ispartitioned among untrusted parties. Even earlier, databasefragmentation [6, 7] has already been proposed as a promis-ing approach that can be used in combination with encryp-tion to achieve secure DAS. In many scenarios, the confi-dentiality requirements can be represented as a set of confi-dentiality constraints specifying the sensitive attributes andsensitive associations among attributes that need to be pro-tected. Table 1(a) shows a sample database relation PA-TIENT with 5 attributes. Table 1(b) shows an example setof confidentiality constraints on relation PATIENT. Single-ton constraints involving a single attribute such as C0 statesthat the attribute itself is sensitive. Association constraintsinvolving multiple attributes such as C1, C2 and C3 statethat the associations among these attributes are sensitive,i.e. the attributes should not be visible together. The ra-tionale for constraints such as C3 is that the combinationof Occup and ZIP may form a quasi-identifier [23] whichwill allow linking and re-identification attacks and lead toattribute association disclosure of Sickness with individualidentities. While many works have focused on inference con-trol on the databases [10] which can be useful for definingsuch confidentiality constraints, we assume, in this paper,the set of constraints are defined by the data owner andstudy the fragmentation techniques to enforce such associa-tion constraints.

Singleton constraints can be enforced via encryption ofthe attribute. Association constraints can be enforced viafragmentation which vertically partitions the database into

263

Table 2: Physical fragments

(a) Enc F1

salt enc Name

s1 α A.Hellmans2 β B.Dooleys3 γ C.McKinleys4 δ B.Dooleys5 ε E.Taylors6 ζ F.White

(b) Enc F2

salt enc Occup

s7 η Nurses8 θ Nurses9 ι Clerks10 κ Lawyers11 λ Managers12 µ Designer

(c) Enc F3

salt enc Sickness ZIP

s13 ν Obesity 30322s14 ξ Obesity 30322s15 π Obesity 30307s16 ρ Celiac 30322s17 σ Latex al. 30332s18 τ Pollen al. 30396

fragments, such that the attributes in the same constraintwill not appear together in any fragment as cleartext. Afragmentation that satisfies all confidentiality constraints iscalled a safe fragmentation. Table 2 shows a possible safefragmentation over PATIENT. Design of the physical frag-ments Enc Fi will be explained later in Section 3. Thefragmented database can be then outsourced to distributedservers. It is important to note that collusion attacks arealso prevented in this mechanism. When multiple serverscollude, even though they have the attributes together, theycannot link them with the same tuple, in other words, theassociation between attributes is still protected.

Not surprisingly, the fragmented data will introduce over-head for processing queries that involve attributes from mul-tiple fragments. Given a cost model that measures the queryprocessing overhead, the optimal fragmentation problem isto find a safe fragmentation with minimum cost. The opti-mal fragmentation problem has been shown to be NP-Hard[7, 22]. The size of the search space for the problem of nattributes is the nth Bell number, Bell(n), which exponen-tially grows with n (e.g. Bell(32) = 1.2806 × 1026). Thuswhen n is relatively large, even one time operation of exhaus-tive search will take several years. The fragmentation prob-lem can also be considered as constrained clustering prob-lem with set constraints (association constraints). However,existing constrained clustering methods only consider pair-wise must-link and cannot-link constraints [27].

Several heuristic methods regarding fragmentation withconfidentiality constraints have been proposed. Ciriani et al.proposed the heuristic search method [6] which constructsa fragmentation tree consisting of all possible fragmenta-tions and searches the fragmentation tree with some heuris-tic pruning strategies. They proposed another hierarchicalclustering algorithm [7] based on nearest joins. However,these methods tend to result in suboptimal solutions due totheir greedy search strategies. Improving the performanceof heuristic method for the optimal fragmentation problemis the main contribution of this paper.

Contributions. In this paper, we propose an efficientgraph search based approach for database fragmentationwith confidentiality constraints. We model the search spaceas a graph where each vertex represents a possible solution(fragmentation) and each edge represents a transformationfrom one solution to another. We propose efficient searchalgorithms based on the general approach of local search [8]and guided local search [26] which are metaheuristic meth-ods for solving computationally hard optimization problems.We present static and dynamic greedy search strategies aswell as a novel level-wise graph expansion technique whichdramatically reduces the search time. Extensive experi-ments showed that our method significantly outperformsother state-of-the-art methods.

The rest of this paper is organized as follows. Section2 presents the related work. Section 3 gives an overview ofthe fragmentation problem. Section 4 presents our proposedgraph search models and algorithms. Section 5 presents theexperimental results. In Section 6, we conclude this paperand propose some future work.

2. RELATED WORKIn this section, we review some related work about secure

data outsourcing and privacy-aware fragmentation.

2.1 Secure Data OutsourcingIn attempting to enhance the security of data outsourc-

ing, a number of different approaches based on encryptionhave been proposed. Techniques for supporting specific op-erations or search on encrypted data developed in the cryp-tographic community [5, 24] provide strong security guar-antees. The break-through of the fully homomorphic en-cryption scheme [11, 12, 25] bears the potential to allow acloud user to store fully encrypted data while enabling ar-bitrary computations on the encrypted data. The recentlydeveloped CryptDB system provides a promising on-the-flyapproach that explores different layers of encryption to pro-vide confidentiality over a database system while support-ing database operations [19, 20]. Other works, for example[5, 13, 24], require weaker forms of encryptions for complexqueries such as joins or other queries which are costly formedium-size to very large scale data. The most essentiallimitation, for encryption-based methods, is that the com-putation cost on encrypted data is prohibitive in practice.

The literature following the seminal work on k-anonymity[16, 17] adopts syntactic privacy notions by considering spe-cific attacks and assuming the attacker has limited back-ground knowledge [10]. Differential privacy [9] is a strong se-mantic notion for guaranteeing privacy with arbitrary back-ground knowledge for statistical data release. These meth-ods typically only allow aggregation or perturbed statisticsto be released and do not support transaction query pro-cessing that involve individual records.

2.2 Privacy-Aware FragmentationRecently, Rekatsinas et al. [22] formalized the problem of

privacy-aware data partitioning, where a sensitive dataset ispartitioned among untrusted parties, and proposed a gen-eral method which considered both vertical and horizontalfragmentation. A key assumption for this work is that thefragmentation servers do not collude with each other. Incontrast, we focus on vertical fragmentation which is resis-tant to collusion. Moreover, the utility defined in [22] isthe general information value in the partitioned datasets byassuming the fragments cannot be joined with each other,while our work considers the utility of query support bythe fragmented data, which can be joined by trusted me-diators to answer queries, in the form of data transmissioncost during query processing. Below we review the workson combining vertical fragmentation and confidentiality con-straints, which consider the same problem setting and aremost closely related to our work.

Vertical fragmentation has been used in combination withencryption to ensure confidentiality constraints for securedata outsourcing problems. By partitioning the data intoseveral fragments, the method of fragmentation avoids in-troducing perturbation or noise to the original data and

264

supports query processing. Several heuristic methods havebeen proposed recently regarding the optimal fragmenta-tion problem with confidentiality constraints. The heuristicsearch method [6] is based on the monotonicity of the pre-defined query cost and a well-designed fragmentation treewhich covers the whole search space. The fragmentationtree is designed such that any offspring of an unsafe frag-mentation is unsafe. Consequently, the sub-tree rooted froman unsafe fragmentation will be pruned during the search.When the volume of confidentiality constraints is sufficientlylarge, the pruning strategy will dramatically reduce the ex-ecuting time of query processing. The hierarchical clus-tering method [7] for database fragmentation is based onnearest joins. In each iteration, two attributes with highestattribute affinity are joined if the join will not violate theconfidentiality constraints. The algorithm terminates whenno pair of attributes can be joined.

Different from these works, our approach models the searchspace as a fragmentation graph. Our search strategy allowsdynamic navigation through the graph, guided by the con-fidentiality constraints, instead of a rigid top-down searchstrategy, and hence dramatically increases the chance toreach the optimal solution.

3. PROBLEM SETUP AND OVERVIEWWe first provide some preliminary definitions and present

a formal problem setting.

3.1 FragmentationDefinition 3.1 (Fragmentation [7]) Given a relation schemeR and a set A ⊂ R of attributes to be fragmented, a frag-mentation of R on A is a set of fragments F = {F1, · · · , Fk}such that: 1) (Validness) ∀Fi ∈ F , Fi ⊂ A; 2) (Complete-ness) ∀a ∈ A, ∃Fi ∈ F : a ∈ Fi; and 3) (Disjointness)∀Fi, Fj ∈ F , i 6= j : Fi ∩ Fj = ∅.

As shown in Table 2, each fragment Fi ∈ F is physi-cally stored in a relation, denoted Enc Fi, called physicalfragment. It is defined on the set {salt, enc, ai1 , · · · , ain},where ai1 , · · · , ain are the attributes in Fi in clear form andenc is the encrypted values of all other attributes in R ex-cept ai1 , · · · , ain , XOR’ed with the salt, which is a randomvalue different for each tuple used for preventing frequency-based attacks over the encrypted values. Sensitive attributessuch as SSN are encrypted in enc. In short, the data arereplicated in each physical fragment with the attributes cor-responding to the fragment in clear form while other at-tributes in encrypted form. The physical fragments can bedistributed at different servers. Replicated data fragmentsenjoy two major advantages: 1) availability - failure of anyfragment containing relation R does not result in unavail-ability of R in other fragments. Although some attributesin the fragments are encrypted, the completeness of the re-lation is still guaranteed within any single fragment; 2) re-duced data transfer - relation R is available locally at eachsite containing a replica of R, thus only the fragment withminimum intermediate result contributes to the overall datatransfer. Data transfer is closely relevant to the query cost ofdistributed database systems, which will be discussed laterin this section. Note that the system combining fragmen-tation and encryption has been proved to be confidentialitypreserving in [2] and [3].

Mediator

I1 I2 Ik

Trusted

Untrusted

Q, k

Answer to Q

Data Transmission

Q

Answer to Q

Figure 1: Query processing model

3.2 Cost ModelFigure 1 illustrates the query processing model we con-

sidered in this paper for supporting user queries using theencrypted and fragmented data. An untrusted user sub-mits queries directly to the server that stores the fragmentcontaining the cleartext contents to be queried, and receivethe results from the corresponding server. A trusted userwho is authorized to access both the cleartext and the en-crypted content submit the query Q together with the keyk needed for decrypting the data to a trusted query medi-ator. The mediator transforms query Q for each physicalfragment with a subset of the queried attributes {aij} thatoverlaps with the attributes in each corresponding fragment.Each physical fragment then processes the query locally andthe fragment with least amount of result data transfers theresult to the mediator. The mediator then decrypts the en-crypted attributes using the key provided by the user, dis-cards spurious tuples, and finally returns the query result tothe user.

The cost of distributed querying systems can be expressedwith respect to the response time, consisting of two parts:local processing cost and data transfer cost. However, datatransfer cost is the dominant time factor in wide networkssuch as the Internet [18]. Next we describe the transfer costin our query processing model.

Let Q = {Q1, · · · , Qq} be a set of queries that accessrelation R with attributes A = {a1, · · · , an}. Each queryhas an executing frequency freq(Qi) with 0 < freq(Qi) <1,∀1 ≤ i ≤ q and

∑1≤i≤q

freq(Qi) = 1. A query Qi is

formed as “select ai1 , · · · , ain from R where∧nl=1(al ∈ Vl)”,

where Vl is a set of values in the domain of attribute al.For a query Qi and a fragment Fj , the query cost is cal-culated as Cost(Qi, Fj) = S(Qi, Fj) · |R| · size(tj), whereS(Qi, Fj) is the selectivity of the query, |R| is the num-ber of records in relation R, and size(tj) is the length ofqueried tuple. The selectivity of an attribute al is the ra-tio of the tuples satisfying the querying condition for al,

i.e. |Vl||R| , where |Vl| is the size of Vl. We assume the values

of different attributes are distributed independently of eachother, the selectivity S(Qi,Fj) is then the product of the se-lectivity for all attributes. If attribute al does not appearin Fj the selectivity for al is set to 1. Since only the frag-ment with the smallest query result set returns the resultback to the mediator, the final query cost of Qi is calcu-lated as the minimum of the query cost for all fragments,Cost(Qi,F) = min

1≤j≤kCost(Qi, Fj). Thus given a fragmen-

tation F and the query set Q, the query cost for the frag-mentation is Cost(F) =

∑1≤i≤q

Cost(Qi,F) · freq(Qi). This

definition of query cost is first introduced in [6]. Here we

265

give a simple example of computing the query cost for asingle query.

Example 3.1 Given a query Q: “select * from PATIENTwhere Sickness = Obesity and Occup = Nurse”, a frag-mentation F={F1, F2, F3} with F1={Name}, F2={Occup},F3={Sickness, ZIP}, the number of records |R| = 6, andsize(tj) = 1 for each fragment, the query cost Cost(Q,F1) =1·6·1 = 6, Cost(Q,F2) = 1

3·6·1 = 2, Cost(Q,F3) = 1

2·6·1 =

3. So Cost(Q,F) = min{6, 2, 3} = 2.

Note that the definition of query cost can change accord-ing to the query processing model and the network environ-ment and that our method is a generic method which appliesto any formula of query cost.

3.3 Confidentiality ConstraintsIn this subsection, we formally present some definitions

including confidentiality constraint and safe fragmentation,and based on which, the optimal fragmentation problem isdefined.

Definition 3.2 (Confidentiality constraint [7]) Given a setA of attributes, a confidentiality constraint C over A is 1) asingleton set {a}, stating that the value of the attribute issensitive; or 2) a subset C ⊂ A, stating that the associationbetween values of the given attributes is sensitive.

Singleton constraints can only be solved by encryption;the method of fragmentation only solves non-singleton con-straints. Since the satisfaction of a constraint Ci implies thesatisfaction of any constraint Cj if Ci ⊂ Cj , we considera well defined set of constraints C = {C1, C2, · · · , Cm}, i.e.∀Ci, Cj ∈ C and i 6= j, we have Ci ( Cj .

Definition 3.3 (Safe fragmentation [7]) Given a relationschema R, a set C = {C1, C2, · · · , Cm} of well defined con-straints over R, and a set A of attributes to be fragmented,a fragmentation F = {F1, · · · , Fk} is safe iff Ci ( Fj , ∀1 ≤i ≤ m, 1 ≤ j ≤ k.

Definition 3.4 (Optimal fragmentation) Given a relationschema R, a set C of well defined constraints over R, a setA of attributes to be fragmented. The fragmentation F isoptimal iff 1) F is a safe fragmentation of R on A; and 2)∀F? and F? is safe, we have Cost(F?) ≥ Cost(F).

We formulate the optimal fragmentation problem as thefollowing constrained optimization problem:

arg minF∈U{F ∈ O : Cost(F)}

where U is the universal search space, O ⊂ U is the feasiblespace consisting of all safe fragmentations.

4. ALGORITHMSIn this section, we present our graph search approach

which achieves near optimal solutions for the above con-strained optimization problem.

4.1 Graph Search MethodWe represent the search space for the optimization prob-

lem as a fragmentation graph, denoted as G(U , E), whichis a graph constructed by representing the fragmentationsas a set of vertices U and transformations between frag-mentations as a set of edges E . A fragmentation in thefragmentation graph is also called a state; a safe (unsafe)fragmentation is called a safe (unsafe) state. Two states xi

xp0xp1xp2

xq2

xq3

xp3

xp4

x†p4

xp5

Figure 2: An example of fragmentation graph

and xj are neighbors, i.e. connected by an edge, denotedas (xi, xj), if and only if they can transform to each otherthrough an atomic operation. Here we consider jump as theatomic operation.

Definition 4.1 (Jump) Given a fragmentation F ∈ U andF={F1,· · · ,Fk}, a source Fs and a destination Fd where 1 ≤s, d ≤ k and s 6= d, an attribute a ∈ Fs,

Jump(F , a, d) = {F1, · · · , Fs \ {a}, · · · , Fd⋃{a}, · · ·Fk}.

A jump operation transforms one fragmentation to an-other by moving an attribute from one fragment to an-other. For example, fragmentation {{NSO}{Z}} can jumpto {{NS}{OZ}} by moving O from the first fragment tothe second. Jump is not the only valid atomic operationfor our method. If we consider join (joining two fragmentsinto one) as the atomic operation, the graph will degrade toa hierarchical tree and our method becomes a reincarnationof the hierarchical clustering method [7].

The fragmentation graph forms the universal search space,which contains all possible fragmentations and links. Ourgraph search method greedily discovers a path towards the(nearly) optimal solution on the fragmentation graph.

Definition 4.2 (Solution path) Given a fragmentation graphG(U , E), a solution path px : U → {xp0 , xp1 , xp2 , · · · } is asequence of states defined in U . The search strategy ρ de-cides the next state in a solution path, i.e. xpi+1 = ρ(xpi).

A solution path terminates when it cannot be extended bythe corresponding search strategy. The terminating state ofthe solution path is the result state obtained by our graphsearch method. Figure 2 briefly illustrates the model of ourgraph search method. The solid dots and circles representthe safe states and unsafe states respectively while neigh-bors are connected by the lines. {xp0 , xp1 , xp2 , xp3} in Fig-ure 2 shows a solution path starting from the initial statexp0 and terminating at xp3 . For example, a solution pathcould be {NSOZ} → {{NSO}{Z}} → {{NS}{OZ}} →{{N}{SOZ}} in the fragmentation graph for PATIENT.

Definition 4.3 (Dominance) Given two states xi and xj ina fragmentation graph, xi dominates xj , denoted as xi � xj ,iff σ(xi) < σ(xj). σ : U → R is a scoring metric for eachfragmentation.

Next we propose the static and dynamic search strategiesbased on dominance.

4.2 Static Search StrategyA straightforward search strategy is to use the query cost

of a fragmentation as the scoring metric and greedily pick aneighbor fragmentation with the minimum cost at each step.Such search strategy is called static search strategy since itis invariant with the number of steps in a solution path.

266

Algorithm 1: GSM(A,C,Q,F0)

1 i ← 1; F ← F0; // initialization2 while true do3 Min ← F ;4 forall the (F,F?) in E do5 if σ1(F?) < σ1(Min) and SatCon(C,F?) = true then6 Min ← F?;

7 if Min = F then8 return Min;

9 F ← Min; i ← i+1 ; // step on

The corresponding scoring metric and dominance relationare called static scoring metric and static dominance rela-tion, respectively. We denote σ1(x) = Cost(x) as the staticscoring metric, and �1 as the corresponding dominance re-lation, thus the corresponding static search strategy ρ1 isdefined as

xpi+1 = ρ1(�1, xpi) = minx�1xpi

{x ∈ O, (xpi , x) ∈ E : σ1(x)}

Note that, with search strategy ρ1, the scoring metric of theselected state is lower than that of the previous state.

Algorithm 1 shows the pseudocode of functionGSM , whichimplements our graph search method with static search strat-egy. GSM requires A = {a1, · · · , an} (set of attributes tobe fragmented), C = {C1, · · · , Cm} (set of well defined non-singleton constraints), Q (set of queries) and F0 (the initialstate) as the parameters and returns Min which is the resultfragmentation obtained by our algorithm. This algorithmfirst initializes the step counter i = 1 and sets the currentstate F as the initial state F0, then enters the main loop.At each step, the algorithm sets Min as previous selectedstate, F , then visits each of its safe neighbors. The neighborenjoying lowest scoring metric and dominates F is stored inMin. If such neighbor does not exist i.e. Min = F , thealgorithm returns Min as the result. Otherwise, let F equalMin and the step counter i increase by 1. The functionSatCon(x) checks whether the state x satisfies all the con-fidentiality constraints. It is important to note that staticsearch strategies require safe initial states, i.e. F0 ∈ O.Next we propose the dynamic search strategy which doesnot require a safe initial state.

4.3 Dynamic Search StrategyFragmentation graph with static search strategy may suf-

fer from the problem of dead-end. Figure 2 shows an ex-ample of dead-end in our graph search method. In Figure2, the solution path px = {xp0 , xp1 , xp2 , xp3} terminates atxp3 , since all its two neighbors xp4 and x†p4 are unsafe. How-ever, it is possible that one of their neighbors xp5 is safe andenjoys even lower query cost than xp3 . The search strategyρ1 fails to detect this situation.

We solve the dead-ends by applying the theory of guidedlocal search [26]. At the early stage, we treat the states in thefragmentation graph as transparent states, i.e. we considerthe confidentiality constraints as soft constraints which canbe violated with a penalty. The confidentiality constraintswill become harder as the algorithm proceeds and eventuallybecomes hard constraints which enforces the algorithm toterminate at a safe state. We use a dynamic search strategyto represent the softness of the confidentiality constraints.A scoring metric σ is dynamic if and only if it varies with the

number of steps in a solution path. We denote a dynamicscoring metric as σ=[σ0, σ1, σ2, · · · ]T , σi is applied at theith step. The corresponding dominance relation �, denotedas �=[�0, �1, �2, · · · ]T , is a dynamic dominance relationand the corresponding search strategy ρ, denoted as ρ(�, x)=[(�0, x), (�1, x), (�2, x), · · · ]T , is a dynamic searchstrategy.

Intuitively, our dynamic strategy also greedily picks theneighbor solution with lowest scoring metric. However, dif-ferent from the static strategy, it will allow an unsafe neigh-bor to be picked which helps to avoid the dead-end problem.On the other hand, to guarantee the safeness of the final so-lution, the scoring metric will include penalties for unsafestates and the penalties increase following a function of thenumber of steps. Formally, our dynamic search strategy ρ2

is defined as

xpi+1 = ρ2(�2i , xpi) = min

x�2ixpi

{x ∈ U , (xpi , x) ∈ E : σ2i (x)}

where σ2i (x) = Cost(x)−α

µ−α + i · γ · Penalty(x)−βω−β is a dynamic

scoring metric. Penalty(x) is the penalty penalizing thefragmentation x for violating the confidentiality constraints

in C. γ is the relaxation coefficient. Cost(x)−αµ−α and Penalty(x)−β

ω−βare the normalized query cost and penalty, respectively. Thepenalty function is defined as

Penalty(x) =∑Ci∈C

δ(Ci, x)

where δ(·) is the indicating function, which equals 1 whenconstraint C is violated in x and 0 otherwise. According tothe cost model defined in Section 3.2, the fragmentation F>= {{a1}, {a2}, · · · , {an}} and F⊥ = {{a1, a2, · · · , an}} en-joys the highest and lowest query cost, respectively. On theother hand, F⊥ violates all the confidentiality constraints,thus suffers the highest penalty while F> satisfies all theconfidentiality constraints, thus enjoys zero penalty. Ac-cordingly, the regularization coefficients are set as µ = Cost(F>),α = Cost(F⊥), ω = Penalty(F⊥), and β = 0.

At early stage of the search, i is small so that it guar-antees the softness of the confidentiality constraints. How-ever, i increases during the procedure thus the softness ofthe confidentiality constraints reduces concurrently. Specif-ically, when i ≥ dω−β

γe, penalty will dominate the scoring

metric, i.e. fragmentations with higher penalty will be as-signed higher scoring metrics regardless of their query costs,thus the solution path will be forced to extend towards stateswith lower penalty. Different from ρ1, ρ2 does not requiresafe initial states.

However, with search strategy ρ2, a solution path mayterminate at an unsafe state, which we call a fake-end.

Definition 4.4 (Fake-end) Given a fragmentation graphG(U , E), and a search strategy ρ, a solution path px={xp0 ,xp1 ,· · · ,xpr} enters a fake-end xpr if px terminates at xpr andxpr /∈ O.

For example, in Figure 2, the solution path {xp0 , xp1 , xq2 , xq3}terminates at a fake-end xq3 . Fortunately, fake-end can beeasily solved by shifting the dynamic scoring metric.

Definition 4.5 (Shift) Given a dynamic scoring metric σ,σi ← σi+1, ∀i > K,K ∈ N, is a shift of σ.

The shift essentially moves the scoring metric to the nextstep and allows the search to continue beyond a fake-end.

267

Algorithm 2: GSM?(A,C,Q,F0)

1 i ← 1; F ← F0; // initialization2 while true do3 Min ← F ;4 forall the (F,F?) in E do5 if σ2

i (F?) < σ2i (Min) then

6 Min ← F?;

7 if Min 6= F then8 F ← Min; i ← i+1; // step on9 continue;

10 if SatCon(Min)=true then11 return Min;

12 i ← i+1 ; // shift

expansion

xp0

xp1

xp2xp3

xp4

Original search space

Uk

Uk+1

Figure 3: An example of graph expansion

Below we show that there exists a K ∈ N so that we canescape fake-ends by executing at most K shifts on σ2. Sup-pose a solution path enters a fake-end x and x = F ={F1, · · · , Fk} where k is the number of fragments. Firstwe claim k < n. Otherwise if k = n, the fragmentation canonly be F> = {{a1}, {a2}, · · · , {an}}, it has been justifiedthat F> is safe. Thus F 6= F>, and accordingly k < n.Consider that F violates the confidentiality constraint C ={a1, · · · , al}, i.e. there exists a fragment Fj with C ⊂ Fj .Let F? = Jump(F , ai, k+1), 1 ≤ i ≤ l, F? is a neighbor of Fand C is solved in F?. Thus Penalty(F)−Penalty(F?) ≥ 1.Let K > dω−β

γe and c1 = Cost(F) , c2 = Cost(F?), p1 =

Penalty(F), p2 = Penalty(F?). σ2K(F)− σ2

K(F?)= c1−c2µ−α

+K · γ · p1−p2ω−β . Since | c1−c2

µ−α | ≤ 1, p1 − p2 ≥ 1, and K >

dω−βγe, we have σ2

K(F)− σ2K(F?) > 0, i.e. F? �2

K F . Thus

we can solve fake-ends by executing at most K shifts on σ2.Algorithm 2 shows the pseudocode for our graph search

method with dynamic search strategy. Similar to functionGSM , GSM? visits all neighbors of the current state andfind the fragmentation which enjoys lowest scoring metricand dominates F . If GSM? enters a fake-end, it executes ashift on the scoring metric and moves on to the next itera-tion.

4.4 Graph ExpansionThe above methods can result in high computation cost

due to large graph size. In this subsection, we propose alevel-wise graph expansion technique which dramatically re-duces the executing time of our graph search method. Thebasic idea is to start by considering only fragmentation so-lutions with a maximum number of fragments, and thenexpand to other solutions with more fragments as needed.

Consider the constrained optimization problem

arg minF∈Uk

{F ∈ Ok : Cost(F)}

where Ok (Uk) is a subset of O (U) with the maximum num-ber of fragments k. Ek is the edge set responding to Uk.

Algorithm 3: EGSM?(A,C,Q,F0)

1 i ← 1; k ← 2; F ← F0; // initialization2 while true do3 Min ← F ;4 forall the (F,F?) in Ek do5 if σ2

i (F?) < σ2i (Min) then

6 Min ← F?;

7 if Min 6= F then8 F ← Min; i ← i+1; // step on9 continue;

10 if SatCon(Min)=true then11 return Min;

12 i ← i+1; // shift13 if ReqExp(F)=true then14 k ← k+1; // expand

Our graph expansion technique starts from searching thesubspace Uk, instead of the universal search space U , andgradually expands the subspace if needed. Since the num-ber of neighbors of each state in Uk is much smaller, the ex-ecuting time of the algorithm will be dramatically reduced.However, some of the fake-ends cannot be solved by shiftsin the subspace searching, since the solution path will notautomatically reach a state out of the subspace by jumps.To solve this issue, we manually expand the search space byincreasing k.

Figure 3 shows the scheme of our graph expansion tech-nique. In Figure 3, we start with the original search spaceUk (right) with fragmentations with at most k fragments.When we detect that the solution path cannot terminatewithin the original search space, we expand the search spaceto Uk+1 (left) where fragmentations can contain at most k+1fragments. It is important to note that, the initial state x0in Uk is not guaranteed to be safe, thus graph expansiontechnique can not be used with static search strategies.

Algorithm 3 shows the pseudocode for our graph searchmethod with graph expansion. This algorithm is quite sim-ilar to Algorithm2 except that it conducts the search in Uk.For simplicity, we originally set k = 2. The ReqExp functiondecides whether an expansion is required by checking thepenalty of the current state F , returning true if F has lowerpenalty than all its neighbors, otherwise returning false.

5. EXPERIMENTAL STUDYIn this section, we evaluate our graph search method, in

comparison with other state-of-the-art fragmentation algo-rithms including heuristic search [6] and hierarchical clus-tering [7]. We also compare some of the results with theoptimal solution obtained from exhaustive search.

Below we list the algorithms in our comparison which in-clude the different versions of our proposed algorithms aswell as the two state-of-the-art algorithms and the optimalsolution: 1) GSM - graph search method with jump op-eration and static search strategy ρ1 (initial state F>); 2)GSM? - graph search method with jump operation and dy-namic search strategy ρ2 (initial state randomly generatedin U); 3) EGSM? - GSM? with graph expansion technique(initial state randomly generated in U2); 4) Hierarchical clus-tering (HC) [7] - a nearest join algorithm based on attributeaffinity, which can be derived from the query set [18]; 5)Heuristic search (HS) (d=x, ps=y) [6] - a search algorithmon the fragmentation tree with the complete visited subtree

268

Table 3: Default settings

Parameters Description Default valuen Number of attributes 10m Number of constraints 10L Length of constraints 3γ Relaxation coefficient 0.1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 15.28

5.3

5.32

5.34

5.36

5.38

5.4

x 105

γ

Que

ry C

ost

GSM*

(a) Query cost

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

50

100

150

200

250

300

γ

Exe

cuti

ng ti

me

(ms)

GSM*

(b) Executing time

Figure 4: Impact of relaxation coefficient

10 20 30 40 50

1

1.5

2

2.5

x 106

Number of Attributes

Que

ry C

ost

GSM

GSM*

EGSM*

(a) Query cost

10 20 30 40 50

101

102

103

104


Exe

cuti

ng ti

me

(ms)

GSM

GSM*

EGSM*

(b) Executing time

Figure 5: Compare GSMs

of depth x and y best solutions selected in each iteration.Generally, greater values of x and y lead to better solutionbut longer executing time. We set x = 3 and y = 5 in ourexperiments; and 6) Optimal solution - exhaustive search onthe fragmentation tree [6].

We consider query cost and executing time as the criteriato evaluate the performance of the algorithms. Our experi-ments are implemented in Java and executed on 64-bit RedHat Linux with 8 cores Intel(R) Core(TM) i7-2600 CPU at3.40GHz. All experimental results are the average valuesof 30 independent trials. We conduct experiments on bothsynthetic and benchmark data sets.

5.1 Synthetic DataWe randomly generated 100 queries over synthetic data

sets with 30K records. Without loss of generality, the sizeof each attribute is set to 1. The confidentiality constraintsare randomly generated in this experiment. Table 3 showsthe default settings of the parameters in our experiments.The following shows the experimental results with varyingvalues of these parameters.

We first study the impact of relaxation coefficient γ indynamic search strategy ρ2. Figure 4(a) show the relationbetween query cost and different values of γ. Figure 4(b)illustrates the convergence rates by recording the executingtime. From the figures, we find that roughly the range of[0.05, 0.15] for the γ value enjoys the best balance betweeneffectiveness and efficiency. Thus γ = 0.1 is chosen as thedefault value for the remaining experiments.

Secondly we compare our GSM algorithms including GSM,GSM?, and EGSM?. The results are shown in Figure 5. Wecan see that GSM? gains lower query cost than GSM at simi-lar time and the graph expansion technique not only reducesthe query cost considerably but also dramatically saves com-

HC HS(d=3,ps=5) EGSM* Optimal

10 11 12 13 14 15 16 17 18 195

6

7

8

9

10

11

12x 10

5


Que

ry C

ost

(a) Query cost

10 11 12 13 14 15 16 17 18 1910

−1

100

101

102

103

104

105

106

107


Exe

cuti

ng ti

me

(ms)

(b) Executing time

Figure 6: Impact of number of attributes


1 5 9 13 17 21 25 294

4.5

5

5.5

6

x 105

Number of Constraints

Que

ry C

ost

(a) Query cost

1 5 9 13 17 21 25 2910

−1

100

101

102

103

Number of Constraints

Exe

cuti

ng ti

me

(ms)

(b) Executing time

Figure 7: Impact of number of constraints

puting time. In the remaining experiments, we compare theEGSM? with other state-of-the-art methods.

In the third experiment, we study the impact of the num-ber of attributes. Figure 6(a) and 6(b) show the experimen-tal results. Figure 6(a) show that our EGSM? enjoys betterperformance than HC or HS. Figure 6(b) shows that HC ismost time efficient, followed by EGSM?, HS, and exhaus-tive search. We also find that EGSM? obtains near optimalresults in terms query cost.

Next we study the impact of the number of confidential-ity constraints. Figure 7(a) and 7(b) show the experimentalresults. Figure 7(a) shows that increase of number of con-fidentiality constraints raises the query cost due to the factthat a large number of confidentiality constraints will likelyresult in highly fragmented data. Figure 7(b) compares thecorresponding executing time. It can be seen from the fig-ures that EGSM? enjoys better performance than both HCand HS in changes of number of confidentiality constraints.

Finally, we conduct experiments with different averagelengths of the confidentiality constraints. The experimen-tal results can be found from Figure 8(a) and 8(b). Figure8(a) shows the impact of average length of the confidential-ity constraints on query cost. Generally speaking, longerconstraints result in lower query cost since the data is lessfragmented. Experimental results regarding executing timecan be found in Figure 8(b). Again, our EGSM? outper-forms other methods.

5.2 Benchmark DataIn this subsection, we conduct experiments with the Adult

data set from UCI benchmark1. We choose 10 privacy con-cerning attributes (including age, race, sex etc.) and prunethe records with missing values.

The confidentiality constraints is usually specified by thedata custodians in practice. However, for the purpose ofthis experiment, we generated the confidentiality constraints

1http://archive.ics.uci.edu/ml/datasets.html

269


2 2.5 3 3.5 4 4.5 54

4.5

5

5.5

x 105

Avg Length of Constraints

Que

ry C

ost

(a) Query cost

2 2.5 3 3.5 4 4.5 510

−1

100

101

102

103

Avg Length of Constraints

Exe

cuti

ng ti

me

(ms)

(b) Executing time

Figure 8: Impact of average length of constraints


1 2 3 4 5 6 7 8 9 104.5

5

5.5

6

6.5

x 105

θ (%)

Que

ry C

ost

(a) Query cost

1 2 3 4 5 6 7 8 9 1010

−1

100

101

102

103

θ (%)

Exe

cuti

ng ti

me

(ms)

(b) Executing time

Figure 9: Impact of uniqueness threshold

for the Adult data set by considering the unique associa-tions between the attributes. Given a relation R and itsattribute set A. The uniqueness of an association of at-tributes S = {a1, a2, . . . , ar} ⊂ A, denoted as U(S), is thenumber of tuples in R which have distinct values on S. IfU(S)|R| is greater than or equal to a threshold θ, we consider S

as a confidentiality constraint. It is important to note that,we do not need to search all attribute associations in the re-lation to find the confidentiality constraints. We can exploitthe Apriori property [1] of the confidentiality constraints,i.e. the supersets of confidentiality constraints are also con-fidentiality constraints. Thus, efficient pruning approachescan be applied to accelerate the computation of generatingconfidentiality constraints.

Figure 9(a) and 9(b) show the experimental results of theAdult data set from UCI benchmark by varying the thresh-old θ from 1% to 10%. Generally speaking, lower thresholdresult in higher query cost. That is because lower thresholdindicates more strict privacy requirements, which reducesthe utility concurrently. Again, our EGSM? significantlyoutperforms both HC and HS.

6. CONCLUSIONS AND FUTURE WORKBased on the theory of local search and guided local search,

we proposed the graph search method for the fragmentationproblem with confidentiality constraints. By modeling theoptimal fragmentation problem as path finding problem ina graph, we avoid the issue of dead-end in traditional con-strained clustering algorithms. We also propose the frag-mentation graph expansion technique which dramaticallyreduces the time complexity of our method.

In our future work, we will consider fragmentations withoverlaps and soft confidentiality constraints. Moreover, wewill explore other techniques, such as simulated annealing[15], to search optimal solutions on the fragmentation graph.Finally, we will also apply graph search method to solveother constrained clustering problems.

7. REFERENCES[1] R. Agrawal and R. Srikant. Fast algorithms for mining

association rules in large databases. In VLDB, pages 487–499,1994.

[2] J. Biskup and M. Preuß. Database fragmentation withencryption: Under which semantic c5nstraints and A prioriknowledge can two keep a secret? In DBSec, pages 17–32, 2013.

[3] J. Biskup, M. Preuß, and L. Wiese. On the inference-proofnessof database fragmentation satisfying confidentiality constraints.In ISC, pages 246–261, 2011.

[4] F. Bonchi, B. Malin, and Y. Saygin. Recent advances inpreserving privacy when mining data. Data Knowl. Eng.,65(1):1–4, 2008.

[5] D. Boneh, G. Crescenzo, R. Ostrovsky, and G. Persiano. Publickey encryption with keyword search. In EUROCRYPT, pages506–522, 2004.

[6] V. Ciriani, S. D. C. di Vimercati, S. Foresti, S. Jajodia,S. Paraboschi, and P. Samarati. Fragmentation design forefficient query execution over sensitive distributed databases. InICDCS, pages 32–39, 2009.

[7] V. Ciriani, S. D. C. di Vimercati, S. Foresti, S. Jajodia,S. Paraboschi, and P. Samarati. Combining fragmentation andencryption to protect privacy in data storage. ACM Trans. Inf.Syst. Secur., 13(3), 2010.

[8] Y. Crama, A. W. J. Kolen, and E. Pesch. Local search incombinatorial optimization. In Artificial Neural Networks: AnIntroduction to ANN Theory and Practice, pages 157–174,1995.

[9] C. Dwork. Differential privacy: A survey of results. In TAMC,volume 4978 of Lecture Notes in Computer Science, pages1–19. Springer, 2008.

[10] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu.Privacy-preserving data publishing: A survey of recentdevelopments. ACM Comput. Surv., 42(4), 2010.

[11] C. Gentry. Fully homomorphic encryption using ideal lattices.In STOC, pages 169–178, 2009.

[12] C. Gentry. Computing arbitrary functions of encrypted data.Commun. ACM, 53(3):97–105, 2010.

[13] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra. Executing sqlover encrypted data in the database-service-provider model. InSIGMOD, pages 216–227, 2002.

[14] M. Hay, K. Liu, G. Miklau, J. Pei, and E. Terzi. Privacy-awaredata management in information networks. In SIGMOD, pages1201–1204, 2011.

[15] S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi, et al. Optimizationby simmulated annealing. science, 220(4598):671–680, 1983.

[16] J. Liu, J. Luo, and J. Z. Huang. Rating: Privacy preservationfor multiple attributes with different sensitivity requirements.In ICDMW, pages 666–673, 2011.

[17] A. Machanavajjhala, J. Gehrke, D. Kifer, andM. Venkitasubramaniam. l-diversity: Privacy beyondk-anonymity. In ICDE, page 24, 2006.

[18] M. T. Ozsu and P. Valduriez. Principles of DistributedDatabase Systems, Third Edition. Springer, 2011.

[19] R. A. Popa, C. M. S. Redfield, N. Zeldovich, andH. Balakrishnan. Cryptdb: protecting confidentiality withencrypted query processing. In SOSP, pages 85–100, 2011.

[20] R. A. Popa, C. M. S. Redfield, N. Zeldovich, andH. Balakrishnan. Cryptdb: processing queries on an encrypteddatabase. Commun. ACM, 55(9):103–111, 2012.

[21] C. M. Procopiuc and D. Srivastava. Efficient tableanonymization for aggregate query answering. In ICDE, pages1291–1294, 2009.

[22] T. Rekatsinas, A. Deshpande, and A. Machanavajjhala. ASPARSI: partitioning sensitive data amongst multipleadversaries. PVLDB, 6(13):1594–1605, 2013.

[23] P. Samarati. Protecting respondents’ identities in microdatarelease. IEEE Trans. Knowl. Data Eng., 13(6):1010–1027,2001.

[24] D. X. Song, D. Wagner, and A. Perrig. Practical techniques forsearches on encrypted data. In IEEE Symposium on Securityand Privacy, pages 44–55, 2000.

[25] M. van Dijk, C. Gentry, S. Halevi, and V. Vaikuntanathan.Fully homomorphic encryption over the integers. InEUROCRYPT, pages 24–43, 2010.

[26] C. Voudouris and E. P. K. Tsang. Guided local search and itsapplication to the traveling salesman problem. EuropeanJournal of Operational Research, 113(2):469–499, 1999.

[27] K. Wagstaff and C. Cardie. Clustering with instance-levelconstraints. In AAAI, page 1097, 2000.

270

database fragmentation with conﬁdentiality constraints: a ... · the set of constraints are de...

Documents