k-nearest neighbor classification over semantically secure ... neighbor classificatio… ·...

14
k -Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data Bharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEE Abstract—Data Mining has wide applications in many areas such as banking, medicine, scientific research and among government agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification problem have been proposed under different security models. However, with the recent popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving classification techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol protects the confidentiality of data, privacy of user’s input query, and hides the data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed protocol using a real-world dataset under different parameter settings. Index Terms—Security, k-NN Classifier, Outsourced Databases, Encryption 1 I NTRODUCTION Recently, the cloud computing paradigm [1] is revo- lutionizing the organizations’ way of operating their data particularly in the way they store, access and process data. As an emerging computing paradigm, cloud computing attracts many organizations to con- sider seriously regarding cloud potential in terms of its cost-efficiency, flexibility, and offload of adminis- trative overhead. Most often, organizations delegate their computational operations in addition to their data to the cloud. Despite tremendous advantages that the cloud offers, privacy and security issues in the cloud are preventing companies to utilize those advantages. When data are highly sensitive, the data need to be encrypted before outsourcing to the cloud. However, when data are encrypted, irrespective of the underlying encryption scheme, performing any data mining tasks becomes very challenging without ever decrypting the data. There are other privacy concerns, demonstrated by the following example. Example 1: Suppose an insurance company out- sourced its encrypted customers database and rele- vant data mining tasks to a cloud. When an agent from the company wants to determine the risk level of a potential new customer, the agent can use a B. K. Samanthula is with the Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907. E-mail: [email protected]. Y. Elmehdwi and W. Jiang are with the Department of Computer Science, Missouri University of Science and Technology, 310 CS Building, 500 West 15th St., Rolla, MO 65409. Email: {ymez76, wjiang}@mst.edu. classification method to determine the risk level of the customer. First, the agent needs to generate a data record q for the customer containing certain personal information of the customer, e.g., credit score, age, marital status, etc. Then this record can be sent to the cloud, and the cloud will compute the class label for q. Nevertheless, since q contains sensitive infor- mation, to protect the customer’s privacy, q should be encrypted before sending it to the cloud. The above example shows that data mining over encrypted data (denoted by DMED) on a cloud also needs to protect a user’s record when the record is a part of a data mining process. Moreover, cloud can also derive useful and sensitive information about the actual data items by observing the data access patterns even if the data are encrypted [2], [3]. There- fore, the privacy/security requirements of the DMED problem on a cloud are threefold: (1) confidentiality of the encrypted data, (2) confidentiality of a user’s query record, and (3) hiding data access patterns. Existing work on Privacy-Preserving Data Mining (either perturbation or secure multi-party computa- tion based approach) cannot solve the DMED prob- lem. Perturbed data do not possess semantic security, so data perturbation techniques cannot be used to encrypt highly sensitive data. Also the perturbed data do not produce very accurate data mining results. Se- cure multi-party computation (SMC) based approach assumes data are distributed and not encrypted at each participating party. In addition, many interme- diate computations are performed based on non- encrypted data. As a result, in this paper, we proposed novel methods to effectively solve the DMED problem IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Upload: duongnhan

Post on 10-Jun-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

k -Nearest Neighbor Classification overSemantically Secure Encrypted Relational

DataBharath K. Samanthula, Member, IEEE, Yousef Elmehdwi, and Wei Jiang, Member, IEEE

Abstract —Data Mining has wide applications in many areas such as banking, medicine, scientific research and amonggovernment agencies. Classification is one of the commonly used tasks in data mining applications. For the past decade, dueto the rise of various privacy issues, many theoretical and practical solutions to the classification problem have been proposedunder different security models. However, with the recent popularity of cloud computing, users now have the opportunity tooutsource their data, in encrypted form, as well as the data mining tasks to the cloud. Since the data on the cloud is in encryptedform, existing privacy-preserving classification techniques are not applicable. In this paper, we focus on solving the classificationproblem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposedprotocol protects the confidentiality of data, privacy of user’s input query, and hides the data access patterns. To the best of ourknowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-honest model. Also, weempirically analyze the efficiency of our proposed protocol using a real-world dataset under different parameter settings.

Index Terms —Security, k-NN Classifier, Outsourced Databases, Encryption

1 INTRODUCTION

Recently, the cloud computing paradigm [1] is revo-lutionizing the organizations’ way of operating theirdata particularly in the way they store, access andprocess data. As an emerging computing paradigm,cloud computing attracts many organizations to con-sider seriously regarding cloud potential in terms ofits cost-efficiency, flexibility, and offload of adminis-trative overhead. Most often, organizations delegatetheir computational operations in addition to theirdata to the cloud. Despite tremendous advantagesthat the cloud offers, privacy and security issues inthe cloud are preventing companies to utilize thoseadvantages. When data are highly sensitive, the dataneed to be encrypted before outsourcing to the cloud.However, when data are encrypted, irrespective of theunderlying encryption scheme, performing any datamining tasks becomes very challenging without everdecrypting the data. There are other privacy concerns,demonstrated by the following example.Example 1: Suppose an insurance company out-

sourced its encrypted customers database and rele-vant data mining tasks to a cloud. When an agentfrom the company wants to determine the risk levelof a potential new customer, the agent can use a

• B. K. Samanthula is with the Department of Computer Science, PurdueUniversity, 305 N. University Street, West Lafayette, IN 47907.E-mail: [email protected].

• Y. Elmehdwi and W. Jiang are with the Department of ComputerScience, Missouri University of Science and Technology, 310 CSBuilding, 500 West 15th St., Rolla, MO 65409.Email: {ymez76, wjiang}@mst.edu.

classification method to determine the risk level ofthe customer. First, the agent needs to generate a datarecord q for the customer containing certain personalinformation of the customer, e.g., credit score, age,marital status, etc. Then this record can be sent tothe cloud, and the cloud will compute the class labelfor q. Nevertheless, since q contains sensitive infor-mation, to protect the customer’s privacy, q should beencrypted before sending it to the cloud.The above example shows that data mining over

encrypted data (denoted by DMED) on a cloud alsoneeds to protect a user’s record when the record is apart of a data mining process. Moreover, cloud canalso derive useful and sensitive information aboutthe actual data items by observing the data accesspatterns even if the data are encrypted [2], [3]. There-fore, the privacy/security requirements of the DMEDproblem on a cloud are threefold: (1) confidentialityof the encrypted data, (2) confidentiality of a user’squery record, and (3) hiding data access patterns.Existing work on Privacy-Preserving Data Mining

(either perturbation or secure multi-party computa-tion based approach) cannot solve the DMED prob-lem. Perturbed data do not possess semantic security,so data perturbation techniques cannot be used toencrypt highly sensitive data. Also the perturbed datado not produce very accurate data mining results. Se-cure multi-party computation (SMC) based approachassumes data are distributed and not encrypted ateach participating party. In addition, many interme-diate computations are performed based on non-encrypted data. As a result, in this paper, we proposednovel methods to effectively solve the DMED problem

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 2: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

assuming that the encrypted data are outsourced toa cloud. Specifically, we focus on the classificationproblem since it is one of the most common datamining tasks. Because each classification techniquehas their own advantage, to be concrete, this paperconcentrates on executing the k-nearest neighbor clas-sification method over encrypted data in the cloudcomputing environment.

1.1 Problem Definition

Suppose Alice owns a database D of n recordst1, . . . , tn and m+ 1 attributes. Let ti,j denote the jth

attribute value of record ti. Initially, Alice encryptsher database attribute-wise, that is, she computesEpk(ti,j), for 1 ≤ i ≤ n and 1 ≤ j ≤ m + 1, wherecolumn (m + 1) contains the class labels. We assumethat the underlying encryption scheme is semanticallysecure [4]. Let the encrypted database be denoted byD′. We assume that Alice outsources D′ as well as thefuture classification process to the cloud.Let Bob be an authorized user who wants to classify

his input record q = 〈q1, . . . , qm〉 by applying the k-NN classification method based on D′. We refer tosuch a process as privacy-preserving k-NN (PPkNN)classification over encrypted data in the cloud. For-mally, we define the PPkNN protocol as:

PPkNN(D′, q)→ cq

where cq denotes the class label for q after applyingk-NN classification method on D′ and q.

1.2 Our Contributions

In this paper, we propose a novel PPkNN protocol,a secure k-NN classifier over semantically secure en-crypted data. In our protocol, once the encrypted dataare outsourced to the cloud, Alice does not participatein any computations. Therefore, no information isrevealed to Alice. In addition, our protocol meets thefollowing privacy requirements:

• Contents of D or any intermediate results shouldnot be revealed to the cloud.

• Bob’s query q should not be revealed to the cloud.• cq should be revealed only to Bob. Also, no other

information should be revealed to Bob.• Data access patterns, such as the records corre-

sponding to the k-nearest neighbors of q, shouldnot be revealed to Bob and the cloud (to preventany inference attacks).

We emphasize that the intermediate results seen bythe cloud in our protocol are either newly generatedrandomized encryptions or random numbers. Thus,which data records correspond to the k-nearest neigh-bors and the output class label are not known to thecloud. In addition, after sending his encrypted queryrecord to the cloud, Bob does not involve in anycomputations. Hence, data access patterns are furtherprotected from Bob (see Section 5 for more details).

The rest of the paper is organized as follows. Wediscuss the existing related work and some conceptsas a background in Section 2. A set of privacy-preserving protocols and their possible implementa-tions are provided in Section 3. The formal securityproofs for the mentioned privacy-preserving primi-tives are provided in Section 4. The proposed PPkNNprotocol is explained in detail in Section 5. Section 6discusses the performance of the proposed protocolunder different parameter settings. We conclude thepaper along with future work in Section 7.

2 RELATED WORK AND BACKGROUND

Due to space limitations, here we briefly review theexisting related work and provide some definitions asa background. Please refer to our technical report [5]for a more elaborated related work and background.At first, it seems fully homomorphic cryptosystems

(e.g., [6]) can solve the DMED problem since it allowsa third-party (that hosts the encrypted data) to exe-cute arbitrary functions over encrypted data withoutever decrypting them. However, we stress that suchtechniques are very expensive and their usage inpractical applications have yet to be explored. Forexample, it was shown in [7] that even for weaksecurity parameters one “bootstrapping” operation ofthe homomorphic operation would take at least 30seconds on a high performance machine.It is possible to use the existing secret sharing

techniques in SMC, such as Shamir’s scheme [8], todevelop a PPkNN protocol. However, our work isdifferent from the secret sharing based solution inthe following aspect. Solutions based on the secretsharing schemes require at least three parties whereasour work require only two parties. For example, theconstructions based on Sharemind [9], a well-knownSMC framework which is based on the secret sharingscheme, assumes that the number of participatingparties is three. Thus, our work is orthogonal toSharemind and other secret sharing based schemes.

2.1 Privacy-Preserving Data Mining (PPDM)

Agrawal and Srikant [10], Lindell and Pinkas [11]were the first to introduce the notion of privacy-preserving under data mining applications. The exist-ing PPDM techniques can broadly be classified intotwo categories: (i) data perturbation and (ii) data dis-tribution. Agrawal and Srikant [10] proposed the firstdata perturbation technique to build a decision-treeclassifier, and many other methods were proposedlater (e.g., [12]–[14]). However, as mentioned earlier inSection 1, data perturbation techniques cannot be ap-plicable for semantically secure encrypted data. Also,they do not produce accurate data mining results dueto the addition of statistical noises to the data. Onthe other hand, Lindell and Pinkas [11] proposed thefirst decision tree classifier under the two-party setting

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 3: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

assuming the data were distributed between them.Since then much work has been published using SMCtechniques (e.g., [15]–[17]). We claim that the PPkNNproblem cannot be solved using the data distributiontechniques since the data in our case is encrypted andnot distributed in plaintext among multiple parties.For the same reasons, we also do not consider securek-NN methods in which the data are distributedbetween two parties (e.g., [18]).

2.2 Query Processing over Encrypted Data

Various techniques related to query processing overencrypted data have been proposed, e.g., [19]–[21].However, we observe that PPkNN is a more complexproblem than the execution of simple kNN queriesover encrypted data [22], [23]. For one, the interme-diate k-nearest neighbors in the classification process,should not be disclosed to the cloud or any users. Weemphasize that the recent method in [23] reveals thek-nearest neighbors to the user. Secondly, even if weknow the k-nearest neighbors, it is still very difficultto find the majority class label among these neighborssince they are encrypted at the first place to preventthe cloud from learning sensitive information. Third,the existing work did not addressed the access patternissue which is a crucial privacy requirement from theuser’s perspective.In our most recent work [24], we proposed a novel

secure k-nearest neighbor query protocol over en-crypted data that protects data confidentiality, user’squery privacy, and hides data access patterns. How-ever, as mentioned above, PPkNN is a more complexproblem and it cannot be solved directly using theexisting secure k-nearest neighbor techniques overencrypted data. Therefore, in this paper, we extendour previous work in [24] and provide a new solutionto the PPkNN classifier problem over encrypted data.More specifically, this paper is different from our

preliminary work [24] in the following four aspects.First, in this paper, we introduced new security prim-itives, namely secure minimum (SMIN), secure min-imum out of n numbers (SMINn), secure frequency(SF), and proposed new solutions for them. Second,the work in [24] did not provide any formal securityanalysis of the underlying sub-protocols. On the otherhand, this paper provides formal security proofs ofthe underlying sub-protocols as well as the PPkNNprotocol under the semi-honest model. Additionally,we discuss various techniques through which theproposed PPkNN protocol can possibly be extendedto a protocol that is secure under the malicious setting.Third, our preliminary work in [24] addresses only se-cure kNN query which is similar to Stage 1 of PPkNN.However, Stage 2 in PPkNN is entirely new. Finally,our empirical analyses in Section VI are based on areal dataset whereas the results in [24] are based ona simulated dataset. Furthermore, new experimentalresults are included in this paper.

2.3 Threat Model

We adopt the security definitions in the literature ofsecure multi-party computation (SMC) [25], [26], andthere are three common adversarial models underSMC: semi-honest, covert and malicious. In this paper,to develop secure and efficient protocols, we assumethat parties are semi-honest. Briefly, the followingdefinition captures the properties of a secure protocolunder the semi-honest model [27], [28].Definition 1: Let ai be the input of party Pi, Πi(π)

be Pi’s execution image of the protocol π and bibe the output for party Pi computed from π. Then,π is secure if Πi(π) can be simulated from ai andbi such that distribution of the simulated image iscomputationally indistinguishable from Πi(π).

In the above definition, an execution image gener-ally includes the input, the output and the messagescommunicated during an execution of a protocol. Toprove a protocol is secure under semi-honest model,we generally need to show that the execution imageof a protocol does not leak any information regardingthe private inputs of participating parties [28].

2.4 Paillier Cryptosystem

The Paillier cryptosystem is an additive homomor-phic and probabilistic public-key encryption schemewhose security is based on the Decisional CompositeResiduosity Assumption [4]. Let Epk be the encryptionfunction with public key pk given by (N, g), whereN is a product of two large primes of similar bitlength and g is a generator in Z

N2 . Also, let Dsk bethe decryption function with secret key sk. For anygiven two plaintexts a, b ∈ ZN , the Paillier encryptionscheme exhibits the following properties:

1) Homomorphic Addition

Dsk(Epk(a+ b)) = Dsk(Epk(a) ∗Epk(b) mod N2);

2) Homomorphic Multiplication

Dsk(Epk(a ∗ b)) = Dsk(Epk(a)b mod N2);

3) Semantic Security - The encryption scheme issemantically secure [28], [29]. Briefly, given a setof ciphertexts, an adversary cannot deduce anyadditional information about the plaintext(s).

For succinctness, we drop the mod N2 term duringhomomorphic operations in the rest of this paper.

3 PRIVACY-PRESERVING PRIMITIVES

Here we present a set of generic sub-protocols thatwill be used in constructing our proposed k-NNprotocol in Section 5. All of the below protocolsare considered under two-party semi-honest setting.In particular, we assume the existence of two semi-honest parties P1 and P2 such that the Paillier’s secretkey sk is known only to P2 whereas pk is public.

• Secure Multiplication (SM): This protocol consid-ers P1 with input (Epk(a), Epk(b)) and outputs

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 4: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

Epk(a ∗ b) to P1, where a and b are not known toP1 and P2. During this process, no informationregarding a and b is revealed to P1 and P2.

• Secure Squared Euclidean Distance (SSED): Inthis protocol, P1 with input (Epk(X), Epk(Y ))and P2 with sk securely compute the encryptionof squared Euclidean distance between vectorsX and Y . Here X and Y are m dimensionalvectors where Epk(X) = 〈Epk(x1), . . . , Epk(xm)〉and Epk(Y ) = 〈Epk(y1), . . . , Epk(ym)〉. The outputEpk(|X − Y |

2) will be known only to P1.• Secure Bit-Decomposition (SBD): Here P1 with in-

put Epk(z) and P2 securely compute the encryp-tions of the individual bits of z, where 0 ≤ z < 2l.The output [z] = 〈Epk(z1), . . . , Epk(zl)〉 is knownonly to P1. Here z1 and zl are the most and leastsignificant bits of integer z, respectively.

• Secure Minimum (SMIN): In this protocol, P1

holds private input (u′, v′) and P2 holds sk, whereu′ = ([u], Epk(su)) and v′ = ([v], Epk(sv)). Heresu (resp., sv) denotes the secret associated with u(resp., v). The goal of SMIN is for P1 and P2 tojointly compute the encryptions of the individualbits of minimum number between u and v. Inaddition, they compute Epk(smin(u,v)). That is, theoutput is ([min(u, v)], Epk(smin(u,v))) which willbe known only to P1. During this protocol, noinformation regarding the contents of u, v, su, andsv is revealed to P1 and P2.

• Secure Minimum out of n Numbers (SMINn):In this protocol, we consider P1 with n en-crypted vectors ([d1], . . . , [dn]) along with theirrespective encrypted secrets and P2 with sk. Here[di] = 〈Epk(di,1), . . . , Epk(di,l)〉 where di,1 anddi,l are the most and least significant bits ofinteger di respectively, for 1 ≤ i ≤ n. The secretof di is given by sdi

. P1 and P2 jointly com-pute [min(d1, . . . , dn)]. In addition, they computeEpk(smin(d1,...,dn)). At the end of this protocol,the output ([min(d1, . . . , dn)], Epk(smin(d1,...,dn)))is known only to P1. During SMINn, no infor-mation regarding any of di’s and their secrets isrevealed to P1 and P2.

• Secure Bit-OR (SBOR): P1 with input(Epk(o1), Epk(o2)) and P2 securely computeEpk(o1 ∨ o2), where o1 and o2 are two bits. Theoutput Epk(o1 ∨ o2) is known only to P1.

• Secure Frequency (SF): Here P1 with private in-put (〈Epk(c1), . . . Epk(cw)〉, 〈Epk(c

1), . . . , Epk(c′

k)〉)and P2 securely compute the encryption of thefrequency of cj , denoted by f(cj), in the list〈c′1, . . . , c

k〉, for 1 ≤ j ≤ w. Here we ex-plicitly assume that cj ’s are unique and c′i ∈{c1, . . . , cw}, for 1 ≤ i ≤ k. The output〈Epk(f(c1)), . . . , Epk(f(cw))〉 will be known onlyto P1. During the SF protocol, no informationregarding c′i, cj , and f(cj) is revealed to P1 andP2, for 1 ≤ i ≤ k and 1 ≤ j ≤ w.

Now we either propose a new solution or refer tothe most efficient known implementation to each ofthe above protocols. First of all, efficient solutions toSM, SSED, SBD and SBOR were discussed in [24].Therefore, in this paper, we discuss SMIN, SMINn,and SF problems in detail and propose new solutionsto each one of them.

Secure Minimum (SMIN): In this protocol, we as-sume that P1 holds private input (u′, v′) and P2

holds sk, where u′ = ([u], Epk(su)) and v′ =([v], Epk(sv)). Here su and sv denote the secretscorresponding to u and v, respectively. The maingoal of SMIN is to securely compute the encryp-tions of the individual bits of min(u, v), denotedby [min(u, v)]. Here [u] = 〈Epk(u1), . . . , Epk(ul)〉 and[v] = 〈Epk(v1), . . . , Epk(vl)〉, where u1 (resp., v1) andul (resp., vl) are the most and least significant bitsof u (resp., v), respectively. In addition, they computeEpk(smin(u,v)), the encryption of the secret correspond-ing to the minimum value between u and v. At theend of SMIN, the output ([min(u, v)], Epk(smin(u,v))) isknown only to P1.We assume that 0 ≤ u, v < 2l and propose a novel

SMIN protocol. Our solution to SMIN is mainly mo-tivated from the work of [24]. Precisely, the basic ideaof the proposed SMIN protocol is for P1 to randomlychoose the functionality F (by flipping a coin), whereF is either u > v or v > u, and to obliviously executeF with P2. Since F is randomly chosen and knownonly to P1, the result of the functionality F is obliviousto P2. Based on the comparison result and chosenF , P1 computes [min(u, v)] and Epk(smin(u,v)) locallyusing homomorphic properties.The overall steps involved in the SMIN protocol

are shown in Algorithm 1. To start with, P1 initiallychooses the functionality F as either u > v or v > u

randomly. Then, using the SM protocol, P1 computesEpk(ui∗vi)with the help of P2, for 1 ≤ i ≤ l. After this,the protocol has the following key steps, performedby P1 locally, for 1 ≤ i ≤ l:

• Compute the encrypted bit-wise XOR betweenthe bits ui and vi using the following formula-tion1 Ti = Epk(ui) ∗ Epk(vi) ∗ Epk(ui ∗ vi)

N−2

• Compute an encrypted vector H by preservingthe first occurrence of Epk(1) (if there exists one)in T by initializing H0 = Epk(0). The rest of theentries of H are computed as Hi = Hri

i−1 ∗ Ti. Weemphasize that at most one of the entry in H isEpk(1) and the remaining entries are encryptionsof either 0 or a random number.

• Then, P1 computes Φi = Epk(−1) ∗Hi. Note that“−1” is equivalent to “N − 1” under ZN . Fromthe above discussions, it is clear that Φi = Epk(0)at most once since Hi is equal to Epk(1) at most

1. In general, for any two given bits o1 and o2, the propertyo1 ⊕ o2 = o1 + o2 − 2(o1 ∗ o2) always holds.

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 5: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

Algorithm 1 SMIN(u′, v′)→ [min(u, v)], Epk(smin(u,v))

Require: P1 has u′ = ([u], Epk(su)) and v′ =([v], Epk(sv)), where 0 ≤ u, v < 2l; P2 has sk

1: P1:

(a). Randomly choose the functionality F(b). for i = 1 to l do:

• Epk(ui ∗ vi)← SM(Epk(ui), Epk(vi))• Ti ← Epk(ui ⊕ vi)• Hi ← Hri

i−1 ∗Ti; ri ∈R ZN and H0 = Epk(0)• Φi ← Epk(−1) ∗Hi

• if F : u > v then:

– Wi ← Epk(ui) ∗ Epk(ui ∗ vi)N−1

– Γi ← Epk(vi − ui) ∗ Epk(ri); ri ∈R ZN

else

– Wi ← Epk(vi) ∗ Epk(ui ∗ vi)N−1

– Γi ← Epk(ui − vi) ∗ Epk(ri); ri ∈R ZN

• Li ←Wi ∗ Φr′ii ; r′i ∈R ZN

(c). if F : u > v then: δ ← Epk(sv − su) ∗ Epk(r)else δ ← Epk(su−sv)∗Epk(r), where r ∈R ZN

(d). Γ′ ← π1(Γ) and L′ ← π2(L)(e). Send δ,Γ′ and L′ to P2

2: P2:

(a). Receive δ,Γ′ and L′ from P1

(b). Decryption: Mi ← Dsk(L′

i), for 1 ≤ i ≤ l(c). if ∃ j such that Mj = 1 then α← 1

else α← 0(d). if α = 0 then:

• M ′

i ← Epk(0), for 1 ≤ i ≤ l• δ′ ← Epk(0)

else

• M ′

i ← Γ′

i ∗ rN , where r ∈R ZN and is

different for 1 ≤ i ≤ l• δ′ ← δ ∗ rNδ , where rδ ∈R ZN

(e). Send M ′, Epk(α) and δ′ to P1

3: P1:

(a). Receive M ′, Epk(α) and δ′ from P2

(b). M ← π−11 (M ′) and θ ← δ′ ∗ Epk(α)

N−r

(c). λi ← Mi ∗ Epk(α)N−ri , for 1 ≤ i ≤ l

(d). if F : u > v then:

• Epk(smin(u,v))← Epk(su) ∗ θ• Epk(min(u, v)i)← Epk(ui)∗λi, for 1 ≤ i ≤ l

else

• Epk(smin(u,v))← Epk(sv) ∗ θ• Epk(min(u, v)i)← Epk(vi)∗λi, for 1 ≤ i ≤ l

once. Also, if Φj = Epk(0), then index j is theposition at which the bits of u and v differ first(starting from the most significant bit position).

Now, depending on F , P1 creates two encryptedvectors W and Γ as follows, for 1 ≤ i ≤ l:

• If F : u > v, compute

Wi = Epk(ui ∗ (1− vi))

Γi = Epk(vi − ui) ∗ Epk(ri) = Epk(vi − ui + ri)

• If F : v > u, compute

Wi = Epk(vi ∗ (1− ui))

Γi = Epk(ui − vi) ∗ Epk(ri) = Epk(ui − vi + ri)

where ri is a random number (hereafter denoted by∈R) in ZN . The observation is that if F : u > v, thenWi = Epk(1) iff ui > vi, and Wi = Epk(0) otherwise.Similarly, when F : v > u, we have Wi = Epk(1) iffvi > ui, and Wi = Epk(0) otherwise. Also, dependingof F , Γi stores the encryption of the randomizeddifference between ui and vi which will be used inlater computations.After this, P1 computes L by combining Φ and W .

More precisely, P1 computes Li = Wi ∗ Φr′ii , where

r′i is a random number in ZN . The observation hereis if ∃ an index j such that Φj = Epk(0), denotingthe first flip in the bits of u and v, then Wj storesthe corresponding desired information, i.e., whetheruj > vj or vj > uj in encrypted form. In addition,depending on F , P1 computes the encryption of ran-domized difference between su and sv and stores it inδ. Specifically, if F : u > v, then δ = Epk(sv − su + r).Otherwise, δ = Epk(su − sv + r), where r ∈R ZN .After this, P1 permutes the encrypted vectors Γ and

L using two random permutation functions π1 and π2.Specifically, P1 computes Γ′ = π1(Γ) and L′ = π2(L),and sends them along with δ to P2. Upon receiving,P2 decrypts L′ component-wise to get Mi = Dsk(L

i),for 1 ≤ i ≤ l, and checks for index j. That is, ifMj = 1, then P2 sets α to 1, otherwise sets it to 0.In addition, P2 computes a new encrypted vector M ′

depending on the value of α. Precisely, if α = 0, thenM ′

i = Epk(0), for 1 ≤ i ≤ l. Here Epk(0) is different foreach i. On the other hand, when α = 1, P2 sets M ′

i tothe re-randomized value of Γ′

i. That is, M′

i = Γ′

i ∗ rN ,

where the term rN comes from re-randomization andr ∈R ZN should be different for each i. Furthermore,P2 computes δ′ = Epk(0) if α = 0. However, whenα = 1, P2 sets δ′ to δ ∗ rNδ , where rδ is a randomnumber in ZN . Then, P2 sends M ′, Epk(α) and δ′ toP1. After receivingM ′, Epk(α) and δ

′, P1 computes the

inverse permutation of M ′ as M = π−11 (M ′). Then,

P1 performs the following homomorphic operationsto compute the encryption of ith bit of min(u, v), i.e.,Epk(min(u, v)i), for 1 ≤ i ≤ l:

• Remove the randomness from Mi by computing

λi = Mi ∗ Epk(α)N−ri

• If F : u > v, compute Epk(min(u, v)i) = Epk(ui) ∗λi = Epk(ui + α ∗ (vi − ui)). Otherwise, computeEpk(min(u, v)i) = Epk(vi) ∗λi = Epk(vi+α ∗ (ui−vi)).

Also, depending on F , P1 computes Epk(smin(u,v)) asfollows. If F : u > v, P1 computes Epk(smin(u,v)) =Epk(su) ∗ θ, where θ = δ′ ∗ Epk(α)

N−r. Otherwise,he/she computes Epk(smin(u,v)) = Epk(sv) ∗ θ.

In the SMIN protocol, one main observation(upon which we can also justify the correctness

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 6: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

TABLE 1P1 chooses F as v > u where u = 55 and v = 58

[u] [v] Wi Γi Gi Hi Φi Li Γi’ L′

i Mi λi mini

1 1 0 r 0 0 −1 r 1 + r r r 0 1

1 1 0 r 0 0 −1 r r r r 0 1

0 1 1 −1 + r 1 1 0 1 1 + r r r −1 0

1 0 0 1 + r 1 r r r −1 + r r r 1 1

1 1 0 r 0 r r r r 1 1 0 1

1 0 0 1 + r 1 r r r r r r 1 1

All column values are in encrypted form except Mi col-umn. Also, r ∈R ZN is different for each row and column.

of the final output) is that if F : u > v, thenmin(u, v)i = (1 − α) ∗ ui + α ∗ vi always holds, for1 ≤ i ≤ l. On the other hand, if F : v > u, thenmin(u, v)i = α ∗ ui + (1 − α) ∗ vi always holds.Similar conclusions can be drawn for smin(u,v). Weemphasize that using similar formulations one canalso design a SMAX protocol to compute [max(u, v)]and Epk(smax(u,v)). Also, we stress that there canbe multiple secrets of u and v that can be fed asinput (in encrypted form) to SMIN and SMAX. Forexample, let s1u and s2u (resp., s1v and s2v) be two secretsassociated with u (resp., v). Then the SMIN protocoltakes ([u], Epk(s

1u), Epk(s

2u)) and ([v], Epk(s

1v), Epk(s

2v))

as P1’s input and outputs [min(u, v)], Epk(s1min(u,v))

and Epk(s2min(u,v)) to P1.

Example 2: For simplicity, consider that u = 55,v = 58, and l = 6. Suppose su and sv be the secretsassociated with u and v, respectively. Assume thatP1 holds ([55], Epk(su)) ([58], Epk(sv)). In addition, weassume that P1’s random permutation functions areas given below. Without loss of generality, suppose

i = 1 2 3 4 5 6

↓ ↓ ↓ ↓ ↓ ↓

π1(i) = 6 5 4 3 2 1

π2(i) = 2 1 5 6 3 4

P1 chooses the functionality F : v > u. Then, variousintermediate results based on the SMIN protocol areas shown in Table 1. Following from Table 1, weobserve that:

• At most one of the entry in H is Epk(1), namelyH3, and the remaining entries are encryptions ofeither 0 or a random number in ZN .

• Index j = 3 is the first position at which thecorresponding bits of u and v differ.

• Φ3 = Epk(0) since H3 is equal to Epk(1). Also,since M5 = 1, P2 sets α to 1.

• Epk(smin(u,v)) = Epk(α∗su+(1−α)∗sv) = Epk(su).

At the end, only P1 knows [min(u, v)] = [u] = [55]and Epk(smin(u,v)) = Epk(su). �

Secure Minimum out of n Numbers (SMINn): Con-sider P1 with private input ([d1], . . . , [dn]) along withtheir encrypted secrets and P2 with sk, where 0 ≤ di <

Algorithm 2 SMINn(([d1], Epk(sd1)), .., ([dn], Epk(sdn

)))→ ([dmin], Epk(sdmin

))

Require: P1 has (([d1], Epk(sd1)), . . . , ([dn], Epk(sdn

)));P2 has sk

1: P1:

(a). [d′i]← [di] and s′i ← Epk(sdi), for 1 ≤ i ≤ n

(b). num← n

2: for i = 1 to ⌈log2 n⌉:

(a). for 1 ≤ j ≤⌊num2

⌋:

• if i = 1 then:

– ([d′2j−1], s′

2j−1) ← SMIN(x, y), wherex = ([d′2j−1], s

2j−1) and y = ([d′2j ], s′

2j)

– [d′2j ]← 0 and s′2j ← 0

else

– ([d′2i(j−1)+1], s′

2i(j−1)+1) ← SMIN(x, y),where x = ([d′2i(j−1)+1], s

2i(j−1)+1) andy = ([d′2ij−1], s

2ij−1)

– [d′2ij−1]← 0 and s′2ij−1 ← 0

(b). num←⌈num2

3: P1: [dmin]← [d′1] and Epk(sdmin)← s′1

2l and [di] = 〈Epk(di,1), . . . , Epk(di,l)〉, for 1 ≤ i ≤ n.Here the secret of di is denoted by Epk(sdi

), for1 ≤ i ≤ n. The main goal of the SMINn protocol is tocompute [min(d1, . . . , dn)] = [dmin] without revealingany information about di’s to P1 and P2. In addition,they compute the encryption of the secret correspond-ing to the global minimum, denoted by Epk(sdmin

).Here we construct a new SMINn protocol by utilizingSMIN as the building block. The proposed SMINn

protocol is an iterative approach and it computesthe desired output in an hierarchical fashion. In eachiteration, minimum between a pair of values andthe secret corresponding to the minimum value arecomputed (in encrypted form) and fed as input to thenext iteration, thus, generating a binary execution treein a bottom-up fashion. At the end, only P1 knows thefinal result [dmin] and Epk(sdmin

).

The overall steps involved in the proposed SMINn

protocol are highlighted in Algorithm 2. Initially, P1

assigns [di] and Epk(sdi) to a temporary vector [d′i]

and variable s′i, for 1 ≤ i ≤ n, respectively. Also,he/she creates a global variable num and initializes itto n, where num represents the number of (non-zero)vectors involved in each iteration. Since the SMINn

protocol executes in a binary tree hierarchy (bottom-up fashion), we have ⌈log2 n⌉ iterations, and in eachiteration, the number of vectors involved varies. Inthe first iteration (i.e., i = 1), P1 with private input(([d′2j−1], s

2j−1), ([d′

2j ], s′

2j)) and P2 with sk involve inthe SMIN protocol, for 1 ≤ j ≤

⌊num2

⌋. At the end of

the first iteration, only P1 knows [min(d′2j−1, d′

2j)] ands′min(d′

2j−1,d′

2j), and nothing is revealed to P2, for 1 ≤

j ≤⌊num2

⌋. Also, P1 stores the result [min(d′2j−1, d

2j)]and s′min(d′

2j−1,d′

2j) in [d′2j−1] and s

2j−1, respectively. In

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 7: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

[dmin]← [d′1]← [min(d′

1, d′

5)]

[d′1]← [min(d′

1, d′

3)]

[d′1]← [min(d′

1, d′

2)]

[d′1] [d′

2]

[d′3]← [min(d′

3, d′

4)]

[d′3] [d′

4]

[d′5]

[d′5]← [min(d′

5, d′

6)]

[d′5] [d′

6]

Fig. 1. Binary execution tree for n = 6 based on SMINn

addition, P1 updates the values of [d′2j ], s′

2j to 0 andnum to

⌈num2

⌉, respectively.

During the ith iteration, only the non-zero vectors(along with the corresponding encrypted secrets)are involved in SMIN, for 2 ≤ i ≤ ⌈log2 n⌉. Forexample, during the second iteration (i.e., i = 2), only([d′1], s

1), ([d′

3], s′

3), and so on are involved. Note thatin each iteration, the output is revealed only to P1

and num is updated to⌈num2

⌉. At the end of SMINn,

P1 assigns the final encrypted binary vector of globalminimum value, i.e., [min(d1, . . . , dn)] which is storedin [d′1], to [dmin]. Also, P1 assigns s′1 to Epk(sdmin

).

Example 3: Suppose P1 holds 〈[d1], . . . , [d6]〉 (i.e.,n = 6). For simplicity, here we are assuming thatthere are no secrets associated with di’s. Then, basedon the SMINn protocol, the binary execution tree (ina bottom-up fashion) to compute [min(d1, . . . , d6)] isshown in Figure 1. Note that, initially [d′i] = [di]. �

Secure Frequency (SF): Let us considera situation where P1 holds private input(〈Epk(c1), . . . , Epk(cw)〉, 〈Epk(c

1), . . . , Epk(c′

k)〉) and P2

holds the secret key sk. The goal of the SF protocol isto securely compute Epk(f(cj)), for 1 ≤ j ≤ w. Heref(cj) denotes the number of times element cj occurs(i.e., frequency) in the list 〈c′1, . . . , c

k〉. We explicitlyassume that c′i ∈ {c1, . . . , cw}, for 1 ≤ i ≤ k.

The output 〈Epk(f(c1)), . . . , Epk(f(cw))〉 is revealedonly to P1. During the SF protocol, neither c′i nor cjis revealed to P1 and P2. Also, f(cj) is kept privatefrom both P1 and P2, for 1 ≤ i ≤ k and 1 ≤ j ≤ w.

The overall steps involved in the proposed SF pro-tocol are shown in Algorithm 3. To start with, P1

initially computes an encrypted vector Si such thatSi,j = Epk(cj−c

i), for 1 ≤ j ≤ w. Then, P1 randomizesSi component-wise to get S′

i,j = Epk(ri,j ∗ (cj − c′

i)),where ri,j is a random number in ZN . After this, for1 ≤ i ≤ k, P1 randomly permutes S′

i component-wiseusing a random permutation function πi (known onlyto P1). The output Zi ← πi(S

i) is sent to P2. Uponreceiving, P2 decrypts Zi component-wise, computesa vector ui and proceeds as follows:

• If Dsk(Zi,j) = 0, then ui,j is set to 1. Otherwise,ui,j is set to 0.

• The observation is, since c′i ∈ {c1, . . . , cw}, thatexactly one of the entries in vector Zi is an

Algorithm 3 SF(Λ,Λ′)→ 〈Epk(f(c1)), . . . , Epk(f(cw))〉

Require: P1 has Λ = 〈Epk(c1), . . . , Epk(cw)〉, Λ′ =〈Epk(c

1), . . . , Epk(c′

k)〉 and 〈π1, . . . , πk〉; P2 has sk1: P1:

(a). for i = 1 to k do:

• Ti ← Epk(c′

i)N−1

• for j = 1 to w do:

– Si,j ← Epk(cj) ∗ Ti– S′

i,j ← Si,jri,j , where ri,j ∈R ZN

• Zi ← πi(S′

i)

(b). Send Z to P2

2: P2:

(a). Receive Z from P1

(b). for i = 1 to k do

• for j = 1 to w do:

– if Dsk(Zi,j) = 0 then ui,j ← 1else ui,j ← 0

– Ui,j ← Epk(ui,j)

(c). Send U to P1

3: P1:

(a). Receive U from P2

(b). Vi ← π−1i (Ui), for 1 ≤ i ≤ k

(c). Epk(f(cj))←∏k

i=1 Vi,j , for 1 ≤ j ≤ w

encryption of 0 and the rest are encryptions ofrandom numbers. This further implies that ex-actly one of the decrypted values of Zi is 0 andthe rest are random numbers. Precisely, if ui,j = 1,then c′i = cπ−1(j).

• Compute Ui,j = Epk(ui,j) and send it to P1, for1 ≤ i ≤ k and 1 ≤ j ≤ w.

Then, P1 performs row-wise inverse permutation onit to get Vi = π−1

i (Ui), for 1 ≤ i ≤ k. Finally, P1

computes Epk(cj) =∏k

i=1 Vi,j locally, for 1 ≤ j ≤ w.

4 SECURITY ANALYSIS OF PRIVACY-PRESERVING PRIMITIVES UNDER THE SEMI-HONEST MODEL

First of all, we emphasize that the outputs in theabove mentioned protocols are always in encryptedformat, and are known only to P1. Also, all theintermediate results revealed to P2 are either randomor pseudo-random.Since the proposed SMIN protocol (which is used

as a sub-routine in SMINn) is more complex thanother protocols mentioned above and due to spacelimitations, we are motivated to provide its securityproof rather than providing proofs for each protocol.Therefore, here we only include a formal securityproof for the SMIN protocol based on the standardsimulation argument [28]. Nevertheless, we stress thatsimilar proof strategies can be used to show that otherprotocols are secure under the semi-honest model. Forcompleteness, we provided the security proofs for the

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 8: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

4.1 Proof of Security for SMIN

As mentioned in Section 2.3, to formally prove thatSMIN is secure [28] under the semi-honest model,we need to show that the simulated image of SMINis computationally indistinguishable from the actualexecution image of SMIN.An execution image generally includes the mes-

sages exchanged and the information computed fromthese messages. Therefore, according to Algorithm1, let the execution image of P2 be denoted byΠP2

(SMIN), given by

{〈δ, s+ r mod N〉, 〈Γ′

i, µi + ri mod N〉, 〈L′

i, α〉}

Observe that s + r mod N and µi + ri mod N arederived upon decrypting δ and Γ′

i, for 1 ≤ i ≤ l,respectively. Note that the modulo operator is implicitin the decryption function. Also, P2 receives L′ fromP1 and let α denote the (oblivious) comparison resultcomputed from L′. Without loss of generality, supposethe simulated image of P2 be ΠS

P2(SMIN), given by

{〈δ∗, r∗〉, 〈s′1,i, s′

2,i〉, 〈s′

3,i, α′〉 | for 1 ≤ i ≤ l}

Here δ∗, s′1,i and s′3,i are randomly generated fromZN2 whereas r∗ and s′2,i are randomly generated fromZN . In addition, α′ denotes a random bit. Since Epk isa semantically secure encryption scheme with result-ing ciphertext size less than N2, δ is computationallyindistinguishable from δ∗. Similarly, Γ′

i and L′

i arecomputationally indistinguishable from s′1,i and s′3,i,respectively. Also, as r and ri are randomly gener-ated from ZN , s + r mod N and µi + ri mod N arecomputationally indistinguishable from r∗ and s′2,i,respectively. Furthermore, because the functionality israndomly chosen by P1 (at step 1(a) of Algorithm 1),α is either 0 or 1 with equal probability. Thus, α iscomputationally indistinguishable from α′. Combin-ing all these results together, we can conclude thatΠP2

(SMIN) is computationally indistinguishable fromΠS

P2(SMIN) based on Definition 1. This implies that

during the execution of SMIN, P2 does not learn anyinformation regarding u, v, su, sv and the actual com-parison result. Intuitively speaking, the informationP2 has during an execution of SMIN is either randomor pseudo-random, so this information does not dis-close anything regarding u, v, su and sv. Additionally,as F is known only to P1, the actual comparison resultis oblivious to P2.On the other hand, the execution image of P1,

denoted by ΠP1(SMIN), is given by

ΠP1(SMIN) = {M ′

i , Epk(α), δ′ | for 1 ≤ i ≤ l}

M ′

i and δ′ are encrypted values, which are random in

ZN2 , received from P2 (at step 3(a) of Algorithm 1).Let the simulated image of P1 be ΠS

P1(SMIN), where

ΠSP1(SMIN) = {s′4,i, b

′, b′′ | for 1 ≤ i ≤ l}

The values s′4,i, b′ and b′′ are randomly generated from

ZN2 . Since Epk is a semantically secure encryptionscheme with resulting ciphertext size less than N2,it implies that M ′

i , Epk(α) and δ′ are computation-ally indistinguishable from s4,i, b

′ and b′′, respectively.Therefore, ΠP1

(SMIN) is computationally indistin-guishable from ΠS

P1(SMIN) based on Definition 1. As

a result, P1 cannot learn any information regardingu, v, su, sv and the comparison result during the exe-cution of SMIN. Putting everything together, we claimthat the proposed SMIN protocol is secure under thesemi-honest model (according to Definition 1).

5 THE PROPOSED PPkNN PROTOCOL

In this section, we propose a novel privacy-preservingk-NN classification protocol, denoted by PPkNN,which is constructed using the protocols discussed inSection 3 as building blocks. As mentioned earlier, weassume that Alice’s database consists of n records,denoted by D = 〈t1, . . . , tn〉, and m + 1 attributes,where ti,j denotes the jth attribute value of recordti. Initially, Alice encrypts her database attribute-wise,that is, she computes Epk(ti,j), for 1 ≤ i ≤ n and1 ≤ j ≤ m + 1, where column (m + 1) contains theclass labels. Let the encrypted database be denotedby D′. We assume that Alice outsources D′ as well asthe future classification process to the cloud. Withoutloss of generality, we assume that all attribute valuesand their Euclidean distances lie in [0, 2l). In addition,let w denote the number of unique class labels in D.In our problem setting, we assume the exis-

tence of two non-colluding semi-honest cloud serviceproviders, denoted by C1 and C2, which together forma federated cloud. Under this setting, Alice outsourcesher encrypted database D′ to C1 and the secret keysk to C2. Here it is possible for the data owner Aliceto replace C2 with her private server. However, ifAlice has a private server, we can argue that thereis no need for data outsourcing from Alice’s pointof view. The main purpose of using C2 can be mo-tivated by the following two reasons. (i) With limitedcomputing resource and technical expertise, it is inthe best interest of Alice to completely outsource itsdata management and operational tasks to a cloud.For example, Alice may want to access her data andanalytical results using a smart phone or any devicewith very limited computing capability. (ii) SupposeBob wants to keep his input query and access patternsprivate from Alice. In this case, if Alice uses a privateserver, then she has to perform computations assumedby C2 under which the very purpose of outsourcingthe encrypted data to C1 is negated.In general, whether Alice uses a private server or

cloud service provider C2 actually depends on herresources. In particular to our problem setting, weprefer to use C2 as this avoids the above mentioneddisadvantages (i.e., in case of Alice using a private

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 9: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

server) altogether. In our solution, after outsourcingencrypted data to the cloud, Alice does not participatein any future computations.The goal of the PPkNN protocol is to classify users’

query records using D′ in a privacy-preserving man-ner. Consider an authorized user Bob who wants toclassify his query record q = 〈q1, . . . , qm〉 based on D′

in C1. The proposed PPkNN protocol mainly consistsof the following two stages:

• Stage 1 - Secure Retrieval of k-Nearest Neighbors(SRkNN): In this stage, Bob initially sends hisquery q (in encrypted form) to C1. After this,C1 and C2 involve in a set of sub-protocols tosecurely retrieve (in encrypted form) the classlabels corresponding to the k-nearest neighborsof the input query q. At the end of this step,encrypted class labels of k-nearest neighbors areknown only to C1.

• Stage 2 - Secure Computation of Majority Class(SCMCk): Following from Stage 1, C1 and C2

jointly compute the class label with a majorityvoting among the k-nearest neighbors of q. At theend of this step, only Bob knows the class labelcorresponding to his input query record q.

The main steps involved in the proposed PPkNNprotocol are as shown in Algorithm 4. We now explaineach of the two stages in PPkNN in detail.

5.1 Stage 1 : Secure Retrieval of k-Nearest Neigh-bors (SR kNN)

During Stage 1, Bob initially encrypts his queryq attribute-wise, that is, he computes Epk(q) =〈Epk(q1), . . . , Epk(qm)〉 and sends it to C1. The mainsteps involved in Stage 1 are shown as steps 1 to3 in Algorithm 4. Upon receiving Epk(q), C1 withprivate input (Epk(q), Epk(ti)) and C2 with the secretkey sk jointly involve in the SSED protocol. HereEpk(ti) = 〈Epk(ti,1), . . . , Epk(ti,m)〉, for 1 ≤ i ≤ n.The output of this step, denoted by Epk(di), is theencryption of squared Euclidean distance between q

and ti, i.e., di = |q−ti|2. As mentioned earlier, Epk(di)

is known only to C1, for 1 ≤ i ≤ n. We empha-size that the computation of exact Euclidean distancebetween encrypted vectors is hard to achieve as itinvolves square root. However, in our problem, it issufficient to compare the squared Euclidean distancesas it preserves relative ordering. Then, C1 with inputEpk(di) and C2 securely compute the encryptionsof the individual bits of di using the SBD protocol.Note that the output [di] = 〈Epk(di,1), . . . , Epk(di,l)〉 isknown only to C1, where di,1 and di,l are the most andleast significant bits of di, for 1 ≤ i ≤ n, respectively.After this, C1 and C2 compute the encryptions of

class labels corresponding to the k-nearest neighborsof q in an iterative manner. More specifically, theycompute Epk(c

1) in the first iteration, Epk(c′

2) in thesecond iteration, and so on. Here c′s denotes the

Algorithm 4 PPkNN(D′, q)→ cq

Require: C1 has D′ and π; C2 has sk; Bob has q1: Bob:

(a). Compute Epk(qj), for 1 ≤ j ≤ m(b). Send Epk(q) = 〈Epk(q1), . . . , Epk(qm)〉 to C1

2: C1 and C2:

(a). C1 receives Epk(q) from Bob(b). for i = 1 to n do:

• Epk(di)← SSED(Epk(q), Epk(ti))• [di]← SBD(Epk(di))

3: for s = 1 to k do:

(a). C1 and C2:

• ([dmin], Epk(I), Epk(c′))← SMINn(θ1, . . . , θn),

where θi = ([di], Epk(Iti), Epk(ti,m+1))• Epk(c

s)← Epk(c′)

(b). C1:

• ∆← Epk(I)N−1

• for i = 1 to n do:

– τi ← Epk(i) ∗∆– τ ′i ← τ rii , where ri ∈R ZN

• β ← π(τ ′); send β to C2

(c). C2:

• β′

i ← Dsk(βi), for 1 ≤ i ≤ n• Compute U ′, for 1 ≤ i ≤ n:

– if β′

i = 0, then U ′

i = Epk(1)– otherwise, U ′

i = Epk(0)

• Send U ′ to C1

(d). C1: V ← π−1(U ′)(e). C1 and C2, for 1 ≤ i ≤ n and 1 ≤ γ ≤ l:

• Epk(di,γ)← SBOR(Vi, Epk(di,γ))

4: SCMCk(Epk(c′

1), . . . , Epk(c′

k))

class label of sth nearest neighbor to q, for 1 ≤s ≤ k. At the end of k iterations, only C1 knows〈Epk(c

1), . . . , Epk(c′

k)〉. To start with, consider the firstiteration. C1 and C2 jointly compute the encryp-tions of the individual bits of the minimum valueamong d1, . . . , dn and encryptions of the location andclass label corresponding to dmin using the SMINn

protocol. That is, C1 with input (θ1, . . . , θn) and C2

with sk compute ([dmin], Epk(I), Epk(c′)), where θi =

([di], Epk(Iti), Epk(ti,m+1)), for 1 ≤ i ≤ n. Here dmin

denotes the minimum value among d1, . . . , dn; Itiand ti,m+1 denote the unique identifier and classlabel corresponding to the data record ti, respectively.Specifically, (Iti , ti,m+1) is the secret information as-sociated with ti. For simplicity, this paper assumesIti = i. In the output, I and c′ denote the indexand class label corresponding to dmin. The output([dmin], Epk(I), Epk(c)) is known only to C1. Now, C1

performs the following operations locally:• Assign Epk(c

′) to Epk(c′

1). Remember that, ac-cording to the SMINn protocol, c′ is equivalent tothe class label of the data record that correspondsto dmin. Thus, it is same as the class label of themost nearest neighbor to q.

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 10: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

• Compute the encryption of difference between I

and i, where 1 ≤ i ≤ n. That is, C1 computesτi = Epk(i)∗Epk(I)

N−1 = Epk(i−I), for 1 ≤ i ≤ n.• Randomize τi to get τ ′i = τ rii = Epk(ri ∗ (i − I)),

where ri is a random number in ZN . Note that τ ′iis an encryption of either 0 or a random number,for 1 ≤ i ≤ n. Also, it is worth noting thatexactly one of the entries in τ ′ is an encryptionof 0 (which happens iff i = I) and the restare encryptions of random numbers. Permute τ ′

using a random permutation function π (knownonly to C1) to get β = π(τ ′) and send it to C2.

Upon receiving β, C2 decrypts it component-wise toget β′

i = Dsk(βi), for 1 ≤ i ≤ n. After this, he/shecomputes an encrypted vector U ′ of length n suchthat Ui = Epk(1) if β′

i = 0, and Epk(0) otherwise.Since exactly one of entries in τ ′ is an encryption of0, this further implies that exactly one of the entriesin U ′ is an encryption of 1 and the rest of themare encryptions of 0’s. It is important to note that ifβ′

k = 0, then π−1(k) is the index of the data record thatcorresponds to dmin. Then, C2 sends U ′ to C1. Afterreceiving U ′, C1 performs inverse permutation on it toget V = π−1(U ′). Note that exactly one of the entriesin V is Epk(1) and the remaining are encryptions of0’s. In addition, if Vi = Epk(1), then ti is the mostnearest tuple to q. However, C1 and C2 do not knowwhich entry in V corresponds to Epk(1).

Finally, C1 updates the distance vectors [di] due tothe following reason:

• It is important to note that the first nearest tupleto q should be obliviously excluded from furthercomputations. However, since C1 does not knowthe record corresponding to Epk(c

1), we need toobliviously eliminate the possibility of choosingthis record again in next iterations. For this, C1

obliviously updates the distance correspondingto Epk(c

1) to the maximum value, i.e., 2l−1. Morespecifically, C1 updates the distance vectors withthe help of C2 using the SBOR protocol as below,for 1 ≤ i ≤ n and 1 ≤ γ ≤ l.

Epk(di,γ) = SBOR(Vi, Epk(di,γ))

Note that when Vi = Epk(1), the correspond-ing distance vector di is set to the maxi-mum value. That is, under this case, [di] =〈Epk(1), . . . , Epk(1)〉. On the other hand, whenVi = Epk(0), the OR operation has no effect onthe corresponding encrypted distance vector.

The above process is repeated until k iterations, and ineach iteration [di] corresponding to the current chosenlabel is set to the maximum value. However, C1 andC2 does not know which [di] is updated. In iterations, Epk(c

s) is known only to C1. At the end of Stage1, C1 has 〈Epk(c

1), . . . , Epk(c′

k)〉, the list of encryptedclass labels of k-nearest neighbors to the query q.

Algorithm 5 SCMCk(Epk(c′

1), . . . , Epk(c′

k))→ cq

Require: 〈Epk(c1), . . . , Epk(cw)〉, 〈Epk(c′

1), . . . , Epk(c′

k)〉are known only to C1; sk is known only to C2

1: C1 and C2:

(a). 〈Epk(f(c1)), . . . , Epk(f(cw))〉 ← SF(Λ,Λ′),where Λ = 〈Epk(c1), . . . , Epk(cw)〉, Λ′ =〈Epk(c

1), . . . , Epk(c′

k)〉(b). for i = 1 to w do:

• [f(ci)]← SBD(Epk(f(ci)))

(c). ([fmax], Epk(cq)) ← SMAXw(ψ1, . . . , ψw),where ψi = ([f(ci)], Epk(ci)), for 1 ≤ i ≤ w

2: C1:

(a). γq ← Epk(cq) ∗ Epk(rq), where rq ∈R ZN

(b). Send γq to C2 and rq to Bob

3: C2:

(a). Receive γq from C1

(b). γ′q ← Dsk(γq); send γ′q to Bob

4: Bob:

(a). Receive rq from C1 and γ′q from C2

(b). cq ← γ′q − rq mod N

5.2 Stage 2 : Secure Computation of MajorityClass (SCMC k)

Without loss of generality, let us assume that Alice’sdataset D consists of w unique class labels denotedby c = 〈c1, . . . , cw〉. We assume that Alice outsourcesher list of encrypted classes to C1. That is, Aliceoutsources 〈Epk(c1), . . . , Epk(cw)〉 to C1 along with herencrypted database D′ during the data outsourcingstep. Note that, for security reasons, Alice may adddummy categories into the list to protect the numberof class labels, i.e., w from C1 and C2. However, forsimplicity, we assume that Alice does not add anydummy categories to c.

During Stage 2, C1 with private inputsΛ = 〈Epk(c1), . . . , Epk(cw)〉 and Λ′ =〈Epk(c

1), . . . , Epk(c′

k)〉, and C2 with sk securelycompute Epk(cq). Here cq denotes the majority classlabel among c′1, . . . , c

k. At the end of stage 2, onlyBob knows the class label cq .

The overall steps involved in Stage 2 are shownin Algorithm 5. To start with, C1 and C2 jointlycompute the encrypted frequencies of each class la-bel using the k-nearest set as input. That is, theycompute Epk(f(ci)) using (Λ,Λ′) as C1’s input to thesecure frequency (SF) protocol, for 1 ≤ i ≤ w. Theoutput 〈Epk(f(c1)), . . . , Epk(f(cw))〉 is known only toC1. Then, C1 with Epk(f(ci)) and C2 with sk involvein the secure bit-decomposition (SBD) protocol tocompute [f(ci)], that is, vector of encryptions of theindividual bits of f(ci), for 1 ≤ i ≤ w. After this, C1

and C2 jointly involve in the SMAXw protocol. Briefly,SMAXw utilizes the sub-routine SMAX to eventuallycompute ([fmax], Epk(cq)) in an iterative fashion. Here[fmax] = [max(f(c1), . . . , f(cw))] and cq denotes themajority class out of Λ′. At the end, the output

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 11: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

([fmax], Epk(cq)) is known only to C1. After this, C1

computes γq = Epk(cq + rq), where rq is a randomnumber in ZN known only to C1. Then, C1 sends γq toC2 and rq to Bob. Upon receiving γq , C2 decrypts it toget the randomized majority class label γ′q = Dsk(γq)and sends it to Bob. Finally, upon receiving rq fromC1 and γ′q from C2, Bob computes the output classlabel corresponding to q as cq = γ′q − rq mod N .

5.3 Security Analysis of PP kNN under the Semi-honest Model

First of all, we stress that due to the encryption of qand by semantic security of the Paillier cryptosystem,Bob’s input query q is protected from Alice, C1 andC2 in our PPkNN protocol. Apart from guaranteeingquery privacy, the goal of PPkNN is to protect dataconfidentiality and hide data access patterns.In this paper, to prove a protocol’s security under

the semi-honest model, we adopted the well-knownsecurity definitions from the literature of SMC. Morespecifically, as mentioned in Section 2.3, we adoptthe security proofs based on the standard simulationparadigm [28]. For presentation purpose, we provideformal security proofs (under the semi-honest model)for Stages 1 and 2 of PPkNN separately. Note thatthe outputs returned by each sub-protocol are inencrypted form and known only to C1.

5.3.1 Proof of Security for Stage 1As mentioned earlier, the computations involved inStage 1 of PPkNN are given as steps 1 to 3 in Al-gorithm 4. For simplicity, we consider the messagesexchanged between C1 and C2 in a single iteration(similar analysis can be deduced for other iterations).According to Algorithm 4, the execution image of

C2 is given by ΠC2(PPkNN) = {〈βi, β

i〉 | for 1 ≤ i ≤n} where βi is an encrypted value which is randomin ZN2 . Also, β′

i is derived upon decrypting βi by C2.Remember that, exactly one of the entries in β′ is 0and the rest are random numbers in ZN . Without lossof generality, let the simulated image of C2 be givenΠS

C2(PPkNN) = {〈a′1,i, a

2,i〉 | for 1 ≤ i ≤ n}. Here a′1,iis randomly generated from ZN2 and the vector a′2 israndomly generated in such a way that exactly oneof the entries is 0 and the rest are random numbersin ZN . Since Epk is a semantically secure encryptionscheme with resulting ciphertext size less than ZN2 ,we claim that βi is computationally indistinguishablefrom a′1,i. In addition, since the random permutationfunction π is known only to C1, β

′ is a randomvector of exactly one 0 and random numbers in ZN .Thus, β′ is computationally indistinguishable froma′2. By combining the above results, we can concludethat ΠC2

(PPkNN) is computationally indistinguish-able from ΠS

C2(PPkNN). This implies that C2 does not

learn anything during the execution of Stage 1.On the other hand, the execution image of C1

is given by ΠC1(PPkNN) = {U ′} where U ′ is an

encrypted value sent by C2 (at step 3(c) of Algo-rithm 4). Let the simulated image of C1 in Stage1 be ΠS

C1(PPkNN) = {a′}. Here the value of a′

is randomly generated from ZN2 . Since Epk is asemantically secure encryption scheme with result-ing ciphertexts in ZN2 , we claim that U ′ is com-putationally indistinguishable from a′. This impliesthat ΠC1

(PPkNN) is computationally indistinguish-able from ΠS

C1(PPkNN). Hence, C1 cannot learn any-

thing during the execution of Stage 1 in PPkNN.Combining all these results together, it is clear thatStage 1 is secure under the semi-honest model.In each iteration, it is worth pointing out that C1

and C2 do not know which data record belongs tocurrent global minimum. Thus, data access patternsare protected from both C1 and C2. Informally speak-ing, at step 3(c) of Algorithm 4, a component-wisedecryption of β reveals the tuple that satisfy thecurrent global minimum distance to C2. However, dueto the random permutation by C1, C2 cannot traceback to the corresponding data record. Also, note thatdecryption operations on vector β by C2 will result inexactly one 0 and the rest of the results are randomnumbers in ZN . Similarly, since U ′ is an encryptedvector, C1 cannot know which tuple corresponds tocurrent global minimum distance.

5.3.2 Security Proof for Stage 2

In a similar fashion, we can formally prove that Stage2 of PPkNN is secure under the semi-honest model.Briefly, since the sub-protocols SF, SBD, and SMAXw

are secure, no information is revealed to C2. Also, theoperations performed by C1 are entirely on encrypteddata and thus no information is revealed to C1.Furthermore, the output data of Stage 1 which

are passed as input to Stage 2 are in encrypted for-mat. Therefore, the sequential composition of the twostages lead to our PPkNN protocol and we claim itto be secure under the semi-honest model accordingto the Composition Theorem [28]. In particular, basedon the above discussions, it is clear that the proposedPPkNN protocol protects the confidentiality of thedata, the user’s input query, and also hides data accesspatterns from Alice, C1, and C2. Note that Alice doesnot participate in any computations of PPkNN.

5.4 Security under the Malicious model

The next step is to extend our PPkNN protocol into asecure protocol under the malicious model. Under themalicious model, an adversary (i.e., either C1 or C2)can arbitrarily deviate from the protocol to gain someadvantage (e.g., learning additional information aboutinputs) over the other party. The deviations include, asan example, for C1 (acting as a malicious adversary) toinstantiate the PPkNN protocol with modified inputs(say Epk(q

′) and Epk(t′

i)) and to abort the protocol af-ter gaining partial information. However, in PPkNN,

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 12: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

it is worth pointing out that neither C1 nor C2 knowsthe results of Stages 1 and 2. In addition, all the inter-mediate results are either random or pseudo-randomvalues. Thus, even when an adversary modifies theintermediate computations he/she cannot gain anyadditional information. Nevertheless, as mentionedabove, the adversary can change the intermediate dataor perform computations incorrectly before sendingthem to the honest party which may eventually resultin the wrong output. Therefore, we need to ensurethat all the computations performed and messagessent by each party are correct.Remember that the main goal of SMC is to ensure

the honest parties to get the correct result and toprotect their private input data from the maliciousparties. Therefore, under the two-party SMC scenario,if both parties are malicious, there is no point todevelop or adopt an SMC protocol at the first place.In the literature of SMC [30], it is the norm that atmost one party can be malicious under the two-partyscenario. When only one of the party is malicious, thestandard way of preventing the malicious party frommisbehaving is to let the honest party validate theother party’s work using zero-knowledge proofs [31].However, checking the validity of operations at eachstep of PPkNN can significantly increase the cost.An alternative approach, as proposed in [32], is to

instantiate two independent executions of the PPkNNprotocol by swapping the roles of the two partiesin each execution. At the end of the individual ex-ecutions, each party receives the output in encryptedform. This is followed by an equality test on theiroutputs. More specifically, suppose Epk1

(cq,1) andEpk2

(cq,2) be the outputs received by C1 and C2

respectively, where pk1 and pk2 are their respectivepublic keys. Note that the outputs in our case are inencrypted format and the corresponding ciphertexts(resulted from the two executions) are under twodifferent public key domains. Therefore, we stress thatthe equality test based on the additive homomorphicencryption properties which was used in [32] is notapplicable to our problem. Nevertheless, C1 and C2

can perform the equality test based on the traditionalgarbled-circuit technique [33].

5.5 Complexity Analysis

The total computation complexity of Stage 1 isbounded by O(n ∗ (l+m+ k ∗ l ∗ log2 n)) encryptionsand exponentiations. On the other hand, the totalcomputation complexity of Stage 2 is bounded byO(w ∗ (l + k + l ∗ log2 w)) encryptions and exponenti-ations. Due to space limitations, we refer the readerto [5] for detailed complexity analysis of PPkNN. Ingeneral, as w ≪ n, the computation cost of Stage 1should be significantly higher than that of Stage 2.This observation is further justified by our empiricalresults given in the next section.

6 EMPIRICAL RESULTS

In this section, we discuss some experiments demon-strating the performance of our PPkNN protocolunder different parameter settings. We used thePaillier cryptosystem [4] as the underlying additivehomomorphic encryption scheme and implementedthe proposed PPkNN protocol in C. Various experi-ments were conducted on a Linux machine with anIntel R© Xeon R© Six-CoreTM CPU 3.07 GHz processorand 12GB RAM running Ubuntu 12.04 LTS. To the bestof our knowledge, our work is the first effort to de-velop a secure k-NN classifier under the semi-honestmodel. There is no existing work to compare with ourapproach. Hence, we evaluate the performance of ourPPkNN protocol under different parameter settings.

6.1 Dataset and Experimental Setup

For our experiments, we used the Car Evaluationdataset from the UCI KDD archive [34]. It consistsof 1728 records (i.e., n = 1728) and 6 attributes (i.e.,m = 6). Also, there is a separate class attribute andthe dataset is categorized into four different classes(i.e., w = 4). We encrypted this dataset attribute-wise, using the Paillier encryption whose key size isvaried in our experiments, and the encrypted datawere stored on our machine. Based on our PPkNNprotocol, we then executed a random query over thisencrypted data. For the rest of this section, we do notdiscuss about the performance of Alice since it is aone-time cost. Instead, we evaluate and analyze theperformances of the two stages in PPkNN separately.

6.2 Performance of PP kNN

We first evaluated the computation costs of Stage 1 inPPkNN for varying number of k-nearest neighbors.Also, the Paillier encryption key size K is either 512or 1024 bits. The results are shown in Figure 2(a). ForK=512 bits, the computation cost of Stage 1 variesfrom 9.98 to 46.16 minutes when k is changed from 5to 25, respectively. On the other hand, when K=1024bits, the computation cost of Stage 1 varies from 66.97to 309.98 minutes when k is changed from 5 to 25,respectively. In either case, we observed that the costof Stage 1 grows almost linearly with k. For any givenk, we identified that the cost of Stage 1 increasesby almost a factor of 7 whenever k is doubled. E.g.,when k=10, Stage 1 took 19.06 and 127.72 minutes togenerate the encrypted class labels of the 10 nearestneighbors under K=512 and 1024 bits, respectively.Moreover, when k=5, we observe that around 66.29%of cost in Stage 1 is accounted due to SMINn whichis initiated k times in PPkNN (once in each iteration).Also, the cost incurred due to SMINn increases from66.29% to 71.66% when k is increased from 5 to 25.

We now evaluate the computation costs of Stage2 for varying k and K. As shown in Figure 2(b),

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

Page 13: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

0

50

100

150

200

250

300

350

0 5 10 15 20 25

Tim

e (m

inut

es)

Number of k Nearest Neighbors

K=512K=1024

(a) Total cost of Stage 1

0

0.5

1

1.5

2

0 5 10 15 20 25

Tim

e (s

econ

ds)

Number of k Nearest Neighbors

K=512K=1024

(b) Total cost of Stage 2

0

50

100

150

200

250

300

350

0 5 10 15 20 25

Tim

e (m

inut

es)

Number of k Nearest Neighbors

SRkNNSRkNNoSRkNNp

(c) Efficiency gains of Stage 1

Fig. 2. Computation costs of PPkNN for varying number of k nearest neighbors and encryption key size K

for K=512 bits, the computation time for Stage 2 togenerate the final class label corresponding to theinput query varies from 0.118 to 0.285 seconds whenk is changed from 5 to 25. On the other hand, forK=1024 bits, Stage 2 took 0.789 and 1.89 secondswhen k = 5 and 25, respectively. The low computationcosts of Stage 2 were due to SMAXw which incurssignificantly less computations than SMINn in Stage 1.This further justifies our theoretical analysis in Section5.5. Note that, in our dataset, w=4 and n=1728. Likein Stage 1, for any given k, the computation time ofStage 2 increases by almost a factor of 7 whenever Kis doubled. E.g., when k=10, the computation time ofStage 2 varies from 0.175 to 1.158 seconds when theencryption key size K is changed from 512 to 1024bits. As shown in Figure 2(b), a similar analysis canbe observed for other values of k and K.It is clear that the computation cost of Stage 1 is

significantly higher than that of Stage 2 in PPkNN.Specifically, we observed that the computation timeof Stage 1 accounts for at least 99% of the total timein PPkNN. For example, when k = 10 and K=512bits, the computation costs of Stage 1 and 2 are 19.06minutes and 0.175 seconds, respectively. Under thisscenario, cost of Stage 1 is 99.98% of the total cost ofPPkNN. We also observed that the total computationtime of PPkNN grows almost linearly with n and k.

6.3 Performance Improvement of PP kNN

We now discuss two different ways to boost the effi-ciency of Stage 1 (as the performance of PPkNN de-pends primarily on Stage 1) and empirically analyzetheir efficiency gains. First, we observe that some ofthe computations in Stage 1 can be pre-computed. Forexample, encryptions of random numbers, 0s and 1scan be pre-computed (by the corresponding parties) inthe offline phase. As a result, the online computationcost of Stage 1 (denoted by SRkNNo) is expected tobe improved. To see the actual efficiency gains ofsuch a strategy, we computed the costs of SRkNNo

and compared them with the costs of Stage 1 withoutan offline phase (simply denoted by SRkNN) andthe results for K = 1024 bits are shown in Figure2(c). Irrespective of the values of k, we observed thatSRkNNo is around 33% faster than SRkNN. E.g., whenk k k

are 84.47 and 127.72 minutes, respectively (boostingthe online running time of Stage 1 by 33.86%).Our second approach to improve the performance

of Stage 1 is by using parallelism. Since operations ondata records are independent of one another, we claimthat most computations in Stage 1 can be parallelized.To empirically evaluate this claim, we implemented aparallel version of Stage 1 (denoted by SRkNNp) usingOpenMP programming and compared its cost withthe costs of SRkNN (i.e., the serial version of Stage 1).The results for K = 1024 bits are shown in Figure 2(c).The computation cost of SRkNNp varies from 12.02to 55.5 minutes when k is changed from 5 to 25. Weobserve that SRkNNp is almost 6 times more efficientthan SRkNN. This is because our machine has 6 coresand thus computations can be run in parallel on 6separate threads. Based on the above discussions, it isclear that efficiency of Stage 1 can indeed be improvedsignificantly using parallelism.On the other hand, Bob’s computation cost in

PPkNN is mainly due to the encryption of his inputquery. In our dataset, Bob’s computation cost is 4 and17 milliseconds when K is 512 and 1024 bits, respec-tively. It is apparent that PPkNN is very efficient fromBob’s computational perspective which is especiallybeneficial when he issues queries from a resource-constrained device (such as mobile phone and PDA).

6.3.1 A Note on PracticalityThe proposed PPkNN protocol is not very efficientwithout utilizing parallelization. However, ours is thefirst work to propose a PPkNN solution that is secureunder the semi-honest model. Due to rising demandsfor data mining as a service in cloud computing, webelieve that our work will be very helpful to thecloud community to stimulate further research alongthat direction. Hopefully, more practical solutions toPPkNN will be developed (either by optimizing ourprotocol or investigating alternative approaches) inthe near future.

7 CONCLUSIONS AND FUTURE WORK

To protect user privacy, various privacy-preservingclassification techniques have been proposed over thepast decade. The existing techniques are not appli-

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.

k = 10, the computation costs of SRkNNo and SRkNN cable to outsourced database environments where the

Page 14: k-Nearest Neighbor Classification over Semantically Secure ... Neighbor Classificatio… · k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational Data

data resides in encrypted form on a third-party server.This paper proposed a novel privacy-preserving k-NN classification protocol over encrypted data in thecloud. Our protocol protects the confidentiality of thedata, user’s input query, and hides the data accesspatterns. We also evaluated the performance of ourprotocol under different parameter settings.Since improving the efficiency of SMINn is an im-

portant first step for improving the performance ofour PPkNN protocol, we plan to investigate alterna-tive and more efficient solutions to the SMINn prob-lem in our future work. Also, we will investigate andextend our research to other classification algorithms.

REFERENCES

[1] P. Mell and T. Grance, “The nist definition of cloud computing(draft),” NIST special publication, vol. 800, p. 145, 2011.

[2] S. De Capitani di Vimercati, S. Foresti, and P. Samarati,“Managing and accessing data in the cloud: Privacy risks andapproaches,” in CRiSIS, pp. 1 –9, 2012.

[3] P. Williams, R. Sion, and B. Carbunar, “Building castles outof mud: practical access pattern privacy and correctness onuntrusted storage,” in ACM CCS, pp. 139–148, 2008.

[4] P. Paillier, “Public key cryptosystems based on compositedegree residuosity classes,” in Eurocrypt, pp. 223–238, 1999.

[5] B. K. Samanthula, Y. Elmehdwi, and W. Jiang, “k-nearestneighbor classification over semantically secure encrypted re-lational data.” eprint arXiv:1403.5001, 2014.

[6] C. Gentry, “Fully homomorphic encryption using ideal lat-tices,” in ACM STOC, pp. 169–178, 2009.

[7] C. Gentry and S. Halevi, “Implementing gentry’s fully-homomorphic encryption scheme,” in EUROCRYPT, pp. 129–148, Springer, 2011.

[8] A. Shamir, “How to share a secret,” Commun. ACM, vol. 22,pp. 612–613, Nov. 1979.

[9] D. Bogdanov, S. Laur, and J. Willemson, “Sharemind: A frame-work for fast privacy-preserving computations,” in ESORICS,pp. 192–206, Springer, 2008.

[10] R. Agrawal and R. Srikant, “Privacy-preserving data mining,”in ACM Sigmod Record, vol. 29, pp. 439–450, ACM, 2000.

[11] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” inAdvances in Cryptology (CRYPTO), pp. 36–54, Springer, 2000.

[12] P. Zhang, Y. Tong, S. Tang, and D. Yang, “Privacy preservingnaive bayes classification,” ADMA, pp. 744–752, 2005.

[13] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacypreserving mining of association rules,” Information Systems,vol. 29, no. 4, pp. 343–364, 2004.

[14] R. J. Bayardo and R. Agrawal, “Data privacy through optimalk-anonymization,” in IEEE ICDE, pp. 217–228, 2005.

[15] H. Hu, J. Xu, C. Ren, and B. Choi, “Processing private queriesover untrusted data cloud through privacy homomorphism,”in IEEE ICDE, pp. 601–612, 2011.

[16] M. Kantarcioglu and C. Clifton, “Privately computing a dis-tributed k-nn classifier,” in PKDD, pp. 279–290, 2004.

[17] L. Xiong, S. Chitti, and L. Liu, “K nearest neighbor classifica-tion across multiple private databases,” in CIKM, pp. 840–841,ACM, 2006.

[18] Y. Qi and M. J. Atallah, “Efficient privacy-preserving k-nearestneighbor search,” in IEEE ICDCS, pp. 311–319, 2008.

[19] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, “Order preserv-ing encryption for numeric data,” in ACM SIGMOD, pp. 563–574, 2004.

[20] H. Hacigumus, B. Iyer, C. Li, and S. Mehrotra, “Executing sqlover encrypted data in the database-service-provider model,”in ACM SIGMOD, pp. 216–227, 2002.

[21] B. Hore, S. Mehrotra, M. Canim, and M. Kantarcioglu, “Securemultidimensional range queries over outsourced data,” TheVLDB Journal, vol. 21, no. 3, pp. 333–358, 2012.

[22] W. K. Wong, D. W.-l. Cheung, B. Kao, and N. Mamoulis,“Secure knn computation on encrypted databases,” in ACMSIGMOD, pp. 139–152, 2009.

[23] X. Xiao, F. Li, and B. Yao, “Secure nearest neighbor revisited,”in IEEE ICDE, pp. 733–744, 2013.

[24] Y. Elmehdwi, B. K. Samanthula, and W. Jiang, “Secure k-nearest neighbor query over encrypted data in outsourcedenvironments,” in IEEE ICDE, pp. 664–675, 2014.

[25] A. C. Yao, “Protocols for secure computations,” in SFCS,pp. 160–164, IEEE Computer Society, 1982.

[26] O. Goldreich, S. Micali, and A. Wigderson, “How to play anymental game - a completeness theorem for protocols withhonest majority,” in STOC, pp. 218–229, ACM, 1987.

[27] O. Goldreich, The Foundations of Cryptography, vol. 2, ch. Gen-eral Cryptographic Protocols, pp. 599–746. Cambridge Uni-versity Press, 2004.

[28] O. Goldreich, The Foundations of Cryptography, vol. 2, ch. En-cryption Schemes, pp. 373–470. Cambridge University Press,2004.

[29] S. Goldwasser, S. Micali, and C. Rackoff, “The knowledgecomplexity of interactive proof systems,” SIAM Journal ofComputing, vol. 18, pp. 186–208, February 1989.

[30] D. Chaum, C. Crepeau, and I. Damgard, “Multiparty uncon-ditionally secure protocols,” in STOC, pp. 11–19, ACM, 1988.

[31] J. Camenisch and M. Michels, “Proving in zero-knowledgethat a number is the product of two safe primes,” in EURO-CRYPT, pp. 107–122, Springer-Verlag, 1999.

[32] Y. Huang, J. Katz, and D. Evans, “Quid-pro-quo-tocols:Strengthening semi-honest protocols with dual execution,” inIEEE Security and Privacy, pp. 272–284, 2012.

[33] Y. Huang, D. Evans, J. Katz, and L. Malka, “Faster secure two-party computation using garbled circuits,” in Proceedings of the20th USENIX conference on Security (SEC ’11), pp. 35–35, 2011.

[34] M. Bohanec and B. Zupan. The UCI KDD Archive, 1997. http://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

Bharath K. Samanthula is a PostdoctoralResearch Associate at the Purdue CyberCenter and CERIAS. He received his Bach-elor’s degree in Computer Science and En-gineering from International Institute of In-formation Technology, Hyderabad, India, in2008. He received his Ph.D. degree in Com-puter Science from the Missouri University ofScience and Technology, Rolla, MO, in 2013.His research interests include applied cryp-tography, personal privacy and data security

in the fields of social networks, cloud computing, and smart grids.

Yousef Elmehdwi received his bachelorsdegree in Computer Science from BenghaziUniversity, Libya in 1993 and MSc degree inInformation Technology from Mannheim Uni-versity of Applied Science, Germany in 2005.Elmehdwi has joined the Computer Sciencedepartment at the Missouri University of Sci-ence and Technology, Rolla, Missouri, asa graduate student in January 2010. He iscurrently working toward his PhD degree un-der the supervision of Dr. Wei Jiang. His

research interests lie at the crossroads of privacy, security, and datamining with current focus on privacy-preserving query processingover encrypted data outsourced to cloud.

Wei Jiang is an assistant professor at theDepartment of Computer Science of MissouriUniversity of Science and Technology. Hereceived the Bachelors degrees in both Com-puter Science and Mathematics from the Uni-versity of Iowa, Iowa City, Iowa, in 2002. Hereceived the Masters degree in ComputerScience and the Ph.D. degree from PurdueUniversity, West Lafayette, IN, in 2004 and2008, respectively. His research interests in-clude privacy-preserving data mining, data

integration, privacy issues in federated search environments, andtext sanitization. His research has been funded by the National Sci-ence Foundation, the Office of Naval Research, and the Universityof Missouri Research Board.

IEEE Transactions on Knowledge and Data Engineering, (Volume:27 , Issue: 5 ),May 1 2015.