[ieee 2013 ieee 37th international computer software and applications conference workshops...

6
Privacy-Preserving Two-Party k -Means Clustering in Malicious Model Rahena Akhter , Rownak Jahan Chowdhury , Keita Emura †† , Tamzida Islam , Mohammad Shahriar Rahman , Nusrat Rubaiyat Department of CSE, University of Asia Pacific (UAP), Dhaka, Bangladesh †† National Institute of Information and Communications Technology (NICT), Tokyo, Japan Email: [email protected], [email protected], [email protected], {rahena ratna, tamzida.Islam, kotha toru}@yahoo.com Abstract—In data mining, clustering is a well-known and useful technique. One of the most powerful and frequently used techniques is k-means clustering. Most of the privacy- preserving solutions based on cryptography proposed by different researchers in recent years are in semi-honest model, where participating parties always follow the protocol. This model is realistic in many cases. But providing stonger solutions consid- ering malicious model would be more useful for many practical applications because it tries to protect a protocol from arbitrary malicious behavior using cryptographic tools. In this paper, we have proposed a new protocol for privacy-preserving two-party k-means clustering in malicious model. We have used threshold homomorphic encryption and non-interactive zero knowledge protocols to construct our protocol according to real/ideal world paradigm. KeyWords: k-means clustering, privacy-preserving, mali- cious model, threshold two-party computation. I. I NTRODUCTION k-means clustering is one of the simplest and popular learning algorithms in data mining area that solves the well known clustering problem [1]. There are various applications of k-means clustering such as, character recognition, encod- ing/decoding, market segmentation, healthcare etc. Suppose, two hospitals want to retrieve some data from their joint databases, or two companies want to combine their data to identify the product which business will be profitable. k-means clustering is such an application which can be used here. Let, n data objects are given. The k-means clustering algorithm divides them into k groups. Suppose there exists two parties P 0 and P 1 in a distributed network, each having a private document database denoted by d 0 and d 1 respectively. It is assumed that there exists a combined database d = d 0 d 1 coming from P 0 and P 1 on which they want to collaboratively compute the k-means clustering to get common benefit. Here we need to preserve privacy for both parties which means no information of each party’s database will be revealed. Privacy can be invaded by the adversaries. In cryptography literature, adversaries have been classified into the following two categories in general. Semi-honest adversary: A semi-honest adversary follows the protocol specification faithfully and tries to learn extra information from the message transcript during the execution. Malicious adversary: A malicious adversary does not follow the protocol specification and can alter the input from the message transcript during the execution. On the basis of the above specifications, we have con- structed our protocol in malicious model so that we can provide stronger security gurantee. A. Related work Most of the existing privacy-preserving k-means schemes (either two-party or multiparty [2][3]) deal with the semi- honest model. Also they have some security problems [4]. Vaidya and Clifton introduced a solution [5] based on Okamoto and Uchiyama homomorphic cryptosystem [6] which is not secure without random padding. Jagannathan et al.’s proposal [7] applies a scheme [8] whose security is not proven and the whole protocol is inefficient in clustering update computation. In Jha et al.’s solution [9], two parties compute their k-means local clustering, respectively; the up- date computation is executed after two parties’ local k-means clustering is computed. They have used oblivious polynomial evaluation to construct a private update computation of clusters to join the 2k clusters into k ones. The proposal has the correctness problem because the result is not actually based on two parties’ initiative clusters. In Jagannathan et al.’s another scheme [10], a secure scalar production protocol [11] is used which has the serious weakness that a leakage of some database entries, can reveal the whole database as well as it is very inefficient when the database is large. In paper [12], Yao’s garbled circuit technique is used which is time consuming. In paper [13], the above problems are covered but none of these schemes can deal with malicious adversaries. The cost of communication in these protocols are mostly reasonable [14]. So, we cannot ignore the matter of cost. There is a paper [15] where the researchers have given an idea about how to convert the k-means clustering process from semi-honest model to malicious model without giving any kind of details about the protocols. So, we cannot compare its performance. In pa- per [16], the protocols have been constructed considering more than two participating parties in malicious model. They have used secret sharing with code based identification scheme. But secret sharing scheme is not efficient in some cases. A stronger security guarantee with reasonable cost is what we want to achieve in this paper. 2013 IEEE 37th Annual Computer Software and Applications Conference Workshops 978-0-7695-4987-3/13 $26.00 © 2013 IEEE DOI 10.1109/COMPSACW.2013.53 121

Upload: nusrat

Post on 06-Mar-2017

215 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

Privacy-Preserving Two-Party k-Means Clusteringin Malicious Model

Rahena Akhter†, Rownak Jahan Chowdhury†, Keita Emura††,Tamzida Islam†, Mohammad Shahriar Rahman†, Nusrat Rubaiyat†

†Department of CSE, University of Asia Pacific (UAP), Dhaka, Bangladesh††National Institute of Information and Communications Technology (NICT), Tokyo, Japan

Email: [email protected], [email protected], [email protected],

{rahena ratna, tamzida.Islam, kotha toru}@yahoo.com

Abstract—In data mining, clustering is a well-known anduseful technique. One of the most powerful and frequentlyused techniques is k-means clustering. Most of the privacy-preserving solutions based on cryptography proposed by differentresearchers in recent years are in semi-honest model, whereparticipating parties always follow the protocol. This model isrealistic in many cases. But providing stonger solutions consid-ering malicious model would be more useful for many practicalapplications because it tries to protect a protocol from arbitrarymalicious behavior using cryptographic tools. In this paper, wehave proposed a new protocol for privacy-preserving two-partyk-means clustering in malicious model. We have used thresholdhomomorphic encryption and non-interactive zero knowledgeprotocols to construct our protocol according to real/ideal worldparadigm.

KeyWords: k-means clustering, privacy-preserving, mali-

cious model, threshold two-party computation.

I. INTRODUCTION

k-means clustering is one of the simplest and popular

learning algorithms in data mining area that solves the well

known clustering problem [1]. There are various applications

of k-means clustering such as, character recognition, encod-

ing/decoding, market segmentation, healthcare etc. Suppose,

two hospitals want to retrieve some data from their joint

databases, or two companies want to combine their data to

identify the product which business will be profitable. k-means

clustering is such an application which can be used here. Let,

n data objects are given. The k-means clustering algorithm

divides them into k groups. Suppose there exists two parties

P0 and P1 in a distributed network, each having a private

document database denoted by d0 and d1 respectively. It is

assumed that there exists a combined database d = d0 ∪ d1

coming from P0 and P1 on which they want to collaboratively

compute the k-means clustering to get common benefit. Here

we need to preserve privacy for both parties which means

no information of each party’s database will be revealed.

Privacy can be invaded by the adversaries. In cryptography

literature, adversaries have been classified into the following

two categories in general.

Semi-honest adversary: A semi-honest adversary follows

the protocol specification faithfully and tries to learn extra

information from the message transcript during the execution.

Malicious adversary: A malicious adversary does not

follow the protocol specification and can alter the input from

the message transcript during the execution.

On the basis of the above specifications, we have con-

structed our protocol in malicious model so that we can

provide stronger security gurantee.

A. Related work

Most of the existing privacy-preserving k-means schemes

(either two-party or multiparty [2][3]) deal with the semi-

honest model. Also they have some security problems [4].

Vaidya and Clifton introduced a solution [5] based on

Okamoto and Uchiyama homomorphic cryptosystem [6]

which is not secure without random padding. Jagannathan

et al.’s proposal [7] applies a scheme [8] whose security is

not proven and the whole protocol is inefficient in clustering

update computation. In Jha et al.’s solution [9], two parties

compute their k-means local clustering, respectively; the up-

date computation is executed after two parties’ local k-means

clustering is computed. They have used oblivious polynomial

evaluation to construct a private update computation of clusters

to join the 2k clusters into k ones. The proposal has the

correctness problem because the result is not actually based

on two parties’ initiative clusters. In Jagannathan et al.’s

another scheme [10], a secure scalar production protocol [11]

is used which has the serious weakness that a leakage of some

database entries, can reveal the whole database as well as it is

very inefficient when the database is large. In paper [12], Yao’s

garbled circuit technique is used which is time consuming.

In paper [13], the above problems are covered but none of

these schemes can deal with malicious adversaries. The cost of

communication in these protocols are mostly reasonable [14].

So, we cannot ignore the matter of cost. There is a paper [15]

where the researchers have given an idea about how to convert

the k-means clustering process from semi-honest model to

malicious model without giving any kind of details about

the protocols. So, we cannot compare its performance. In pa-

per [16], the protocols have been constructed considering more

than two participating parties in malicious model. They have

used secret sharing with code based identification scheme. But

secret sharing scheme is not efficient in some cases. A stronger

security guarantee with reasonable cost is what we want to

achieve in this paper.

2013 IEEE 37th Annual Computer Software and Applications Conference Workshops

978-0-7695-4987-3/13 $26.00 © 2013 IEEE

DOI 10.1109/COMPSACW.2013.53

121

Page 2: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

B. Our contribution

In this paper, we have given out an interactive protocol

with some sub-protocols using non-interactive zero knowledge

(NIZK) proofs. The combined database is used to perform

the protocol to get the target results. The overall clustering

computations are similar as in [13] and [12]. We have added

NIZK proofs with the computations to assure secure communi-

cation. Adding NIZK proofs may seem to be a trivial solution

for converting protocols from semi-honest model to malicious

model. But, in fact, while constructing our protocols we had to

handle a number of cases. We needed to perform a number of

encryptions using both parties’ data. Decryption by any party

might casue infringement of other party’s privacy through

information leakage. To avoid such information leakage we

have used oracle calls by which we perform decryptions and

get desired result of a protocol. Firstly, we have constructed a

privacy-preserving data standardization protocol which helps

the parties not to face any correctness problem. Next, we have

proposed a secure distance measuring protocol to compute all

the distances between each cluster and each data object and to

find the closest cluster center for each data object. Then, we

have built a secure update computation protocol for reassign-

ing the data points to their nearest clusters. Finally, we have

constructed privacy-preserving iteration stopping protocol to

securely stop clustering iteration.

Organization of the paper: We give out brief description

of some preliminaries in the next section. Section 3 contains

the short overview of our proposed protocol. The details of

our proposed protocol is in Section 4. In Section 5 we give

out security analysis. Finally in Section 6 we draw out a

conclusion.

II. PRELIMINARIES

A. Brief review of k-means clustering

k-means clustering is an iterative method of clustering in

which n data items are grouped into k clusters. Each data

item belongs to the cluster with the nearest mean. The data

items from different clusters will have different characteristics.

The algorithm is as follows:

1) At first choose k center points randomly.

2) Using Euclidean distance, each sample is assigned to the

nearest cluster.

Euclidean distance: DistE(x, y) =√∑n

i=1(xi − yi)2

3) Calculate center of each cluster.

4) If centers are unchanged then finish, otherwise repeat

from step 2.

B. Problem specification and notations

In this paper, we assume that there are two parties P0 and P1

each having private databases. They want to get the common

benefit for doing clustering analysis over their joint databases.

Here we assume that, they will do their computations in the

presence of malicious adversaries who can change the input

during protocol execution. We also assume that:

Party P0 has database d0 = {d01, . . . , d0n0} with n0 entries.

Party P1 has database d1 = {d11, . . . , d1n1} with n1 entries.

Combined database d = {d1, . . . , dn}.We take an object di as a vector set such as, di =(xi,1, . . . , xi,l) where x denotes the attribute variable and ldenotes the attribute number.

C. Security definitions in malicious model

We have constructed our protocol based on secure multi-

party computation [17]. In case of malicious adversaries, a

party may refuse to participate in the protocol, may substitute

its local input or may abort the protocol prematurely. The

adversary may obtain its output while the honest party does

not. These adversarial capabilities are therefore not prevented

by the definition of security. The definition below is formalized

according to the real/ideal world paradigm.

Definition 1. Let P0 and P1 be the parties and I denote the

indices of the corrupted parties which is controlled by an

adversary A which is a non-uniform probabilistic polynomial-

time machine. There may be zero, one or both parties cor-

rupted. For simplicity, we will assume that exactly one of the

two parties is corrupted. i.e., either I = {1} or I = {2}.Let f : {0, 1}∗×{0, 1}∗ → {0, 1}∗×{0, 1}∗ be a function

where f = (f1, f2). Then, the ideal execution of f (on inputs

(x1, x2), auxiliary input z to A and security parameter n) is

called the output pair of the honest party and the adversary Aand denoted by IDEALf,A(z),I(n, x1, x2).Definition 2. In real model, a real two-party protocol πis executed (no trusted third party exists). Instead of the

corrupted party, the adversary A sends all messages, and

may follow an arbitrary polynomial-time strategy, whereas,

the honest party follows the instructions of π.

Let f be as above and π be a two-party protocol for

computing f . Similarly as above, the real execution of πis known as the output vector of the honest party and the

adversary A and is denoted by REALπ,A(z),I(n, x1, x2).Definition 3. Let f and π be as described above. If for

every non-uniform probabilistic polynomial-time adversary Afor the real model, there exists a non-uniform probabilistic

expected polynomial-time adversary S for the ideal model,

then Protocol π is said to securely compute f with abort in

the presence of malicious adversaries such that, for every I ,

every x1, x2 ∈ {0, 1}∗ such that |x| = |y|, and every auxiliary

input z ∈ {0, 1}∗:

{IDEALf,S(z),I(n, x1, x2)}n∈N

≡ {REALπ,A(z),I(n, x1, x2)}n∈N

Here, ≡ denotes computational indistinguishability. To prove

that the composed protocol is secure, we first show that

the protocol is secure given a trusted party (or oracle calls)

that implements the functions used as a subroutine for the

composed protocol. Later, we can replace the oracle calls with

secure protocols without violating security. The execution of

the protocol π in the hybrid model proceeds as in the real

model, except that the parties have oracle access to a trusted

party T for evaluating the two-party functions that proceed as

in the ideal model. Similarly, the security in the hybrid model

requires that for any adversary operating in the hybrid model,

122

Page 3: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

there exists an adversary in the ideal model such that they are

indistinguishable.

D. Cryptographic techniques

♦ Public-key encryption: A public-key encryption scheme

on a message space M consists of three algorithms

(K,E,D):

– The key generation algorithm K outputs a random

pair of secret/public keys (sk, pk).– The encryption algorithm Epk(m) outputs a cipher-

text e corresponding to the plaintext m ∈M .

– The decryption algorithm Dsk(e) outputs the plain-

text m associated to the ciphertext e.

♦ Homomorphic public-key encryption: Homomorphic

public-key encryption is a form of encryption which

allows specific types of computations to be performed

on encrypted data without compromising the encryption

and obtain an encrypted result. This is the ciphertext of

the result of operations performed on the plaintext. A

public-key encryption scheme is called additive homo-

morphic (e.g., the Paillier scheme [18]) if it satisfies the

requirements as follows:

– The encryption of m1 and m2 is given by Epk(m1)and Epk(m2), there exists an efficient algorithm to

compute the public-key encryption of m1 + m2,

which is denoted by Epk(m1+m2) := Epk(m1)+h

Epk(m2)– Given a constant c and the encryption of m as

Epk(m), there exists an efficient algorithm to com-

pute the public-key encryption of cm such that

Epk(cm) := c×h Epk(m).

Suppose, a chain of different services from different com-

panies want to calculate the tax, the currency exchange

rate or shipping, on a transaction without exposing the

unencrypted data to each of those services. Homomorphic

encryption allows this chaining together. Also, homomor-

phic encryption is used for private information retrieval

for many years.

♦ NIZK protocols: We have used efficient NIZK protocols

to prove that the actions taken by the parties are correct.

No other information is revealed except it. We are giving

a brief description of those protocols and the implemen-

tation details can be found in paper [19], [20].

– Threshold decryption (in case of two-party compu-tation) – Suppose there is a common public key

pk, and a secret key sk corresponding to pk. The

secret key sk has been divided into two pieces

sk0 and sk1. There exists an efficient and secure

protocol Dski(E(m)) that outputs the random share

of the decryption result si along with the NIZK

proof POK(ski, Epk(m), si) which shows that skiis used correctly. To calculate the decryption result

those shares can be combined. Any single share of

the secret key ski cannot be used to decrypt the

ciphertext alone. In other words si reveals nothing

about the final decryption result. We also use a

special version of a threshold decryption where only

one party learns the decryption result.

– Proving that a party knows a plaintext – A party

Pi can compute the Proof Of plaintext Knowledge

POK(em) if he knows an element m in the domain

of valid plaintexts such that Dsk(em) = m.

Note that Cramer et al. introduced a 3-move Σ-protocol. We

can convert the Σ-protocol into NIZK proof of knowledge by

applying the classical Fiat-Shamir heuristic [21] in the random

oracle model.

III. BRIEF OVERVIEW OF OUR PROPOSED PROTOCOL

Brief Description of our solutionNotions: ρi denotes the clustering center which is a sum

of the P0 and P1’s shares. P0 has his own share cluster ρ0

and P1 has his own share cluster ρ1, where ρi = ρ0i + ρ1i .

A data object xi is used as a private input in the protocol

to be clustered into a group. xi can be held by Party P0 or

party P1.

Input: (1) Database d0 and d1, which are owned by party

P0 and party P1, respectively consisting of total n objects.

(2) The total number of clusters is k.

Output: The k proper clusters to which data objects belong

with assignment. The following steps are performed to get

the output.

1) Two parties execute the Secure Data Standardizationprotocol.

2) Randomly, party P0 selects k objects from his

database as random share ρ01, . . . , ρ0k. Symmetri-

cally, party P1 selects ρ11, . . . , ρ1k randomly. And let

ρ1, . . . , ρk = (ρ01 + ρ11, . . . , ρ0k + ρ1k) as the initial

cluster.

3) Compute the distance of the numeric with the SecureDistance Measuring Protocol.

4) Run the Secure Clustering Update ComputationProtocol to assign the data objects to the closest

cluster and to update means of the clusters.

5) Run Secure Iteration Stopping Protocol to check

whether the difference between the clusters ρ1, . . . , ρkand the previous ones is minor or not. If no, go to step

3.

IV. PROPOSED PROTOCOL IN DETAILS

A. Secure Data Standardization Protocol

Assuming P0 and P1 have n0 and n1 data entries in their

databases d0 and d1, respectively, where d0 =∑n0

j=1 d0j and

d1 =∑n1

j=1 d1j . We need the mean and the standard

deviation which are computed as M = d0+d1

n0+n1and

σ =√

1n0+n1

{∑n0

j=1(d0j −M)2 +

∑n1

j=1(d1j −M)2}

using Protocol 4.1 and Protocol 4.2, respectively. Then

both parties can get the standardized data, x,i = xi−M

σ ;

where, xi is an l-attribute data vector in the database.

123

Page 4: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

Protocol 4.1: Secure Mean calculation protocol in malicious

model using threshold decryption.

Require: Two parties P0 and P1 with the shares sk0 and

sk1 of the secret key sk and databases d0 and d1 with n0

and n1 data entries, respectively.

Ensure: Return M = (d0 + d1)/(n0 + n1) to both parties.

for all Pf doCalculate edf = Epk(d

f ) and create POK(edf )Send (edf , POK(edf )) to P1−f

end forfor all Pf do

Check POK(ed1−f ) is valid else ABORTCalculate enf

= Epk(nf ) and create POK(enf)

Send (enf, POK(enf

)) to P1−f

end forfor all Pf do

Check POK(en1−f) is valid else ABORT

Calculate ed0+d1 = ed0 +h ed1 ; en0+n1 = en0 +h en1

end forfor all Pf do

Jointly call the oracle to compute Dsk(ed0+d1),Dsk(en0+n1

) and the value MReturn M to both parties

end for

Protocol 4.2: Secure Standard Deviation calculation

protocol in malicious model using threshold decryption.

Require: Two parties P0 and P1 with the shares sk0 and

sk1 of the secret key sk and database d0 and d1 with n0

and n1 data entries, respectively. Both P0 and P1 have M .

Ensure:Return σ =

√1

n0+n1{∑n0

j=1(d0j −M)2 +

∑n1

j=1(d1j −M)2}

to both parties.

for all Pf doCalculate enf

= Epk(nf ) and create POK(enf)

Send (enf, POK(enf

)) to P1−f

end forfor all Pf do

Check POK(en1−f) is valid else ABORT

Calculate en0+n1= en0

+h en1

end forfor all Pf do∀j, Calculate zj = dfj −M ; z2j = (zj)× (zj)

Calculate sum =∑nf

j=1 z2j = z21 + z22 + · · ·+ z2nf

Calculate esf = Epk(sum) and create POK(esf )Send (esf , POK(esf )) to P1−f

end forfor all Pf do

Check whether POK(es1−f) is correct else ABORT

Calculate es = es0 +h es1end forfor all Pf do

Jointly call the oracle to compute Dsk(en0+n1), Dsk(es)

and the value σ =√

sn0+n1

Return σ to both parties

end for

B. Secure Distance Measuring Protocol

Protocol 4.3: Secure Distance Measuring Protocol in ma-

licious model using threshold decryption.

Require: Two parties P0 and P1 with the shares sk0 and

sk1 of the secret key sk and private data inputs x0i and x1

i .

Both have cluster object shares ρ0j and ρ1j .

Ensure: Return Dist2(di, ρj) to party Pf where f may be

either 0 or 1.

for P1−f do∀t, Calculate eρ1−f

j,t= Epk(ρ

1−fj,t ); eprod1

=∏l

t=1 eρ1−fj,t

=∏l

t=1 Epk(ρ1−fj,t );

∀t, Calculate ex1−fi,t

= Epk(x1−fi,t ); eprod2

=∏l

t=1 ex1−fi,t

=∏l

t=1 Epk(x1−fi,t );

Create POK(eprod1) and POK(eprod2

)

∀t, Calculate (ρ1−fj,t )2 = ρ1−f

j,t × ρ1−fj,t ; s1 =∑l

t=1(ρ1−fi,t )2; set es1 = Epk(s1);

Create POK(es1); Send encryptions and NIZK proofs

to Pf

end forfor Pf do

Compute eprod1=

∏lt=1 eρ1−f

j,t, eprod2

=∏l

t=1 ex1−fi,t

;

Check POK(eprod1), POK(eprod2), POK(es1) are

correct else ABORT∀t, Calculate eρ1−f

j,t .ρfj,t

= eρ1−fj,t

×h ρfj,t

Calculate es2 = Epk(∑l

t=1 ρ1−fj,t ρfj,t) = eρ1−f

j,1 .ρfj,1

+h

eρ1−fj,2 .ρf

j,2+h · · ·+h eρ1−f

j,l .ρfj,l

∀t, Calculate eρ1−fj,t .xf

i,t= eρ1−f

j,t×h xf

i,t

Calculate es3 = Epk(∑l

t=1 ρ1−fj,t xf

i,t) = eρ1−fj,1 .xf

i,1+h

eρ1−fj,2 .xf

i,2+h · · ·+h eρ1−f

j,l .xfi,l

∀t, Calculate eρfj,t.x

1−fi,t

= ρfj,t ×h ex1−fi,t

Calculate es4 = Epk(∑l

t=1 ρfj,tx

1−fi,t ) = eρf

j,1.x1−fi,1

+h

eρfj,2.x

1−fi,2

+h · · ·+h eρfj,l.x

1−fi,l

end forfor both parties do

Jointly call the oracle along with∑l

t=1(ρfi,t)

2 and∑lt=1(xi,t)

2 to get the decryptions of es1 , es2 , es3 and es4as well as the value Dist2(di, ρj)

Return Dist2(di, ρj) to Pf

end for

In case of multi-attribute data, we can compute the distance

function∑k

j=i

∑ni=1 ||xi−Cj ||2 using the Euclidean distance

which can be decomposed into x2i − 2xi ∗ Ci + C2

i . Let the

data objects from both parties after providing cluster shares

be di = (xi,1, . . . , xi,l) where l is the number of attributes.

Then the distance is calculated as follows:

Dist2(di, ρj) = (xi,1 − ρj)2 + (xi,2 − ρj)

2 + . . . (xi,l − ρj)2

=∑l

t=1(xi,t)2 +

∑lt=1(ρ

0j,t)

2 +∑l

t=1(ρ1j,t)

2 +

2∑l

t=1 ρ0j,tρ

1j,t − 2

∑lt=1 ρ

0j,tx

1i,t − 2

∑lt=1 ρ

1j,tx

0i,t

where, ρj is the jth cluster center and 1 ≤ j ≤ k. We

provide Protocol 4.3 to calculate distances. The distances

between a data object and all the k clusters are found as a

124

Page 5: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

distance data set. If di is a data object it will have a data set

consisting of k number of distances after running the protocol

k times. Let, Pf runs the protocol where f may be either 0 or

1. Some calculations are follwed from [22]. From the above

equation, we can see that, Pf can calculate∑l

t=1(xi,t)2 and∑l

t=1(ρfj,t)

2 locally and privately. But to compute the rest of

the summations, both partys’ data is needed. So we have to

preserve the privacy for both parties. Pf runs the protocol and

learns only about needed summations to compute the distance

function. Each of Party P0 and P1 will run the Protocol 4.3

and get a number of distance data sets for each of their own

data objects.

C. Secure Clustering Update Computation Protocol

Algorithm 1: Finding minimum of h numbers

Initialize min← 0for i = 1 to h do

if min > a[i] thenmin← a[i]

end ifend for

Return min

Protocol 4.4: Secure Clustering Update Computation Pro-

tocol in malicious model using threshold decryption.

Require: Two parties P0 and P1 with the shares sk0 and

sk1 of the secret key sk and sum of data points of each

attribute in ith cluster as (s01,. . . ,s0l ) and (s11,. . . ,s1l ) with

(n01,. . . ,n0

l ) and (n11,. . . ,n1

l ) data entries, respectively.

Ensure: Return μ0i = (μ0

i,1, . . . , μ0i,l) to P0 and μ1

i =(μ1

i,1, . . . , μ1i,l) to P1.

for all Pf do∀j, Calculate esfj

= Epk(sfj ); efprod1

=∏l

j=1 esfj=

∏lj=1 Epk(s

fj );

∀j, Calculate enfj= Epk(n

fj ); efprod2

=∏l

j=1 enfj=

∏lj=1 Epk(n

fj );

Create POK(efprod1) and POK(efprod2

)Send the encryptions and NIZK proofs to P1−f

end forfor all Pf do

Compute e1−fprod1

=∏l

j=1 es1−fj

, e1−fprod2

=∏l

j=1 en1−fj

;

Check POK(e1−fprod1

) and POK(e1−fprod2

) are correct else

ABORT∀j, Calculate esfj +s1−f

j= sfj +h es1−f

j

∀j, Calculate enfj +n1−f

j= nf

j +h en1−fj

end forfor all Pf do

Jointly call the oracle to decrypt all the encryptions and

to compute the values μ0i,j and μ1

i,j for all attributes

Return μ0i to P0 and μ1

i to P1

end for

In this step, Algorithm 1 is used (replacing h with k and

considering a[i] as the representation of distance data set for

a data point) to find out the most proper clustering center

for each data point. Then the data points are assigned to

their nearest clusters. As a result, the clusters change and

new centers are assigned for them. When data objects have

been assigned to their nearest clusters, the cluster shares have

also been changed. Both parties know their own shares. So

they calculate the sum of each attribute in ith cluster and the

number of total data objects of corresponding attribute. The

equation of the mean of jth attribute in ith cluster is, μi,j =(s0j+s1j )/(n

0j+n1

j ). Protocol 4.4 is used to calculate new center

of ith cluster using this equation. Both parties run this protocol

and get the ith cluster center μi = (μi,1, μi,2, . . . , μi,l) as two

shares μ0i (Party P0 gets it) and μ1

i (Party P1 gets it). This

protocol is given considering ith cluster where 1 ≤ i ≤ k. That

means for finding centers of all clusters the protocol needs to

be run k times.

D. Secure Iteration Stopping Protocol

We know that k-means clustering is an iterative algorithm.

So, during the protocol execution, we need to stop it when

the output satisfies our requirements. For this step, we have

considered a very negligible threshold value ε. We need to do

the iteration until the difference of Euclidean distance between

two consequent calculation is smaller than ε. i.e,

Dist(Ci, Ci+1) = Dist(ρ0j + ρ1j , μ0j + μ1

j ) < ε

Protocol 4.5: Secure Iteration Stopping Protocol in mali-

cious model using threshold decryption.

Require: Two parties P0 and P1 with the shares sk0 and

sk1 of the secret key sk. They both have shares of jth

cluster center as μ0j and μ1

j for current iteration and ρ0j and

ρ1j for previous iteration, respectively. Let, party P1 wants

to check for iteration stopping. Firstly, both parties compute

the distances using Euclidean distance. Then Protocol 4.5

is used which is defined for one cluster.

Ensure: Return 1 P1 if difference of two cluster centers is

less than ε′, otherwise return 0.

for P0 doCalculate esub0 = Epk(ρ

0j − μ0

j )Create POK(esub0)Send (esub0 , POK(esub0)) to P1

end forfor P1 do

Check POK(esub0) is valid else ABORTCalculate em = esub0 ×h sub1

end forfor all Pf do

Jointly call the oracle to calculate Dsk(em)If m < ε′ then return 1, else return 0 P1

end for

where, j represents the cluster number and i and i + 1represent the previous and current iterations, respectively. We

can transform the distance function into the following one:

(ρ0j + ρ1j ) − (μ0j + μ1

j ) < ε′, where ε′ is also a negligible

threshold value. In case of jth cluster, party P0 locally com-

putes Epk(ρ0j − μ0

j ) and party P1 locally computes Epk(ρ1j −

μ1j ) and both of them use the common public key according

125

Page 6: [IEEE 2013 IEEE 37th International Computer Software and Applications Conference Workshops (COMPSACW) - Japan (2013.07.22-2013.07.26)] 2013 IEEE 37th Annual Computer Software and Applications

to threshold decryption. If both of them pass their data to each

other, they can perform homomorphic multiplication over their

encrypted results. Using Protocol 4.5 party P1 gets a value

which is either 1 or 0 against a particular cluster. He runs

this protocol k times and finds a set of k values(we define

it as decision set) which represents the changes of all current

clusters comparing with the previous ones. If he gets k number

of 1s throughout the decision set, iteration can be stopped,

otherwise the computation continues.

The final step of our algorithm replaces the old cluster centers

with the new ones. This is done by each party as follows:

Party P0 sets: (ρ01, ρ02, . . . , ρ

0k) = (μ0

1, μ02, . . . , μ

0k)

Party P1 sets: (ρ11, ρ12, . . . , ρ

1k) = (μ1

1, μ12, . . . , μ

1k)

V. PROPOSED PROTOCOL ANALYSIS

A. Security Analysis

Here we provide the following lemma which is used for

minimizing the cost of communication without invading pri-

vacy as in paper [22]. We also provide theorem related to the

security of our proposed protocol. We remove the proofs due

to lack of space. Those will appear in the full version.

Lemma 1: An adversary who does not know the input plain-texts of Protocol 4.3 and Protocol 4.4 can successfully cheatusing its own input strings with negligible probability.Theorem 1 and 2: Protocol 4.1 and 4.2 are secure in the(decryption)-hybrid model assuming that the NIZK protocolsand the additive homomorphic encryptions used are secure inthe malicious model.Theorem 3 and 4: Protocol 4.3 and 4.4 are secure in the(decryption)-hybrid model assuming that the NIZK protocols(given that lemma 1 holds) and the additive homomorphicencryptions used are secure in the malicious model.Theorem 5: Protocol 4.5 is secure in the (decryption)-hybridmodel assuming that the NIZK protocols and the additivehomomorphic encryptions used are secure in the maliciousmodel.

B. Complexity Analysis

The full description of the following complexities will

appear in the full version.

Computational complexity: In Protocol 4.1 and 4.2, the

complexity is O(1) for ciphertexts and proofs of knowledge

(POK), and O(n) for rest of the operations (n = size of data).

In Protocol 4.3 and 4.4, the complexity is O(1) for POKand O(l) for rest of the operations (l = attribute number). In

Protocol 4.5, the complexity is O(1) for ciphertexts, POKand subtraction, and O(l) for rest of the operations.

Communication complexity: The complexity for POKis O(1) in all sub-protocols. The complexity is O(1) for

ciphertexts in Protocol 4.1 and 4.2 and O(l) for rest of the

protocols.

VI. CONCLUSION

In this paper, we have proposed a new scheme based

on NIZK proofs for privacy-preserving k-means clustering

between two parties in malicious model. Most of the existing

privacy-preserving k-means schemes deal with the semi-honest

model having various security and correctness problems. Ours

is a challenging work because it guarantees stronger security

and has overcome such problems. Moreover, we have mini-

mized the communication cost. As ours is a two-party model, it

will be a great future work to design private k-means clustering

considering more than two parties in the presence of malicious

adversaries.

REFERENCES

[1] J. B. MacQueen, “Some methods for classification and analysis ofmultivariate observations,” in Proc. of the fifth Berkeley Symposium onMathematical Statistics and Probability, vol. 1. University of CaliforniaPress, 1967, pp. 281–297.

[2] M. Upmanyu, A. M. Namboodiri, K. Srinathan, and C. V. Jawahar,“Efficient privacy preserving k-means clustering,” in PAISI, 2010, pp.154–166.

[3] T.-K. Yu, D. T. Lee, S.-M. Chang, and J. Zhan, “Multi-party k-meansclustering with privacy consideration,” in ISPA, 2010, pp. 200–207.

[4] C. Su, F. Bao, J. Zhou, T. Takagi, and K. Sakurai, “Security andcorrectness analysis on privacy-preserving k-means clustering schemes,”IEICE Transactions, vol. 92-A, no. 4, pp. 1246–1250, 2009.

[5] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering oververtically partitioned data,” in KDD, 2003, pp. 206–215.

[6] T. Okamoto and S. Uchiyama, “A new public-key cryptosystem as secureas factoring,” in EUROCRYPT, 1998, pp. 308–318.

[7] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” in KDD, 2005, pp.593–599.

[8] J. Bar-Ilan and D. Beaver, “Non-cryptographic fault-tolerant computingin constant number of rounds of interaction,” in PODC, 1989, pp. 201–209.

[9] S. Jha, L. Kruger, and P. McDaniel, “Privacy preserving clustering,” inESORICS, 2005, pp. 397–417.

[10] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “A new privacy-preserving distributed k-clustering algorithm,” in SDM, 2006.

[11] W. Du and M. J. Atallah, “Privacy-preserving cooperative statisticalanalysis,” in ACSAC, 2001, pp. 102–110.

[12] P. Bunn and R. Ostrovsky, “Secure two-party k-means clustering,” inACM Conference on Computer and Communications Security, 2007, pp.486–497.

[13] C. Su, F. Bao, J. Zhou, T. Takagi, and K. Sakurai, “Privacy-preservingtwo-party k-means clustering via secure approximation,” in AINA Work-shops (1), 2007, pp. 385–391.

[14] C. Yildizli, T. B. Pedersen, Y. Saygin, E. Savas, and A. Levi, “Distributedprivacy preserving clustering via homomorphic secret sharing and itsapplication to (vertically) partitioned spatio-temporal data,” IJDWM,vol. 7, no. 1, pp. 46–66, 2011.

[15] S. Patel and D. C. Jinwala, “Privacy preservingdistributed k-means clustering in malicious model,”http://precog.iiitd.edu.in/events/spsymposium13/SPSymposium files/SPsymposium-

papers/SPsymposium-paper7.pdf, 2013.[16] S. Patel, V. Patel, and D. C. Jinwala, “Privacy preserving distributed

k-means clustering in malicious model using zero knowledge proof,” inICDCIT, 2013, pp. 420–431.

[17] Y. Lindell and B. Pinkas, “Secure multiparty computation for privacy-preserving data mining,” The Journal of Privacy and Confidentiality,vol. 1, no. 1, pp. 59–98, 2009.

[18] P. Paillier, “Public-key cryptosystems based on composite degree resid-uosity classes,” in EUROCRYPT, 1999, pp. 223–238.

[19] R. Cramer, I. Damgard, and J. B. Nielsen, “Multiparty computation fromthreshold homomorphic encryption,” in EUROCRYPT, 2001, pp. 280–299.

[20] M. Kantarcioglu and O. Kardes, “Privacy-preserving data mining appli-cations in the malicious model,” International Journal of Informationand Computer Security, vol. 2, no. 4, pp. 353–375, 208.

[21] A. Fiat and A. Shamir, “How to prove yourself: Practical solutions toidentification and signature problems,” in CRYPTO, 1986, pp. 186–194.

[22] K. Emura, A. Miyaji, and M. S. Rahman, “Efficient privacy-preservingdata mining in malicious model,” in ADMA (1), 2010, pp. 370–382.

126