[ieee 2013 ieee 37th international computer software and applications conference workshops...
TRANSCRIPT
Privacy-Preserving Two-Party k-Means Clusteringin Malicious Model
Rahena Akhter†, Rownak Jahan Chowdhury†, Keita Emura††,Tamzida Islam†, Mohammad Shahriar Rahman†, Nusrat Rubaiyat†
†Department of CSE, University of Asia Pacific (UAP), Dhaka, Bangladesh††National Institute of Information and Communications Technology (NICT), Tokyo, Japan
Email: [email protected], [email protected], [email protected],
{rahena ratna, tamzida.Islam, kotha toru}@yahoo.com
Abstract—In data mining, clustering is a well-known anduseful technique. One of the most powerful and frequentlyused techniques is k-means clustering. Most of the privacy-preserving solutions based on cryptography proposed by differentresearchers in recent years are in semi-honest model, whereparticipating parties always follow the protocol. This model isrealistic in many cases. But providing stonger solutions consid-ering malicious model would be more useful for many practicalapplications because it tries to protect a protocol from arbitrarymalicious behavior using cryptographic tools. In this paper, wehave proposed a new protocol for privacy-preserving two-partyk-means clustering in malicious model. We have used thresholdhomomorphic encryption and non-interactive zero knowledgeprotocols to construct our protocol according to real/ideal worldparadigm.
KeyWords: k-means clustering, privacy-preserving, mali-
cious model, threshold two-party computation.
I. INTRODUCTION
k-means clustering is one of the simplest and popular
learning algorithms in data mining area that solves the well
known clustering problem [1]. There are various applications
of k-means clustering such as, character recognition, encod-
ing/decoding, market segmentation, healthcare etc. Suppose,
two hospitals want to retrieve some data from their joint
databases, or two companies want to combine their data to
identify the product which business will be profitable. k-means
clustering is such an application which can be used here. Let,
n data objects are given. The k-means clustering algorithm
divides them into k groups. Suppose there exists two parties
P0 and P1 in a distributed network, each having a private
document database denoted by d0 and d1 respectively. It is
assumed that there exists a combined database d = d0 ∪ d1
coming from P0 and P1 on which they want to collaboratively
compute the k-means clustering to get common benefit. Here
we need to preserve privacy for both parties which means
no information of each party’s database will be revealed.
Privacy can be invaded by the adversaries. In cryptography
literature, adversaries have been classified into the following
two categories in general.
Semi-honest adversary: A semi-honest adversary follows
the protocol specification faithfully and tries to learn extra
information from the message transcript during the execution.
Malicious adversary: A malicious adversary does not
follow the protocol specification and can alter the input from
the message transcript during the execution.
On the basis of the above specifications, we have con-
structed our protocol in malicious model so that we can
provide stronger security gurantee.
A. Related work
Most of the existing privacy-preserving k-means schemes
(either two-party or multiparty [2][3]) deal with the semi-
honest model. Also they have some security problems [4].
Vaidya and Clifton introduced a solution [5] based on
Okamoto and Uchiyama homomorphic cryptosystem [6]
which is not secure without random padding. Jagannathan
et al.’s proposal [7] applies a scheme [8] whose security is
not proven and the whole protocol is inefficient in clustering
update computation. In Jha et al.’s solution [9], two parties
compute their k-means local clustering, respectively; the up-
date computation is executed after two parties’ local k-means
clustering is computed. They have used oblivious polynomial
evaluation to construct a private update computation of clusters
to join the 2k clusters into k ones. The proposal has the
correctness problem because the result is not actually based
on two parties’ initiative clusters. In Jagannathan et al.’s
another scheme [10], a secure scalar production protocol [11]
is used which has the serious weakness that a leakage of some
database entries, can reveal the whole database as well as it is
very inefficient when the database is large. In paper [12], Yao’s
garbled circuit technique is used which is time consuming.
In paper [13], the above problems are covered but none of
these schemes can deal with malicious adversaries. The cost of
communication in these protocols are mostly reasonable [14].
So, we cannot ignore the matter of cost. There is a paper [15]
where the researchers have given an idea about how to convert
the k-means clustering process from semi-honest model to
malicious model without giving any kind of details about
the protocols. So, we cannot compare its performance. In pa-
per [16], the protocols have been constructed considering more
than two participating parties in malicious model. They have
used secret sharing with code based identification scheme. But
secret sharing scheme is not efficient in some cases. A stronger
security guarantee with reasonable cost is what we want to
achieve in this paper.
2013 IEEE 37th Annual Computer Software and Applications Conference Workshops
978-0-7695-4987-3/13 $26.00 © 2013 IEEE
DOI 10.1109/COMPSACW.2013.53
121
B. Our contribution
In this paper, we have given out an interactive protocol
with some sub-protocols using non-interactive zero knowledge
(NIZK) proofs. The combined database is used to perform
the protocol to get the target results. The overall clustering
computations are similar as in [13] and [12]. We have added
NIZK proofs with the computations to assure secure communi-
cation. Adding NIZK proofs may seem to be a trivial solution
for converting protocols from semi-honest model to malicious
model. But, in fact, while constructing our protocols we had to
handle a number of cases. We needed to perform a number of
encryptions using both parties’ data. Decryption by any party
might casue infringement of other party’s privacy through
information leakage. To avoid such information leakage we
have used oracle calls by which we perform decryptions and
get desired result of a protocol. Firstly, we have constructed a
privacy-preserving data standardization protocol which helps
the parties not to face any correctness problem. Next, we have
proposed a secure distance measuring protocol to compute all
the distances between each cluster and each data object and to
find the closest cluster center for each data object. Then, we
have built a secure update computation protocol for reassign-
ing the data points to their nearest clusters. Finally, we have
constructed privacy-preserving iteration stopping protocol to
securely stop clustering iteration.
Organization of the paper: We give out brief description
of some preliminaries in the next section. Section 3 contains
the short overview of our proposed protocol. The details of
our proposed protocol is in Section 4. In Section 5 we give
out security analysis. Finally in Section 6 we draw out a
conclusion.
II. PRELIMINARIES
A. Brief review of k-means clustering
k-means clustering is an iterative method of clustering in
which n data items are grouped into k clusters. Each data
item belongs to the cluster with the nearest mean. The data
items from different clusters will have different characteristics.
The algorithm is as follows:
1) At first choose k center points randomly.
2) Using Euclidean distance, each sample is assigned to the
nearest cluster.
Euclidean distance: DistE(x, y) =√∑n
i=1(xi − yi)2
3) Calculate center of each cluster.
4) If centers are unchanged then finish, otherwise repeat
from step 2.
B. Problem specification and notations
In this paper, we assume that there are two parties P0 and P1
each having private databases. They want to get the common
benefit for doing clustering analysis over their joint databases.
Here we assume that, they will do their computations in the
presence of malicious adversaries who can change the input
during protocol execution. We also assume that:
Party P0 has database d0 = {d01, . . . , d0n0} with n0 entries.
Party P1 has database d1 = {d11, . . . , d1n1} with n1 entries.
Combined database d = {d1, . . . , dn}.We take an object di as a vector set such as, di =(xi,1, . . . , xi,l) where x denotes the attribute variable and ldenotes the attribute number.
C. Security definitions in malicious model
We have constructed our protocol based on secure multi-
party computation [17]. In case of malicious adversaries, a
party may refuse to participate in the protocol, may substitute
its local input or may abort the protocol prematurely. The
adversary may obtain its output while the honest party does
not. These adversarial capabilities are therefore not prevented
by the definition of security. The definition below is formalized
according to the real/ideal world paradigm.
Definition 1. Let P0 and P1 be the parties and I denote the
indices of the corrupted parties which is controlled by an
adversary A which is a non-uniform probabilistic polynomial-
time machine. There may be zero, one or both parties cor-
rupted. For simplicity, we will assume that exactly one of the
two parties is corrupted. i.e., either I = {1} or I = {2}.Let f : {0, 1}∗×{0, 1}∗ → {0, 1}∗×{0, 1}∗ be a function
where f = (f1, f2). Then, the ideal execution of f (on inputs
(x1, x2), auxiliary input z to A and security parameter n) is
called the output pair of the honest party and the adversary Aand denoted by IDEALf,A(z),I(n, x1, x2).Definition 2. In real model, a real two-party protocol πis executed (no trusted third party exists). Instead of the
corrupted party, the adversary A sends all messages, and
may follow an arbitrary polynomial-time strategy, whereas,
the honest party follows the instructions of π.
Let f be as above and π be a two-party protocol for
computing f . Similarly as above, the real execution of πis known as the output vector of the honest party and the
adversary A and is denoted by REALπ,A(z),I(n, x1, x2).Definition 3. Let f and π be as described above. If for
every non-uniform probabilistic polynomial-time adversary Afor the real model, there exists a non-uniform probabilistic
expected polynomial-time adversary S for the ideal model,
then Protocol π is said to securely compute f with abort in
the presence of malicious adversaries such that, for every I ,
every x1, x2 ∈ {0, 1}∗ such that |x| = |y|, and every auxiliary
input z ∈ {0, 1}∗:
{IDEALf,S(z),I(n, x1, x2)}n∈N
≡ {REALπ,A(z),I(n, x1, x2)}n∈N
Here, ≡ denotes computational indistinguishability. To prove
that the composed protocol is secure, we first show that
the protocol is secure given a trusted party (or oracle calls)
that implements the functions used as a subroutine for the
composed protocol. Later, we can replace the oracle calls with
secure protocols without violating security. The execution of
the protocol π in the hybrid model proceeds as in the real
model, except that the parties have oracle access to a trusted
party T for evaluating the two-party functions that proceed as
in the ideal model. Similarly, the security in the hybrid model
requires that for any adversary operating in the hybrid model,
122
there exists an adversary in the ideal model such that they are
indistinguishable.
D. Cryptographic techniques
♦ Public-key encryption: A public-key encryption scheme
on a message space M consists of three algorithms
(K,E,D):
– The key generation algorithm K outputs a random
pair of secret/public keys (sk, pk).– The encryption algorithm Epk(m) outputs a cipher-
text e corresponding to the plaintext m ∈M .
– The decryption algorithm Dsk(e) outputs the plain-
text m associated to the ciphertext e.
♦ Homomorphic public-key encryption: Homomorphic
public-key encryption is a form of encryption which
allows specific types of computations to be performed
on encrypted data without compromising the encryption
and obtain an encrypted result. This is the ciphertext of
the result of operations performed on the plaintext. A
public-key encryption scheme is called additive homo-
morphic (e.g., the Paillier scheme [18]) if it satisfies the
requirements as follows:
– The encryption of m1 and m2 is given by Epk(m1)and Epk(m2), there exists an efficient algorithm to
compute the public-key encryption of m1 + m2,
which is denoted by Epk(m1+m2) := Epk(m1)+h
Epk(m2)– Given a constant c and the encryption of m as
Epk(m), there exists an efficient algorithm to com-
pute the public-key encryption of cm such that
Epk(cm) := c×h Epk(m).
Suppose, a chain of different services from different com-
panies want to calculate the tax, the currency exchange
rate or shipping, on a transaction without exposing the
unencrypted data to each of those services. Homomorphic
encryption allows this chaining together. Also, homomor-
phic encryption is used for private information retrieval
for many years.
♦ NIZK protocols: We have used efficient NIZK protocols
to prove that the actions taken by the parties are correct.
No other information is revealed except it. We are giving
a brief description of those protocols and the implemen-
tation details can be found in paper [19], [20].
– Threshold decryption (in case of two-party compu-tation) – Suppose there is a common public key
pk, and a secret key sk corresponding to pk. The
secret key sk has been divided into two pieces
sk0 and sk1. There exists an efficient and secure
protocol Dski(E(m)) that outputs the random share
of the decryption result si along with the NIZK
proof POK(ski, Epk(m), si) which shows that skiis used correctly. To calculate the decryption result
those shares can be combined. Any single share of
the secret key ski cannot be used to decrypt the
ciphertext alone. In other words si reveals nothing
about the final decryption result. We also use a
special version of a threshold decryption where only
one party learns the decryption result.
– Proving that a party knows a plaintext – A party
Pi can compute the Proof Of plaintext Knowledge
POK(em) if he knows an element m in the domain
of valid plaintexts such that Dsk(em) = m.
Note that Cramer et al. introduced a 3-move Σ-protocol. We
can convert the Σ-protocol into NIZK proof of knowledge by
applying the classical Fiat-Shamir heuristic [21] in the random
oracle model.
III. BRIEF OVERVIEW OF OUR PROPOSED PROTOCOL
Brief Description of our solutionNotions: ρi denotes the clustering center which is a sum
of the P0 and P1’s shares. P0 has his own share cluster ρ0
and P1 has his own share cluster ρ1, where ρi = ρ0i + ρ1i .
A data object xi is used as a private input in the protocol
to be clustered into a group. xi can be held by Party P0 or
party P1.
Input: (1) Database d0 and d1, which are owned by party
P0 and party P1, respectively consisting of total n objects.
(2) The total number of clusters is k.
Output: The k proper clusters to which data objects belong
with assignment. The following steps are performed to get
the output.
1) Two parties execute the Secure Data Standardizationprotocol.
2) Randomly, party P0 selects k objects from his
database as random share ρ01, . . . , ρ0k. Symmetri-
cally, party P1 selects ρ11, . . . , ρ1k randomly. And let
ρ1, . . . , ρk = (ρ01 + ρ11, . . . , ρ0k + ρ1k) as the initial
cluster.
3) Compute the distance of the numeric with the SecureDistance Measuring Protocol.
4) Run the Secure Clustering Update ComputationProtocol to assign the data objects to the closest
cluster and to update means of the clusters.
5) Run Secure Iteration Stopping Protocol to check
whether the difference between the clusters ρ1, . . . , ρkand the previous ones is minor or not. If no, go to step
3.
IV. PROPOSED PROTOCOL IN DETAILS
A. Secure Data Standardization Protocol
Assuming P0 and P1 have n0 and n1 data entries in their
databases d0 and d1, respectively, where d0 =∑n0
j=1 d0j and
d1 =∑n1
j=1 d1j . We need the mean and the standard
deviation which are computed as M = d0+d1
n0+n1and
σ =√
1n0+n1
{∑n0
j=1(d0j −M)2 +
∑n1
j=1(d1j −M)2}
using Protocol 4.1 and Protocol 4.2, respectively. Then
both parties can get the standardized data, x,i = xi−M
σ ;
where, xi is an l-attribute data vector in the database.
123
Protocol 4.1: Secure Mean calculation protocol in malicious
model using threshold decryption.
Require: Two parties P0 and P1 with the shares sk0 and
sk1 of the secret key sk and databases d0 and d1 with n0
and n1 data entries, respectively.
Ensure: Return M = (d0 + d1)/(n0 + n1) to both parties.
for all Pf doCalculate edf = Epk(d
f ) and create POK(edf )Send (edf , POK(edf )) to P1−f
end forfor all Pf do
Check POK(ed1−f ) is valid else ABORTCalculate enf
= Epk(nf ) and create POK(enf)
Send (enf, POK(enf
)) to P1−f
end forfor all Pf do
Check POK(en1−f) is valid else ABORT
Calculate ed0+d1 = ed0 +h ed1 ; en0+n1 = en0 +h en1
end forfor all Pf do
Jointly call the oracle to compute Dsk(ed0+d1),Dsk(en0+n1
) and the value MReturn M to both parties
end for
Protocol 4.2: Secure Standard Deviation calculation
protocol in malicious model using threshold decryption.
Require: Two parties P0 and P1 with the shares sk0 and
sk1 of the secret key sk and database d0 and d1 with n0
and n1 data entries, respectively. Both P0 and P1 have M .
Ensure:Return σ =
√1
n0+n1{∑n0
j=1(d0j −M)2 +
∑n1
j=1(d1j −M)2}
to both parties.
for all Pf doCalculate enf
= Epk(nf ) and create POK(enf)
Send (enf, POK(enf
)) to P1−f
end forfor all Pf do
Check POK(en1−f) is valid else ABORT
Calculate en0+n1= en0
+h en1
end forfor all Pf do∀j, Calculate zj = dfj −M ; z2j = (zj)× (zj)
Calculate sum =∑nf
j=1 z2j = z21 + z22 + · · ·+ z2nf
Calculate esf = Epk(sum) and create POK(esf )Send (esf , POK(esf )) to P1−f
end forfor all Pf do
Check whether POK(es1−f) is correct else ABORT
Calculate es = es0 +h es1end forfor all Pf do
Jointly call the oracle to compute Dsk(en0+n1), Dsk(es)
and the value σ =√
sn0+n1
Return σ to both parties
end for
B. Secure Distance Measuring Protocol
Protocol 4.3: Secure Distance Measuring Protocol in ma-
licious model using threshold decryption.
Require: Two parties P0 and P1 with the shares sk0 and
sk1 of the secret key sk and private data inputs x0i and x1
i .
Both have cluster object shares ρ0j and ρ1j .
Ensure: Return Dist2(di, ρj) to party Pf where f may be
either 0 or 1.
for P1−f do∀t, Calculate eρ1−f
j,t= Epk(ρ
1−fj,t ); eprod1
=∏l
t=1 eρ1−fj,t
=∏l
t=1 Epk(ρ1−fj,t );
∀t, Calculate ex1−fi,t
= Epk(x1−fi,t ); eprod2
=∏l
t=1 ex1−fi,t
=∏l
t=1 Epk(x1−fi,t );
Create POK(eprod1) and POK(eprod2
)
∀t, Calculate (ρ1−fj,t )2 = ρ1−f
j,t × ρ1−fj,t ; s1 =∑l
t=1(ρ1−fi,t )2; set es1 = Epk(s1);
Create POK(es1); Send encryptions and NIZK proofs
to Pf
end forfor Pf do
Compute eprod1=
∏lt=1 eρ1−f
j,t, eprod2
=∏l
t=1 ex1−fi,t
;
Check POK(eprod1), POK(eprod2), POK(es1) are
correct else ABORT∀t, Calculate eρ1−f
j,t .ρfj,t
= eρ1−fj,t
×h ρfj,t
Calculate es2 = Epk(∑l
t=1 ρ1−fj,t ρfj,t) = eρ1−f
j,1 .ρfj,1
+h
eρ1−fj,2 .ρf
j,2+h · · ·+h eρ1−f
j,l .ρfj,l
∀t, Calculate eρ1−fj,t .xf
i,t= eρ1−f
j,t×h xf
i,t
Calculate es3 = Epk(∑l
t=1 ρ1−fj,t xf
i,t) = eρ1−fj,1 .xf
i,1+h
eρ1−fj,2 .xf
i,2+h · · ·+h eρ1−f
j,l .xfi,l
∀t, Calculate eρfj,t.x
1−fi,t
= ρfj,t ×h ex1−fi,t
Calculate es4 = Epk(∑l
t=1 ρfj,tx
1−fi,t ) = eρf
j,1.x1−fi,1
+h
eρfj,2.x
1−fi,2
+h · · ·+h eρfj,l.x
1−fi,l
end forfor both parties do
Jointly call the oracle along with∑l
t=1(ρfi,t)
2 and∑lt=1(xi,t)
2 to get the decryptions of es1 , es2 , es3 and es4as well as the value Dist2(di, ρj)
Return Dist2(di, ρj) to Pf
end for
In case of multi-attribute data, we can compute the distance
function∑k
j=i
∑ni=1 ||xi−Cj ||2 using the Euclidean distance
which can be decomposed into x2i − 2xi ∗ Ci + C2
i . Let the
data objects from both parties after providing cluster shares
be di = (xi,1, . . . , xi,l) where l is the number of attributes.
Then the distance is calculated as follows:
Dist2(di, ρj) = (xi,1 − ρj)2 + (xi,2 − ρj)
2 + . . . (xi,l − ρj)2
=∑l
t=1(xi,t)2 +
∑lt=1(ρ
0j,t)
2 +∑l
t=1(ρ1j,t)
2 +
2∑l
t=1 ρ0j,tρ
1j,t − 2
∑lt=1 ρ
0j,tx
1i,t − 2
∑lt=1 ρ
1j,tx
0i,t
where, ρj is the jth cluster center and 1 ≤ j ≤ k. We
provide Protocol 4.3 to calculate distances. The distances
between a data object and all the k clusters are found as a
124
distance data set. If di is a data object it will have a data set
consisting of k number of distances after running the protocol
k times. Let, Pf runs the protocol where f may be either 0 or
1. Some calculations are follwed from [22]. From the above
equation, we can see that, Pf can calculate∑l
t=1(xi,t)2 and∑l
t=1(ρfj,t)
2 locally and privately. But to compute the rest of
the summations, both partys’ data is needed. So we have to
preserve the privacy for both parties. Pf runs the protocol and
learns only about needed summations to compute the distance
function. Each of Party P0 and P1 will run the Protocol 4.3
and get a number of distance data sets for each of their own
data objects.
C. Secure Clustering Update Computation Protocol
Algorithm 1: Finding minimum of h numbers
Initialize min← 0for i = 1 to h do
if min > a[i] thenmin← a[i]
end ifend for
Return min
Protocol 4.4: Secure Clustering Update Computation Pro-
tocol in malicious model using threshold decryption.
Require: Two parties P0 and P1 with the shares sk0 and
sk1 of the secret key sk and sum of data points of each
attribute in ith cluster as (s01,. . . ,s0l ) and (s11,. . . ,s1l ) with
(n01,. . . ,n0
l ) and (n11,. . . ,n1
l ) data entries, respectively.
Ensure: Return μ0i = (μ0
i,1, . . . , μ0i,l) to P0 and μ1
i =(μ1
i,1, . . . , μ1i,l) to P1.
for all Pf do∀j, Calculate esfj
= Epk(sfj ); efprod1
=∏l
j=1 esfj=
∏lj=1 Epk(s
fj );
∀j, Calculate enfj= Epk(n
fj ); efprod2
=∏l
j=1 enfj=
∏lj=1 Epk(n
fj );
Create POK(efprod1) and POK(efprod2
)Send the encryptions and NIZK proofs to P1−f
end forfor all Pf do
Compute e1−fprod1
=∏l
j=1 es1−fj
, e1−fprod2
=∏l
j=1 en1−fj
;
Check POK(e1−fprod1
) and POK(e1−fprod2
) are correct else
ABORT∀j, Calculate esfj +s1−f
j= sfj +h es1−f
j
∀j, Calculate enfj +n1−f
j= nf
j +h en1−fj
end forfor all Pf do
Jointly call the oracle to decrypt all the encryptions and
to compute the values μ0i,j and μ1
i,j for all attributes
Return μ0i to P0 and μ1
i to P1
end for
In this step, Algorithm 1 is used (replacing h with k and
considering a[i] as the representation of distance data set for
a data point) to find out the most proper clustering center
for each data point. Then the data points are assigned to
their nearest clusters. As a result, the clusters change and
new centers are assigned for them. When data objects have
been assigned to their nearest clusters, the cluster shares have
also been changed. Both parties know their own shares. So
they calculate the sum of each attribute in ith cluster and the
number of total data objects of corresponding attribute. The
equation of the mean of jth attribute in ith cluster is, μi,j =(s0j+s1j )/(n
0j+n1
j ). Protocol 4.4 is used to calculate new center
of ith cluster using this equation. Both parties run this protocol
and get the ith cluster center μi = (μi,1, μi,2, . . . , μi,l) as two
shares μ0i (Party P0 gets it) and μ1
i (Party P1 gets it). This
protocol is given considering ith cluster where 1 ≤ i ≤ k. That
means for finding centers of all clusters the protocol needs to
be run k times.
D. Secure Iteration Stopping Protocol
We know that k-means clustering is an iterative algorithm.
So, during the protocol execution, we need to stop it when
the output satisfies our requirements. For this step, we have
considered a very negligible threshold value ε. We need to do
the iteration until the difference of Euclidean distance between
two consequent calculation is smaller than ε. i.e,
Dist(Ci, Ci+1) = Dist(ρ0j + ρ1j , μ0j + μ1
j ) < ε
Protocol 4.5: Secure Iteration Stopping Protocol in mali-
cious model using threshold decryption.
Require: Two parties P0 and P1 with the shares sk0 and
sk1 of the secret key sk. They both have shares of jth
cluster center as μ0j and μ1
j for current iteration and ρ0j and
ρ1j for previous iteration, respectively. Let, party P1 wants
to check for iteration stopping. Firstly, both parties compute
the distances using Euclidean distance. Then Protocol 4.5
is used which is defined for one cluster.
Ensure: Return 1 P1 if difference of two cluster centers is
less than ε′, otherwise return 0.
for P0 doCalculate esub0 = Epk(ρ
0j − μ0
j )Create POK(esub0)Send (esub0 , POK(esub0)) to P1
end forfor P1 do
Check POK(esub0) is valid else ABORTCalculate em = esub0 ×h sub1
end forfor all Pf do
Jointly call the oracle to calculate Dsk(em)If m < ε′ then return 1, else return 0 P1
end for
where, j represents the cluster number and i and i + 1represent the previous and current iterations, respectively. We
can transform the distance function into the following one:
(ρ0j + ρ1j ) − (μ0j + μ1
j ) < ε′, where ε′ is also a negligible
threshold value. In case of jth cluster, party P0 locally com-
putes Epk(ρ0j − μ0
j ) and party P1 locally computes Epk(ρ1j −
μ1j ) and both of them use the common public key according
125
to threshold decryption. If both of them pass their data to each
other, they can perform homomorphic multiplication over their
encrypted results. Using Protocol 4.5 party P1 gets a value
which is either 1 or 0 against a particular cluster. He runs
this protocol k times and finds a set of k values(we define
it as decision set) which represents the changes of all current
clusters comparing with the previous ones. If he gets k number
of 1s throughout the decision set, iteration can be stopped,
otherwise the computation continues.
The final step of our algorithm replaces the old cluster centers
with the new ones. This is done by each party as follows:
Party P0 sets: (ρ01, ρ02, . . . , ρ
0k) = (μ0
1, μ02, . . . , μ
0k)
Party P1 sets: (ρ11, ρ12, . . . , ρ
1k) = (μ1
1, μ12, . . . , μ
1k)
V. PROPOSED PROTOCOL ANALYSIS
A. Security Analysis
Here we provide the following lemma which is used for
minimizing the cost of communication without invading pri-
vacy as in paper [22]. We also provide theorem related to the
security of our proposed protocol. We remove the proofs due
to lack of space. Those will appear in the full version.
Lemma 1: An adversary who does not know the input plain-texts of Protocol 4.3 and Protocol 4.4 can successfully cheatusing its own input strings with negligible probability.Theorem 1 and 2: Protocol 4.1 and 4.2 are secure in the(decryption)-hybrid model assuming that the NIZK protocolsand the additive homomorphic encryptions used are secure inthe malicious model.Theorem 3 and 4: Protocol 4.3 and 4.4 are secure in the(decryption)-hybrid model assuming that the NIZK protocols(given that lemma 1 holds) and the additive homomorphicencryptions used are secure in the malicious model.Theorem 5: Protocol 4.5 is secure in the (decryption)-hybridmodel assuming that the NIZK protocols and the additivehomomorphic encryptions used are secure in the maliciousmodel.
B. Complexity Analysis
The full description of the following complexities will
appear in the full version.
Computational complexity: In Protocol 4.1 and 4.2, the
complexity is O(1) for ciphertexts and proofs of knowledge
(POK), and O(n) for rest of the operations (n = size of data).
In Protocol 4.3 and 4.4, the complexity is O(1) for POKand O(l) for rest of the operations (l = attribute number). In
Protocol 4.5, the complexity is O(1) for ciphertexts, POKand subtraction, and O(l) for rest of the operations.
Communication complexity: The complexity for POKis O(1) in all sub-protocols. The complexity is O(1) for
ciphertexts in Protocol 4.1 and 4.2 and O(l) for rest of the
protocols.
VI. CONCLUSION
In this paper, we have proposed a new scheme based
on NIZK proofs for privacy-preserving k-means clustering
between two parties in malicious model. Most of the existing
privacy-preserving k-means schemes deal with the semi-honest
model having various security and correctness problems. Ours
is a challenging work because it guarantees stronger security
and has overcome such problems. Moreover, we have mini-
mized the communication cost. As ours is a two-party model, it
will be a great future work to design private k-means clustering
considering more than two parties in the presence of malicious
adversaries.
REFERENCES
[1] J. B. MacQueen, “Some methods for classification and analysis ofmultivariate observations,” in Proc. of the fifth Berkeley Symposium onMathematical Statistics and Probability, vol. 1. University of CaliforniaPress, 1967, pp. 281–297.
[2] M. Upmanyu, A. M. Namboodiri, K. Srinathan, and C. V. Jawahar,“Efficient privacy preserving k-means clustering,” in PAISI, 2010, pp.154–166.
[3] T.-K. Yu, D. T. Lee, S.-M. Chang, and J. Zhan, “Multi-party k-meansclustering with privacy consideration,” in ISPA, 2010, pp. 200–207.
[4] C. Su, F. Bao, J. Zhou, T. Takagi, and K. Sakurai, “Security andcorrectness analysis on privacy-preserving k-means clustering schemes,”IEICE Transactions, vol. 92-A, no. 4, pp. 1246–1250, 2009.
[5] J. Vaidya and C. Clifton, “Privacy-preserving k-means clustering oververtically partitioned data,” in KDD, 2003, pp. 206–215.
[6] T. Okamoto and S. Uchiyama, “A new public-key cryptosystem as secureas factoring,” in EUROCRYPT, 1998, pp. 308–318.
[7] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” in KDD, 2005, pp.593–599.
[8] J. Bar-Ilan and D. Beaver, “Non-cryptographic fault-tolerant computingin constant number of rounds of interaction,” in PODC, 1989, pp. 201–209.
[9] S. Jha, L. Kruger, and P. McDaniel, “Privacy preserving clustering,” inESORICS, 2005, pp. 397–417.
[10] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “A new privacy-preserving distributed k-clustering algorithm,” in SDM, 2006.
[11] W. Du and M. J. Atallah, “Privacy-preserving cooperative statisticalanalysis,” in ACSAC, 2001, pp. 102–110.
[12] P. Bunn and R. Ostrovsky, “Secure two-party k-means clustering,” inACM Conference on Computer and Communications Security, 2007, pp.486–497.
[13] C. Su, F. Bao, J. Zhou, T. Takagi, and K. Sakurai, “Privacy-preservingtwo-party k-means clustering via secure approximation,” in AINA Work-shops (1), 2007, pp. 385–391.
[14] C. Yildizli, T. B. Pedersen, Y. Saygin, E. Savas, and A. Levi, “Distributedprivacy preserving clustering via homomorphic secret sharing and itsapplication to (vertically) partitioned spatio-temporal data,” IJDWM,vol. 7, no. 1, pp. 46–66, 2011.
[15] S. Patel and D. C. Jinwala, “Privacy preservingdistributed k-means clustering in malicious model,”http://precog.iiitd.edu.in/events/spsymposium13/SPSymposium files/SPsymposium-
papers/SPsymposium-paper7.pdf, 2013.[16] S. Patel, V. Patel, and D. C. Jinwala, “Privacy preserving distributed
k-means clustering in malicious model using zero knowledge proof,” inICDCIT, 2013, pp. 420–431.
[17] Y. Lindell and B. Pinkas, “Secure multiparty computation for privacy-preserving data mining,” The Journal of Privacy and Confidentiality,vol. 1, no. 1, pp. 59–98, 2009.
[18] P. Paillier, “Public-key cryptosystems based on composite degree resid-uosity classes,” in EUROCRYPT, 1999, pp. 223–238.
[19] R. Cramer, I. Damgard, and J. B. Nielsen, “Multiparty computation fromthreshold homomorphic encryption,” in EUROCRYPT, 2001, pp. 280–299.
[20] M. Kantarcioglu and O. Kardes, “Privacy-preserving data mining appli-cations in the malicious model,” International Journal of Informationand Computer Security, vol. 2, no. 4, pp. 353–375, 208.
[21] A. Fiat and A. Shamir, “How to prove yourself: Practical solutions toidentification and signature problems,” in CRYPTO, 1986, pp. 186–194.
[22] K. Emura, A. Miyaji, and M. S. Rahman, “Efficient privacy-preservingdata mining in malicious model,” in ADMA (1), 2010, pp. 370–382.
126