information leakage in cloud data warehousescs.ucf.edu/~ahmadian/pubs/08361046.pdf · the second...

2377-3782 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TSUSC.2018.2838520, IEEETransactions on Sustainable Computing

IEEE TRANSACTIONS ON SUSTAINABLE COMPUTING 1

Information leakage in cloud data warehousesMohammad Ahmadian, Member, IEEE, and Dan C. Marinescu, Senior Member, IEEE

Abstract—Information leakage is the inadverent disclosure of sensitive information through correlation of records from severaldatabases/collections of a cloud data warehouse. Malicious insiders pose a serious threat to cloud data security and this justifies thefocus on information leakage due to rogue employees or to outsiders using the credentials of legitimate employees. The discussion inthis paper is restricted to NoSQL databases with a flexible schema. Data encryption can reduce information leakage, but it isimpractical to encrypt large databases and/or all fields of database documents. Encryption limits the operations that can be carried onthe data in a database. It is thus, critical to distinguish sensitive documents in a data warehouse and concentrate on efforts to protectthem. The capacity of a leakage channel introduced in this work quantifies the intuitively obvious means to trigger alarms when aninsider attacker uses excessive computer resources to correlate information in multiple databases. The Sensitivity Analysis based onData Sampling (SADS) introduced in this paper balances the trade-offs between higher efficiency in identifying the risks posed byinformation leakage and the accuracy of the results obtained by sampling very large collections of documents. The paper reports onexperiments assessing the effectiveness of SADS and the use of selective disinformation to limit information leakage. Cloud servicesidentifying sensitive records and reducing the risk of information leakage are also discussed.

Index Terms—Database as a Service, Information leakage, Capacity of a leakage channel, Sensitivity analysis, Approximate QueryProcessing, Biased Sampling, Cross-Correlation estimation.

F

1 INTRODUCTION

INFORMATION leakage is the inadvertent disclosure ofsensitive information. A malicious insider with access to

the information stored by a cloud data warehouse is ableto infer sensitive information through multiple databasesearches and cross-correlations among databases.

This new threat to cloud security has received littleattention in the past. The impact of information leakagewill most likely amplify as the volume of data stored onpublic clouds by many organizations is steadily increasing.Often oblivious to the dangers of information leakage manygovernmental agencies and enterprises transition to privateand to hybrid clouds with the belief that in addition to lowercost a cloud offers enhanced security.

Nowadays virtually all Cloud Service Providers (CSPs)offer Database as a Service (DBaaS) [14]. It is predicted thatDBaaS will enjoy a solid annual growth rate for the foresee-able future. CSPs guarantee availability and scalability ofcloud services, but the data confidentiality poses significantchallenges in the face of new threats.

Unauthorized access to confidential information anddata theft top the list of concerns of individuals and orga-nization who relinquish the physical control of their data toa CSP [13]. Some of the new threats emanate from insiderattackers who have the ability to correlate information frommultiple cloud databases. Sensitive information could beinferred using information from low-risk datasets hosted bythe same cloud.

The discussion in this paper is restricted to NoSQLdatabases with a flexible schema. A NoSQL database is acollection of documents D = d1, . . . , dn and is also called a

• Mohammad Ahmadian and Dan C. Marinescu are with Department ofComputer Science, University of Central Florida, Orlando, FL, 32816.

• E-mail: {ahmadian, dcm}@cs.ucf.edu

Manuscript received August 28, 2017; revised March 16, 2018.

collection. The two terms database and collection will be usedinterchangeably throughout this paper. A document is a setof 〈keyi, valuei〉 pairs, each representing an attribute of anobject.

Contrary to the common belief, encrypted cloud dataand encrypted queries are vulnerable to information leak-age. A malicious insider can infer sensitive informationas the attribute name, the key, the number of attributesinvolved in a query, and the query length often revealinformation about the encrypted data.

A motivating sequence of events illustrates the effects ofdata correlation and, implicitly, of information leakage. InAugust 2006, AOL, a global on-line mass media corporation,released search logs of over 650 000 users for research pur-poses. An analysis of the searches conducted over a periodof three months with user names changed to random IDnumbers made users uniquely identifiable. Correlating datareleased by AOL with publicly available datasets revealedadditional private information about AOL users.

The very large number of documents in a collectionlimits the ability to analyze in real-time the dangers posedby information leakage and to take preventing measures.The alternative proposed in this paper uses random sam-pling and error estimation to assess the vulnerability ofthe data warehouse to information leakage. This solutiondramatically cuts the analysis time but, as expected, ap-proximate measurements based on data sampling exhibitdifferent levels of errors.

Sensitivity Analysis based on Data Sampling inspired bythe Approximate Query Processing (AQP) method providesbounds on the accuracy of the method [15], [27]. However,uniform sampling cannot provide accurate response forcorrelated databases. Thus, different methods of samplingknown as biased sampling are proposed for providing betterapproximations [28], [29].

Extending sensitivity analysis from one dataset to a data

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TSUSC.2018.2838520

Copyright (c) 2018 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].




warehouse with a very large number of datasets is a daunt-ing task even when using samples of modest size taken fromindividual datasets. Fortunately, the computational effortmade by an attacker to gather sensitive information can beexploited for his demise. Two cloud information leakageprevention methods are proposed: (i) limit the ability ofan attacker to gather sensitive information by restrictingcomputing resources available for such attacks [25]; and (ii)insert disinformation documents.

The first approach is based on a quantitative character-ization of the capacity of a leakage channel discussed inSection 3. Alarms triggered once pre-established thresholdson the number of chained queries is exceeded serve asdeterrents for potential attackers and limit their ability tocollect sensitive information.

The second approach proposes the insertion of disinfor-mation documents providing multiple values to an attributeand misleads an attacker. The indiscriminate documentreplication drastically increases the database size and, im-plicitly, the query response time. The selective disinforma-tion proposed in this paper uses sensitivity analysis to limitthe number of additional documents, as well as the otherinherent negative effects of the original method proposed in[26].

The contributions of this paper are:1) A survey of data encryption methods and their limita-

tions for preventing information leakage in cloud datawarehouses presented in Section 2.

2) The definition of the capacity of an information leakagechannel relating the level of information leakage with theeffort and the resources needed by a malicious insider,discussed in Section 3.

3) The use of disinformation to limit the capacity of aleakage channel, discussed in Section 4

4) The introduction of sensitivity analysis in the studyof information leakage through correlation of multipledocuments in a database. An effective sensitivity analy-sis method based on approximate query processing forclassifying documents in several sensitivity classes andselective disinformation to limit information leakage arediscussed in Section 5.

5) A scalable leakage assessment and parameter extractionalgorithm for cloud data warehouses based on approxi-mate query processing, discussed in Section 6.The paper also suggests cloud services to assess the

likelihood of information leakage and limit the ability of amalicious insider to discover sensitive information in clouddata warehouses.

2 CLOUD DATA ENCRYPTION

Computer clouds are target-rich environments for maliciousindividuals and criminal organizations [1], [2]. The impactof traditional threats to all computer systems connected tothe Internet is amplified in the case of computer cloudsdue to the vast amount of resources and the large userpopulation [3], [4], [5], [6], [7]. At the same time, new threatsthat exploit cloud organization and services have emerged[8], [9], [10], [11], [12].

Data encryption can be used for protection of sensitiveinformation, but cannot be used indiscriminately to protect

the very large volume of data stored on a cloud. Encryptioncan be applied to a range of data granularity, from highgranularity of atomic data to low granularity of aggregateddata items. The higher the encryption granularity, the higherthe information leakage. For example, encryption of a singleattribute leaks how frequently the attribute is present in thedatabase records, while encryption of the entire documentand collection as a single unit leaks less information.

Cloud data can be in three states, at-rest, in-transit, orin-process; a comprehensive data security mechanism mustprotect data in any of the three states. Data encryption canonly protect cloud data while in storage, but arithmeticoperations with encrypted data are only theoretically fea-sible at this time. It is therefore necessary to decrypt databefore processing and this process creates another windowof vulnerability and opens the door for information leakage.It is feasible to query encrypted data as we shall see inthis section which starts with a review of cloud encryptionschemes and continues with an overview of two systemsused to process encrypted data.

Encryption Methods and Cryptosystems. Several encryp-tion methods and cryptosystems used for cloud hosteddatabases are discussed next.

I. Deterministic Encryption. This encryption scheme producesthe same ciphertext for an identical pair of plaintext andkey. For example, block ciphers in the Electronic CodeBook (ECB) mode, with a constant initialization vector aredeterministic. Deterministic encryption scheme preservesequality; therefore, the frequency of encrypted data mirrorsthe frequency of plaintext data and this information isleaked to an attacker.

for j = 1, .., n; Cj = Ek(Pj); Pj = Dk(Cj). (1)

Equation 1 describes the deterministic encryption and de-cryption operation with Ek the encryption algorithm, Dk

the decryption algorithm, k the secret key, Pj a plaintextdata block, and Cj the ciphered data block.

II. Random Encryption. In this encryption scheme, a messageis coupled with a key k and a random Initial Vector (IV).This scheme is non-deterministic, i.e., encryption of thesame message with the same key yields different ciphertext.Random encryption schemes are semantically secure againstplaintext attacks. Equation 2 describes the encryption anddecryption of a block cipher in Cipher Block Chaining (CBC)mode.

for j = 2 . . . n{C1 = Ek(P1 ⊕ IV ), P1 = IV ⊕Dk(C1)

Cj = Ek(Pj ⊕ Cj−1), Pj = Cj−1 ⊕Dk(Cj)

(2)

The Advanced Encryption Standard (AES) is one the mostsecure random encryption scheme. AES is a symmetricblock cipher algorithm with a key size of 128,192, or 256bits and with a block size of 128 bits. Equation 3 showsthat random encryption function can be constructed from adeterministic one by concatenation of a fixed length randomnumber r to each input.






Ek(x) = E′k(x || r) (3)

where k is the encryption key, E′ is a deterministic en-cryption, Ek(x) is a random encryption function, and r isa random number, illustrates this process.

III. Fully Homomorphic Encryption (FHE) allows computationto be done on encrypted data [33]. However, the full homo-morphic encryption algorithm is very intricate and the over-head of computations with FH-encrypted data is severalorders of magnitude larger than that of computations withthe corresponding plaintext data. Another reason for FHEimpracticality is that a query to an FH-encrypted databasemust be expressed as a circuit over an entire dataset.

IV. Order-Preserving Encryption (OPE) is a deterministic cryp-tosystem where ciphertext preserves the ordering of theplaintext data. Aggregate queries such as comparison, min,and max can be executed on OPE-encrypted datasets. Equa-tion 4 shows the preservation of order relation of plaintextin the ciphertext with OPEk the key-based OPE

∀x, y |x, y ∈ Data Domain :

x < y =⇒ OPEk(x) < OPEk(y).(4)

OPE offers less protection than FHE and leaks critical in-formation about the plaintext data. The OPE algorithm in-troduced in [31] was used for a cloud database service [30].The Modular Order-Preserving Encryption (MOPE) [32], anextension to the basic OPE claiming security improvements,also leaks information.

V. Additive Homomorphic Encryption (AHOME) is a partiallyhomomorphic cryptosystem that allows a database server toconduct homomorphic addition and multiplication compu-tations on ciphertext. An example of AHOME is Paillier’scryptosystem [34]. The homomorphic addition is formu-lated as

Dk(Ek(m1, r1).Ek(m2, r2)modn2) = m1 +m2 (mod n) (5)

where m1,m2 ∈ Zn are plaintext messages, r1, r2 ∈ Z∗n arerandomly selected, and n is the product of two large primes,Zn and Z∗n are sets of integers.

Protection of data in-transit is discussed next. The com-munication channels with cloud databases can be securedusing standard HTTP over the Secure Socket Layer (SSL)communication protocol. Most CSPs provide APIs for theweb service enabling developers to use both the standardHTTP and the secure version of the HTTPS protocol. Thesecurity requirements for data in-transit can be fully sat-isfied using HTTPS for communication with a cloud. Theendpoint authentication feature of SSL makes it possible toensure that the clients are communicating with an authenticcloud server. The basic idea of maintaining confidentiality ofdata in at-rest and in-process states is to use a cryptosystem.However, providing the decryption key to the server is aconfidentiality violation.

Processing encrypted data. Searchable encryption methodssuch as Oblivious RAM (ORAM) [19], [20] provide an ac-ceptable level of security. However, the efficiency and highcomputational cost, as well as the excessive communication

costs between the clients and the server make this methodimpractical [24]. Deterministic and OPE cryptosystems leakcritical information such as frequency and order of theoriginal data and enable attackers to extract sensitive in-formation.

The following systems do not require modifications ofdatabase services; encrypted data is processed identicallyas plaintext data. Optimizations such as multi-layer index-ing, caching, and file management operations are invariantwhether applied to encrypted or plaintext databases. Thefirst system, CryptDB, [17] is used to search encryptedSQL cloud databases. Inference attacks against CryptDB arediscussed in [18].

The system to search encrypted NoSQL databases [16]involves a secure proxy to encrypt client queries and de-crypt server query responses. The proxy ensures that an at-tacker could not access sensitive information. The process iscompletely transparent to the clients which are not involvedin encryption/decryption operations.

It is impractical to encrypt all documents in a database ora large number of documents in multiple cloud databases.Moreover, encryption of all document fields restricts therange of database operations. Data encryption can be usedto selectively protect sensitive information in a data ware-house however, it is seldom used for several reasons. First,it is not feasible at this time to support a full range ofarithmetic and logic operations with encrypted data. Sec-ond, searching encrypted databases requires more complexsoftware systems such as the ones discussed in [17] and [16].

This motivates the investigation of information leakagedue to correlations among documents in plaintext or withpartially encrypted fields in multiple databases of a datawarehouse. Such correlations can only by carried out byindividuals with quasi-unlimited access to the data as dis-cussed in the next section.

3 INFORMATION LEAKAGE DUE TO MALICIOUS IN-SIDERS ACCESS TO PLAINTEXT COLLECTIONS

Malicious insiders could exploit information leakage fromsensitive documents for a range of nefarious activities. Suchattacks can be conducted by the employees and the contrac-tors of large data centers with access to the software, thehardware, and the data. There is also the risk of an intrudergaining the same level of access using the credentials of alegitimate employee.

Example. The following example illustrates how correla-tions among multiple documents in several cloud collec-tions allow an insider to infer sensitive information evenwhen some sensitive documents are encrypted.

After buying an item from an online store a document,Rm, of John’s sale including his name, address, phonenumber, and credit card information is stored in the cloudmerchant’s collection, Dm:

Rm = {〈Name, John〉, 〈Addr, SW81〉, 〈Ph, 7654321〉,〈Card, V ISA xyzw|EXP May20|COD 345〉}.

(6)John’s dental documents are in the Dh collection on

the same cloud. He is identified by a patient identification






number, PatId, stored as an encrypted document, R0h, to

ensure anonymity of patient information:

R0h = {〈Name, John〉, 〈PatId, 987654〉} (7)

After John visits an orthodontist a new document, Rsh,containing the patient Id, age, sex, social security number,address, phone number and X-ray results is stored in theDhcollection:

Rsh = {〈PatId, 987654〉, 〈Age, 23〉, 〈Sex,M〉, 〈SSN, 333〉,〈Addr, SW81〉, 〈Ph, 7654321〉, 〈XRay,Results〉}.

(8)Encryption of the obviously sensitive document, R0

h is in-sufficient. Indeed, an insider with access to both collections,Dm and Dh, can correlate the address and phone numbersin documents Rm and Rsh and find John’s SSN and hiscredit card information, in spite of the attempt to protectJohn’s privacy by using the PatID instead of his name.This example illustrates the need for sensitivity analysisdiscussed in Section 5, to identify all fields of a documentthat need to be protected, in our case the address and thephone number.

The system model. We examine sensitive informationleakage due to correlations among data in several NoSQLdatabases residing on the same cloud. We assume that datadocuments containing sensitive information consisting of〈key, value〉 pairs, are distributed among several databasesstored on the same cloud, and the attacker has access to allN databases stored on the cloud.

In the general case, the intruder can attack a set oftargets, T = {T1,T2, . . . ,Tq}. A target Ti is the collection ofdocuments scattered among the databases in a cloud hosteddata warehouse containing sensitive information about oneperson, process, or document. The malicious insider knowsat least one 〈key, value〉 pair for each target and has thepotential to identify one 〈key, value〉 pair in every sensitivedocuments of each target.

3.1 The capacity of a leakage channel.We propose a quantitative characterization of informationleakage reflecting the attacker’s cost-benefits options. Thismeasure correlates the amount of information leakage withthe effort of the attacker; the larger the effort to accesssensitive information, the higher the risk of detection.

The following analysis is based on several assumptions:1) To avoid detection a malicious insider limits her effort

to ni searches related to target Ti.2) There are only ki < ni documents with sensitive infor-

mation related to Ti.3) The attacker does not know ki.

The CTi(ni, ki), the capacity of an [ni, ki]-leakage channelrelative to target Ti, is the probability of successful access tor1, r2, . . . , rki sensitive documents relevant to target Ti in nisearches.

CTi(ni, ki) =ni∏j=1

p(nj , kj) with n =ni∑j=1

nj and k =ni∑j=1

kj

(9)

In this expression p(nj , kj) is the probability of accessing kjsensitive documents of Ti in nj searches of database Dj .

The capacity of the leakage channel for all targets T isthen the vector

CT =[CT1(n1, k1), CT2(n2, ki2), . . . , CTq (nq, kq)

]. (10)

.Next, we consider the case of a single target, we drop the

index identifying the target, and examine several cases:1) All sensitive documents share a 〈key, value〉 pair, e.g.,

all documents contain the key-value pair assigning acode name to a patient.

2) All documents of interest are linked together by pairsof documents sharing a unique 〈key, value〉 pair. Thisis the case of documentsR0

h andRsh in our example. Inthis case there are several possibilities:

a) The attacker determines the 〈key, value〉 pair of thehead of the list and follows the chain;

b) The attacker determines the 〈key, value〉 pair of oneof the documents in the set; thus, it can discovereither the downstream or the upstream sensitive doc-uments. This is similar to the previous case, once thedocument containing the 〈key, value〉 pair is found itacts as a partial head of the list for the documents inthe chain;

c) The attacker detertmines two 〈key, value〉 pairs ofone document thus, it can discover both the upstreamand downstream sensitive documents.

Now we analyze the capacity of an n-leakage channelwhen the malicious insider has different means to identifysensitive information and different targets, and assume thatshe examines N documents and only K of them contain sensitiveinformation.

A. All sensitive documents share a 〈key, value〉 pair. Assum-ing that the attacker maintains a list of the databases she hasalready explored, we have a hypergeometric distribution ofsuccesses. The problem is sampling without replacementand the probability of finding k sensitive documents in nDB searches, p(n, k) is

p(n, k) =

(Kk

)×(N−Kn−k

)(Nn

) . (11)

The mean value and the variance of p(n.k) are

µp = nK

Nand σp = n

K

N× N −K

N× N − nN − 1

. (12)

The capacity of an n-leakage channel CA(n, k) in thiscase is given by Equation 11

CA(n, k) =

(Kk

)×(N−Kn−k

)(Nn

) . (13)

B. Chained documents. The attacker locates first the headof the list including the 〈keyhead, valuehead〉 pair, then fol-lows the chain by searching for documents containing thepairs 〈keyhead, valuehead〉 and 〈keyhnext, valuenext〉, until






she identifies the document containing the next two pairsand so on.

To identify the document containing〈keyhead, valuehead〉 the attacker must search N documentsand only one contains the desired 〈key, value〉 pair. Callph(n0) the probability of locating the head of the list inn0 trials. In this case K = 1 and k = 1 and according toEquation 11 ph(n0) is

ph(n0, 1) =

(N−1n0−1

)(Nn0

) =n0

N(N − n0 + 1). (14)

Then the probability of locating a second document of thelist in n1 trials, ph,1(n1, 1) is

ph,1(n1, 1) =

(N−2n1−1

)(Nn1

) =n1

N(N − n1 + 1). (15)

It follows that ps(n0, n1, . . . , ns−1), the probability of locat-ing s consecutive sensitive documents in n0, n1, . . . , ns−1trials is

ps(n0, n1, . . . , ns−1) =s−1∏j=0

(N−jnj−1

)(Nnj

) =s−1∏j=0

njN(N − nj + 1)

.

(16)It is plausible that the attacker follows a wrong chain;

for example, the pair 〈Age, 35〉 may be present in multipledocuments and point the attacker to documents pertinentto a target other than the desired one. Thus, in this casewe can only provide an upper bound for the capacity of ann-leakage channel CB(n, q)

CB(n, s) ≤s−1∏j=0

njN(N − nj + 1)

with N =s−1∑j=0

nj . (17)

C. Chained documents. The attacker identi-fies two 〈key, value〉 pairs, one pointing up-stream 〈keyups, valueups〉 and one downstream〈keydowns, valuedowns〉. First, a search for the documentcontaining both 〈key, value〉 pairs is conducted. Theprobability of locating this document in n0 trials isequal to ph(n0, 1) given by Equation 14. Then thesearch is conducted for documents containing either〈keyups, valueups〉 or 〈keydowns, valuedowns〉. It is plausiblethat the attacker follows a wrong chain in either searchthus,

CC(n, s) ≤ CB(n, s). (18)

3.2 Timing AnalysisThe attacker is not only constrained by the number of trials,but also by the time necessary to achieve his objectives. Oneoption for the attacker is to have a script and carry outdatabase searches in parallel to reduce the exposure time. IfX1, X2, . . . , XN are random variables andXi represents thesearch time for database Di then TN , the time for searchingin parallel the n databases D1,D2, . . . ,DN , is

TN = max(X1, X2, . . . , XN ). (19)

When X1, X2, . . . , XN are independent random variablesand have a common distribution function FX(t) then thedistribution function of TN is given by [21]

FTN (t) = [FX(t)]N. (20)

The expected value of TN is

T = E[TN ] = N∫ ∞0

tFN−1X (t)dFX(t). (21)

Uniform distribution of the search time. When little isknown about a random variable X , except its range [a, b],a uniform distribution is used to model X ,

FX(t) = Pr[X ≤ t] =

0 t < at−ab−a a ≤ t ≤ b,0 t > b

E[X] =a+ b

2.

(22)According to Equation 21, the time to search in parallelthe N databases when the searching times are uniformlydistributed is

Tuniform = N∫ ∞0

t(t− a)N−1

(b− a)N−11

b− adt = b− b− a

(N + 1).

(23)Normal distribution of the search time. Searching adatabase requires the comparison of the search pattern withmultiple documents thus, the search time is the sum of alarge number of individual operations. By virtue of the cen-tral limit theorem, the distribution of a random variable Xwhich is the sum of a large number of quantities is normal.The probability density function of a normal distributionwith mean µ and variance σ is

fX(t) =1

σ√

2πexp

(− (t− µ)2

2σ2

). (24)

There is no closed form for the normal distribution functionexcept for the case of the standard normal distribution whenµ = 0 and σ = 1. In this case

FX(t) =1√2π

∫ t

0exp(−x

2

2)dx. (25)

We use a result from [22] to compute the average time

Tstdnormal = (2 logN )12 − 1

2 (2 logN )−12 × (log logN+

+ log 4π − 2C) +O[(logN )−1](26)

with C = 0.577 the Euler’s constant.The obvious method for limiting the capacity of a

leakage channel is to trigger alarms when the amount ofresources used in a given period by an employee or bysomeone with access to the credentials of an employeereaches a threshold enforced by the trusted security baseof the system. A statistical analysis based on approximatequery processing discussed in Section 5 could help the cloudservice provider to determine the probabilities of findingsensitive documents in different collections and estimatethe number of trials and the time required by an attackerto gather sensitive information close to the capacity of theleakage channels.






4 USING DISINFORMATION TO LIMIT THE CAPAC-ITY OF LEAKAGE CHANNELS

Disinformation in the context of NoSQL databases meansdocument replication combined with the alteration of sen-sitive 〈key, value〉 pairs to limit the ability of an attackerto identify the true value for a given key. The use ofdisinformation [26] is a last resort method for limitingsensitive information leakage. Indeed, this solution requireschanges of the original database that can be only done by thedatabase owner prior to uploading the data to the cloud.The method also requires a trusted proxy to filter out thefictitious data in the answer to the query.

The indiscriminate replication of all collection docu-ments increases the storage space dramatically as well asthe response time for aggregate queries by a factor at leastequal to the replication index. The replication index is thecardinality of the set of documents created to hide thesensitive information in an original document. The largerthe replication index, the more difficult it is for an attacker toidentify the sensitive information, but the larger the storagefor the expanded collection.

For example, a 100 TB collection becomes 1 PB collectionwhen the replication index of every document in the col-lection is equal to ten; the query response time increases inaverage by an order of magnitude. It follows that replicationmust be selective, it should only cover documents with sen-sitive information and could only be applied for relativelysmall databases.

Disinformation reduces the capacity of leakage channels;for example, a replication factor of ten reduces the capac-ity of the leakage channel in Equation 11 by an order ofmagnitude but it also increases the query processing timewith an amount related to the effort of identifying thedisinformation documents in the response to the query.

A secure proxy like the one in [16] mediates the inter-action between clients and the DBaaS server. The proxyintercepts the client queries, transfers them to the encryptedqueries and passes them to the cloud DBaaS server whichresponds with a combination of valid and forged docu-ments. The proxy decrypts the query response and filtersout the disinformation documents and forwards the desireddocument to the user’s application.

Info doc

Key1 V alue1

... ...

Keyn V aluen

Disinfo doc

Key1 V alue1

... ...

Keyn V aluen

Digest = Hash{Document}eTag : Ek{Token‖Digest}

Encryption

Ek(Key1) : Ek(V alue1)..Ek(Keyn) : Ek(V aluen)

Encrypted doc

eTag value

Ek(Key1) Ek(V alue1)

... ...

Ek(Keyn) Ek(V aluen)

Encrypted doc

eTag value

Ek(Key1) Ek(V alue1)

... ...

Ek(Keyn) Ek(V aluen)

Fig. 1: Generation and encryption of the eTag attribute whichallows an authorized database user to identify disinformationin the response to a query.

This method is not only useful to identify disinformationdocuments but also to conduct integrity verification usingtamper-resistant algorithms. A Message Authentication Code(MAC), also known as a tag confirms that a message comesfrom the stated sender thus, is authentic and has not beenaltered. Figure 1. shows this process carried out by theowner of the data:

1) Hash functions are applied to the original and disinfor-mation documents.

2) A new attribute 〈eTag,Ek(di)〉 is appended to eachdocument di.

3) The tag is encrypted. Only an authorized database usercan decrypt the tag and identify the original document.

16 64 64 256 256 1024 1024 8192 81920

1 · 1052 · 1053 · 1054 · 1055 · 1056 · 1057 · 1058 · 1059 · 105

Document size (Bytes)Sp

eed

(KB/

seco

nd)

MD5SHA1

RIPEMDSHA256

Fig. 2: The hashing rate of four popular cryptographic hashfunctions expressed in KB/second for several document sizes.

The algorithm can use one of several crypto-hash func-tions including MD5, SHA1, SHA256, and RIPEMD. Thetime to compute the hash value is a function of the docu-ment size. Figure 2 shows the performance of the four pop-ular cryptographic hash functions for a range of documentsizes. Based on these results we concluded that SHA1 is thebest option.

5 SADS - SENSITIVITY ANALYSIS BASED ONDATA SAMPLING

The documents in the NoSQL databases of a data warehouseinclude items with different degrees of intrinsic sensitiv-ity and different domains. The intrinsic information quanti-fies the danger posed by the indiscriminate disclosure ofinformation. The domain quantifies the likelihood that a〈attribute, value〉 pair is present in multiple databases.

A large domain increases the capacity of the informationleakage channel discussed in Section 3. For example, recordscontaining the Social Security Number (SSN) are likely tobe present in health, financial, personnel records, as well as,records maintained by credit scoring agencies, motor vehicleand passport services, airlines and many other organiza-tions with information about an individual.

The goal of sensitivity analysis is to determine the levelof vulnerability resulting from correlation of sensitive infor-mation in various databases of a public cloud data ware-house and to support measures for limiting the informationleakage. Given the massive amounts of data maintained bya public cloud a brute force approach to sensitivity analysisis a hopelessly daunting tasks.

SADS, the solution we proposed is based on Approxi-mate Query Processing (AQP), a technique used by On-LineAnalytical Processing applications to extract information






from massive datasets. The response time to a query can beprohibitive thus, limiting the usefulness of data analytics.Many such applications are latency sensitive and in somecases, e.g., in case of exploratory investigations, it is prefer-able to have sooner an approximate answer to a query thanan accurate answer later. In such cases the sampling [15]offers a tempting alternative and, as shown in this section,can also be useful for limiting the information leakage.

SADS. Sensitivity analysis has several stages: (i) establishsensitivity levels; (ii) establish the domain of different keysrelated to sensitive information; (iii) determine the numberof collection documents at each sensitivity level. The lasttwo stages of the sensitivity analysis require an examinationof all collection documents, a rather slow process. To facil-itate fast sensitivity analysis, we shall use samples of thecollection and report the estimation errors.

An initial step of sensitivity analysis of the documentsin one database can be carried out by the database ownerand classify the documents in several classes based on theintrinsic information they could leak. The results of thisanalysis can be used in several ways to limit the informationleakage:

1) Identify data items that can be encrypted to limit theability of an insider to correlate sensitive informationin multiple documents. For example, encrypt recordswhich encode the randomly selected PatientID in caseof health records.

2) Identify and rename the key in < key, value > pairsto prevent correlations. For example, instead of ”SSN”(Social Security Number) use ”PIC” (Personal Identifi-cation Code).

3) Selectively apply disinformation to the collection docu-ments.

4) Identify sensitive information and flag repeated queriesthat search for sensitive information.

A cloud service provider can offer a comprehensive sen-sitivity analysis service. This service will assess the domainof the key in 〈key, value〉 pairs in all databases hosted bythe system. It should be stressed that there is no full proofmethod to completely eliminate the information leakage, thesensitivity analysis can only limit it.

AQP and Sampling. AQP is based on sampling techniquesfor providing approximate responses to aggregated queriesalongside estimations of the implicit error produced by thismethod. An aggregate query calls aggregate functions to re-turn a meaningful computed summary of specific attributesof a group of documents. Common aggregate functions are:Average, Max, Min, and Count [23].

An AQP system supplies confidence intervals indicatingthe uncertainty of approximate answers. Confidence Inter-vals (CI) represent the range of values centered at a knownsample mean and used to calculate error bounds. Indeed, anapproximate answer without the specification of the errorsdue to sampling rather than full database search would notbe useful.

Sampling can be done with or without replacement.In sampling without replacement (disjoint samples), anytwo samples are independent whereas in sampling withreplacement, sample values are dependent. The sample sizeshould be increased when the estimation error of the AQP

method is higher than an acceptable threshold during thesampling phase. The results produced by resampling withoutreplacement from the larger sample set are dependent on theoriginal sample set. The resampling process can be repeatedto balance estimation errors while limiting the processingtime required by the bootstrapping method [15], [29].

Collection samples consist of randomly selected docu-ments from the original collection. Queries can be conductedin parallel on such samples. Given C, the set of documentsin the collection and S, the set of documents in a sampleused by the AQP method, the scaling factor, σ, is defined as:

σ =| C || S |

. (27)

The smaller the sample size, the larger is σ, and the shorteris the response time to a query posed to the sample, but alsothe larger are the estimation errors based on this sample.

Let S be a set of n sensitivity classes of documents in C,S = {s1, s2, . . . , sn}. Call ci the count of documents clas-sified in sensitivity class si with | C |=

∑ni=1 ci. Given the

aggregate query θ, let θ be the corresponding approximatequery carried out using the documents in sample S.

The response to the approximate query θ may onlyinclude documents in m ≤ n sensitivity classes si of the setS = {s1, s2, . . . , sn}. Call ci ≤ ci the count of documentsclassified in sensitivity class si. Then | S |=

∑mi=1 ci.

Sampling errors. A key element of any AQP system is toprovide error bounds for the approximative results, allow-ing the user to decide whether the results are acceptable. TheSampling-based Approximate Query Processing (S-AQP)with guaranteed accuracy provides bounds on the errorcaused by sampling [15].

The distribution converges into the standard normal ran-dom distributionN(0, 1) as n, the number of elements in thesample, goes to infinity. The sampling error for sensitivityclass si is

ei = 100ci − cici

. (28)

The error vector due to sampling is

E = (e1, e2, . . . , en). (29)

In a uniform random sampling n−m classes may not appearin the response to the query and components of the errorvector for missing classes are 100%.

0 1

Tightness of bounds

Markov inequality

Chebyshev’s inequality

CLT close-form

Fig. 3: Bound tightness comparison obtained by using Markov,Chebyshev inequalities and close-form CLT.

We use a close-form Central Limit Theorem(CLT) andMarkov and Chebyshev inequalities to get the tightestbounds. The tightness of the bounds resulted from the






three aforementioned approaches is illustrated in Figure 3.Markov inequality provides larger deviation bounds thanChebyshev inequality. Close-form CLT provides the tightestbound among these three approaches [35].

Experimental results. The results reported in this sectionare restricted to sensitivity analysis based solely on intrinsicinformation and we report on two groups of experimentsto: (a) investigate the effect of sample size on the estima-tion errors and (b) study the effectiveness of the selectivedisinformation.

These experiments were conducted on a cluster of 100AWS EC2 instances (t2.large) with two vCPU, 8 GB memory,and the Linux kernel version 4.4.0-59-generic. MongoBDversion 3.2.7 was used as the NoSQL server. MongoDB sup-ports a variety of storage engines designed and optimizedfor various workloads. The storage engine is responsiblefor data storage both in memory and on disk; we choseWiredTiger storage engine. The OPE and AHOM cryptosys-tems are implemented locally and other crypto modules areimported from OpenSSL version 1.0.2g.

The relationship between sample size and estimation error.We created four sets of random samples from an originalcollection of 107 documents. Each one of the four setsincluded 100 random samples with 102, 103, 104, and 105

documents, respectively. The samples were selected withand without replacement. Figure 4 displays the error fordifferent sample size for the two different sampling modes.

The measurement results show that samples withoutreplacement exhibit slightly more accurate results thansamples with replacement. For instance, the average errorpercentage is 0.22% for the largest sample of 105 documents,whereas the error is 5.08% for the smallest sample size of100 documents. We concluded that a scaling factor of 103 isperfectly suitable. This scaling factor is likely to reduce theaverage query response time by two orders of magnitude.

102 103 104 105

0

0.1

0.2

0.3

0.4

Number of documents per sample

Erro

rpe

rcen

tage

Without replacementWith replacement

Fig. 4: Estimation errors and confidence intervals for a collec-tion with 107 documents. Results are shown for 102, 103, 104,and 105 documents per sample. Sampling without replacementconsistently exhibit slightly more accurate results than withreplacement.

The effectiveness of selective disinformation. The experi-ments have three objectives: (a) Investigate the accuracy ofsensitivity analysis for collections with multiple classes ofdocuments; (b) Study the effect of sampling on query re-sponse time; and (c) Assess the effectiveness of the selectivedisinformation.

In these experiments, we created a collection of tenmillion documents in eight sensitivity classes. Table 1 shows

TABLE 1: The original collection of 107 documents and thenumber of documents in each one of the eight sensitivityclasses.

Class (si) Cardinality(ci) Percentage

Top Secret 782 471 7.823Secret 1 475 118 14.751Information 3 134 844 31.348Official 1 475 603 14.756Unclassified 783 443 7.834Clearance 783 024 7.830Confidential 782 698 7.826Restricted 782 799 7.828

Total 10 000 000 100.00%

the eight sensitivity classes and the count and percentage ofdocuments in each sensitivity class.

TABLE 2: The effect of sampling on the number of documents ineach sensitivity class when the sample size is 104. The columnlabeled “Difference” shows the differences between the numberof documents in the sample and in the original collection.

Class(s′i) Cardinality(c′i) Percentage Difference

Top Secret 785 600 7.856 -3 129Secret 1 462 000 14.620 13 118Information 3 152 200 31.522 -17 356Official 1 463 700 14.637 11 903Unclassified 787 200 7.872 -3 757Clearance 784 800 7.848 -1 776Confidential 783 900 7.839 -1 202Restricted 780 300 7.803 -2 499

Total 9 945 260 99.4526 54 740

Then we sampled the original collection and investi-gated the accuracy of document classification in the eightsensitivity classes. Table 2, displays the estimated cardinal-ity of each class, when the sample size is 104. In this casethe scaling factor is

σ =107

104= 103 (30)

and the overall error is less than 0.55% as only less than55 000 out of 107 documents have been misclassified. Weexpected a query process speedup of the same order as σand, it is indeed 1 300 as we shall see next.

Once the error was estimated the experiment continuedwith the estimation of the query response time, definedas the interval between the time when the sever receivesa query and the time it starts forwarding the result. Mostdatabase servers cache the most recently used data to reducethe response time latency. In our experiments query cachingand prefetching was disabled to force the query optimizer toserve the next matching queries directly from the databasenot cache memory. The aggregate query displayed in Figure5 is used for computation of cardinality and percentage ofeach sensitivity class.

The results show that the average speedup due samplingis better than linear and the estimation errors are quitelow. Both metrics are plotted for the four sample size inFigure 6. For the largest sample of 105 documents, theaverage processing time is 93 ms, whereas for the original






db[collection].aggregate([{"$group":{id:{"clearance":"$clearance"},"count":{"$sum":1}}},{"$project": {"count":1,"percentage":{"$concat":[{"$substr":[{"$multiply":[{"$divide":["$count",{"$literal":Size}]},100]},0,6]},"","%"]}}}]);

Fig. 5: The aggregate query for sensitivity analysis of a givencollection including the original or sample datasets.

102 103 104 105 107

0

20

40

60

80

100

0,45 1,8

10,8

93

14000


Proc

essi

ngti

me(

ms)

(a)

102 103 104 105

0.1

1

2

3

·104

31000

8000

1300 150


Spee

dup

(b)

Fig. 6: The performance of SADS classification over 100 samplesets with 105, 104, 103 and 102 documents: (a) the processingtime of aggregate query; (b) achieved speedup.

collection (including 107 documents) the processing time ofthe same query is 14 000 ms; the speedup is 150.

Lastly, we used the sensitivity analysis to generate disin-formation. A disinformation replication factor V is assignedto each class according to sensitivity. For example, know-ing the approximated cardinality value c′i of the numberof documents in each sample class we choose replicationfactors of 100, 25, 0, 5, 10, 0, 50 and 15 for ”Top Secret”,”Secret”, ”Information”, ”Official”, ”Unclassified”, ”Clear-ance”, ”Confidential”, and ”Restricted” classes, respectively,from Table 2. The expansion factor E is defined as the ratioof the cardinality of the collection including disinformationdocuments to that of the original collection. In this example,the expansion factor is E = 0.18, while indiscriminate dis-information insertion leads to an expansion factor E = 100.

The overhead is drastically reduced while providing a simi-lar leakage prevention level.

The conclusion of our experiments is that SADS is apowerful method to substantially reduce query latency withbounded and small estimation errors. SADS with uniformrandom sampling provides sensible results for classificationaggregate query workload, with a sensible compromise be-tween sample size and query latency. However, for querieswith different workloads such as aggregate functions thatinvolve multiple correlated collections, the uniform sam-pling cannot provide accurate responses and we designeda new technique for biased sampling solution for this prob-lem. In the next section, we highlight approximated answersfor correlated collections.

6 WAREHOUSE INFORMATION LEAKAGE

Investigation of potential leakage in a data warehouse re-quires cross-correlations among all datasets, a truly daunt-ing task due to colossal amount of data and the huge com-putational effort. A warehouse hosting n databases, eachdatabase with m collections, and each collection containingq documents requires Nc = (m × n × q)2 correlations. Forexample, when m = 103, n = 106, and q = 109 thenNc = 1036. Can the sampling methods discussed in Section5 be extended to a cloud data warehouse hosting a largenumber of databases? This is the topic explored in thissection.

Ultimately, we wish to understand the relationshipsamong the collections in a data warehouse and amongthe attributes of the documents. These relationships canbe captured by two networks: (1) the collection network ΓC ;and (2) the attribute network ΓA. Construction of these twonetworks follows the same ideas discussed earlier, samplingthe databases followed by correlations among samples andthe determination of the estimation errors.

The vertices and the links of ΓC , the collection network,are the collections and the identifier attributes, respectively.Two vertices i and j are connected if they share a numberof identifier attributes. The link connecting the two verticesis labeled with the list of common attributes of the twocollections. In our example the number of vertices is m .

The vertices and the links of ΓA, the attributes net-work, are the attributes and the list of documents sharingattributes, respectively. Two vertices i and j are connectedthrough a link if they appear together in one or more doc-uments. The link between the two vertices is labeled withthe Id of the documents containing the common identifierattributes. An identifier attribute could be postal address,phone number, social security number, patient ID, and soon for a collection related to healthcare, utilities, financialrecords, etc.

We suspect that ΓC and ΓA are scale-free networks witha power-law distribution of node degrees [37]. Networkswith a power-law distribution of node degrees appear nat-urally in social networks and other virtual organizations.Such organizations are inherently heterogeneous, there area few highly connected entities and a very large number ofentities with a few connections. Several instances of virtualorganizations, as well as man-made systems, seem to enjoythis type of organization.






In a scale-free organization the probability p(k) that anentity interacts with k other entities decays as a power law

p(k) ≈ k−γ , (31)

with γ a constant and k a positive integer. We only considerthe discrete case when the probability density function isp(k) = af(k) with f(k) = k−γ and the constant a is a =1/ζ(γ, kmin) thus,

p(k) =1

ζ(γ, kmin)k−γ . (32)

In this expression kmin is the lowest degree of any node,and in our discussion we assume that kmin = 1; ζ is theHurvitz zeta function1.

Fig. 7: A scale-free network is non-homogeneous; the majorityof vertices of a graph model of a scale-free network have a lowdegree and only a few vertices are connected to a large numberof edges; the majority of the vertices are directly connected withthe vertices with the highest degree.

The degree distribution of scale-free networks follows apower law

ζ(γ, kmin) =∞∑n=0

1

(kmin + n)γ =

∞∑n=0

1

(1 + n)γ . (33)

Figure 7 shows the graph of a scale-free network. Theaverage distance d between the P nodes of a scale-freenetwork, also referred to as the diameter of the scale-freenetwork, scales as lnP .

Empirical data confirm the existence of scale-free or-ganization in many instances where multiple entities areinterconnected with one another. For example, the powergrid of the Western US has some 5, 000 nodes representingpower generating stations; in this scale-free network γ ≈ 4.The collaborative graph of movie actors where links arepresent if two actors were ever cast in the same moviefollows the power law with γ ≈ 2.3 [38]. Recent studiesindicate that γ ≈ 3 for the citation of scientific papers [39].

Correlations among collections. We consider a 2-way corre-lation reflecting an unintentional join relation of collectionsC1 and C2. This relation is a result of a common identifierattribute, namely linkage attribute. A k-way correlation withk > 2 is a k-way join relation between a sequence of kcorrelated collections C1,C2, . . . ,Ck. A k-way correlation is

1. The Hurvitz zeta function ζ(s, q) =∑∞

n=01

(q+n)sfor s, q ∈ C and

Re(s) > 1 and Re(q) > 0. The Riemann zeta function is ζ(s, 1).

equivalent to a combination of multiple 2-way correlations[28], [36].

Biased sampling. A new sampling strategy is discussednext as uniform random sampling leads to large errors.Biased sampling takes into account the frequency of values[40] in the original dataset. Biased sampling creates samplesets with respect to the repetition frequency of each valueof the intended attribute. The higher the frequency of oc-currence, the higher the probability to be included in thesample set. Furthermore, infrequent collection documentshave a small contribution to the sample set. However, ifthere are large numbers of infrequent values their impactadds up.

We customized the biased sampling algorithm consistentwith cross-correlation analysis queries which are join-drivento probe and extract the leaked attributes from the givencollections. For any 2-way join query there are two collec-tions: the left and the right collection. The cross-correlationanalytical query returns a new set of attributes from theright collection based on the evaluation of the join-predicate.

Tunable thresholds Ti for collection Ci can balance thesample size and the accuracy. Documents occurring withfrequency fv > Ti are added to the sample, while those withfrequency fv ≤ Ti are included with probability pv = fv

Ti.

Higher values of Ti result in a smaller sample set.Sometimes we wish to bias the sampling by adjusting

the thresholds of two collections as seen in Equation 34.This equation shows cv , the cross-correlation using biasedsampling for an attribute value v of two collections CL andCR with the threshold parameters TL and TR, respectively.

cv :

fL(v).fR(v) if fL(v) ≥ TL and fR(v) ≥ TRTL.fR(v) if fL(v) < TL and fR(v) ≥ TRfL(v).TR if fL(v) ≥ TL and fR(v) < TRfL(v).fR(v).max( TL

fL(v) ,TR

fR(v) ) if fL(v) < TL and fR(v) < TR(34)

We adjust the threshold parameters of the biased sam-pling algorithm to generate a larger sample set from theright collection. TR is adjusted to be significantly smallerthan TL to increase selection probability thus, a larger sam-ple size from CR. With this adjustment the sizes of the twosamples are different. The cross-correlation analysis uses thesamples created by this process.

An instance of correlation extractor query is displayedin Figure 8. After samples are created, the 2-way cross-correlation analysis query is processed over the samplesinstead of the original data.

db[Left].aggregate([{$lookup :{from :Right,localField:value,foreignField:value,as:"correlation" }},{$match:{"correlation":{$ne:[]}}},{$out: saveToDest}]);

Fig. 8: The aggregation join query for discovery of attributesfrom the right collection based on evaluation of equality checkon the value of the common attribute.

The sampling probability for collection Ci is pi = |Si||Ci| ,

where SL and SR are sample sets taken from CL and CR, re-






spectively. The exact correlation between CL, CR is denotedby CLR while SLR is the approximated value computedfor SL and SR. Using biased sampling, the correlation sizeapproximation is computed by C =

∑v cv . The scaling

factor 1min(pL,pR) is used to scale up the result from the size

of the samples to the size of the original dataset.

Experimental results. We use four correlated databasesfrom social media, phone directory, medical, and financialareas. Each collection includes 107 documents. The knownpairwise correlation are shown in Figure 9.

The proposed estimation method based on biased sam-pling is evaluated on different datasets and compared withthe random sample selection. The optimized biased sam-pling provides more accurate results and reduces the pro-cessing time. The frequency of concurrence in each datasetcan be computed offline.

Fig. 9: The estimation algorithm uses four datasets each with107 documents and with a pre-defined level of correlations:50.34% between social network profile and the phone directory;31.18% between health and financial collections; and 41.7%between health collection and the phone directory.

Social-Phone Phone-Health Health-Finance3

3.5

4

4.5

5

·106


Size

ofcr

oss-

corr

elat

ion

Exact cardinalityRandom samplingBiased sampling

Fig. 10: The approximation of the cross-correlation size usingrandom and biased sampling.

The comparison between random sampling and the pro-posed method in Figure 10 shows that biased sampling per-forms better than random sampling. The estimation errorsfor the biased sampling method are very low, almost zerofor two of the three database correlations.

The exact cross-correlation between social media andphone directory collections requires almost 12 000 seconds

while the time is reduced to 1.5 seconds with 1% error bythis sampling method.

7 CONCLUSIONS AND FUTURE WORK

The impact of information leakage will most likely grow inthe future not only for public clouds, but also for privateclouds. Indeed, more organizations and government agen-cies transition to private and to hybrid clouds with the beliefthat a cloud could offer enhanced security and prevent datatheft. The sad reality is that there are no full proof methodsto prevent information leakage as cloud data encryptiondiscussed in Section 2 has serious limitations.

The sensitivity analysis based on data sampling, inspiredby approximate query processing, introduced in Section 5offers a glimmer of hope. The sensitivity analysis identifiesthe most valuable information to be protected and offerssome guidance on how to protect against insider attacks.

Attribute correlation among the databases of a cloudwarehouse, involves processing enormous amounts of dataand brute force methods are hopeless. The method based onheterogeneous biased data sampling has a reasonable levelof accuracy. The optimum sample size results in substantialspeedup with results close to the exact value.

We suggest introduction of leakage detection cloud ser-vices that can offer guidance to organizations on how tobetter protect their data and minimize the risks of infor-mation leakage. Sensitivity and cross-correlation analysis atcloud warehouse level can only be conducted by a CSP withaccess to all datasets. The extension of individual ServiceLevel Agreements to include a clause related to informationleakage protection will allow CSPs to periodically computethe correlations necessary to construct the two networksdiscussed in Section 6.

Future work. The authors are further developing an infor-mation theoretic framework for the analysis of informationleakage based on the capacity of a leakage channel intro-duced in Section 3. We are also developing algorithms toconstruct the collection and the attribute networks and willinvestigate if our assumption that ΓC and ΓA are indeedscale-free networks is valid and will continue to experimentwith large collections of documents.

ACKNOWLEDGMENTS

The authors wish to express their gratitude to anonymousreviewers whose comments and suggestions contributed tosignificant improvements of the paper.

REFERENCES

[1] F. Y. Rashid. “The dirty dozen: 12 cloud securitythreats.” Infoworld, www.infoworld.com/article/3041078/security/the-dirty-12-cloud-security-threats.html, March 11, 2016.

[2] M. Balduzzi, J. Zaddach, D. Balzarotti, E. Kirda, and S. Loureiro.“A security analysis of Amazon’s elastic compute cloud service.”Proc. 27th Annual ACM Symp. Applied Computing, pp. 1427–1434,2012.

[3] Cloud Security Alliance. “Security guidance for critical areasof focus in cloud computing V2.1.” https://cloudsecurityalliance.org/csaguide.pdf, 2009.

[4] Cloud Security Alliance. “Top threats to cloud computingV1.0.” https://cloudsecurityalliance.org/topthreats/csathreats.v1.0.pdf,2010. Accessed August 2015.






[5] Cloud Security Alliance. “Security guidance for critical areasof focus in cloud computing V3.0.” https://cloudsecurityalliance.org/guidance/csaguide.v3.0.pdf, 2011. Accessed August 2015.

[6] NIST. “Top 10 cloud security concerns (Working list).”http://collaborate.nist.gov/twiki-cloud-computing/bin/view/CloudComputing. Accessed February 2017.

[7] M. O’Neill. “SaaS, PaaS, and IaaS: a security checklistfor cloud models.” http://www.csoonline.com/article/660065/saas-paas-and-iaas-a-security-checklist-for-cloud-models. Accessed August2015.

[8] S. Garfinkel and M. Rosenblum. “When virtual is harder thanreal: security challenges in virtual machines based computingenvironments.” Proc. 10th Conf. Hot Topics in Operating Systems,pp. 20–25, 2005.

[9] S. T. King, P. M. Chen, Y-M Wang, C. Verbowski, H. J. Wang,and J. R. Lorch. “SubVirt: Implementing malware with virtualmachines.” Proc. IEEE Symp. Security and Privacy, pp. 314 – 327,2006.

[10] M. Price. “The paradox of security in virtual environments.”Computer, 41(11):22–28, 2008.

[11] J. Luna, N. Suri, M. Iorga and A. Karmel. “Leveraging the potentialof cloud security service level agreements through standards.”IEEE Cloud Computing, 2(3):32–40, 2015

[12] P. Mell. “What is special about cloud security?” IT-Professional,14(4):6–8, 2012. http://doi.ieeecomputersociety.org/10.1109/MITP.2012.84.Accessed August 2015.

[13] S. Pearson and A. Benameur. “Privacy, security, and trust issuesarising from cloud computing.” Proc. Cloud Computing and Science,pp. 693–702, 2010.

[14] D. C. Marinescu, Cloud Computing; Theory and Practice, 2nd Ed.Morgan Kaufmann, San Francisco, Ca., 2017.

[15] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Mad-den, B. Mozafari, and I. Stoica, “Knowing when you’re wrong:Building fast and reliable approximate query processing systems,”in Procs. 2014 ACM SIGMOD Int. Conf on Management of Data, ser.SIGMOD ’14. New York, NY, USA: ACM, 2014, pp. 481–492.

[16] M. Ahmadian, F. Plochan, Z. Roessler, and D. C. Marinescu,“SecureNoSQL: An approach for secure search of encrypted nosqldatabases in the public cloud,” Int. J. Information Management,37(2):63 – 74, 2017.

[17] R. A. Popa, C. Redfield, N. Zeldovich, and H. Balakrishnan,“Cryptdb: protecting confidentiality with encrypted query pro-cessing,” in Proc. 23 ACM Symposium on Operating Systems Prin-ciples. ACM, 2011, pp. 85–100.

[18] M. Naveed, S. Kamara, and C. V. Wright, “Inference attacks onproperty-preserving encrypted databases,” in Proc. 22nd ACMSIGSAC Conf. on Computer and Communications Security. ACM,2015, pp. 644–655.

[19] C. Liu, L. Zhu, M. Wang, and Y.-a. Tan, “Search pattern leakage insearchable encryption: Attacks and new construction,” InformationSciences, vol. 265, pp. 176–188, 2014.

[20] R. Ostrovsky, “Efficient computation on oblivious rams,” in Proc.22 ACM symposium on Theory of computing. ACM, 1990, pp. 514–523.

[21] H. A. David. Order Statistics. John Wiley, 1970.[22] D. C. Marinescu and J. R. Rice. “Synchronization of non-

homogeneous parallel computations.” Parallel Processing for Sci-entific Computing, (G. Rodrigue, Ed.), SIAM, pp. 362–367, 1989.

[23] S. Faber, S Jarecki, H. Krawczyk, Q Nguyen, M. Rosu, and M.Steiner. “Rich queries on encrypted data: beyond exact matches.”Proc. 20th Euro. Symp. Research in Computer Security. Lecture Noteson Computer Science, Vol. 9327, pp.123-145, Springer-Verlag, Berlin,2015.

[24] O. Goldreich and R. Ostrovsky, “Software protection and simula-tion on oblivious rams,” Journal of the ACM (JACM), 43(3):431–473,1996.

[25] L. T. Li, Ninghui and S. Venkatasubramanian, “t-closeness: privacybeyond k-anonymity and l-diversity,” in Proc 23rd Int. Conf. onData Engineering, April 2007, pp. 106–115.

[26] S. E. Whang and H. Garcia-Molina, “Managing information leak-age,” Stanford InfoLab, 2010.

[27] K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo, “The analyticalbootstrap: a new method for fast error estimation in approximatequery processing,” in Proc. o2014 ACM SIGMOD Int. Conf. onManagement of Data. ACM, 2014, pp. 277–288.

[28] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy, “Joinsynopses for approximate query answering,” in ACM SIGMODRecord, vol. 28, no. 2. ACM, 1999, pp. 275–286.

[29] B. Babcock, S. Chaudhuri, and G. Das, “Dynamic sample selectionfor approximate query processing,” in Proc. 2003 ACM SIGMODInt. Conf. on Management of Data. ACM, 2003, pp. 539–550.

[30] M. Ahmadian, A. Paya, and D. Marinescu, “Security of appli-cations involving multiple organizations and order preservingencryption in hybrid cloud environments,” Proc. IEEE Interna-tional Conf. on Parallel Distributed Processing Symposium Workshops(IPDPSW), pp. 894–903, 2014.

[31] A. Boldyreva, N. Chenette, Y. Lee, and A. Oneill, “Order-preserving symmetric encryption,” Advances in Cryptology-EUROCRYPT, pp. 224–241, 2009.

[32] C. Mavroforakis, N. Chenette, A. O’Neill, G. Kollios, andR. Canetti, “Modular order-preserving encryption, revisited,” Proc.2015 ACM SIGMOD Int. Conf. on Management of Data, pp. 763–777,2015.

[33] C. Gentry, “Computing arbitrary functions of encrypted data,”Communications of the ACM, vol. 53, no. 3, pp. 97–105, 2010.

[34] P. Paillier, “Public-key cryptosystems based on composite de-gree residuosity classes,” in Advances in cryptologyEUROCRYPT99.Springer, 1999, pp. 223–238.

[35] P. J. Huber, “The behavior of maximum likelihood estimates undernonstandard conditions,” in Proc. 5th Berkeley Symp. on Mathemati-cal Statistics and Probability, vol. 1, no. 1, 1967, pp. 221–233.

[36] D. Vengerov, A. C. Menck, M. Zait, and S. P. Chakkappen, “Joinsize estimation subject to filter conditions,” Proc. VLDB Endow-ment, 8(12):1530–1541, 2015.

[37] D. C. Marinescu, Complex Systems and Clouds; A self-organizationand Self-management Perspectiove, Morgan Kaufmann, San Fran-cisco, Ca., 2016.

[38] D. J. Watts and S. H. Strogatz. “Collective-dynamics of small-world networks,” Nature, 393:440–442, 1998.

[39] R. Albert and A-L. Barabasi. “Statistical mechanics of complexnetworks.” Reviews of Modern Physics, 72(1):48–97, 2002.

[40] C. Estan and J. F. Naughton, “End-biased samples for join car-dinality estimation,” in Proc. 22nd Int. Conf. on Data Engineering(ICDE’06). IEEE, 2006, pp. 20–20.

Mohammad Ahmadian Got his Ph.D. in 2017from the Computer Science Department at Uni-versity of Central Florida. His research interestsare in the area of public cloud security.

Dan C. Marinescu During the period 1984-2001Dan Marinescu was an Associate and then FullProfessor in the Computer Science Departmentat Purdue University in West Lafayette, Indiana.Since August 2001 he is a Professor of Com-puter Science at University of Central Florida.He is conducting research in parallel and dis-tributed systems, complex systems, and quan-tum information processing. Dan Marinescu haspublished more than 220 papers in journals andrefereed conference proceedings. He has also

published several books including: “Internet-based Workflow Manage-ment” published by Willy-Interscience in 2002, “Cloud Computing; The-ory and Practice” published by Morgan Kaufmann in 2013, “ComplexSystems and Clouds; A Self-organization and Self-management Per-spective” published by Morgan Kaufmann in 2016 and “Cloud Com-puting; Theory and Practice” Second Edition, published by MorganKaufmann in November 2017.