[ieee 2009 ninth ieee international conference on data mining (icdm) - miami beach, fl, usa...

6
On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining Ian Molloy, Ninghui Li, and Tiancheng Li Center for Education and Research in Information Assurance and Security and Department of Computer Science, Purdue University West Lafayette, Indiana, USA {imolloy, ninghui, li83}@cs.purdue.edu Abstract—The recent interest in outsourcing IT services onto the cloud raises two main concerns: security and cost. One task that could be outsourced is data mining. In VLDB 2007, Wong et al. propose an approach for outsourcing association rule mining [1]. Their approach maps a set of real items into a set of pseudo items, then maps each transaction non- deterministically. This paper, analyzes both the security and costs associated with outsourcing association rule mining. We show how to break the encoding scheme from [1] without using context specific information and reduce the security to a one-to- one mapping. We present a stricter notion of security than used in [1], and then consider the practicality of outsourcing association rule mining. Our results indicate that outsourcing association rule mining may not be practical, if the data owner is concerned with data confidentiality. Keywords-association rule mining, outsourcing, security I. I NTRODUCTION The problem of outsourcing data mining tasks to a third- party service provider has been studied in a number of recent papers [1]–[3]. While outsourcing data mining has the potential of reducing the computation and software cost for the data owners, it is important that private information about the data is not disclosed to the service providers. The raw data and the mining results can both contain business intelligence of the organization and private information about customers of the organization and require protection from the service provider. Unfortunately, the current understanding of the potential privacy threats to outsourcing data mining and the needed privacy protection are still quite primitive. In [1] Wong et al. proposed an approach for outsourcing association rule mining. In their model, the data owner first encodes the transactional database before sending it to the service provider. The service provider finds the frequent item-sets and their support counts in the encoded database, then sends the information back to the data owner. The data owner then decodes the results to get the correct support counts of frequent item-sets in the original database. One na¨ ıve encoding approach is to replace each item in the original data with a randomly generated pseudo-identifier, but this is subject to frequency analysis attacks [4], [5]. To defend against this, Wong et al. propose an encoding algorithm (we call the WCH+ algorithm) that supplements the na¨ ıve approach with additional, random, dummy-items. Wong et al. claim that “[the proposed] technique is highly secure with a low data transformation cost,” and include a proof of security [1]. We present here an attack that breaks the Wong encoding and reduces it to the na¨ ıve approach with a one-to-one mapping, allowing standard frequency analysis to be applied [4], [5]. We find that perfect secrecy is achievable but prohibitively expensive and of limited use. We introduce a more natural and practical notion of security and discuss the tradeoff between security and efficiency. We find there exists a point where the security costs exceed the costs of performing association rule mining oneself. Our work makes the following contributions: We present an attack that breaks a state of the art algo- rithm for outsourcing association rule mining. Knowl- edge of this attack can be used to develop more secure schemes and may be more widely applicable. We question the costs associated with outsourcing. If one is not concerted with frequency attacks, the na¨ ıve approach is sufficient, otherwise efficiently outsourcing may be impossible or impractical in many settings. The remainder of this paper is organized as follows. In Section II, we present the Wong et al.’s encoding scheme. In Section III we present our attack. We discuss different definitions of security and consider the practicality of out- sourcing in general in Section IV, and Section V concludes. II. BACKGROUND This section provides data-transformation framework def- initions for outsourcing frequent itemset mining. We then describe the encoding algorithm proposed in [1]. A. A Data Transformation Framework In the data transformation framework, outsourcing works as follows: The data owner has a transaction database T , where each tuple represents a transaction. In an encoding step, the data owner computes an encoded version of T de- noted by W. The data owner then provides W and a threshold 2009 Ninth IEEE International Conference on Data Mining 1550-4786/09 $26.00 © 2009 IEEE DOI 10.1109/ICDM.2009.122 872

Upload: tiancheng

Post on 24-Feb-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

On the (In)Security and (Im)Practicality ofOutsourcing Precise Association Rule Mining

Ian Molloy, Ninghui Li, and Tiancheng LiCenter for Education and Research in Information Assurance and Security and

Department of Computer Science, Purdue UniversityWest Lafayette, Indiana, USA

{imolloy, ninghui, li83}@cs.purdue.edu

Abstract—The recent interest in outsourcing IT services ontothe cloud raises two main concerns: security and cost. Onetask that could be outsourced is data mining. In VLDB 2007,Wong et al. propose an approach for outsourcing associationrule mining [1]. Their approach maps a set of real itemsinto a set of pseudo items, then maps each transaction non-deterministically.

This paper, analyzes both the security and costs associatedwith outsourcing association rule mining. We show how tobreak the encoding scheme from [1] without using contextspecific information and reduce the security to a one-to-one mapping. We present a stricter notion of security thanused in [1], and then consider the practicality of outsourcingassociation rule mining. Our results indicate that outsourcingassociation rule mining may not be practical, if the data owneris concerned with data confidentiality.

Keywords-association rule mining, outsourcing, security

I. INTRODUCTION

The problem of outsourcing data mining tasks to a third-party service provider has been studied in a number ofrecent papers [1]–[3]. While outsourcing data mining hasthe potential of reducing the computation and software costfor the data owners, it is important that private informationabout the data is not disclosed to the service providers. Theraw data and the mining results can both contain businessintelligence of the organization and private informationabout customers of the organization and require protectionfrom the service provider.

Unfortunately, the current understanding of the potentialprivacy threats to outsourcing data mining and the neededprivacy protection are still quite primitive. In [1] Wonget al. proposed an approach for outsourcing associationrule mining. In their model, the data owner first encodesthe transactional database before sending it to the serviceprovider. The service provider finds the frequent item-setsand their support counts in the encoded database, then sendsthe information back to the data owner. The data ownerthen decodes the results to get the correct support countsof frequent item-sets in the original database. One naı̈veencoding approach is to replace each item in the originaldata with a randomly generated pseudo-identifier, but thisis subject to frequency analysis attacks [4], [5]. To defend

against this, Wong et al. propose an encoding algorithm(we call the WCH+ algorithm) that supplements the naı̈veapproach with additional, random, dummy-items.

Wong et al. claim that “[the proposed] technique is highlysecure with a low data transformation cost,” and include aproof of security [1]. We present here an attack that breaksthe Wong encoding and reduces it to the naı̈ve approach witha one-to-one mapping, allowing standard frequency analysisto be applied [4], [5].

We find that perfect secrecy is achievable but prohibitivelyexpensive and of limited use. We introduce a more naturaland practical notion of security and discuss the tradeoffbetween security and efficiency. We find there exists a pointwhere the security costs exceed the costs of performingassociation rule mining oneself.

Our work makes the following contributions:• We present an attack that breaks a state of the art algo-

rithm for outsourcing association rule mining. Knowl-edge of this attack can be used to develop more secureschemes and may be more widely applicable.

• We question the costs associated with outsourcing. Ifone is not concerted with frequency attacks, the naı̈veapproach is sufficient, otherwise efficiently outsourcingmay be impossible or impractical in many settings.

The remainder of this paper is organized as follows. InSection II, we present the Wong et al.’s encoding scheme.In Section III we present our attack. We discuss differentdefinitions of security and consider the practicality of out-sourcing in general in Section IV, and Section V concludes.

II. BACKGROUND

This section provides data-transformation framework def-initions for outsourcing frequent itemset mining. We thendescribe the encoding algorithm proposed in [1].

A. A Data Transformation Framework

In the data transformation framework, outsourcing worksas follows: The data owner has a transaction database T ,where each tuple represents a transaction. In an encodingstep, the data owner computes an encoded version of T de-noted by W. The data owner then provides W and a threshold

2009 Ninth IEEE International Conference on Data Mining

1550-4786/09 $26.00 © 2009 IEEE

DOI 10.1109/ICDM.2009.122

872

Page 2: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

θ to the service provider, who computes all itemsets in Wthat have support at least θ, and returns these itemsets andtheir support to the data owner. In a decoding step, the dataowner computes the support of itemsets in T . Finally, usingthe result of the decoding step, the data owner computes theassociation rules and their support and confidence.

Let I be the set of items that can appear in the inputtable T . A transaction t is a subset of I , i.e., t ∈ 2I , andT is a sequence of transactions, i.e., T ∈

{2I}∗

. Let Σ bethe set of items that can appear in an encoded table, thenW ∈

{2Σ}∗

.Definition 1: A data transformation algorithm E takes

as input a table T ∈{

2I}∗

, and outputs 〈W,D〉, whereW ∈

{2Σ}∗

is the encoded table, and D : 2Σ → 2I

is the decoding mapping. A data transformation algorithmE :{

2I}∗ → 〈W,D〉 is said to be sound if and only if the

following two conditions hold:1) ∀x∈2I\{∅} ∀y∈2Σ\{∅}[

(D(y) = x)⇒(suppT (x) = suppW(y)

)].

2) ∀x∈2I\{∅} ∃y∈2Σ\{∅} [(D(y) = x)] .where suppT (x) is the support of itemset x in table T .

Condition 1 above requires that if the itemset y decodesinto a non-empty itemset x (i.e., D(y) = x), then y’s supportin W must equal x’s support in T . Intuitively, this means thaty corresponds to x in the original data. An itemset y ⊆ Σmay not correspond to any itemset in the original set, inwhich case we should have D(y) = ∅. Upon receiving anitemset y and its support from the service provider, the dataowner discards it if D(y) = ∅ and otherwise records thesupport of D(y).

These conditions ensure that we can follow the out-sourcing process described earlier to find all itemsets withfrequency at least θ. For any itemset x that occurs at least θtimes in T , Condition 2 requires that there must exist y thatdecodes into x and Condition 1 requires y to appear in Wwith the same frequency. Hence y will be found by frequentitemset mining in W, and be returned with its support.

A naı̈ve encoding approach is to replace each item with apseudo-identifier. This, however, is insecure and vulnerableto frequency analysis [4], [5]. In a frequency analysisattack, an adversary uses known information regarding thedistribution of items, such as letters in written text, to recoverthe original database. For example, an adversary may knowthat an item i is the most frequent item, and when i occurs, jis also highly likely to occur. Knowing this information, theattacker can find out which pseudo-identifiers correspondto i and j and recover their exact frequencies. To defeatfrequency analysis attacks, Wong et al. [1] introduced a moresophisticated encoding approach.

B. The WCH+ Encoding Algorithm

We call the encoding algorithm in [1] the WCH+ encodingalgorithm. In this algorithm, items in I are called originalitems. Σ, the set of items used in the encoded table, consists

of three disjoint sets: U , the set of unique items; C, the setof common items; and F , the set of fake items. That is,Σ = U ∪C∪F . The items in U correspond one-to-one withthe items in I , that is, |U | = |I|. They can be viewed asreplacing each original item with a pseudo-identifier so thatthe item name itself does not tell which item it is. Itemsin C and F are used to defeat frequency analysis attacks.The sizes of C and F are parameters to the algorithm. Thealgorithm consists of the following two steps.

Step 1: Construct an item-level mapping In the first step,the algorithm generates an item mapping m : I → 2U∪C ,which maps each original item i ∈ I to a set of items inU ∪C. Items in U appear in the image of exactly one item;let u : I → U be a random bijection between I and U ,for each i ∈ I , m(i) contains u(i). For each item ci ∈ C,the algorithm randomly picks b items j ∈ I and adds ci tom(j), where b has an expected mean of NB . The numberNB is an input parameter to the algorithm.

The decoding mapping D is uniquely determined by m.Given an itemset σ ⊆ Σ, D(σ) = {i ∈ I | m(i) ⊆ σ}.

Step 2: Construct a transaction-level mapping Here thealgorithm processes the transactions in T one by one. Foreach transaction t ⊆ I , it performs three steps:a Compute M(t) = ∪i∈tm(i).b Compute N(t) = M(t) ∪ E(t), where E(t) is a subset

of U ∪ C. For each item j /∈ t, E(t) may include itemsin m(j) as long as m(j) 6⊆ N(t), which is sufficient toensure that D(N(t)) = t.

c Compute the final transaction R(t) = N(t)∪ sf where sfis a random subset of F of a random size with mean NF .To recap, the algorithm uses the following three steps to

defend against frequency analysis attacks: (1) each originalitem is mapped to a set of pseudo items, including oneunique item and zero or more common items; (2) in eachtransaction additional unique and common items are addedwhile ensuring that one does not include all items in themapping of an original item not in the transaction; (3) ineach transaction additional fake items are added.

III. ATTACK

We now present an attack that breaks the security of theWCH+ encoding algorithm.

A. Summary of Our Attack

Before presenting how our attack works, let us first ex-amine what a successful attack against the WCH+ algorithmneeds to do. Recall that the WCH+ algorithm is introducedas an enhancement to the naı̈ve algorithm of replacingeach item with a pseudo-identifier, which is vulnerable tofrequency analysis attacks [6], [7]. Hence the goal of asuccessful attack is to reduce the security of tables encodedusing the WCH+ algorithm to the same level as those

873

Page 3: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

encoded using the naı̈ve one-to-one mapping approach atwhich point frequency analysis attacks may be applied.

Given a table W encoded from the original table T usingthe mapping m, there is a bijection g : T → W such thatfor every transaction t ∈ T , R(t) is the encoded transactionof t. Our attack succeeds if it can identify all images of themapping m. That is, if we can find Γ = {m(i) | i ∈ I},then we know that each γ ∈ Γ corresponds to an originalitem, and we can apply frequency analysis attacks just asthe case of using the naı̈ve algorithm.

Furthermore, while finding Γ = {m(i) | i ∈ I} issufficient, it is not necessary. For example, even if an itema 6∈ m(i), it may be fine to include a in the itemset corre-sponding to i, provided that a occurs in every transactionin which m(i) occurs. More precisely, our attack succeedswhen it is able to find a set of correct mappings, defined asfollows.

Definition 2: Given W, an encoded table of T under thebijection g : T → W, we say that Y is a correct mappingof the original item i when for each transaction t ∈ T ,i ∈ t⇔ Y ⊂ R(t).

It is sufficient for the attack to find a set of correctmappings for original items in I . As frequency analysis ismost effective with items of high frequency, it is criticalto identify the correct mappings of high frequency originalitems. We illustrate the effectiveness of our attack on highfrequency items in Section III-F. Our attack analyzes thefrequencies of single items as well as pairs of items in theencoded database and relies only on the knowledge of theencoding algorithm, but not on its security parameters or onthe frequencies of any itemsets in the original database.

Given the encoded database, our attack works as follows.First, we identify and remove fake items from the encodedtable. Then, we identify pairs of associated items that occurin correct mappings. Finally, we recover the itemsets thatare correct mappings. We now explain our attacks in detail.

B. Example

To illustrate how our attack works, we introduce a smallrunning example. We use the IBM data generator [8] tocreate a dataset with thirty items and 10,000 transactions.We encode it using the scheme from Wong et al. [1] withten common items and two fake items. Each common item isadded to an average of three mappings (NB = 3), on averagetwo extra items are added to each transaction (NE = 2),and on average one fake item is added to each transaction(NF = 1). The generated mapping is shown in Table I; notethat without loss of generality, we assume u(i) = i.

We illustrate the WCH+ encoding on a transaction fromour dataset using these parameters. Consider the transaction

t = {2, 6, 11, 14, 16, 22, 29} .

The first step is to take the union of the item-wise mappings,

i m(i) suppT (i) i m(i) suppT (i)0 {0} 2573 15 {15, 34} 33751 {1, 34} 3118 16 {16, 38, 39} 32072 {2} 2917 17 {17, 33} 30583 {3} 3884 18 {18} 19804 {4} 3886 19 {19} 29505 {5} 3378 20 {20, 30, 33, 39} 12366 {6, 32} 3608 21 {21, 36} 23967 {7} 3364 22 {22} 45948 {8, 30, 38} 4339 23 {23, 31} 37749 {9, 37} 2847 24 {24} 3759

10 {10} 2919 25 {25} 251311 {11, 35} 2599 26 {26, 37} 259412 {12, 32, 36} 2716 27 {27} 215613 {13} 4483 28 {28, 35} 334214 {14} 4176 29 {29} 2012

Table ITHE MAPPINGS m( · ) IN THE EXAMPLE.

M(t) =⋃j∈t

m(j) = {2, 6, 11, 14, 16, 22, 29, 32, 35, 38, 39} .

In the next step we add E(t) = {8, 34} to the transaction,N(t) = {2, 6, 8, 11, 14, 16, 22, 29, 32, 34, 35, 38, 39} .

Finally we add the fake item sf = {41}, yieldingR(t) = {2, 6, 8, 11, 14, 16, 22, 29, 32, 34, 35, 38, 39, 41} .

C. Identifying and removing fake items

In Step 2c, the WCH+ algorithm randomly generates aset sf ⊆ F and adds sf to obtain the final transformationR(t) = N(t) ∪ sf . This approach of adding fake itemshas two weaknesses. The first weakness is that each fakeitem has the same probability of being added to eachtransaction, and thus appears with similar frequencies whenthe number of transactions is large. The second weaknessis that fake items are added to transactions independentlyof the items already present. As a result, each fake item fis independent of all other items x. That is, for each itemx, Pr [f ∧ x] = Pr [f ] ∗ Pr [x]. This second observationholds even if the frequency of each fake item is different.To measure independence, we define the following metric.

Definition 3: The loading factor of a pair {x, y} ⊂ Σ inW, denoted by loadingW ({x, y}), is the ratio of the numberof times we observe the pair to the number of times weexpected to observe the pair assuming they are i.i.d.

loadingW ({x, y}) =|W| ∗ suppW({x, y})

suppW({x}) ∗ suppW({y}). (1)

When loadingW ({x, y}) = 1, then x and y are indepen-dent. When loadingW ({x, y}) > 1, then x and y are posi-tively correlated. When loadingW ({x, y}) < 1, then x andy are negatively correlated. Hence

∣∣∣1− loadingW ({x, y})∣∣∣

measures the degree of independence, with smaller valuesmeaning higher independence.

To tell whether an item x is a fake item or not, we need tocheck whether it is independent with all other items, hence

874

Page 4: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

0 0.04 0.08 0.12 0.16

Mean IndW

(x)

0.00

0.05

0.10

0.15

0.20

Sta

ndard

Devia

tion o

f In

dW

(x)

Figure 1. Fake (•) versus non-fake (x) items.

Definition 4: The independence factor set of an item x ∈Σ in W, denoted by IndW (x), is defined as

IndW (x) ={∣∣∣1− loadingW ({x, y})

∣∣∣ | x 6= y ∧ y ∈ Σ}

To identify fake items, we use the following observation.Observation 1: When x is a fake item, both the arithmetic

mean and standard deviation of IndW (x) should be close tozero and smaller than those of unique or common items.

The effectiveness of this is illustrated in Figure 1. We addan additional eight fake items (for a total of ten) where theremaining items are given random support in [0, |T |].

We tested this method on a number of datasets with up to1000 original items, 300 common items, 20 fake items and500K transactions with equal results.

D. Identify AssociationsThe goal of the next step is to identify associations

between pairs of items. There may be many candidate map-pings m′( · ) that allow an adversary to recover the originaldata. We try to find a suitable candidate by identifying trueassociations between pairs of items.

Definition 5: We say that y ∈ Σ is truly associated withx ∈ Σ if and only if there exists an original item i such that{x, y} ⊆ Y and Y is a correct mapping for i.We use two metrics to tell when an association x ⇒ y istrue: loadingW ({x, y}), defined above, and the following.

Definition 6: The association confidence of an associa-tion x⇒ y in W, denoted by confW (x⇒ y), is

confW (x⇒ y) =suppW({x, y})suppW({x})

(2)

Our attack relies on the following observation.Observation 2: A true association x ⇒ y is very likely

to have high association confidence and high loading factor,and any other pair is unlikely to have both high associationconfidence and high loading factor.We now explain the rationale underlying this observation.Consider a true association x ⇒ y; they must appear inm(i), where i is some original item. When x occurs in anencoded transaction, there are two cases. Case one is that ioccurs in the original transaction. Here, y must also occur,contributing to the association confidence. Case two is thatx is added during Step 2b. For a number of reasons, suchsituations will not occur frequently. First, as a unique item,x is unlikely to be added frequently in 2b. This is because

common items in m(i) may appear in many mappings, thusappearing more frequently, reducing the probability of xbeing added, because it may result in incorrect decodings.A true association x ⇒ y may have a low associationconfidence when {x, y} ∈ m(i) and i appears rarely inthe table. In most situations, missing these associations areacceptable because frequency analysis, the step after ourattack, is most effective against frequent items.

We point out that when x ⇒ y is a true association,confW (y ⇒ x) may not be high, because y may appear inmany mappings, and hence may appear often without x.

Some pairs that are not true associations may still havehigh association confidence. Such an association x ⇒ yis due to y having a high frequency. These pairs can bedifferentiated from the true associations by examining theloading factor. When x ⇒ y is a true association, many ofthe joint occurrences are due to the fact that they are in thesame mapping, hence x and y will have a high loading factor.On the other hand, when the high association confidence ofx ⇒ y is simply due to the high frequency of y, they willhave a loading factor close to 1.

Based on Observation 2, by setting thresholds on the as-sociation confidence and the loading factor, one can identifythe true associations. We combine these two metrics to createa one-dimensional ordering over all candidate sets as

µW({x, y}) =√

loadingW ({x, y}) ∗ confW (x⇒ y) . (3)

Note that this captures the intuition that both the associationconfidence and loading factor must be high. By selecting theappropriate threshold, the adversary can recover the correctassociations and minimize the number of false positives andfalse negatives. The effectiveness of this one-dimensionalmetric is shown in Figure 2. Each antecedent i thus definesa candidate set of consequence items that may composem( · ). If desired, one may define two candidate sets: onefor antecedents, and one for consequences, and take theirintersection. In practice, this is not required.

The result produces what we call the mapping associationgraph. The mapping association graph is a graph G : (V,E)where V = Σ and (u, v) ∈ E where u, v ∈ Σ ifµW({x, y}) is greater than some threshold. The graph ourrunning example is shown in Figure 3. It produces three

1.0

1.25

1.5

1.75

2.0

Loa

din

g 30

36

38

Clusters for 8⇒yi

1 3 30

3339

Clusters for 20⇒yi

5

8

1619

35

39

Clusters for 27⇒yi

0 0.25 0.5 0.75 1.0Confidence

1.0

1.25

1.5

1.75

2.0

Load

ing 8

1218

20

2629333638

Clusters for xi⇒30

0 0.25 0.5 0.75 1.0Confidence

612

18

19

23252931

36

Clusters for xi⇒32

0 0.25 0.5 0.75 1.0Confidence

2

11

161827

28

39

Clusters for xi⇒35

Figure 2. Confidence versus loadings for the candidate associations.An item in a gray circle is a correct association. The dashed line is theseparation for µW({ · , · }) > 0.95.

875

Page 5: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

false positives and zero false negatives. The final step is toidentify subgraphs of the mappings association graph thatrepresent the correct mappings.

E. Finding Correct Mappings

At this stage of the attack, we have a set of associationsthat we believe belong to some mapping. The final stage isto identify all associations to recreate the mappings m( · )before frequency analysis. The key challenge is to identifywhich items are unique items and which items are commonitems. Once we are able to do that, then we can recover allmappings. We use the following observation.

Observation 3: Unique items can only appear in one sub-graph by definition, defining boundaries between mappings,and two unique items are unlikely to be associated together.The subgraphs can thus be identified by two-coloring themapping association graph.

We two-color the association graph to obtain the finalmappings; each mapping is a unique item and the adja-cent common items. The difficulty is determining whichof the two possible two-colorings is correct. Consider theassociation graph in Figure 3. In some instances, such asthe subgraph 〈23, 31〉, either two-coloring is correct (item31 may also be considered unique). Otherwise, we mustselect the correct two-coloring. This may be deferred untilfrequency analysis or we may select the most probablemapping as follows.

Consider the subgraph containing items 〈6, 12, 21, 32, 36〉;the possible two-colorings are {6, 12, 21} are unique, or{32, 36} are unique. If 12 is unique, there exists a mappingwith three items m(i) = {12, 32, 36}, otherwise there existsmappings of size three m(i) = {6, 12, 32} and m(j) ={21, 21, 36}. We extend the loading factor to sets of itemsby assuming a set of items is composed of two independentdisjoint sets:

loadingW (S) = minS1⊂S

S2=S\S1

|W | ∗ suppW(S)suppW(S1) ∗ suppW(S2)

. (4)

This is the most natural extension to the loadingdefined in Equation 1 because it maintains asimilar scale. In the first possible two-coloring,loadingW ({12, 32, 36}) = 1.702, while in the othertwo-coloring we have loadingW ({6, 12, 36}) = 1.146and loadingW ({12, 21, 36}) = 0.607. We select the firstinstance as the most positively anomalous, and from Table Iwe can verify that this is correct.

The result of two-coloring the remainder of the graph isshown in Figure 3. Each mapping m( · ) may be defined asa unique item and each adjacent common item separatedby at most one edge, treating the graph as undirected. Forexample, {20, 30, 33, 39}, {8, 30, 38}, or {12, 32, 36}.

Figure 3. Association graph for µ > 0.95. Unique items colored grayusing Obser. 3.

F. Evaluation

In this section, we evaluate the effectiveness of our attackat identifying and isolating the true associations and allowingan adversary to recover the original database.

We generate several datasets, encode them without fakeitems1, and select all pairs of candidate sets such thatµW({ · , · }) > 1.0. Next, we make a simplifying as-sumption that we can identify each unique item, anddefine the candidate mappings m′( · ) as a unique itemand its candidate set. For example, let i ∈ I and xbe the unique item in m(i). Then m′(i) = {x} ∪{y | µW({x, y}) > 1.0 ∨ µW({y, x}) > 1.0

}. Next we find

all transactions W[m(i)] that contain m(i) and all transac-tions W[m′(i)] that contain m′(i). If W[m(i)] 6= W[m′(i)]we consider the decoding to contain an error and the item idecodes incorrectly.

In [1] Wong et al. report the decryption accuracy usingrecall, the total number of correct decodings divided bythe total number of items. Without indicating the falsepositive rate this is a meaningless measure (decoding everytransaction into I has perfect recall). We calculate therecall R only for items that are correctly recoded as apercentage of the total size of the entire table T , i.e.,R =

∑i∈I

W[m(i)]=W[m′(i)]

suppT (i)‖T‖ .

In Table II we report the number of items that decodeincorrectly as errors, E, and the recall, R, of W. We provideresults for the top 10%, 20%, 40%, 50% 75% and 100% ofitems (by support in T ), illustrating that most errors are dueto infrequent items in the original data. Many of these errorsare unlikely to be present in any large itemsets, allowing forcomplete recovery of the association rules by an adversary.

It should be clear from the table that our attack is highlyeffective, especially for the most frequent items, and resultsin a very low error rate. Further, most errors are due toinfrequent items and are unlikely to adversely affect therecovery of the association rules, or large portions of theoriginal database. For our larger tests (W2−4) caused errorsin around 3–4% of the unique items and only cased around1% of the total number of items in the original table todecode incorrectly. Depending on θ, this may have no effecton the association rules.

1The attack against fake items never produced false positives or falsenegatives in any of our tests.

876

Page 6: [IEEE 2009 Ninth IEEE International Conference on Data Mining (ICDM) - Miami Beach, FL, USA (2009.12.6-2009.12.9)] 2009 Ninth IEEE International Conference on Data Mining - On the

W1 W2 W3 W4

|I| 100 1000 1000 1000|C| 20 150 150 300NB 2.5 4 4 4NE 2 8 8 8|W| 100 k 100 k 500 k 500 k

E R E R E R E R10% 0 22.6% 0 20.0% 0 21.0% 0 20.5%20% 0 38.4% 0 35.6% 0 36.8% 0 36.4%40% 0 63.5% 0 60.7% 0 61.4% 0 61.4%50% 1 72.7% 0 70.8% 0 71.3% 0 71.2%75% 7 87.7% 0 90.1% 1 90.2% 5 90.0%

100% 12 92.2% 32 99.1% 40 98.4% 33 98.9%

Table IIANALYSIS OF OUR ATTACK. E: NUMBER OF ITEMS IN I PRODUCING AT

LEAST ONE FALSE POSITIVE OR NEGATIVE. R: PERCENTAGE OF WCORRECTLY DECODED.

IV. DISCUSSIONS

It was claimed in [1] that “[the proposed technique] ishighly secure”; in fact, there was a proof of security. Theproof implicitly defines the security of an encoding schemeto be equivalent with the property that there doesn’t exist anequivalent one-to-one mapping. We note that this is an incor-rect definition of security. The existence or non-existence ofan equivalent one-to-one mapping does not constitute a proofof security. Even if an equivalent one-to-one mapping doesnot exist, it may be easy for an attacker to recover a one-to-many mapping and fully recover the original database,as shown by our attack. Furthermore, even if a one-to-one decoding mapping exists, an encoding scheme mayremain secure because it is infeasible for the attacker tofind the mapping. For each symmetric cipher such as theAdvanced Encryption Standard (AES), there always exists aunique one-to-one decryption function. This however, doesnot imply that the encryption is insecure, since finding sucha one-to-one decryption function takes exponential time andmakes such attacks infeasible in practice.

Given that a satisfactory definition of security is lackingin [1], one might attempt to apply the notion of informa-tion theoretical security to encoding. In this approach, anencoding scheme is secure if the encoded database containsno information of the input database. Indeed, one can findencoding schemes that have this property (see [9]); however,they will be very expensive. Intuitively, if the encodeddatabase contains no information about the input database,then the outsourced data mining cannot provide any help.

Instead we must select a more practical security propertyand find a solution that satisfies it. Because we are concernedwith protecting an encoding against frequency analysis at-tacks, we propose a solution property that obfuscates thefrequency of the original items.

∀k1≥k′≥k0

∣∣{X | suppW(X) = k′}∣∣ ≥ n (5)

for security parameters n, k0, and k1.

Given such a definition, a natural question is whether it ispossible to have a practical encoding scheme that satisfiesthe above notion of security. To answer this question, weneed to compare (a) the time required to perform associationrule mining oneself, with (b) the time required to encodethe data, transmitting it to the service provider, receivingthe results back, and decoding the result data. An encodingscheme is practical if the time for (b) exceeds the timefor (a). We have performed some initial analysis to try toanswer this question. While the answer is not definitive, ourpreliminary results suggest that such an encoding scheme isimpractical for reasonable security parameters.

For more details about the issues discussed in this section,see the technical report version of this paper [9].

V. CONCLUSIONS AND FUTURE WORK

In this paper we presented an attack on a databaseencoding scheme for outsourcing association rule mining.We showed how an attacker may identify patterns in the datacreated by the encoding algorithm, allowing a significantamount of the original data to be recovered; it makes noassumptions regarding a priori knowledge of the data.

After illustrating why the security properties discussed in[1] are inadequate we suggest an alternative security prop-erty aimed at defeating frequency analysis attacks. Further,we questioned the practicality of outsourcing associationrule mining in general. The evidence we gathered suggeststhat outsourcing is not efficient in many settings. It is stillan open problem if there exist provably secure encodingschemes that are still practical.

Portions of this work were supported by a Google granttitled “Utility and Privacy in Data Anonymization” and bysponsors of CERIAS.

REFERENCES

[1] W. K. Wong, D. W. Cheung, E. Hung, B. Kao, andN. Mamoulis, “Security in outsourcing of association rulemining,” in VLDB, 2007, pp. 111–122.

[2] L. QIU, Y. LI, and X. WU, “An approach to outsourcingdata mining tasks while protecting business intelligence andcustomer privacy,” ICDMW, vol. 0, 2006.

[3] L. Xiong, S. Chitti, and L. Liu, “Preserving data privacy inoutsourcing data aggregation services,” ACM Trans. InteretTechnol., vol. 7, no. 3, p. 17, 2007.

[4] D. L. Kahn, The codebreakers: the story of secret writing.New York: Scribner, 1996.

[5] D. R. Stinson, Cryptography: Theory and Practice. CRCPress, 1995.

[6] R. Kumar, J. Novak, B. Pang, and A. Tomkins, “On anonymiz-ing query logs via token-based hashing,” in WWW, 2007.

[7] L. V. S. Lakshmanan, R. T. Ng, and G. Ramesh, “To do ornot to do: the dilemma of disclosing anonymized data,” inSIGMOD, 2005.

[8] R. Agrawal and R. Srikant, “Fast algorithms for mining asso-ciation rules,” in VLDB, 1994, pp. 487–499.

[9] I. Molloy, N. Li, and T. Li, “On the (in)security and(im)practicality of outsourcing precise association rule min-ing,” CERIAS, Purdue University, Tech. Rep., 2009, https://www.cerias.purdue.edu/apps/reports and papers/.

877