shuffling_obfuscation (3)

Designs, Analyses and Optimizations for Attribute-Shuffling Obfuscation to Protect Information

from Malicious Cloud AdministratorsHiroshi Fujinoki

Department of Computer ScienceSouthern Illinois University

EdwardsvilleEdwardsville, IL 62026-1656

+1 618 650 [email protected]

ABSTRACTInsider’s threats are one of the issues that is recently gaining attention from any stakeholder in network-based information systems. In this paper, we propose the algorithms for attribute-shuffling obfuscation for database applications in cloud environment. The proposed obfuscation combines data and code obfuscation methods to prevent information leak to malicious administrators at cloud providers, who, we assume, have unlimited and non-censored accesses to any local resources, possibly including the system security logs. The algorithms allow DBMS at cloud servers to perform fundamental operations, such as INSERT, SELECT, UPDATE, and DELETE, while cloud users’ information is protected against leaks to malicious administrators at cloud providers. We analyzed the expected performance of the proposed obfuscation by focusing on the degree of obfuscation, table size inflation, and increase in network traffic load. We developed performance optimization algorithms for the proposed attribute-shuffling obfuscation. We conclude that the proposed attribute-shuffling obfuscation is a solution that will be efficient for the systems with diverse requirements from the linear relation between the two tuning parameters and the two essential performance metrics, the degree of obfuscation and the increase in the network traffic. We also found that the relative network traffic load nearly converged to that of “no obfuscation” (the traffic overhead was only 12% of “no obfuscation”) as the query load increased. These observations let us conclude that the optimization algorithm will be efficient under high workload and for busy systems.

KeywordsInformation obfuscation, insider threats, malicious administrators, cloud computing, information security, network security

1. INTRODUCTION

Threats from insiders are one of the top concerns for cloud users as well as cloud providers [1]. Insider threats are serious threats because of a few reasons. For example, security breaches by malicious insiders cost more and take longer time before being detected than those by external attackers [2]. For most of cloud users, possible insider threats from their cloud provider add significant anxieties, mainly because of the organizational,

and thus cultural, separation between the users and those who manage their intellectual assets as well as their privacy.

Although a through survey on “malicious insiders” can be found in research effort such as [3], who are malicious insiders and what threats they pose are the questions that are harder than they sound due to our increased dependency on the current, and most probably future, information technologies as the vital infrastructure of our society. Given that assumption, coping with the worst case insiders – the malicious administrators – will be an imminent challenge we face. Threats from malicious administrators are serious, since many of them have unlimited accesses to all the raw resources stored at a server host computer.

We extended our definition of “unlimited accesses” not only to the data stored at a server but also to the activities performed at a server. Administrators can monitor network communications, even the contents of the messages exchanged through network connections, and can even tamper access logs of users, including the ones for the administrators themselves in some circumstances.

In this work, we focused on protecting users’ information stored at a cloud server from malicious administrators at cloud providers while the users take advantage of cloud computing. Protecting users’ information stored at a cloud server has two aspects: preventing illegal modifications and preventing information leaks. We have developed a solution for preventing malicious administrators from illegally modifying users’ information stored at a cloud server [4]. Preventing information leaks to administrators necessitates information hiding from administrators, at least in a sense that it is hard for administrators to understand the meaning of users’ information. This paper focuses on information hiding from administrators at cloud servers.

Solutions have been proposed for protecting resources from unauthorized and/or malicious uses by insiders. Access control is the most intuitive solution for protecting resources available at server host computers [X]. However, it is not an effective approach against malicious administrators, since they are the system administrators who set up access control policies at a server host computer.

Intrusion detection is another solution for protecting resources at server host computers by detecting unauthorized accesses to resources. Although intrusion detections are popularly used, there are some well-known weaknesses [5, 6, 7].

1

For example, intrusion detections require training for known normal access patterns by users. This training phase takes time and the required training makes intrusion detections hard to cope with zero-day attacks. Another known weakness is high false positives for anomaly detection [8, 9, 10]. Accurate training will logically minimize false positives, but it is hard to achieve. The most serious concern on intrusion detections is that some system administrators have privileges that even circumvent detections by intrusion detections. Intrusion detections using honey-pots share the same weaknesses [X].

Another approach is pre-encryption, which encrypts users’ information before it is uploaded to a cloud server [X1, Y1, Y2, Y3]. This approach is the most effective solution from the view point of information hiding from malicious administrators, if properly implemented and carried out. Despite its potential, it significantly limits the types of the operations that can be performed to users’ information stored at a cloud server [Y4]. This ruins one of the most significant motivations in cloud computing. Pre-encryption also necessitates frequent encryptions and decryption performed at users’ host computers.

A promising approach, which solves the tradeoff between the information hiding from system administrators and the processing capability at cloud servers, is obfuscation. Obfuscations are the techniques that intentionally make the meaning of data opaque to those who do not have legitimate access rights [S1, S2, S3]. Obfuscations are different from encryption in a sense that obfuscations do not transform data to any undecipherable sequences of symbols, while most of encryptions do. For example, a specific column name in a database table, “SSN” may be replaced by “column 1” to obfuscate its meaning.

Although the degree of information hiding by obfuscations is generally less than that of encryption, obfuscations allow remote processes, such as DBMS at cloud servers, to perform more diverse operations than encryptions can do. Well-designed obfuscation schemes may provide decent information hiding from malicious administrators, while they allow remote processes at cloud servers to more flexibly process users’ data stored at cloud servers.

There are two primary contributions in this paper. First, we designed an obfuscation-based information protection against malicious administrators at cloud providers, who presumably have unrestricted accesses to both static and dynamic information at their servers and unrestricted authorizations to perform any activities, possibly without being logged by system logs. Although the core of our proposed protection is data obfuscation by shuffling attributes in multiple table rows, we combined both code and data obfuscation techniques to cope with threats from malicious administrators at cloud providers. We designed our obfuscation-based protection in such a way that the attribute shuffling is reversible only by the owners of information, but not by administrators at cloud providers. Since the contents of table attributes are not encrypted, the solution allows the four fundamental SQL query types of INSERT, SELECT, UPDATE, and DELETE to be performed at cloud servers, while code (query) obfuscation prevents malicious administrators from learning the logical connections in shuffled attributes. We developed obfuscation and de-obfuscation algorithms for each of the four query types.

Second, we studied the primary performance factors of the obfuscation (and de-obfuscation) algorithms to analyze the feasibility of the proposed obfuscation-based protection. We identified the major performance bottleneck in the proposed obfuscation-based protections and presented possible approaches to mitigate them for practical use of the obfuscation-based protection in the near future.

The rest of this paper is organized as follows. Section 2 discusses the existing work related to the obfuscation techniques proposed for information security. The primary motivation in the section is to discuss variations in the existing obfuscation methods, as well as to analyze and understand the strengths and weaknesses in each major known obfuscation methods. Section 3 describes the proposed attribute-shuffling obfuscation and de-obfuscation algorithms to protect users’ information from malicious administrators at cloud providers, while they allow processing users’ queries at the cloud servers. Section 4 studies the feasibility of the proposed obfuscation-based protection by analyzing the performance of the proposed solution by estimating its effectiveness for protection and its overhead. Section 5 summarizes the contribution of our work, followed by a list of the selected related referenced articles.

2. EXISTING WORK

Various obfuscations methods have been proposed, while each of them has a unique set of strengths and weaknesses, thus applications [19, 20, 21, 22]. The two primary categories of obfuscation methods are code and data obfuscations [X2]. Code obfuscations are the techniques for hiding the meanings in activities (operations), while data obfuscations are those for data. The three major approaches for code obfuscations are code replacement, code shuffling, and dummy code injections, while data type transformation, location shuffling and dummy data injection are the three popular approaches for data obfuscations [G1].

The primary intention of code obfuscations to make the meaning of the operations unclear. Code replacements replace a particular operation (instruction, code, command, and query) by another (or a set of other operations) that produces the same result (e.g., “xor A, A” for “load A, 0”) [X]. Code shuffling changes the order of operations in such ways that they still produce the same results [X]. Dummy code injection inserts some operations between two intended operations, such as inserting “nop” instruction between two instructions [X].

Data obfuscations make the meaning of the data, thus information derived from data, unclear to anyone who does not have legitimate access rights to the data. Data type transformations hide the meaning of the data by transforming data type (e.g., “A” means “1”, “B” means “2” and etc.) [X]. Encryptions are data type transformation obfuscations that make their “de-obfuscation” prohibitively expensive in the computation overhead [X]. Locational shuffling scrambles the ways words, which are “attributes” in database tables, appear in a document [X]. The essential difference between data type transformation and location shuffling is that each word in a document still appears as it is in location shuffling obfuscations, while the scrambled words prohibit unauthorized users from extracting correct meaning from them. Dummy data injection inserts extra data to the intended data in such a way that the

2

meaning of the information is clear only to the authorized users [X].

The existing obfuscation techniques are effective in that they allow remote cloud servers to perform many different operations, while the meaning of data is still hidden from potentially malicious personnel at cloud providers. Denning discussed potential data obfuscation by shuffling attributes in multiple table rows [Z1]. Using data obfuscation, owners of information can let the DBMS at a cloud server perform SQL queries while the complete meaning in the information is hidden. Despite its potential, the attribute-shuffling obfuscation has never been implemented. Algorithms for the obfuscation, as well as de-obfuscation, are needed. Moreover, the initial sketch of the attribute-shuffling obfuscation has some major weak points.

First, the solution does not expect malicious administrators, who are capable of accessing the contents of tables at any given moment, as well as monitoring all the queries submitted by cloud users. If malicious administrators have access to the queries submitted to obfuscated tables, they are still able to discover the hidden relations in the shuffled attributes in obfuscated table, which lets the administrators to understand the meaning of obfuscated data.

The second weakness is that the solution is not equipped with a mechanism to update tables using UPDATE, DELETE, and INSERT queries to an existing table. This issue has significant implication with the first weakness. Lack of such mechanisms implies that the only query supported by the data obfuscation is SELECT to static tables, meaning that such tables should be fully set up without any dynamic change.

This significantly limits the usability of database tables, since some queries, such as INSERT and DELETE can not be performed. The second issue (monitoring by administrators) poses serious threats to the privacy of the obfuscated data since administrators can log a sequence of queries to detect the meaning behind obfuscated data.

3. PROPOSED SOLUTION

Our solution assumes proxies, called “security gateways”. The concept of the security gateway was initially proposed by Pearson [X1]. Each security gateway is located between a cloud server and cloud users’ host computers (Figure 1). It is owned and managed by an organization who logically owns information stored at a cloud server. Each security gateway authenticates its users, authorizes accesses to its data, logs activities performed to its information, and even encrypts their information before they upload their information to their cloud servers.

By requiring all users to go through a security gateway, the security gateways let their owners to hold control over their information stored at remote cloud servers. For example, by auditing access logs, security gateways are able to detect unauthorized updates to their information stored at a cloud server. If information is encrypted before it is uploaded to a cloud server, the security gateway can prevent even unauthorized exposure of their information to anyone who does not have adequate authorization to their information even including administrators of the cloud servers.

Users

SecurityGateway

CloudServer

User authentication Access authorization Activity logging and auditing Encryption

Administrator

Users’Hosts

Figure 1 – Security gateways for maintaining logical ownership to information stored at a cloud server

We designed and implemented a data obfuscation by shuffling attributes in table rows at security gateways. The security gateway performs attribute-shuffling data obfuscation before new information is inserted to a table at a cloud server. Our security gateway keeps track of how attributes are shuffled, which allows cloud servers to execute SQL queries without exposing the meaning of the contents of tables to malicious administrators. Not only obfuscating data, our proposed algorithms also obfuscate SQL queries to prevent malicious administrators from detecting the semantic links in shuffled attributes.

The security gateway works as the mandatory checkpoint to database tables that are logically owned by a cloud user. Anyone who performs a query to a database table is required to submit the query through the security gateway. The mechanism to detect any updating query that bypassed the security gateway has been developed in our previous project [4]. Whenever the security gateway receives a query, it performs the algorithm for each of the four query types.

INSERT: For inserting a new row (Figure 2) to an existing table (e.g., “Employee” table in Figure 3), multiple dummy rows must be inserted for obfuscation against malicious administrators at cloud providers. Contents in dummy rows are first artificially generated. Then, their attribute fields are shuffled in dummy and real rows as shown in Figure 4.

123-45-6789 Smith John 789 N. 55th Ave., St. Louis, MO (314) 222-333363001

Figure 2 – Example of information to be inserted to a table

Assuming that the destination table is currently empty (i.e., no row has been inserted yet) and that the five rows are inserted to the table, the security gateway constructs a temporary list that contains the five rows. The temporary list is for shuffling attributes of dummy and real rows before they are inserted to a table at the cloud server.

The security gateway assigns a unique row number to each row in the temporally list. The unique row numbers are a sequence of consecutive unique positive integers, starting at 1, for each new row to be inserted to a table at a cloud server. For this example, since the table is currently empty, the new row numbers should be a sequence of 1, 2, 3, 4, and 5. If as many as 1,000 rows already exist in the destination table, the five rows will be numbered as 1001, 1002, 1003, 1004, and 1005.

3

R# SSN LastName FirstName Address Phone

1

2

3

4

5

123-45-6789 Smith John 789 N. 55th Ave., St. Louis, MO

Zip Code

(314) 222-333363001

- indicates a dummy attribute

Figure 3 – the contents of the temporally list (before attribute-shuffling obfuscation)

R# SSN LastName FirstName Address Phone

1

2

3

4

5

123-45-6789

Smith

John

789 N. 55th Ave., St. Louis, MO

Zip Code

(314) 222-3333

63001

Figure 4 – the temporally list after attribute-shuffling obfuscation

C1 C2 C3 C4 C6C5

2

2

2

2

2

2

R# V

1

2

3

4

5

1

1

1

1

0

0 0 0 0 0

0 0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0 0 0

Figure 5 - the contents of the mapping table at the security gateway for the five new rows

Figure 6 – the algorithm for obfuscated INSERT

The real row is stored in a randomly selected row in the temporary list. Assuming that the real row is initially stored in the second row in the temporally list, the temporally list should look as shown in Figure 3.

After the contents of the five rows (including both dummy and real) are prepared, the attributes of the five rows are shuffled in the temporally list (Figure 4). The formats of the mapping table and obfuscated tables are extended to hold two additional columns, “R#” and “V” columns. The R# column indicates the corresponding row number in the obfuscated table at a cloud server. The V field indicates if a row holds any attribute of a real row. V is set to 0, if a row does not hold any attribute of a real row. It is set to 1 otherwise.

Figure 5 shows the contents of the mapping table at the security gateway for the five rows in Figure 4. The V field for row 5 is set to 0, since no attribute of row 5 carries any real attribute. After the mapping table is set up, the security gateway issues five INSERT queries to the table at the cloud server, one for each of the five rows in the temporally list.

SELECT LastNameFROM EmployeeWHERE SSN = “123456789”

SELECT R#FROM EmployeeWHERE SSN = “123456789”

SELECT LastNameFROM EmployeeWHERE R# = <target row#>

Raw SELECT query

A sequence of obfuscated SELECT

Figure 7 – example of an obfuscated SELECT query to look up single attribute using a condition

Inserting dummy rows inflates tables, which can be significant for large tables especially if a large number of dummy rows are inserted for obfuscation. To control the table size inflation, the security gateway performs the following algorithm, which uses the following five parameters:

Ntotal_rows: the number of total (including both dummy and real) rows that exist in an obfuscated table at a cloud serverNreal_rows: the number of the real rows that exist in an obfuscated table at a cloud serverNnew_rows: the number of new rows to be inserted for inserting a new real row, including both dummy and real rows while it satisfies (Nnew_rows) > (the number of the columns in the obfuscated table)Ninvalid_rows: the number of rows in the mapping table whose “V” status is 0: the obfuscation factor (1.0 < ) for controlling the number of dummy rows in an obfuscated table

4

(i) Create a list of the rows in the mapping table that have V = 0. Store the count of the rows to Ninvalid_rows

if (((Ntotal_rows - Nreal_rows) < (Nnew_rows × )) OR (Ninvalid_rows = 0)) then

(ii) Artificially generate (Nnew_rows - 1) dummy rows

(iii) Generate a sequence of as many as Nnew_rows unique consecutive row numbers for the temporarily list

(iv) Create an empty temporary list, randomly select one of the rows in the temporary list, and load the real row to that row

(v) Load the (Nnew_rows - 1) dummy rows to the remaining rows in the temporary list and shuffle the attributes

(vi) Update the mapping table

(vii) Issue an INSERT query for each of the Nnew_rows

rows in the temporary list

(viii) Performs the following updates:

(a) Ntotal_rows = Ntotal_rows + Nnew_rows

(b) Nreal_rows = Nreal_rows + 1else

(ix) Randomly pick one of the rows whose V = 0 in the list created in (i)

(x) Randomly find an attribute whose Cn status is “0” in the mapping table (and designates it as the chosen attribute)

(xi) Repeat (ix) for each of the attributes in the new real row

(xii) Update the chosen attributes in the obfuscated table for each attribute of the new row by issuing UPDATE queries (defined and described later) to the obfuscated table

(xiii) Updates the mapping table

(xiv) Performs the following update:

(a) Nreal_rows = Nreal_rows + 1

SELECT: The obfuscated SELECT algorithm uses two different SELECT queries for each obfuscated SELECT query that looks up an attribute using a condition. The first SELECT query is used to find the rows that satisfy a given condition in an obfuscated table. After the security gateway receives the row numbers of those that satisfy a given condition in the obfuscated table at a cloud server, the security gateway looks up the real row number for each such a row in the local mapping table. Then, the security gateway looks for the obfuscated row number of the target attribute. Finally, the security gateway issues the second SELECT query to the obfuscated table to retrieve the content of the attribute of the row. If more than one row satisfies a given condition, one SELECT query is issued to the obfuscation table, one for each of them. The algorithm for obfuscated SELECT queries is shown in Figure 6.

Using an example, the obfuscated SELECT algorithm works as follows: for retrieving the last name of the person whose SSN is “123456789”, the security gateway issues the first SELECT query (Figure 7) to the obfuscated table for finding the row number (i.e., “R#”) of the one whose SSN is “123456789” in the obfuscated “Employee” table, which is “4” (Figure 4). Therefore, the obfuscated table returns “4” to the security gateway. Then, the security gateway looks up the content of column C1 (“SSN”) of the fourth row in the mapping table, which is “2” (Figure 5). After that, the security gateway scans column C2 (“LastName”) for the matching number (i.e., “2”). It is the second row in the mapping table (Figure 5). Finally, the security gateway constructs the second SELECT query (in Figure 7) for retrieving the contents of an attribute in the second row of column C2 in the obfuscated table.

For SELECT queries that look up multiple attributes using multiple conditions, the obfuscated SELECT query works as follow. First, the security gateway issues one SELECT query for each condition to find the rows that satisfy a given condition in the obfuscated table. For example, if an obfuscated SELECT has two conditions to satisfy (LastName and FirstName, as shown in Figure 8), the security gateway issues one SELECT for each of the two conditions. The security gateway issues a SELECT query for “Smith” (the first SELECT query in Figure 8). When the obfuscated table receives the SELECT query, it scans the column for “LastName” to find a matching attribute for “Smith”, whose R# should be 2 in the obfuscated table (Figure 4). The obfuscated table returns the matched row numbers (= ‘2” for this example) to the security gateway. The security gateway issues another SELECT query for FirstName (the second SELECT query in Figure 8). The obfuscated table returns R# = 3 (the row # for “John”) to the security gateway. Since the R# for “Smith” and “John” are both “2” in the mapping table (Figure 5), the security gateway scans its mapping table for matching R# (i.e., “2”) for SSN. In the mapping table, the row that has “2” for the column of C1 (“SSN” column) is “4” (i.e., R# = 4) (Figure 5). Finally, the security gateway issues the third SELECT query to retrieve the content of “SSN” attribute of the row whose R# = 4 in the obfuscated table. For the second attribute to look up (“Phone”), the security gateway repeats the same procedure for the column of “Phone”, which results in the fourth SELECT query for this example (Figure 8). If more than one row satisfies all the conditions in a SELECT query, one SELECT query is issued to retrieve the content of an attribute for each of the rows.

Submitting a sequence of SELECT queries to a cloud server will reveal the logical link of the attributes in an obfuscated table to malicious administrators. Therefore, “noises” should be generated to obfuscate such links. The algorithm for the obfuscated SELECT query is shown in Figure 9.

SELECT SSN, PhoneFROM EmployeeWHERE LastName = “Smith” and FisrtName = “John”

Raw SELECT query

A sequence of obfuscated SELECT formultiple attributes using multiple conditions

SELECT R#FROM EmployeeWHERE LastName = “Smith”

SELECT R#FROM EmployeeWHERE FirstName = “John”

SELECT SSNFROM EmployeeWHERE R# = 4

SELECT PhoneFROM EmployeeWHERE R# = 3

Figure 8 – example of a raw SELECT to look up multiple attributes using multiple conditions

Figure 9 – the algorithm for obfuscated SELECT

5

(i) Construct a SELECT query for finding the rows that satisfy a given condition in an obfuscated table. If there is more than one condition to satisfy, construct one SELECT query for each condition.

(ii) Issue (i.e., send) each SELECT query constructed in (i) to the obfuscated table with (N-1) dummy SELECT queries after the order of their query transmission order is shuffled.

(iii) Look for the real row number in the mapping table for each row found in (i). If no row satisfies the given condition or if there is no single row whose real row numbers for the given conditions are all same in the mapping table, it indicates that there is no row that satisfies the given condition or that there is no row that satisfies all the given conditions. For either case, terminate this algorithm. Otherwise, proceed to (iv).

(iv) Scan the target column in the mapping table, looking for the matching real row number for each row found in (iii).

(v) For each matching real row number found in (iv), find the row # (i.e., “R#”) in the mapping table.

(vi) For each R# found in (v), construct a SELECT query for retrieving the content of the attribute stored in the obfuscated table.

UPDATE EmployeeSET Salary = $100KWHERE Age 50 and Rank = “manager”

SELECT R#FROM EmployeeWHERE Age 50

UPDATE EmployeeSET Salary = $100KWHERE R# = <target row#>

SELECT R#FROM EmployeeWHERE Rank = “manager”

Raw REPLACE-UPDATE

Obfuscated REPLACE-UPDATE

Figure 10 – REPLACE-UPDATE query that consists of multiple conditions

UPDATE: Updating attributes of a row is performed by a combination of SELECT and UPDATE queries. Like INSERT queries, one SELECT query is used to find the rows that satisfy a given condition, followed by UPDATE queries, each of which updates an attribute of a row (Figure 10).

Issuing a sequence of SELECT and UPDATE queries to obfuscated tables will also reveal their semantic links to malicious administrators at the cloud server in the following two ways. (1) Updates to attributes will be performed for specific table rows in an obfuscated table, each of which are uniquely identified by its R#. By observing a sequence of a SELECT followed by an UPDATE query, malicious administrators will be able to discover the logical connections in shuffled attributes. (2) If multiple attributes in a real row are updated, multiple UPDATE queries will follow a SELECT query. Observing a sequence of such UPDATE queries, malicious administrators can discover the attributes of a real row. These problems suggest that UPDATE queries also require obfuscation.

We recognized two different types of UPDATE queries. The first, what we call “REPLACE UPDATE”, replaces the content of a table attribute without referencing to its current value, such as Salary = $100K. The second type, “DELTA UPDATE”, first retrieves the current value of an attribute, followed by a cumulative update based on its current value, such as “Salary = Salary + $10K”.

The algorithm for the obfuscated REPLACE-UPDATE is identical to that of obfuscated SELECT, except that the second SELECT query is replaced by UPDATE query (Figure 11). In the algorithm, the parameter, Nattributes, indicates how many different attributes will be updated. Similar to obfuscated SELECT, if more than one row satisfies (all) the condition(s), one UPDATE query is issued to each such a row. The primary difference between a REPLACE-UPDATE and a DELTA-UPDATE is that the security gateway needs to obtain the current value of an attribute before it updates a row for DELTA-UPDATE. The algorithm is thus a combination of the obfuscated SELECT and REPLACE-UPDATE (Figure 12).

Figure 11 – the algorithm for obfuscated REPLACE-UPDATE

DELETE: Each obfuscated DELETE query does not always immediately delete a row from an obfuscated table. The security gateway updates the status of each attribute in each deleted row to “0” in its managing table without issuing a DELETE query to the obfuscated table. The rows deleted in the mapping table will be deleted in the obfuscated table when the number of such “deleted” rows reached a certain number. When the number of “deleted” rows reaches a threshold, the security gateway issues multiple DELETE queries, one for each “deleted” row, to the obfuscated table. Thus, there is no need for dummy DELETE queries for obfuscation. The algorithm definition of obfuscated DELETE is shown in Figure 13.

6






(vi) For each R# found in (v), construct a SELECT query for retrieving the content of the attribute stored in the obfuscated table.

(vii) Issue (i.e., send) each SELECT query constructed in (vi) to the obfuscated table with (N-1) dummy SELECT queries after the order of their query transmission order is shuffled.

(viii) For the attribute content retrieved by each SELECT query in (vii), locally apply whatever delta update to the value.

(ix) For the attribute content prepared in (viii), construct an UPDATE query for updating the content of the attribute in the obfuscated table.

(x) Issue (i.e., send) each UPDATE query constructed in (vi) to the obfuscated table with (N-1) dummy UPDATE queries after the order of their query transmission order is shuffled.

(xi) Repeat (iv) through (x) for each attribute, if there is more than one attribute to update for each row that satisfies the given conditions.






(vi) For each R# found in (v), construct an UPDATE query for updating the content of the attribute stored in the obfuscated table.

(vii) Issue (i.e., send) each UPDATE query constructed in (vi) to the obfuscated table with (N-1) dummy UPDATE queries after the order of their query transmission order is shuffled.

(viii) Repeat (iv) through (vii) for each attribute, if there is more than one attribute to retrieve for each row that

Figure 12 – the algorithm for obfuscated DELTA-UPDATE

Figure 13 - the algorithm for obfuscated DELETE

4. PERFORAMNCE ANALYSES

The metrics we used for our performance evaluation are the degree of obfuscation, table size inflation, network traffic, and the server-side processing overhead. The degree of obfuscation indicates how hard it is for malicious administrators to reconstruct each table row. For the degree of obfuscation, we used the probability of how likely a malicious administrator can guess the correct link between two obfuscated attributes of real rows in obfuscated tables.

The table size inflation was defined as how much larger a table will be after dummy rows are inserted, in terms of the number of rows in an obfuscated table. To analyze the table size inflation, we used D/R (dummy/real) ratio. We defined the D/R ratio as (the number of dummy rows)/(the number of real rows) in an obfuscated table.

Network traffic overhead indicates how much more network traffic will be generated for dummy rows. We quantified the network traffic overhead by estimating the number of queries issued from the security gateway to an obfuscated table.

INSERT: The two primary overhead factors for obfuscated INSERT (Figure 6) are the table size inflation and the increased network traffic for inserting dummy rows to an obfuscated table. Figure 14 shows the D/R ratio observed for different values of Nnew_rows (10, 15, 20, 25, 30, and 35) for the first 100 inserts to an empty table when = 3.0. The D/R ratio decreased to less than 1.0 by the time the 30th, 45th, 60th, 75th, 90th, and 105th insert was performed for Nnew_rows = 10, 15, 20, 25, 30, and 35. At the 100th insert, the D/R ratio was 0.3, 0.5, 0.6, 0.75, 1.10, and 1.20 for the six values of Nnew_rows, respectively. Figure 15 shows the observed D/R ratio for = 11.0. At the 100th insert, the D/R ratio was 1.10, 1.70, 2.20, 2.75, 3.50, and 3.90. The D/R ratio decreased to less than 1.0 by the 110th, 165th, 220th, 275th, 330th, and 385th insert for the six values of Nnew_rows.

Our observations in Figure 14 and 15 suggest that the obfuscated INSERT will cause relatively high table size inflation for the tables that contains a small number (e.g., < 100) of rows, but that the overhead will be relatively small for tables that contains a large number (e.g., > 100) of rows. Since the degree of table size inflation is inversely proportional to the number of rows inserted to a table, D/R ratio will quickly decrease as new rows are inserted to a table. This implies that the proposed attribute shuffling obfuscation will be efficient for large tables in term of table size overhead.

Figure 16 shows the D/R ratio for different values of (3.0, 5.0, 7.0, 9.0, and 11.0) where Nnew_rows = 10 for the first 100 inserts to an empty table. At the 100th insert, the D/R ratio was 0.3, 0.5, 0.7, 0.9, and 1.1. The D/R ratio decreased to less than 1.0 by the time the 35th, 50th, 70th, 90th, and 110th insert was performed for the five values of . Figure 17 shows the D/R ratio for Nnew_rows = 50. At the 100th insert, the D/R ratio was 1.5, 2.5, 3.5, 4.5, and 5.5. The D/R ratio decreased to less than

7





(v) Set “0” to each such matching row number detected in (iv) in the mapping table. If all attributes in a row in the mapping table becomes "0”, reset its valid flag (i.e., “V” field) to “0” and perform: Ninvalid_rows = Ninvalid_rows + 1. If (Ninvalid_rows < s), terminate this algorithm. Otherwise, proceed to (vi).

(vi) Randomly select m (m = s/p) invalid rows in the mapping table.

(vii) Construct and issue a DELETE query for each of the m rows in the obfuscated table.

(vii) Perform: Ninvalid_rows = Ninvalid_rows – m.

1.0 by the time the 150th, 250th, 350th, 449th, and 549th insert was performed for the five values of .

0

5

10

15

20

25

30

351 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88 91 94 97 100

Nnew_rows = 10

Nnew_rows = 15

Nnew_rows = 20

Nnew_rows = 25

Nnew_rows = 35

D/R

rat

io

Number of rows inserted

= 3.0 for allexperimentsNnew_rows = 30

Figure 14 - the D/R ratio of Nnew_rows = 10, 15, 20, 25, 30, and 35 for the first 100 inserts to an empty table when = 3.0

0

5

10

15

20

25

30

35

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Nnew_rows = 10

Nnew_rows = 15

Nnew_rows = 20

Nnew_rows = 25

Nnew_rows = 35

D/R

rat

io


= 11.0 for allexperiments

Nnew_rows = 30

Figure 15 - the D/R ratio of Nnew_rows = 10, 15, 20, 25, 30, and 35 for the first 100 inserts to an empty table when = 11.0

0

5

10

15

20

25

30

35

40

45

50

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

D/R

rat

io


= 3.0

= 5.0

= 7.0

= 9.0

= 11.0

Nnew_rows = 10 forall experiments

Figure 16 - the D/R ratio of different values of where Nnew_rows

= 10 for the first 100 inserts to an empty table

0

5

10

15

20

25

30

35

40

45

50

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

D/R

rat

io


= 3.0

= 5.0

= 7.0

= 9.0

= 11.0

Nnew_rows = 50 forall experiments

Figure 17 - the D/R ratio of different values of where Nnew_rows

= 50 for the first 100 inserts to an empty table

0

50

100

150

200

250

300

350

400

450

1 2 3 4 5 6N

umbe

r of

inse

rts

to r

each

a ta

rget

D/R

rat

io

D/R = 0 .25

D/R = 0.50

D/R = 1.00

D/R = 2.00

D/R = 3.00

D/R = 4.00

y = 59.7x (D/R = 0.25)

y = 30.0x (D/R = 0.50)

y = 15.0x (D/R = 1.00)

y = 8.7x (D/R = 2.00)y = 6.4x (D/R = 3.00)y = 4.0x (D/R = 4.00)

10 15 20 25 30 35Nnew-rows

Figure 18 – number of inserts to reach a D/R ratio for = 3.0

Figure 18 and 19 show our analyses on the impact of Nnew_rows to the number of inserts expected for achieving a D/R ratio of 0.25, 0.50, 1.00, 2.00, 3.00, and 4.00, when = 3.0 and = 13.0 respectively. The graphs indicate linear relation between the number of inserts and the inverse D/R ratio (y = /(D/R) × Nnew_rows), where = 16 for = 3.0 and = 64 for = 13.0. Figure 20 shows the impact of to the number of inserts needed to achieve D/R ratio = 1.0. Our linear regression analyses indicate that the linear coefficient for each graph is proportional to Nnew_rows by observing (the number of inserts to achieve D/R = 1.0) = (Nnew_rows) . For different values of D/R ratio, we observed similar results. This implies that the number of inserts needed to achieve a target D/R ratio can be controlled by the parameter of Nnew_rows.

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6

Num

ber

of in

sert

sto

rea

ch a

targ

et D

/R r

atio

D/R = 0 .25

D/R = 0.50

D/R = 1.00

D/R = 2.00

D/R = 3.00

D/R = 4.00

y = 261.7x (D/R = 0.25)

y = 129.5x (D/R = 0.50)

y = 67.4x (D/R = 1.00)

y = 33.2x (D/R = 2.00)y = 24.4x (D/R = 3.00)y = 17.0x (D/R = 4.00)

10 15 20 25 30 35Nnew_rows

Figure 19 – the number of inserts to reach a D/R ratio for = 13

8

0

100

200

300

400

500

600

700

3 4 5 6 7 8 9 10 11 12 13

Num

ber

of in

sert

sto

rea

ch a

targ

et D

/R r

atio

Nnew_rows = 10

Nnew_rows = 20

Nnew_rows = 30

Nnew_rows = 40Nnew_rows = 50

(alpha)

y = 50x

y = 40x

y = 30x

y = 20x

y = 10x

D/R ratio = 1.0for all experiments

Figure 20 – the impact of (alpha) to the number of inserts needed to achieve D/R ratio = 1.0

0%

10%

20%

30%

40%

50%

60%

70%

N=10 N=15 N=20 N=25 N=30 N=35 N=40 N=45 N=50

D/R

rat

io

y = 0.065x for = 13

y = 0.055x for = 11

y = 0.045x for = 9

y = 0.035x for = 7

y = 0.025x for = 5

y = 0.015x for = 3

Nnew_rows

10 2015 25 30 35 40 45 50

Figure 21 - the D/R ratio for different values of parameter at the 1,000th insert to an empty table

Figure 21 shows the effect of Nnew_rows to the D/R ratio for a large table (at the 1,000th insert to an empty table) for = 3.0, 5.0, 7.0, 9.0, 11.0, and 13.0. The results shown in Figure 21 suggest that the effect of Nnew_rows to D/R ratio is linear, whose coefficient is 0.015, 0.025, 0.035, 0.045, 0.055, and 0.065 for the six values of . The gradual slope of the linear graphs in 21 implies that the wide range of Nnew_rows can be practically used for controlling D/R ratio. Figure 21 also suggests that the impact of parameter is a factor of 0.5% (each time the value of increased, it increased the linear-coefficient by 0.5%). From the observations in the figure, we developed formula (1) for estimating D/R ratio.

(Expected D/R ratio) = 0.005 × × Nnew_rows (1)

Since each obfuscated INSERT query requires (Nnew_rows - 1) dummy queries, the network traffic overhead of obfuscated INSERT is a factor of (Nnew_rows – 1). Regarding the degree of obfuscation, the probability of an attribute in a total of Nnew_rows

rows to be the one for the real row is:

1/(Nnew_rows) (2)

Since there is only one unique combination that correctly represents the contents of a whole real row, each of which consists of Ncolumns columns in an obfuscated table, the probability of malicious administrators to correctly guess the correct contents of a whole real row is:

1/((Nnew_rows)(Ncolumns)) (3)

Formula (3) suggests that the degree of obfuscation provided by obfuscated INSERT queries is inversely proportional to the number of the inserted rows to the power of the number of columns in each row. This implies that the

degree of obfuscation can be adjusted using Nnew_rows as a tuning parameter. Controlling the degree of obfuscation by adjusting Nnew_rows is a practical option since Nnew_rows has weak impact to D/R ratio (Figure 21). For example, for the tables that has a small number of columns, a target degree of obfuscation can be achieved by increasing Nnew_rows, while for the tables that have a large number of columns, an equivalent level of obfuscation can be achieved by a small value of Nnew_rows.

SELECT: Since the obfuscated SELECT (Figure 11) will not inflate the size of obfuscated tables, its major overhead is the increase in the network traffic. For an obfuscated SELECT query that looks up one attribute using one condition (Figure 7), the security gateway first issues a SELECT query to find the rows that satisfy the given condition in the obfuscated table. After the security gateway receives the R#’s for the matching rows, one SELECT query will be issued to retrieve the attribute content for each of them in the obfuscated table. Since (N-1) dummy SELECT queries are issued, the degree of obfuscation is same as that for INSERT (formula (3)).

Assuming that the number of rows in the obfuscated table that satisfies the given condition is Nmatched_rows, the security gateway issues a total of (1 + Nmatched_rows) SELECT queries to an obfuscated table for each obfuscated SELECT. Since each of the SELECT queries needs to be obfuscated with (N-1) dummy SELECT queries, the total number of queries issued to the obfuscated table will be:

(1 + Nmatched_rows) × N (4)

If an obfuscated SELECT query looks up multiple attributes (Nattributes attributes) for multiple conditions (Nconditions

conditions), such as the one shown in Figure 8 (Nattributes = 2 and Nconditions = 2 in the example), the obfuscated SELECT works as follow. One SELECT query is issued to the obfuscated table for finding the rows that satisfy a condition. After the security gateway receives the R# for the rows that satisfies the condition, it looks for “the real row number(s)” (i.e., “R#”) of them in the mapping table. Then, the security gateway looks up the R# in the mapping table that holds the shuffled target attribute of the rows. Finally, for each such a real row, the security gateway issues one SELECT query for retrieving the content of the attribute. This procedure results in the number of real SELECT queries estimated by formula (5).

Nconditions + (Nattributes × Nmatched_rows) (5)

Since each of the SELECT queries in (5) should be obfuscated by (N-1) dummy SELECT queries, the total number of SELECT queries the security gateway issues to the obfuscated table will be:

(Nconditions + (Nattributes × Nmatched_rows)) × N (6)

REPLACE-UPDATE: Since UPDATE queries will not inflate obfuscated tables, its major overhead is the increase in the network traffic. If a REPLACE-UPDATE updates an attribute of only one row, the security gateway first issues a SELECT query to find the row that satisfies the condition in the obfuscated table. When the security gateway receives the row number of the one whose attribute needs to be updated, it searches for the row number in the mapping table that has the attribute to update. Finally, the security gateway issues an UPDATE query to overwrite the content of the attribute in the obfuscated table.

9

If an obfuscated REPLACE-UPDATE query contains more than one condition to satisfy (Figure 10), one SELECT query is issued for finding the rows that satisfy a given condition in the obfuscated table. For example, the obfuscated REPLACE-UPDATE query in Figure 10 will result in the sequence of two SELECT queries for finding the rows that satisfy each condition (one for finding the rows that satisfy “Age 50” and another for those that satisfy “Rank = manager”). After the security gateway finds all the rows that satisfy each condition in the obfuscated table, it identifies the row numbers that satisfy all the conditions using the local mapping table. Finally, the security gateway issues a REPLACE-UPDATE for each of such rows. For example, if twenty rows in the obfuscated table satisfy “Age 50” and “Rank = manager”, the security gateway issues an UPDATE query for each of the twenty rows in the obfuscated table, for updating their salary to $100K.

Based on the above analyses, the number of queries issued for each obfuscated REPLACE-UPDATE will be estimated by formula (7), assuming that an obfuscated REPLACE-UPDATE updates an attribute of the rows that satisfy Nconditions conditions and that there are Nmatched_rows such rows.

Nconditions + Nmatched_rows (7)

If an obfuscated REPLACE-UPDATE updates as many as Nattributes attributes, one UPDATE query is issued for each row that satisfies all the given conditions. Since updating Nattributes

attributes of a row requires Nattributes distinct UPDATE queries, and since each of which requires (N-1) dummy queries for obfuscation, the total number of queries issued from the security gateway for an obfuscated REPLACE-UPDATE is same as the one for the obfuscated SELECT (formula (6)).

DELTA-UPDATE: Each obfuscated DELTA-UPDATE query issues two types of SELECT queries. The first one for finding the rows that satisfy a given condition in an obfuscated table and the second for retrieving their current value. For the first type, one SELECT query is issued for each condition, while for the second type, one SELECT query is issued for each row that satisfies all the given conditions. Finally, one UPDATE query is issued to update an attribute of such rows. Thus, for each DELTA-UPDATE that updates Nattributes attributes, satisfying Nconditions conditions, the total number of queries will be:

(Nconditions + (Nattributes × Nmatched_rows × 2)) × N (8)

DELETE: Since obfuscated DELETE queries do not require a dummy query, they do not cause extra network traffic between the security gateway and obfuscated tables. Obfuscated DELETE queries increase the size of the obfuscated tables until “deleted” rows are physically deleted from obfuscated tables. Assuming that the security gateway issues DELETE queries, one for each “already deleted row”, when the number of such already-deleted rows reaches a certain threshold (“m” in Figure 13), the increase will be a factor of m. If malicious administrators see as many as m DELETE queries, the degree of obfuscation provided by the obfuscated DELETE queries is same as the one estimated for INSERT queries (formula (3) where Nnew_rows = m).

Data structures at the security gateway: Another major overhead in the implementation of the proposed attribute-shuffling obfuscation is the mapping tables at the security gateway. Assuming that the maximum number if rows in an obfuscated table is N, each field in the mapping table (i.e., R# and CN), except “V”, should be log2N bits, while each “V” field

is one bit. Thus, the size of a mapping table, in bytes, for each obfuscated table at a cloud server is estimated as:

((log2N × (Ncolumns + 1) × N) + N)/8 (9)

Figure 22 shows the estimated size of a mapping table in bytes for tables that consists of 3, 6, 9, 12, and 15 columns estimated by formula (9). The results of the estimations indicate that the number of rows contained in a table has a linear impact to the table size. Formula (10), which estimates the table size, is deducted from the trend lines in Figure 22, where Ntotal_rows

represents the number of rows contained in a table. Its linear coefficient is a factor of Ncolumns (i.e., (2 + 5 × (Ncolumns – 1))). The relatively low linear coefficient suggests that the table size of mapping tables is tractable, implying a good scalability for large obfuscated tables.

(2 + 5 × (Ncolumns – 1)) × (Ntotal_rows) (10)

0.0

50000.0

100000.0

150000.0

200000.0

250000.0

300000.0

0 2000 4000 6000 8000 10000

the maximum number of rowsin an obfuscated table

3 columns

6 columns

9 columns

12 columns

15 columns

y = 6.9x

y = 12.0x

y = 17.0x

y = 22.1x

y = 27.2x

size

of

a m

appi

ng t

able

(in

byte

s)

Figure 22 – Estimated data structure size of a mapping table for different size of obfuscated tables

Network traffic load optimization

In the previous section, we observed a linear relation between the two tuning parameters (Nnew_rows and ) and the two essential performance metrics (the D/R ratio and the number of inserts to a database table for achieving a particular D/R ratio) in the proposed attribute-shuffling obfuscation. This result implies that the D/R ratio and the number of inserts to achieve a particular D/R ratio can be controlled by Nnew_rows and parameters. From the results observed in Figure 14 through 17, we also found that D/R ratio will quickly decrease as new rows are inserted to a table. Especially when the value of Nnew_rows is low (e.g., 10, 15, 20, and 25), the observed D/R ratio was less than 1.0 (0.3, 0.5, 0.6, and 0.75, respectively) for = 3.0.

After our initial performance evaluations, our major concerns converged on the increased network traffic in SELECT, REPLACE-UPDATE, and DELTA-UPDATE queries. The results of our analyses suggested possibly high overhead for queries that consist of multiple conditions and attributes, especially when a large number of rows satisfy the query conditions, as predicted by formula (6) and (8). This section describes our optimizations of the obfuscation algorithms for the three query types to make the proposed attribute-shuffling obfuscation a practical and efficient solution by minimizing their network traffic overhead.

The optimization method we developed consist of two components of “query crates (called “crates” hereafter)” and “crate slots (called “slots” hereafter)”. Each crate consists of a certain number of slots, each of which is used to transfer a SQL

10

query from a security gateway to an obfuscated table at a cloud server. One crate can not be used to transfer more than one query for one obfuscated query. For example, each execution of obfuscated DELTA-UPDATE queries requires as many as ((Nconditions + (Nattributes × Nmatched_rows × 2)) × N) individual queries. Therefore, ((Nconditions + (Nattributes × Nmatched_rows × 2)) × N) crates need to be used to transmit queries for one obfuscated DELTA-UPDATE query.

Each crate consist of s slots (s is an integer, which is larger than 1). Thus, each crate can accommodate s queries, one from each of s distinct obfuscated queries, each of which is assumed to be independent from each other. Figure 23 shows the organization of the security gateway to implement this optimization method. The obfuscated query dispatcher accepts obfuscated queries from users, and creates an obfuscated query processor for each obfuscated query from users. The obfuscated query processor is a process that executes one of the obfuscation algorithms for an obfuscated query submitted by a user. Obfuscated query processors are dynamically spawned, one for each obfuscated query. Queries generated for an obfuscated query are placed in the local outgoing query queue. The dashed lines in the figure are the backward traffic from a cloud server to users.

Users Users’Hosts

CloudServer

Administrator

Security GatewayObfuscated QueryDispatcher

Obfuscated QueryProcessor

Outgoing QueryQueue

Crate Constructer

Figure 23 - Organization of the security gateway to implement the optimization method

Figure 24 – Crate construction algorithm

The crate constructer constructs crates using the algorithm shown in Figure 24 and transmits the crates to a cloud server. The algorithm uses two key parameters of s (the number of slots in a crate) and q (the number of the existing outgoing query queues). The DBMS at a cloud server accepts and executes one query at a time without using the concept of crates and slots. Thus, this optimization method does not require any special process at the cloud server side.

The above optimization method reduces the network traffic overhead for obfuscated queries in the following way. If as many as s outgoing query queues exist, the security gateway will

not transmit any dummy query. Instead, the crate constructor fills all of the s slots in a crate by real queries, one for each of s outgoing queues. In another word, instead of each obfuscation algorithm generates (N-1) dummy queries, the crate constructor dynamically merges real queries for real obfuscated queries from users. Only if there are not enough obfuscated queries, the crate constructor inserts dummy queries for obfuscation.

Assuming that the probability of having a query at the top of an outgoing query queue is and that there is no correlation in the query queues, the average number of queries needed for transmitting one real query is calculated by formula (11). The formula was derived in the following way. First, if only one query queue holds a query at the top of the queue, the crate constructor needs to obfuscate it by mixing it with (s-1) dummy queries. This means that a total of s queries are transmitted for transmitting one real query. If two query queues have a query to transmit, the crate constructor creates a crate by merging the two real queries with (s-2) dummy queries, making the number of transmitted queries per real query s/2. Similarly, if all of the s query queues have a query to transmit, all the s slots in the crate are used for carrying real queries, resulting in s/s = 1.0.

(11)

Using the expected total number of queries for transmitting a real query is calculated by formula (11), the formula for SELECT and REPLACE-UPDATE is (12) and the one for DELTA-UPDATE is (13) after the optimization is added:

(Nconditions + (Nattributes × Nmatched_rows)) × (12)

(Nconditions + (Nattributes × Nmatched_rows × 2)) × (13)

where = formula (11).

Figure 25 shows the estimated increase in the network traffic measured in the number of queries issued to a cloud server for no obfuscation (“no obfuscation”), the attribute-shuffling obfuscation without the proposed optimization (‘w/out”), and the attribute-shuffling obfuscation with the proposed optimization (“w/”). The graph plots the average number of queries for transmitting one real query for different values of . We applied s = Nnew_rows in each analysis. Both “no obfuscation” and “w/out” resulted in a constant traffic load. The average number of queries for transmitting one real query was Nnew_rows for “w/out” and 1.0 for “no obfuscation”.

We observed that the attribute-shuffling obfuscation with the proposed optimization reduced the network traffic to approximately half of the obfuscation without the optimization for low query loads ( < 0.1). However, as the query load increased toward = 0.9, the relative network traffic load converged to that of “no obfuscation” (i.e., the proposed attribute shuffling will virtually cause no major network traffic overhead). These observations let us conclude that the optimization algorithm will be effective for busy database systems. Although its effectiveness will be relatively low for under-utilized systems, overhead for under-utilized systems will not be a fatal weakness in the proposed attribute-shuffling data obfuscation. The DBMS at a cloud server is still processing dummy queries, but because of the same reason (many dummy queries when the real query load is low, but few dummy queries when the real query load is high), the overhead will not be fatal to the net throughout of the database applications.

11

1. Grab one query from the top of each of the existing outgoing query queues.

2. Repeat Step 1 for min(q, s) times in the round-robin order of the q existing outgoing query queues.

3. If the number of the existing queues (q) is less than or equal to s, artificially synthesize (s - q) dummy queries.

4. Randomly place the q real and (s - q) dummy queries to each slot in the current crate.

5. Transmit all the s queries in the current crate.6. Go back to step 1 (i.e., repeat from step 1).

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Nor

mal

ized

Net

wor

k T

raff

ic(w/out, Nnew_rows = 30)(w/, Nnew_rows = 30)

(w/, Nnew_rows = 25)




(w/out, Nnew_rows = 25)




(no obfuscation, Nnew_rows = 10, 15, 20, 25, 30)

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Query Load ()

Figure 25 – increase in the network traffic for no obfuscation, the attribute-shuffling obfuscation with, and without the proposed optimization

5. CONCLUSIONS AND FUTURE WORK

In this paper, we propose the algorithms for attribute-shuffling obfuscation for database applications in cloud environment. The proposed obfuscation combines data and code obfuscation methods to prevent information leak to malicious administrators at cloud providers, who, we assume, have unlimited and non-censored access to any local resources, possibly including the system security logs. The algorithms allow database management systems at cloud servers to perform fundamental operations, such as INSERT, SELECT, UPDATE, and DELETE, while cloud users’ information is protected against leaks to malicious administrators at cloud providers.

We analyzed the expected performance of the proposed attribute-shuffling by focusing on the degree of obfuscation, table size inflation, and increase in network traffic load. Using the results of the performance analyses, we developed performance optimization algorithms to the proposed attribute-shuffling obfuscation. Our performance analyses brought us the following observations.

Observation 1: In Figure 14 and 15, we observed that the table size inflation is inversely proportional to the number of rows inserted to a table. The D/R ratio, which indicates how much dummy overhead rows will be inserted to a table, will quickly decrease as new rows are inserted to a table. This implies that the proposed attribute shuffling obfuscation will be efficient for large tables in term of table size overhead.

Observation 2: Our primary observation in Figure 7 was that the number of inserts needed to achieve a target D/R ratio can be controlled by the parameter of Nnew_rows. This implies that the proposed attribute-shuffling obfuscation is adjustable to different real database systems.

Observation 3: The gradual slope (the linear coefficient for the impact of to D/R ratio was 0.015 to 0.065 = 3.0 to 13.0) of the linear graphs in 21 implies that the wide range of Nnew_rows

can be practically used for controlling D/R ratio. Figure 21 also suggests that the impact of parameter is a factor of 0.5% (each time the value of increased, it increased the linear-coefficient by 0.5%).

Observation 4: Formula (3) suggests that the degree of obfuscation provided by obfuscated INSERT queries is inversely

proportional to the number of the inserted rows to the power of the number of columns in each row. This implies that the degree of obfuscation can be adjusted using Nnew_rows as a tuning parameter. Controlling the degree of obfuscation by adjusting Nnew_rows is a practical option since Nnew_rows has weak impact to D/R ratio (Figure 21).

Observation 5: Figure 22 shows the impact of the table size, in terms of the number of columns and the maximum number of rows in a table, to the size of a mapping table at a security gateway. This analysis studied the overhead for the data structure at each security gateway. Our linear regression analyses indicated that its linear coefficient is a factor of Ncolumns

(i.e., (2 + 5 × (Ncolumns – 1))). The relatively low linear coefficient suggests that the table size of mapping tables is tractable, implying a good scalability even for large obfuscated tables.

Observation 6: The attribute-shuffling obfuscation with the proposed optimization reduced the network traffic to approximately half of the obfuscation without the optimization for low query loads ( < 0.1). As the query load increased toward = 0.9, the relative network traffic load converged to that of “no obfuscation” (i.e., the proposed attribute shuffling will virtually cause no major network traffic overhead under high workload).

The six major observations from our performance analyses brought us two major conclusion. First, the proposed attribute-shuffling obfuscation is a solution that will be efficient for the systems with diverse requirements. This conclusion is from the observed linear relation between the two tuning parameters (Nnew_rows and ) and the two essential performance metrics (the D/R ratio and the number of inserts to a database table for achieving a particular D/R ratio).

From our sixth observation (“Observation 6”), we learned that the relative network traffic load converged to that of “no obfuscation” as the query load increased toward = 0.9. These observations let us conclude that the optimization algorithm will be efficient under high workload and for busy systems.

There are some issues that must be taken care of before the proposed attribute-shuffling obfuscation can be put in real use. The first issue is the design, the algorithms, and the implementation of artificially generating fake but realistic contents for dummy rows for obfuscation. Since poorly generated fake contents will provide malicious administrators with strong clues for eliminating such dummy rows in obfuscated tables, the “quality of fake data” has significance. Since the information stored and processed by each database system depends on each, generating such “high quality fake data” may be a major challenge.

6. REFERENCES

[1] Jon Oltsik, “2013 Vormetric/ESG Insider Threats Survey - The Ominous State of Insider Threats,” Insider Threats Survey, Vormetric, 2013. URL:http://www.vormetric.com/ sites/default/files/ap_Vormetric-Insider_Threat_ESG_ Research_Brief.pdf (last accessed: June 13, 2014).

[2] Sang-Chin Yang, and Yi-Lu Wang, “System Dynamics Based Insider Threats Modeling,” International Journal of

12

Network Security and its Applications, vol. 3, no. 3, pp. 1-14, 2011.

[3] Adam Cummings, Todd Lewellen, David McIntire, Andrew P. Moore, and Randall F. Trzeciak, “Insider Threat Study: Illicit Cyber Activity Involving Fraud in the U.S. Financial Services Sector,” CERT Research Report CMU/SEI-2012-SR-004, July, 2012. URL: http://repository.cmu.edu/sei/688/ ?utm_source=repository.cmu.edu%2Fsei%2F688&utm_medium=PDF&utm_campaign=PDFCoverPages (last accessed: June 13, 2014).

[4] Hiroshi Fujinoki and Siamak Mahmoudian Dehkordi, “In-line Auditing and Real-time Lineage Summaries to Maintain Ownership of Information Stored in Cloud Servers,” Journal of Network and Information Security, _________, 2013.

[X1]Siani Pearson, Yun Shen, and Miranda Mowbray, ”A Privacy Manager for Cloud Computing,” Cloud Computing, Lecture Notes in Computer Science, vol. 5931, pp. 90-106, 2009.

[5]

[6]

[7]

[8] Brian M. Bowen, Malek Ben Salem, Angelos D. Keromytis, and Salvatore J. Stolfo, “Monitoring Technologies for Mitigating Insider Threats,” Advances in Information Security, vol. 49, pp. 197-217, 2010.

[9] Paul Thompson, “Weak Models for Insider Threat Detection,” Proceedings of the SPIE Sensors, and Command, Control, Communications, and Intelligence (C3I) Technologies for Homeland Security and Homeland Defense III, vol. 5403, 2004.

[10] Nam Nguyen, Peter Reiher, and Geoffrey H. Kuenning, “Detecting Insider Threats by Monitoring System Call Activity,” Proceedings of the IEEE, Workshop on Information Assurance, 2003.

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18] Boanerges Aleman-Meza, Phillip Burns, Matthew Eavenson, Devanand Palaniswami, and Amit Sheth, “An Ontological Approach to the Document Access Problem of Insider Threat,” Proceedings of the IEEE International Conference on Intelligence and Security Informatics, pp. 486-491, 2005.

[X2]Christian Collberg, Clark Thomborson, and Douglas Low, “A Taxonomy of Obfuscating Transformations,” Computer Science Technical Reports #148, Dept. of Computer

Science, Univ. of Auckland, 1997. URL: https://researchspace. auckland.ac.nz/handle/2292/3491 (last accessed: June 13, 2014).

[19] Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, and Hoeteck Wee, “Toward Privacy in Public Databases,” Proceedings of the International Conference on Theory of Cryptography, pp. 363-385, 2005.

[20] William E. Winkler, “Re-identification Methods for Masked Microdata,” Privacy in Statistical Databases, Lecture Notes in Computer Science, vol. 3050, pp. 216-230, 2004.

[21] Alexandre Evfimievski, Johannes Gehrke, and Ramakrishnan Srikant, “Limiting Privacy Breaches in Privacy Preserving Data Mining,” Proceedings of the ACM Symposium on Principles of Database Systems, pp. 211-222, 2003.

[22] Narayanan, Arvind, and Vitaly Shmatikov, “Obfuscated Databases and Group Privacy,” Proceedings of the ACM Conference on Computer and Communications Security, pp. 102-111, 2005.

[Y1]

[Y2]

[Y3]

[Y4]

[Z1] (D. E. Denning, “Cryptography and Data Security, Addison-Wesley, 1982”)].

[G1]John Magnabosco, Protecting SQL Server Data, Red gate Books, 2009, ISBN-10: 1906434271.

EXTRA REFERNCES

William E. Winkler, “Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems,” Research Report Series #2004-06, Statistical Research Division, U.S. Bureau of the Census, 2004.

Dawn Cappelli, Andrew P. Moore, Marissa R. Randazzo, Michelle Keeney, and Eileen Kowalski, “Insider Threat Study: Illicit Cyber Activity in the Banking and Finance Sector,” CERT Research Report, Software Engineering Institute, August 2000.

13

shuffling_obfuscation (3)

Documents

proposed obfuscation

malicious administrators

malicious insiders

cloud users information

cloud providers

degree of obfuscation

cloud computing

cloud servers