privacy preserving intermediate datasets in the cloud

CHAPTER 1INTRODUCTION1.1 CLOUD COMPUTING A key differentiating element of a successful information technology (IT) is its ability to become a true, valuable, and economical contributor to cyber infrastructure. Cloud computing embraces cyber infrastructure, and builds upon decades of research in virtualization, distributed computing, grid computing, utility computing and more recently, networking, web and software services. It implies service oriented architecture, reduced information technology over head for the end-user, greater flexibility, reduced total cost of ownership, on demand services and many other things. Cloud computing is the delivery of computing and storage capacity as a service to a community of end recipients. The name comes from the use of a cloud shaped symbol as an abstraction for the complex infrastructure it contains in system diagrams. Cloud computing entrusts services with a user's data, software and computation over a network. Cloud computing is internet based development and use of computer technology. A cloud service is an independent piece of software which can be used in conjunction with other services to achieve an interoperable machine to machine interaction over the network. Typical cloud computing services provide common business applications online that are accessed from a web browser, while the software and data are stored on the servers. Cloud computing is a large scale distributed computing paradigm in which a pool of computing resources is available to users via the internet computing resources, e.g., processing power, storage, software and network bandwidth are represented to cloud consumers as the accessible public utility services.These services are broadly divided into three categories: a) Software as a Service (SaaS)b) Platform as a Service (PaaS) c) Infrastructure as a Service (IaaS)a) Software as a Service (SaaS)Software as a service features a complete application offered as a service on demand. A single instance of the software runs on the cloud and services multiple end users or client organizations. Software as a Service (SaaS) is a software distribution model in which applications are hosted by a vendor or service provider.SaaS is becoming an increasingly prevalent delivery model as underlying technologies that support web servicesand service-oriented architecture (SOA) mature and new developmental approaches such asAjax become popular. Meanwhile,broadband service has become increasingly available to support user access from more areas around the world. SaaS is closely related to theASP(Application Service Provider) andon demand computingsoftware delivery models. IDC identifies two slightly different delivery models for SaaS. b) Platform as a Service (PaaS)Platform as a service encapsulates a layer of software and provides it as a service that can be used to build higher level services. There are at least two perspectives on PaaS depending on the perspective of the producer or consumer of the services. Someone producing PaaS might produce a platform by integrating an OS, middleware, application software and even a development environment that is then provided to a customer as a service. c) Infrastructure as a Service (IaaS)Infrastructure as a service delivers basic storage and compute capabilities as standardized services over the network. Servers, storage systems, switches, routers and other systems are pooled and made available to handle workloads that range from application components to high performance computing applications. Commercial examples of IaaS include Joyent, whose main product is a line of virtualized servers that provide a highly available on demand infrastructure. The name cloud computing was inspired by the cloud symbol that is often used to represent the internet in flow charts and diagrams. 1.1.1 Cloud PlatformCloud platform is a kind of platform that lets developers write applications that run in the cloud or use services provided from the cloud or both. Different names are used for this kind of platform today, including on demand platform and platform as a service (PaaS). Its a method of managing data (files, photos, music, video, whatever, etc) from one or more web based solutions. Rather than keeping data primarily on hard drives that are tethered to computers or other devices, those things are keep it in the cloud where it may be accessible from any number of devices. Cloud Infrastructure is the concept of providing `hardware as a service` i.e. shared/reusable hardware for a specific time of service. Example includes virtualization, grid computing and para virtualization. This service helps reduce maintenance and usability costs, considering the need for infrastructure management and upgrade.1.1.2 Cloud Concepts A powerful underlying and enabling concept is computing through Service Oriented Architectures (SOA) delivery of an integrated and orchestrated suite of functions to an end-user through composition of both loosely and tightly coupled functions, or services often network based. Related concepts are component-based system engineering, orchestration of different services through workflows, and virtualization.1.1.3 Service Oriented Architecture SOA is not a new concept, although it again has been receiving considerable attention in recent years. Examples of some of the first network based service-oriented architectures are Remote Procedure Calls (RPC) and object request brokers based on the CORBA specification. A more recent example is the so called Grid Computing architectures and solutions. In an SOA environment, end-users request an IT service at the desired functional, quality and capacity level and receive it either at the time requested or at a specified later time. Service discovery, brokering and reliability are important and services are usually designed to interoperate, as are the composites made of these services. It is expected that in the next 10 years, service-based solutions will be a major vehicle for delivery of information and other IT-assisted functions at both individual and organizational levels, e.g., software applications, web-based services, personal and business desktop computing, high-performance computing.1.1.4 Cloud Components The key to a SOA framework that supports workflows is componentization of its services, an ability to support a range of couplings among workflow building blocks, fault tolerance in its data and process aware service-based delivery, and an ability to audit processes, data and results, i.e., collect and use provenance information. Component based approach is characterized by reusability, substitutability, extensibility and scalability, customizability and computability. Those include reliability and availability of the components and services, the cost of the services, security, total cost of ownership and economy of scale.In the context of cloud computing users distinguish many categories of components: from differentiated and undifferentiated hardware, to general purpose and specialized software and applications, to real and virtual images, to environments, to no-root differentiated resources, to workflow based environments and collections of services, and so on. 1.1.5 Cyber Infrastructure Cyber infrastructure makes applications dramatically easier to develop and deploy, thus expanding the feasible scope of applications possible within budget and organizational constraints, and shifting the scientists and engineers effort away from information technology development and concentrating it on scientific and engineering research. Cyber infrastructure also increases efficiency, quality, and reliability by capturing commonalities among application needs, and facilitates the efficient sharing of equipment and services. Today, almost any business or major activity uses, or relies in some form, on IT and IT services. These services need to be enabling and appliance-like, and there must be an economy-of-scale for the total cost of ownership to be better than it would be without cyber infrastructure. Technology needs to improve end user productivity and reduce technology driven overhead. For example, unless IT is the primary business of an organization, less than 20% of its efforts not directly connected to its primary business should have to do with IT overhead, even though 80% of its business might be conducted using electronic means.CHAPTER 2LITERATURE REVIEW2.1 MINIMUM COST BENCHMARKING FOR INTERMEDIATE DATASET STORAGE IN SCIENTIFIC CLOUDDong Yuan et al (2007) have proposed the minimum cost benchmark for scientific applications. Scientific applications are usually complex and data intensive. In many fields, such as astronomy, high-energy physics and bioinformatics, scientists need to analyse terabytes of data either from existing data resources or collected from physical devices. The scientific analyses are usually computation intensive, hence taking a long time for execution. Workflow technologies can be facilitated to automate these scientific applications. Accordingly, scientific workflows are typically very complex. They usually have a large number of tasks and need a long time for execution.During the execution, a large volume of new intermediate datasets will be generated. They could be even larger than the original dataset(s) and contain some important intermediate results. After the execution of a scientific workflow, some intermediate datasets may need to be stored for future use because scientists may need to re-analyze the results or apply new analyses on the intermediate datasets and for collaboration, the intermediate results may need to be shared among scientists from different institutions and the intermediate datasets may need to be reused. Storing valuable intermediate datasets can save their regeneration cost when they are reused, not to mention the waiting time saved by avoiding regeneration. Given the large sizes of the datasets, running scientific workflow applications usually need not only high performance computing resources but also massive storage.Nowadays, popular scientific workflows are often deployed in grid systems because they have high performance and massive storage. However, building a grid system is extremely expensive and it is normally not an option for scientists all over the world. The emergence of cloud computing technologies offers a new way to develop scientific workflow systems, in which one research topic is cost-effective strategies for storing intermediate datasets.In late 2007, the concept of cloud computing was proposed and it is deemed the next generation of IT platforms that can deliver computing as a kind of utility. Foster et al. made a comprehensive comparison of grid computing and cloud computing. Cloud computing systems provide high performance and massive storage required for scientific applications in the same way as grid systems, but with a lower infrastructure construction cost among many other features, because cloud computing systems are composed of data centres which can be clusters of commodity hardware. Research into doing science and data-intensive applications on the cloud has already commenced, such as early experiences like the Nimbus and Cumulus projects. The work by Deelman et al. shows that cloud computing offers a cost effective solution for data-intensive applications, such as scientific workflows. Furthermore, cloud computing systems offer a new model: namely, that scientist from all over the world can collaborate and conduct their research together. Cloud computing systems are based on the Internet, and so are the scientific workflow systems deployed in the cloud. Scientists can upload their data and launch their applications on the scientific cloud workflow systems from everywhere in the world via the Internet, and they only need to pay for the resources that they use for their applications. As all the data are managed in the cloud, it is easy to share data among scientists. Scientific cloud workflows are deployed in a cloud computing environment, where use of all the resources need to be paid for. For a scientific cloud workflow system, storing all the intermediated datasets generated during workflow executions may cause a high storage cost. In contrast, if user delete all the intermediate datasets and regenerate them every time they are needed, the computation cost of the system may well be very high too. The intermediate dataset storage strategy is to reduce the total cost of the whole system. The best way is to find a balance that selectively stores some popular datasets and regenerates the rest of them when needed. In this system proposes a novel algorithm that can calculate the minimum cost for intermediate dataset storage in scientific cloud workflow systems. The intermediate datasets in scientific cloud workflows often have dependencies. These generation relationships are a kind of data provenance. Based on the data provenance, user create an Intermediate data Dependency Graph (IDG), which records the information of all the intermediate datasets that have ever existed in the cloud workflow system, no matter whether they are stored or deleted. With the IDG, user know how the intermediate datasets are generated and can further calculate their generation cost. Given an intermediate dataset, user divide its generation cost by its usage rate, so that this cost can be compared with its storage cost per time unit, where a datasets usage rate is the time between every usage of this dataset. Then user can decide whether an intermediate dataset should be stored or deleted in order to reduce the system cost. Given the historic usages of the datasets in an IDG, propose a Cost Transitive Tournament Shortest Path (CTT-SP) based algorithm that can find the minimum cost storage strategy of the intermediate datasets on demand in scientific cloud workflow systems. This minimum cost can be utilized as a benchmark to evaluate the cost effectiveness of other intermediate dataset storage strategies.2.2 PRIVACY PRESERVING MULTI-KEYWORD RANKED SEARCH OVER ENCRYPTED CLOUD DATANing Cao et al (2007) have proposed that cloud computing is the long dreamed vision of computing as a utility, where cloud customers can remotely store their data into the cloud so as to enjoy the on-demand high quality applications and services from a shared pool of configurable computing resources. To protect data privacy and combat unsolicited accesses in cloud and beyond, sensitive data, e.g., emails, personal health records, photo albums, tax documents, financial transactions, etc., may have to be encrypted by data owners before outsourcing to commercial public cloud; this, however, obsoletes the traditional data utilization service based on plaintext keyword search. The trivial solution of downloading all the data and decrypting locally is clearly impractical, due to the huge amount of bandwidth cost in cloud scale systems. Moreover, aside from eliminating the local storage management, storing data into the cloud serves no purpose unless they can be easily searched and utilized. Thus, exploring privacy-preserving and effective search service over encrypted cloud data is of paramount importance. Considering the potentially large number of on demand data users and huge amount of outsourced data documents in cloud, this problem is particularly challenging as it is extremely difficult to meet also the requirements of performance, system usability and scalability.On the one hand, to meet the effective data retrieval need, large amount of documents demand cloud server to perform result relevance ranking, instead of returning undifferentiated result. Such ranked search system enables data users to find the most relevant information quickly, rather than burdensomely sorting through every match in the content collection. Ranked search can also elegantly eliminate unnecessary network traffic by sending back only the most relevant data, which is highly desirable in the pay-as-you use cloud paradigm. For privacy protection, such ranking operation, however, should not leak any keyword related information. On the other hand, to improve search result accuracy as well as enhance user searching experience, it is also crucial for such ranking system to support multiple keywords search, as single keyword search often yields far too coarse result. As a common practice indicated by todays web search engines, data users may tend to provide a set of keywords instead of only one as the indicator of their search interest to retrieve the most relevant data. And each keyword in the search request is able to help narrow down the search result further. Coordinate matching, i.e., as many matches as possible, is an efficient principle among such multi-keyword semantics to refine the result relevance, and has been widely used in the plaintext Information Retrieval (IR) community. However, how to apply it in the encrypted cloud data search system remains a very challenging task because of inherent security and privacy obstacles, including various strict requirements like data privacy, index privacy, keyword privacy, and many others.In the literature, searchable encryption is a helpful technique that treats encrypted data as documents and allows a user to securely search over it through single keyword and retrieve documents of interest. However, direct application of these approaches to deploy secure large scale cloud data utilization system would not be necessarily suitable, as they are developed as crypto primitives and cannot accommodate such high service-level requirements like system usability, user searching experience, and easy information discovery in mind.Although some recent designs have been proposed to support boolean keyword search as an attempt to enrich the search flexibility, they are still not adequate to provide users with acceptable result ranking functionality. Users early work has been aware of this problem, and solves the secure ranked search over encrypted data with support of only single keyword query. But how to design an efficient encrypted data search mechanism that supports multikeyword semantics without privacy breaches still remains a challenging open problem.In this system define and solve the problem of multi-keyword ranked search over encrypted cloud data while preserving strict system-wise privacy in cloud computing paradigm. Among various multi-keyword semantics, user choose the efficient principle of coordinate matching, i.e., as many matches as possible, to capture the similarity between search query and data documents. Specifically, use inner product similarity, i.e., the number of query keywords appearing in a document, to quantitatively evaluate the similarity of that document to the search query in coordinate matching principle. During index construction, each document is associated with a binary vector as a sub index where each bit represents whether corresponding keyword is contained in the document. The search query is also described as a binary vector where each bit means whether corresponding keyword appears in this search request, so the similarity could be exactly measured by inner product of query vector with data vector. To meet the challenge of supporting such multikeyword semantic without privacy breaches, propose a basic scheme using secure inner product computation, which is adapted from a secure k-nearest neighbour technique and then improve it step by step to achieve various privacy requirements in two levels of threat models.2.3 AUTHORIZED PRIVATE KEYWORD SEARCH OVER ENCRYPTED PERSONAL HEALTH RECORDS IN CLOUD Ming Li et al (2008) have proposed the keyword search in the Personal Health Record (PHR) has emerged as a patient-centric model of health information exchange. It had never been easier than now for one to create and manage her own Personal Health Information (PHI) in one place, and share that information with others. It enables a patient to merge potentially separate health records from multiple geographically dispersed health providers into one centralized profile over passages of time. This greatly facilitates multiple other users, such as medical practitioners and researchers to gain access to and utilize ones PHR on demand according to their professional need, thereby making the healthcare processes much more efficient and accurate.As a matter of fact, PHRs are usually untethered, i.e., provided by a third-party service provider, in contrast to electronic medical records which are usually tethered, i.e., kept by each patients own healthcare provider. Untethered PHRs are the best ways to empower patients to manage their health and wellbeing. The most popular examples of PHR systems include Google Health and Microsoft HealthVault, which are hosted by cloud computing platforms. And it is a vision dreamed by many to enable anyone to access PHR service from anywhere, at anytime.Despite enthusiasm around the idea of the patient-centric PHR systems, their promises cannot be fulfilled until address the serious security and privacy concerns patients have about these systems, which are the main impediments standing in the way of their wide adoption. In fact, people remain dubious about the levels of privacy protection of their health data when they are stored in a server owned by a third-party cloud service provider. Most people do not fully entrust the third-party service providers for their sensitive PHR data because there is no governance about how this information can be used by them and whether the patients actually control their information. On the other hand, even if patients choose to trust those service providers, PHR data could be exposed if an insider in the service providers company misbehaves, or the server is broken into. To cope with the tough trust issues and to ensure patients control over their own privacy, applying data encryption on patients PHR documents before outsourcing has been proposed as a promising solution. With encrypted PHR data, one of the key functionalities of a PHR system keyword search becomes an especially challenging issue. First need to support frequently used complex query types in practice, while preserving the privacy of the query keywords. This class of boolean formulas feature conjunctions among different keyword fields and will refer to them as multi dimensional multi-keyword search in this paper. To hide the query keywords from the server, it is apparently inefficient for a user to download the whole database and try to decrypt the records one by one. Searchable encryption has been proposed as a better solution; informally speaking, a user submits a capability encoding her query to the server, who searches through the encrypted keyword indexes created by the owners, and returns a subset of encrypted documents that satisfy the underlying query without ever knowing the keywords in the query and the index. However, existing solutions of searchable encryption are still far from practical for PHR applications in cloud environments. First and foremost, they are limited both in the type of applications and system scalability. Recently, Benaloh et al and Narayan et al proposed several solutions for securing encrypted electronic health records. In their schemes for encrypted search, each owner issues search capabilities to individual users upon request. The main advantage is that the owner herself can exert fine-grained control on users search accesses to her PHR documents. Yet, observe that such a framework is limited to small-scale access and search applications, where the best use case is for the users who are personally known by the patient, such as family members, friends or primary doctors. User call such a user set as personal domain. In contrast, there is public domain, which contains a large number of users which may come from various wide avenues, such as other fellow patients, medical researchers in a medical school, staffs in public health agencies etc. Their corresponding applications are patient matching medical research and public health monitoring, respectively. The user set of ones PHR is potentially of large number, usually unknown to a PHR owner and their access/search requests are basically unpredictable.Under existing solutions, to support those important kinds of applications will incur an intrinsic non-scalability on key management: an owner will need to be always online and dealing with many search requests, while a user shall obtain search capabilities one-by-one from every owner. In this system focus on PHR applications in the public domain, where there are multiple owners who can contribute encrypted PHR data while multiple users shall be able to search over all those records with high scalability.Second, in many existing searchable encryption schemes, the users are often given a private key that endows her unlimited capability to generate any query of her choice, which is essentially a 0 or 1 authorization. However, note that fine-grained search authorization is an indispensable component for a secure system. Although the accesses to actual documents can be controlled by separate cryptographic enforced access control techniques such as attribute-based encryption, 0-1 search authorization may still lead to leakage of patients sensitive health information. For example, if Alice is the only one with a rare disease in the PHR database, by designing the query in a clever way, from the results Bob will be certain that Alice has that disease. Thus, it is desirable that a user is only allowed to search for some specific sets of keywords; in particular, the authorization shall be based on a users attributes. For instance, in a patient matching application in healthcare social networks, a patient should only be matched to patients having similar symptoms as her, while shall not learn any information about those who do not.On the other hand, requiring every user to obtain restricted search capabilities from a central trusted authority does not achieve high scalability as well. If the TA assumes the responsibility of authorization at the same time, it shall be always online, dealing with large workload, and facing the threat of single-point-of-failure. In addition, since the global TA does not directly possess the necessary information to check the attributes of users from different local domains, additional infrastructure needs to be employed. It is therefore desirable for the users to be authorized locally.To realize such a framework, make novel use of a recent cryptographic primitive, Hierarchical Predicate Encryption (HPE), which features delegation of search capabilities. The first solution enhances search efficiency, especially for subset and a class of simple range queries, while the second enhances query privacy with the help of proxy servers. Both schemes support multi dimensional multi keyword searches and allow delegation and revocation of search capabilities. Finally, implement scheme on a modern workstation and carry out extensive performance evaluation. Through experimental results demonstrate that scheme is suitable for a wide range of delay-tolerant PHR applications. To the best of knowledge, work is the first to address the authorized private search over encrypted PHRs within the public domain.2.4 SILVERLINE: TOWARD DATA CONFIDENTIALITY IN STORAGE-INTENSIVE CLOUD APPLICATIONSJinjun Chen et al (2008) have proposed the third party computing clouds, such as Amazons EC2 and Microsofts Azure, provide support for computation, data management in database instances, and internet services. By allowing organizations to efficiently outsource computation and data management, they greatly simplify the deployment and management of Internet applications. Examples of success stories on EC2 include Nimbus Health, which manages and distributes patient medical records, and ShareThis, a social content-sharing network that has shared 430 million items across 30,000 websites.Unfortunately, these game-changing advantages come with a significant risk to data confidentiality. Using a multi-tenant model, clouds locate applications from multiple organizations on a single managed infrastructure. This means application data is vulnerable not only to operator errors and software bugs in the cloud. With unencrypted data exposed on disk, in memory, or on the network, it is not surprising that organizations cite data confidentiality as their biggest concern in adopting cloud computing. In fact, researchers recently showed that attackers could effectively target and observe information from specific cloud instances on third party clouds. As a result, many recommend that cloud providers should never be given access to unencrypted data.Organizations can achieve strong data confidentiality by encrypting data before it reaches the cloud, but naively encrypting data severely restricts how data can be used. The cloud cannot perform computation on any data it cannot access in plaintext. For applications that want more than just pure storage. There are efforts to perform specific operations on encrypted data such as searches. A recent proposal of a fully homomorphic cryptosystem even supports arbitrary computations on encrypted data. However, these techniques are either too costly or only support very limited functionality. Thus, users that need real application support from todays clouds must choose between the benefits of clouds and strong confidentiality of their data. In this system take a first step towards improving data confidentiality in cloud applications, and propose a new approach to balance confidentiality and computation on the cloud. User key observation is this: in applications that can benefit the most from the cloud model, the majority of their computations handle data in an opaque way, i.e. without interpretation. Users refer to data that is never interpreted by the application as functionally encryptable, i.e. encrypting them does not limit the applications functionality.Leveraging the observation that certain data is never interpreted by the cloud; key step is to split the entire application data into two subsets: functionally encryptable data, and data that must remain in plaintext to support computations on the cloud. A majority of data in many of applications is functionally encryptable. Such data would be encrypted by users before uploading it to the cloud, and it would be decrypted by users after receiving from the cloud. While this idea sounds conceptually simple, realizing it requires us to address three significant challenges such as identifying functionally encryptable data in cloud applications, assigning encryption keys to data while minimizing key management complexity and risks due to key compromise, and providing secure data access at the user device.Identifying functionally encryptable data: The first challenge is to identify data that can be functionally encrypted without breaking application functionality. To this end, present an automated technique that marks data objects using tags and tracks their usage and dependencies through dynamic program analysis. User identifies functionally encryptable data by discarding all data that is involved in any computations on the cloud. Naturally, the size of this subset of data depends on the type of service. For example, for programs that compute values based on all data objects, techniques will not find any data suitable for encryption. In practice, however, results show that for many applications, including social networks and message boards, a large fraction of the data can be encrypted.Encryption key assignment: Once user identify the data to be encrypted, must choose how many keys to use for encryption, and the granularity of encryption. In the simplest case, can encrypt all such data using a single key, and share the key with all users of the service. Unfortunately, this has the problem that a malicious or compromised cloud could obtain access to the encryption key, e.g. by posing as a legitimate user, or by compromising or colluding with an existing user. In these cases, confidentiality of the entire dataset would be compromised. In the other extreme, could encrypt each data object with a different key. This increases robustness to key compromise, but drastically increases key management complexity.Users goal is to automatically infer the right granularity for data encryption that provides the best tradeoff between robustness and management complexity. To this end, partition the data into subsets, where each data subset is accessed by the same group of users. User then encrypts each data subset using a different key, and distributes keys to groups of users that should have access. Thus, a malicious or buggy cloud that compromises a key can only access the data that is encrypted by that key, minimizing its negative impact. User introduces a dynamic access analysis technique that identifies user groups who can access different objects in the data set. In addition, describe a key management system that leverages this information to assign to each user all keys that she would need to properly access her data. Since key assignment is based on user access patterns, can obtain an assignment that uses a minimal number of encryption keys necessary to cover all data subsets with distinct access groups, while minimizing damage from key compromise. Key management is handled by the organization. Users also develop mechanisms that need to manage keys when users or objects are dynamically added to or removed from the application or service. Secure and transparent user data access: Client devices, e.g. browsers, are given decryption keys by the organization to provide users with transparent data access. Ofcourse, these devices must protect these keys from compromise. To ward off these attacks, propose a client-side component that allows users to access cloud services transparently, while preventing key compromise. As a result, solution works without any browser modifications, and can be easily deployed today.Prototype and evaluation: User implemented techniques as part of Silverline, a prototype of software tools designed to simplify the process of securely transitioning applications into the cloud. Users prototype takes as input an application and its data. First, it automatically identifies data that is functionally encryptable. Then, it partitions this data into subsets that are accessible to different sets of users. User assigns each group a different key, and all users obtain a key for each group that they belong to. This allows the application to be run on the cloud, while all data not used for computation is encrypted. In addition, find that a large majority of data can be encrypted on each of tested applications.

2.5SEDIC: PRIVACY-AWARE DATA INTENSIVE COMPUTING ON HYBRID CLOUDSWith the rapid growth of information within organizations, ranging from hundreds of gigabytes of satellite images to terabytes of commercial transaction data, the demands for processing such data are on the rise. Meeting such demands requires an enormous amount of low-cost computing resources, which can only be supplied by todays commercial cloud computing systems. This newfound capability, however, cannot be fully exploited without addressing the privacy risks it brings in: on one hand, organizational data contains sensitive information and therefore cannot be shared with the cloud provider without proper protection; on the other hand, todays commercial clouds do not offer high security assurance, a concern that has been significantly aggravated by the recent incidents of Amazon outages and the Sony PlayStation network data breach and tend to avoid any liability. As a result, attempts to outsource the computations involving sensitive data are often discouraged. A natural solution to this problem is cryptographic techniques for secure computation outsourcing, which has been studied for a decade. Secure hybrid-cloud computing. Oftentimes, a data-intensive computation involves both public and sensitive data. For example, a simple grep across an organizational file system encounters advertising slogans as well as lines of commercial secrets. Also, many data analysis tasks, such as intrusion detection, targeted advertising, etc., need to make use of the information from public sources, sanitized network traces and social-network data. If the computation on the public data can be separated from that on the sensitive data, the former can be comfortably delegated to the public commercial clouds and the latter, whose scale can be much smaller than the original task, will become much easier to handle within the organization.Such a split of computation is an effective first step to securely outsource computations and can be naturally incorporated into todays cloud infrastructure, in which a public cloud typically receives the computation overflow from an organizations internal system when it is running out of its computing resources. This way of computing is called hybrid cloud computing. The hybrid cloud has already been adopted by most organizational cloud users and is still undergoing a rapid development, with new techniques mushroomed to enable a smoother inter-cloud coordination. It also presents a new opportunity that makes practical, secure outsourcing of computation tasks possible.However, todays cloud-based computing frameworks, such as MapReduce, are not ready for secure hybrid-cloud computing they are designed to work on a single cloud and not aware of the presence of the data with different security levels, which forces cloud users to manually split and re-arrange each computation job across the public/private clouds. This lack of a framework-level support also hampers the reuse of existing data-processing code, and therefore significantly increases the cloud users programming burden. Given the fact that privacy concerns have already become the major hurdle for a broader adoption of the cloud-computing paradigm, it is in urgent need to develop practical techniques to facilitate secure data-intensive computing over hybrid clouds. To answer this urgent call, a new, generic secure computing framework needs to be built to support automatic splitting of a data-intensive computing job and scheduling of it across the public and private clouds in such a way that data privacy is preserved and computational and communication overheads are minimized. Also desired here is accommodation of legacy data-processing code, which is expected to run directly within the framework without the users manual interventions. User present a suite of new techniques that make this happen. Users system, called Sedic, includes a privacy-aware execution framework that automatically partitions a computing job according to the security levels of the data it involves, and distributes the computation between the public and private clouds. Sedic is based on MapReduce, which includes a map step and a reduce step: the map step divides input data into lists of key-value pairs and assigns them to a group of concurrently-running mappers; the reduce step receives the outputs of these mappers, which are intermediate key-value pairs, and runs a reducer to transform them into the final outputs. This way of computation is characterized by its simple structure, particularly the map operations that are performed independently and concurrently on different data records. This feature is leveraged by execution framework to automatically decompose a computation on a mixture of public and sensitive data, which is actually difficult in general. More specifically, Sedic transparently processes individual data blocks, sanitizes those carrying sensitive information along the line set by the smallest data unit a map operation works on, and replicates these sanitized copies to the public cloud. Over those data blocks, map tasks are assigned to work solely on the public or sensitive data within the blocks. These tasks are carefully scheduled and executed to ensure the correctness of the computing outcomes and the minimum impacts on performance.In this way, the workload of map operations is distributed to the public/private clouds according to their available computing resources and the portion of sensitive data in the original dataset. A significant technical challenge here is that reduction usually cannot be done on private nodes and public nodes separately and only private nodes are suitable for such a task in order to preserve privacy. This implies that the intermediate outputs of computing nodes on the cloud need to be sent back to the private cloud. To reduce such inter-cloud data transfer as well as move part of the reduce computation to the public cloud, developed a new technique that automatically analyzes and transforms reducers to make them suitable for running on the hybrid cloud. Users approach extracts a combiner from the original reducer for preprocessing the intermediate key-value pairs produced by the public cloud, so as to compress the volume of the data to be delivered to the private cloud. This was achieved, again, by leveraging the special features of MapReduce: its reducer needs to perform a folding operation on a list, which can be automatically identified and extracted by a program analyzer embedded in Sedic. If the operation turns out to be associative or even commutative, as happens in the vast majority of cases, the combiner can be built upon it and deployed to the public cloud to process the map outcomes. In research, implemented Sedic on Hadoop and evaluated it over FutureGrid, a large scale, cross-the-country cloud testbed. Users experimental results show that the techniques effectively protected confidential user data and minimized the workload of the private cloud at a small overall cost. Sedic is designed to protect data privacy during map-reduce operations, when the data involved contains both public and private records. This protection is achieved by ensuring that the sensitive information within the input data, intermediate outputs and final results will never be exposed to untrusted nodes during the computation. Another important concern in data intensive computing is integrity, i.e., whether the public cloud honestly performs a computing task and deliveries the right results back to the private cloud. User chooses to address the confidentiality issue first, as it has already impeded the extensive use of the computing resources offered by the public cloud. By comparison, many cloud users today live with the risk that their computing jobs may not be done correctly on the public cloud. 2.6ENABLING PRIVACY IN PROVENANCE AWARE WORKFLOW SYSTEMSDaniel Warneke et al (2009) have proposed a new paradigm for creating and correcting scientific analyses is emerging, that of provenance aware workflow systems. In such systems, repositories of workflow specifications and of provenance graphs that represent their executions will be made available as part of scientific information sharing. This will allow users to search and query both workflow specifications and their provenance graphs: Scientists who wish to perform new analyses may search workflow repositories to find specifications of interest to reuse or modify. They may also search provenance information to understand the meaning of a workflow, or to debug a specification.Finding erroneous or suspect data, a user may then ask provenance queries to determine what downstream data might have been affected, or to understand how the process failed that led to creating the data. With the increased amount of available provenance information, there is a need to efficiently search and query scientific workflows and their executions. However, workflow authors or owners may wish to keep some information in the repository confidential. Although users with the appropriate access level may be allowed to see such confidential data, making it available to all users, even for scientific purposes, is an unacceptable breach of privacy. Beyond data privacy, a module itself may be proprietary, and hiding its description may not be enough: users without the appropriate access level should not be able to infer its behavior if they are allowed to see the inputs and outputs of the module. Finally, details of how certain modules in the workflow are connected may be proprietary, and so showing how data is passed between modules may reveal too much of the structure of the workflow. Scientific workflows are gaining wide-spread use in life sciences applications, a domain in which privacy concerns are particularly acute. User now illustrates three types of privacy using an example from this domain. Consider a personalized disease susceptibility workflow. Information such as an individuals genetic make-up and family history of disorders, which this workflow takes as input, is highly sensitive and should not be revealed to an unauthorized user, placing stringent requirements on data privacy. Further, a workflow module may compare an individuals genetic makeup to profiles of other patients and controls. The manner in which such historical data is aggregated and the comparison is made, is highly sensitive, pointing to the need for module privacy. As recently noted, You are better off designing in security and privacy from the start, rather than trying to add them later. User apply this principle by proposing that privacy guarantees should be integrated in the design of the search and query engines that access provenance-aware workflow repositories. Indeed, the alternative would be to create multiple repositories corresponding to different levels of access, which would lead to inconsistencies, inefficiency, and a lack of flexibility, affecting the desired techniques.This system focuses on privacy-preserving management of provenance-aware workflow systems. User considers the formalization of privacy concerns, as well as query processing in this context. Specifically, address issues associated with keyword-based search as well as with querying such repositories for structural patterns. To give some background on provenance-aware workflow systems, first describe the common model for workflow specifications and their executions. User then enumerates privacy concerns, consider their effect on query processing, and discuss the challenges.2.7 SCALABLE AND SECURE SHARING OF HEALTH RECORDS IN CLOUD USING ATTRIBUTE BASED ENCRYPTIONChristian Vecchiola et al (2009) have proposed this technique for Personal Health Record (PHR) has emerged as a patient-centric model of health information exchange. A PHR service allows a patient to create, manage, and control her personal health data in one place through the web, which has made the storage, retrieval, and sharing of the the medical information more efficient. Especially, each patient is promised the full control of her medical records and can share her health data with a wide range of users, including healthcare providers, family members or friends. Due to the high cost of building and maintaining specialized data centers, many PHR services are outsourced to or provided by third-party service providers.Recently, architectures of storing PHRs in cloud computing have been proposed. While it is exciting to have convenient PHR services for everyone, there are many security and privacy risks which could impede its wide adoption. The main concern is about whether the patients could actually control the sharing of their sensitive personal health information (PHI), especially when they are stored on a third-party server which people may not fully trust. On the one hand, although there exist healthcare regulations such as HIPAA which is recently amended to incorporate business associates, cloud providers are usually not covered entities. On the other hand, due to the high value of the sensitive Personal Health Information (PHI), the third-party storage servers are often the targets of various malicious behaviors which may lead to exposure of the PHI. As a famous incident, a Department of Veterans Affairs database containing sensitive PHI of 26.5 million military veterans, including their social security numbers and health problems was stolen by an employee who took the data home without authorization. To ensure patient-centric privacy control over their own PHRs, it is essential to have fine-grained data access control mechanisms that work with semi-trusted servers. A feasible and promising approach would be to encrypt the data before outsourcing. Basically, the PHR owner herself should decide how to encrypt her files and to allow which set of users to obtain access to each file. A PHR file should only be available to the users who are given the corresponding decryption key, while remain confidential to the rest of users. Furthermore, the patient shall always retain the right to not only grant, but also revoke access privileges when they feel it is necessary.However, the goal of patient-centric privacy is often in conflict with scalability in a PHR system. The authorized users may either need to access the PHR for personal use or professional purposes. Users refer to the two categories of users as personal and professional users, respectively. The latter has potentially large scale; should each owner herself be directly responsible for managing all the professional users, she will easily be overwhelmed by the key management overhead. In addition, since those users access requests are generally unpredictable, it is difficult for an owner to determine a list of them. On the other hand, different from the single data owner scenario considered in most of the existing works, in a PHR system, there are multiple owners who may encrypt according to their own ways, possibly using different sets of cryptographic keys. An alternative is to employ a Central Authority (CA) to do the key management on behalf of all PHR owners, but this requires too much trust on a single authority. In this system endeavor to study the patient centric, secure sharing of PHRs stored on semi-trusted servers, and focus on addressing the complicated and challenging key management issues. In order to protect the personal health data stored on a semi-trusted server, adopt Attribute Based Encryption (ABE) as the main encryption primitive. 2.8 ENABLING SECURE AND EFFICIENT RANKED KEYWORD SEARCH OVER OUTSOURCED CLOUD DATAWanchun Dou et al (2010) have proposed this ranked keyword search. Cloud computing is the long dreamed vision of computing as a utility, where cloud customers can remotely store their data into the cloud so as to enjoy the on-demand high-quality applications and services from a shared pool of configurable computing resources. The benefits brought by this new computing model include but are not limited to: relief of the burden for storage management, universal data access with independent geographical locations, and avoidance of capital expenditure on hardware, software, and personnel maintenances, etc.,As cloud computing becomes prevalent, more and more sensitive information are being centralized into the cloud, such as e-mails, personal health records, company finance data, and government documents, etc. The fact that data owners and cloud server are no longer in the same trusted domain may put the outsourced unencrypted data at risk: the cloud server may leak data information to unauthorized entities or even be hacked. It follows that sensitive data have to be encrypted prior to outsourcing for data privacy and combating unsolicited accesses. Besides, in cloud computing, data owners may share their outsourced data with a large number of users, who might want to only retrieve certain specific data files they are interested in during a given session. One of the most popular ways to do so is through keyword-based search. Such keyword search technique allows users to selectively retrieve files of interest and has been widely applied in plaintext search scenarios. Unfortunately, data encryption, which restricts users ability to perform keyword search and further demands the protection of keyword privacy, makes the traditional plaintext search methods fail for encrypted cloud data. On the one hand, for each search request, users without pre-knowledge of the encrypted cloud data have to go through every retrieved file in order to find ones most matching their interest, which demands possibly large amount of post-processing overhead, On the other hand, invariably sending back all files solely based on presence/ absence of the keyword further incurs large unnecessary network traffic, which is absolutely undesirable in todays pay-as-you-use cloud paradigm. In short, lacking of effective mechanisms to ensure the file retrieval accuracy is a significant drawback of existing searchable encryption schemes in the context of Cloud Computing. Nonetheless, the state of the art in Information Retrieval (IR) community has already been utilizing various scoring mechanisms quantify and rank order the relevance of files in response to any given search query. Therefore, how to enable a searchable encryption system with support of secure ranked search is the problem tackled in this system. Users work is among the first few ones to explore ranked search over encrypted data in cloud computing. To achieve design goals on both system security and usability, propose to bring together the advance of both crypto and IR community to design the Ranked Searchable Symmetric Encryption (RSSE) scheme, in the spirit of as-strong-as-possible security guarantee. Specifically, explore the statistical measure approach from IR and text mining to embed weight information of each file during the establishment of searchable index before outsourcing the encrypted file collection. As directly outsourcing relevance scores will leak lots of sensitive frequency information against the keyword privacy and properly modify it to develop a one-to- many order-preserving mapping technique for purpose to protect those sensitive weight information, while providing efficient ranked search functionalities. 2.9 A SECURE ERASURE CODE-BASED CLOUD STORAGE SYSTEM WITH SECURE DATA FORWARDINGJianfeng Zhan et al (2010) have proposed this code based cloud storage system with secure data forwarding. As high-speed networks and ubiquitous internet access become available in recent years, many services are provided on the internet such that users can use them from anywhere at any time. For example, the email service is probably the most popular one. Cloud computing is a concept that treats the resources on the Internet as a unified entity, a cloud. Users just use services without being concerned about how computation is done and storage is managed. In this system focused on designing a cloud storage system for robustness, confidentiality, and functionality. A cloud storage system is considered as a large scale distributed storage system that consists of many independent storage servers. Data robustness is a major requirement for storage systems. There have been many proposals of storing data over storage servers. One way to provide data robustness is to replicate a message such that each storage server stores a copy of the message. It is very robust because the message can be retrieved as long as one storage server survives. Another way is to encode a message of k symbols into a codeword of n symbols by erasure coding. To store a message, each of its codeword symbols is stored in a different storage server. A storage server failure corresponds to an erasure error of the codeword symbol. As long as the number of failure servers is under the tolerance threshold of the erasure code, the message can be recovered from the codeword symbols stored in the available storage servers by the decoding process. This provides a tradeoff between the storage size and the tolerance threshold of failure servers. Thus, the encoding process for a message can be split into n parallel tasks of generating codeword symbols. A decentralized erasure code is suitable for use in a distributed storage system. After the message symbols are sent to storage servers, each storage server independently computes a codeword symbol for the received message symbols and stores it. This finishes the encoding and storing process. The recovery process is the same. Storing data in a third partys cloud system causes serious concern on data confidentiality. In order to provide strong confidentiality for messages in storage servers, a user can encrypt messages by a cryptographic method before applying an erasure code method to encode and store messages. When he wants to use a message, he needs to retrieve the codeword symbols from storage servers, decode them, and then decrypt them by using cryptographic keys. There are three problems in the above straight forward integration of encryption and encoding. First, the user has to do most computation and the communication traffic between the user and storage servers is high. Second, the user has to manage his cryptographic keys. If the users device of storing the keys is lost or compromised, the security is broken. Finally, besides data storing and retrieving, it is hard for storage servers to directly support other functions. For example, storage servers cannot directly forward a users messages to another one. The owner of messages has to retrieve, decode, decrypt and then forward them to another user. In this paper, address the problem of forwarding data to another user by storage servers directly under the command of the data owner. User considers the system model that consists of distributed storage servers and key servers. Since storing cryptographic keys in a single device is risky, a user distributes his cryptographic key to key servers that shall perform cryptographic functions on behalf of the user. These key servers are highly protected by security mechanisms. 2.10 HARNESSING THE CLOUD FOR SECURELY OUTSOURCING LARGE-SCALE SYSTEMS OF LINEAR EQUATIONS Kui Ren et al (2010) have proposed this concept of linear equation in cloud. In cloud computing, customers with computationally weak devices are now no longer limited by the slow processing speed, memory, and other hardware constraints, but can enjoy the literally unlimited computing resources in the cloud through the convenient yet flexible pay-per-use manners. Despite the tremendous benefits, the fact that customers and cloud are not necessarily in the same trusted domain brings many security concerns and challenges toward this promising computation outsourcing model. First, customers data that are processed and generated during the computation in cloud are often sensitive in nature, such as business financial records, proprietary research data, and personally identifiable health information, etc. While applying ordinary encryption techniques to this sensitive information before outsourcing could be one way to combat the security concern, it also makes the task of computation over encrypted data in general a very difficult problem. Second, since the operational details inside the cloud are not transparent enough to customers, no guarantee is provided on the quality of the computed results from the cloud. For example, for computations demanding a large amount of resources, there are huge financial incentives for the Cloud Server (CS) to be lazy if the customer cannot tell the correctness of the answer. Besides, possible software/hardware malfunctions and/or outsider attacks might also affect the quality of the computed results. Thus, argue that the cloud is intrinsically not secure from the viewpoint of customers. Without providing a mechanism for secure computation outsourcing.Focusing on the engineering and scientific computing problems, this paper investigates secure outsourcing for widely applicable large-scale systems of Linear Equations (LE), which are among the most popular algorithmic and computational tools in various engineering disciplines that analyze and optimize real-world systems. For example, by applying Newtons method, to solve a system modeled by nonlinear equations converts to solve a sequence of systems of linear equations. Also, by interior point methods, system optimization problems can be converted to a system of nonlinear equations, which is then solved as a sequence of systems of linear equations as mentioned above. Because the execution time of a computer program depends not only on the number of operations it must execute, but on the location of the data in the memory hierarchy, solving such large-scale problems on customers weak computing devices can be practically impossible, due to the inevitably involved huge IO cost. Thus, resorting to cloud for such computation intensive tasks can be arguably the only choice for customers with weak computing power, especially when the solution is demanded in a timely fashion. It is worth noting that in the literature, several cryptographic protocols for solving various core problems in linear algebra, including the systems of linear equations have already been proposed from the Secure Multiparty Computation (SMC) community. However, these approaches are in general ill suited in the context of computation outsourcing model with large problem size. First, all these work developed under SMC model do not address the asymmetry among the computational power possessed by cloud and the customer, i.e., they all impose each involved party comparable computation burdens, which in this paper design specifically intends to avoid. Second, the framework of SMC usually does not directly consider the computation result verification as an indispensable security requirement, due to the assumption that each involved party is semi honest. This assumption is not true anymore in model, where any unfaithful behavior by the cloud during the computation should be strictly forbidden. Last but not the least, almost all these solutions are focusing on the traditional direct method for jointly solving the LE, like the joint Gaussian elimination method in, or the secure matrix inversion method. While working well for small size problems, these approaches in general do not derive practically acceptable solution time for large-scale LE, due to the expensive cubictime computational burden for matrix-matrix operations and the huge IO cost on customers weak devices. The analysis from existing approaches and the computational practicality motivates us to design secure mechanism of outsourcing LE via a completely different approach iterative method, where the solution is extracted via finding successive approximations to the solution until the required accuracy is obtained. Compared to direct method, iterative method only demands relatively simpler matrix-vector operations with O(n2) computational cost, which is much easier to implement in practice and widely adopted for large scale LE. For a linear system with n n coefficient matrix, the proposed mechanism is based on a one-time amortizable setup with O(n2) cost. Then, in each iterative algorithm execution, the proposed mechanism only incurs O(n) local computational burden to the customer and asymptotically eliminates the expensive IO cost, i.e., no unrealistic memory demands. To ensure computation result integrity, also propose a very efficient cheating detection mechanism to effectively verify in one batch of all the computation results by the cloud server from previous algorithm iterations with high probability. Both designs ensure computational savings for the customer. 2.11 AN UPPER BOUND CONTROL APPROACH FOR PRESERVING DATASET PRIVACY IN CLOUDXuyun Zhang et al (2012) have proposed this upper bound control approach concept in the cloud computing for security purpose. Along with more and more data intensive applications have been migrated into cloud environments, storing some valuable intermediate datasets has been accommodated in order to avoid the high cost of recomputing them. However, this has a risk on data privacy protection because malicious parties may deduce the private information of the parent dataset or original dataset by analyzing some of those stored intermediate datasets. The traditional way for addressing this issue is to encrypt all of those stored datasets so that they can be hidden. Argue that this is neither efficient nor cost effective because it is not necessary to encrypt all of those datasets and encryption of all large amounts of datasets can be very costly. In this paper, propose a new approach to identify which stored datasets need to be encrypted and which not. Through intensive analysis of information theory, approach designs an upper bound on privacy measure. As long as the overall mixed information amount of some stored datasets is no more than that upper bound, those datasets do not need to be encrypted while privacy can still be protected. A tree model is leveraged to analyze privacy disclosure of datasets and privacy requirements are decomposed and satisfied layer by layer. With a heuristic implementation of this approach, evaluation results demonstrate that the cost for encrypting intermediate datasets decreases significantly compared with the traditional approach while the privacy protection of parent or original dataset is guaranteed. Technically, cloud computing could be regarded as a combination of a series of developed or developing ideas and technologies, establishing a model.All the participants in cloud computing business chains can benefit from this novel business model, as they can reduce their cost and concentrate on their own core business. Therefore, many companies or individuals have moved their business into cloud computing environments. Hence, in the sense of pay-as-978-0-7695-4612-4/11 $26.00 2011 IEEE DOI 10.1109/DASC.2011.98978-0-7695-4612-4/11 $26.00 2011 IEEE DOI 10.1109/DASC.2011.98978-0-7695-4612-4/11 $26.00 2011 IEEE DOI 10.1109/DASC.2011.98you-go feature of cloud computing, computation sources are equivalent to storage sources. Then cloud users can store some intermediate data and final results selectively when processing raw data on a cloud, especially in data intensive applications like medical diagnosis and bio informatics. Since such kind of data volume may be accessed by multiple users, storing these intermediate datasets would curtail the overall cost via eliminating the frequently repeated computation to obtain some data. For example, intermediate data within an execution may contain sensitive information, such as a social security number, a medical record or financial information about an individual. These scenarios are quite common because data users often reanalyze the result or possess new analyses on intermediate data or share some intermediate results for collaborations. The occurrence of intermediate dataset storage enlarges the attack surface so that the original data privacy is at risk of being compromised. The intermediate dataset storage might be out of control of the original data owner and can be accessed and shared by other applications, enabling an adversary to collect them and menace the privacy information of the original dataset, further leading to considerable economic loss or severe social reputation impairment. This new paradigm allows allocating compute resources dynamically and just for the time they are required in the processing workflow. So from this the upper bound control approach for storing intermediate datasets the users can upload their data in online cloud with more security.CHAPTER 3SYSTEM ANALYSIS3.1 EXISTING SYSTEMTechnically, cloud computing is regarded as an ingenious combination of a series of technologies, establishing a novel business model by offering IT services and using economies of scale. Participants in the business chain of cloud computing can benefit from this novel model. Cloud customers can save huge capital investment of IT infrastructure, and concentrate on their own core business. Therefore, many companies or organizations have been migrating or building their business into cloud. However, numerous potential customers are still hesitant to take advantage of cloud due to security and privacy concerns. The privacy concerns caused by retaining intermediate data sets in cloud are important but they are paid little attention. Storage and computation services in cloud are equivalent from an economical perspective because they are charged in proportion to their usage. Thus, cloud users can store valuable intermediate data sets selectively when processing original data sets in data intensive applications like medical diagnosis, in order to curtail the overall expenses by avoiding frequent recomputation to obtain these data sets. Such scenarios are quite common because data users often reanalyze results, conduct new analysis on intermediate data sets, or share some intermediate results with others for collaboration. Without loss of generality, the notion of intermediate data set herein refers to intermediate and resultant data sets. However, the storage of intermediate data enlarges attack surfaces so that privacy requirements of data holders are at risk of being violated. Usually, intermediate data sets in cloud are accessed and processed by multiple parties, but rarely controlled by original data set holders. This enables an adversary to collect intermediate data sets together and menace privacy-sensitive information from them, bringing considerable economic loss or severe social reputation impairment to data owners. But, little attention has been paid to such a cloud-specific privacy issue. 3.1.1 Drawbacks of Existing System The following drawbacks are identified in the existing system. Static privacy preserving model Privacy preserving data scheduling is not focused Storage and computational aspects are not considered Load balancing is not considered3.2 PROPOSED SYSTEM Shared data values are maintained under third party cloud data centers. Data values are processed and stored in different cloud nodes. Privacy leakage upper bound constraint model is used to protect the intermediate data values. Dynamic privacy management and scheduling mechanism are integrated to improve the data sharing with security. Multiple intermediate data set privacy models is integrated with data scheduling mechanism. Privacy preservation is ensured with dynamic data size and access frequency values. Storage space and computational requirements are optimally utilized in the privacy preservation process. Data distribution complexity is handled in the scheduling process. 3.2.1Advantages of Proposed System Privacy preserving cost is reduced Resource consumption is controlled Data delivery overhead is reduced Dynamic privacy preservation model Encryption cost is reduced3.3 ACHIEVING MINIMUM STORAGE COST WITH PRIVACY PRESERVING INTERMEDIATE DATASET IN THE CLOUDThis system has used an approach to identify which intermediate data sets need to be encrypted and which do not, so that privacy-preserving cost can be saved while the privacy requirements of data holders can still be satisfied. It is promising to anonymize all datasets first and then encrypt them before storing or sharing them in cloud. The volume of intermediate datasets is huge. Hence, argue that encrypting all intermediate datasets will lead to high overhead and low efficiency when they are frequently accessed or processed. For preserving privacy of datasets, it is promising to anonymize all datasets first and then encrypt them before storing or sharing them in cloud. Usually, the volume of intermediate data sets is huge. A tree structure is modeled from generation relationships of intermediate datasets to analyze privacy propagation of datasets. Design a practical heuristic algorithm accordingly to identify the datasets that needs to be encrypted. Directed Acyclic Graph (DAG) is exploited to capture the topological structure of generation relationships among these datasets. First, formally demonstrate the possibility of ensuring privacy leakage requirements without encrypting all intermediate datasets when encryption is incorporated with anonymization to preserve privacy. So by this model the privacy of dataset holders are preserved and the sensitive intermediate datasets are protected from the intruders.3.4 PROJECT DESCRIPTIONCloud computing services provide common business applications in online that are accessed from a web browser, while the software and data are stored on the servers. Massive computation power and storage capacity of cloud computing systems allow scientists to deploy computation and data intensive applications without infrastructure investment. Since the usage of the cloud computing have been widely spreading there will be a lots of transaction will be carried out in cloud. There will be intermediate data which contain the sensitive data during the transaction.3.4.1 Problem DefinitionExisting technical approaches for preserving the privacy of data sets stored in cloud mainly include encryption and anonymization. On one hand, encrypting all data sets, a straightforward and effective approach, is widely adopted in current research. However, processing on encrypted data sets efficiently is quite a challenging task, because most existing applications only run on unencrypted data sets. Although recent progress has been made in homomorphic encryption which theoretically allows performing computation on encrypted data sets, applying current algorithms are rather expensive due to their inefficiency. On the other hand, partial information of data sets, e.g., aggregate information, is required to expose to data users in most cloud applications like data mining and analytics. In such cases, data sets are anonymized rather than encrypted to ensure both data utility and privacy preserving. Current privacy-preserving techniques like generalization can withstand most privacy attacks on one single data set, while preserving privacy for multiple data sets is still a challenging problem. Thus, for preserving privacy of multiple data sets, it is promising to anonymize all data sets first and then encrypt them before storing or sharing them in cloud. Usually, the volume of intermediate data sets is huge. Hence, argue that encrypting all intermediate data sets will lead to high overhead and low efficiency when they are frequently accessed or processed. As such, propose to encrypt part of intermediate data sets rather than all for reducing privacy-preserving cost. In this system proposes a novel approach to identify which intermediate data sets need to be encrypted while others do not, in order to satisfy privacy requirements given by data holders. A tree structure is modeled from generation relationships of intermediate data sets to analyze privacy propagation of data sets. As quantifying joint privacy leakage of multiple data sets efficiently is challenging, exploit an upper bound constraint to confine privacy disclosure. Based on such a constraint, model the problem of saving privacy-preserving cost as a constrained optimization problem. Experimental results on real world and extensive data sets demonstrate that privacy-preserving cost of intermediate data sets can be significantly reduced with approach over existing ones where all data sets are encrypted. The major contributions of research are threefold. First, formally demonstrate the possibility of ensuring privacy leakage requirements without encrypting all intermediate data sets when encryption is incorporated with anonymization to preserve privacy. Second, design a practical heuristic algorithm to identify which data sets need to be encrypted for preserving privacy while the rest of them do not. Third, experiment results demonstrate can significantly reduce privacy-preserving cost over existing approaches, which is quite beneficial for the cloud users who utilize cloud services in a pay-as-you-go fashion. 3.4.2 Overview of the Project Providing the security for the intermediate data is carrying out by encryption which is very costly. Encrypting all intermediate data sets are neither efficient nor cost-effective because it is very time consuming and costly for data-intensive applications to encrypt/decrypt data sets frequently while performing any operation on them.Encrypting all intermediate data sets will lead to high overhead and low efficiency when they are frequently accessed or processed to identify which intermediate datasets need to be encrypted while others do not, in order to satisfy privacy requirements given by data holders. To preserving the privacy of the intermediate dataset, a novel upper bound privacy leakage constraint based approach is used to identify which intermediate data sets need to be encrypted and which do not, which minimize the process of encryption. The purpose of the scheme is to implement the cost effective system for preserving privacy for intermediate data. The user who does not register with the online service provider could not able to upload their information in cloud.

3.4.3System ArchitectureIn this online service provider, users are going to upload their information regarding their business. The users who are going to upload should be already a registered member under the online service provider norms. Only the registered user can able to upload their datas in the online service provider. Formally demonstrate the possibility of ensuring privacy leakage requirements without encrypting all intermediate data sets when encryption is incorporated with anonymization to preserve privacy.

CloudapplicationRegisterRegisterLoginDatas UploadGraphCost Effective EncryptionSecurityCloud

Cloud User

Figure 3.1 System Architecture 3.5MODULE DESCRIPTION Cloud data sharing system provides security for original and intermediate data values. Data sensitivity is considered in the intermediate data security process. Resource requirement levels are monitored and controlled by the security operations. The system is divided into five major modules. They are data center, data provider, intermediate data privacy, security analysis and data scheduling.The data center maintains the encrypted data values for the providers. Shared data uploading process are managed by the data provider module. Intermediate data privacy module is designed to protect intermediate results. Security analysis module is designed to estimate the resource and access levels. Original data and intermediate data distribution is planned under the data scheduling module.Data Center Database transactions are shared in the data centers. Data center maintains the shared data values in encrypted form. Homomorphic encryption scheme is used for encryption process. Key values are also provided by the data center.Data Provider Data provider uploads the database tables to the data center. Database schema is also shared by the provider. Encryption process is performed under the data provider environment. Access control tasks are managed by the providers. Homomorphic encryption can be used to encrypt data in the cloud and the user can decrypt the encrypted data at the processing time in the cloud itself.Intermediate Data Privacy Intermediate data values are generated by processing the original data values. Intermediate data values are stored under the data center or provider environment. Encryption process is carried out on the intermediate data values. Sensitivity information is used for the intermediate data security process.Security Analysis Joint privacy leakage model is used for the security process. Storage requirements are analyzed in the intermediate data analysis. Computational resource requirements are also analyzed in the security analysis. Intermediate data encryption decisions are made with reference to the storage and computational resource requirements.Data Scheduling Data scheduling is used to plan the data distribution process. Computational tasks are combined in the scheduling process. Scheduling is applied to select suitable provider for data delivery process. Request levels are considered in the data scheduling process.

3.6 SYSTEM SPECIFICATION3.6.1Hardware RequirementsProcessor : Intel Dual Core 2.5GHzRAM : 1GBHard Disk : 80 GBFloppy Disk Drive : Sony 1.44 MBDVD-ROM : LG 52X MAXKeyboard : TVS Gold 104 KeysMouse : Tech-Com SSD Optical MouseEthernet Card : Realtek 1110 10mbps3.6.2 Software Requirements Platform : Windows XP Language : Java Backend : Oracle Simulation Tool : CloudSim

3.7 SOFTWARE DESCRIPTIONWindows XPWindows XP offer many new, exciting features, in addition to improvements to many features with form earlier versions to windows. WindowsXP Professional makes sharing a computer easier than ever by storing personalized settings and preferences for each user. Windows XP FeaturesXP RAP project members review individual features in Windows XP, including: Remote Desktop and Remote Assistance Power management Windows application compatibility System tools: device driver rollback, last known good configuration, and system restore Multi-language toolkit Personal firewall Automatic unzip feature: There is no need for expander tools such as WinZip or Aladdin Expander with Windows XP. Zipped files are automatically unzipped by Windows and placed in folders.Managing a myriad of network and Internet connections can be confusing. Empower with knowledge about managing network and Internet connections for local and remote users. WindowsXP is loaded with new tools and programs that ensure the privacy and security of data, and help to operate computer at peak performance.

Java Java is a general purpose; object oriented programming language developed by Sun Microsystems of USA in 1991. The most striking features of the language are that it is platform neural language. Java can be called as a revolutionary technology because it has brought in a fundamental shift in develop and use programs. The internal helped catapult have to the forefront of programming. If can be used to develop both application and applet programs. Java is mainly adopted for two reasons. Security Portability These two features are available in java because of the byte code. Byte code is a highly optimized set of instructions to be executed by the Java run time system called Java Virtual Machine (JVM). Java program is under the control of JVM; the JVM can contain the program and prevent it from generating side effects outside the system. Thus safety is included in Java language. Some of the features of Java which are adopted for the network system explore are Multithreading Socket programming SwingMultithreadingUsers perceive that their world is full of multiple events all happenings at once and wants their computers to do the same. Unfortunately, writing programs that deal with many things at once can be much more difficult than writing conventional single threaded programs in C or C++. Thread safe in multithreading means that a given library functions is implemented concurrent threads of execution.Socket programming A socket is one end-point of a two-way communication link between two programs running on the network. Socket classes are used to represent the connection between a client program and a server program. The Java.net package provides two classes. Socket Server SocketThese two classes implement the client and server side of the connection respectively. The beauty of Java sockets is that no knowledge whatsoever of the details of TCP is required. TCP stands for transmission Control Protocol and is a standard protocol for data transmission with confirmation of data reception.Sockets are highly useful in at least three communications context Client /server models Peer-to-Peer scenarios, such as chat applications Making Remote Procedure Calls (RPC) by having the receiving application interpret a message as a function call.SwingSwing refers to the new library of GUI controls (buttons, sliders, checkboxes etc) that replaces the somewhat weak and inflexible AWT controls. Swing is a rapid GUI development tool that is part of the standard Java development kit. It was primarily developed due to the shortcomings of the Abstract Windows Toolkit. Swing is a set of classes that provides more powerful and flexible components than AWT. Swing components are not implemented by platform specific code. Instead they are written in Java and therefore are platform independent. The term lightweight is used to describe such elements. In addition, all Swing components support assertive technologies.Remote Method Invocation (RMI)This is a brief introduction to Java Remote Method Invocation (RMI). Java RMI is a mechanism that allows one to invoke a method on an object that exists in another address space. The other address space could be on the same machine or a different one. The RMI mechanism is basically an object oriented RPC mechanism. CORBA is another object-oriented RPC mechanism. CORBA differs from Java RMI in a number of ways. CORBA is a language-independent standard. CORBA includes many other mechanisms in its standard none of which are part of Java RMI. There is also no notion of an "object request broker" in Java RMI. Java RMI has recently been evolving toward becoming more compatible with CORBA. In particular, there is now a form of RMI called RMI/IIOP ("RMI over IIOP") that uses the Internet Inter-ORB Protocol (IIOP) of CORBA as the underlying protocol for RMI communication. This tutorial attempts to show the essence of RMI, without discussing any extraneous features. Sun includes a lot of material that is not relevant to RMI itself. For example, it discusses how to incorporate RMI into an Applet, how to use packages and how to place compiled classes in a different directory than the source code. All of these are interesting in themselves, but they have nothing at all to do with RMI. As a result, Sun's guide is unnecessarily confusing. Moreover, Sun's guide and examples omit a number of details that are important for RMI. The Client is the process that is invoking a method on a remote object. The server is the process that owns the remote object. The remote object is an ordinary object in the address space of the server process. The Object Registry is a name server that relates objects with names. Objects are registered with the Object Registry. Once an object has been registered, one can use the Object Registry to obtain access to a remote object using the name of the object.Relational Database Management System (RDBMS)Over the past several years, relational database management systems have become the most widely accepted way to manage data. Relational systems offer benefits such as: easy access of data flexibility in data modeling Reduced data storage and redundancy Independence of physical storage and logical data design A high- level data manipulation language (SQL)The phenomenal growth, of the relational technology has led to more demand for RDBMSS from personnel computers to large, highly secure CPUs. The oracle corporation was the first company to offer a true RDBMS commercially that is portable, compatible and connectable. These results produced a set of powerful add-on tools user level to cater, to adhoc requests.Oracle RDBMS ORACLE demands greater expertise on the part of application developer, an application developed on ORACLE will be able to keep space with growth and change. Oracle gives security and controlDisaster recovery can be extremely problematic ORACLE has several features that ensure the integrity of the database. If an interruption occurs in processing, a ROLL BACK can reset the database to the previous transaction point before the disaster. If a restore is necessary, ORACLE has a ROLL FORWARD command for recreating the database to its most recent SAVEPOINT.ORACLE provides users with several functions for security GRANT and REVOKE commands limit access to information down to column and row levels like this there are many ways to control access to a database. One part of the kernel is the QUERY OPTIMIZER. The query optimizer examines alternate access paths to the data, to find the optimal path to resolve a given query. The query optimizer, in order to work with the query, it looks which indexes will be most helpful.At the heart of the ORACLE RDBMS is the Structured Query Language (SQL). SQL was developed and defined by IBM research, and has been accredited by the ANSI as the standard query language for RDBMS. It is an English - like language that is used for most database activities. SQL is simple enough to allow no voice users to access data easily and quickly, yet if is powerful enough to offer programmers all the capability and flexibility they require.SQL statements are the one designed are the designed to work with relational database data. SQL data language presumes that database data is found in tables. Each table is defined by a Table name and a set of columns. Each column has a column name and a datatype. Columns can contain null values if they have not been declared NOT NULL. Columns are sometimes called fields or attributes. CloudSimCloud computing emerged as the leading technology for delivering reliable, secure, fault-tolerant, sustainable, and scalable computational services, which are presented as Software, Infrastructure, or Platform as services. Moreover, these services may be offered in private data centers, may be commercially offered for clients, or yet it is possible that both public and private clouds are comb

privacy preserving intermediate datasets in the cloud

Documents