database management as a cloud-based service for small - muni

Masaryk University

Faculty of Informatics

Master Thesis

Database management as a cloud-based service

for small and medium organizations

Student: Dime Dimovski

Brno, 2013

2

Statement

I declare that I have worked on this thesis independently using only the sources listed in the

bibliography. All resources, sources, and literature, which I used in preparing or I drew on them,

I quote in the thesis properly with stating the full reference to the source.

Dime Dimovski

3

Resume

The goal of this thesis is to explore the cloud computing, manly focusing on database

management systems as a cloud service. It will give review of some of current available

solutions of SQL and NOSQL based database management systems as a cloud service;

advantages and disadvantages of the cloud computing in general and the common

considerations.

Keywords

Cloud computing, SaaS, PaaS, Database management, SQL, NOSQL, DBaaS, Database.com, SQL

Azure, Amazon Web Services, SimpleDB, DynamoDB, Google SQL, MongoDB, CouchDB, Google

Datastore.

4

Contents

1. Introduction .................................................................................................................................... 8

2. Introduction to Cloud Computing ................................................................................................... 9

2.1 Cloud computing – definition ........................................................................................................ 9

2.2 Cloud Types .................................................................................................................................. 10

2.2.1 NIST model ............................................................................................................................... 10

2.3 Cloud computing architecture ..................................................................................................... 12

2.3.1 Infrastructure ....................................................................................................................... 13

2.3.2 Platform ............................................................................................................................... 14

2.3.3 Application Platform as a Service (APaaS ) or Virtual appliances ........................................ 15

2.3.4 Application ........................................................................................................................... 16

3. Scalability ...................................................................................................................................... 17

4. Elasticity ........................................................................................................................................ 18

5. Database Management Systems in the cloud (Database as a service) ......................................... 19

6. Database.com ............................................................................................................................... 21

6.1 Database.com Architecture ......................................................................................................... 21

6.2 Multitenant data model ............................................................................................................... 22

6.3 Multitenant indexes ..................................................................................................................... 23

6.4 Multitenant relationships ............................................................................................................ 23

6.5 Multitenant field history .............................................................................................................. 23

6.6 Partitioning of metadata, data, and index data ........................................................................... 23

6.7 Application development ............................................................................................................ 24

6.8 Data Access .................................................................................................................................. 24

6.9 Query languages .......................................................................................................................... 25

6.10 Multitenant search processing .................................................................................................... 25

5

6.11 Multitenant isolation and protection .......................................................................................... 26

6.12 Deletes, undeletes ....................................................................................................................... 27

6.13 Backup.......................................................................................................................................... 27

6.14 Pricing .......................................................................................................................................... 27

7. Microsoft’s SQL AZURE ................................................................................................................. 28

7.1 Subscriptions ................................................................................................................................ 28

7.2 Databases ..................................................................................................................................... 28

7.3 Security and Access to a SQL Azure Database ............................................................................. 29

7.4 SQL Azure architecture ................................................................................................................ 29

7.5 Logical Databases on a SQL Azure Server .................................................................................... 29

7.6 Network Topology ........................................................................................................................ 31

7.7 High Availability with SQL Azure .................................................................................................. 33

7.8 Failure Detection .......................................................................................................................... 33

7.9 Reconfiguration............................................................................................................................ 33

7.10 Availability Guarantees ................................................................................................................ 34

7.11 Scalability with SQL Azure ............................................................................................................ 34

7.12 Throttling ..................................................................................................................................... 34

7.13 Load Balancer ............................................................................................................................... 35

7.14 SQL Azure Management .............................................................................................................. 35

7.15 Pricing in SQL Azure ..................................................................................................................... 35

8. Amazon WebServices .................................................................................................................... 37

8.1 Amazon Relational Database Service (Amazon RDS) ................................................................... 37

8.2 Amazon RDS Architecture/Features ............................................................................................ 37

8.3 Scalability with Amazon RDS ........................................................................................................ 38

8.4 High Availability ........................................................................................................................... 39

8.5 Pricing .......................................................................................................................................... 39

9. Google Cloud SQL ......................................................................................................................... 40

6

9.1 Pricing .......................................................................................................................................... 41

10. Summary of RDBMSaaS and common considerations ................................................................. 42

11. NOSQL ........................................................................................................................................... 45

12. Amazon SimpleDB and DynamoDB............................................................................................... 45

12.1 Dynamo History ........................................................................................................................... 45

12.2 Amazon DynamoDB DataModel .................................................................................................. 46

12.3 Amazon DynamoDB Features ...................................................................................................... 48

12.4 Amazon SimpleDB ........................................................................................................................ 49

12.5 Pricing .......................................................................................................................................... 51

13. Google Datastore .......................................................................................................................... 52

13.1 Datastore Datamodel ................................................................................................................... 52

13.2 Queries and indexes ..................................................................................................................... 52

13.3 Transactions ................................................................................................................................. 53

13.4 Scalability ..................................................................................................................................... 53

13.5 High Availability ............................................................................................................................ 53

13.6 Data Access .................................................................................................................................. 54

13.7 Quotas and Limits ........................................................................................................................ 54

14. MongoLab/MongoDB and Cloudent/Apache CouchDB ............................................................... 55

14.1 Document oriented database ...................................................................................................... 55

14.2 MongoDB and CouchDB comparison ........................................................................................... 56

14.3 MVCC – Multy Version Concurency Control ................................................................................. 56

14.4 Scalability ..................................................................................................................................... 57

14.5 Querying ....................................................................................................................................... 57

14.6 Atomicity and Durability .............................................................................................................. 58

14.7 Map Reduce ................................................................................................................................. 58

14.8 Javascript ...................................................................................................................................... 58

14.9 REST ............................................................................................................................................. 58

7

14.10 MongoLab and Cloudent .......................................................................................................... 58

15. What benefits cloud database and cloud computing brings for small and medium organizations?

62

15.1 Advantages for Small Business ..................................................................................................... 62

15.2 Disadvantages of Cloud Computing ............................................................................................. 63

15.3 Main things to be considered when moving to the cloud ............................................................ 64

16. Will cloud computing reduce the budget? ................................................................................... 67

17. Conclusion ..................................................................................................................................... 69

Appendix ................................................................................................................................................... 70

Case studies from the industry – Amazon RDS ........................................................................................ 70

Case studies from the industry – Microsoft SQL Azure ........................................................................... 70

Case studies from the industry – Amazon DynamoDB ............................................................................ 70

Case studies from the industry – Amazon SimpleDB ............................................................................... 71

References ................................................................................................................................................ 72

8

1. Introduction

The boom of the cloud computing over the past few years has led to situation that it is common to many innovations and new technologies. It became common for enterprises and a person to use the services that are offered in the cloud and recognize that cloud computing is a big deal even though they are not clear why that is so. Even the phrase “in the cloud” has been used in our colloquial language. Huge percentage of the developers in the world is currently working on “cloud-related” products. Therefore the cloud is this amorphous entity that is supposed to represent the future of modern computing.

In an attempt to gain a competitive edge, businesses are looking for new innovative ways to cut costs while maximizing value. They recognize the need to grow but at the same time they are under pressure to save money. The cloud gave this opportunity for the business allowing them to focus on their core business by offering hardware and software solution without having to develop them by their own.

In this thesis I will give an overview of what cloud computing is. I will describe its main concepts and architecture; and take a look at the paradigm XaaS (something/everything as a service) and the current available options in the cloud mostly focusing on Database in the cloud or Database as a service. I will give a closer look on how the cloud computing in general and database as a service can be used for small and medium enterprises, what are the main benefits that it offers and will it really help businesses to reduce the budget and focus on their core business.

9

2. Introduction to Cloud Computing

In reality the cloud is something that we have been using for a long time, it is the Internet, with all the standards and protocols that provide Web services to us. Usually the Internet is drawn as a cloud, this represent s one of the essential characteristics of cloud computing, abstraction. Cloud computing refers to applications and services that run on a distributed network using virtualized resources and are accessed by common Internet protocols and networking standards. It is distinguished by the notion that resources are virtual and limitless and that details of the physical system on which software runs are abstracted from user .[1]

One of the main things that is driving cloud computing is the recent advancements in wireless speed and connectivity. Without these in place, cloud computing wouldn’t be practical or even possible. In many ways, cloud computing was/is an eventuality. The influence of telecommunications organizations and their push towards simplifying and miniaturizing virtually every electronic device that can be used by the mobile users is pushing cloud computing even faster. This represents a major breakthrough in not only computing but also communication.

Cloud computing represents a real paradigm shift in the way in which systems are deployed. The massive scale of cloud computing systems was enabled by the popularization of the Internet and the growth of some large service companies.[1]

Cloud computing has been compared to the standard utility companies, and it does bear a striking resemblance to these institutions. Just like water, electricity or gas, cloud computing makes the long- held dream of utility computing possible with a pay-as-you-go, infinitely scalable, universally available system. In other words, the ‘goods’ come from one central location; we’re just turning things off and on. This may ultimately give more people access to a larger pool or resources at an extremely reduced cost. One of the biggest benefits of cloud computing is its ability to offer users access to off-site hardware and software. With cloud computing the resources of the cloud itself are at your disposal. This means all the hardware, software, processors and networks will combine to give individuals much more computing power than has ever been possible. This will completely change nearly every facet of information exchange as well as influence everything from social networking to web development. By keeping things light and simple individual access devices are going to last a lot longer and become much more durable. And of course, losing or breaking a device is no longer going to be of any particular concern, as they will be easily replaced and there’s no danger of losing your files or information either.

With cloud computing, you can start very small and become big very fast. That's why cloud computing is revolutionary, even if the technology it is built on is evolutionary.

2.1 Cloud computing – definition

The use of the word “cloud” makes reference to the two essential concepts:

Abstraction

Virtualization

10

Abstraction Cloud computing is abstracting the details of the system implementation from the users and the developers. Applications run on physical systems that aren't specified, data is stored in locations that are unknown, administration of systems is outsourced to others, and access by users is ubiquitous.[1]

Virtualization

Cloud computing virtualizes systems by pooling and sharing resources. Systems and storage can be provisioned as needed from a centralized infrastructure, costs are assessed on a metered basis, multi-tenancy is enabled, and resources are scalable with agility.

Cloud computing is an abstraction based on the notion of pooling physical resources and presenting them as a virtual resource. It is a new model for provisioning resources, for staging applications, and for platform-independent user access to services. Clouds can come in many different types, and the services and applications that run on clouds may or may not be delivered by a cloud service provider.

2.2 Cloud Types

Usually the cloud computing is separated into two distinct sets of models:

Deployment models – refers to location and management of the cloud’s infrastructure.

Service models – particular types of services that can be accessed on a cloud computing platform.

2.2.1 NIST model

The NIST model is set of working definitions published by the U.S. National Institute of Standards and Technology. This cloud model is composed of five essential characteristics, three service models, and four deployment models.[2]

Essential Characteristics:

On-demand self-service - A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.

Broad network access - Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).

Resource pooling - The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, and network bandwidth.

11

Rapid elasticity - Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.

Measured service - Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g. storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Service Models:

Software as a Service (SaaS) - The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure2. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user specific application configuration settings.

Platform as a Service (PaaS) - The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.

Infrastructure as a Service (IaaS) - The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models:

Private cloud - The cloud infrastructure is provisioned for exclusive use by a single

organization comprising multiple consumers (e.g., business units). It may be owned, managed,

and operated by the organization, a third party, or some combination of them, and it may

exist on or off premises.

Community cloud - The cloud infrastructure is provisioned for exclusive use by a specific

community of consumers from organizations that have shared concerns (e.g., mission,

security requirements, policy, and compliance considerations). It may be owned, managed,

and operated by one or more of the organizations in the community, a third party, or some

combination of them, and it may exist on or off premises.

Public cloud - The cloud infrastructure is provisioned for open use by the general public. It is usually open system available to general public via WWW or Internet. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider. Examples of public cloud: Google application engine, Amazon elastic compute cloud, Microsoft Azure.

Hybrid cloud - The cloud infrastructure is a composition of two or more distinct cloud

12

infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds). [2]

2.3 Cloud computing architecture

Cloud computing is essentially a series of levels that function together in various ways to create a system. This system is also referred to as cloud computing architecture. The cloud creates a system where resources can be pooled and partitioned as needed. Cloud architecture can couple software running on virtualized hardware in multiple locations to provide an on-demand service to user-facing hardware and software. A cloud can be created within an organization's own infrastructure or outsourced to another datacenter. Usually resources in a cloud are virtualized resources because virtualized resources are easier to modify and optimize. A compute cloud requires virtualized storage to support the staging and storage of data. From a user's perspective, it is important that the resources appear to be infinitely scalable, that the service be measurable, and that the pricing be metered.[1]

Figure 1 Cloud computing stack

Applications in the cloud are usually composable systems, this means that they are using standard component so assemble services that are tailored for a specific purpose. A composable component must be:

• Modular: It is a self-contained and independent unit that is cooperative, reusable, and reeplaceable.

13

• Stateless: A transaction is executed without regard to other transactions or requests

In general cloud computing doesn’t require that hardware and software to be composable but it is a highly desirable characteristic. It makes system design easier to implement and solutions are more portable and interoperable.

Some of the benefits from composable system are:

Easier to assemble systems

Cheaper system development

More reliable operation

A larger pool of qualified developers

A logical design methodology

There is a trend toward designing composable systems in cloud computing in the widespread adoption of what has come to be called the Service Oriented Architecture (SOA). The essence of a service oriented design is that services are constructed from a set of modules using standard communications and service interfaces. An example of a set of widely used standards describes the services themselves in terms of the Web Services Description Language (WSDL), data exchange between services using some form of XML, and the communications between the services using the SOAP protocol. There are, of course, alternative sets of standards.[1]

What isn't specified is the nature of the module itself; it can be written in any programming language the developer wants. From the standpoint of the system, the module is a black box, and only the interface is well specified. This independence of the internal workings of the module or component means it can be swapped out for a different model, relocated, or replaced at will, provided that the interface specification remains unchanged. That is a powerful benefit to any system or application provider as their products evolve.

Essentially there are 3 tiers in a basic cloud computing architecture:

Infrastructure

Platform

Application

If we further break down the standard cloud computing architecture there are really two areas to deal with; the front end and back end.

Front End - The front end includes all client (user) devices and hardware in addition to their computer network and the application that they actually use to make a connection with the cloud.

Back End - The back end is populated with the various servers, data storage devices and hardware that facilitate the functionality of a cloud computing network.

2.3.1 Infrastructure

The infrastructure of cloud computing architecture is essentially all the hardware, data storage devices (including virtualized hardware), networking equipment, applications and software that operates and

14

drives the cloud.

Most Infrastructure as a Service (IaaS) providers use virtual machines to deliver servers that run applications. Virtual machines images or instances are containers that have assigned specific resources (number of CPU cycles, memory access, network bandwidth, etc.).

Figure 2 shows the cloud computing stack that is defined as the server. The Virtual Machine Monitor, also called a hypervisor is the low level software that allows different operating systems to run in their own memory space and manages I/O for the virtual machines.[1]

Figure 2 "Server" stack

2.3.2 Platform

A cloud computing platform is the actual programming, code and implemented systems of interfacing that help user-level devices (and applications) connect with the hardware and software resources of the cloud. It is a software layer that is used to create higher level of services.

A cloud computing platform is generally divided up between the front end and back end of a network. Its job is to provide a communication and access portal for the client, so that they may effectively utilize the resources of the cloud network. The platform may only be a set of directions, but it is in all actuality the most integral part of a cloud computing network; without it cloud computing would not be possible.

There are many different Platform as a Service (PaaS) providers, we will mention some of them:

Salesforce.com’s Force.com and Databse.com Platforms

Windows Azure Platform

Google Apps and Google AppEngine

Amazon Web services

All platform services offer hosted hardware and software needed to build and deploy Web application or services that are custom built by the developers.

It makes sense for operating system vendors to move their development environments into the cloud

15

with the same technologies that have been successfully used to create Web applications. Thus, you might find a platform based on an Oracle xVM hypervisor virtual machine that includes a NetBeans Integrated Development Environment (IDE) and that supports the Oracle GlassFish Web stack programmable using Perl or Ruby. For Windows, Microsoft would be similarly interested in providing a platform that allowed Windows developers to run on a Hyper-V VM, use the ASP.NET application framework, support one of its enterprise applications such as SQL Server, and be programmable within Visual Studio—which is essentially what the Azure Platform does. This approach allows someone to develop a program in the cloud that can be used by others.

Platforms often come with tools and utilities to aid in application design and deployment. Depending on a vendor they can be: tools for team collaboration, testing tools, versioning tools, database and web service integration, and storage tools. Platforms providers begin with creation of developer’s community to support the work done in the environment.

Platform is exposed to users through an API, also an application built in the cloud using a platform service would encapsulates the service through its own API. An API can control data flow, communications, and other important aspects of the cloud application. Till now there are is no standard API and each cloud vendor has their own.

2.3.3 Application Platform as a Service (APaaS ) or Virtual appliances

A virtual appliance is software that installs as middleware onto a virtual machine. This are usually a Web server, database server, BPM, ESBs, Messaging Portals and others that are running on a virtual machine image. This, by someone referred to as Application platform as a Service, is more or less horizontal extension of the offerings of PaaS.

APaaS is a type of service model that gives cloud software developers the power to actually do their jobs. This gives an opportunity to use the APaaS /Virtual Appliances to build more complex services. Within the ApaaS system, the actual software architectures of applications are built and established. It is also within this layer that overall portability (and the ability of an application to function alongside a bevy of other cloud applications as well as operating systems) is established. Since most of the actual developmental breakthroughs (both in terms of software and overall cloud usability) occur within the realms of the middleware (PaaS, APaaS), it makes sense that a great deal of attention is paid to it. [3]

For example Amazon WS is offering more than 700 different virtual machine images preconfigured with enterprise applications like Oracle BPM, SQL Server, and even complete application stacks such as LAMP (Linux, Apache, MySQL, and PHP) which are used to create a virtual machine within the Amazon Elastic Compute Cloud (EC2). It serves as the basic unit of deployment for services delivered using EC2.

APaaS gives software developers a solid part of platform that they can stand on, with its own impressive workbench of tools, while they are constructing and envisioning new possibilities.

The true benefit from APaaS however is its ability to provide accurate feedback regarding the functionality and compatibility of applications that are still under development. This is extremely important to software developers, who can take serious losses (in terms of both money and time spent) if they produce an application that simply won’t function in an environment, behave as expected once deployed, or function in a compatible manner with other elements in a cloud infrastructure. For those companies that want to run their IT and/or software development projects through an APaaS, they need only pay subscription fees and not licensing fees. Subscription is substantially cheaper than licensing and offers its benefits when paired with cloud APaaS. Most APaaS

16

packages that are put together for designers are often much easier to use than most standardized design tools. These packages often allow software development teams to integrate and share their work more smoothly as well as run the project from start to finish much faster than with other systems.[3]

The global emergence of APaaS will no doubt lead to the creation of a number of companies that will utilize the tools of APaaS to create their own business model, especially one that seeks to provide yet another proprietary service aimed at delivering timely solutions to business software issues. One particular area that could use the help is enterprise software, for example. Enterprise software is often hard to manage, difficult to customize and frequently falls short in its functionalities. When you couple these shortcomings with the fact that it is often quite expensive, there is a serious problem. An obvious solution for dealing with enterprise software problems would be the deployment of an APaaS-style service. Not only would this greatly increase the overall functionality of expensive enterprise business software, but it would also allow for a great range of customization, as well as the option for integrating it with other cloud services and/or networking opportunities. APaaS was created to make the lives of software designers, developers and investors much easier. It is through the use of APaaS that many excellent next generation apps have been developed and many experts in the field of cloud computing agree that it is APaaS that will produce some of the upcoming “game changing” applications that will actually shape the future of cloud computing in general.

2.3.4 Application

This area is compromised of the client hardware and the interface used to connect to the cloud. Big problems arise from the design of Internet protocols to treat each request to a server as an independent transaction (stateless service) [1]. The standard HTTP commands are all atomic in nature. While stateless servers are easier to architect and stateless transactions are more resilient and can survive outages, much of the useful work that computer systems need to accomplish are stateful. Usage of transaction servers, message queuing servers and other similar middleware is meant to bridge this problem. Standard methods that are part of Service oriented Architecture that help to solve this issue and that are used in cloud computing are:

Orchestration – process flow can be choreographed as a service

Use of service bus that controls cloud components

There are many ways how clients can connect to a cloud service. The most common are:

Web browser

Proprietary application

This application can run on number of different devices, PC, Servers, Smartphones, and Tablets. They all need a secure way to communicate with the cloud. Some of the basic methods to secure the connection are:

Secure protocol such as SSL (HTTPS). FTPS, IPSec or SSH

Virtual connection using a virtual private network (VPN)

Remote data transfer such as Microsoft RDP or Citrix ICA that are using tunneling mechanism

Data encryption

17

3. Scalability

The scalability is the ability of a system to handle growing amount of work in a capable manner or its ability to improve when additional resources are added.

The scalability requirement arises due to the constant load fluctuations that are common in the context of Web-based services. In fact these load fluctuations occur at varying frequencies: daily, weekly, and over longer periods. The other source of load variation is due to unpredictable growth (or decline) in usage. The need for scalable design is to ensure that the system capacity can be augmented by adding additional hardware resources whenever warranted by load fluctuations. Thus, scalability has emerged both as a critical requirement as well as a fundamental challenge in the

context of cloud computing.[1][4]

Typically there are two ways to increase scalability:

Vertical scalability – by adding hardware resources, usually addition of CPU, memory etc. This vertical scaling (scaling-up) enables them to use virtualizations technologies more effectively by providing more resources for the hosted operating systems and applications to share.

Horizontal scalability – means to add more nodes to a system, such as adding new node to a distributed software application or adding more access points within the current system. Hundreds of small computers may be configured in a cluster to obtain aggregate computing power. The Horizontal scalability (scale-out) model also creates an increased demand for shared data storage with very high I/O performance especially where processing of large amounts of data is required. In general, the scale-out paradigm has served as the fundamental design paradigm for the large-scale data-centers of today.

Integrating multiple load balancers into your system is probably the best solution for dealing with scalability issues. There are many different forms of load balancers to choose from; server farms, software and even hardware that have been designed to handle and distribute increased traffic. Items that interfere with scalability[3]:

Too much software clutter (no organization) within the hardware stack(s).

Overuse of third-party scaling.

Reliance on the use of synchronous calls.

Not enough caching

Database not being used properly.

Creating a cloud network that offers the maximum level of scalability potential is entirely possible if we apply a more “diagonal” solution. By incorporating the best solutions present in both vertical and horizontal scaling, it is possible to reap the benefits of both models[3]. Once the servers reach the limit of diminishing returns (no growth), we should simply start cloning them. This will allow us to keep a consistent architecture when adding new components, software, apps and users. For most individuals, problems arise from lack of resources not the inherent architecture of their cloud itself. A more diagonal approach should help the business to deal with the current and growing demands that it is facing.

18

4. Elasticity

Of all the attributes possessed by cloud computing in general, the most important is certainly its elasticity. It’s ability to amplify and instantly upgrade resources and/or capacities on a moment notice. Storage, processing and the scalability of applications are all elastic in the cloud. The really remarkable thing about cloud computing is the real-time infrastructure that actively responds on user requests for resources. Without the real-time monitoring and support behind this elasticity, the effectiveness, adaptability and muscle of cloud computing would be greatly undermined. It is this elastic ability that the service providers possess which allows them to offer their users access to cloud computing services at such reduced costs. Since users only pay for what they use they can save money. For example with the traditional grid computing network every user has its own intensive hardware setup of which most of the users rarely use more than 50% of the capacity. Their combined resource usage might be 20-30% of the total resources available on their central cloud computing hardware stack. What cloud computing is really offering is the ability for average users to retain their current standards and expectations, while leaving the door open for instant expansion opportunities if they desire it. This also gives a much more efficient way to use energy. Elasticity offers the same computing experience to which we are accustomed, with the added benefit of near limitless resources at the same time offering a way to manage the energy consumption. [1][3] The elastic capabilities offered by cloud computing makes it perfectly suited toward handling certain activities or processes.

Establishing an “in office” communication and online networking infrastructure (for employees). Setting up a system that allows those in the organization a cleaner and more efficient system for communicating and working often leads to greatly increased profits.

Using cloud computing to handle overdrafting - high volume data transfer periods and events. Some businesses only use cloud computing when they run out of their own resources, or perhaps anticipate that they might lack needed functionalities. This can be something that is scheduled for an annual or bi-annual basis; designed to meet a seasonal demand for a particular product for example.

Assigning all customer data and transaction information to a cloud computing element. This allows an organization to keep their customer’s data safe even from their own employees. Utilizing a third party to handle all customer data can also pay off in the event of a catastrophic type event. Cloud computing providers tend to keep your information more securely backed-up than most are even aware of. [3]

In other word elasticity allows both user and provider to “do more with less”.

19

5. Database Management Systems in the cloud (Database as a

service)

Data and database management are integral part of wide variety of applications. Particularly Relation DBMSs had been massively used due to many futures that they offer:

Overall functionality offering intuitive and relatively simple model for modeling different types of applications.

Consistency, dealing with concurrent workloads without worrying about the data getting out of sync

Performance, low latency and high throughput combined with many years of engineering and development

Reliability, persistence of data in the presence of different types of failures and ensuring safety.

The main concern is that the DBMSs and RDBMSs are not cloud-friendly because they are not as scalable as the web-servers and application servers, which can scale from a few machines to hundreds. The traditional DBMSs are not design to run on top of the shared-nothing architecture (where a set of independent machines accomplish a task with minimal resource overlap) and they do not provide the tools needed to scale-out from a few to a large number of machines. Technology leaders such as Google, Amazon, and Microsoft have demonstrated that data centers comprising thousands to hundreds of thousands compute nodes, provide unprecedented economies-of-scale since multiple applications can share a common infrastructure. All three companies have provided frameworks such as Amazon’s AWS, Google’s AppEngine and Microsoft Azure for hosting third party application in their clouds (data-center infrastructures). Because the RDBMs or “transactional data management” databases that back banking, airline reservation, online e-commerce, and supply chain management applications typically rely on the ACID (Atomicity, Consistency, Isolation, Durability) guarantees that databases provide and It is hard to maintain ACID guarantees in the face of data replication over large geographic distances1, they even have developed propriety data management technologies referred to as key-value stores or informally called NO-SQL database management systems.[6] The need for web-based application to support virtually unlimited number of users and to be able to respond to sudden load fluctuations raises the requirement to make them scalable in cloud computing platforms. There is a need that such scalability can be provisioned dynamically without causing any interruption in the service. Key-value stores and other NOSQL database solutions, such as Google Datastore offered with Google AppEngine, Amazon SimpleDB and DynamoDB, MongoDB and others, have been designed so that they can be elastic or can be dynamically provisioned in the presence of load fluctuations. We will explain some of these systems in more details later on.

1 CAP theorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to

simultaneously provide all three of the following guarantees:

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the

system)

According to the theorem, a distributed system can satisfy any two of these guarantees at the same time, but

not all three.

20

As we move to the cloud-computing arena which typically comprises data-centers with thousands of servers, the manual approach of database administration is no longer feasible. Instead, there is a growing need to make the underlying data management layer autonomic or self-managing especially when it comes to load redistribution, scalability, and elasticity. [7]

Figure 3 Traditional VS Cloud Data Services

This issue becomes especially acute in the context of pay-per-use cloud-computing platforms hosting multi-tenant applications. In this model, the service provider is interested in minimizing its operational cost by consolidating multiple ten-ants on as few machines as possible during periods of low activity and distributing these tenants on a larger number of servers during peak usage [7]. Due to the above desirable properties of key-value stores in the context of cloud computing and large-scale data-centers, they are being widely used as the data management tier for cloud-enabled Web applications. Although it is claimed that atomicity at a single key is adequate in the context of many Web-oriented applications, evidence is emerging that indicates that in many application scenarios this is not enough. In such cases, the responsibility to ensure atomicity and consistency of multiple data entities falls on the application developers. This results in the duplication of multi-entity synchronization mechanisms many times in the application software. In addition, as it is widely recognized that concurrent programs are highly vulnerable to subtle bugs and errors, this approach impacts the application reliability adversely. The realization of providing atomicity beyond single entities is widely discussed in developer blogs. Recently, this problem has also been recognized by the senior architects from Amazon and Google, leading to systems like MegaStore [10] that provide transactional guarantees on key-value stores. Both RDBMs and NOSQL DBMs offerings in the cloud will be explained in more details, how they work who offers them and how they are provisioned. I will first focus on the relational database offered in the cloud. I will start with one of the first Enterprise database built for the cloud, the Salesforce’s database.com.

21

6. Database.com Database.com is a database management system that is built for cloud computing with multitenancy inherent in its design. Traditional RDBMSs were designed to support on premises deployments for one organization. All core mechanisms such as system catalog, cashing mechanisms and query optimizer are built to support single-tenant applications and to run directly on a specifically tuned host operating system and hardware. Only possible way to build multi-tenant cloud database service with standard RDBMS is to use virtualization. Unfortunately, the extra overhead of the hypervisor typically hurts the performance of the RDBMS. Database.com combines several different persistence technologies, including a custom -designed relational database schema, which are innately designed for clouds and multitenancy - no virtualization required.

6.1 Database.com Architecture

Database.com’s core relational database technology uses a runtime engine that materializes all application data from metadata - data about the data itself. In Database.com’s metadata-driven architecture, there is a clear separation of the compiled runtime database engine (kernel), tenant data, and the metadata that describes each application’s schema. These distinct boundaries make it possible to independently update the system kernel and tenant -specific application schemas.

Figure 4 Databse.com Architecture [9]

Every logical database object is internally managed using metadata. Objects, (“tables” in traditional relational database parlance), fields, stored procedures, and database triggers are all abstract

22

constructs that exist merely as metadata in Database.com’s Universal Data Dictionary (UDD). Database.com used terminology is shown in Table 1.

Relational Database Term Equivalent Term in Databse.com Database Organization Table Object Column Field Row Record

Table 1 Database.com Terminology

When a new application object is defined or some procedural code is written, Database.com does not create an actual table in a database or compile any code, it simply stores metadata that the system’s engine can use to generate the virtual application components at runtime. When modification or customization of something about the application schema is needed, like modify an existing field in an object, all that’s required is a simple non-blocking update to the corresponding metadata [9].

In order to avoid performance-sapping disk I/O and code recompilations, and improve application response times, Database.com uses massive and sophisticated metadata caches to maintain the most recently used metadata in memory. The system runtime engine must be optimizes to access metadata because frequent metadata access would prevent the service from scaling.

At the heart of Database.com is its transaction database engine. Database.com uses a relational database engine with a specialized schema build for multitenancy. It also employs a search engine (separate from the transaction engine) that optimizes full -text indexing and searches. As applications update data, the search service’s background processes asynchronously update tenant - and user-specific indexes in near real time. The goal of this separation of duties between the transaction engine and the search service lets applications process transactions without the overhead of text index updates [9].

6.2 Multitenant data model Database.com storage model manages virtual database structures using a set of metadata, data, and pivot tables, as illustrated in Figure 5

Figure 5 Multitenant data model of Database.com [9]

23

When application schemas are created, the UDD keeps track of metadata concerning the objects, their fields, their relationships, and other object attributes. Few large database tables store the structured and unstructured data for all virtual tables. A set of related multitenant indexes, implemented as simple pivot tables with denormalized data, make the combined data set extremely functional. Because Database.com manages object and field definitions as metadata rather than actual database structures, the system can tolerate online multitenant application schema maintenance activities without blocking the concurrent activity of other tenants and users [9].

6.3 Multitenant indexes

Database.com automatically indexes various types of fields to deliver scalable performance. Traditional database systems rely on native database indexes to quickly locate specific rows in a database table that have fields matching a specific condition. Index of MT_Data is managed by synchronously copying field data marked for indexing to an appropriate column in a pivot table called MT_Indexes.

In some circumstances the external search engine can fail to respond to a search request. In this cases

Database.com falls back to a secondary search mechanism. A fallback search is implemented as a direct

database query with search conditions that reference the Name field of target records. To optimize global object searches (searches that span tables) without having to execute potentially expensive union queries, a pivot table called MT_Fallback_Indexes that records the Name of all records is maintained. Updates to MT_Fallback_Indexes happen synchronously, as transactions modify records, so that fall-back searches always have access to the most current database information [9].

6.4 Multitenant relationships

Database.com provides “relationship” datatypes that an organization can use to declare relationships (referential integrity) among tables. When an organization declares an object’s field with a relationship type, the field is mapped to a Value field in MT_Data, and then uses this fie ld to store the ObjID of a related object [9].

6.5 Multitenant field history

Database.com provides history tracking for any field. When a tenant enables auditing for a specific field, the system asynchronously records information about the changes made to the field (old and new values, change date, etc.) using an internal pivot table as an audit trail [9].

6.6 Partitioning of metadata, data, and index data

All Database.com data, metadata, and pivot table structures, including underlying database indexes, are physically partitioned by tenant (OrgID) using native database partitioning mechanisms. Data partitioning is a proven technique that database systems provide to physically divide large logical data structures into smaller, more manageable pieces. Partitioning can also help to improve the performance, scalability, and availability of a large database system such as a multitenant environment. For example, by definition, every Database.com query targets a specific tenant’s information, so the

24

query optimizer need only to consider accessing data partitions that contain a tenant’s data rather than an entire table or index. This common optimization is sometimes referred to as “partition pruning.” [9]

6.7 Application development

Developers can declaratively build server-side application components using the Database.com

Console. This point-and-click interface supports all facets of the application schema building process,

including the creation of an application’s data model (objects and their fields, relationships, etc.),

security and sharing model (users, profiles, role hierarchies, etc.), declarative logic (workflows), and

programmatic logic (stored procedures and triggers). The Console provides access to built-in system

futures which makes it easy to implement application functionality without the need of writing code

[9].

6.8 Data Access

Database.com provides the following tools to query and work with data. Database.com REST API and Force.com Web Services API The REST API and Web Services API can be used to interact with Database.com by creating, retrieving,

updating, and deleting records, maintaining passwords, performing searches, etc. This APIs can be used

with any language that supports Web services.

The SOAP-based API is optimized for real-time client applications that update small numbers of records

at a time [8] [9].

Force.com Bulk API The Bulk API is based on REST principles, and is optimized for loading or deleting large sets of data. It

can be used to insert, update, delete, or restore a large number of records asynchronously by

submitting a number of batches that are processed in the background by Database.com. The Bulk

API is designed to simplify the processing of a few thousand to millions of records. Apex Data Manipulation Language (DML) DML statements are used to insert, delete, and update data from within your Apex code. Apex Web Services Apex methods can be exposed as Web service operations that can be called by external Web client

applications. This is a powerful tool for building efficient communication between data service and

application tier. By aggregating business logic onto Database.com, it can:

Prevent unnecessary communication between data service and the client

25

Client development and maintenance by providing a coarse-grained application- level API

Build more robust applications, since all of the logic implemented in Apex is executed within a transaction on Database.com [9]

6.9 Query languages

Database.com is using the Salesforce Object Query Language (SOQL) to construct database queries.

Similar to the SELECT command in the Structured Query Language (SQL), SOQL allows you to specify

the source object, a list of fields to retrieve, and conditions for selecting rows in the source object.

Database.com also includes a full-text, multi-lingual search engine that automatically indexes all text-

related fields. Apps can leverage this pre-integrated search engine using the Salesforce Object Search

Language (SOSL) to perform text searches.

Unlike SOQL, which can only query one object at a time, SOSL can search text, email, and phone fields

for multiple objects simultaneously [9].

6.10 Multitenant search processing

Web-based application users have come to expect an interactive search capability to scan the entire or

a selected scope of an application’s database, return ranked results that are up-to-date, and do it all

with sub-second response times. To provide such robust search functionality for applications,

Database.com uses a search engine that is separate from its transaction engine. The relationship

between the two engines is depicted in the figure 4.

Figure 6 Transaction and Search engine [9]

The search engine receives data from the transactional engine, with which it creates search indexes.

The transactional engine forwards search requests to the search engine, which returns results that the

transaction engine uses to locate rows that satisfy the search request.

As applications update data in text fields (CLOBs, Name, etc.), a pool of background processes called

indexing servers are responsible for asynchronously updating corresponding indexes, which the search

26

engine maintains outside the core transaction engine. To optimize the indexing process, Database.com

synchronously copies modified chunks of text data to an internal “to-be - indexed” table as

transactions commit, thus providing a relatively small data source that minimizes the amount of data

that indexing servers must read from disk. The search engine automatically maintains separate indexes

for each organization (tenant).

Depending on the current load and utilization of indexing servers, text index updates may noticeably

lag behind actual transactions. To avoid unexpected search results originating from stale indexes,

Database.com also maintains an MRU (most recently used) cache of recently updated rows that the

system considers when materializing full-text search results. In order to efficiently support possible

search scopes, MRU caches are maintained per-user and per-organization.

Database.com’s search engine optimizes the ranking of records within search results using several

different methods. For example, the system considers the security domain of the user performing a

search and weighs those rows to which the current user has access more heavily. The system can also

consider the modification history of a particular row and rank more actively updated rows ahead of

those that are relatively static. The user can choose to weight search results as desired, for example,

placing more emphasis on recently modified rows.

6.11 Multitenant isolation and protection

To protect the overall scalability and performance of the shared database system for all concerned applications, Database.com is using an extensive set of governors and resource limits associated with code execution. Execution of a code script is monitored and limited how much CPU time it can use, how much memory it can consume, how many queries and DML statements it can execute, how many math calculations it can perform, how many outbound Web services calls it can make, and much more. Individual queries that optimizer regards as too expensive to execute throw an exception to the caller [9].

Before an organization can transition a new application from development to production status,

salesforce.com requires unit tests that validate the functionality of the application’s Database.com

code routines. Salesforce.com executes submitted unit tests in Database.com’s sandbox development

environment to ascertain if the application code will adversely affect the performance and scalability of

the multitenant population at large.

Once an application’s code is certified for production by salesforce.com, the deployment process

copies all the application’s metadata into a production Database.com instance and reruns the

corresponding unit tests

After a production application is live, the performance profiler automatically analyzes and provides

associated feedback to administrators. Performance analysis reports include information about slow

queries, data manipulations, and sub-routines that you can review and use to tune application

functionality.

27

6.12 Deletes, undeletes

When an app deletes a record from an object, Database.com simply marks the row for deletion. It is

possible to restore selected rows from the Recycle Bin for up to 30 days before it is permanently

removed. The total number of records that is maintained for an organization is limited based on the

storage limits for that organization.

The Recycle Bin also stores dropped fields and their data until an organization permanently deletes

them or 45 days has elapsed, whichever happens first. Until that time, the entire field and all its data is

available for restoration [9].

6.13 Backup

Database.com uses a variety of methods to ensure that organizations do not experience any data loss.

Every transaction is stored to RAID disks in real-time with archive mode enabled, allowing the database

to recover all transactions prior to any system failure. Every night all data is backed up to a separate

backup server and automatic tape library. The backup tapes are cloned as an additional precautionary

measure, and the cloned tapes are transported to an off-site, fireproof vault twice a month [8].

6.14 Pricing

Database.com pricing is based on number of users, records and transactions per month. The

registration of new account is free and it includes:

3 Standard Users

3 Administration Users

100,000 records in the database

50,000 Transactions per month Additional storage and capacity can be purchased at any time with no downtime.

28

7. Microsoft’s SQL AZURE

Microsoft SQL Azure Database is a cloud-based relational database service that is built on SQL Server

technologies and runs in Microsoft data centers on hardware that is owned, hosted, and maintained

by Microsoft.

SQL Azure is probably the most fully-featured relational database available in the cloud. It is based on

the SQL Server standalone database but the way data is managed and stored in SQL Azure is

significantly different.

Similar to an instance of SQL Server, SQL Azure Database exposes a tabular data stream (TDS)

interface for Transact-SQL-based database access. This allows your database applications to use SQL

Azure Database in the same way that they use SQL Server. Because SQL Azure Database is a service,

administration in SQL Azure Database is slightly different.

Unlike administration for an on-premise instance of SQL Server, SQL Azure Database abstracts the

logical administration from the physical administration. Users continue to administer databases,

logins, users, and roles, but Microsoft administers the physical hardware such as hard drives, servers,

and storage. This approach helps SQL Azure Database provide a large-scale multitenant database

service that offers enterprise-class availability, scalability, security, and self-healing [11].

7.1 Subscriptions

To use SQL Azure, Windows Azure platform account must be used. This account allows access to all

the Windows Azure-related services, such as Windows Azure, Windows Azure AppFabric, and SQL

Azure. The Windows Azure platform account is used to set up and manage subscriptions and to bill for

consumption of any of the Windows Azure services including SQL Azure, and running SQL Azure does

not require Windows Azure. Whit the Windows Azure platform account, the Windows Azure Platform

Management portal can be used to create SQL Azure servers, databases, and its associated

administrator accounts [11].

Each subscription allows one instance of SQL Server to be defined, which will initially include only a

master database. For each server firewall settings has to be configured, to determine which

connections will be allowed access.

7.2 Databases

Each SQL Azure server always includes a master database. Up to 149 additional databases can be

created for each SQL Azure server. Microsoft is offering two editions of SQL Azure databases: Web

and Business, and when you create a database using the Windows Azure Platform Management

portal, the maximum size you specify determines the edition you create. A Web Edition database can

have a maximum size of 1 GB or 5GB. A Business Edition database can have maximum size of up to

150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments [11][12]. If the size of the

database reaches the limit it is not possible to insert data, update data, or create new database

29

objects. However, read and delete data, truncate tables, drop tables and indexes, and rebuild indexes

are still possible.

SQL Azure data access model does not support cross-database queries in the current version a

connection is made to a single database. If data from another database is needed, new connection

must be created [11].

7.3 Security and Access to a SQL Azure Database

Most security issues for SQL Azure databases are managed by Microsoft within the SQL Azure data

center, with very little setup required by the users. A user must have a valid login and password in

order to connect to the SQL Azure database. Because SQL Azure supports only standard security, each

login must be explicitly created.

In addition, the firewall can be configured on each SQL Azure server to only allow traffic from

specified IP addresses to access the SQL Azure server. This helps to greatly reduce any chance of a

denial-of-service (DoS) attack. All communications between clients and SQL Azure must be SSL

encrypted, and clients should always connect with Encrypt = True to ensure that there is no risk of

man-in-the-middle attacks. DoS attacks are further reduced by a service called DoSGuard that actively

tracks failed logins from IP addresses and if it notices too many failed logins from the same IP address

within a period of time, the IP address is blocked from accessing any resources in the service [11].

The security model within a database is identical to that in SQL Server. Users are created and mapped

to login names. Users can be assigned to roles, and users can be granted permissions. Data in each

database is protected from users in other databases because the connections from the client

application are established directly to the connecting user’s database.

7.4 SQL Azure architecture

Each SQL Azure database is associated with its own subscription. From the subscriber’s perspective,

SQL Azure provides logical databases for application data storage. In reality, each subscriber’s data is

replicated across three SQL Server databases that are distributed across three physical servers in a

single data center. Many subscribers may share the same physical database, but the data is presented

to each subscriber through a logical database that abstracts the physical storage architecture and uses

automatic load balancing and connection routing to access the data. The logical database that the

subscriber creates and uses for database storage is referred to as a SQL Azure database [11].

7.5 Logical Databases on a SQL Azure Server

SQL Azure subscribers access the actual databases, which are stored on multiple machines in the data

center, through the logical server. The SQL Azure Gateway service acts as a proxy, forwarding the

Tabular Data Stream (TDS) requests to the logical server. It also acts as a security boundary providing

30

login validation, enforcing the firewall and protecting the instances of SQL Server behind the gateway

against denial-of-service attacks. The Gateway is composed of multiple computers, each of which

accepts connections from clients, validates the connection information and then passes on the TDS to

the appropriate physical server, based on the database name specified in the connection. Figure 8

shows the physical architecture represented by the single logical server.

Figure 7 Figure 8 A logical server and its databases distributed across machines in the data center [11]

The machines with the SQL Server instances are called data nodes. Each data node contains a single

SQL Server instance, and each instance has a single user database, divided into partitions. Each

partition contains one SQL Azure client database, either a primary or secondary replica. Each database

hosted in the SQL Azure data center has three replicas: one primary replica and two secondary

replicas. All reads and writes go through the primary replica, and any changes are replicated to the

secondary replicas asynchronously. The replicas are the central means of providing high availability

for your SQL Azure databases.

The other SQL Azure databases partitions existing within the same SQL Server instances in the data

center are completely invisible and unavailable between different subscribers [11].

For SQL Azure databases every commit needs to be a quorum commit. That is, the primary replica and

at least one of the secondary replicas must confirm that the log records have been written before the

transaction is considered to be committed.

Each data node machine hosts a set of processes referred to as the fabric. The fabric processes

perform the following tasks:

Failure detection: notes when a primary or secondary replica becomes unavailable so that

the Reconfiguration Agent can be triggered

Reconfiguration Agent: manages the re-establishment of primary or secondary replicas after

a node failure

31

PM (Partition Manager) Location Resolution: allows messages to be sent to the Partition

Manager

Engine Throttling: ensures that one logical server does not use a disproportionate amount of the node’s resources, or exceed its physical limits

Ring Topology: manages the machines in a cluster as a logical ring, so that each machine has

two neighbors that can detect when the machine goes down

The machines in the data center are all commodity machines with components that are of low-to-

medium quality and low-to-medium performance capacity. The low cost and the easily available

configuration make it easy to quickly replace machines in case of a failure condition. In addition,

Windows Azure machines use the same commodity hardware, so that all machines in the data center,

whether used for SQL Azure or for Windows Azure, are interchangeable

In Figure 7, the logical server contains three databases: DB1, DB2, and DB3. The primary replica for

DB1 is on Machine 6 and the secondary replicas are on Machine 4 and Machine 5. For DB3, the

primary replica is on Machine 4, and the secondary replicas are on Machine 5 and on another

machine not shown in this figure. For DB4, the primary replica is on Machine 5, and the secondary

replicas are on Machine 6 and on another machine not shown in this figure. Note that this diagram is

a simplification. Most production Microsoft SQL Azure data centers have hundreds of machines with

hundreds of actual instances of SQL Server to host the SQL Azure replicas, so it is extremely unlikely

that if multiple SQL Azure databases have their primary replicas on the same machine, their secondary

replicas will also share a machine [11].

The physical distribution of databases that all are part of one logical instance of SQL Server means

that each connection is tied to a single database, not a single instance of SQL Server.

7.6 Network Topology

Four distinct layers of abstraction work together to provide the logical database for the subscriber’s

application to use: the client layer, the services layer, the platform layer, and the infrastructure layer.

Figure 8 illustrates the relationship between these four layers.

The client layer resides closest to the application, and it is used by the application to communicate

directly with SQL Azure. The client layer can reside on-premises in a data center, or it can be hosted in

Windows Azure. Every protocol that can generate TDS over the wire is supported. Because SQL Azure

provides the TDS interface same as SQL Server, known and familiar tools and libraries can be used to

build client applications for data that is in the cloud.

The infrastructure layer represents the IT administration of the physical hardware and operating

systems that support the services layer.

32

Figure 8 Four layers of abstraction provide the SQL Azure logical database for a client application to use [11]

33

7.7 High Availability with SQL Azure

The goal for Microsoft SQL Azure is to maintain 99.9 percent availability for the subscribers’

databases. As it was stated earlier this goal is achieved by the use of commodity hardware that can

be quickly and easily replaced in the case of machine or drive failure and the management of the

replicas, one primary and two secondary, for each SQL Azure database [12].

7.8 Failure Detection

Management in the data centers needs to detect not only a complete failure of a machine, but also

conditions where machines are slowly degenerating and communication with them is affected. The

concept of quorum commit, discussed earlier, addresses these conditions. First, a transaction is not

considered to be committed unless the primary replica and at least one secondary replica can

confirm that the transaction log records were successfully written to disk. Second, if both a primary

replica and a secondary replica must report success, small failures that might not prevent a

transaction from committing but that might point to a growing problem can be detected [11].

7.9 Reconfiguration

The process of replacing failed replicas is called reconfiguration. Reconfiguration can be required

due to failed hardware or to an operating system crash, or to a problem with the instance of SQL

Server running on the node in the data center. Reconfiguration can also be necessary when an

upgrade is performed, whether for the operating system, for SQL Server, or for SQL Azure.

All nodes are monitored by six peers, each on a different rack than the failed machine. The peers

are referred to as neighbors. A failure is reported by one of the neighbors of the failed node, and

the process of reconfiguration is carried out for each database that has a replica on the failed node.

Because each machine holds replicas of hundreds of SQL Azure databases (some primary replicas

and some secondary replicas), if a node fails, the reconfiguration operations are performed

hundreds of times. There is no prioritization in handling the hundreds of failures when a node fails;

the Partition Manager randomly selects a failed replica to handle, and when it is done with that

one, it chooses another, until all of the replica failures have been dealt with.

If a node goes down because of a reboot, that is considered a clean failure, because the neighbors

receive a clear exception message.

Another possibility is that a machine stops responding for an unknown reason, and an ambiguous

failure is detected. In this case, an arbitrator process determines whether the node is really down.

Although this discussion centers on the failure a single replica, it is really the failure of a node that is

detected and dealt with. A node contains an entire SQL Server instance with multiple partitions

containing replicas from up to 650 different databases. Some of the replicas will be primary and

some will be secondary. When a node fails, the processes described earlier are performed for each

affected database. That is, for some of the databases, the primary replica fails, and the arbitrator

chooses a new primary replica from the existing secondary replicas, and for other databases, a

34

secondary replica fails, and a new secondary replica is created.

The majority of the replicas of any SQL Azure database must confirm the commit. At this time, user

databases maintain three replicas, so a quorum commit would require two of the replicas to

acknowledge the transaction. A metadata store, which is part of the Gateway components in the

data centers, maintains five replicas and so needs three confirmations to satisfy a quorum commit.

The master cluster, which maintains seven replicas, needs four of them to confirm a transaction.

However, for the master cluster, even if all seven replicas fail, the information is recoverable,

because mechanisms are in place to rebuild the master cluster automatically in case of such a

massive failure [11].

7.10 Availability Guarantees

As mentioned earlier, the goal for Microsoft SQL Azure is to maintain 99.9 percent availability.

Because of the way that database replicas are distributed across multiple servers and the efficient

algorithms for promoting secondary replicas to primary, up to 15 percent of the machines in the

data center can be down and the availability can still be guaranteed [11].

7.11 Scalability with SQL Azure

As said earlier one of the biggest benefits of hosting your databases in the cloud is the built-in

scalability. With SQL Azure as with the most cloud database platforms you add more databases only

when and if you need them, and if the need is only temporary, you can then drop the unneeded

databases. There are two components within SQL Azure that allow this scalability by continuously

monitoring the load on each node. One component is Engine Throttling, which ensures that the

server doesn’t get overloaded. The other component is the Load Balancer, which ensures that a

server isn’t continuously in the throttled state. In this section, we’ll look at these two components

and discuss how engine throttling applies when predefined limits are reached and how load

balancing works as the number of hosted database increases. The third technique to achieve

greater scalability and performance are the Federations [31] used in SQL Azure. One or more tables

within a database are split by row and portioned across multiple databases (Federation members).

This type of horizontal partitioning is often referred to as ‘sharding’. The primary scenarios in which

this is useful are where you need to achieve scale, performance, or to manage capacity [11].

7.12 Throttling

Because of the multitenant use of each SQL Server in the data center, it is possible that one

subscriber’s application could render the entire instance of SQL Server ineffective by imposing

heavy loads. For example, under full recovery mode, inserting lots of large rows, especially ones

containing large objects, can fill up the transaction log and eventually the drive that the transaction

log resides on. In addition each instance of SQL Server in the data center shares the machine with

35

other critical system processes that cannot be starved – most relevantly the fabric process that

monitors the health of the system.

To keep a data center server’s resources from being overloaded and jeopardizing the health of the

entire machine, the load on each machine is monitored by the Engine Throttling component. In

addition, each database replica is monitored to make sure that statistics such as log size, logs write

duration, CPU usage, the actual physical database size limit, and the SQL Azure user database size

are all below target limits. If the limits are exceeded, the result can be that a SQL Azure database

rejects reads or writes for 10 seconds at a time. Occasionally, violation of resource limits may result

in the SQL Azure database permanently rejecting reads and writes (depending on the resource type

in question) [11].

7.13 Load Balancer

At this time, although there are availability guarantees with SQL Azure, there are no performance

guarantees. Part of the reason for this is the multitenant problem: many subscribers with their own

SQL Azure databases share the same instance of SQL Server and the same computer, and it is

impossible to predict the workload that each subscriber’s connections will be requesting. SQL Azure

provides load balancing services that evaluate the load on each machine in the data center. When a

new SQL Azure database is added to the cluster, the Load Balancer determines the locations of the

new primary and secondary replicas based on the current load on the machines.

If one machine gets loaded too heavily, the Load Balancer can move a primary replica to a machine

that is less loaded [11].

7.14 SQL Azure Management

Because your SQL Azure databases are hosted within larger SQL Server instances on machines in

the data centers, the management work that needs to be done is very limited. However, some

maintenance tasks are still necessary.

All physical aspects of dealing with your databases are handled in the data center by Microsoft.

Also all the upgrades are handled in the data center one replica at a time. The user has

responsibility to troubleshoot poorly performing queries and concurrency problems, such as

blocking.

Just like in SQL Server, some of the main tools available for troubleshooting are the dynamic

management views (DMVs) [11].

7.15 Pricing in SQL Azure

Billing in SQL Azure is per database, based on usage and database edition, this allows organization

36

to start with a small investment and add space as the business grows. SQL Azure provides two

different database editions, Business Edition and Web Edition. SQL Azure edition features apply to

the individual database. They can be mixed and match different database editions within the same

SQL Azure server.

Both editions offer scalability, automated high availability, and self-provisioning.

The Web Edition Database is suited for small Web applications and workgroup or

departmental applications. This edition supports a database with a maximum size of 1 or 5

GB of data.

The Business Edition Database is suited for independent software vendors (ISVs), line- of-

business (LOB) applications, and enterprise applications. This edition supports a database

of up to 150 GB of data, in 10GB increments up to 50GB, and then 50 GB increments.

Both editions charge an additional bandwidth-based fee when the data transfer includes a client

outside the Windows Azure platform or outside the region of the SQL Azure database.

You specify the edition and maximum size of the database when you create it; you can also change

the edition and maximum size after creation. The billing will be based on the new edition type (and

the peak size the database reaches, daily) [13].

Microsoft is charging monthly fee for each SQL Azure user database. The database fee is amortized

over the month and charged daily. The daily fee depends on the peak size that each database

reached that day, the edition of each database, and the maximum number of databases you. A 10

GB multiplier is used for pricing Business Edition databases and a 1 GB or 5 GB multiplier is used for

pricing Web Edition databases. Users pay for the databases they have, for the days they have them

[13].

Bandwidth used between SQL Azure and Windows Azure or Windows Azure AppFabric is free

within the same sub-region or data center.

37

8. Amazon WebServices

Amazon is another company that is offering relational database service as a part of their amazon

web services. In the next section I will first speak about Amazon relational database services and

later I will give an overview of their NOSQL database, Amazon SimpleDB and DynamoDB and

another NOSQL solutions currently available.

8.1 Amazon Relational Database Service (Amazon RDS)

Amazon Relational Database Service (Amazon RDS) is a web service that can operate, and to some

level scale a relational database in the cloud. It provides cost-efficient and resizable capacity while

automating the administration tasks. Amazon RDS gives the users access to the capabilities of a

MySQL or Oracle database running on their own Amazon RDS database instance. This gives the

advantage that the code and applications that use on-premises MySQL or Oracle database can be

easily migrated to Amazon RDS.

8.2 Amazon RDS Architecture/Features

Amazon RDS has different approach then the Database.com and SQL Azure. It offers the full

capabilities of MySQL or Oracle database running on separate database instance. The features

provided by Amazon RDS depend on the DB Engine you select. In general it offers:

Pre-configured Parameters – DB Instances are pre-configured with a sensible set of

parameters and settings appropriate for the DB Instance class that has been selected. It

gives the possibility to launch a MySQL or Oracle DB Instance and connect an application

without additional configuration.

Monitoring and Metrics – Amazon RDS provides Amazon CloudWatch metrics for the DB

Instance deployments. AWS Management Console can be used to view key operational

metrics for the DB Instance deployments, including compute/memory/storage capacity

utilization, I/O activity, and DB Instance connections.

Automatic Software Patching – Amazon RDS will make sure that the relational database

software stays up-to-date with the latest patches

Automated Backups – Turned on by default, the automated backup feature of Amazon

RDS enables point-in-time recovery for the DB Instance. Amazon RDS will backup the

database and transaction logs and store both for a user-specified retention period. This

allows restores of the DB Instance to any second during the retention period, up to the

last five minutes. Automatic backup retention period can be configured to up to thirty five

days.

38

DB Snapshots – DB Snapshots are actually user-initiated backups of the DB Instance.

These full database backups will be stored by Amazon RDS until they are explicitly deleted.

Users can also create a new DB Instance from a DB Snapshot.

Isolation and Security– Using Amazon VPC2, it is possible to isolate DB Instances in own

virtual network, and connect to an existing IT infrastructure using industry-standard

encrypted IPsec VPN. In addition, for both MySQL and Oracle, it allows controlling access

to the DB Instances using database security groups (DB Security Groups). A DB Security

Group acts like a firewall controlling network access to the DB Instance. By default,

network access is turned off to the DB Instances. For applications to access a DB Instance

DB Security Group must be set to allow access from EC23 Instances with specific EC2

Security Group membership or IP ranges [14].

8.3 Scalability with Amazon RDS

Amazon RDS gives flexibility of being able to scale the compute resources or storage capacity

associated with the relational database instance by using the Amazon RDS APIs or through the AWS

Management Console. The compute and memory resources can be scaled up or down by using

predefined DB Instance Classes. Currently Amazon is offering five supported DB Instance classes:

Small DB Instance: 1.7 GB memory, 1 ECU (1 virtual core with 1 ECU), 64-bit platform,

Moderate I/O Capacity

Large DB Instance: 7.5 GB memory, 4 ECUs (2 virtual cores with 2 ECUs each), 64-bit

platform, High I/O Capacity

High-Memory Extra Large Instance 17.1 GB memory, 6.5 ECU (2 virtual cores with 3.25

ECUs each), 64-bit platform, High I/O Capacity

High-Memory Double Extra Large DB Instance: 34 GB of memory, 13 ECUs (4 virtual cores

with 3,25 ECUs each), 64-bit platform, High I/O Capacity

High-Memory Quadruple Extra Large DB Instance: 68 GB of memory, 26 ECUs (8 virtual

cores with 3.25 ECUs each), 64-bit platform, High I/O Capacity

For each DB Instance class, it is possible to select from 5GB to 1TB of associated storage capacity.

Additional storage can be provisioned on the fly with no downtime.

One ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon

processor [14].

2 Amazon Virtual Private Cloud (Amazon VPC) - isolated section of the Amazon Web Services (AWS) Cloud where you

can launch AWS resources in a virtual network that you define, offering complete control over your virtual networking environment, including selection of your own IP address range, creation of subnets, and configuration of route tables and network gateways. 3 Amazon Elastic Compute Cloud (EC2) - web service that provides resizable compute capacity in the cloud.

39

8.4 High Availability

Amazon RDS run on the same high reliable infrastructure as the other Amazon web services. It has

multiple features that enhance availability for critical production databases. Currently it offers

Automatic host replacement and Replication.

With the automatic host replacement, Amazon RDS will automatically replace the compute instance

powering the deployment in the event of a hardware failure.

The replication at this time is supported only for MySQL, although it is planned to be available for

oracle in the near future. For MySQL Amazon RDS provides two replication features, Multi-AZ

deployments and read replicas.

With Multi-AZ deployments Amazon RDS will automatically provision and manage a “standby”

replica in a different Availability Zone (independent infrastructure in a physically separate location).

Database updates are made concurrently on the primary and standby resources to prevent

replication lag. In the event of planned database maintenance, DB Instance failure, or an Availability

Zone failure, Amazon RDS will automatically failover to the up-to-date standby so that database

operations can resume quickly without administrative intervention. Prior to failover you cannot

directly access the standby, and it cannot be used to serve read traffic.

Read Replicas make it easy to elastically scale out beyond the capacity constraints of a single DB

Instance for read-heavy database workloads. It is possible to create one or more replicas of a given

source DB Instance and serve high-volume application read traffic from multiple copies of the data,

thereby increasing aggregate read throughput. Amazon RDS uses MySQL’s native replication to

propagate changes made to a source DB Instance to any associated Read Replicas. Since Read

Replicas leverage standard MySQL replication, they may fall behind their sources, and they are

therefore not intended to be used for enhancing fault tolerance in the event of source DB Instance

failure or Availability Zone failure [14].

8.5 Pricing

Same as with the other, previously mentioned DBMS services, Amazon RDS pricing is based on the

usage and the DB Instance class. It is possible to choose between hourly On-Demand pricing with

no up-front or long-term commitments with reserved pricing option.

On-Demand DB Instances lets user to pay for compute capacity by the hour with no long-

term commitments. This frees you from the costs and complexities of planning,

purchasing, and maintaining hardware and transforms what are commonly large fixed

costs into much smaller variable costs.

Reserved DB Instances give users the option to make a low, one-time payment for each DB Instance they want to reserve and in turn receive a discount on the hourly usage charge for that DB Instance. Depending on usage, there is a possibility to choose between three Reserved DB Instance types (Light, Medium, and Heavy Utilization) and receive

anywhere between 30% and 55% of discount over On-Demand prices. Based on the

40

application workload and the amount of time they will run, Amazon RDS Reserved Instances may provide substantial savings over running on-demand DB instances.

The prices are different weather standard or Multi-AZ Deployment is used. For both standard and

Multi-AZ deployments, pricing is per DB Instance-hour consumed, from the time a DB Instance is

launched until it is terminated.

There is no additional charge for backup storage up to 100% of provisioned database storage for an

active DB Instance. After the DB Instance is terminated, backup storage is billed at per GB-month.

Also additional backup storage is billable.

Data transferred between Amazon RDS and Amazon EC2 Instances in the same Availability Zone and

Data transferred between Availability Zones for replication of Multi-AZ deployments is free. Amazon RDS DB Instances outside VPC: For data transferred between an Amazon EC2 instance and

Amazon RDS DB Instance in different Availability Zones of the same Region, there is no Data

Transfer charge for traffic in or out of the Amazon RDS DB Instance. Charges apply only for the Data

Transfer in or out of the Amazon EC2 instance, and standard Amazon EC2 Regional Data Transfer

charges apply.

Amazon RDS DB Instances inside VPC: For data transferred between an Amazon EC2 instance and

Amazon RDS DB Instance in different Availability Zones of the same Region, Amazon EC2 Regional

Data Transfer charges apply on both sides of transfer.

Data transferred between Amazon RDS and AWS services in different regions is charged as Internet

Data Transfer on both sides of the transfer. Additionally for Oracle database there are two licensing models, “License Included” and “Bring-

Your- Own-License (BYOL)”. In the "License Included" service model, you do not need separately

purchased Oracle licenses; the Oracle Database software has been licensed by AWS. Bring-Your-Own-License is suited for users that already own Oracle Database licenses. The “BYOL”

model is designed for customers who prefer to use existing Oracle database licenses or purchase

new licenses directly from Oracle [14].

9. Google Cloud SQL

Google Cloud SQL is a MySQL database in the Google's cloud. It has all the capabilities and

functionality of MySQL. Google Cloud SQL is currently available for Google App Engine applications

that are written in Java or Python. It can also be accessed from a command-line tool.

As all the others database as a service offers Google cloud SQL is fully managed, patch

management, replication and other database management chores are managed by Google.

41

High availability is offered by built in automatic replication across multiple geographic regions so da

service is available and data is preserved even when whole data center becomes unavailable. Users

can choose to create databases and choose synchronous or asynchronous replication in

datacenters in the EU or the US.

Google cloud SQL is tightly integrated with Google App Engine and other Google services which

allow users to work across multiple products and get more value of their data. The database

instances are not restricted to be used only by one application in the app engine allowing multiple

applications to use same instance and database. Data to the database can be imported usind

mysqldumps. This allows users to easily move data, applications, and services in and out of the

cloud.

As initial trial Google is offering instances with small amount of RAM and 0.5GB of database

storage. Additional RAM and storage can be purchased up to 16GB of RAM and 100GB of storage

[15].

9.1 Pricing

Google offers two billing plans for Google Cloud SQL, Packages or Per Use. The packages offer is

shown in the table below:

Tier RAM Included Storage Included I/O Per Day

D1 0.5GB 1GB 850K

D2 1GB 2GB 1.7M

D4 2GB 5GB 4M

D8 4GB 10GB 8M

D16 8GB 10GB 16M

D32 16GB 10GB 32M

Table 2 Google Cloud SQL Packages

Each database instance is allocated the RAM shown above, along with an appropriate amount of

CPU. Storage is measured as the filespace used by the MySQL database. Bills are issued monthly,

based on the number of days during which the database existed. Google is not charging the storage

for backups created using the scheduled backup service. The number of I/O requests to storage

made by database instance depends on the queries, workload and data set. Cloud SQL caches data

in memory to serve queries efficiently and to minimize the number of I/O requests. Use of storage

or I/O over the included quota is charged at the Per Use rate. The maximum storage for any

instance is currently 100GB.

42

With the Per Use plan the same tiers as with the packages are offered with the difference that

database instances is charged for periods of continuous use. Storage is charged per GB in hourly

units (whether the database is active or not) measured as the largest number of bytes during that

one hour period, rounded up to the nearest GB and the I/O are charged by number rounded to the

nearest million.

Network use is charged for both packages and per use billing plans. Only outbound external traffic

is charged the network usage between Google App Engine applications and Cloud SQL is not

charged [15].

10. Summary of RDBM DBaaS and common considerations

As we can see from the previous section Relational Database as a Service (DBaaS) is currently found

in the public marketplace in two broad capabilities - online general relational databases, and the

ability to operate virtual machine images loaded with common databases such as MySQL, Oracle or

similar commercial databases.

Database.com offers relational multitenant database specially build for the cloud using their

metadata-driven architecture.

Microsoft AzureSQL offers SQL Server like relational database management system and controls

many of the database configuration details allowing the users to focus on the schema, data and

application layer.

Amazon RDS provides implementation of MySQL or Oracle on virtual machine build and tune for

that purpose and Goolge also has their cloud SQL providing MySQL for their AppEngine PaaS.

While the all presented RDBMS DBaaS provide an opportunity to reduce cost there are many

consideration to taken before moving the data to a cloud based solution. Figure 11 presents the

main considerations comparison.

Data Sizing - All of the RDBMS DBaaS offerings presented have limits on the size of the data set

that can be stored on their systems.

Portability - Portability and adherence to standards is a critical issue for ensuring Continuity of

Operations and to mitigate business risk (e.g., a provider going out of business or raising rates). The

ability to instantiate a replicated version of the data “off-cloud” or in another cloud offering can

provide the business owners with an extra level of assurance that they will not suffer a loss of data.

This can be facilitated by standards, such as the use of a standard database query language (SQL).

Transaction Capabilities - Transaction capabilities are an essential feature for databases that need

to provide guaranteed reads and writes (ACID).

43

Salesforce

Database.com

Microsoft SQL

Azure

Amazon RDS

(MySQL or Oracle)

Google Cloud

SQL

Maximum amount

of data that can be

stored

Maximum data is

limited by number of

records per

database. Up to

22300000 records.

5gb with web

edition database

and up to 150GB

with business

edition database

1 terabyte per

database instance.

100GB per

database

instance.

Ease of software

portability with

similar locally hosted

capability

Low. Requires

database to be

specially built and

tested by Salesforce

before deployment.

High. Most SQL

Server features are

available in SQL

Azure.

High. MySQL/Oracle

instantiation in cloud

is very similar to the

local instantiated

version.

Medium.

MySQL

instance in

the cloud very

similar to the

local instance

but accessible

only by

Google App

Engine

Transaction

capabilities

Yes Yes Yes Yes

Configurability and

ability to tune

databases

Low. It creates

indexes

automatically and

keeps record of most

recently accessed

records but does not

allow control over it.

Also does not allow

control over memory

allocation and similar

resources.

Medium. Can

create indexes and

stored procedures,

but no control over

memory allocation

or similar

resources.

High. MySQL/Oracle

instantiation in cloud

on virtual machine.

Low.

Automatically

tuned.

Database accessible

as “stand-alone”

offering.

Yes Yes Yes No. Requires

Google App

Engine

application

layer.

Possibility to

designate where the

data is stored (ex.

Region or data

center)

No Yes Yes Yes

Replication No Yes Yes Yes

Table 3 Main Considerations Comparison

44

Configurability - DBaaS offerings may provide capabilities that reduce the amount of configuration

options available to database administrators. For some applications, if more configurability options

are managed by the platform owner rather than the customer’s database administrator, this can be

a benefit and it can reduce the amount of effort expended to maintain the database. For others,

the inability to tune and control all aspects of the database, such as memory management, can be

a limiting constraint in obtaining performance.

Database Accessibility - Most DBaaSs offer a predefined set of connectivity mechanisms that will

directly impact adoption and use. There are three general approaches. First, Most RDBMS offerings

are typically accessible through industry standard database drivers such as Java Database

Connectivity (JDBC) or Open Database Connectivity (ODBC). These drivers allow for applications

external to the service to access the database through a standard connection, facilitating

interoperability. Second, services typically provide interfaces that use standards-based, Service-

Oriented Architecture (SOA) protocols, such as SOAP or REST, with Hypertext Transfer Protocol

(HTTP) and a vendor-specific API definition. These services may provide software development kits

in common source-code languages to facilitate the adoption. Third, some databases may be

restricted to accessing data through software running in the vendor’s ecosystem. This approach

may increase security, but it also significantly limits portability and interoperability.

Availability and Replication - the ability to ensure that data is available and not lost will be a key

consideration. Ensuring access to data can come through enforcement of service-level agreements

(SLA) metrics such as up time, replication across a cloud provider’s regions, and replication or

movement of the data across cloud providers or to the consuming organization’s data center.

Replication across a cloud provider’s hardware within a region may ameliorate the effects

of a localized hardware or software failure.

Replication across a cloud provider’s geographic regions may ameliorate the effects of a

network outage, natural disaster, or other regional event.

Replication across multiple cloud providers or back to the consuming organization’s IT

infrastructure may provide the most continuity of operation benefit through full

geographic and IT stack independence.

Many providers such as Microsoft and Amazon offer replication of the data across hardware within

a specific region as part of a packaged service. Within a given vendor, replication across

geographies is usually more expensive and may result in significant data transfer fees.

45

11. NOSQL

While RDBMS databases are widely deployed and successful, they have shortcomings for some

applications that have been filled by the growing use of NoSQL databases. Rather than conforming

to SQL standards and providing relational data modeling, NoSQL databases typically offer fewer

transactional guarantees than RDBMSs in exchange for greater flexibility and scalability. NoSQL

databases tend to be less complex than RDBMSs and scale horizontally across lower-cost hardware.

Unlike RDBMSs, which share a common relational data model, several different types of databases,

such as column-oriented, key-value, and document-oriented, are considered as “NoSQL”

databases. NoSQL databases tend to be used in applications that do not require the same level of

data consistency guarantees that RDBMS systems provide but that require throughput levels that

would be very expensive for RDBMSs to support.

12. Amazon SimpleDB and DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service. As I said in the introduction of

DBMSs in the clouds, the NoSQL databases are more suitable for situations where applications

experience explosive growth, when traditional databases require reworking to distribute their

workload across multiple servers.

DynamoDB has been created by taking Amazon’s in-house NoSQL database, Dynamo (incremental

scalability, predictable high performance), combining it with the best parts of SimpleDB (ease of

administration of a cloud service, consistency, and a table-based data model that is richer than a

pure key-value store) and putting it into a form suitable for external use as a service.

In the next section I will give a short overview of the Dynamo and SimpleDB.

12.1 Dynamo History

The original Dynamo design was based on a core set of strong distributed systems principles

resulting in an ultra-scalable and highly reliable database system. It was developed as response to

the scaling challenges that Amazon.com faced, when direct database access was one of the major

bottlenecks in scaling and operating the business. There are many services that only need primary-

key access to a data store. For many services, such as those that provide best seller lists, shopping

carts, customer preferences, session management, sales rank, and product catalog, the common

pattern of using a relational database would lead to inefficiencies and limit scale and availability.

Dynamo provided a simple primary-key only interface to meet the requirements of these

applications[17][18].

Dynamo was targeted mainly at applications that need an “always writeable” data store where no

updates are rejected due to failures or concurrent writes. It was built for an infrastructure within a

single administrative domain where all nodes are assumed to be trusted. Applications that use

46

Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or

complex relational schema (supported by traditional databases). Dynamo can be characterized as a

zero-hop DHT, where each node maintains enough routing information locally to route a request to

the appropriate node directly in order to avoid routing requests through multiple nodes and meet

the need of the latency sensitive applications that require at least 99.9% of read and write

operations to be performed within a few hundred milliseconds [17].

Dynamo gave to the developers a system that met their reliability, performance, and scalability

needs, it did nothing to reduce the operational complexity of running large database systems. Since

developers were responsible for running their own Dynamo installations, they had to become

experts on the various components running in multiple data centers. Also, they needed to make

complex tradeoff decisions between consistency, performance, and reliability. This operational

complexity was a barrier that kept them from adopting Dynamo [17].

12.2 Amazon DynamoDB DataModel

Amazon DynamoDB organizes data into tables containing items, and each item has one or more attributes.

Attributes An attribute is a name-value pair. The name must be a string, but the value can be a string, number,

string set, or number set. The following are all examples of attributes:

"ImageID" = 1 "Title" = "flower"

"Tags" = "flower", "jasmine", "white" "Ratings" = 3, 4, 2

Item A collection of attributes forms an item, and the item is identified by its primary key. An item's

attributes are a collection of name-value pairs, in any order. The item attributes can be sparse,

unrelated to the attributes of another item in the same table, and are optional (except for the

primary key attribute). The table has no schema other than its reliance on the primary key. Items

are stored in a table. The primary key uniquely identifies an item for a DynamoDB table. In the

following diagram, Figure 9, the ImageID is the attribute designated as the primary key:

47

Figure 9 Diagram of DynamoDB Data Model [18]

Notice that the table has a name, "my table", but the item does not have a name. The primary key

defines the item; the item with primary key "ImageID"=1. [18]

Tables Tables contain items, and organize information into discrete areas. All items in the table have the

same primary key scheme. Attribute name (or names) to be used for the primary key are

designated when a table is created, and the table requires each item in the table to have a unique

primary key value. The first step in writing data to DynamoDB is to create a table and designate a

table name with a primary key. The following is a larger table that also uses the ImageID as the

primary key to identify items.

DyanomoDB also allows specifying a composite primary key which enable specifying two attributes

in a table that collectively form a unique primary index. All items in the table must have both

attributes. One serves as a “hash partition attribute” and the other as a “range attribute.” For

example, there might be a “Status Updates” table with a composite primary key composed of

“UserID” (hash attribute, used to partition the workload across multiple servers) and a “Time”

(range attribute). Then query can be executed to fetch either: 1) a particular item uniquely

identified by the combination of UserID and Time values; 2) all of the items for a particular hash

“bucket” – in this case UserID; or 3) all of the items for a particular UserID within a particular time

range. Range queries against “Time” are only supported when the UserID hash bucket is specified.

[18]

48

Table: My Images

Primary

Key

Other Attributes

ImageID

= 1

ImageLocation =

https://s3.amazonaws.com/bucket/img_1.jpg

Date =

1260653179

Title =

flower

Tags =

Flower,

Jasmine

Width = 1024

Depth =

768

ImageID

= 2

ImageLocation =


Date =

1252617979

Rated =

3, 4, 2

Tags =

Work,

Seattle,

Office

Width = 1024

Depth =

768

ImageID

= 3

ImageLocation =


Date =

1285277179

Price =

10.25

Tags =

Seattle,

Grocery,

Store

Author = you

Camera =

phone

ImageID

= 4

ImageLocation =


Date =

1282598779

Title =

Hawaii

Author =

Joe

Colors = orange,

blue, yellow

Tags =

beach,

blanket, ball

Figure 10 DynamoDB Table

12.3 Amazon DynamoDB Features

As we said earlier Amazon DynamoDB is based on the principles of Dynamo, a progenitor of NOSQL,

and brings the power of the cloud to the NOSQL database world. It offers high-availability,

reliability, and incremental scalability, with no limits on dataset size or request throughput for a

given table. As all the previous explained services DynamoDB is managed, scalable system that

handles all the complexities of scaling and partitions and re-partitions of the data over more

machine resources to meet the I/O performance requirements. It can scale the resources dedicated

to a table to multiple servers spread over multiple Availability and there are no pre-defined limits to

the amount of data each table can store.

In order to achieve high performance all data items are stored on Solid State Drives (SSD). Moreover, by not indexing all attributes, the cost of read and write operations is low as write operations involve updating only the primary key index thereby reducing the latency of both read and write operations.

One of the most important functionalities of DynamoDB is the Performance Predictability. There

49

are many applications that benefit from predictable performance as their workloads scale: online

gaming, social graphs applications, online advertising, and real-time analytics to name a few.

DynamoDB’s gives the ability of “Provisioned Throughput.” Users can specify the request

throughput capacity they require for a given table. DynamoDB will allocate sufficient resources to

the table to predictably achieve this throughput with low-latency performance. Throughput

reservations are elastic and can be increased or decreased on-demand using the AWS Management

Console or the DynamoDB APIs. CloudWatch metrics provides the ability to make informed

decisions about the right amount of throughput to be dedicated to a particular table.

Amazon DynamoDB also integrates with Amazon Elastic MapReduce (Amazon EMR) which allows

businesses to perform complex analytics on their large datasets using a hosted Hadoop framework

on AWS. [18]

Some of the ways in which EMR can be used with DynamoDB are as follows:

Users can analyze data stored in DynamoDB using EMR and store the results of the

analysis in S3 while leaving the original data in DynamoDB.

Users can back up the data from DynamoDB to S3 using EMR.

Customers can also use Amazon EMR to access data in multiple stores, do complex

analysis over this combined dataset, and store the results of this work.

12.4 Amazon SimpleDB

SimpleDB another NOSQL DBaaS offered by Amazon. The data model used by Amazon SimpleDB

makes it easy to store, manage and query structured data. Developers organize their data-set into

domains and can run queries across all of the data stored in a particular domain. Domains are

collections of items that are described by attribute-value pairs. This can be thought at in terms

analogous to concepts in a traditional spreadsheet table. For example, if we take details of a

customer management database shown in the table below and consider how they would be

represented in Amazon SimpleDB. The whole table would be domain named “customers.”

Individual customers would be rows in the table or items in the domain. The contact information

would be described by column headers (attributes). Values are in individual cells.

50

CustomerID First

name

Last

name

Street

address

City State Zip Telephone

123

Bob

Smith

123 Main

St

Springfield

MO

65801

222-333-

4444

456

James

Johnson

456 Front

St

Seattle

WA

98104

333-444-

5555

Figure 11 SimpleDB Table

Amazon SimpleDB differs from tables of traditional databases in important ways. It offers the

flexibility to easily go back later on and add new attributes that only apply to certain records. For

example, adding customers’ email addresses to enable real-time alerts on order status it is possible

to add the new records and any additional attributes to the existing “customers” domain. The

resulting domain might look something like this:

CustomerID

First name

Last name

Street

address

City

State

Zip

Telephone

Email

123

Bob

Smith

123 Main

St

Springfield

MO

65801

222-333-

4444

456

James

Johnson

456 Front

St

Seattle

WA

98104

333-444-

5555

789

Deborah

Thomas

789

Garfield

New York

NY

10001

444-555-

6666

[email protected]

Figure 12 SimpleDB table after adding additional attributes

Domains have a finite capacity in terms of storage (10 GB) and request throughput which is

considerable scaling limitation. Although there is a possibility to work around this limitation by

partitioning workloads over many domains, this is not that simple to implement. SimpleDB also fails

to meet the requirement of incremental scalability which is possible with DynamoDB.

Another limitation of SimpleDB is Predictability of Performance. SimpleDB indexes all attributes for

each item stored in a domain. While this simplifies schema design and provides query flexibility, it

has a negative impact on the predictability of performance. For example, every database write

needs to update not just the basic record, but also all attribute indices (regardless of whether all

indices are used for querying). Similarly, since the Domain maintains a large number of indices, its

mailto:[email protected]

51

working set does not always fit in memory. This impacts the predictability of a Domain’s read

latency, particularly as dataset sizes grow.

SimpleDB’s original implementation had taken the "eventually consistent"4 approach to the

extreme and presented users with consistency windows that were up to a second in duration. This

meant that developers used to a more traditional database solution had trouble adapting to it. The

SimpleDB team eventually addressed this issue by enabling users to specify whether a given read

operation should be strongly or eventually consistent. consistent read can potentially incur higher

latency and lower read throughput it is best to use it only when an application scenario mandates

that a read operation absolutely needs to read all writes that received a successful response prior

to that read. For all other scenarios the default eventually consistent read yield the best

performance. [18]

12.5 Pricing

As the other services, DynamoDB and SimpleDB keep the pay only for what you use model. The

pricing is calculated based on the provisioned throughput capacity, index data storage and data

transfer.

When a DynamoDB table is created or updated the needed capacity to be reserved is specified for

reads and writes and it is charged hourly based on the capacity used. A unit of Write Capacity

enables users to perform one write per second for items of up to 1KB in size. Similarly, a unit of

Read Capacity enables users to perform one strongly consistent read per second (or two eventually

consistent reads per second) of items of up to 1KB in size.

Amazon DynamoDB is an indexed datastore, and the amount of disk space the data consumes will

exceed the raw size of the data uploaded. Amazon DynamoDB measures the size of the billable data

by adding up the raw byte size of the uploaded data, plus a per-item storage overhead of 100 bytes

to account for indexing. The first 100MB stored per month are offered free and after that the price

is calculated per GB depending on region.

As with the other AWS there is no additional charge for data transferred between Amazon

DynamoDB , SimpleDB and other Amazon Web Services within the same Region. Data transferred

across Regions (e.g. between Amazon DynamoDB in the US East (Northern Virginia) Region and

Amazon EC2 in the EU (Ireland) Region), is charged at Internet Data Transfer rates on both sides of

the transfer.

Amazon SimpleDB is biling based on machine hours utilization and data transfer depending on the

region where the SimpleDB domains are established.

Amazon SimpleDB measures the machine utilization of each request and charges based on the

amount of machine capacity used to complete the particular request (SELECT, GET, PUT, etc.),

normalized to the hourly capacity of a circa 2007 1.7 GHz Xeon processor. [18]

4 Eventually consistent- It means that given a sufficiently long period of time over which no changes are sent,

all updates can be expected to propagate eventually through the system and all the replicas will be

consistent.

52

13. Google Datastore

The Google App Engine Datastore is a schemaless object datastore providing robust, scalable

storage mainly targeted for web application. App Engine's datastore is built on top of Bigtable.

Bigtable is distributed storage system for managing structured data that is designed to scale to a

very large size, petabytes of data across thousands of commodity server. Many Google project like

Google Earth, Google Finance including the web indexing use Bigtable for storing data.

13.1 Datastore Datamodel

Datastore is basically key-value pared database. The Datastore holds data objects known

as entities. An entity has one or more properties, named values of one of several supported data

types: for instance, a property can be a string, an integer, or a reference to another entity. Each

entity is identified by its kind, which categorizes the entity for the purpose of queries, and

a key, that uniquely identifies it within its kind [19, 20]. Entities of the same kind can have different

properties, and different entities can have properties with the same name but different value

types. The key consists of the following components:

The entity's kind

An identifier, which can be either

o a key name string

o an integer numeric ID

An optional ancestor path locating the entity within the Datastore hierarchy.

Entities in the Datastore form a hierarchically structured space similar to the directory structure of

a file system. When an entity is created, it is possible designate another entity as its parent; the

new entity is a child of the parent entity. This creates the ancestor path. [20]

13.2 Queries and indexes

App Engine predefines a simple index on each property of an entity. An App Engine application can

define further custom indexes in an index configuration file. Because all queries on App Engine are

served by these pre-built indexes, the types of query that can be executed are more restrictive

than those allowed on a relational database with SQL [20]. In particular, the following are not

supported:

Join operations

Inequality filtering on multiple properties

53

Filtering of data based on results of a subquery

All the queries in the Datastore are eventually consistent. A typical query includes the following:

An entity kind to which the query applies

Zero or more filters based on the entities' property values, keys, and ancestors

Zero or more sort orders to sequence the results

In addition to retrieving entities from the Datastore directly by their keys, an application can

perform a query to retrieve them by the values of their properties [20].

13.3 Transactions

The Datastore can execute multiple operations in a single transaction. By definition, a transaction

cannot succeed unless every one of its operations succeeds. If any of the operations fails, the

transaction is automatically rolled back. This is especially useful for distributed web applications,

where multiple users may be accessing or manipulating the same data at the same time [20].

13.4 Scalability

The App Engine Datastore is designed to scale, allowing applications to maintain high performance

as they receive more traffic:

Datastore writes scale by automatically distributing data as necessary.

Datastore reads scale because the only queries supported are those whose performance

scales with the size of the result set (as opposed to the data set). This means that a query

whose result set contains 100 entities performs the same whether it searches over a

hundred entities or a million. This property is the key reason some types of query are not

supported [20].

13.5 High Availability

App Engine's primary data repository is the High Replication Datastore (HRD), in which data is

replicated across multiple data centers using a system based on the Paxos algorithm5. This provides

a high level of availability for reads and writes [20].

5 Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the

process of agreeing on one result among a group of participants. This problem becomes difficult when the

participants or their communication medium may experience failures

54

13.6 Data Access

Development on the DataStore is done though Application Programming Interfaces (API). These

can be accessed by either Python or JAVA. The App Engine Java SDK provides a low-level Datastore

API with simple operations on entities. The SDK also includes implementations of the Java Data

Objects (JDO) and Java Persistence API (JPA) interfaces for modeling and persisting data. These

standard interfaces include mechanisms for defining classes for data objects and for performing

queries [20].

The Python Datastore interface includes a rich data modeling API and a SQL-like query language

called GQL [21, 22].

13.7 Quotas and Limits

Google has defined quotas and limits to variious aspects of application's Datastore usage:

• Each call to the Datastore API counts toward the Datastore API Calls quota.

• Data sent to the Datastore by the application counts toward the Data Sent to Datastore API

quota.

• Data received by the application from the Datastore counts toward the Data Received from

Datastore API quota.

The total amount of data currently stored in the Datastore for the application cannot exceed the

Stored Data (billable) quota. This includes all entity properties and keys, as well as the indexes

needed to support querying those entities. The following table shows the limits that apply

specifically to the use of the Datastore [20]:

Limit Amount

Maximum entity size 1MB

Maximum transaction size 10MB

Maximum number of index entries for an entity 2000

Maximum number of bytes in composite indexes for an entity 2MB

Figure 13 Google Datasore Limits [20]

55

14. MongoLab/MongoDB and Cloudent/Apache CouchDB

Both CouchDB and MongoDB are document-oriented databases with schema less JSON-style and

BSON (Binary JSON) style object data storage [26]. Because they offer similar functionalities I’ll

write about them together and give a short overview of their differences. First what is document

oriented database?

14.1 Document oriented database

A document oriented database or data store does not use tables for storing data. It stores each

record as a document with certain characteristics. Documents inside a document-oriented

database are similar, in some ways, to records or rows, in relational databases, but they are less

rigid. They are not required to adhere to a standard schema nor will they have all the same

sections, slots, parts, keys, or the like [24, 25]. For example here's a document:

FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing". Another document could be: FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10},

{Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}]. Both documents have some similar information and some different. Unlike a relational database

where each record would have the same set of fields and unused fields might be kept empty, there

are no empty 'fields' in either document (record) in this case. This system allows new information

to be added and it does not require explicitly stating if other pieces of information are left out.

The benefit would be that if you are using a document oriented database for storing a large number

of records in a huge database, any change in the number or type of row does not need an alter on

the table. All it is needed is to do is insert new documents with new structure and it is automatically

inserted to the current datastore.

Documents are addressed in the database via a unique key that represents that document. Often,

this key is a simple string. In some cases, this string is a URI or path. Regardless, this key can be

used to retrieve the document from the database. Typically, the database retains an index on the

key such that document retrieval is fast.

One of the other defining characteristics of a document-oriented database is that, beyond the

simple key-document (or key-value) lookup that you can use to retrieve a document, the database

will offer an API or query language that will allow document retrieval based on their contents. For

example, you may want a query that gets you all the documents with a certain field set to a certain

value. The set of query APIs or query language features available, as well as the expected

performance of the queries, varies significantly from one implementation to the next.

Implementations offer a variety of ways of organizing documents, including notions of

Collections

56

Tags

Non-visible Metadata

Directory hierarchies

14.2 MongoDB and CouchDB comparison

As I said earlier both MongoDB and CouchDB are document-oriented databases with schemaless

JSON-style object data storage. Table2 shows comparison between the both databases.

MongoDB CouchDB

Data Model Document-Oriented (JSON) Document-Oriented (BSON) Interface HTTP/REST Native Drivers; REST Large Objects (Files) Yes (attachments) Yes (GRIDFS) Horizontal Partitioning scheme BigCouch, CouchDB Lounge,

Pillow Auto-sharding

Object Storage Database contains Documents Database contains collections Collection contains documents

Query Method Map/Reduce (javascript + others) creating Views + range queries

Map/Reduce (javascript) creating collections + object- based query language

Replication Master-master with custom conflict resolution function

Master-Slave

Concurrency MVCC (Multi Version Concurrency Control)

Update in-place

Distributed Consistency Eventually consistent Strong consistency. Eventually consistent reads from secondary replicas

Written in Erlang C++ Table 4 Comparison MongoDB adn CouchDB

14.3 MVCC – Multy Version Concurency Control

One big difference is that CouchDB is MVCC based, and MongoDB is more of a traditional update-

in- place store [24, 25, 27]. MVCC is very good for certain classes of problems:

Problems which need intense versioning; problems with offline databases that re-sync

later

Problems where you want a large amount of master-master replication happening. But

with MVCC there are some considerations:

57

The database must be compacted periodically, if there are many updates;

When conflicts occur on transactions, they must be handled by the programmer manually (unless the db also does conventional locking -- although then master-master replication is likely lost) [25].

MongoDB updates an object in-place when possible. Problems requiring high update rates of

objects are a great fit, compaction is not necessary. Mongo's replication, without the MVCC model,

is more oriented towards master/slave and auto failover configurations than to master-master

setups. MongoDB promises high write performance, especially for updates.

14.4 Scalability

One fundamental difference is that a number of Couch users use replication as a way scale. Mongo

uses auto - sharding as a way of scalability. There is couple of available options for sharding

CouchDB available as opensource or by third-party developers. The best known are CouchDB

Lounge and BigCouch used by cloudant.com [25, 26]

BigCouch can be seen as Erlang/OTP applications that allow creating a cluster of CouchDBs that is

distributed across many nodes/servers.[30] Instead of one big honking CouchDB, the result is an

elastic data store which is fully CouchDB API-compliant.

The clustering layer is most closely modeled after Amazon's Dynamo, with consistent hashing,

replication, and quorum for read/write operations. CouchDB view indexing occurs in parallel on

each partition, and can achieve impressive speedups as compared to standalone serial indexing.

[25]

14.5 Querying

CouchDB uses a view model which acts as an ongoing incremental map-reduce function, providing

a constantly updated view of the database. From the HTTP interface different views can be

accessed and data can be retrieved by key/index as well. The view model is well-suited for statically

definable queries; job-style operations. There is elegance to the approach, although these

structures must be pre-declared for each query to be executed. They can be thought of as

materialized views.[27]

Mongo uses traditional dynamic queries. As with, say, MySQL, it can do queries where an index

does not exist, or where an index is helpful but only partially so. Mongo includes a query optimizer

which makes these determinations. This is very nice for inspecting the data administratively, and

this method is also good when indexes are not used: such as insert-intensive collections. When an

index corresponds perfectly to the query, the Couch and Mongo approaches are conceptually

similar.[24]

58

14.6 Atomicity and Durability

Both MongoDB and CouchDB support concurrent modifications of single documents. Both forego

complex transactions involving large numbers of objects.

CouchDB is a "crash-only" design where the database can terminate at any time and remain

consistent.[25,27]

Previous versions of MongoDB used a storage engine that would require a repair database operation when starting up after a hard crash. Newer versions offer durability via journaling.[24]

14.7 Map Reduce

Both CouchDB and MongoDB support map/reduce operations. For CouchDB map/reduce is

inherent to the building of all views [24]. With MongoDB, map/reduce is only for data processing

jobs but not for traditional queries.[25]

14.8 Javascript

Both CouchDB and MongoDB make use of Javascript. CouchDB uses Javascript extensively including

in the building of views.

MongoDB supports the use of JavaScript but more as an adjunct. In MongoDB, query expressions

are typically expressed as JSON-style query objects, however one may also specify a JavaScript

expression as part of the query. MongoDB also supports running arbitrary javascript functions

server-side and uses JavaScript for map/reduce operations.

14.9 REST

Couch uses REST as its interface to the database. MongoDB relies on language-specific database

drivers for access to the database over a custom binary protocol. Of course, REST interface can be

added on top of an existing MongoDB driver at any time.

14.10 MongoLab and Cloudent

The most popular platforms offering managed instances as a service of MongoDB and CouchDB are

MongoLab and Cloudent respectfully. MongoLab is offering two tiers of plans, shared and dedicated in order to accommodate a range of

use cases and budgets. The database can be hosted on Amazon AWS or in Rackspace Cloud. With

the shared plan MongoLab is offering one MongoDB database on a shared mongod server process

on a shared VM host and replication for backups [28]. The architecture is shown in the Figure 13

59

bellow.

Figure 14 MongoLab Shared Plan

The shared plan is offered for free up to 250MB and there are three more options available Small

Medium and Large, also additional storage is available as an option.

The dedicated plan is offered in two variants with one dedicated node and with two and more

dedicated nodes. The dedicated plan with one node is a single dedicated VM with automatic

failover to a secondary on a shared VM. It offers high availability with the replicas but it does not

allow reading from the replicas as a mean to increase read-throughput. It also offers monitoring

services through MongoDB Monitoring Service (MMS). MMS is 10gen6 web service that monitors

6 10gen is a software company that develops and provides commercial support for the open MongoDB

database

VM Host 1 VM Host 0

Replication for

backup

Replication for

backup

DB0 DBN

Master0

Mongod server process

DB0 DBN

Mastern


DB0 DBN

Slave0


DB0 DBN

Slaven


60

and graphs the performance of MongoDB clusters, servers and databases over time. It can monitor

important statistics such as resident memory usage, rate of database operations, write-lock queue

depth and CPU alongside any other MongoDB instances they might be running outside of

MongoLab [28].

Dedicated plan with two and more nodes can scale to as many dedicated nodes of equal size as it is

needed. Also in addition to providing high-availability, it scales horizontally read throughput by the

creation of a Replica Set cluster of more than one member. The architectures of both dedicated

plans are shown in Figure 14. With dedicated plans hosting is available on Amazon EC2 or in

Rackspace Cloud.

Figure 15 Dedicated Plan Architecture: 1 Dedicated Node

61

Figure 16 Dedicated Plan Architecture: 2+ Dedicated Nodes

Cloudent.com is offering multi-tenant and single-tenant (private) CouchDB database clusters that are hosted and scaled within or across multiple top-tier data centers around the globe. In all offered plans Cloudant automatically replicates the data across this network as needed to push it closer to the global user base, reduce network latency overhead, ensure 24x7 availability, and provide disaster recovery capabilities.[27]

Cloudant, provides a domain through which to access the data layer. Behind that domain, Cloudant

stores the data in horizontally scalable version of the CouchDB database. The horizontal scalability

is done with BigCouch [30] as mentioned earlier. The data layer automatically handles load

balancing, clustering, backup, growing/shrinking the clusters, and high availability. It also provides

private, single-tenant clusters that exist entirely within a data center or that span across data

centers to provide real-time data distribution to multiple locations.[29]

Regardless of whether it is a multi- or single-tenant data layer, data can be replicated and

synchronized between Cloudant data centers and:

62

other Cloudant data centers for high availability, backup or for scalability and performance

non-Cloudant data centers

disconnected devices/networks great for mobile apps

edge databases such as data marts or spreadsheets; great for independent analytic projects

Cloudant Data Layer also includes a number of dashboards that allow view and control of the data

layer performance, usage, search indexing, billing and other metrics.[29]

Cloudnat pricing is a little bit different then MongoLab and it is based on data stored and millions of

requests per month (MReq/mo). There is a free starting plan that includes 250MB of storage and

0.5 Mreq/mo.[25,28]

Data storage is counted in a way that includes the size only of the latest revision of all documents,

plus the size of the view indexes. Older revisions and deleted documents do not count towards size

quotas. They are purged automatically after a certain time.

Requests are approximately the number of documents reads and writes from the database.

15. What benefits cloud database and cloud computing

brings for small and medium organizations? For small and medium business owners saving money and time whenever possible is critical to their

success. Regardless weather it is just startup or more mature business, cloud software and services in general can help to cut costs and allow you to concentrate on the core of your business. The benefits of cloud computing for small business sound attractive but that does not mean that it does not have certain disadvantages or is right for every business. As I shown in the previous part of this paper, there are a lot of available options to choose from the cloud database as service offerings and when you include there all the other available cloud services, choosing the right provider and the right services for your business needs is not an easy task. Here I refer to cloud computing in general as the benefits from the DBaaS solutions are part of the benefits from the Clout computing. First I will write the main benefits.

15.1 Advantages for Small Business

I will speak about the advantages and disadvantages in more general terms of cloud computing as

the same apply to the cloud database. The main advantages include

Lower Initial Investment – only things needed to start using the cloud is computer and an

Internet connection, it is possible take advantage of most cloud offerings without investing

in any new hardware, specialized software, or adding to staff. This is one cloud computing

63

advantage that has universal appeal regardless of the industry or the type of business. This

allows organizations and especially startups to invest in new projects and ideas without risk

of big loss.

Easier to manage - There are no power requirements or space considerations to think about

and users do not have to understand the underlying technology in order to take advantage

of it. There is no need for maintaining and updating any new hardware or software.

Planning time is considerably less as well since there are fewer logistical issues.

Pay as You Go - Large upfront fees is not the norm when it comes to cloud services. Most

of the cloud services as I wrote earlier in this paper are available on a month to month

basis with no long term contracts. It also gives the benefit of keeping multiple projects

running without enormous expenses.

Scalability - Cloud computing can be scaled to match the changing needs of the small

business as it grows. Licenses, storage space, new instances and more can be added as

needed.

Deploy Faster – usually it is possible to get up and running significantly faster with cloud

services than if there is a need to plan, buy, build, and implement in house. With many

software as a service applications or other cloud offerings it is possible to start using the

service within hours or days rather than weeks or months.

Location Independent - Because services are offered over the Internet, there are no limits

to using cloud software or services just at work or only on one computer. Access from

anywhere is a big advantage for people who travel a lot, like to be able to work from home,

or whose organization is spread out across multiple locations.

Device independent - Most web-based software and cloud services are not designed

specifically for any one browser or operating system. Many can be accessed via PC, Mac, on

tablets and through mobile phones.

15.2 Disadvantages of Cloud Computing

While the advantages of cloud computing are clear and easy enough to understand, there are

potentially a few disadvantages that needs to be considered carefully.

Downtime - While we would like to think our data or the cloud based services that we use

are available on demand all day every day, the truth is they are not. System uptime is

entirely out of our hands with cloud services. There are two types of downtime:

o Scheduled downtime might be required to upgrade software, install new hardware,

or perform other routine maintenance. Typically, scheduled downtime is

infrequent, announced well in advance, and takes place at non-peak hours where

64

usage is likely to be low so as to minimize interruption to the customer.

o Unscheduled downtime, otherwise known as an outage, is indicative of some sort

of failure or problem. It is rare but outages do happen even for the larger, more

established cloud providers. If it does, there is not much that can be done other

than wait.

Security Issues - This is maybe one of the most discussed issues when considering moving

to the cloud. You are turning over data about your business and your customers to a third

party and entrusting them to keep it safe. Without the proper level of security, your data

could be exposed to users outside your company or accessed by a hacker.

Less control over your data loss - With cloud services, you will have to give up some degree

of control over the prevention of data loss. That is in the hands of the cloud service

provider.

Integration and Customization - Some web based software solutions and cloud services are

offered as a one size fits all solution. If you need to customize the application or service to

fit specific needs or integrate with your existing systems, doing so may be challenging,

expensive, or not an option.

15.3 Main things to be considered when moving to the cloud

Migrating to a cloud solution is usually fairly easy, the service provider usually helps with setting

everything up and transferring the information to the hosted environment. But there are some

considerations that organization should look at.

Prioritize applications

Focus on the applications that provide the maximum benefit for the minimum cost/risk. Measure

the business criticality, business risk, functionality of the services and impact to data sovereignty,

regulation and compliance. Prioritize which applications to migrate to the cloud and in which order.

Consumption models

As can be seen from the different pricing models used by the services and providers described

earlier, each provider has a different consumption model for how you procure and use the service.

These consumption models need to be considered carefully from two perspectives – frequency of

change and volume.

Data residency and legal jurisdiction

This issue is not recognized by many but most organizations realize that business information held

outside their country is subject to the commercial law of the country it is held in. Most

organizations decide to keep their data in the country of origin to ensure that the local country law

still applies to their business information.

65

Performance and availability

When moving to a distributed IT landscape with some functionality in the cloud, where there is

integration between these cloud applications and on-premise applications, then performance of

this distributed functionality needs careful consideration and potentially increased processing to

ensure service delivery. Similarly, availability will need careful assessment because an application

that is all in the cloud, or distributed across the cloud and on-premise, will have different

availability characteristics to the legacy on-premise application. Organizations also need to ensure

that their local and wide area networks are enabled for cloud and will support the associated

increase in bandwidth and network traffic.

Service integration

When moving an application to the cloud, continuity of service and service management needs to

be considered. The service management role changes to more of a service integration role. An

alternative to the in-house service management function providing this capability is the use of an

outsourcing organization, to provide this function.

Architecting for the cloud and cloud application maturity

Cloud Computing provides real benefits for organizations but to realize these benefits the

applications being utilized sometimes need to be architected to take advantage of the scalable

nature of Cloud Computing. While new applications, should be built with this in mind, often legacy

applications are built to take advantage of legacy systems and hence may not be able to truly

leverage the benefits the Cloud can bring without significant re- architecting. There are even

differences between how much re-architecting is needed from to move to a cloud provider and

also to move from one Cloud Computing provider to the next, so the Cloud provider selection

process should include questions about the Cloud provider’s technological underpinning and if re-

architecting is needed, it does not come as a surprise. Currently application maturity is extremely

variable from one application to the next.

Exit strategy

Before adopting a cloud service provider or application ensure you consider your exit strategy, e.g.

data extraction, and put costs for this strategy into your business case and service costs. Many

people are rightly concerned about moving to Cloud Computing and being fixed to one provider.

This is indeed a concern and one which should not be brushed off lightly. That said however, Cloud

Computing tends to be much more transparent when it comes to lock in and so organizations

should be able to accurately gauge the risks. Organizations should look at a number of different

factors:

o Does the vendor use industry standard APIs or proprietary ones?

o Does the vendor provide quick and easy data extraction in the event that the

customer wishes to shift?

o Does the vendor use open standards or have they created their own ways of doing

things?

o Can the Cloud Computing service be controlled by third party control panels?

66

Data migration

Moving data into or out of a SaaS and DBaaS application may require considerable transformation

and load effort.

Service and transaction state

Maintaining continuity of the state of in-flight transactions at the point of transition into the cloud

will need consideration. This will also be the case at the point of exit as well.

Service Level Agreement (SLA)

Small business owners usually don’t have experience with these types of agreements and not

viewing them might open up Pandora’s Box without knowing it. Business impact in the SLA myst be

carefully considered and analyzed. Close attention should be paid to the availability guarantees and

penalty clauses:

Does the availability fit in with organization business model?

What do you need to do to receive the credits when the hosting provider failed to achieve

the guaranteed service levels?

Are they automatically processes, or do you need to ask for them in writing?

Usually the cloud providers have one SLA for all users and do not provide customization of the SLA.

All this considerations must be evaluated carefully before moving to a cloud based solutions in

order to mitigate the risk and be confident to choose the right cloud services that will support and

insure growth of the business.

67

16. Will cloud computing reduce the budget?

A small business which decides to own and manage its own IT equipment sometimes fails to recognize

that over time, these equipment and their components will begin to deteriorate thus causing the

system to crash or experience latency. This may pose a bigger problem if the company has remote

users and satellite offices. Without much thought, an entrepreneur will surely put in more money by

upgrading its equipment and adding extra redundancy. Additional IT support personnel may be hired.

The cycle will truly become vicious as new equipment will depreciate and break down after a couple of

years.

In general, IT eats up a huge part of the company’s budget not only because of the costly equipment

but its maintenance and upgrade costs as well. Upgrades, security threats, and unexpected system

crashes often cost a lot of money. With cloud computing, all these IT capital investments and expenses

are borne by the third-party supplier. The business owner will just have to budget for the system’s

monthly subscription fees per user. There is also no need to invest on IT in anticipation of a future

demand because cloud computing can be deployed on demand when needed. An entrepreneur can

settle for a cloud computing service for better forecasting of an IT budget

Cloud computing simplifies budgeting. The business owner need not worry about merging projects or

complex expansion because he only needs to pay for the resources his company uses. Also, when users

are reduced, the accompanying cloud computing costs are reduced also. The traditional IT process of

procurement, installation, management, protection, and support of an on-premise system can be a

vicious cycle and contradicts the company’s goal of reducing recurring expenses. Cloud computing

services and resources are used only when needed which greatly reduce recurrent

expenditures and leverage the company in adapting to frequently evolving conditions of the market. With cloud computing, a business owner can better manage uncertainties. He exposes his company to

greater risks if he invests a lot of money on IT. Because of growing demand, a lot of businesses

overinvest in Information Technology which eventually increases expenses and uncertainties of IT

management and maintenance. Cloud computing vendors reduce the company’s reliance on on-

premise IT systems thereby assuming the uncertainties and costs of IT support, security, backups, and

hardware. The business owner, therefore, has no more liability in procurement, management, and

upgrade of IT equipment. Growth opportunities can then be pursued without having to bear the

uncertainties of important capital outlays.

One usually overlooked benefit from the small entrepreneurs is the fact that cloud computing

also reduces energy costs because the company has less IT equipment to maintain. IT servers require

specific temperatures to run perfectly. When a business owner decides to use cloud computing

services, energy bills are reduced because expensive IT equipment are moved to a safe, monitored,

and disaster-proof IT center.

When on-site IT problems arise, it is but expected that employee productivity is affected. Because of

this, stress levels are elevated. When using cloud services, employees can do their work anywhere and

anytime they wish. They can work from home by accessing the software through internet connection,

68

this also improves the morale. Travel time and costs are significantly reduced. Each employee who is

given access to the software can even ask the cloud computing supplier’s team for support with

regards to the problems which may arise while he is using the system. Management can even monitor

remotely each employee’s activity through the management consoles provided by the supplier.

69

17. Conclusion

Database management system, for a long time has been an integral part of the computing. As the

whole IT world is moving to the cloud whether you are assembling, managing or developing on a cloud

computing platform, you need a cloud compatible database. In this work I gave a short overview of

cloud computing and presented couple of the currently available companies that offer database as a

service in the cloud. Although they differ from the most widely used “traditional” relational database

systems and most of them might require revision and recoding of the existing applications, it is obvious

that they bring a lot of benefits especially with the offer for fully managed and automated database

administration tuning and optimization.

Cloud database system are built to use the power of the cloud, they are extremely scalable and elastic,

giving the opportunity to start small and expand as you need mitigating the risk and uncertainties of

investing in IT equipment and professional IT support. Cloud computing in general, with the flexible

pricing models and different plans it presents the one of the best solutions for startup and small

companies that are developing new products and does not have the financial power to risk and invest

in uncertain projects.

The cloud database solution provides an ideal solution for web and mobile application. The fact that

most of the DBaaS offerings are tightly integrated with other PaaS gives the organization the

opportunity to focus on developing their products and do not waste any resources on administration

of the platform and gives an opportunity to fully focus on the development of the product.

Despite the benefits offered by cloud-based DBMS, many people still have apprehensions about them.

This is most likely due to the various security issues that have yet to be dealt with. Storing and

entrusting security of critical business data in the cloud, to a third party, where the data will be spread

on multiple hardware stacks and across multiple data centers can be a big security issue. In my

opinion, maybe the cloud is still not ready to be used to move critical enterprise applications which

store highly sensitive data but is definitely ready to be used for testing and development of new

projects.

Many companies including some of the huge multinational corporations have already moved to cloud

computing because it is less costly, efficient, and agile as compared to onsite IT systems. Therefore,

small and medium scale enterprise must follow suit. If cloud computing is proven to work for these big

enterprises, it will surely work for small and medium enterprises.

70

Appendix

Case studies from the industry – Amazon RDS

Airbnb, a vacation rental firm, kept its main database in Amazon RDS. The consistency between locally

hosted MySQL and Amazon RDS MySQL facilitated the migration to AWS.

A significant architecture consideration for Airbnb was that Amazon provided the underlying

replication infrastructure. “Amazon RDS supports asynchronous master-slave replication,” wrote Tobi

Knaup.21 Knaup added that the hot standby, which ran in a different AWS Availability Zone, was

updated synchronously with no replication lag. Therefore, if the master database failed, the standby

was promoted to the new master with no loss of data. [32]

Case studies from the industry – Microsoft SQL Azure

Xerox Corporation ported an on-premise enterprise print capability to a public cloud environment.

This capability allowed mobile users to find printers with their smartphones and route printouts. As

the on-premise version leveraged Microsoft SQL Server for the database component, Xerox selected

Microsoft SQL Azure for cloud storage. This approach allowed them to reuse their prior investments in

SQL Server-based technology and .NET, and minimize the technical challenges of porting to a cloud

based environment.38 They were also able to minimize their skills-based challenges because the

development team was trained on Microsoft products.

Xerox used SQL Azure for “user account information, job information, device information, print job

metadata, and other such data,” but the actual print files were stored in Azure Blob Storage, not SQL

Azure.39 Azure Blob Storage had different pricing and characteristics than SQL Azure. For example,

unlike SQL Azure, Blob Storage was not limited to 10 GB (Web edition) or 50 GB (Business edition).[33]

Case studies from the industry – Amazon DynamoDB

"When IMDb launches features to our over 110MM monthly unique users worldwide, we want to be

prepared for rapid growth (1000x scale), and for customers to use our software in exciting and

different ways," said H.B. Siegel, CTO, IMDb. "To ensure we could scale quickly, we migrated IMDb’s

popular 10 star rating system to DynamoDB. We evaluated several technologies and chose DynamoDB

because it is a high-performance database system that scales seamlessly and is fully managed. This

saves us a ton of development time and allows us to focus our resources on building better products

for our customers, while still feeling confident in our ability to handle growth."[34]

71

Case studies from the industry – Amazon SimpleDB

Alexa Web Search crawled the Internet every night and generated a Web-scale datastore with

terabytes of data. They wanted to allow users to run custom queries against this data and generate up

to 10 million results.

To provide this service, Alexa’s architecture team leveraged a combination of AWS services that

included EC2, S3, SQS, and SimpleDB. SimpleDB was used for status information because it was

“schema-less.” AWS’ Jinesh Varia wrote, “There is no need to provide the structure of the record

beforehand. Every controller can define its own structure and append data to a ‘job’ item.” SimpleDB

allowed components of the architecture to independently and asynchronously read and write state

information (e.g., status of jobs in-process). While a good fit for state information, SimpleDB, which

had a 10 GB limit per domain, was not used for the nightly multiterabyte Internet crawl.[35]

72

References

[1]. Cloud Computing Bible - Barrie Sosinsky, Janury 2012. ISBN: 978-0-470-90356-8

[2]. http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

[3]. Introduction to cloud computing - Ivanka Menken, Emereo Publishing 2011

[4]. Understanding PaaS - Michael P. McGrath, O'Reilly Media January 2012

[5]. Data Management Challenges in Cloud Computing Infrastructures - Divyakant Agrawal, A. E., University of California, Santa Barbara.

[6]. Database Scalability, Elasticity, and Autonomy in the Cloud - Divyakant Agrawal, A. E., Department of Computer Science, University of California at Santa Barbara.

[7]. Cloud Computing: Principles, Systems and Applications - Gillam, N. A., Springer 2010

[8]. http://relationalcloud.com/index.php?title=Database_as_a_Service

[9]. The multitenant, metadata-driven architecture of Database.com - Database.com Getting Started Series White Paper

[10]. Megastore: Providing Scalable, Highly Available Storage for Interactive Services - Jason Baker, C. B.-M. http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf

[11]. Inside SQL Azure. Microsoft TechNet. http://social.technet.microsoft.com/wiki/contents/articles/1695.inside-windows-azure-sql-database.aspx

[12]. https://www.windowsazure.com/en-us/home/features/data-management/

[13]. https://www.windowsazure.com/en-us/pricing/details/#storage

[14]. http://aws.amazon.com/rds/

[15]. https://developers.google.com/appengine/docs

[16]. http://en.wikipedia.org/wiki/Paxos_algorithm

[17]. Werner Vogels' weblog on building scalable and robust distributed systems http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

[18]. http://aws.amazon.com/dynamodb/

[19]. http://www.databasejournal.com/features/mssql/article.php/3823471/Cloud-Computing-with-Google-DataStore.htm

http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

http://relationalcloud.com/index.php?title=Database_as_a_Service

http://pdos.csail.mit.edu/6.824-2012/papers/jbaker-megastore.pdf

http://social.technet.microsoft.com/wiki/contents/articles/1695.inside-windows-azure-sql-database.aspx

http://social.technet.microsoft.com/wiki/contents/articles/1695.inside-windows-azure-sql-database.aspx

https://www.windowsazure.com/en-us/home/features/data-management/

https://www.windowsazure.com/en-us/pricing/details/%23storage

http://aws.amazon.com/rds/

https://developers.google.com/appengine/docs

http://en.wikipedia.org/wiki/Paxos_algorithm

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://aws.amazon.com/dynamodb/

http://www.databasejournal.com/features/mssql/article.php/3823471/Cloud-Computing-with-Google-DataStore.htm

http://www.databasejournal.com/features/mssql/article.php/3823471/Cloud-Computing-with-Google-DataStore.htm

73

[20]. Google AppEngine Documents https://developers.google.com/appengine/docs/java/overview - Product page

[21]. Google AppEngine Documents https://developers.google.com/appengine/docs/phyton/overview - Product page

[22]. Google AppEngine Documents https://developers.google.com/appengine/docs/python/datastore/gqlreference

[23]. MongoDB - http://www.mongodb.org/ - Product Page

[24]. MongoDB blog: http://blog.mongodb.org – Product Blog

[25]. Cloudant Blog http://blog.cloudant.com/cloudant-bigcouch-is-open-source - Product Blog

[26]. http://bsonspec.org/

[27]. http://wiki.apache.org/couchdb/ Product wiki

[28]. http://www.mongolab.com – Product page

[29]. Technical Overview: Anatomy of the Cloudant Data Layer Service - 2012 Cloudant, Inc.

[30]. http://bigcouch.cloudant.com/

[31]. Building Scalable Database Solution with SQL Azure - Introducing Federation in SQL Azure. http://blogs.msdn.com

[32]. http://aws.amazon.com/solutions/case-studies/airbnb/

[33]. https://www.windowsazure.com/en-us/home/case-studies/

[34]. http://aws.amazon.com/dynamodb/testimonials/#imdb

[35]. http://aws.amazon.com/solutions/case-studies/alexa/

[36]. White Paper - Top Ten Data Management Trends - Scalability Experts - Raj Gill, Y. B.

[37]. http://nosql.mypopescu.com/post/1669537044/sql-and-nosql-in-the-cloud

[38]. White Paper - NOSQL for the Enterprise - Neo Technology (2011)

[39]. White Paper - Database as a Cloud Service - Scalability Experts - Wolter, R. (2011)

https://developers.google.com/appengine/docs/java/overview

https://developers.google.com/appengine/docs/phyton/overview

https://developers.google.com/appengine/docs/python/datastore/gqlreference

http://www.mongodb.org/

http://blog.mongodb.org/

http://blog.cloudant.com/cloudant-bigcouch-is-open-source

http://bsonspec.org/

http://wiki.apache.org/couchdb/

http://www.mongolab.com/

http://bigcouch.cloudant.com/

http://blogs.msdn.com/

http://aws.amazon.com/solutions/case-studies/airbnb/

https://www.windowsazure.com/en-us/home/case-studies/

http://aws.amazon.com/dynamodb/testimonials/%23imdb

http://aws.amazon.com/solutions/case-studies/alexa/

http://nosql.mypopescu.com/post/1669537044/sql-and-nosql-in-the-cloud