the cloud database

8/3/2019 The Cloud Database

1/19

WHITE PAPER

WHAT IS ACLOUDDATABASE?

Robin Bloor, Ph D


2/19

WHITEPAPER

Copyright 2011, The Bloor Group

All rights reserved. Neither this publication nor any part of it may be reproduced or transmitted or stored in

any form or by any means, without either the prior written permission of the copyright holder or the issue

of a license by the copyright holder. The Bloor Group is the sole copyright holder of this publication.

22214 Oban Drive Spicewood TX 78669 Tel: 512-524-3689

Email contact: [email protected]

w w w . T h e V i r t u a l C i r c l e . c o m

w w w . B l o o r G r o u p . c o m
mailto:[email protected]:[email protected]:[email protected]


3/19

Executive Summary

This white paper was commissioned by Algebraix Data. The goal of the paper is to provide adefinition of what a cloud database is, and in the light of that definition, examine thesuitability of Algebraix Datas technology to fulfill the role of a cloud database.

Here is a brief summary of the contents of this paper:

We define a cloud dbms (CDBMS) to be a distributed database that can deliver a queryservice across multiple distributed database nodes located in multiple data centers,including cloud data centers. Querying distributed data sources is precisely theproblem that businesses will encounter as cloud computing grows in popularity. Sucha database also needs to deliver high availability and cater for disaster recovery.

In our view, a CDBMS only needs to provide a query service. SOA already deliversconnectivity and integration for transactional systems, so we see no need for a CDBMSto cater for transactional traffic - only query traffic. A CDBMS needs to scale acrosslarge computer grids, but it also needs to be able to span multiple data centers and, as

far as is possible, cater for slow network connections.

We review traditional databases, focusing primarily on relational databases andcolumn store databases, concluding that such databases, as currently engineered,could not fulfill the role of a CDBMS. They have centralized architectures and sucharchitectures would encounter a scalability limit at some point, both within and between data centers. We conclude that a distributed peer-to-peer architecture isneeded to satisfy the CDBMS characteristics that we have defined.

We move on to examine the Hadoop/MapReduce environment and its suitability as aCDBMS. It has much better scalability for many workloads than relational or columnstore databases, because of its distributed architecture. However it was not built for

mixed workloads or for complex data structures or even for multitasking. In itscurrent form it emphasizes fault tolerance. It succeeds as a database for very largevolumes of data, but does not have the characteristics of a CDBMS.

Finally, we examine Algebraix Datas technology as implemented in its databaseproduct A2DB. Our conclusion is that it has an architecture which is suitable fordeployment as a CDBMS. Our view is as follows:

- A2DBs unique capability to reuse intermediate results of queries that it haspreviously executed, contribute to it delivering high performance at a single node.

- The same performance characteristics can be employed to speed up queries thatjoin information between a local node and remote nodes, whether in the same data

center or in a remote data center.

- Algebraix Datas technology is capable of global optimization, balancing theperformance requirements of both global and local queries.

- Additionally the technology can deliver high availability/fault tolerant operation.

We are aware that Algebraix Data has not been deployed and tested its database A2DBin the role of CDBMS hence our conclusion is not that it qualifies as a CDBMS, but thatit has an architecture that would enable it to be tested in this role.

WHAT IS A CLOUDDATABASE?

1


4/19

The Cloud Database - In Concept

Cloud computing is a major driving trend for IT. Over 36 percent of US companies alreadyrun applications in the cloud (Mimecast survey, February 2010) and the major cloud vendorsare growing their revenues and customer bases rapidly. Given the trends, fairly soon the

majority of IT departments will be running applications in the cloud, possibly using morethan one cloud provider. So corporate computing will inevitably become much moredistributed than it currently is, spreading itself across multiple data centers. This will posemanagement, architectural and performance challenges - and foster innovation to meet thosechallenges.

The Cloud Implementation of Transactional and Query Systems

If we think solely in terms of database technology, the wider distribution of transactionalsystems, such as OLTP systems, communications applications and work flow systems, willnot pose a severe problem at the data level. The sweeping success of Salesforce.comdemonstrates this. The data problems of placing your CRM system in the cloud are resolvedeasily enough by the regular transfer of customer and other data from the cloud to the datacenter.

Indeed the broad success of SOA demonstrates the same thing. Loosely coupling silotransaction systems together works fine as regards the work flow between transactionalsystems. Because the volume of data passed between applications within a SOA is low, it ishighly unlikely that the relatively slow speeds of the Internet will be prohibitive to placingsome of these applications in the cloud. There will be exceptions, but in principle it will workwell most of the time.

For query workloads typified by BI applications, distribution of the data across multiple datacenters is more problematic. There are three main reasons for this:

1. Internet speeds are generally slow compared to data center network speeds and thislimits performance considerably. This issue can be addressed through high-speeddirect connections, but this becomes expensive very quickly.

2. Query workloads are not as predictable as transactional workloads. We can predicttransactional workloads reasonably accurately, but we cannot easily predictspecifically what questions a user might wish to ask - hence we are less able to predictthe workload. This has profound architectural implications for the distribution ofquery systems. Stated simply: we dont know where best to locate the data ahead oftime, because we do not know which sets of data users may wish to join together.

3. Even if we achieve an efficient distribution of data, query workloads involve the

movement of much greater volumes of data than transactional workloads. Thatmovement of data will inevitably be slower than if the data was located in a singledata center.

This set of constraints suggests that it may be better to centralize query workloads in onephysical location. This is traditionally how most BI domains have been constructed, around a big data warehouse with subsets of data drawn off to serve individual BI applications. Butultimately that approach fails the test of scalability. A centralized architecture scales poorlyover very large numbers of nodes. Bottlenecks eventually arise.


2


5/19

Towards a Cloud Database

For the moment, we will set aside that fact that there are many challenges in implementing adistributed architecture for query workloads across several data centers, and provide a viewof what a cloud database would look like.

We can define a cloud dbms (CDBMS) as a distributed database that delivers a query serviceacross multiple distributed database nodes located in multiple geographically-distributeddata centers, both corporate data centers and cloud data centers. So think in terms of anorganization with some applications running in the cloud. Perhaps Salesforce.com plus some

hosted transactional web applications in some remote data center plus local applicationsincluding BI applications split between two data centers. Such a situation is illustrated in

Figure 1. It is the typical situation that companies will have to deal with as we move forward.In practice, a query can originate from anywhere; from a PC within the corporation, which isconnected by a fast line to the local data center, from a PC in the home via a VPN line, from alaptop via a WiFi connection, or from a smart phone via a 3G or 4G connection. For thatreason we represent a query here as coming through the Internet implying that theresponse will possibly travel through the Internet too.

The CDBMS will not concentrate all query traffic through a single node. A peer-to-peerarchitecture will be far more scalable - with any single node able to receive any query. In such


3

Figure 1. A CDBMS

Cloud Data Center 2

Data

Data

CDBMSNode - 7

CDBMSNode - 8

Data

Data Center 2

Data Data DataData

CDBMSNode - 4

CDBMSNode - 5

CDBMSNode - 6

Data Center 1

Data Data

CDBMSNode - 2

CDBMSNode - 3

Data

Cloud Data Center 1

DataDataDataData

CDBMSNode - 1

Internet

Query

User


6/19

an arrangement, each node needs to have a map of the data stored at every node and knowthe performance characteristics of every node. When a node receives a query its first task is todetermine which node is best able to respond to the query. It then passes responsibility for thequery to that node. That node executes the query and returns the result directly to the user.

Figure 1 shows more than one CDBMS node in some of the data centers. In practice, it willprobably be necessary to configure more than one node per data center to distribute thedatabase workload within the data center as well as between data centers.

Consider Figure 2. It illustratesthe likely strategy that would be used by a CDBMS node inaccessing data held in localtransactional databases or files.If the data is held in a database,the CDBMS can either get at thedata directly (via ODBC, for

example) or access a replicateddata store. Replication will only be needed if read access to thedata imposes too great animpact on performance. Criticalsystems often have a hot standby in place ready to go if theprimary system fails, in whichcase the stand-by systemsdatabase could be used as a data source. Data might also be drawn from operational datastores or data warehouses, with the same kind of replication strategy being employed.

Where the application data is held in a file, the CDBMS will probably be able to access thedata directly. For non-database data, the CDBMS would maintain a metadata map of the fileso it could identify data items within the records read from the file.

Finally, the CDBMS will maintain its own store of data consisting of frequently used datadrawn from the data sources it accesses. This would likely be most of the data the CDBMSnode was responsible for, with direct access to data stores being used primarily for datarefresh.

Local Data and Distributed Data

In processing local data, the CDBMS acts as an operational data store. It has up-to-date data

and responds to queries using that data. While BI databases, such as a data warehouse orlarge data mart, could be included, the cloud database might replace rather than complementsuch data stores.

There is a scalability issue here. If we consider a large data center with many terabytes ofdata, no matter how efficient the CDBMS node is, it probably will not be able to deal with allthe query traffic. At each data center there would likely be several database nodes. And if thequery traffic grew, as usually happens, the CDBMS would need to instantiate extra nodes tohandle the increased workload.


4

Figure 2. A CDBMS Node

File

Node

Data

CDBMS

Node - x

DBMS

DBMS

DataRepl.

Data

App 1 App 2

File


7/19

Consider the situation illustrated in Figure 3 where Node A of the CDBMS is managingqueries for files A1, A5 and databases A2, A3 and A4. If the workload gets too great for theresources at its disposal, then assuming that there is another server available to use, it couldsplit like an amoeba as indicated. The original node might take responsibility for file A1 anddatabases A2 and A3, while the newly created node A takes responsibility for A4 and A5.

In order to do this, Node A would have to have keep a full history of query traffic so that itwould be able to calculate the optimal division as it split in two. Similarly there would needto be a reverse procedure that amalgamated two local nodes in the event that the queryworkload diminished.

In concept, that takes care of queries that only access local data that Node A has responsibilityfor. However, there will necessarily be queries that span multiple nodes.

Distributed Queries

Consider the major entities that a company holds as data: customer, product, salestransaction, staff member, supplier, purchase transaction and so on. They crop up in manyapplications. Consequently, many queries that seek information on these major entities willinevitably span multiple nodes of a CDBMS. Even if we could find a convenient way todistribute and cluster the applications around these entities, there would be many queriesthat spanned multiple nodes.

Most query-oriented databases, column store databases or traditional relational databases,could be configured to handle single node queries. Technically, the fundamental challenge forthe CDBMS is to handle distributed queries effectively.

A distributed query which accesses multiple nodes of the CDBMS can be thought of as anamalgamation (a union) of several queries that access individual nodes of the CDBMS. This isillustrated in Figure 4. Note that the resolution of a query in this manner could result in morethan one result set from each node as illustrated. Once the answers have been calculated, theCDBMS has to determine which node will join them together.


5

Figure 3. Cloud Database Node Splitting

File

A1

DBMS

A2

Data

A2

Network

DBMS

A4

Data

A4

DBMS

A3

Data

A3

Node

A

Data

CDBMS

Node A

Node

A'

Data

CDBMS

Node A'

File

A5

When workload expands, node A

instantiates a new node, A'


8/19

The best node to choose is the one that is least cost in respect of time. That can dependupon many physical factors, not just the volume of data that needs to be transmitted, but thenetwork speeds and how long it will take each node to carry out its work. It could evendepend upon which node is currently busiest. The challenge is to find the fastest solution, but

the problem is not a trivial one.

Other Cloud Database IssuesThere are other issues that a CDBMS needs to address. A primary one is high availability. Thisis a necessity rather than a nice-to-have. The CDBMS needs to be able to recover from thefailure of any node and, in the extreme, the failure of a whole data center. However, that isachievable by any distributed database that is capable of replicating its nodes.

There are also the traditional issues of database security and the broader issues of data qualityand data governance. However, these are not show-stoppers. The CDBMS has to be able toassemble a complete metadata map of all the nodes. For that reason, data security, dataquality and data governance issues can be handled as if the CDBMS were a single database.

There is also the need to provide support for a variety of data access interfaces. Ultimatelythese will include the usual SQL interfaces (ODBC, JDBC, ADO.NET), web services interfaces(HTTP, REST, SOAP, XQuery, etc.) and any other specialized interfaces such as MDX (for datacubes.)

All of these features are both necessary and important, but catering for them is not where themain CDBMS challenge lies. The greatest engineering challenge is in optimizing varied queryworkloads across a widely distributed resource space in a manner that consistently performswell.


6

Figure 4. CDBMS: Distributed Queries

Data Center 1

Query

SubQuery 1

SubQuery 2

SubQuery 3

SubQuery 4

Data

CDBMSNode - 5

Data

CDBMSNode - 8

Data Center 2

Data

CDBMSNode - 2

CloudData Center 1

Answer

SQ-2

Answer

AnswerSQ-1 Answer

SQ-4

Join

AnswerSQ-3

The Node that receives thequery decomposes it.

The most cost effective

Node performs the Join.


9/19

Can a Traditional Databases Evolve to be a CDBMS?

Databases came into existence over 40 years ago because of the limitations of file systems.They were a more effective mechanism for storing data, for many reasons. The main one wasthat they made metadata (data definition data) available, so that many different programs

could use the same data store. The situation further improved with the emergence of astandard data access language; SQL. This meant that, for the most part, the programmer nolonger needed to think about how data was stored.

Naturally, when databases first appeared, a hope arose that it would eventually be possible tostore all of a companys data in a single database. It was a forlorn hope.

Relational Database Evolution

Relational databases (RDBMS) became the dominant type of database as soon as computerhardware was fast enough to enable their use for OLTP. The relational database was originallyviewed as a more appropriate database for query workloads, and it was. But in time it was

engineered to be suitable for OLTP.Once databases had standardized around a data model (relational) and an access language(SQL), the hope that it would become possible to implement a single corporate database foruse by all programs strengthened. There were many reasons why this did not happen. Themajor ones were:

RDBMS products could cater for many different data structures, but never catered forevery possible data structure. The relational model was not a universal model of dataand to compound this problem, SQL was not a universal data access language thatcould access any kind of data structure.

In practice this meant that RDBMS was simply unfit for storing some kinds of data.

Specifically, RDBMS did not properly cater for many important data types (e.g. text,composite data types, etc.) Consequently other types of database arose (e.g. objectdatabases, text databases, content databases, etc.)

Even though RDBMS were based on the use of a two dimensional structure (the table)it never catered for structures of a higher dimension. This meant they did not cater for3D data cubes or higher dimensional data cubes. Consequently specific databasesemerged for dealing with such structures (OLAP databases.) Most importantly,RDBMS did not directly cater for the dimension of time and for time series data.

While RDBMS could cater to both OLTP and query workloads, it never had theperformance capability to cater for both types of workload at the same time. From anengineering perspective it made much more sense to have two database instances, onewhich was configured for OLTP and another, fed from the first, which was configuredand tuned for query traffic.

Most RDBMS products charged license fees, so Independent Software Vendors (ISVs)rarely used them. But even when open source RDBMS products became available atno charge, most ISVs continued to ignore them, preferring their own files structures.

The IT industry never even tried to agree on a standard file format that exposed the metadataof a file. Thus the commonly used operating systems never provided such a file type. This


7


10/19

meant there was no alternative for ISVs but to constantly invent new types of files, and evennew data types, for the data they stored.

This brought us to the situation where the industry began to accept a de facto reality:

There was structured data; data held in databases with its metadata available.

There was unstructured data; data held in files of various kinds where the metadatawas either unavailable or incomplete.

Scale and Scalability

In the light of these constraints, databases evolved in two directions. On one hand databasesaccommodated some unstructured data - by extensions to the relational model, implementingsome version of an object-relational model. On the other hand, the dream of a singlecorporate database continued - but only for query traffic - giving rise to the idea of the datawarehouse.

In practice, data warehouses were an attempt to scale up by storing all data in a single

instance of a database. But in practice they never did scale up. From the get-go users wereforced to store data subsets in data marts. Focusing all query workloads on the datawarehouse would have paralyzed it. Because of the limitations of the relational model, someof the data marts were OLAP databases holding multidimensional data cubes.

The impressive march of Moores Law, which vaporized performance issues in many areas ofIT, never came close to fixing this scalability issue - and it still hasnt. Data flowed fromoperational systems, through ETL and data quality programs into a data warehouse for laterextraction into a data mart for eventual use. This was a slow process. Consequently, softwaredesigned to short cut that pedestrian route emerged, called Enterprise InformationIntegration (EII) software. EII tools created Operational Data Stores which were nothingmore than accelerated data marts.

RDBMS did not scale outand little effort was put into that. So when the likes of Yahoo andGoogle assembled large data centers with thousands of servers, there was no databasetechnology at all that could scale out across such large computing grids. This gave rise to acompletely different approach to scaling out for large volumes of data, which went by thename of MapReduce and which gave rise to Hadoop, a programming framework forimplementing MapReduce across large grids of servers.

The Coming of the Column Store

As a database idea, the column store is very old. It goes back to the 1970s. Edward Glaser,principal developer on the MIT MULTICS project, first proposed the idea and it was used by

IBM on a database called APLDI. It came back into fashion via Sybase and Sand Technologywhen the scalability limitations of the indexed data structures that RDBMS used became moreapparent. Column-store databases became increasingly popular with the emergence of newstart-up database companies like Vertica and ParAccel that took this approach.

The column stores were RDBMS in the sense that they employed SQL as the primary dataaccess language and they held data in tables, but at a physical level they stored columnsrather than tables, they made heavy use of data compression and they didnt use indexes. Thesimple fact was that, while the speed at which data could be read from disk had been


8


11/19

increasing rapidly over the years, the speed of the movement of the read/write head acrossthe disk had not increased much. Consequently, using indexes for accessing data on disk had become a liability. It caused disk head movement and slowed everything down. It hadbecome far faster to read data serially from disk.

This gave rise to the scalability approach illustrated in Figure 5. This depicts the generalapproach of the column store DBMS to scalability. First of all, data is compressed when it isloaded, resulting in a much smaller volume of data - one twentieth of the original raw data isachievable. Then the data is stored in columns. The columns may also be split up betweendisks and between servers. This ensures good parallelism. A query may need to read thewhole of a column from a table, for example, so if the column is split between 12 disks thatare split between two servers, then the data retrieval may be 12 times faster.

Furthermore, the servers will most likely be configured for a high level of memory so that agood deal of the data is already in memory. The caching algorithms will probably split a fairamount of the memory equally between the disks to balance the average workload. Inaddition to this, multiple processes will be running and they will be distributed betweenmultiple cores in the cpus on each server.


9

Figure 5. Column Store DBMS Scalability

Server 1

DataDataDataDataDataData

Server 2

Data is compressed thenpartitioned on disk bycolumn and by range


Query

SubQuery 1

SubQuery 2

The query is decomposed

into a sub-queryfor each node

CPU CPUCPU CPU

Server 3


CPU CPU

As Much

Memory AsPossible

As Much

Memory AsPossible

As MuchMemory As

Possible

The columnar databasescales up and out byadding more servers

DatabaseTable


12/19

The overall performance of the column store DBMS will depend on how well the softwarebalances the workload when multiple queries are processed. This solution has the advantagethat you can simply add more servers as the data volume expands and the balancing of theworkload across 3, then 4 then 5 servers will usually work out well. This solution scales outonto multiple servers more effectively than the traditional RDBMS - which is precisely why it

has become popular.

Unfortunately it will hit a limit at some point. Clearly that limit will depend upon thestructure of the data and the variety of queries being processed. Even though it scales outmore effectively, it is still a centralized architecture. As the workload increases a messaging bottleneck will naturally develop at the master node of the column store database andultimately, this limits the number of servers it can expand onto.

Hadoop and Map/Reduce: A Distributed Architecture

The Hadoop development framework for MapReduce has attracted a great deal of attentionfor two reasons. First, it does scale out across large grids of computers and secondly it is the

product of an Open Source project, so companies can test it out at low cost. MapReduce is aparallel architecture designed by Google specifically for large scale search and data analysis.It is very scalable and worksin a distributed manner. TheHadoop environment is aMapReduce framework thatenables the addition of Javasoftware components. It alsoprovides HDFS (the HadoopDistributed File System) andhas been extended to include

HBase, which is a kind ofcolumn store database.

Figure 6 shows how Hadoopworks. Basically, a mappingfunction partitions data andthen passes it to a reducingfunction, which calculates aresult. In the diagram weshow many nodes (servers)with nodes 1 to i running themapping process and nodes i

+1 to k running the reducingprocess. The environment is(designed to recover from thefailure of any node. TheHDFS holds a redundantcopy of all data, so if anynode fails, the same data will be available through another


10

Figure 6. Hadoop & MapReduce

BackUp

/Recov

BackUp/Recov

Map Partition Combine Reduce

HDFS

Node 1

Mapping

Process

Scheduler

HDFS

Node i

Mapping

Process

Node i+1

Reducing

Process

Node j

Reducing

Process

Node k

ReducingProcess

BackUp

/Recov

BackUp

/Recov

BackUp

/Recov

BackUp

/Recov


13/19

node. Every server logs what it is doing and can be recovered using its backup/recovery file,if it fails. Because of that, Hadoop/MapReduce is quite slow at each node, but it compensatesfor this by scaling out over thousands of nodes. It has been used productively on grids of over5000 servers. Node failure is a daily event when you have that many commodity serversworking together, so at that scale, its recoverability is an advantage.

With MapReduce, all the data records consist of a simple key and value pair. An examplemight be a log file, consisting of message codes (the key) and the details of the conditionbeing reported (the value). For the sake of illustrating the MapReduce process, imagine wehave a large log file of many terabytes containing messages and message codes and wesimply want to count each type of message record. It could be done in the following way:

The log file is loaded into the HDFS file system. Each mapping node will read some of the log records.

The mappers will look at each record they read and output a key value pair containing the message code

as the key and 1 as the value (the count of occurrences). The reducer(s) will sort by the key and

aggregate the counts. With repeated reductions eventually it will arrive at the result; a map of distinct

keys with their overall counts from all inputs.

While this example is very simple, if we had a very large fact table of the type that mightreside in a data warehouse, we could execute SQL queries in the same way. The map processwould be the SQL SELECT and the reduce process could simply be the sorting and merging ofresults. You can add any kind of logic to either the map or the reduce step and you can alsohave multiple map and reduce cycles for a single task.

Also, by deploying HBase it is possible to have a very large massively parallel column-storedatabase that presides over petabytes of data and which can be regularly updated.

The CDBMS

Ultimately, neither column store databases nor Hadoop (with Hbase) currently have the

capabilities needed to function as a CDBMS.Column-store DBMS are (in most cases) centralized databases that will encounter scalabilitylimits as data volumes and workloads increase. Ultimately, all centralized architectures sufferthat fate no matter how splendid the underlying engineering. For that reason some of thecolumn-store vendors are integrating with Hadoop and enhancing it in various ways.

Because Hadoop provides a fully distributed environment it is unlikely to encounter ascalability limit of the kind that would floor a centralized architecture. Hadoop waspurposely designed to preside over massive tables and, in that role, it can be useful, especiallyfor those organizations that run into scalability limits with column store databases. However,in its current form it processes only one workload at a time - it has no multiprocessing

capability at all. Also, it does not work well with complex data structures, even when theyonly contain structured data. Big tables, yes; but lots of little tables from lots of databases allwith varying data structures, decidedly no.

Neither is Hadoop equipped to easily distribute workloads across complex networks thatwork at varying speeds. Hadoop expects a clean environment of similar sized servers allnetworked together at the same speed in an orderly fashion. Its secret sauce is homogeneity ineverything it does.

A CDBMS has to be able to handle heterogeneity at every level.


11


14/19

Algebraix Data and Cloud Database

Algebraix Datas A2DB is, uniquely, an algebraic database. As such, it is capable ofrepresenting any kind of data in an algebraic form and managing it accordingly. Manydatabases (RDBMS and derivative products) are constrained by the relational model of data,

unable to handle data that does not fit in that limited environment. A2DB is not constrainedin that way. Its algebraic nature allows it to represent hierarchies, ordered lists, recursive datastructures and compound data objects of any kind. (For a more detailed mathematicalexplanation of how it achieves this, read the Bloor Group white paper: Doing The Math).

Algebraic Optimization and the Use of Intermediate Results

To understand how Alegbraix Datas technology could implement a CDBMS, you needunderstand the optimization strategy it implements. The A2DB product stores all the sets itcalculates, including all intermediate result sets for possible reuse.

Consider a fairly simple query which accesses some rows and columns from one table and

then joins them to some rows and columns of another table. Most databases will select thedata from the first table, select it from the second table and then join the resulting two tablestogether to provide the answer. A2DB behaves in the same manner, but with the additionalnuance that it stores the first selection and the second selection and the joined result, forpossible later use. If later queries make the same selection or make a selection of a subset ofeither of the two stored selections, then A2DB will reuse those results. Once A2DB hasprocessed many queries it has assembled a reasonably large population of these intermediateresults.

Not only does it store each such set of data, it also stores their algebraic representation. Sowhen it processes a new query, it simply examines its store of algebraic representations andselects those that can contribute to resolving the query. It then works out which of them has

the least cost in termsof resource usage, anduses those sets toresolve the query.

The adjacent graphillustrates how theperformance of A2DBimproves when thesame type of query isrepeated.

The first time a queryruns, response is slow.But it improves witheach repetition untilthe response time fallsto a very low level.This happens with alltypes of query. The use


12

Figure 7. The A2DB Optimizer Performance Curves


15/19

of intermediate result sets proves valuable in a distributed environment and a cloud

environment. Figure 8 illustrates this. The distributed architecture is peer-to-peer, so therecould be many such nodes, even thousands - all functioning in the same way. On the left ofthe diagram are the data sources that this particular node takes input from and is responsiblefor. In order to load the database node it is only necessary to create load files of the sourcedatabases. The database doesnt immediately load the data, it just loads the metadata fromthose files. The way the technology works is that there is no data load per se. As queriesarrive it references the load files (or log files or other data files) and gradually accumulatesintermediate result sets, which constitute its managed data store - as illustrated.

It uses physically efficient mechanisms to store such data, the same techniques as the typicalcolumn store database; no indexes, data compression and data partitioning. There is complete

separation between the logical representation of the data sets stored and the physical storageof those data sets. It works in the following way:

The XSN Translator translates a query into an algebraic representation that corresponds withthe algebraic sets defined at a logical level in the Universe Manager. (XSN stands forExtended Set Notation.) The Universe Manager holds a logical model of all the databasessets and their relations.

The Optimizer first works out which stored sets might participate in a solution. It maydeduce it has to go to source data (load files) for all or part of the data requested by the query.


13

Figure 8. Algebraix Datas Technology in a Distributed Operation

LoadFiles

LogFiles

DBMS

DBMSData

Apps

LoadFiles

LogFiles

DBMS

DBMSData

Apps

LoadFiles

LogFiles

DBMS

DBMSData

Apps

LoadFiles

LogFiles

DBMS

DBMSData

Apps

File

Apps

File

Apps

File

Apps

QueriesQueries

Queries

Mgt DataLocal Result Sets Remote Result Sets

Data Sources

AnswersAnswers

XSN

Translator

LocalAccess

Universe

Manager

(Algebraicmodel)

LOGICAL

Resource

Manager

(CPU/cores,memory,

disk)

PHYSICAL

RemoteAccess

Set

Processor

Optimizer

StorageManager

CDBMSNode i

CDBMSNode j

LocalData Center

CDBMSNode k

CDBMSNode l

RemoteData Center


16/19

In any event the search for alternatives will yield one or more possible solutions. TheOptimizer now consults the Resource Manager and tests each of its algebraic solutionsagainst PHYSICAL information held by the Resource Manager. Armed with precise costinformation, the Optimizer works out the physical cost of each algebraic solution andchooses the fastest one. The Resource Manager knows whether data is on disk or cached in

memory and it knows how it is physically organized. Once the Optimizer has decided on asolution, it passes it to the Set Processor, which executes it.

The Distributed Query

Now consider what happens if the query requests some data that is not on this database node.How does it know what to do? By design, the Universe Manager doesnt just hold a map oflocal data, it also holds a global map that identifies all other database nodes and the data theyare responsible for.

When we described how the database handles a query, we omitted to discuss how it handlesa query that spans more than one node. Such a query will naturally involve a join of some

kind with one or more parts of the join operation referencing remote data.

The mode of operation of Algebraix Datas technology is essentially the same, but slightlymore complex. The Optimizer always checks to see if any of the data requested is part of theremote universe rather than the local universe. If it discovers that some element in thequery references remote data, it deconstructs the query into several parts, as follows:

A subquery for this node

A subquery for each remote node that is involved

A master query that joins together all the results of all the subqueries

It calculates which node is the best node to execute the master query by estimating the

resource cost of transporting result data from one location to another. If it decides to pass thatresponsibility to another node then it behaves as follows:

It passes all the other subqueries to the nodes where they need to execute.

It also informs each node where to deliver the result of their subquery.

It then executes its own subquery and passes the result to the master node when localprocessing completes. At that point it has finished with that query.

If it has determined that it is, itself, the best node to execute the master query, it behaves asfollows:

It passes all the other subqueries to the nodes where they need to execute.

It gives itself as the return address for the results of those subqueries.

It executes its own subquery.

When it receives all the remote result sets, it executes the master query.

Finally it dispatches the end result to the program that sent the query.

Note that in carrying out such a distributed query the database gathers some remote resultsets at the node that masters the distributed query. It will save these results as remote result


14


17/19

sets in the same way that it saves local result sets, so that when more queries of that typecome in it may be able to resolve those queries locally rather than in a distributed manner.

Failover

With Hadoop, failure of any node can be catered for. The same is true of Algebraix Datastechnology. It is fairly easy to configure complete node mirrors so that a standby node cantake over immediately if an active node fails. It would be more economic though to use aSAN at each data center, and only mirror data that is written to disk (the intermediateresults). Then if a node fails, it will be possible to recover the node from the SAN. This injectsa greater delay into the recovery process, as the recovered database would have to recreatethe last known state of the failed node.

In practice, Algebraix Datas technology can run on commodity servers. While it may appearthat it has a substantial requirement for data storage, because of its strategy of storingintermediate results, in practice this is not the case. This is because, after a suitable time haspassed, the database deletes the intermediate results it didnt reuse. The database rarely

requires the deployment of additional storage (such as NAS or a SAN). For atypicalworkloads special configurations can be deployed for any given node.

Node Splitting

Node splitting becomes necessary when the query load for a node becomes too great. Theneed becomes apparent when the performance of the node begins to decline. However, nodesplitting is simple to achieve:

A replica node is created of the node and the data sources that the new node will beresponsible for are defined - deleting those it will not be responsible for from the UniverseManager. The technology can estimate what the best split is likely to be from an analysis of

past query workloads. It can also recognize which intermediate results are derived fromwhich source files or databases. So it reclassifies those intermediate results as remote ratherthan local. The configuration of the original node is configured in the same way, deleting thedata sources that it is no longer responsible for. The nature of the changes are then relayed toall the nodes in the CDBMS.

Data Growth

Most source data will consist of databases that are themselves being added to on a regularbasis. That data growth is best dealt with by feeding database log file images to the database.For other applications which simply use file systems, it is best to feed the equivalent of anupdate audit trail to the database. There is a specific reason for this. Algebraix Datas

technology does not cater for updated data in the way most databases do.

Typically, database updates destroy data by over-writing one value with another. Thisdatabase technology is different. It treats updates as additional (i.e. new) data. In effect, they become non-destructive updates, with a record of the previous values remaining. Fordeletions, it simply marks the set of data or a data item as no longer current. To achievethese things, the database adds a time stamp to all data as it arrives and is used (if such a timestamp does not exist in source data.) All queries to the database either specify the time thatapplies, so that the result has an as at date/time or omit the time, in which case the current


15


18/19

date and time is applied. So all updates are taken into account when the associated data isprocessed according to time stamp. Because of this, all intermediate result tables also have anas at date/time associated with them.

The database is configured at every node to accept new data on the basis of a timed switch. It

is inadvisable to set the time switch to too short a period as this rapidly increases the numberof sets held by the Universe Manager - and this, in turn, could impact performance.

The Economy of A2DB

In any database and especially in any distributed database, it is always possible to posequeries that will take a long time to answer. This technology does not make that problemsuddenly disappear. For example if you join two terabyte-sized tables together that are ondifferent nodes, a terabyte of data must pass over the network. If it is a slow network line, thequery could take a very long time. If such a query is frequently run, the database will solvethis particular performance issue naturally by holding one of the terabyte tables as anintermediate result.

If you have a petabyte or even several petabytes of data that you wish to query regularly, thenthe database could be used for the task by deploying it on a sufficient number of nodes. Insuch circumstances it could look quite similar to Hadoop (with HBase). However that is notthe prime requirement of a CDBMS. A CDBMS needs to be able to handle heterogenousworkloads some of which access complex data structures, and it needs to do so with economyand with speed. That is what Algebraix Datas technology does.

In the distributed environment it is helped by the fact that users and programs that requestdata normally do not pose queries that have terabyte-long answers. They pose queries thathave quite short answers - a few megabytes or less. An exception is when users aredownloading a large data extract for more detailed analysis, but such downloads are

relatively rare.This distributed approach has the virtue that it naturally localizes data to suit the querytraffic. In each node it localizes the data that is frequently queried in memory. In a distributedenvironment with multiple nodes it will, through its natural performance mechanisms,gradually localize the data to suit the local and global query traffic. If query volumes rise toohigh at a given node, then the node can split like an amoeba to cater for the rising workload.

If the query traffic changes with, say, one kind of query not being posed so frequently and anew set of previously unknown queries becoming common the database will simply adjust, by adjusting the intermediate results it holds. After three or four queries of each new querytype its natural performance will be restored.

The nature of this technology, coupled with the fact that it can be configured for highavailability, qualifies it as suitable for deployment as a CDBMS.


16


19/19


17

About The Bloor Group

The Bloor Group is a consulting, research and analyst firm that focuses on quality research andanalysis of emerging information technologies across the whole spectrum of the IT industry. The firms

research focuses on understanding both the technical features and the business value of information

technologies and how they are successfully implemented within modern computing environments.

Additional information on The Bloor Group can be found at www.TheBloorGroup.com and

www.TheVirtualCircle.com . The Bloor Group is the sole copyright holder of this publication.

22214 Oban DriveSpicewood TX 78669 Tel: 512-524-3689 w w w . T h e V i r t u a l C i r c l e . c o m

w w w . B l o o r G r o u p . c o m
http://www.thevirtualcircle.com/http://www.thevirtualcircle.com/http://www.thevirtualcircle.com/http://www.bloorgroup.com/the-company-and-its-goals/http://www.bloorgroup.com/the-company-and-its-goals/

the cloud database

Documents