parallel database systems: the future of database processing or a passing fad?

Lf'TANDEM

Parallel Database Systems:The Future of Database Processingor a Passing Fad?

David J. DeWittJim Gray

Technical Report 90.9September 1990Part Number: 50119

Parallel Database Systems:The Future of Database Processing or a Passing Fad?

David 1. DeWitt! Jim GrayComputer Sciences Department Tandem Computers Inc.

University of Wisconsin 19333 Val.lco ParkwayMadison, WI. 53706 Cupertino, CA. 95014

Tandem Technical Report 90.9 (part Number: 50119)September1990

Abstract: Parallel database machine architectures based on exotic hardware have evolved

to a parallel database systems running atop a parallel dataflow software architecture based on

conventional shared-nothing hardware. These new designs provide speedup and scaleup when

processing relational database queries. This paper reviews the techniques used by such systems,

and surveys current commercial and research systems.

Table of Contents1 Introduction 1

2. Basic Techniques for Parallel Database Machine Implementation .32.1 Shared-Nothing Hardware 52.2 Software 5

Declustering 6Intra-operator Parallelization 7

3. The State of the Art 113.1.Bubba 113.2. Gamma 113.3. Tandem NonStop SQL 123.4. Teradata 133.5. The Super Database Computer 143.6. Other Systems 14

4. Future Directions and Research Problems 154.l.Mixing Batch and OLTP Queries 154.2. Parallel Query Optimization 154.3. Parallelization of the Application Program 164.4. Physical Database Design 164.5. On-line Reorganization and Utilities 174.6. Processing Highly Skewed Data 174.7. Non-relational Parallel Database Machines 18

5. Summary and Conclusions 19

References 20

1This research was partially supported by the Defense Advanced Research Projects Agency under contract NOOO39-86-C-0578, bythe National Science Foundation under grant DCR-8512862. and by research grants from Intel Scientific Computers. TandemComputers. and Digital Equipment Corporation.

1 IntroductionThe 1983 paper titled Database Machines: An Idea Whose Time has Passed? A Critique of

the Future of Database Machines [BORA83] painted a gloomy picture of the future of highly

parallel database machines for two reasons. First, it observed that disk throughput was predicted

to double while processor speeds were predicted to increase by a factor of ten to one hundred

times. Consequently, it predicted that systems with multiple processors would soon be

bottlenecked by their I/O systems unless a solution to the I/O bottleneck were found. Second, it

predicted that none of the then trendy mass storage alternatives such as bubble or ceo memories

were likely to replace standard moving head disks. At that time, the database machine field was

much like the fashion industry - one fad after another. In particular, every time a new memory

technology appeared it was soon followed by countless paper designs, which all too frequently

claimed to solve all the world's database problems.

While these predictions were fairly accurate about the future of hardware, the predictions

were certainly wrong about the overall future of parallel database systems as evidenced by the

success of Teradata and Tandem in developing and marketing highly parallel database machines.

What happened? Certainly it has not been due to a fundamental change in the relative perfonnance

of CPUs and disks. If anything, the spread in their relative perfonnance has grown even larger

since 1983. In the interim, disk drives have improved their response time and throughput by about

a factor of two, while in the same period single chip microprocessors have gone from about 1 mips

in 1983 to well over twenty mips in 1990. Fortunately (or perhaps unfortunately depending on

your perspective), many of these additional cycles have been consumed by corresponding

increases in the complexity and sophistication of the database software itself.

Why then have parallel database systems become more than a research curiosity? One

possible explanation is due to the widespread adoption of the relational data model itself. In 1983,

relational database systems were just beginning appearing in the commercial market place. Today,

they dominate it. In addition, mainframe vendors have found it increasingly difficult to build

machines that are powerful enough to meet the CPU and I/O demands of relational DBMS serving

large numbers of simultaneous users or searching terabyte databases. During this same time

frame, multiprocessors based on increasingly fast and inexpensive microprocessors became widely

available from a variety of vendors including Encore, Sequent, Tandem, Intel, Teradata, and

NCUBE. These machines provide not only more total power than their mainframe counterparts,

but also provide a lower price/MIPS. Also, their modular architectures enable customers to grow

their configurations incrementally, adding mips, memory, and disks to a system either to speedup

the processing of a given job, or to scaleup the system to process a larger job in the same time.

1

The real answer is that special-purpose database machines have indeed failed. But, parallel

database systems have been a big success. They have emerged as the major consumers of highly

parallel architectures, and are in an excellent position to exploit massive numbers of fast-cheap

commodity disks, processors, and memories. In fact, the successful parallel database systems are

built from conventional processors, memories, and disks. Relational queries are ideally suited to

parallel execution: consisting of uniform operations applied to uniform streams of data. They

require a message-based client-server operating system and also a high-speed network in order to

interconnect the parallel processors. Such facilities may have seemed exotic a decade ago, but now

they are clearly the direction that computer architecture is heading. The client-server paradigm

based on high-speed LANs is the basis for most PC and workstation workgroup software, as well

as the basis for distributed database technology.

A consensus on parallel and distributed database system architecture has emerged. This

architecture is based on a shared-nothing hardware design [STON86] in which the only way

processors communicate with one another by sending messages via an interconnection network.

In such systems, tuples of each relation in the database are declustered [LIVN87] (horizontally

partitioned [RIES78]) across disk drives attached directly to each processor. Declustering allows

multiple processors to scan large relations in parallel without any exotic I/O devices. This design is

used by Tandem [TAND87, TAND88], Teradata [TERA83, TERA85], Gamma [DEWI86,

DEWI90], Bubba [BORA90], Arbre [LORI89], and several other products currently under

development. Such architectures were pioneered by Teradata in the late seventies as well as by the

Mufrm [STON79], Xtree [GOOD8I], and MDBS [BOYN83] projects.

The remainder of this paper is organized as follows. Section 2 describes the basic

architectural concepts used in these parallel database systems. This is followed by a brief

presentation of the unique features of the Teradata, Tandem, Bubba, and Gamma systems in

Section 3. Section 4 describes several possible areas for future research. Our conclusions are

contained in Section 5.

2

2. J3asic Techniques for Parallel Database Machine ImplementationTechnology is driving parallel database machines. The ideal machine would be a single

infinitely fast processor with an infinite memory with infinite bandwidth - and it would be

infinitely cheap (free). Given such a machine, the need for parallelism would be much reduced.

Unfortunately, technology is not delivering such machines - but it is coming close. Technology

is promising to deliver fast one-chip processors, fast and large disks, and large ram memories.

And each of these devices will be very cheap by today's standards, costing only hundreds of

dollars each. So the challenge is to build an infinitely fast processor out of infinitely many

processors offinite speed, and to build an infinitely large memory with infinite memory bandwidth

from infinitely many storage units offinite speed. This sounds trivial mathematically; but in

practice, when a new processor is added to most computer designs, it slows every other computer

down just a litde bit. If this slowdown is 1%, then when adding a processor to a 100 processor

complex, the added power is canceled by the slowdown of the other processors. Such a system

has a maximum speedup of 25 (at 50 processors) and the 100 processor complex has the effective

power of a single processor system.

The ideal parallel system demonstrates two key properties: (1) linear speedup: Twice as

much hardware can perform the task in half the elapsed time, and (2) linear scaleup: Twice as

much hardware can perform twice as large a task in the same elapsed time (see Figure l.a).

Speedup Batch Scaleup

Figure 18. The difference between a speedup design in which a one-minute job is done in 15-seconds,and a scaleup design in which a ten-times bigger job is done in the same time by a ten-times biggersystem.

The barriers to these goals are the triple threats of: (1) startup: The time needed to start a

parallel operation. If thousands of processes must be started, this can easily dominate the actual

computation time. (2) interference: The slowdown each new process imposes on all others. (3)

skew: The service time of a job is often the service time of the slowest step of the job. As the

number of parallel steps increase the average sized of each step decreases, but the variance can well

exceed the mean. When the variance dominates the mean, further parallelism will not help.

This section describes several basic techniques widely used in the design of shared-nothing

parallel database machines to achieve linear speedup and scaleup on relational operators. There are

a number of reasons for the success of shared-nothing systems. The main advantage of shared

nothing multiprocessors is that they can be scaled up to hundreds and probably thousands of

processors which do not interfere with one another. Teradata and Tandem, for example, have

3

shipped systems with more than 200 processors, and Intel is implementing a 2000 node

Hypercube. On the other hand, the largest shared memory multiprocessors currently available are

limited to about 32 processors.

A Bad Speedup Curve3-Factors

The Good SpeedupCurve

A Bad Speedup CurveCDiE No ParallelismE-i=~:slCDo

II0-::l Linearity~0-en .....=- ~:-:---

Processors & Discs

Figure 1b. The standard speedup curves. The left curve is the ideal. The middle graph shows nospeedup as harclware is added. The right curve shows the three threats to parallelism. Initial startup costsmay dominate at first. As the number of processes increase, interference can increase. Ultimately, the jobis divided so finely, that the variance in service times (skew) causes a slowdown.

Interference is a major problem for shared-memory multiprocessors for database

applications. As processors become faster, each processor is given a large private cache in order to

avoid interference on shared memory resources. Recent results measuring a multiprocessor

running an online transaction workload show that the cache loading/flushing overhead of

transaction processing applications considerably degrades processor performance [THAK90]. We

do not believe high-performance shared-memory machines will scale beyond a tens of processors

when running database applications2. When high degrees of parallelism are introduced, shared

resources become a bottleneck. Consequently the software architecture must use fine granularity

concurrency control, and must even partition resources to avoid process and processor

interference. This observation is true for both shared-nothing and shared memory designs. This

partitioning on a shared-memory system creates many of the problems faced by a shared nothing

machine.

As demonstrated by Teradata in many competitive benchmarks, and by [DEWI90] and

[TAND88, ENGL89], it is possible to obtain near-linear speedups on complex relational queries as

well as for online-transaction processing using a shared-nothing architecture. Given such results,

there is little justification for bothering with the hardware and software complexity associated with

a shared-memory design.

2 SIMD machines such as Il1.JAC IV and its derivatives like the Connection Machine are ignored here because to dateno software designer has devised a programming technique for them. SIMD machines seem to have application insimulation, pattern matching, and mathematical search, but they do not seem to be appropriate for the i/o intensiveand dataflow paradigm of database systems.

4

2.1 Shared-Nothing HardwareIn a shared-nothing architecture, processors do not share disk drives or random access

memory. They can only communicate with one another by sending messages through an

interconnection network. Mass storage in such an architecture is distributed among the processors

by connecting one or more disk drives to each processor as shown in Figure 2.

Figure 2. The basic shared nothing design. Each processor has a private memory and one or more diskdrives. Processors communicate via a high-speed interconnect network.

This architecture characterizes the database systems being used by Teradata [TERA85],

Gamma [DEWI86, DEWI90], Tandem [TAND88], Bubba [ALEX88, COPE88], and Arbre

[LORI89]. It is interesting to note, however, that the actual interconnection network varies from

system to system. Teradata employs a redundant tree-structured communication network. Tandem

uses a three-level duplexed network (two levels within a cluster, and rings connecting the clusters).

Arbre, Bubba, and Gamma were designed to be independent of the underlying interconnection

network, requiring only that network allow any two nodes to communicate with one another.

Gamma currently operates on an Intel iPSC2 Hypercube which offers each processor eight 2.8

megabyte/second connections to its neighbors. The initial Arbre prototype was implemented using

IBM 4381 processors connected to one another in a point-ta-point network. In the case of Bubba,

the shared memory provided by the FLEX/32 multiprocessor was used as an interconnection

network.

2.2 SoftwareTwo software mechanisms have been used in the construction of most parallel database

systems: declustering [LNN87] for mapping tuples to disks and an intra-operator parallelizer to

distribute the tuples of intermediate relations among the processors being used to execute a

relational operator.

5

DeclusteringDeclustering a relation involves distributing the tuples of a relation among two or more disk

drives3 according to some distribution criteria such as applying a hash function to the key attribute

of each tuple. Declustering has its origins in the concept of horizontal partitioning initially

developed as a distribution mechanism for distributed DBMS [RIES78] (see Figure 3). One of the

key reasons for using declustering in a parallel database systems is to enable the DBMS software to

exploit the I/O bandwidth reading and writing multiple disks in parallel. By declustering the tuples

of a relation the task of parallelizing a scan operator becomes trivial. All that is required is to start a

copy of the operator on each processor or disk containing relevant tuples, and to merge their

outputs at the destination. For operations requiring a sequential scan of an entire relation, this

approach can provide the same I/O bandwidth as a RAID-style system [SALE84], [PATI88]

without needing any specialized hardware.

While tuples can simply be declustered in a round-robin fashion, more interesting

alternatives exist. One is to apply a hashing function to the key attribute of each tuple to distribute

the tuples among the disks. This distribution mechanism allows exact match selection operations

on the partitioning attribute to be directed to a single disk, avoiding the overhead of starting such

queries on multiple disks. On the other hand, range queries on the partitioning attribute, must be

sent to all disks over which a relation has been declustered. A hash declustering mechanism is

provided by Arbre, Bubba, Gamma, and Teradata

range partitioning round-robin hashing

Figure 3: The three basic declustering schemes: range declustering maps contiguous fragments of atable to various disks. Round-Robin declustering maps the i'th record to disk i mod n. Hasheddeclustering, maps each record to a disk location based on some hash function. Each of these schemesspreads data among a collection of disks, allowing parallel disk access and parallel processing.

An alternative declustering strategy is to associate a distinct range of partitioning attribute

values with each disk by, for example, dividing the range of possible values into N units, one for

each of the N processors in the system range partitioning. The advantage of range declustering is

that it can isolate the execution of both range and exact match-selection operations to the minimal

3 The tenn disk here is used as a shorthand for disk or other nonvolatile storage media As the decade proceedsnonvolatile electronic storage or some other media may replace or augment disks.

6

number of processors. Another advantage of range partitioning is that it can be used to deal with

non-uniformly distributed partitioning attribute values. A range-partitioning mechanism is provided

by Arbre, Bubba, Gamma, and Tandem.

While declustering is a simple concept that is easy to implement, it raises a number of new

physical database design issues. In addition to selecting a declustering strategy for each relation,

the number of disks over which a relation should be declustered must also be decided. While

Gamma declusters all relations across all disk drives (primarily to simplify the implementation), the

Bubba, Tandem, and Teradata systems allow a subset of the disks to be used. In general,

increasing the degree of declustering reduces the response time for an individual query and

(generally) increases the overall throughput of the system. For sequential scan queries, the

response time decreases because more processors and disks are used to execute the query. For

indexed selections on the partitioning attribute, the response time improves because fewer tuples

are stored at each node and hence the size of the index that must be searched decreases. However,

there is a point beyond which further declustering actually increases the response time of a query.

This point occurs when the cost of starting a query on a node becomes a significant fraction of the

actual execution time [COPE88, DEWI88, GHAN90a]. In general, full declustering is not always

a good idea, especially in a very large configurations. Bubba [COPE88] refines the concept of

range-partitioning by considering the heat4 of its tuples when declustering a relation; the goal

being to balance the frequency with which each disk is accessed rather than the actual number of

tuples on each disk. In [COPE88] the effect of the degree of the declustering on the multiuser

throughput of Bubba is studied. In [GHAN90a] the impact of the alternative partitioning strategies

on the multiuser throughput of Gamma is evaluated.

Intra-operator ParallelizationThe parallelization of selection operators should now be obvious. Consider the query tree

shown in Figure 4 applied to a table declustered as in Figure 3. The approach used in DIRECT

[DEWI79] to execute such dataflow graphs in parallel was to design and implement parallel

algorithms for each operator. Fortunately, a much simpler approach is possibleS [DEWI86]. The

basic idea is to parallelize the data instead of parallelizing the operators (programs), enabling the

use of unmodified, existing sequential routines to execute the relational operators. Consider fIrst

the simple scan of a table called A that has been declustered across three disks into partitions AO,

AI, and A2. This scan can be implemented as three scan operators which send their output to a

common merge operator which produces a single output data stream to the application or to the

4The heat of an object is the access frequency of the object over some period of time" [COPE88]5Teradata may use a similar mechanism but, if so, it has not been described publicly.

7

c1~

A

next relational operator. So the first basic operator is a merge operator that can combine several

parallel data streams into a single sequential stream (see Figure 4).C ----:z..- mecge

operator

AlAO

Figure 4: A simple relational dataflow graph showing a relational scan (project and select) decomposedinto three scans on three partitions of the input stream or table. These three scans send their output to amerge node which in produces a single data stream.

The parallel query executor creates the three processes shown in Figure 4 and directs them

to take their inputs from some sequential stream and direct their outputs to the merge node. Each

of the scans can run on an independent processor and disk.

The merge operator tends to focus data on one spot. Clearly, if a multi-stage parallel

operation is to be done in parallel, a single data stream must be split into several independent

streams. To do this, a structure known as a split table is used to distribute the stream of tuples

produced by a relational operator. A split table defines a mapping of one or more attribute values

of the output tuples produced by a process executing a relational operator to a set of destination

processes (see Figure 5).

Figure 5: A relational dataflow graph showing a relational operator's output being decomposed by asplit table into several independent streams. Each stream may be a duplicate of the input stream, or apartitioning of the input stream into many disjoint streams. With the split and merge operators, a web ofsimple sequential dataflow nodes can be connected to fonn a parallel execution plan.

As an example, consider the two split tables shown in Figure 6 in conjunction with the

query shown in Figure 7. Assume that three processes are used to execute the join operator, and

that five other processes execute the two scan operators - three scanning partitions of table A

while two scan partitions of table B. Each of the three table A scan nodes will have the same split

table, sending all tuples between "A-H" to port 1 of join process 0, all between "I-Q" to port 1 of

join process 1, and all between "R-Z" to port 1 of join process 2. Similarly the two table B scan

nodes have the same split table except that their outputs are merged by port 1 (not port 0) of each

join process. Each join process sees a sequential input stream of A tuples from the port 0 merge

(the left scan nodes) and another sequential stream of B tuples from the port 1 merge (the right scan

nodes). The split tables would be as shown in Figure 6.

8

Table A Scan Split Table Table B Scan Split Table

Predicate Destination Process Predicate Destination Process

"A-If' (cpu #5, Process #3, Port #0) "A-If' (cpu #5, Process #3, Port #1)

"I-Q" (cpu #7, Process #8, Port #0) "I-Q" (cpu #7, Process #8, Port #1)

"R-Z" (cpu #2, Process #2, Port #0) "R-Z" (cpu #2, Process #2, Port #1)

Figure 6. Sample Split Tables which map tuples to different output streams (ports of other processes)depending on the range value of some attribute of the input tuple.The split table on the left is for theTable A scan in figure 7, while the table on the right is for the table B scan.

The split table in Figure 6 is just an example. Other split tables might duplicate the input

stream, or partition it round-robin, or partition it by hash. The partitioning function is arbitrary. In

fact in Volcano, it is an arbitrary program [GRAE90]. For each process executing the scan, the

split table applies the predicates to the join attribute of each output tuple. If the predicate is

satisfied, the tuple is sent to the corresponding destination (this is the logical model, the actual

implementation is generally optimized by table lookups). Each join process has two ports, each of

which merges the outputs from the various scan split tables.

C

c ~ mergeoperator

~-oIIIII~_-z:..--. split of each scan ouput +merge of scan inputsto each join node

Figure 7: A simple relational dataflow graph showing two relational scans (project and select) consumingtwo input tables, A and B and feeding their outputs to a join operator which in turn produces a data streamc.

To clarify this example, consider the second join process in Figure 7 (processor 7, process

8, ports 1 and 2). It will get all the table A "I-Q" tuples from the three table A scan operators

merged as a single stream on port 0, and will get all the "I-Q" tuples from table B merged as a

single stream on port 1. It will join them using a hash-join, sort-merge join, or even a nested join

if the tuples arrive in the proper order. This join node in turn sends its output to the merge node,

much as the scans did in Figure 4.

If each of these processes is on an independent processor with an independent disk, there

will be little interference among them. Such dataflow designs are a natural application for shared

nothing machine architectures.

9

Both Volcano and Tandem have taken Gamma's notion of a split table one step further and

made it a separator operator that can be inserted anywhere in a query tree [GRAE90]. This

approach has a number of advantages including the automatic parallelism of any new operator

added to the system plus support for a variety of different kinds of parallelism. In addition, in

Volcano this encapsulation made it possible to eliminate the need for a separate scheduler process

such as used by Gamma to coordinate the processes executing the operators of the query. Rather

the flow control and buffering are built into the merge and join operators.

For simplicity, these examples have been stated in terms of an operator per process. But it

is entirely possible to place several operators within a process to get coarser grained parallelism.

The fundamental idea though is to build a self-pacing dataflow graph and distribute it in a shared

nothing machine in a way that minimizes interference.

10

3J:he State of the Art

3.1.BubbaThe Bubba [BORA90] prototype was implemented using a 40 node FLEX/32

multiprocessor with 40 disk drives. Although this is a shared-memory multiprocessor, Bubba was

designed as a non-shared memory system and the shared-memory is only used for message

passing. Nodes are divided into three groups: Interface Processors for communicating with

external host processors and coordinating query execution, Intelligent Repositories for data storage

and query execution, and Checkpoint/Logging Repositories. While Bubba also uses declustering

as a storage mechanism (both range and hash declustering mechanisms are provided) and dataflow

processing mechanisms, Bubba is unique in a number of ways. First, Bubba uses FAD rather

than SQL as its interface language. FAD is an extended-relational persistent programming

language. FAD provides support for complex objects via a variety of type constructors including

shared sub-objects, set-oriented data manipulation primitives, as well as more traditional language

constructs. The FAD compiler is responsible for detecting operations that can be executed in

parallel according to how the data objects being accessed are declustered. Program execution is

performed using a dataflow execution paradigm. The task of compiling and parallelizing a FAD

program is significantly more difficult than parallelizing a relational query. The other radical

departure taken by Bubba is its use of a "single-level store" mechanism in which the persistent

database at each node is mapped into the virtual memory address space of each process executing at

the node instead of the traditional approach of files and pages. This mechanism, similar to IBM's

AS400 mapping of SQL databases into virtual memory, HP's mapping of the Image Database into

the operating system virtual address space, and Mach's mapped file [TEVA87] mechanism, greatly

simplified the implementation of the upper levels of the Bubba software.

3.2. GammaThe current version of Gamma runs on a 32 node Intel iPSC/2 Hypercube with a 330

megabyte disk drive attached to each node. In addition to range and hash declustering, Gamma

also provides a new declustering strategy termed hybrid-range that combines the best features

(from a performance perspective) of the hash and range partitioning strategies [GHAN90b]. Once a

relation has been partitioned, Gamma provides the normal collection of access methods including

both clustered and non-clustered indices. When the user requests that an index be created on a

relation, the system automatically creates an index on each fragment of the relation. There is no

restriction in Gamma that the same attribute which is used to partition a relation into fragments also

be used as the attribute on which the clustering index is created.

11

As described above, Gamma uses split tables to facilitate the parallelization of the

complex relational algebra operators. Four join methods are supported: sort-merge, Grace

[KITS83], Simple [DEWI84], and Hybrid [DEWI84]. The relative performance of these

alternative methods is presented in [SCHN89]. [DEWI90] contains an extensive performance

evaluation of the system including a series of speedup, scaleup, and sizeup experiments. Although

not yet implemented, a new availability technique called chained declustering [HSIA90] has been

designed which provides a higher degree of availability than interleaved declustering [TERA83,

COPE89] and, in the event of a processor or disk failure, does a better job of distributing the

workload of the broken node than mirrored disks. In [SCHN90] a variety of alternative strategies

for processing queries involving multiple join operations are presented and evaluated.

3.3. Tandem NonStop SQlThe Tandem NonStop SQL system is composed of processor clusters interconnected via

4-plexed fiber optic rings. The systems are typically configured at a disk per MIPS, so each ten

MIPS processor has about ten disks. Disks are typically duplexed and managed in the standard

way [BITT88]. Each disk pair has a set of processes managing a large RAM cache, a set of locks,

and log records for the data on that disk pair. Considerable effort is spent on optimizing sequential

scans by prefetching large units, and by filtering and manipulating the tuples with sql predicates at

these disk servers.

Relations are range partitioned across multiple disk pairs [TAND87] which is the only

strategy provided. The partitioning attribute is also used as the primary, clustering attribute on

each node, making it impossible to decluster a relation by partitioning on one attribute and then

constructing a clustered index at each node on a different attribute. Parallelization of operators in a

query plan is achieved by inserting a parallel operator (which is similar to Volcano's exchange

operator [GRAE90]) between operator nodes in the query tree and joins are executed using nested

or son-merge algorithms. Scans, aggregates, updates, and deletes are parallelized. In addition

several utilities use parallelism (e.g. load, reorg, ...). A hash join algorithm which uses hash

declustering of intermediate and final results has recently been implemented [ZELL90]. [ENGL89]

contains a performance evaluation of the speedup and scaleup characteristics of the parallel version

of the Non-Stop SQL system.

The system is primary designed for transaction processing. The main parallelism feature

for OLTP is parallel index update. Relational tables typically have five indices on them, although it

is not uncommon to see ten indices on a table. These indices speed reads, but slow down inserts,

updates, and deletes. By doing the index maintenance in parallel, index maintenance time can be

held constant if the indices are spread among many processors and disks. In addition, the

NonStop SQL is fault tolerant and suppons geographically distributed data and execution.

12

3.4. TeradataTeradata quietly pioneered many of the ideas presented here. Since 1978 they have been

building shared-nothing highly-parallel SQL systems based on commodity microprocessors, disks,

and memories. They have demonstrated near-linear speedup and scaleup in many commercial

benchmarks, and far exceed the speed of traditional mainframes in their ability to process large

(terabyte) databases.

The Teradata processors are functionally divided into two groups: Interface Processors

(IFPs) and Access Module Processors (AMPs). The IFPs handle communication with the host,

query parsing and optimization, and coordination of AMPs during query execution. The AMPs are

responsible for executing queries. Each AMP typically has several disks and a large memory

cache. IFPs and AMPs are interconnected by a dual redundant, tree-shaped interconnect called the

Y-net [TERA83, TERA85]. The Y-net has an aggregate bandwidth of 12 megabytes/second.

Relations can be declustered over a subset of the AMPs but hashing is the only declustering

strategy currently supported. Whenever a tuple is to be inserted into a relation, a hash function is

applied to the primary key of the relation to select an AMP for storage. Hash maps in the Y-net

nodes and AMPs are used to indicate which hash buckets reside on each AMP. Once a tuple arrives

at a site, that AMP applies a hash function to the key attribute in order to place the tuple in its

"fragment" (several tuples may hash to the same value) of the appropriate relation. Once an entire

relation has been loaded, the tuples in each horizontal fragment are in what is termed "hash-key

order." Thus, given a value for the key attribute, it is possible to locate the tuple in a single disk

access (assuming no buffer pool hits). This is the only physical file organization supported

currently. With this organization the only kind of indices one can construct are dense, secondary

indices.

Optionally, each relation can be replicated for increased availability using a special data

placement scheme termed interleaved deciustering [TERA85, COPE89]. For each primary

fragment of a relation (the tuples of a relation stored on one node), a backup copy of the fragment

is made which is itself declustered into subfragments. Each of these backup subfragments is stored

on a disk independent of the primary fragment. During the normal mode of operation, read

requests are directed to the primary fragments and write operations update both copies. If a

primary fragment is unavailable, the corresponding backup fragment is promoted to become the

primary (active) fragment and all data accesses will be directed to them.

Like the other systems, hashing is used to distribute the tuples of intermediate relations to

facilitate intra-operator parallelism. However, instead of using dataflow techniques during the

execution of a query, each operator is run to completion on all participating nodes before the next

operator is initiated. Join operators are normally executed using the standard merge-join algorithm.

13

3.5. The Super Database ComputerThe Super Database Computer (SOC) project at the University of Tokyo presents an

interesting contrast to other database system projects [KITS90, mRA90, KITS87]. This system

takes a combined hardware and software approach to the performance problem. The basic unit,

called a processing module (PM), consists of one or more processors with a shared memory.

These processors are augmented by a special purpose sorting engine that sorts at high speed

(3MB/s at present), and by a disk subsystem. Clusters of processing modules are connected via an

omega network [LAWR75] which not only provides non-blocking NxN interconnect, but also

provides some dynamic routing to support data distribution during hash joins [KITS90]. The

SOC is designed to scale to thousands of PMs, and so considerable attention is paid to the problem

of data skew. Data is declustered among the PMs by hashing.

The SOC software includes a unique operating system, and a relational database query

executor. Most publish work so far has been on query execution and on efficient algorithms to

execute the relational operations, rather than on query planning or data declustering. The SOC is a

shared-nothing design with a software dataflow architecture. This is consistent with our assertion

that current parallel database machines systems use conventional hardware. But the special

purpose design of the omega network and of the hardware sorter clearly contradict the thesis that

special-purpose hardware is not a good investment of development resources. Time will tell

whether these special-purpose components offer better price performance or peak performance than

shared-nothing designs built of conventional hardware.

3.6. Other SystemsOther parallel database system prototypes under development include XPRS [STON88],

Volcano [GRAE90], and Arbre [LORI89]. While both Volcano and XPRS are implemented on

shared-memory multiprocessors, XPRS is unique in its exploitation of the availability of massive

shared-memory in its design. In addition, XPRS is based on a number of innovative techniques

for obtaining extremely high performance and availability.

14

~uture Directions and Researcb....Problems

4.1.Mixlng Batch and OLTP QueriesSection 2 concentrated on the basic techniques used for processing complex relational

queries in a parallel database system. Tandem and Teradata have demonstrated that the same

architecture can be used successfully to process transaction-processing workloads [TAND88] as

well as ad-hoc queries [ENGB89]. Running a mix of both types of queries concurrently,

however, presents a number of unresolved problems. The problem is that ad-hoc relational queries

tend to acquire a large number of locks (at least from a logical viewpoint) and tend to hold them for

a relatively long period of time, (at least from the viewpoint of a competing debit-credit style

query). The solution currently offered is "Browse mode" locking for ad-hoc queries that do not

update the database. While such a "dirty_read" solution is acceptable for some applications, it is

not universally acceptable. The solution advocated by XPRS [STON88], is to use a versioning

mechanism to enable readers to read a consistent version of the database without setting any locks.

While this approach is intuitively appealing, the associated performance implications need further

exploration. Other, perhaps better, solutions for this problem may also exist.

A related issue is priority scheduling in a shared-nothing architecture. Even in centralized

systems, batch jobs have a tendency to monopolize the processor, flood the memory cache, and

make large demands on the I/O subsystem. It is up to the underlying operating system to quantize

and limit the resources used by such batch jobs in order to insure short response times and low

variance in response times for short transactions. A particularly difficult problem, is the priority

inversion problem, in which a low-priority client makes a request to a high priority server. The

server must run at high priority because it is managing critical resources. Given this, the work of

the low priority client is effectively promoted to high priority when the low priority request is

serviced by the high-priority server. There have been several ad-hoc attempts at solving this

problem, but considerably more work is needed.

4.2. Parallel Query OptimizationCurrently, the query optimizers for most parallel database systems/machines do not

consider alternative query tree formats when optimizing a relational query. Typically only

alternative left-deep query trees are considered and not right deep or bushy trees. [GRAE89]

proposes to dynamically select from among a number of alternative plans at run time depending

on, for example, the amount of physical memory actually available and the cardinalities of the

intermediate results. While cost models for relational queries running on a single processor are

now well-understood [SELI79, JARK84, MACK86] they still depend on cost estimators which

are a guess at best The situation with parallel join algorithms running in a mixed batch and online

15

environment is even more complex. Only recently have we begun to understand the relative

performance of the various parallel join methods [SCHN89] and alternative query tree

organizations in a parallel database machine environment [SCHN90]. The XPRS optimizer

considers, for example, only the use of merge-sort join methods. To date, no query optimizers

consider the full variety of parallel algorithms for each operator and the full variety of alternative

query tree organizations. While the necessary query optimizer technology exists, accurate cost

models have not been developed, let alone validated. More work is needed in this area.

4.3. Parallelization of the Application ProgramWhile machines like Teradata and Gamma separate the application program running on a

host processor from the database software running on the parallel processor, both the Tandem and

Bubba systems use the same processors for both application programs and the parallel database

software. This arrangement has the disadvantage of requiring a complete, full-function operating

system on the parallel processor, but it avoids any potential load imbalance between the two

systems and allows parallel applications. Missing, however, are tools that would allow the

application programs themselves to take advantage of the inherent underlying parallelism of these

integrated parallel systems. While automatic parallelization of applications programs written in

Cobol may not be feasible, library packages to facilitate explicitly parallel application programs are

needed. Support for the SQL3 NOWAIT option in which the application can launch several SQL

statements at once would be an advance. Ideally the SPLIT and MERGE operators could be

packaged so that applications could benefit from them.

4.4. Physical Database DesignAs discussed in Section 3, Gamma currently provides four alternative declustering

strategies in addition to the normal collection of access methods. While this is a richer set than what

is currently available commercially, the results in [GHAN90a, GHAN90b] demonstrate that there

is no one "best" declustering strategy. In addition, for a given database and workload there are a

variety of viable alternative indexing and declustering combinations. Database design tools are

needed to help the database administrator select the correct combination. Such a tool might accept

as input a description of the queries comprising the workload (including their frequency of

execution), statistical information about the relations in the database, and a description of the target

environment. The resulting output would include a specification of which declustering strategy

should be used for each relation (including which nodes the relation should be declustered over)

plus a specification of the indices to be created on each relation. Such a tool would undoubtedly

need to be integrated with the query optimizer as query optimization must incorporate information

on declustering and indices which, in tum, impact what plan is chosen for a particular query.

16

Another area where additional research is needed is in the area of multidimensional

declustering algorithms. All current algorithms declustering the tuples in a relation using the values

of a single attribute. While this arrangement allows selections against the partitioning attribute to be

localized to a limited number of nodes, selections on any other attribute must be sent to all the

nodes over which the relation is declustered for processing. While this is acceptable in a small

configuration, it is not in a system with 1000s of processors. Recently we have begun to study the

applicability of grid file techniques to the problem of multidimensional declustering. While the

initial results look promising, the associated database design problem of computing the proper size

of each bucket and the distribution of buckets among processors appears daunting.

4.5. On-line Reorganization and UtilitiesLoading, reorganizing, or dumping a terabyte database at a megabyte a second takes over

12 days and nights. Clearly parallelism is needed if utilities are to complete within a few hours or

days. Even then, it will be essential that the data be available while the utilities are operating. In

the SQL world, typical utilities create indices, add or drop columns, add constraints, and

physically reorganize the data, changing its clustering.

One unexplored (and difficult) problem is how to process database utility commands while

the system remains operational and the data remains available for concurrent reads and writes by

others; A technique for reorganizing indices was proposed in [STON88]. The fundamental

properties of such algorithms is that they must be online (operate without making data unavailable),

incremental (operate on parts of a large database), parallel (exploit parallel processors), and

recoverable (allow the operation to be canceled and return to the old state).

4.6. Processing Highly Skewed DataAnother interesting research area is algorithms to handle relations with highly skewed

attribute values distributions. Both range declustering and hybrid-range declustering with DBA

supplied key ranges help alleviate the problem of declustering a relation whose partitioning attribute

value is skewed (especially if the heat of the tuples in the relation is considered [COPE88]). For

example, there is really no reason why each node must have a unique range of partitioning attribute

values when declustering a relation using range partitioning. Problems can still occur, however,

when the data is redistributed as part of processing a complex operator such as a join. One possible

solution is use a range-split table with non-unique entries for redistributing the tuples of two

relations to be joined. In the case of the inner relation, if a tuple's joining attribute value falls into

one of the non-unique ranges, one entry is picked at random from the set of possible alternatives.

Tuples from the outer relation need, however, to be sent to all the possible alternatives. Other

approaches to solving this problem have been proposed in [KITS90, W0LF90] and certainly other

solutions remain to be explored and evaluated.

17

4.7. Non-relational Parallel Database MachinesWhile open research issues remain in the area of parallel database machines for relational

database systems, building a highly parallel database machine for an object-oriented database

system presents a number of new challenges. One of the flISt issues to resolve is how declustering

should be handled. For example, should one decluster all sets (such as set-valued attributes of a

complex object) or just top-level sets? Another question is how should inter-object references be

handled. In a relational database machine, such references are handled by doing a join between the

two relations of interest, but in an object-oriented DBMS references are generally handled via

pointers. In particular, a tension exists between declustering a set in order to parallelize scan

operations on that set and clustering an object and the objects it references in order to reduce the

number of disk accesses necessary to access the components of a complex object. Since clustering

in a standard object-oriented database system remains an open research issue, mixing in

. . . declustering makes the problem even more challenging.

Another open area is parallel query processing in an OODBMS. Most OODBMS provide a

relational-like query language based on an extension to relational algebra. While it is possible to

parallelize these operators, how should class-specific methods be handled? If the method operates

on a single object it is certainly not worthwhile parallelizing it However, if the method operates on

a set of values or objects that are declustered, then it almost must be parallelized if one is going to

avoid moving all the data referenced to a single processor for execution. Since it is, at this point in

time, impossible to parallelize arbitrary method code, one possible solution might be to insist that

if a method is to be parallelized that it be constructed using the primitives from the underlying

algebra, perhaps embedded in a normal programming language.

18

5. Summary and ConclusionsLike most applications, database systems want cheap, fast hardware. Today that means

commodity processors, memories, and disks. Consequently, the hardware concept of a database

machine built of exotic hardware is inappropriate for current technology. On the other hand, the

availability of fast microprocessors, and small inexpensive disks packaged as standard inexpensive

but fast computers is an ideal platform for parallel database systems. A shared-nothing architecture

is relatively straightforward to implement and, more importantly, has demonstrated both speedup

and scaleup to hundreds of processors. Furthermore, shared-nothing architectures actually

simplify the software implementation. If the software techniques of data declustering and intra

operator parallelizers are employed, the task of parallelizing an existing database management

system becomes a relatively straightforward task. Finally, there are certain applications (e.g. data

mining in terabyte databases) that require the computational and I/O resources only available from a

parallel architecture.

While the successes of both the commercial products and prototypes demonstrates the

viability of highly parallel database machines, a number of open research issues remain unsolved

including techniques for mixing ad-hoc queries and with online transaction processing without

seriously limiting transaction throughput, improved optimizers for parallel queries, tools for

physical database design and on-line database reorganization, and algorithms for handling

relations with highly skewed data distributions. Some application domains are not well supported

by the relational data model, it appears that a new class of database systems based on an object

oriented data model are needed. Such systems pose a host of interesting research problems that

required further examination.

19

fullerences

[ALEX88] Alexander, W., et. al., "Process and Dataflow Control in Distributed Data-IntensiveSystems," Proc. ACM SIGMOD Conf., Chicago, n.., June 1988. October, 1983.

[BfIT88] Bitton, D. and J. Gray, "Disk Shadowing," Proceedings of the Fourteenth InternationalConference on Very Large Data Bases, Los Angeles, CA, August, 1988.

[BLAS79] Blasgen, M. W., Gray, J., Mitoma, M., and T. Price, "The Convoy Phenomenon,"Operating System Review, Vol. 13, No.2, April, 1979.

[BOYN83] Boyne, R.D., D.K. Hsiao, D.S. Kerr, and A. Orooji, A Message-OrientedImplementation of a Multi-Backend Database System (MOBS), Proceedings of the 1983Workshop on Database Machines, edited by H.-O. Leilich and M. Missikoff, SpringerVerlag, 1983.

[BORA83] Boral, H. and D. DeWitt, "Database Machines: An Idea Whose Time has Passed? ACritique of the Future of Database Machines," Proceedings of the 1983 Workshop on DatabaseMachines, edited by H.-O. Leilich and M. Missikoff, Springer-Verlag, 1983.

[BORA90] Boral, H. et. aI., "Prototyping Bubba: A Highly Parallel Database System," IEEEKnowledge and Data Engineering, Vol. 2, No.1, March, 1990.

[COPE89] Copeland, G. and T. Keller, "A Comparison of High-Availability Media RecoveryTechniques," Proceedings of the ACM-SIGMOD International Conference on Management ofData, Portland, Oregon June 1989.

[COPE88] Copeland, G., Alexander, W., Boughter, E., and T. Keller, "Data Placement inBubba," Proceedings of the ACM-SIGMOD International Conference on Management of Data,Chicago, May 1988.

[DEWI79] DeWitt, D.J., "DIRECT - A Multiprocessor Organization for Supporting RelationalDatabase Management Systems," IEEE Transactions on Computers, June, 1979.

[DEWI84] DeWitt, D. J., Katz, R., Olken, F., Shapiro, D., Stonebraker, M. and D. Wood,"Implementation Techniques for Main Memory Database Systems", Proceedings of the 1984SIGMOD Conference, Boston, MA, June, 1984.

[DEWI86] DeWitt, D., et. aI., "GAMMA - A High Performance Dataflow Database Machine,"Proceedings of the 1986 VLDB Conference, Japan, August 1986.

[DEWI88] DeWitt, D., Ghandeharizadeh, S., and D. Schneider, "A Performance Analysis of theGamma Database Machine," Proceedings of the ACM-SIGMOD International Conference onManagement of Data, Chicago, May 1988.

[DEWI90] DeWitt, D., et. al., "The Gamma Database Machine Project," IEEE Knowledge andData Engineering, Vol. 2, No.1, March, 1990.

[ENGL89] Englert, S, J. Gray, T. Kocher, and P. Shah, "A Benchmark of NonStop SQLRelease 2 Demonstrating Near-Linear Speedup and Scaleup on Large Databases," TandemComputers, Technical Report 89.4, Tandem Pan No. 27469, May 1989.

20

[GHAN90a] Ghandeharizadeh, S., and D.J. DeWitt, "Performance Analysis of AlternativeDeclustering Strategies", Proceedings of the 6th International Conference on DataEngineering, Feb. 1990.

[GHAN90b] Ghandeharizadeh, S. and D. J. DeWitt, "Hybrid-Range Partitioning Strategy: ANew Declustering Strategy for Multiprocessor Database Machines" Proceedings of theSixteenth International Conference on Very Large Data Bases", Melbourne, Australia, August,1990.

[GooD81] Goodman, J. R., "An Investigation of Multiprocessor Structures and Algorithms forDatabase Management", University of California at Berkeley, Technical Report UCB/ERL,M81/33, May, 1981.

[GRAE89] Graefe, G., and K. Ward, "Dynamic Query Evaluation Plans", Proceedings of the1989 SIGMOn Conference, Portland, OR, June 1989.

[GRAE9O] Graefe, G., "Encapsulation of Parallelism in the Volcano Query Processing System,"Proceedings of the 1990 ACM-SIGMOD International Conference on Management of Data,May 1990.

[HIRA90] Hirano, M, S. et al, "Architecture of SDC, the Super Database Computer",Proceedings of JSPP '90. 1990.

[HSIA9O] Hsiao, H. I. and D. J. DeWitt, "Chained Declustering: A New Availability Strategy forMultiprocessor Database Machines", Proceedings of the 6th International Conference on DataEngineering, Feb. 1990.

[JARK84] Jarke, M. and J. Koch, "Query Optimization in Database System," ACM ComputingSurveys, Vol. 16, No.2, June, 1984.

[KIM86] Kim, M., "Synchronized Disk Interleaving," IEEE Transactions on Computers, Vol. C35, No. 11, November 1986.

[KITS83] Kitsuregawa, M., Tanaka, H., and T. Moto-oka, "Application of Hash to Data BaseMachine and Its Architecture", New Generation Computing, Vol. 1, No.1, 1983.

[KITS87] Kitsuregawa, M., M. Nakano, and L. Harada, "Functional Disk System for RelationalDatabase", Proceedings of the 6th International Workshop on Database Machines, LectureNotes in Computer Science, #368, Springer Verlag, June 1989.

[KITS89] Kitsuregawa, M., W. Yang, S. Fushimi, "Evaluation of 18-Stage Pipeline HardwareSorter", Proceedings of the 3rd International Conference on Data Engineering, Feb. 1987.

[KITS90] Kitsuregawa, M., and Y. Ogawa, "A New Parallel Hash Join Method with Robustnessfor Data Skew in Super Database Computer (SDC)", Proceedings of the Sixteenth InternationalConference on Very Large Data Bases", Melbourne, Australia, August, 1990.

[LAWR75] Lawrie, D.H., "Access and Alignment of Data in an Array Processor", IEEETransactions on Computers, V. 24.12, Dec. 1975.

[LIVN87] Livny, M., S. Khoshafian, and H. Boral, "Multi-Disk Management Algorithms",Proceedings of the 1987 SIGMETRICS Conference, Banff, Alberta, Canada, May, 1987.

21

[LORI89] Lorie, R., J. Daudenarde, G. Hallmark, J. Stamos, and H. Young, "Adding IntraTransaction Parallelism to an Existing DBMS: Early Experience", IEEE Data EngineeringNewsletter, Vol. 12, No.1, March 1989.

[MACK86] Mackert, L. F. and G. M. Lohman, "R* Optimizer Validation and PerfonnanceEvaluation for Local Queries," Proceedings of the 1986 SIGMOD Conference, Washington,D.C., May, 1986.

[PATT88] Patterson, D. A., G. Gibson, and R. H. Katz, "A Case for Redundant Arrays ofInexpensive Disks (RAID)," Proceedings of the ACM-SIGMOD International Conference onManagement of Data, Chicago, May 1988.

[PIRA90] Pirahesh, H., C. Mohan, J. Cheng, T.S. Liu, P. Selinger, "Parallelism in RelationalDatabase Systems: Architectural Issues and Design Approaches," Proc. 2nd Int. Sympsium onDatabases in Parallel and Distributed Systems, IEEE Press, Dublin, July, 1990.

[RIES78] Ries, D. and R. Epstein, "Evaluation of Distribution Criteria for Distributed DatabaseSystems," UCB/ERL Technical Report M78/22, UC Berkeley, May, 1978.

[SALE84] Salem, K. and H. Garcia-Molina, "Disk Striping", Department of Computer SciencePrinceton University Technical Report EEDS-TR-332-84, Princeton N.J., Dec. 1984

[SCHN89] Schneider, D. and D. DeWitt, "A Perfonnance Evaluation of Four Parallel JoinAlgorithms in a Shared-Nothing Multiprocessor Environment", Proceedings of the 1989SIGMOD Conference, Portland, OR, June 1989.

[SCHN90] Schneider, D. and D. DeWitt, "Tradeoffs in Processing Complex Join Queries viaHashing in Multiprocessor Database Machines," Proceedings of the Sixteenth InternationalConference on Very Large Data Bases", Melbourne, Australia, August, 1990.

[SELI79] Selinger,P. G., et. aI., "Access Path Selection in a Relational Database ManagementSystem," Proceedings of the 1979 SIGMOD Conference, Boston, MA., May 1979.

[STON79] Stonebraker, M., "Muffin: A Distributed Database Machine," ERL Technical ReportUCB/ERL M79/28, University of California at Berkeley, May 1979.

[STON86] Stonebraker, M., "The Case for Shared Nothing," Database Engineering, Vol. 9, No.1, 1986.

[STON88] Stonebraker, M., R. Katz, D. Patterson, and J. Ousterhout, "The Design of XPRS",Proceedings of the Fourteenth International Conference on Very Large Data Bases, LosAngeles, CA, August, 1988.

[TAND87] Tandem Database Group, "NonStop SQL, A Distributed, High-Perfonnance, HighReliability Implementation of SQL," Workshop on High Perfonnance Transaction Systems,Asilomar, CA, September 1987.

[TAND88] Tandem Perfonnance Group, "A Benchmark of Non-Stop SQL on the Debit CreditTransaction," Proceedings of the 1988 SIGMOD Conference, Chicago, IL, June 1988.

[TERA83] Teradata: DBC/l012 Data Base Computer Concepts & Facilities, Teradata Corp.Document No. C02-0001-00, 1983.

22

[TERA85] Teradata, "DBC/1012 Database Computer System Manual Release 2.0," Document No.CIO-OOOI-02, Teradata Corp., NOV 1985.

[TEVA87] Tevanian, A., et. al, "A Unix Interface for Shared Memory and Memory Mapped FilesUnder Mach," Dept. of Computer Science Technical Report, Carnegie Mellon University,July, 1987.

[THAK90] Thakkar, S.S. and M. Sweiger, "Performance of an OLTP Application on SymmetryMultiprocessor System," Proceedings of the 17th Annual International Symposium onComputer Architecture, Seattle, WA., May, 1990.

[WOLF90] Wolf, J.L., Dias, D.M., and P.S. Yu, "An Effective Algorithm for Parallelizing SortMerge Joins in the Presence of Data Skew," 2nd International Symposium on Databases inParallel and Distributed Systems, Dublin, Ireland, July, 1990.

[ZELL90] Zeller, H.J. and J. Gray, "Adaptive Hash Joins for a Multiprogramming Environment",Proceedings of the 1990 VLDB Conference, Australia, August 1990.

23

Distributed by~TANDEM

Corporate Information Center10400 N. Tantau Ave., LOC 248-07Cupertino, CA 95014-0708

parallel database systems: the future of database processing or a passing fad?

Documents