Transcript

A relational approach makes this experimental data basemanagement system unusually easy to install and use. Some of thedecisions made in System R design in order to enhance usability

also offer major bonuses in other areas.

Perhaps the greatest impediments to the use of acomputerized data base management system are in-stallation cost and complexity. At present, installa-tion of these systems requires a staff skilled intelecommunitations, operating systems, datamanagement and in applications. In response, ourlab designed and implemented System R-an ex-perimental data base management system allowingeasy definition of data bases and data base applica-tions without sacrificing the function and perform-ance available in most commercial systems. Amongits capabilities, the system provides a sophisticatedauthorization facility, and automatically handlessystem functions such as recovery and concurrencycontrol. System R adopts a relational data modeland supports a language called SQL for defining, ac-cessing, and modifying various views of the database.

The relational data model

All data base management systems representdata in the form of records. Records, the basic unit

-System R:

A Relational Data BaseManagement System

M. M. Astrahan, M. W. Blasgen,D. D. Chamberlin, J. N. Gray, W. F. King,B. G. Lindsay, R. A. Lorie, J. W. Mehl,

T. G. Price, G. R. Putzolu, M. Schkolnick,P. P. Selinger, D. R. Slutz, H. R. Strong,P. Tiberio, I. L. Traiger, B. W. Wade,

R. A. YostIBM San Jose Research Laboratory

of storage, contain fields, which hold values. Eachrecord represents some fact about the world. Atelephone record, for example, has three fields:name, address, and telephone number. Each real-world instance of the record assigns a value to eachfield.Data base management systems differ in the way

they organize records. System R organizes data intotables-sequences of identically formatted records.Train schedules, price lists, tide tables, and phonebooks all fit this model.A relational model was adopted because it is easy

to understand and to explain and because it lendsitself to powerful relational operators (such as sort,join, project, and select).2 By relational model, we

mean an organization which collects data into justsuch uniform tables as those discussed above, andwhich allows a user to access data without having tospecify the physical organization of the tables.Figure 1 shows a fragment of a relational data base;the various attributes-name, office, job, salary-define a relation. Each record (row), therefore, is an

instance of that relation.Relations are combined and manipulated by vari-

ous relational operators. Each operator produces

0018-9162!79/0500-0042$00.75 e 1979 IEEE COMPUTER42

another relation as a result. Operators apply to en-tire relations, thereby reducing the use of constructssuch as "for each record do." Operators include thefollowing: select choses a row subset of a table basedon a predicate; project removes columns of a rela-tion; join combines two or more relations to producea new relation; order sorts a relation by a collectionof attributes; group by aggregates records by someattribute; Boolean operators provide union and in-tersection of relations; and aggregate operatorssuch as sum, mi, and max collect all instances of afield into a single value.By contrast, other systems present a hierarchical

or network data model.3 In general, one mustnavigate through a network or hierarchy. For exam-ple, to find recent invoices of a customer, we locatethe customer account record and then locate the in-voice records under that customer. By contrast, anonnavigational approach requests all accountrecords of that customer and relies on the system tolocate them. Navigation is not inherent in nonrela-tional systems, but to date all nonrelational systemshave provided a navigational interface.The three data models have equal power of expres-

sion. Besides being easier for the end user tovisualize and understand, the tables of a relationalDBMS make it somewhat easier to define a non-navigational language. Further, it appears possibleto compile code from a nonnavigational languagewhich rivals the execution efficiency of navigationalinterfaces.For these reasons, System R supports only a rela-

tional model at the external interface; that is, theuser can work in a set-oriented language withoutdescribing the pathways taken to reach the data.Since efficient implementation of the relationalmodel and operators seems to require the use of net-work mechanisms (pointers among related records),System R has full network support internally.However, all this is hidden by the SQL language.

High-level and host language

A very high-level data access language (such asSQL) for data definition, manipulation, and controlis important for several reasons:

* It is easy to learn and use.* It permits the use of an optimizer to improve

performance.* It provides some independence between the

application programs and the stored data, sothat if the data is reorganized, the applicationsdo not have to be rewritten.

SQL is an example of such a language. (A completedefinition of that language appears in Astrahan, etal.') For instance, in the sample data base of Figure1, we could find who does what in the Paris office bywriting:

SELECTFROMWHERE

NAME, JOBEMPLOYEEOFFICE = 'Paris';

EMPLOYEE

NAME OFFICE JOB SALARY

Smith Paris SALES 15000Jones Bonn ALES 18000Clark Boise SALES 12000

Kent Paris SERVICE 15000Davis London SERVICE 13000Jacob Rio SALES 12000

OFFICE

LOCATION MANAGER PHONE

San Jose Blasgen 7152Paris Portal 9123London Portal 3278Bonn Roever 1287

Figure 1. A fragment of a relational data base. All dataIs organized Into tables, i.e., Into sequences of records,each of which has the same format. Each row of a table(e.g., see "Jones") is a record describing one instanceof a relation.

In many data base systems, data can only be ac-cessed and manipulated by writing a program in alanguage such as Cobol or PL/I. Such a programcontains CALL statements to subroutines perform-ing data base functions. This has several draw-backs:

* It requires the services of a Cobol or PL/Iprogrammer.

* It introduces a long delay between request andresult (write the program, compile it, debugit, etc.).

* Using the CALL statement prohibits anycompile-time optimization of the data baseservices required.

The solution to the first two drawbacks is to in-troduce a query capability, that is, a mechanismsupporting ad hoc requests to retrieve or modify in-formation in the data base. For example, System Rimplements this query capability by providing ahost-language interface as an application programcalled the UFI-user-friendly interface. Thus, datastored in a System R data base can be accessed andupdated both through the UFI (by interactively is-suing SQL statements) and by application programswritten in host languages such as Cobol and PL/I(see Figure 2).The solution to the performance problems caused

by the third drawback is to extend the hostlanguages to include the data sublanguage. Coboland PL/I have been extended to permit SQLstatements be imbedded in an application program.The extensions are via preprocessors which acceptCOBSQL and PLISQL input programs and produceCobol and PL/I output respectively along with com-

May 1979 43

Figure 2. These diagrams show two scenarios, one using the ad hocinterface and the other the programmer interface. In the first scenario,a casual user deals directly with UFI-the User Friendly Interface.In the second scenario, a programmer installs an application pro-gram; then Jones and Smith use that application.

piled fragments of SQL statements; thesefragments are stored in the data base by thepreprocessors. An example application program us-ing PLISQL is shown in Figure 3.The SQL language used in host languages is the

same language accepted by the UFI; hence the SQLportion of application programs can be debugged us-ing the UFI. Since the SQL statement is an integralpart of the application program (and not a param-eter of a CALL), it is possible to perform early bind-ing (compilation) and thus enhance performance.

Compilation

A major criticism of nonprocedural languages isthat they are inherently inefficient. If that appliedto SQL, System R would be of little interest. Ourwork, therefore, concentrated on performance aswell as on function. To achieve acceptable perform-ance, System R compiles SQL statements into ma-chine code containing calls to low-level accessingroutines. The alternative is to interpret each state-

ment every time it is executed. Compiling an SQLstatement captures and preserves information thatan interpreter must rediscover on each invocation.Experiments indicate compilation is almost alwayssuperior to interpretation, even for SQL statementswhich are executed only once and which retrieve ormodify only a few records (see Figures 4 and 5).In fact, System R can be thought of as a compiler

of data manipulation statements: It compiles state-ments in the SQL language into machine code; thecode issues calls to the access method.

Prior to PL/I compilation, a PLISQL program(such as in Figure 3) is examined by a System Rpreprocessor which finds the SQL statements. EachSQL statement is passed to the System R parser,optimizer, and code generator. These produce objectcode containing calls to a relational access method.The PLISQL program is then translated into a purePL/I program containing calls to this object code.As a result, the cost of supporting the high-levellanguage SQL is paid once, at compile time, ratherthan at run time. If a query is used many times, itsone-time cost is amortized over many invocations.An optimizer determines which access paths to

use to evaluate a SQL statement. It is theoptimizer's responsibility to develop an optimal"plan" for the evaluation of each statement. Thisoptimal plan minimizes the execution cost in termsof instructions and I/O operations. The plan is thenpassed to the code generator, which creates amachine language program to carry it out.

Nonprocedurality and automatic pathselection

Nonnavigational data base languages allow dataaccesses and updates to be expressed without im-plying either the existence of specific access pathsor the physical layout of data. While this makes ap-plication programs simpler, it uses the DBMS tochoose an optimal strategy for evaluating the pro-gram. This system is also relatively flexible. In par-ticular, it is hard to imagine how a system without ahigh-level language would allow programs to adaptto new storage structures. Using the Figure 1 exam-ple, if we wish to determine if manager Portal has aservice person in his office, we just query:

SELECT NAMEFROM EMPLOYEE, OFFICEWHERE EMPLOYEE.OFFICE

ANDAND

=OFFICE.LOCATIONOFFICE.MANAGER= 'Portal'EMPLOYEE.JOB ='SERVICE';

Since the language specifies only what is desired,and now how to obtain it, the optimizer choosesamong several possible plans: One strategy is tosearch EMPLOYEE looking for service people and,for each service person, use the corresponding OF-FICE to enter into the OFFICE table to see if thatemployee works for Portal. Another strategy mightfirst search OFFICE to find what LOCATIONsPortal manages and then search EMPLOYEE for

COMPUTER44

service people at those locations. Other strategiesinvolve sorting one or both tables.When the System R optimizer selects the mini-

mum-cost strategy for carrying out a statement, thecost is based on estimates of CPU and I/O require-ments. Using an optimizer in this way has two bene-fits: First, the user need not be concerned with stor-age details. Second, the user is prohibited from"taking advantage" of knowledge of such details.The second benefit allows the program to continuefunctioning as the underlying storage structuresevolve with time.

Data independence

Three techniques enhance System R performance:First, the high-level SQL allows global optimizationof the query and the consequent generation of effi-cient code (minimum number of data base calls andI/Os). Second, the system can be tuned by definingor deleting access paths (either associative orpointer-based). Lastly, one can specify that recordsbe clustered together in physical media for fast se-quential access. All three techniques are trans-parent to the application programmer and the deci-sions can change without altering any programs.To illustrate the point that System R gives pro-

grams some independence from data reorganizationas the system is tuned, suppose there are two trans-actions which must be run periodically against thedata base in Figure 1.

(T1) List all employees at a given office.(T2) List all employees with a certain job.

These two queries need not consider how or wherethe records are stored nor specify the access pathsfor locating the records. The transactions will workcorrectly no matter how the data is organized.

If the users expect to run Ti ninety-nine percentof the time, and T2 one percent of the time, a database structure which results in high performancewill include a fast way to find all employees in agiven office, (e.g. an index on the OFFICE field ofthe EMPLOYEE table), with records of employeesat each office clustered together in secondarystorage (to minimize I/O).An index is a means of directly addressing records

of a relation containing a common value in a par-ticular column, such as EMPLOYEE, where OF-FICE='Paris.' Any authorized user can define thenecessary index and clustering criterion and havethe data reorganized appropriately. The Ti and T2transactions will then take advantage of this neworganization.Suppose, however, that'these estimates are wrong

or that the use of the system changes so that T2runs 99 percent of the time. Performance on theabove data base structure will be very poor, and thedata base will have to be restructured. Namely, theOFFICE index will be dropped, and a clustering in-dex on JOB created. An essential characteristic of aflexible system is that this restructuring not requirethat programs Ti and T2 be rewritten.

May 1979

Figure 3. A sample PLISQL program illustrates this high-level set.oriented language accessing and operating on a table, or view of atable, without detailed description of the specifics of the tableorganization.

Figure 4. Data flow and summary of SQL compiler. Thecompiler contains a statement-parser, an optimizer, anda code-generator.

45

In general, since System R supports a very high-level language, most data base structuring issuescan be deferred until after the applications are writ-ten. This install-now-tune-later philosophy alsoeases application programming by deferring manyperformance decisions, such as the choice of in-dexes.

Incidentally, System R includes a utility programwhich evaluates a set of transactions and theirrelative execution frequencies and suggests gooddata base structurings for the user. This partiallyautomates the data base design task.

Integrated data dictionary

A data dictionary is a description of the data base.It contains both machine-readable and human-readable descriptions of the data base tables, theirattributes, interrelationships, and meaning. It isusually not very large, but it has a very rich struc-ture.Most systems have a data dictionary facility

which stores metadata about the data base asidefrom the data base itself. The data dictionary isoften built on top of the DBMS as a special applica-tion with a special data definition language. This ap-proach allows the data dictionary to benefit fromthe facilities of the DBMS.Separating the data dictionary from the data base

raises two problems: -First, the dictionary and database may disagree with one another unless one in-terface has control of both functions. Second, hav-ing a separate data dictionary implies having aseparate language for the definition and manipula-tion of the dictionary data base.SQL is an integrated data definition and data

manipulation language. In System R, the descrip-tion of the data base is stored in user-visible systemtables which can be read and altered using the SQLlanguage. The creation of a table or an access pathresults in new entries in these system tables. Userswho define tables and other objects are encouragedto include English text to describe the meanings ofthe objects. Later, other users can retrieve all tableswith certain attributes or can browse among the

"SIZE" OF QUERY

Figure 5. A comparison of interpretation, compilation, and precom-pilation shows the decided advantage of amortizing costs of screen-ing and precompiling SQL statements, especially when a query willbe used many times.

descriptions of defined tables (if they are soauthorized). A user can modify these entries tochange the attributes of an object. This approacheliminates the dual language problem and assuresthat the dictionary agrees with the actual system.

Views and authorization

The result of any SQL query is itself a table. Auser can display such a table immediately, or storethe definition of the table as a view. Views can beused just like other tables, except that some viewscan not be modified (will not support insert, delete,or update).Views provide data independence. If the structure

of a table is changed (columns added or permuted ora table split into two tables), the user can define aview which looks like the original table. Old pro-grams can access the new data via the view.Views also provide a powerful authorization

mechanism. Rather than allowing access to an en-tire table, we can define a view which is a row andcolumn subset of the table and only allow access tothat view. For example one view might allow amanager to see records in his own department only.Further, certain fields of the view might be availablefor read only. This gives value-dependent authoriza-tion at a very detailed level.Views, in combination with granting and revoca-

tion operations, provide the basis for the authoriza-tion mechanism. System R maintains special tablescontaining the definition of each view and the opera-tions which selected users may perform on thatview. In order to allow either centralized or dis-tributed control of access, a special privilege calledgrant is included. Grant allows one user to grantany subset of capabilities to other users, who canpass on the grant as well. The compilation processchecks these authorization constraints (at somecost), but this cost is paid only once, at compiletime. If the query is invoked many times, a simpletest at each invocation ensures that the decision isstill valid. This powerful authorization system thusincurs almost no run-time overhead for compiledtransactions.

In System R, there is no centralized DBA-database administrator-function. Rather, each usergroup can have its own DBA to create and authorizeaccess to that group's data. This provides a commondata base shared among all users (an integrateddata base) and yet allows some autonomy amongthe user groups.

Further, authorized users can perform almost allDBA functions at any time without interrupting thenormal operation of the system. Authorized userscan

* create and destroy tables,* create and destroy indexes on tables,* add a column to an existing table,* install a new transaction,* add users to the system,* change the privileges held by various users, and* define or drop a view of existing data.

COMPUTER

C0ST

46

Since SQL supports these operations, they may beinvoked from a terminal (via UFI) or from a pro-gram.

Installation and tuning

The installation procedure for System R consistsof acquiring storage space for the code and data of astarter system, which contains a small data base(relating to employees of a hypothetical company).Once this starter system is up and running, itenables users to experiment with a working versionof System R in their own environment. When theyhave familiarized themselves with the startersystem, they can begin to define their owni databases and. transactions; this can be done in an in-cremental and experimental manner. Users canquickly define and use new data bases because thesystem permits them to defer many data basespecification problems.After an application is coded and running, the

user can tune the system with a program which, fora set of SQL statements and their probabilities ofoccurrence, evaluates possible data base specifica-tions and tells the user the best ones. A proposeddata base specification consists of a clusteringcriterion for each table and a set of indices to. bemaintained on each table. The program also esti-mates the cost of evaluating each SQL statementand the weighted cost of the application.

Transaction management

A major goal of System R is to provide a full set ofcapabilities for data base management in a realistic,operational environment. Only in this way can weassess the viability of the architecture. In SystemR, multiple users can concurrently access data, andthe system has complete facilities for transactionbackout and system recovery. Recovery compen-sates for system failures as well as catastrophicfailures of the magnetic media (e.g. a disk-headcrash). We tried to avoid the need for human in-tervention in system recovery. In particular,recovery requires no explicit operator commands;only media recovery requires handling magnetictapes. Almost all recovery information is kept ondisk and a noncatastrophic restart does not involveoperations personnel.The transaction concept is the key to a successful

recovery philosophy. A complex update of the database involves many SQL statements. If a complexupdate fails or if the system crashes during theoperation, the state of the data base will probably beconfused. The system must be able to undo partiallycompleted transactions. System R does this bykeeping a log of all the changes a transact,ion hasmade. For example, it keeps the old and new value ofeach updated record. If the transaction gets intotrouble, the system can undo the transaction by set-ting updated records to the old values still in thelog. Transactions may be undone, by either a

May 1979

Figure 6. This illustrates the three possible transactionoutcomes; the transaction commits, aborts, or isaborted by the system. A transaction is a unit ofrecovery; it commits all outputs, or aborts and has no ef-fect. A transaction is also a unit of consistency; it startswith consistent inputs and produces consistent outputs.

Figure 7. Five transaction types with respect to the mostrecent checkpoint and the crash point. At restart, T2 andT3 will be redone while T4 and T5 will be undone. SinceTi ended before the checkpoint, its outputs already ap-pear in the checkpoint.

system or an explicit user request. Once a transac-tion commits its updates, they will never be undone.Committed updates are redundantly stored in thelog so that they can be reapplied in case of system ormedia failures (see Figure 6).The log is also useful in reconstructing a current

version of the data base from an archive or check-point version. We use a new technique which pro-vides incremental checkpointing to disk and yet per-mits dynamic space allocation. Incremental check-pointing allows us to use a particularly simplesystem restart technique. When a system crash oc-curs, a previous data base state is rapidly recreatedthrough the use of backup page maps and check-point images of modified pages. This data basestate, coupled with transaction log information, isthen used to either redo or undo all transactionswhich were in progress at the checkpoint or com-pleted after the checkpoint (Figure 7).The transaction concept is exposed to the user by

providing operations to begin, undo, and committhe work of a transaction. The application program-mer brackets a successful transaction by a BEGIN-COMMIT pair. An aborted transaction ends with aRESTORE (undo) verb. All other aspects ofrecovery are handled by System R.Transactions also supply the key to concurrency

control. If multiple transactions concurrently readand write the same data, anomalies can occur. Forexample, if two transactions both want to add $100to an account having $1000 balance, then the properoutcome should be $1200. But if both transactions

47

start with the $1000 balance, then the outcome willbe $1100. For this reason the execution of the twotransactions Ti and T2 is serialized.

System R uses a locking protocol so that: (1) thesystem itself never gets confused because of concur-rent access to a data item by two or more transac-tions, and (2) the user can control the extent towhich his transaction is isolated from the effects ofother transactions. Each user can select one of threeisolation levels:

* The lowest level permits the reading of datawhich has been updated by an incompletetransaction.

* The next level guarantees that only committeddata is read, and

* The third level guarantees that all reads areboth committed and repeatable.

At all levels, System R prevents updates on top ofuncommitted updates, so that the system can undothe effects of each transaction independently. Alllocks are set automatically to ensure the semanticsof these isolation levels. The locking subsystemhandles queuing and deadlock detection. The trans-action recovery mechanism resolves deadlocks byundoing one or more transactions and pre-emptingtheir locks.

Status and experience

SQL is accessible from PL/I and Cobol and runson IBM 370 VM and MVS operating systems.System R has been operational since mid-1977.Since then, it has undergone experimentation invarious environments. Experience thus far indicatesthat the system is indeed easy to learn, install, anduse. Further, its performance, when approached atthe SQL level, is comparable to other data basemanagement systems.

Typical System R applications involve businessoperations such as accounting, inventory control,purchasing, and bill of materials tracking. For themost part, these applications use System R througha host programming language (one would not wantto generate the monthly statements of a large cor-poration by using an ad hoc query terminal).

Users nevertheless make considerable use ofSystem R's query facilities. For example, they candebug SQL statements in advance by testing themusing the UFI. Another example involves tableloading-to assure that the table was correctly load-ed, it is easier to verify the values using the UFIthan to write a complex but seldom-used verifica-tion program. These examples demonstrate that itis important to offer both query and host languagesupport in a complete data base managementsystem.

Experience also shows the importance of pro-viding frequently used transactions with high per-formance by means of early binding (compilation).Consider, for example, an airline reservation ap-plication. The most common and simplest instanceof this application is fetching one record (a flight)and decrementing the number of seats available. Us-ing the access method, the cost for this operation isvery low, and one cannot afford to pay the cost ofoptimizing and interpreting the transaction on eachexecution.

In general, our experience with System R in-dicates that the approach will yield good perform-ance for a variety of users, along with the importantability to tune the system as performance require-ments change. In addition, the system is easy fornew users, who can employ high-level languages tohandle both new and existing applications. U

Acknowledgments

Many have contributed to the design and im-plementation of System R: Raymond Boyce, ArvolaChan, David Choy, Kapali Eswaran, Teo Haerder,Randy Katz, Won Kim, Leonard Liu, Paul McJones,Moscheh Mresse, Jorgen Nilsson, Scott Parker,Dominic Portal, Phyllis Reisner, Paul Roever,Robert Selinger, and Vera Watson.

References

1. M. M. Astrahan, et al., "System R, A Relational Ap-proach to Database Management," ACM Trans.Database Systems, Vol. 1, No. 2, June 1976, pp.97-137.

2. E. F. Codd, "A Relational Model of Data for LargeShared Data Banks," CACM, Vol. 13, No. 6, June1970, pp. 377-387.

3. Special Issue: Data-Base Management Systems,Computing Surveys, Vol. 8, No. 1, Mar. 1976.

COMPUTER48


Top Related