computing - simon fraser university · 2002-03-14 · rpcs have the following advantages: - rpcs...

10-1

10. Client/Server and Distributed Computing

10. Client/Server and Distributed Computing ............................ 10-110.1 The Development of the Client/Server Concept.........................................10-310.2 X-Windows ....................................................................................................10-810.3 The Client/Server Architecture .................................................................10-1210.4 Server Overview and Advantages/Disadvantages....................................10-17

10.4.1 Distributed Server Overview .........................................................10-1710.4.2 Distributed Implementation Layer Overview................................10-1910.4.3 Advantages of Distributed Systems...............................................10-1910.4.4 Disadvantages of Distributed Systems ..........................................10-20

10.5 Activity-Entity Matrices.............................................................................10-2210.5.1 Interaction Matrices.......................................................................10-2210.5.2 Affinity Analysis ...........................................................................10-2410.5.3 CRUD Matrices .............................................................................10-2710.5.4 Frequency Matrices .......................................................................10-31

10.6 Where to Locate Each Database File ........................................................10-3510.6.1 Calculating the Traffic Matrix .......................................................10-35

10.7 File Partitioning ..........................................................................................10-3910.7.1 Horizontal Partitioning ..................................................................10-3910.7.2 Vertical Partitioning.......................................................................10-40

10.8 Data Replication..........................................................................................10-4210.9 References ....................................................................................................10-4510.10 Appendix 10A - SQL + Joins/Projections/Selects/Sorts...........................10-46

10.10.1 SQL Data Definition Language Subset .........................................10-4610.10.2 SQL CRUD Accesses....................................................................10-4710.10.3 SQL Select/Project/Join ................................................................10-4810.10.4 Column Order Re-Arrangement and Row Sorting........................10-5010.10.5 Views .............................................................................................10-5010.10.6 The Report Building Process.........................................................10-51

10.11 Appendix 10B - Binary Components.........................................................10-5410.12 Appendix 10C - Example of a Dendogram ...............................................10-56

Readings: None.

10-2

In this section, we are going look at several topics. First we will look at where data processing came from: batch processing of sorted transactions. From this we will see how the reducing cost of technology, and decentralization made possible by computer networks and workstations, is causing an revolution in the way the various parts of applications are run. The next section will look at how to evaluate the interaction cost of having the data and processes distributed, and see ways to best partition data for best convenience and least cost. Finally, we will look at how to join the distributed data back together in order to produce the necessary reports needed by an application.

10-3

10.1 The Development of the Client/Server Concept

In the oldest information systems applications, programs were invoked by an operator or by some Job Control Language (JCL) commands read in from punched cards. The applications were ’batch processing’ in nature, as their job was to read in a batch of transactions (i.e. a file of transaction records on tape or a pile of transaction records on punched cards), and use them to update a master file of data. A good example is credit card transactions:

• Credit card transactions might have been entered into the system from forms by clerks working off-line, typing (i.e. punching) the transactions describing individual purchases onto cards. Generally, computers had no interactive terminals, except maybe to communicate with the mainframe operator. And (s)he simply communicated with the operating system and received tape mount requests and did the tape mounts.

• The transaction cards were physically sorted by an automatic machine.

• The cards, preceded by some program start-up information (JCL), were fed into the mainframe computer. The program asked that a blank tape, and the tape containing the sorted master file of current account records be mounted on a couple of tape drives.

• It then proceeded to copy the master file to the blank tape, one record at a time, updating the balance of any account number matching the next data card. It did this by debiting an account’s current balance by the amount of the purchase encoded in the transaction on the card.

• When all the inputs for the batch had been processed, the old master tape was erased and the newly written tape became the current master tape.

• This process might be repeated once per working day, as the credit card forms came in.

10-4

Nowadays, we have interactive applications, wherein the clerks enter the data straight into the application, and the account balance is updated on disk. In fact, many transactions are coming in electronically using Electronic Data Interchange (EDI).

But, we now use nicely normalized data. It is thus quite common that a transaction may cause reads and updates of several files. If several clerks are working simultaneously, they may try and debit the same account at the same time (maybe someone used his credit card at two places in the same day). This can cause a serious conflict as illustrated in the following example:

1) Clerk A’s process reads the balance for account #1234 to be $500.

2) Clerk B’s process reads the balance for account #1234 to be $500.

3) Clerk A’s process subtracts the debit she is entering, say $100 from $500, and writes the new balance, $400, back to the master file.

4) Clerk B’s process subtracts the debit he is entering, say $150.00 from $500, and writes his new balance, $350, to the master file.

The problem is that the owner of account #1234 charged a total of $250, but only $150 was debited from their account! :)

What is needed is another process that regulates access by multiple clerks (i.e. users) to allow only one to be working on a particular account at any particular time. A database server is such a process. The first clerk, A, would then have had locked the record of account #1234, read the balance, calculated and wrote back the updated balance, then unlocked the record. Clerk B would not been able to have locked the record for the purpose of updating it, while Clerk A had it locked. This functionality is called ’record locking’.

Database systems can do much more than provide just locking to guarantee atomic (i.e. unpartitionable) updates in a multi-user system. They also provide a library of efficient CRUD procedures which can be called to access and update the data, they maintain the multiple B+ trees and sequential links as elements are added and deleted from each file, and often they try to ensure referential integrity.

10-5

The database library of procedures can be linked into each user’s application program, and using some operating system semaphores, implement locking. The various user’s applications need not even be the same program; they may be different department’s programs running simultaneously. As long as they all link in the database library, and the several copies of the database library linked into each user’s program cooperate via semaphores to guarantee proper operation of the locks, all accesses will work properly.

Alternately, rather than having 10 copies of the database library running in 10 different user’s applications on the same main computer (what a waste of RAM!), you could arrange that only one copy of the database code existed. When the first user needs database services, the database code is brought in, opens the necessary files, and prepares queues for any other users requesting CRUDs. In the MS-DOS world, the database might be called a ’terminate-and-stay resident program’, because the database executable code remains in memory between calls to it, ready to be activated at a moments notice. In a proper multi-user operating system, the database code is called a process (i.e. just a looping daemon program). The process waits in an idle (i.e. blocked) state between calls, ready to handle any incoming CRUD requests. If incoming CRUD requests do not cause lock conflicts, the database system can even make multiple disk sector requests to the disk driver before getting the results of the first back. The disk driver handles them in an efficient manner (e.g. elevator scheduling algorithm).

Now here is the key point. Since each application user need not have a copy of the database code linked in, the database code need not even be running on the same computer as the application. Normally when code is linked, it has to run in the same memory address space. But with the database management system (DBMS) code a process unto itself, the application can be even be running on another computer! Instead of a procedure call (which is normally done in the same memory space), a ’remote procedure call’ can be sent using a network packet from the application to the database server. The results of the access will have to be sent back to the application in the same manner (this is called ’call by value, return by value’).

10-6

Note that you cannot use call by reference (e.g. VAR parameters in Modula-2), as the address of an in/out parameter variable on one computer is useless to another computer. Also, realize that there are several ways to handle remote access:

• Simply send e-mail-like messages back and forth (be careful: e-mail doesn’t always handle binary data well).

• Write your own communications code, or

• Purchase a library of Remote Procedure Call (RPC) stubs to use locally. These stubs have CRUD definition modules, but their implementation modules implement the CRUD accesses by sending formatted messages using some protocol (e.g. ISO File Transfer and Access Method (FTAM)) to a known id for the database server process at a known CPU address on a known network. e.g. Process 9876 on CPU #23 on the UBC mathnet. Such a database server process is written and compiled to accept and handle RPCs from anywhere, and (with appropriate security) do the CRUD access as if it had come from a local application. Except that it must marshal the return parameters into a message, and send them back to the stub on the client application’s computer.

RPCs have the following advantages:- RPCs give the application programmer the illusion of procedure calls,

which is a paradigm he is likely to feel very comfortable with. For instance, coop students and recent grads can use them to write applications without having to know/program all the network communication subtleties.

- RPCs allow access by several de-centralized departments to enterprise-wide data. They allow data to be located where it is used most frequently, yet allow occasional access from other locations by other applications as needed.

RPCs have the following disadvantages:- RPCs have some overhead in marshalling data into packets for the call,

and for the return. This tends to be small compared to the disk access time.

- Someone must have been paid to write the code to implement the RPC data communications code at each end.

10-7

- RPCs can have tremendous delays as satellite transmission to Toronto and back can take 0.5 second, which is an eternity in computer terms. This poses a response time problem if you are trying to say, join two database tables which are both remote from you.

- The RPCs can impose a heavy communication load on the network, and also as a result, run up large communications costs.

- Applications must be aware of the process, CPU, and network designators of the databases they are trying to access. This can become so complicated in large publicly-available applications, that directory services are needed just to allow automated look-up of servers by name.

- RPCs pose a security risk

So servers are programs/processes which offer a service (typically database CRUDs) to any authorized process that calls. The server doesn’t care if it is called from the accounting department’s application on the head office computer, which may be where the database is running too. Or whether it is called from the corporation’s Australian sales office. It will provide and accept data from any authorized client.

Clients are programs which are users of specific services provided by a separate program/process. It really doesn’t care who or where the service is obtained from, as long as it knows of some place it can access the data it needs to provide its user with the functionality the user needs.

Data base specialists usually write the actual database code. They worry about disk sector and cylinder arrangements, and B+ trees, and integrity. They provide a definition module which acts as an advertisement of the Application Program Interface (API) from which the client programmer imports the CRUD procedures.

Systems or data communications programmers (c.f. CMPT 371) can import the API, and use it to write the implementation of the server’s RPC handling code. The advertise with a similar definition module, which adds just a few procedures to make initial contact with the server’s RPC handling code. They thus export a distributed database API, which is implemented on the basic database system API. So remember, an API is just a definition module or a set of definition modules and some instructions/comments on how to use them.

10-8

10.2 X-WindowsTo understand the following discussion of GUI-base client/server application architectures, you must understand X-windows. X-windows is a screen device, CPU, and operating system independent communication protocol for windowing. It allows incompatible computers to write to each other’s GUI screens by sending messages to each other. Without knowing a lot about the other computer’s video circuit board, CPU, or OS, it is a simple protocol for sending window creation and manipulation commands from one computer to another. It also allows the capture and return of mouse and keyboard inputs. Because it is implemented with a communication protocol, it is also independent of programming language.

For example, with an NeXT (Motorola 68040 CPU) running the NeXTStep operating system and a program called Co-eXist, it allows you to open an X-windows connection to a Sun Sparc CPU running Sun OS and an X-windows based graphical application such as Objectbench. You could also attach a PC or a Mac to the net, and run Objectbench from there!

X-windows is not a GUI system really. You can build several different GUIs with a different ’look and feel’ from each other, using X-windows. Instead, X-windows is a simple underlying sub-system for drawing pixels, rectangles (possibly overlapping), and bit images like characters, and receiving keyboard and mouse events back (including which rectangle someone has clicked their mouse in).

10-9

The architecture of X-windows looks like this:

X windows requires a workstation with CPU (a dumb terminal won’t do), though the workstation may not necessarily have a hard disk. It is either a special intelligent graphics terminal called an X terminal, or a workstation (e.g. NeXT, Sun, PC, Mac). The X server software is specially written for each type of workstation, and acts sort of like an

Xlib Xlib Xlib

Client Application #1

X toolkit X toolkit

Client Application #2

Window Manager (also a client)

X Server

Device Drivers

Workstation

X requests from clients to server, and

X replies and user events to clients.

Basic I/O

10-10

RPC handler (though it is more sophisticated in that it can originate event messages to the clients, not just respond to calls from the clients). The X server runs on the workstation CPU.

The clients are applications that need to write to the screen and receive input from the user. Clients may be running on the same workstation as the server, or on a CPU on the other side of the world. They write to the screen by calling a local (to them) copy of Xlib, which is like a more sophisticated RPC call stub (more sophisticated in that it can handle receiving unsolicited events from the server/user). The Xlib is written for the native client CPU, OS, and network type. Xlib is another API available to the application programmer.

Since X is such a low level protocol, some programmers writing client applications prefer to call an X tools library which allows them to program using higher level abstractions (higher level API).

The client application doesn’t necessarily know what GUI ’look and feel’ is being shown the user. It could be Motif, OpenLook, MS-Windows, or the Mac Interface. In fact, the window manager is just a special client of the X server to handle the format and overlap of windows (e.g. is the scroll bar on the left or right, and what does it look like, and what happens when it is hidden). So application client requests that come into the X server, are often routed to the window manager for display management decisions. The window manager can be quite a large sophisticated chunk of code.

Interestingly, the window manager need not be running on the same CPU as the X server. Often it is, but if the X server’s CPU is not programmed for a particular look and feel (e.g. an X terminal), or not capable of running the window manager (e.g. not enough RAM), it can be run on a separate machine.

The final result is that you could be sitting at a screen with 4 windows on it, one displaying a PC-DOS application on an X-aware PC somewhere else on the net, one showing the running of a Sun application from somewhere, one running an IBM mainframe application, and one running a DEC minicomputer application. You

10-11

could mouse between them, and type commands into any one you want. If different X window managers were available for each different CPU’s native GUI, you might even have a different ’look and feel’ for each window by providing a separate manager for each one. More typically you would use a local window manager, and they would all look the same. You should use a local window manager, as there is a lot of pixel drawing commands that would otherwise have to be sent across the network. This is one of the advantages of workstations: they provide distributed users with graphical user interfaces without loading the network down with pixel-by-pixel communication. They do this by providing the high level abstraction implementation in the workstation, and only require simple commands/messages to invoke the higher level UI abstractions, and user data I/O, to be send over the network. And the user data has to move over the network whether using a GUI or dumb terminal.

10-12

10.3 The Client/Server ArchitectureWe now see that an application program often only needs to implement the functionality associated with the nature of the application itself, and the database and user I/O can be handled by other server processes. For instance, Objectbench could possibly in the future provide a library of UI objects like menus, basic screens, dialog boxes, and text windows, which you could simply include in your application. When you sent such an object say a ’new window’ event, a user window would open up along side the Objectbench window to provide user I/O to your Cmpt 370 assignment application.

What is meant by ’client/server’ means different things to different people.

• In Cmpt 275, it just meant one module used the services of another.

• At SFU, it typically just means that your home directory, and some of the application software executable files you can start, are not necessarily located on the CPU you are logged onto. It has nothing to do with database management systems. It is called a distributed or network file system, where there are some CPUs whose sole job is to provide access to some huge hard disks.

• Also at SFU, we have CPU servers, or compute servers. If you have a heavily CPU burdensome application to run, you can connect to a high powered CPU which runs extremely fast. The compute server may not have much file memory or many terminals attached to it, but it sure can compute fast.

• In the information processing world, it typically means that the relational data tables are located in files on a CPU separate from the one running the front end and core of the application. Such a file server must run its own DBMS server software. Or you can use a distributed database system server. The client application must pull together the data from the various servers to present to the user. Or a distributed DBMS must be constructed to give the illusion to the application programmer that the data is not distributed.

10-13

When you read in the data processing industry magazines about the re-engineering to use the new client/server architecture, they mean taking enterprise applications that have previously resided on mainframe, and which are getting a little old anyway, and re-writing them to run in a distributed manner. They mainly mean distributing the data to where it is most needed, and using distributed database systems to handle the network-wide record locking that is sometimes needed. I will call this a database server architecture. Such an architecture may or may not use a network (or distributed) file system, which manages filenames and locations and permissions, network-wide.

GUI GUI GUI GUI

Appl. Appl.

Appl. Appl.

DBMS DBMS DBMS DistributedDBMS

What does ’distributed’ really mean????

Multiple (possibly

accessing a commondatabase?

One Applicationaccessing distributeddata via separate

An Applicationaccessing distributed datavia a distributedDBMS?

different) Applications

DBMSs?

10-14

Occasionally, magazine client/server articles just mean encapsulation and information hiding and Object-Oriented techniques, or alternatively the concept of file/DBMS servers. But generally they mean learning to use distributed systems where multiple concurrent processors and authorization and networks provide new challenges. They MUST deal with these issues as:

• a bunch of small computers is often cheaper than a large mainframe.

• distributed applications can use a lot of data communications cost if not well planned (e.g. data put in most used location, X-windows to reduce GUI low level pixel messages).

• Most customers now prefer UIs that are graphical.

• Most corporations like the fact that, with a network, CPU power is easily scalable (just add another CPU in some closet somewhere), and redundant (if one CPU breaks down, the applications for the most part can keep going).

To do this requires planning as to where to best place portions of the application code (e.g. database server, window manager, etc.), where best to place the data, where best to implement the business rules, and how specialized and sophisticated the types of servers on the network need to be (note: you can purchase special computers that are tailored for economical, high-throughput transaction processing). All these interactions between separate parts of the system allows corporations to chose the most efficient (i.e. cheapest) way to do things, but then you have to handle a lot of incompatibilities between different brand CPU, OSs, network protocols, databases, etc. Overall, it is major challenge to design such a heterogeneous system, especially if you only know COBOL, MVS, and SNA, rather than C++, Unix, and TCP. Unix is not an easy operating system to manage, but is was the original one to handle networking integrally.

10-15

Regarding placing portions of the system, here is the problem:

X Server

WindowManager

ClientApplication

DatabaseManagementSystem

FileSystem

Because there is a lot of communication between the Window Manager and the X server handling the user, it is often good to run these both on the workstation. On the other hand, X terminals are cheaper than workstations, so maybe the window manager should be run on the same CPU as the application.

Because there is a lot of interaction between the DBMS and the file system, maybe these two portions should be run on the same CPU. But what if the data doesn’t all reside on the same CPU or file system, but is spread across the enterprise’s world.

If you do use workstations, why not run the application on each user’s workstation. But then if you have 10 users, that means you have to have 10 copies of the code wasting RAM in 10 different CPU’s. Also, workstation CPU power (measured in Million Instructions Per Second (MIPS)) is more expensive than for specialized CPU servers.

10-16

Or, you can put all the processes except the X server on a single computer. Then you simply have updated the mainframe type application server to use a GUI. But if every user likes having a PC on their desk anyway, for miscellaneous tasks, such a system wastes the CPU power on the desktop.

Another possibility is to split the client application and put some of it related to menus and application specific graphics on the workstation, and the other half related to accessing the underlying database and generating reports, on the database/file server’s CPU.

Please think about and understand these trade offs, be able to write a paragraph on one or the other during an exam. Also realize that from the corporation’s point of view, they simply want to provide adequate friendliness, functionality, and power to users at the cheapest possible price. In the 70’s, as a company and the number of computer applications it used grew, the only main decisions were when to buy more or faster disks, and when to buy a faster CPU. In the 90’s, there is so many more variables to consider!

10-17

10.4 Server Overview and Advantages/Disadvantages

This section will summarize some features of distributed systems, indicate the most frequent architecture, and then provide an advantages/disadvantages list.

10.4.1 Distributed Server Overview

a) File systems (servers) provide:- (network-wide) directory services- (network-wide) access permissions to individual files- read/write of individual bytes for users/clients (anywhere on the

network). Note: file systems don’t really understand the concept of records.

- storage for application executable code, so that when each client application starts up, it gets the latest version. This eases upgrades (vs. having various copies stored on each individual department’s/user’s computer disk), and saves disk space.

- provides the opportunity for a (network-wide) back-up authority (NEVER EVER rely on workstation users to back their files up!).

b) Database Servers are built on top of file systems and provide:- CRUDs of individual records (not bytes) for any client written in any

language- creation and updating of, and use of record links and trees for fast

sequential and random access- record locking to allow concurrent users access (also deadlock detection)- transaction commit/rollback/recovery for maintaining integrity- sort/select/join/project functionality- database re-organization (sometimes).

c) Distributed Database Servers provide:- support for many hardware platform/operating system/network brands.- (almost) transparent location of data (distributed/partitioned/replicated?)

to client application programmer and user. A distributed server sometimes needs a data distribution directory service, even to help itself function properly.

- a two-phase commit strategy to main integrity across the distributed database

10-18

- data format independence (e.g. Is most significant byte of an integer stored lowest in RAM and sent first over the net, or visa versa?)

- facilitates applications to be build with ’views’ so that if one application needs an extra column in a normalized table, it will not require a change to other applications.

- some independence from both foreign key and regular attribute aliasing problems due to synonyms, homonyms, and other enterprise-wide inconsistencies that can occur. If you don’t adopt an enterprise-wide data dictionary, then you may need a distributed database inquire-and-convert-format feature. If you do adopt an enterprise-wide consistent data dictionary, then changes by other applications to the data dictionary/directory entries should somehow be arranged not affect your application.

- prevent as much as possible problems which occur when the DBMS is upgraded to a later version at one location and not immediately at the other locations.

- security and authorization to particular files (or table columns/views) by individual client programs and/or users.

d) GUI Servers provide:- graphics hardware, OS, and network independent support of overlapping

windows, ’look and feel’, mouse, and keyboard services.

e) Print Servers provide:- queueing and queue management (e.g. kill entry in queue) of print jobs- printer driver that appears independent of printer hardware brand- deletion of file when done- warnings re jams and out of paper, in a printer hardware/OS/network

independent manner

f) Fax Servers provide:- queueing, dialing, re-dialing if busy, and delay-until-later (to use cheap

midnight long distance phone rates) sending of documents from wordprocessors without paper

- queueing and viewing of incoming documents

g) Operation Systems are really just servers which provide services such as:- passwords and security to processing (not data)- tasking and task synchronization (e.g. semaphores)- memory management- I/O interfacing

10-19

via linking if local, or via messaging if a distributed OS.

10.4.2 Distributed Implementation Layer Overview

We thus see that distributed systems are often implemented with the client application being implemented on top of/ by using the services of servers. The servers themselves may be build on top of other service providers.

Generally, the most common arrangement you will see in future is:

1) Micro computers/workstations doing GUI-intensive and application ’front-end’ processing (such as checking typed input formats), which would otherwise be expensive if implemented over a network. The bulk of the application and database would be obtained from a departmental server.

2) Mini-computers acting as a departmental database and application processing servers. The minis would run the departmental applications on mostly local data, but must provide access of departmental files to other departments in a software/hardware/network brand-independent way (because other department’s minis or the mainframe may have to be, for various practical (and sometimes political) reasons, a different brand).

3) Central mainframe (or mini) for data storage and processing that is not the responsibility of any particular department (e.g. accounting). Central computers are also good at storing huge amounts of data (Terabytes) and, because they have extremely high processing speed, provide one of the better ways (simple brute force) to implement multi-user integrity for high transaction rate OLTP applications.

10.4.3 Advantages of Distributed Systems• allows use of smaller, cheaper per MIPS and MB, mini-computers and

micros. Mainframes are more expensive as it is very hard to build very fast processors, RAM, and disks.

10-20

• reduce communications and communications processing costs by moving data from central location to where it is most used.

• response time improvements since - distributed systems use the workstation computer for GUI and some

user-intensive applications processing, and - because with distributed data, a lesser % of the data accesses require

remote file access.

• reliability/survivability improvements as most (but probably not all) local user operations can continue at most sites if one file server or distributed DBMS location fails.

• provide smoother incremental growth as you can add more processing power one mini at a time, rather than one mainframe at a time.

• ’design information’ hiding through encapsulation and well defined APIs and message protocols. Server decides how to implement any particular service, and this can be changed/upgraded with no change to the client applications (unless the interface/protocol changes). Also clients can be changed/added without having to change the database.

• autonomy for different departments to develop/buy applications that are most suited to their needs. But this must only be allowed if the data and some processing is available to other departments via a published interface mechanism, otherwise incompatibilities will grow. Note that the desire for autonomy is often an indicator of unresponsive central data processing departments. Recent surveys have shown that most central data processing departments have a 2 year backlog of requests for changes or new systems for individual departments.

10.4.4 Disadvantages of Distributed Systems

• Unfortunately, need to build systems that are independent of hardware, software, operating system, and network brand (this is difficult).

10-21

• Unfortunately, need to purchase extra copies of operating systems, DBMSs, network file systems, communications systems, et cetera, one for each computer. Often though, micro and mini systems software is a lot cheaper than mainframe systems software.

• Unfortunately, need to make sure consistent attribute and especially primary and secondary key names and formats are used enterprise wide (i.e. force use and compliance with an enterprise-wide data dictionary, even though programmers may hate names they personally didn’t invent).

• Unfortunately, need distributed integrity controls - e.g. before creating a new record, you check if a foreign key points to a

valid instance. But by the time you write your record, the other instance has been deleted by another user and the foreign key is thus garbage!

- e.g. two salespersons both selling the last item of a particular type that is in stock!

• Rollback and recovery is harder when distributed.

• Risk of delegating back-up archiving to individual departments who may be lax about backing up disks every night.

10-22

10.5 Activity-Entity MatricesIn order to scientifically evaluate the best way to distribute data and processes that work on the data entities, we need to document which processes access which entities, and how frequently. Since each process could potentially access every entity, we have to represent this interaction documentation using a matrix. In the Information Engineering methodology, and especially in Texas Instrument’s IEF CASE tool, these are called Activity-Entity Matrices. IEF can display matrices to document the interaction of anything with anything else. In addition there are automated functions to re-arrange the columns and rows to help bring out patterns in the data use that are useful during design. We will look at a few uses for these matrices.

10.5.1 Interaction Matrices

It is often important to determine which departments, and more importantly which elementary processes, interact with which data entities. This is usually illustrated with an interaction matrix.

10-23

The following matrix is for the university registration example used previously in the BCNF normalization part of the lecture notes.

An interaction matrix has a mark in every matrix cell where a particular process on the left is anticipated to need to CRUD an entity labelling a particular column. In the above case, I assumed that before deleting an entity instance, you had to check that this wouldn’t destroy referential integrity. For example, when deleting a course offering, you had to check that (i.e. read whether) there were no student registered in the course offering.

ENTITIES -->

ELEMENTARYPROCESSES/ACTIVITIES | | V ad

viso

r

cour

se

stud

ent-

regi

stra

tion

stud

ent-

advi

sor

stud

ent

cour

se o

ffer

ing

inst

ruct

or

add coop advisor 1

add instructor 1

add registration 1 1 1 1

assoc stud + advisor 1 1 1

add course offering 1 1 1

enroll student 1

add course 1

expel student 1 1 1

drop registration 1

del course offering 1 1

delete coop advisor 1 1

delete course 1 1

de-assoc stud + adv 1

delete instructor 1 1

10-24

10.5.2 Affinity Analysis

In the previous matrix, the intersections marks were scattered all over the matrix. This is because the processes that work on common entities were not grouped together.

• For instance, add course and delete course are not adjacent rows.

• Similarly, the columns are not arranged so the associations are between (or at least near) the kernel entity columns they associate.

It is possible with a good table editor to re-arrange the rows, so the rows which have many ’1’s in common are right above and below one another. Then, you can arrange columns that have similar patterns of

10-25

’1’s to be near each other. The result will be something like the following:

• Notice how the ’1’s have arranged themselves into a diagonal pattern.

• What’s more important is you can see that add course is right above delete course.

• And the student-advisor association column is between the advisor and student columns.

ENTITIES -->

PROCESSES | | V ad

viso

r

stud

ent-

advi

sor

stud

ent

stud

ent-

regi

stra

tion

cour

se o

ffer

ing

cour

se

inst

ruct

or

add coop advisor 1

delete coop advisor 1 1

assoc stud + advisor 1 1 1

de-assoc stud + adv 1

enroll student 1

expel student 1 1 1

add registration 1 1 1 1

drop registration 1

add course offering 1 1 1

del course offering 1 1

add course 1

delete course 1 1

add instructor 1

delete instructor 1 1

10-26

This affinity analysis can be done algorithmically, with a process similar to correlation. A computer program, such as is included with the IEF CASE tool generates, a hierarchical tree of the rows, called a dendrogram, that are most similar to one another.

Correlation is not actually used in the algorithm as the correlation between two rows could potentially be negative. We are not interested in whether rows have an opposite pattern of 1’s and 0’s, but how many 1’s they have in common relative to the number of 1’s they have in total. Consider two elementary processes X and Y with the following rows:

X 1101010

Y 1001000

X AND Y 1001000 2

X OR Y 1101010 4

The bitwise AND of the two rows shows these processes interact with 2 entities in common. Taking the bitwise OR of the two rows shows these processes together access 4 entities. So the affinity between them is 2/4 = 0.5

You have to do this between every row and every other row. The results are shown in a Similarity Matrix. The similarity matrix has 1’s down the diagonal (as every row is similar to itself), and is symmetrical about the diagonal (as X has the same similarity to Y, as Y has to X).

There are a variety of techniques for measuring similarity that have been tried for affinity analysis with the goal of rearranging rows (and columns) into clusters. Some, like correlation allowing negative numbers, and another suggested by a popular author compute an asymmetric matrix (e.g. X has 50% of its 4 in common with Y, and Y has 100% of its 2 in common with X). Which algorithmic technique is best is not clear.

Also, when you cluster two similar rows together, it is open to interpretation what the new affinity is between the pair as a whole and

10-27

say a third row. Some authors [Fertuck 92] suggest using the minimum of the affinities between the third and either of the first two rows. Others [Flaatten 91] suggest using the average. None suggest using the union of the data accesses of the pair. Anyway, with the similarity matrix and a clustering algorithm, you can construct a dendrogram like the one show in an Appendix of this section of the course notes.

The dendrogram illustrates the processes with the highest affinity to each other being paired first (on the left). Only after large groups of similar (i.e. cohesive in that they access the same data) processes are joined, are the groups themselves joined (as you move farther to the right). In the end, the dendrogram joins everything into one big cluster (after all, this is one big system of related processes). The amazing thing about a dendrogram is that you can pick and change how many clusters you want, without recalculating everything, simply by specifying the similarity criterion necessary to cluster processes. The previous dendrogram shows five clusters of related processes separated by dashed lines. [Fertuck 92] suggests the number of clusters that will make sense in a typical application is about the square root of the number of rows. But if you want a different number of clusters, you just have to relax or tighten your criterion for what makes two processes similar enough. With the dendrogram, you do not have to recalculate using the new criterion, you would just have to merge two existing clusters, or cut the least cohesive cluster in two where the dendrogram suggests!

10.5.3 CRUD Matrices

A popular variant on interaction matrices is the so called CRUD matrix. Instead of putting a 1 in the matrix cells, you enter one or more of “C”, “R”, “U”, or “D”. The result looks like the following:

10-28

Note that I have not put any “u"s in the above matrix, as I do not have any row processes which actually modify an existing entity instance.

It is popular to group portions of an activity-entity matrix into a number of regions. You can either group the regions that do most of the writing of entities (i.e. CUD), or you can use the dendrograms for the rows and columns created by much computation with Similarity Matrices, to automatically make regions for you. Here is a CRUD matrix broken into a number of regions based on CUD affinity.

ENTITIES -->

PROCESSES | | V ad

viso

r

stud

ent-

advi

sor

stud

ent

stud

ent-

regi

stra

tion

cour

se o

ffer

ing

cour

se

inst

ruct

or

add coop advisor C

delete coop advisor DR R

assoc stud + advisor R C R

de-assoc stud + adv D

enroll student CR

expel student R D R

add registration R C R R

drop registration D

add course offering C R R

del course offering R D

add course C

delete course R D

add instructor C

delete instructor R D

10-29

This regionalizing is somewhat arbitrary. Generally, it is good to break the matrix up into a number of regions equal to the square root of the average of the number of rows an columns (in our case SQRT((14+7)/2) ~ 3. This is an arbitrary recommendation, as some matrices have poor affinity everywhere and can’t even be diagonalized well.

The regions can have interesting interpretations. The major one is that they can be considered sub-systems to be developed using the Successive Versions/Spiral software development lifecycle model. For

ENTITIES -->

PROCESSES | | V ad

viso

r

stud

ent-

advi

sor

stud

ent

stud

ent-

regi

stra

tion

cour

se o

ffer

ing

cour

se

inst

ruct

or

add coop advisor C

delete coop advisor D R

assoc stud + advisor R C R

de-assoc stud + adv D

enroll student C

expel student R D R

add registration R C R R

drop registration D

add course offering C R R

del course offering R D

add course C

delete course R D

add instructor C

delete instructor R D

10-30

example with the above regions, you might want to first design and get running the course/instructor/course-offering sub-system. You could do this easily as it is fairly independent of the student, and especially the student advisor sub-system. You could just stub the check that there are no student registrations in a course you are cancelling. After that sub-system is working, you could develop the student and student course registration sub-system, and test it with your already running course sub-system. Finally, you can develop and add the student advisor sub-system.

Often in a large corporation, things have been organized for political, power, or historical reasons. But business organization theory suggests that businesses work most efficiently if people who share work on the same entities work in the same department, so they can coordinate their work to handle any problems, exceptions, vacations, etc.

In particular, people who have the ’write privileges’ on a set of data files should probably work in the same department where the data is kept, so the increased cooperation possible between the employees can be used to prevent any data corruption. But, if you look at your matrix, you may find this is not true for the company you have just analyzed. Maybe the data is on another department’s computer because:

a) it has always been there, or

b) it’s in a database that can only run on that brand of computer, or

c) that department thinks the data ’belongs’ to them (which is ridiculous considering who has write privileges).

Anyway, this is a trigger to suggest to management that they re-organized their company in a more sensible manner (vs. the manner that it had evolved into up to this point in time). Here again we see a glimpse of where systems analysis is used as a general purpose business technique, not just for developing software applications.

Another interpretation of regions will be discussed in the next section regarding where the data is most frequently accessed from.

10-31

10.5.4 Frequency Matrices

Generally, in a distributed application, it is best to locate individual database files in the locations they are most frequently used from. We do this to

• improve the response time to those files for the most frequently using locations, and to

• reduce data communications cost. Though the internet appears free to students, businesses have to pay considerable sums of money in communications cost if they have high volumes over long distances needing quick transmission.

In order to estimate data traffic flow rates between sources of distributed data, we will use frequency matrices. I will show you how to do this by using a small example.

Let us assume that we have the following interaction matrix for an application:

PROCESS - ENTITY INTERACTION MATRIX:

Now, assume that our corporation has a head office and 2 other locations. We will call the head office Loc1, and the others Loc2 and Loc3. We inquire how often each location needs to execute elementary processes P1, P2, P3, and P4 per hour (or minute, or whatever is appropriate). This data is also given to us in the form of a table:

ENTITY FILES -->

PROCESSES

F1 F2 F3 F4

P1 1 1P2 1 1P3 1 1 1P4 1

10-32

PROCESS FREQUENCY-OF-USE BY LOCATION MATRIX:

Now what we have to do is to duplicate the interaction matrix 3 times, once for each location. Then multiply each row of, say, Loc1’s interation matrix by the Loc1 column from this frequency-of-use by location matrix. Then results are shown on the next page.

Invocations per hourfrom LOCATION -->

PROCESSES

Loc1 Loc2 Loc3

P1 1000 100 50P2 50 10 70P3 10 500P4 100 100 100

10-33

Loc1 PROCESS FREQUENCY-OF-USE MATRIX:

Loc2 PROCESS FREQUENCY-OF-USE MATRIX:

LOC3 PROCESS FREQUENCY-OF-USE MATRIX:

ENTITY FILES -->PROCESSES

F1 F2 F3 F4

P1 1000 1000P2 50 50P3 0 0 0P4 100

Total File Access Frequenciesper Hour from Loc1

1050 0 50 1100


F1 F2 F3 F4

P1 100 100P2 10 10P3 10 10 10P4 100

Total File Access Frequenciesper hour from Loc2

120 10 20 200


F1 F2 F3 F4

P1 50 50P2 70 70P3 500 500 500P4 100

Total File Access Frequenciesper hour from Loc3

620 500 570 150

10-34

Now by taking the bottom summary rows from each of the above three matrices, we can produce a matrix that shows the rate of file accesses from every process at each location, to each file, regardless of the exact processes causing the accesses.

FILE ACCESS FREQUENCY BY LOCATION MATRIX:

The bottom summary row of this table gives the total number of file access per hour for each of files F1, F2, F3, and F4.

ENTITY FILES -->LOCATIONS

F1 F2 F3 F4

Sum of Accesses From Loc1 1050 50 1100Sum of Accesses From Loc2 120 10 20 200Sum of Accesses From Loc3 620 500 570 150

Total File Access Frequenciesper hour from all locations

1790 510 640 1450

10-35

10.6 Where to Locate Each Database FileThe final matrix in the last section provides us with the information we need to decide where best to locate each of the four files. Locating them all at a head office mainframe can cause:

a) Slow response time for users of processes that access files F2 and F3 from either location Loc2 or Loc3.

b) Higher than necessary data communication costs for the enterprise to transfer the many references to F2 and F3 from and to Loc 2 and Loc3.

The trick is to locate the data at the site that most frequently uses it. This simply requires finding the maximum value in each column of the above matrix. In the above matrix, as shown in underlined italics, this indicate files F1 and F4 should be stored at Loc1 (the head office), while F2 and F3 should be at Loc3. Loc2 would hold none of the database files, and would typically just have workstation on site.

The simple maximum value algorithm is usually adequate for most systems, though if you have some locations very close to each other, and some very far away, you may want to evaluate the effect of distance on the communications cost. See Chapter 11 of [Fertuck92] if you desire further details.

10.6.1 Calculating the Traffic Matrix

The communication systems designer who might help you implement your project really doesn’t care whether the files are normalized, nor how long each record needing transferring is. He just wants to know the average and peak data rates (measured in bytes/second) needing to move in each direction between each location. This is what a traffic matrix will provide him. With the traffic matrix, the communications system designer then determines what data rate modems to purchase for each link, and/or what kind of data communication service deal he should try and get from the data communication service branch of B.C. Telephone Company (or one of their competitors).

10-36

To determine the traffic matrix, we still need to do some more multiplication and addition. We need to change from accesses per hour to bytes transferred per hour. In most database applications, rarely is only part of a record ever retrieved or written. Usually to update an instance, the whole record is read, the necessary parts of it are updated in RAM, and then the whole record is written back. We will assume all transfers are therefore of the file’s record length. We must therefore multiply each column in the matrix above by that file’s fixed record length. We will assume:

- F1 record length = 50 bytes- F2 record length = 100 bytes- F3 record length = 150 bytes- F4 record length = 200 bytes.

FILE ACCESS DATA RATE BY LOCATION MATRIX:

ENTITY FILES -->Bytes From LOCATIONS

F1 F2 F3 F4

Loc1 52500 7500 220000Loc2 6000 1000 3000 40000Loc3 31000 50000 85500 30000

Total File Access data volumeper hour from all locations

89500 51000 96000 290000

10-37

Assume F1 and F4 will reside at Loc 1, and F2 and F3 will reside at location Loc3. We now draw a traffic graph:

Note the arrows show the direction of data flow assuming the data accesses are reads (and not writes).

You can ignore the data accesses from one location to itself, as they occur in RAM and incur no data communication burden or cost. But for the other transfers, we can sum the total data rate traffic from each location to each other location. We document this sum in a traffic matrix.

Loc1 CPU,F1 and F4

Loc2 CPU,no files

Loc3 CPU,F2 and F3

52500 b/hr(F1)

7500 b/hr(F3)

220000 b/hr(F4)

6000 b/hr(F1)

1000 b/hr(F2)

3000 b/hr(F3)

40,000 b/hr(F4)

31,000 b/hr(F1)

50,000 b/hr(F2) 85,500 b/hr(f3)

30,000 b/hr(F4)

10-38

TRAFFIC RATE MATRIX (in bytes/hour):

This is an interesting matrix.

• First, the diagonal is empty because to speak of the data communication rate between a CPU at a particular location and itself is ridiculous.

• Second, the matrix is non-symmetrical. This means that rate from Loc1 to Loc3 is not equal to the average needed data transfer rate from Loc3 to Loc1! Though most modern modems have the same transfer rate in both directions, it is possible to lease extra one-way lines or satellite links if the traffic is heavy in one direction between two locations.

• Third, there are no heavy data volumes. If you look at the prior table, there were several data volumes near 100,000 and one near 300,000 bytes per hour. Because we have chosen to place the files at the locations which most frequently use them, we have saved on a lot of data communications cost and delay!

• And finally, this matrix is not quite correct, as it represents only data access (i.e. reads) rather than the mixture of various creates, reads, writes (and reads transfer data in the opposite direction to creates and writes), and deletes that would be typical. In actual fact, a slightly more complex analysis is needed to take into account the actual directional nature of each different type of access. But the procedure you would follow would use the same frequency matrix concepts.

Section 13.3 of [Montgomery94] also has an analytical treatment of where the break-even point is in deciding to distribute, or not to distribute.

From -->To:

Loc1 Loc2 Loc3

Loc1 - 0 7500Loc2 46000 - 4000Loc3 61000 0 -

10-39

10.7 File PartitioningThe above file distributing technique can reduce traffic volumes and response times for most users significantly. But we can sometimes gain further advantage by distributing necessary data, in a more fine-grained manner, to the locations it is most frequently used. Unfortunately, there is sometimes a cost in extra disk memory for the further reduction of data communication and response time.

10.7.1 Horizontal Partitioning

Often, entity instances in a file are used only or mostly by users at one location, and other instances are only used at their location. If we put all the instances in the location where they are most used, we will gain. Consider the following table.

Now if the data about Toronto salespeople are most frequently needed from workstations located in Toronto, maybe we should store the data regarding the Toronto office staff in Toronto. And the data most relevant to the Vancouver office in Vancouver. This can be done by making ’sales office’ a key, and sorting using it as the most significant field of the key. Then separate the Toronto half of the list, and move it to Toronto. Here is what the Toronto file would then look like:

salesman home phone sales office sales_this_year commission earned this year

Art Knapp 406-123-4567 Toronto $150,000 $30,000

Bill Smith 604-987-6543 Vancouver $100,000 $20,000

Phil Wong 406-567-8912 Toronto $200,000 $50,000

Ted Jones 604-321-7890 Vancouver $180,000 $45,000

sales office salesman home phone sales_this_year commission earned this year

Toronto Art Knapp 406-123-4567 $150,000 $30,000

Toronto Phil Wong 406-567-8912 $200,000 $50,000

10-40

And here is what would be stored in Vancouver.

There is a some cost to doing this. If the compound key is retained, this means all random searches have more fields needing comparing. The sales office field can be made just a regular field again, as it was actually just used to separate the data for the two sales offices. But in either case, there would be extra program logic needed for when the boss in Toronto wants to look at all the salesperson data. The Toronto database or application software has to figure out that some of the data is not in Toronto, and must know where to go and find it.

10.7.2 Vertical Partitioning

Vertical partitioning is the opposite: splitting the table between two columns. If ’sales office’ and ’home phone’ fields were only needed in Toronto (because say, that is where the personnel office was), and ’sales_this_year’ and ’commission earned this year’ were only needed in Vancouver (because that was where the enterprise’s national sales and marketing office was), then consider splitting the table as follows. Put the following entity table in Toronto:

sales office salesman home phone sales_this_year commission earned this year

Vancouver Bill Smith 604-987-6543 $100,000 $20,000

Vancouver Ted Jones 604-321-7890 $180,000 $45,000

salesman home phone sales office

Art Knapp 406-123-4567 Toronto

Bill Smith 604-987-6543 Vancouver

Phil Wong 406-567-8912 Toronto

Ted Jones 604-321-7890 Vancouver

10-41

And this one:

in Vancouver.

Note that we have wasted some space by duplicating the primary key in both locations. Extra space is also needed for the B+ tree in both locations. But this may be a reasonable price to pay to keep the communications costs and response time low.

salesman sales_this_year commission earned this year

Art Knapp $150,000 $30,000

Bill Smith $100,000 $20,000

Phil Wong $200,000 $50,000

Ted Jones $180,000 $45,000

10-42

10.8 Data ReplicationThe fourth way to distribute data (other than remote file placement, horizontal partitioning, and vertical partitioning) is replication.

You often distribute several data files which ’belong’ together to the locations where they are most frequently used. To find data that belongs together, look for either:

a) clusters of objects in the ERD connected with few relations between the clusters, or

b) clusters of objects in the OCM connected with few event message paths between the clusters, or

c) regions in an diagonalized activity-entity CRUD matrix that have few columnwise overlaps.

Sometimes, though, it is impossible to find mutually-exclusive regions/clusters that are independent. Instead, there is some overlap. Several operations or several locations need fast access to the same parts of the data model! Since such data is needed frequently and rapidly by users at all locations, the only solution to this dilemma is to make duplicate copies of the relevant data at each location.

Of course there are two serious drawbacks to this idea:

1) it is a terrible waste of disk space, and

2) keeping the distributed copies consistent in the face of multiple simultaneous users updating the data from multiple locations becomes a nightmare!

Obviously, at Air Canada this would require Terabytes of disk space to replicate their reservation data in all of Canada’s major cities. In this huge database case, it is probably better to design the system with a centralized database run on high performance On-Line Transaction Processors (OLTPs) (because there are so many transactions required per minute). This is needed as one of the problems of a centralized (vs. distributed) system is that you end up concentrating the processing requirements all at one location, and thus need tons of processing and

10-43

file accessing power. Then you must communicate with a high speed networks, to keep the response time down to a couple of seconds.

It is hard enough keeping referential integrity intact in a distributed file system. It requires doing distributed locking accompanied by the communications needed to coordinate the locking. If you add the problem that every update to a replicated file causes an integrity problem at every other location, you have a real performance bottleneck. The communications required to handle this (and the extra disk space) can exceed any cost savings obtained through replication. So generally, replication is not used to save cost, but instead mainly to meet fast response requirements imposed by the customer.

In fact, replication is usually only used for data files that are rarely (e.g. once per day) updated. Surprisingly, there are a number of applications for which this is a good design alternative. One example might be a telephone directory. New telephone listings can be batched up during the day, and sent as a package of updates to each location at midnight. With this strategy, you can forbid write operations during most of the day, and thus avoid all the distributed and replicated locking problems.

To update such a replicated system, send only the changes (to reduce communications cost) to each location every day at midnight. Have each location build a separate, updated directory file between 12 AM and 1 AM, and then all locations change to use the new data at exactly 1 AM.

The only drawback to this scheme is that each location must have disk space for both the old file, and the updated one, while the updated one is being built.

• To get around this, horizontally partition the data by first letter of the telephone customer’s last name: update all the A’s, then change over to the new A file, etc.

• Or, do normal distributed, replicated locking late at night when communications charges are at a lower rate, and response time is fast because the network is more lightly loaded.

10-44

As mentioned at the beginning of the course, design is conceiving and then choosing between implementation alternatives. Every implementation scheme to provide users with functionality, large database size, and small response time, involves many trade-offs. A designer’s job is to think of alternative implementation schemes to simultaneously provide high functionality, database size, and response time. Good schemes typically makes the implementation programmer’s job more complicated. Each scheme is not perfect, and should be evaluated (analytically if possible) for memory use, performance, communications cost, reliability/security/integrity risk, and implementation cost (i.e. programmer’s salary).

10-45

10.9 References[Fertuck 92] “Systems Analysis and Design” by Len Fertuck, W.C. Brown Publishers, 1992.

[Flaatten 91] “Foundations of Business Systems” by Per O. Flaatten et al, Dryden Press, 1991.

[Montgomery94] "Object Oriented Information Engineering" by Stephen Montgomery, AP Press, 1994.

10-46

10.10 Appendix 10A - SQL + Joins/Projections/Selects/Sorts

Starting in semester 94-3, students in Cmpt 370 will require as a pre-requisite Cmpt 354. This will have prepared them with a knowledge of normalization, advanced database operations such as joins and projections, and the use of Structured Query Language (SQL).

SQL is quite a simple, database-brand-independent way to define database table attributes and keys, to do CRUDs, and to form complex queries of the database in order to create reports from the selection, join, and sort of data from different tables. Applications which interface with a DBMS via SQL can therefore be easily ported to run on different computers which, say, are not able to run the brand of DBMS that the application was originally designed on.

10.10.1 SQL Data Definition Language Subset

The DDL statements in SQL provide a brand-independent way to create a database file and B+ tree index for the file. File creation and attribute definition are accomplished as per the following examples:

CREATE TABLE Course_Offering(course_name CHAR(10) NOT NULL, semester CHAR(5) NOT NULL, course_room CHAR(8), instructor CHAR(40));

CREATE TABLE Course(course_name CHAR(10) NOT NULL, credit INTEGER);

Other character types are possible. The NOT NULL notation instructs the DBMS not to allow creation of records without these fields actually containing proper data. This allows them subsequently to be used as search keys.

10-47

To create a file containing a B+ tree index into the file Course_Offering, you can do the following:

CREATE UNIQUE INDEX Course_Offering_IndexON Course_Offering (semester, course_name DESC);

This statements creates an index file, and prepares for the sequential links to be arranged so the records are sorted first by semester in ascending order (default), and then for courses in the same semester by course name in descending order (e.g. Cmpt 370 before Cmpt 275). Note: for random searches you really don’t care about the order as long as the tree finds the data.

By the way, if you leave out the word UNIQUE, it will allow more than one record with the exact same key in the database. Assumably (?) a random access search for a particular key will return you the set of such records, and you can subsequently use a cursor to access individual records within the set.

10.10.2 SQL CRUD Accesses

To create (i.e. add) a single new record to an existing database file and index you have set up, you need only to do something like:

INSERT INTO Course_Offering (course_name, semester, course_room)

VALUES (’Cmpt 370, ’94-2’, ’C9000’);

This sets the values of the new record the those specified, except that because the instructor attribute is not mentioned, it is set to NULL.

Read access is accomplished simply by:

SELECT course_name, semester, instructorFROM Course_OfferingWHERE course_name = ’Cmpt 370’ AND semester = ’94-2’;

10-48

Updates are specified as follows:

UPDATE Course_OfferingSET course_room = ’ASB10875’WHERE course_name = ’Cmpt 370’ AND semester = ’94-2’;

Finally, record deletes are specified simply by:

DELETE FROM Course_OfferingWHERE course_name = ’Cmpt 370’ AND semester = ’94-2’;

10.10.3 SQL Select/Project/Join

The SELECT statement is much more powerful than just as shown above. It can retrieve a whole subset of the rows and columns of the table. For instance:

SELECT course_name, instructorFROM Course_OfferingWHERE semester = ’94-2’

Note the following things about this SELECT:

• It does what is called a projection. A projection eliminates unwanted columns. In the above case, we do not want the semester in the generated temporary table, so we leave it out of the SELECT line. (Think of projection like the term is used in graphics, projecting a 3D object onto a 2D screen; we are leaving out the depth attribute of the object.)

• It generates not a single record, but a new temporary table of records, one row for each record in the original table that meets the WHERE predicate. Note the WHERE predicate needn’t specify a unique key of a specific row, and as shown above, may possibly not even use keys in the select criterion. In fact, for numerical attributes you can specify inequalities such as WHERE salary >= 40,000.00 AND age < 30.

10-49

- c.f. Does this remind you of conditional predicates in Objectbench’s FOREACH statement? Actually SQL provides you the ability to declare a row cursor to be used in an SQL DO WHILE to, say, get and print each row of this temporary table!

Having normalized a database, often the data we need to create a report or screen is spread out over several files (possibly located at several different locations). For instance, if we want a table/report of course offering + semester + instructor + course credit of each offering, this data must be gathered from both the Course_Offering and Course files. This data must be ’joined’ together to generate the desired output using a join operation. A join operation can be accomplished using the keywords already introduced:SELECT Course_Offering.course_name

/* or Course.course_name */,semester, course_room, instructor, credit

FROM Course_Offering, CourseWHERE Course_Offering.course_name =Course.course_nameORDER BY credit DESC, course_name, semester;

This gathers all the columns from both tables, except that we avoided having the course_name column twice by only specifying course_name be selected from one of the two files. Note the use of "." qualified names to specify which course_name we meant. In this case, it doesn’t matter which course_name we choose, but if dealing with homonyms (two different attributes with the same name) we must specify which one we are referring too.

Note that if you join two tables via a common column that can have multiple rows with the same value (i.e. an Multi-Value-Dependent column) such as course_name above, you will get one row for each occurrence. Duplicates will not be removed, but this is what you what in the above case. In some cases, though, this is not what you want. What’s worse is if you have MVD duplicates in the join columns coming in from both files, you get the cartesian product of every row of one file with every row that matches from the other file. Cmpt 354 discusses this at length. For more information on this, see any good database text.

10-50

10.10.4 Column Order Re-Arrangement and Row Sorting

Column rearrangement can be accomplished (I believe?) simply by using a different sequence of attribute names in the select clause.

The resulting temporary table can be sorted using the ORDER BY clause. In the above case, I specified the result to be sorted first by credit in descending order, then within groups of courses with the same credit, by ascending course name, and finally when there are several offerings of the same course in different semesters, by ascending semester.

10.10.5 Views

Views are a specification for a named derivable virtual table similar to the specifications used to generate the temporary tables above. Views are not actual tables used in generating a report. Instead, they define a virtual table that can be subsequently used as if it was a base table (e.g. you can project, join, and select from them). These virtual tables do not actually exist, and operations such as projects, selects, and joins on them are translated by SQL into operations on the underlying base tables.

Views are very important as they are often what the client application wants to see. In fact, when building enterprise-wide applications, each client may want a different ’view’ of the data, and not want to see a bunch of miscellaneous attributes needed by other departments. By structuring an application around a view, rather than the physical files and attributes and attribute ordering, you can make your application independent of other department’s application maintenance such as adding extra attributes to the database without telling you.

SQL provides the ability to define views. In fact when you consider it, the SELECT statement wherein each access names the columns wanted and the order for the attributes, makes even view-less SQL pretty uncoupled to addition of attributes.

Data security is often based on views, with users only allowed to see certain rows and columns specified by a view!

10-51

10.10.6 The Report Building Process

The complete report building process can now be summarized in the following diagram. In it we will see two tables being joined, projected, selected, and ordered. Finally, derived attributes and totals are calculated.

10-52

PROJECT and RE-ARRANGE COLUMN ORDER

SELECT only those

WHERE salary > $40,000

JOIN

Table BTable A

10-53

SQL is a powerful data definition and data manipulation language. It allows you to make calls (or pass messages) to a DBMS to do each of these operations in preparation for using a cursor to print out each of the rows in the final table illustrated above. Behind SQL, each brand of DBMS implements the fixed length record definition, the trees, algorithms, enforces key uniqueness, integrity, and record locking checks, etc.

Sort into ORDER BY (some_attrib DESC)

Total

Sum any columns desired.

x

Derive new attributes by some calculation (e.g. quantity x price)

10-54

10.11 Appendix 10B - Binary ComponentsThis is an email that I wrote to my Cmpt 370 class after a guest talk in a previous semester given by Chris Kerslake on client-server computing and distributed objects. You might find it interesting.

---------------------------------

I just wanted to make clear a point about why everybody is excited about binary distributed objects. Previously:

1) It was not always possible to call a function written say in Pascal, from a main program written in C++, even when the compiler, linker and OS were the same brand and both parts of the program were on the same machine. This was often because they passed parameters on the stack differently, etc.

2) It was not always possible to call a virtual member function written in C++ and compiled into a .o file by one brand of compiler, from a main program also written in C++ but compiled by another brand of compiler.

3) It was not always possible to 'port' an application which calls the operating system (e.g. to open a socket) to another operating system because the socket call is different on the other OS.

4) It is not always possible to call a function in a different address space (e.g. in a different process or on a remote machine) because how you send the call and parameters (least significant byte of an integer first?) were not standardized (because different brands of machines internally store these differently), and because some kinds of parameters were not expressive enough (e.g. char* point to a character or whole string?).

5) I is not always possible to add a feature to an application after it has been linked (e.g. add a spell checker for a new language to a word processor). This is tough even in a single language on a single machine and OS. It is typically called dynamic linking.

We have been fighting the above problems for decades. The move to binary distributed components might fix them all, because:

10-55

1) distributed components are not univerally usable unless implemented in some way that is independent of the (possibly different) langauge and OS of the two machines (client and server) and the different geographic location and address spaces of the two machines (and the fact that they are running asynchronously to each other). So distributedness is FORCING us to address the above problems.

2) If the plumbing standard for distributed object interoperation (e.g. CORBA, COM, Java RMI) is written in terms of exactly what bits and bit format should be sent from the client to the server component function to get it to load into RAM, be called with the correct parameters, and to return the results, then it is written in a language, machine, and OS, and location/address space independent way.

This has never been technically impossible; only impossible for the main players to agree on! We are on edge of now having large islands of agreement on a very few (non-proprietary to any particular company) standards (vs. many proprietary standards).

So, for instance, CORBA is like a plumbing standard, and there are several brands of ORBs that comply with that standard. They all perform the necessary plumbing but have different implementations (some use plastic pipes, some metal, and some ask Scotty to "beam the data over"). But as a client you don't care; you only care what the interface to the function is in your language (the server author writes and advertizes the interface in IDL. You translate it into your language's function prototype form using an IDL translator). And you care that your development system creates a stub/proxy that actually receives the call and then forwards it over the network according to the CORBA standard. Russ.

10-56

10.12 Appendix 10C - Example of a DendogramFigure 3-9 from "Systems Analysis and Design" by Len Fertuck, Wm.C. Brown Publishers, 1992.

computing - simon fraser university · 2002-03-14 · rpcs have the following advantages: - rpcs...

Documents