bca 204 3rd database

Database Systems��

��

��

��

BCA-204

Directorate of Distance Education Maharshi Dayanand University

ROHTAK – 124 001

Copyright © 2002, Maharshi Dayanand University, ROHTAK All Rights Reserved. No part of this publication may be reproduced or stored in a retrieval system or

transmitted in any form or by any means; electronic, mechanical, photocopying, recording or otherwise, without the written permission of the copyright holder.

Maharshi Dayanand University

ROHTAK – 124 001

Developed & Produced by EXCEL BOOKS, A-45 Naraina, Phase 1, New Delhi-110028

Contents��

UNIT 1 DATA MODELLING FOR A DATABASE 1 Introduction Database Benefits of the Database Approach Structure of DBMS DBA Records Files Abstraction

Integration

UNIT 2 DATABASE MANAGEMENT SYSTEM 28

Data Model

ER Analysis

Record Based Logical Model

Relational Model

Network Model

Hierarchical Model

UNIT 3 RELATIONAL DATA MANIPULATION 36

Relation Algebra

Relational Calculus

SQL

UNIT 4 RELATIONAL DATABASE DESIGN 99

Introduction

Functional Dependencies

Normalisation

First Normal Form

Second Normal Form

Third Normal Form

Boyce-Codd Normal Form

Fourth Normal Form

Fifth Normal Form

UNIT 5 QUERY PROCESSING 131

Introduction

Query Processor

General Strategies for Query Processing

Query Optimization

Concept of Security

Concurrency

Recovery

UNIT 6 DATABASE DESIGN PROJECT 177

Definition and Analysis of Existing Systems

Data Analysis

Preliminary and Final Design

Testing & Implementation

Maintenance

Operation and Tuning

UNIT 7 USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT 197

Implementation of SQL using Oracle RDBMS

APPENDIX 211

Introduction Database Benefits of the Database Approach Structure of DBMS Database Administrator Records and Record Types Files Data Integrity Constraints Data Abstraction

�

��

Data Modeling for a Database

Learning Objectives

After reading this unit you should appreciate the following: � Introduction � Database � Benefits of the Database Approach � Structure of DBMS � DBA � Records � Files � Abstraction � Integration

Top

��

��

The present time is known as the information age, reason being that humans are dealing with data and information related to business or organization. Since the beginning of civilization, man is manipulating data and the give and take of information has been in practice, but this has been considered as an important discipline only for the last few decades. Today, data manipulation and information processing have become the major tasks of any organization, small or big, whether it is an educational institution, government concern, scientific, commercial or any other. Thus we can say that information is the essential requirement of any business or organization.

Data: It is the plural of a Greek word datum, which means any raw facts, or figure like numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It can be useful after processing,

DATABASE SYSTEMS 2

e.g. 78, it is simply a number (data) but if we say physics (78) then it will becomes information. It means somebody got distinctional marks in physics.

��

Information is processed data. The user can take decision based on information.

Data Processing Information

Information systems, through their central role in information economy, bring about the following changes:

• Global exposure of the industry.

• Actively working people.

• Precedence of idea and information over money.

• Growth in the business size.

• Globalization – changing technologies.

• Integration among different components based on information flow.

• Need for optimum utilization of resources.

• Deciding loss/benefit of business.

• Future oriented Information.

• External Interfaces.

��

Product Sales

Corporate Database

Accountin

A/c A/c Payable

��

Control Planning

Manufacturin

ProductioSchedulin

Material

Requirement Purchasin

DATA MODELLING FOR A DATABASE 3

An organization is only a mechanism for processing information and considers that the traditional management of information can be viewed in the context of information and process. The manager may be considered as a planning and decision center. Established routes of information flow are used to determine the effectiveness of the organization in achieving its objectives. Thus, information is often described as the key to success in business.

Student Activity 1.1

Before reading the next section, answer the following questions:

1. Justify that numbers 90, 40, 45 are data or information.

2. What is the difference between data and information?

3. Give the example of objects by which we can judge that these are information.

4. Name a list data.

5. Make a list of information.

6. Data are the raw facts and…….

If your answers are correct, then proceed to the next section.

��

A system is a group of associated activities or functions with the following attributes:

�� A common goal or purpose

�� Identifiable objectives

�� A set of procedures including inputs, outputs and feedback

�� An environment in which the system exists

�� Mutually dependent sub systems that interact

It can be understood as follows:

�

� ��

Labour

Money

Goods and Services

Growth or Downfall

��

��

��

DATABASE SYSTEMS 4

Productivity

Materials

Machinery

Taxes

Methods

Employment

It is imminent from the above mentioned job profile and process of management that effective management, and effective decision making ability is directly dependent on individuals and organizational ability to manage information.

��

Information can be collected as data or information, which some other person has introduced. Data / information can be obtained from both internal and external organizational sources.

Information sources from internal and external sources may be classified as formal, or informal.

��

The uses of formal systems are based on proper procedures of collecting data. The conversion of data into information, and the use of information, requires proper procedures to be in place. A means of identifying such sources is to look for internal systems within which inputs and outputs are impressed in a constant format. An informal internal source of information is a source of information through which management receives information outside formal procedures. Many informal sources are verbal, therefore, the organization will require procedures through which such information can be collated for future use.

��

Informal and formal information may also be obtained from external sources like newspapers, To etc. In addition to identifying potential sources the organization will need to devise systems through which internal information can be collated for potential future use.

��

The basic quality of good information is that it has some value for the recipient. A measure of value can be found in its usefulness and reliability to the recipient. The value of the information will be decreased if the level of inaccuracy is unknown or the cost of compensating for the inaccuracies are greater than the potential benefits.

��

To have value, the information must be used. The value an organization gains from information, relates to the decision making process and the ability of the management to improve profitability through use of information.

��

Processor Management

Planning

Organizing

Staffing

Controlling


Information should be provided at all levels. The objective of the provision of information is to enable managers to control those areas, which they have responsibility for. Information will come from internal and external sources and has to be communicated on time to facilitate effective decision-making. Management is another form of the system, which comprises of elements and/or activities, which relate to the achievement of a goal.

�

��

Management control: is the means through which managers ensure that required resources are obtained and used effectively and efficiently to accomplish the objectives of the organization.

Operational control: Ensures that specific tasks are undertaken and completed effectively and efficiently. Operational control has become less important with automation because tasks are increasingly becoming subject to programmed control.

Strategic information: Enables directors and senior managers to plan the organization’s overall objectives and strategies.

Tactical information: Used at all management levels, but mainly at middle management level, for tactical planning and managing control function.

Operational information: The managers at the operational level who need to ensure that routine tasks are being correctly planned and controlled use this information. These decisions are usually highly structured and repetitive [i.e. programmable] with the required information being presented on a regular basis.

Collecting data to provide information is a time-consuming exercise. The developer must check that the extra information gained is worthwhile, both in terms of cost and time. The assessment of what is valuable data is carried out before any data is collected. Clear objectives of the intended system should be used to determine data requirements. The company may commission a market research survey to analyses customer-buying habits. The range of investigation and size of the survey sample would be controlled by the survey budge.

��

We expect information to be ‘reliable’ and ‘accurate’. These features can be measured by the degree of completeness, precision and timeliness.

��

The user of information should receive all the details necessary to aid decision-making. It is important for all information to be supplied before decisions are made. For example, new stock should not be ordered until full details of current stock levels are known. This is a simple example, since we know what information is required and where to obtain it. Difficulties begin when we are not sure of the completeness of the information received. Business analysts and economic advisors are well aware of these problems when devising strategies and fiscal plans.

��

Inaccurate information can be more damaging than incomplete information to a business. The degree of accuracy required depends on the recipient’s position in the management hierarchy. In general terms, the higher the position, the less accuracy required. Decisions made at the top management level are based on

DATABASE SYSTEMS 6

annual summaries of items such as sales, purchases and capital spending. Middle managers would require a greater degree of accuracy, perhaps weekly or monthly totals. Junior management requires the greatest degree of accuracy to aid decision-making. Daily up-to-date information is often necessary, with accuracy to the nearest percentage point or unit.

��

This is described as ‘the provision of prepared information as soon as it is required’. We also need to consider the case where accurate information is produced, but not used immediately, rendering it out-of-date. Some systems demand timely information and cannot operate without it. Airline reservation systems are one example, passengers and airline staff depend on timely information concerning flight times, reservations and hold-ups.



1. What is source?

2. Define source’s types.

3. What are various qualities of a good information?

4. Information will come from…………….. and …………………..

5. The basic quality of good information is that it has some value for the …………..

6. Give the examples of source.


��

This is a traditional term used to describe the processing of function-related data with a business organization. Sales order processing is a typical example of data processing. Note that processing may be carried out manually or using a computer. Some systems employ a combination of both manual and computerized processing techniques. In both the cases, the data processing is essentially. Differences can be described in terms of:

��

Computers can process data much quicker than any human. Hence, a computer system has a potentially higher level of productivity and, therefore, it is cheaper for high-volume data processing. Speed allows more timely information to be generated.

��

Computers have a reputation for accuracy, assuming that correct data has been input and that procedures define processing steps correctly. The errors in computer systems are thus human errors (software, or input), or less likely, machine errors (hardware failure).

��


As processing requirements increase, possibly due to business expansion, manages require more information processing. Human systems cannot cope up with these demands. Banking is a prime example where the dependency on computers is total.

�

��

There are some tasks that computers cannot perform. These activities usually have a high degree of non-procedural thinking in which the rules of processing are difficult to define - it would be extremely difficult to produce a set of ‘rules’ even for safety in crossing a busy road. Many management posts still rely to a great degree on human decision-making. Top management decisions on policy and future business are still determined by a board of directors and not by a computer.

Having understood the basic concept and significance of information and database, let us now get into the basics:

• Data: As we described earlier, Data are the ‘raw’ facts used for information processing. Data must be collected and then ‘input’ ready for processing.

� Each item of data must be clearly labeled, formatted and its size determined. For example, a customer account number may be labeled ‘A/C’, in numeric format, of size of five digits.

� Data may enter a system in one form and then be changed as it is processed or calculated. Customer order data, for example, may be converted to electronic form by keying in the orders from specially prepared data entry forms. The order data may then be used to update both customer and stock files.

• Input: The transaction is the primary data input which leads to system action, e.g., the input of a customer order to the sales order processing system. The volume and frequency of transactions will often determine the structure of an organization.

� In addition to transaction data, a business system will also need reference to stored data, know as standing or fixed data. Within a sales order processing system we have standing data in the form of customer names and addresses, stock records and price lists. The transactions contain some standing data, for referencing, but mainly variable data, such as items and quantities ordered.

• Output: Output from a business system is often seen as planning or control information, or as input to another system. This can be understood if we consider a stock control system. Output will be stock level information, slow-and fast-moving items for example are stock orders, for items whose quantities fall below their reorder level. Stock movement information would be used to plan stock levels and reorder levels, whilst stock order requirements would be used as input to the purchasing system.

• Files: A file is an ordered collection of data records, stored for retrieval or amendment, as required. When files are amended from transaction data, this is referred to as updating. In order to aid information flow, files may be shared between sub systems. For example, a stock file may be shared between the sales function and the purchasing function.

• Processes: Data is converted to output or information by processing. Processing examples include sorting, calculating and extracting.



DATABASE SYSTEMS 8

1. Why we process the data?

2. What is input?

3. What is output?

4. Is data processing essential?

5. Speed, accuracy, volume ………are used in favour of …….

6. Output from a business system is often seen as…………….


Top

��

A database is a collection of related data or operational data extracted from any firm or organization. For example, consider the names, telephone number, and address of people you know. You may have recorded this data in an indexed address book, or you may have stored it on a diskette, using a personal computer and software such as Microsoft Access of MS Office or ORACLE, SQL SERVER etc.

The common use of the term database is usually more restricted.

A database has the following implicit properties:

• A database represents some aspect of the real world, sometimes called the miniworld or the Universe of Discourse (U.D.). Changes to the miniworld are reflected in the database.

• A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot correctly be referred to as a database.

• A database is designed, built and populated with data for a specific purpose. It has an intended group of users and some preconceived applications in which these users are interested.

In other words, a database has some source from which data is derived, some degree of interaction with events and an audience that is actively interested in the contents of the database. A database can be of any size and of varying complexity. For example, the list of names and addresses referred to earlier may consist of only a few hundred records, each with a simple structure. On the other hand, the card catalog of a large library may contain half a million cards stored under different categories – by primary author’s last name, by subject, by book titles – with each category organized in alphabetic order.

Here are several examples of databases.

1. Manufacturing company

2. Bank

3. Hospital

4. University

5. Government department

In general, it is a collection of files (tables)

Entity: A person, place, thing or event about which information must be kept.

Attribute: Pieces of information describing a particular entity. These are mainly the characteristics about the individual entity. Individual attributes help to identify and distinguish one entity from another.

��


�

!��"��

e.g.

Student (Database Name)

� � ��

��

��

� ��

�� !"�

# "�� $�

Handling of a small shop’s database can be done normally but if you have a large database and multiple users then in that case you have to maintain computerized database. The advantages of a database system over traditional, paper-based methods of record-keeping tag will perhaps be more readily apparent in these examples. Here are some of them.

• Compactness: No need for possibly voluminous paper files.

• Speed: The machine can retrieve and change data faster than a human can.

• Less drudgrey: Much of the sheer tedium of maintaining files by hand is eliminated. Mechanical tasks are always better done by machines.

��#� ��

$�� %��%��%��&��

�� %�� '%�( ��%�)��'��

Bit 0,1

Byte 10101011 (8-bits)

Field (Attribute name like name, Age, Address)

Record (One or more rows in a table)

File (Table or collection of all files)

Database (Collection of files or tables)

�(� �)�

�� **��

(� +� � )��'��

� �� !�� (� ( ��

�� !,� -(�� ( ��

DATABASE SYSTEMS 10

• Currency: Accurate, up-to-date information is available on demand at any time.

Top

%�� "�� &&��"�

There are following benefits of the Database Approach:

� Redundancy and duplication can be reduced. In the database approach, the views of different user groups are integrated during database design. For consistency, we should have a database design that stores each logical data item – such as student’s name or birth date – in only one place in the database. This does not permit inconsistency, and it saves. However, in some cases, controlled redundancy may be useful for improving the performance of queries. For example, we may store Student Name and Course Number redundantly in a GRADE_REPORT file(fig below), because whenever we retrieve a GRADE_REPORT record, we want to retrieve the student name and course number along with the grade, student number, and section identifier. By placing all the data together, we do not have to search multiple files to collect this data.

�

�� ! ��"#�$��%��! ��"#�$�� &��

��

� Inconsistency can be avoided (to some extent). Employee E4 works in department D5 – is represented by two distinct entries in the stored database. Suppose also that the DBMS is not aware of this duplication (i.e. redundancy is not controlled). Then there will necessarily be an occasion on which the two entries will not agree, i.e., when one of the two has been epilated and the other has not. At such times the database is said to be inconsistent.

� The data can be shared. Same database can be used by variety of users, for their different objectives, simultaneously.

� Security restrictions can be applied. It is likely that some users is often will not be authorized to access all information in the database. For example, financial data is often considered confidential, and hence only authorized persons are allowed to access such data. In addition, some users may be permitted only to retrieve data, whereas others are allowed both to retrieve and to-update.

� Integrity can be maintained. The problem of integrity is the problem of ensuring that the data in the database in accentuate it means if the data type of any field is number then we cannot insert any string text here.




1. What is database?

2. What is record?

3. What is field? 4. Why database is needed?

5. What is redundancy?

6. What are the benefits of database? If your answers are correct, then proceed to the next section.

# "�� $�

A DBMS is a sophisticated piece of software, which supports the creation, manipulation and administration of database system. A database system comprises a database of operational data together with the processing functionality required to access and manage that data. Typically, this means a computerized record keeping system whose overall purpose is to maintain information and to make that information available on demand.

�

�

�

��$��$��

This picture shows a greatly simplified view of a database system. The figure is intended to illustrate the point that a database system involves four major components namely, data, hardware, software, and users.

�"�� %�� '�� &"��

(�)��

The DBMS responds to a query by invoking the appropriate sub-programmes, each of which performs its special function to interpret the query, or to locate the desired data in the database and insert it in the designed order. Thus DBMS shields database users from the tedious programming they would have to do, organize data for storage, or to gain access to it once it has been stored.

)�.�-� �

)�.�-� ��/ ��.� 0 .��

�$$*�(�.�� $�� / ��

��+��

+��1��

DATABASE SYSTEMS 12

As already mentioned, a database consists of a group of related files of different record types and the DBMS allows users to access data anywhere in the database, without the knowledge of how data are actually organized on the storage device.

Student Activity 1.5 Before reading the next section, answer the following questions: 1. Define DBMS. 2. Why DBMS is needed? 3. Users requests are handled by………………… 4. A database consists of………………….. 5. DBMS is an interface between database and……………….. 6. Give the examples of several DBMS.

If your answers are correct, then proceed to the next section. Top

�� %��

The role of the DBMS as an intermediary between the users and the database is very much like the function of a salesperson in a consumer distributor system. A consumer specifies desired items by filling out an order form, which is submitted to a salesperson at the counter. The salesperson presents the specified items to consumer after they have been retrieved from the storage room. Consumers who place orders have no idea of where and how the items are stored; they simply select the desired items from an alphabetical list in a catalogue. However, the logical order of goods in the catalogue bears no relationship to the actual physical arrangement of the inventory in the storage room. Similarly, the database user needs to know only what data he or she requires; the DBMS will take care of retrieving it.

Database Management Systems: A database management system (DBMS) is a software application system that is used to create, maintain and provide controlled access to user databases. Database management systems range in complexity from a PC-DBMS (such as Ashton Tate’s dBASE IV) costing a few hundred dollars to a mainframe DBMS product (such as IBM’s DB2) costing several hundred thousand dollars. The major components of a full-function DBMS are shown in the diagram given below:

2 ��#� � �� *��3��

)�� #��

� ��#��

)��

��)� � ��#��

$��3 ��

��

* ��


�

��'�� () ��

*�+��%��,��

The engine is the central component of a DBMS. This module provides access to the reposition and the database and coordinates all the other functional elements of the DBMS. The DBMS engine receives logical requests for data (and metadata) from human users and from applications, determines the secondary storage location of those data and issues physical input/output requests to the computer operating system. The engine provides services such as memory and buffer management, maintenance of indexes and lists and secondary storage or disk management.

�

*-+��

The interface sub system provides for users and applications to access the various components of the DBMS. Most DBMS products provide a range of languages and other interfaces, since programmers (or other technical persons by users with little or no programming experience will use the system. Some of the typical interfaces to a DBMS are the following:

� A data definition language (or data sub-language), which is used to define database structures such as records, tables, files and vies.

� An interactive query language (such as SQL), which is used to display data extracted from the database and to perform simple updates.

� A graphic interface (such as Query-by-Example) in which the system displays a skeleton table (or tables) and users propose requests by suitable entries in the table.

� A forms interface in which a screen–oriented form is presented to the user, who responds by filling in blanks in the form.

� A DBMS programming language, (such as the DBMS IV command language), which is procedural language that allows programmers to develop sophisticated application.

� An interface to standard third-generation programming languages such as BASIC and COBOL.

� A natural language interface that allows users to present requests in free-form English statements.

DATABASE SYSTEMS 14

*.+�� (�&��

The information repository dictionary sub system is used to manage and control access to the repository. IRDS is a component that is integrated within the DBMS. Notice that the IRDS uses the facilities of the database engine to manage the repository.

*/+��

The performance management sub system provides facilities to optimize (or at least improve) DBMS performance. Two of its important functions follow:

Query optimization. Structuring SQL queries (or other forms of user queries) to minimize response times.

*0+��

The data integrity management sub system provides facilities for managing the integrity of Data in the database and the integrity of metadata in the repository. There are three important functions:

1. Intra-record integrity: Enforcing constraints or data item values and types within each record in the

database.

2. Referential integrity: Enforcing the validity of references between records in the database.

3. Concurrency control: Assuring the validity of database updates when multiple users access the

database (discussed in al later section).

*1+�%��&��(��2��

The backup and recovery sub system provides facilities for logging transactions and database changes, periodically making backup copies of the database and recovering the database in the event of some type of failure.

*3+��&&��2��&� ��

The application development sub system provides facilities that allow end users and/or programmers to develop complete database applications. It includes CASE tools as well as facilities such as screen generators and report generators.

*4+��

The security management sub system provides facilities to protect and control access to the database and repository.

What has been described above is the manifestation of individual components of a typical DBMS; we would again look at these components from another view later in this section. But first, it is relatively important to focus on some of the other interfacing aspects of DBMS software.

What might be termed the third-generation approach to systems development involved the production of a suite of programmers that together constituted an application system- a self-contained functional capability to do something useful. Within an application system, each program manipulates data in one for more files and a particular file might be both read and written by several programmers. An organization typically develops several application systems for each-one information systems task perceived.

The DBMS (database approach) tries to overcome all of the shortcomings of the pre database approach as follows:


Data Validation Problems: If many programs manipulate a particular type of information then validation of its correctness must be carried out by each of those on guard against entry of any illegal values. Consequently, program code may need to duplicate and, if the validation conditions change, each program (at least) must be recompiled.

• Data Sharing Problems: Perhaps more seriously, if a file is used by several programmes and there is a need to change its structures in some way, perhaps to add a new type information object that is required by a new program, then each program will need to be recompiled-unless one maintains duplicate information in different structures, in which case there is a synchronization problem.

A further dimension to this problem results from the fact that with conventional operating system facilities, if two or more programs write to the same file at the same time unpredictable results will be obtained. Concurrent update must be avoided either by use imposed synchronization (that is, manually controlling the usage of programmers), or by locking scheme that would have to be implemented by the application programs. In either case, there are costs-management control or programming effort.

• Manipulation Problems: When writing a program using a conventional programming language and operating system facilities, a programmer uses record-level commands (i.e. reads and writes) on each file to perform the required functions; this is laborious and hence unproductive of the programmer’s time.

• Data Redundancy: The same piece of information may be stored in two or more files. For example, the particulars of an individual who may be a customer and an employee may be stored in two or more files.

• Program/Data Dependency: In the traditional approach, if a data field is to be added to a master file, all such programmes that access the master file would have to be changed to allow for this new field which would have been added to the master record.

• Lack of Flexibility: In view of the strong coupling between the programme and the data, most information retrieval possibilities would be limited to well-anticipated and predetermined requests for data, the system would normally be capable of producing schedule records and queries which it would have been programmed to create. In the fast moving and competent business environment of today, apart from such regularly scheduled records there is a need for responding to un-anticipatory queries and some kind of investigative analysis which cannot be envisaged professionally.

One could discuss other points and security problems which would probably come next, but the above should be sufficient to illustrate that the approach is fundamentally inadequate for the problem to which it has been applied. Agreement in this matter has grown since the mid-1960s and the database approach is now well established as the basis for information system development and management in many application areas.

All the above difficulties result from two surroundings:

• The lack of any definition of data objects independently of the way in which they are used by specific application programmes; and

• The lack of control over data object manipulation beyond that imposed by existing application programmers.

� The database approach has emerged in response. Fundamentally it rests on the following two interrelated ideas:

� The extraction of data object type descriptions from application programmes into a single repository called a database schema (the word schema can be taken to mean a description of form)-an application-independent description of the object in the database; and

DATABASE SYSTEMS 16

� The introduction of a software component called a database management system (DBMS) through which all data definition and manipulation (update and interrogation) occurs-a buffer that controls data access and removes this function from the applications.

Together, these ideas have the effect of fixing a funnel over the top of the data used by application systems and forcing all application program’s data manipulation through it. So let us now try to appreciate how DBMS solves some of the issues.

• Data Validation: In principle, validation rules for data objects can be held in the schema and enforced on entry by the DBMS. This reduces the amount of application code that is needed. Changes to these rules need be made exactly once because they are not duplicated.

• Data Sharing: Changes to the structures of data objects are registered by modifications to the schema. Existing application programmes need not be aware of any differences, because a correspondence between their view of data and that, which is now supported, can also be held in the schema and interpreted by the DBMS. This concept is often referred to as data independence; applications are independent of the actual representation of their data.

Synchronization of concurrent access can be implemented by the DBMS because it oversees all database access.

The record-level data manipulation concept of programming languages such as Cobol, PL/1, Fortran and so on can be escaped by means of a higher-level (more problem-oriented than implementation-oriented) data manipulation language that is embedded within application programs can be improved.

Furthermore, because the approach involves a central repository of data description, it is possible to develop a mechanism that provides a general inquiry facility to data objects and their descriptions; such a mechanism is normally called a query language.

It is interesting that the emergence of the database approach has brought about a new class of programming language; this is symptomatic of the significant change that database thinking has brought to the information systems development process.

Having described the database approach in terms of its impact on the development and management of information systems, it is now appropriate to attempt some definitions.



1. Make a diagram showing the structure of DBMS.

2. What are the major components of DBMS?

3. What is DBMS engine?

4. Write brief notes on security management sub system.

5. What are the functions of data integrity?

6. What do you understand by data validation?


Top

��


One of the main reasons for using DBMS is to have central control of both the data and the processes that access those data. The person who has such central control over the system is called the database administrator (DBA). The functions of the DBA include the following:

• Schema Definition: The DBA creates the original database schema by writing a set of defines that is translated by the DDL (Data Defn. Lang.) Compiler to a set of tables that is store permanently in the data dictionary.

• Storage Structure and Access-Method Definition: The DBA creates appropriate storage structures and access methods by writing a set of definitions, which is translated by the DDL compiler.

• Schema and Physical-Organization Modification: Programmers accomplish the relatively rare modifications either to the database schema or to the description of the physical storage organization. By writing a set of definitions that is used by either the DDL compiler or the data-storage and data defn. Language compilers to generate modifications to the appropriate intend system-tables (for example, the data dictionary).

• Growing of Authorizations for Data Access: The granting of different types of authorizations allows the DBA to regulate the parts of the database, which various users can access.

• Integrity-Constraint Specification: The data values stored in the database must satisfy certain consistency constraints e.g., perhaps the number of hours an employee may work in 1 week may not exceed a pre-specified limit (say 80 hours)

��

The primary goal of a database system is to provide an environment for retrieving information from and storing new information into the database. There are four different types of database system users, differentiated by the way that they expect to interact with the system.

• Application programmers are computer professionals who interact with the system through DML (Data Manipulation Language) calls, which are embedded in a program written in a host language (for example, Cobol, PL/S, Pascal, C). These programs are commonly referred as application programs.

e.g.: A Banking system includes programs that generate payroll checks that debit accounts, that credit accounts, or that transfer funds between accounts.

• Sophisticated Users: Such users interact with the system without writing programs. Instead, they form their requests in database query language. Each such query is submitted to a very processor whose function is to breakdown DML statement into instructions that the storage manager understands. Analysts who submit to explore data in the database till in the category.

• Specialized Users: Such users are those who write specialized database applications that do not fit into the fractional data-processing framework.

e.g. computer-aided design systems, knowledge base and expert systems, systems that store data with complex data types (for example, graphics data and audio data).

Naive users: These users are unsophisticated who interact with the system by involving one of the permanent application programs that have been written. For example, a bank teller who needs to transfer $50 from account A to account B invokes a program called transfer.

This program asks the teller for the amount of money to be transferred, the account from which the money is to be transferred, and the account to which the money is to be transferred.

��"��

DATABASE SYSTEMS 18

Database changes over time when and as information is inserted and deleted. The collection of information stored in the database at a particular moment is called an instance of the database. The overall design of the database is called the database schema, schemas one changes infrequently, if at all.

Analogies to the concepts of data types, variables and values in programming languages are useful here. Returning to the customer record types definition, note that in declaring the type of customer, we have not declared any variables. To declare such variables in a Pascal-like language, we write

Var customer: customer; variable customer2 now corresponds to an area of storage containing a customer type record.

A database schema corresponds to the programming-language type definition. A variable of a given type has a particular value at a given instant. Thus, the value of a variable in programming languages corresponds to an instance of a database schema. In other words “the description of a database is called the database schema, which is specified during database design and is not expected to change frequently”, A displayed schema is called a schema diagram.

E.g. student-schema.

��'� � �� (�� #��

�

( ��'� ( �� )��

A schema diagram displays only some aspects of a schema, such as the names of record types and data items, and some types of constraints. Other aspects are not specified in the schema diagram. As in the above diagram they’re neither in data type of each data item, nor in the relationships among the various files.



1. What do you understand by DBA and how he plays an important role?

2. Describe various types of Database users?

3. Differentiate between instances and schemes?


Top

(��(��&��

Data is usually stored in the form of records. Each record consists of a collection of related data values or items where each value is formed of one or more bytes and corresponds to a particular field of the record. Records usually describe entities and their attributes. For example, an EMPLOYEE and record represents an employee entity, and each field value in the record specifies some attribute of that employee, such as NAME, BIRTHDATE, SALARY, or SUPERVISOR. A collection of field names and their corresponding data types constitutes a record type or record format definition. A data type, associated with each field, specifies the type of values a field can take.

��


The data type of a field is usually one of the standard data types used in programming. These include numeric (integer, long integer, or floating point), string of characters (fixed-length or varying), Boolean (having 0 and 1 or TRUE and FALSE values only), and sometimes specially coded data and time data types. The number of bytes required for each data type is fixed for a given computer system. An integer may require 4 bytes, a long integer 8 bytes, a real number 4 bytes, a Boolean 1 byte, a Boolean 1 byte, a date 10 bytes (assuming a format of YYYY-MM-DD), and a fixed-length string of k characters k bytes. Variable-length strings may require, as many bytes as there are characters in each field value. For example, an EMPLOYEE record type may be defined–using the C programming language notation–as the following structure:

Struct employee{ char name [30]; char ssn[9]; int salary; int jobcode; char department[20];

};

In recent database applications, the need may arise for storing data items that consist of large unstructured objects, which represent images, digitized video or audio streams, or free text. These are referred to as BLOBs (Binary Large Objects). A BLOB data item is typically stored separately from its record in a pool in a pool of disk blocks, and a pointer to the BLOB is included in the record.

Top

5��

A file is a sequence of records. In many cases, all records in a file are of the same record type. If every record in the file has exactly the same size (in bytes), the file is said to be made up of fixed-length records. If different records in the file have different sizes, the file is said to be made up of variable-length records. A file may have variable-length records for several reasons:

The file records are of the same record type, but one or more of the field are of varying size (variable-length fields). For example, the NAME field of EMPLOYEE can be a variable-length field.

The file records are of the same record type, but one or more of the field may have multiple values for individual records; such a field is called a repeating field and a group of values for the field is often called a repeating group.

The file records are of the same record type, but one or more of the fields are optional; that is, they may have values for some but not all of the file records (optional fields).

The file contains records of different record types and hence of varying size (mixed file). This would occur if related records of different types were clustered (placed together) on disk blocks; for example, the GRADE_REPORT records of a particular student may be placed following that STUDENT’s record.

The fixed-length EMPLOYEE records in the Figure given below have a record size of 71 bytes. Every record has the same fields, and field lengths are fixed, so the system can identify the starting byte position of each field relative to the starting position of the record. This facilitates locating field values by programs

DATABASE SYSTEMS 20

that access such files. Notice that it is possible to represent a file that logically should have variable-length records as a fixed-length records file. For example, in the case of optional fields we could have each field in every file record but store a special null value of no value exists for that field. For a repeating field, we could allocate as many spaces in each record as the maximum number of values that the field can take. In either case, space is wasted when certain records do not have values for all the physical spaces provided in each record. We now consider other options for formatting records of a file of variable-length records.

�� *��+��,��&�� +�� -.�� *��&��&��'��,�� +��,�� *��

��'��,��&��$� �� $�� *�

For variable-length fields; each record has a value for each field, but we do not know the exact length of some field values. To determine the bytes within a particular record that represent each field, we can use special separator characters (such as ? or % or $) – which do not appear in any field value–to terminate variable-length fields (See Figure), or we can store the length in bytes of the field in the record, preceding the field value.

A file of records with optional fields can be formatted in different ways. If the total number of fields for the record type is large but the number of fields that actually appear in a typical record is small, we can include in each record a sequence of <field-name, field-value>pairs rather than just the field values. Three types of separator character for the first two purposes–separating the field name from the field value and separating one field from the next field. A more practical option is to assign a short field type code–say, an integer number–to each field and include in each record a sequence of <field-type, field-value> pairs rather than <field-name, field-value> pairs.

A repeating field needs one separator character to separate the repeating values of the field and another separator character to indicate termination of the field. Finally, for a file that includes records of different types, each record is preceded by a record type indicator. Understandably, programs that process files of variable-length records–which are usually part of the file system and hence hidden from the typical


programmers–need to be more complex than those for fixed-length records, where the starting position and size of each field are known and fixed.

Top

��6��

Most database applications have certain integrity constraints that must hold for the data. A DBMS should provide capabilities for defining and enforcing these constraints. The simplest type of integrity constraint involves specifying a data type for each data item. For example, in Figure we may specify that the value of the Class data item within each student record must be an integer between 1 and 5 and that the value of Name must be a string of no more than 30 alphabetic characters. A more complex type of constraint that occurs frequently involves specifying that a record in one file must be related to records in other files. For example, in Figure given below, we can specify that “every section record must be related to a course record.” Another type of constraint specifies uniqueness on data item values, such as “every course record must have a unique value for Course Number.” These constraints are derived from the meaning or semantics of the data and of the miniword it represents. It is the database designer’s responsibility to identify integrity constraints during database design. Some constraints can be specified to the DBMS and automatically enforced. Other constraints may have to be checked by update programs or at the time of data entry.

A data item may be entered erroneously and still satisfy the specified integrity constraints. For example, if a student receives a grade of A but a grade of C is entered in the database, the DBMS cannot discover this error automatically, because C is a valid value for accounts. Since application programs are added to the system in an ad hoc manner, it is difficult to enforce such security constraints.

�� ) �/��

� �� !4� !� ( �

� -� 5�� -� �� ( �

�

�� 0�� $��

� �� ( �� ( !6!7� �� ( �

� )�� ( 66�7� �� ( �

� )�� .8��!7� 6� ��.8�

� )�� ( 66"7� 6� ( �

�

�� 1��

� ",� ��.8��!7� �� 9"� :��

� 9�� ( !6!7� �� 9"� ��

� !7�� ( 66�7� �� 99� :� ��

� !!�� .8��!7� �� 99� (��

� !!9� ( !6!7� �� 99� ��

� !6,� ( 66"7� �� 99� � ��

�

DATABASE SYSTEMS 22

! ��,#�$�� ! ��

� !4� !!�� -�

� !4� !!9� (�

� "� ",� ��

� "� 9��

� "� !7�� -�

� "� !6,� ��

�

��2� �� 2� ��

� ( 66"7� (,66�7�

� ( 66"7� ��.8��!7�

� ( 66�7� ( !6!7�

Top

��

For the system to be usable, it must retrieve data efficiently. This concern has led to the design of complex data structures for the representation of data in the database. Since many database-system users are not computer trained, developers hide the complexity from users through several levels of abstraction, to simplify users’ interactions with the systems:

� Physical level. The lowest level of abstraction describes how the data are actually stored. At the physical level, complex low-level data structures are described in detail.

� Logical level. The next higher level of abstraction describes what data are stored in the database, and what relationships exist among those data. The entire database is thus described in terms of a small number of relatively simple structures. Although implementation of the simple structures at the logical level may involve complex physical-level structures, the user of the logical level does not need to be aware of this complexity. Database administrators, who must decide what information is to be kept in the database, use the logical level of abstraction.

� View level. The highest level of abstraction describes only part of the entire database. Despite the use of simpler structures at the logical level, some complexity remains, because of the large size of the database. Many users of the database system will not be concerned with all this information. Instead, such users need to access only a part of the database. So that their interaction with the system is simplified, the view level of abstraction is defined. The system may provide many views for the same database.

The interrelationship among these three levels of abstraction is illustrated in Figure given below.


�

��3�'��

An analogy to the concept of data types in programming languages may clarify the distinction among levels of abstraction. Most high-level programming languages support the notion of a record type. For example, in a Pascal-like language, we may declare a record as follows:

type customer = record

customer-name : string;

social-security : string;

customer-street : string;

customer-city : string;

end

This code defines a new record called customer with three fields. Each field has a name and a type associate with it. A banking enterprise may have several such record types, including

� Account, with fields account-number and balance

� Employee, with fields employee-name and salary

At the physical level, a customer, account, or employee record can be described as a block of consecutive storage locations (for example, words or bytes). The language compiler hides this level of detail from programmers. Similarly, the database system hides many of the lowest-level storage details from database programmers. Database administrators may be aware of certain details of the physical organization of the data.

At the logical level, each such record is described by a type definition, as illustrated in the previous code segment, and the interrelationship among these record types is defined. Programmers using a programming language work at this level of abstraction. Similarly, database administrators usually work at this level of abstraction.

Finally, at the view level, computer users see a set of application programs that hide details of the data types. Similarly, at the view level, several views of the database are defined, and database users see these views. In addition to hiding details of the logical level of the database, the views also provide a security mechanism to prevent users from accessing parts of the database. For example, tellers in a bank see only that part of the database that has information on customer accounts; they cannot access information concerning salaries of employees.

DATABASE SYSTEMS 24


Answer the following questions:

1. Write short notes on following:

a. Records

b. Files

c. Record types

2. What do understand by Data Integrity Constraints?

3. What do understand by Data Abstraction?

��

• Data is the plural of a Greek word datum, which means any raw facts, or figure like numbers, events, letters, transactions, etc, based on which we cannot reach any conclusion. It can be useful after processing

• Information is processed data. The user can take decision based on information.

• A database is a collection of related data or operational data extracted from any firm or organization

• A database is a logically coherent collection of data with some inherent meaning. A random assortment of data cannot correctly be referred to as a database.

• A database management system (DBMS) is a software application system that is used to create, maintain and provide controlled access to user databases. Database management systems range in complexity from a PC-DBMS (such as Ashton Tate’s dBASE IV) costing a few hundred dollars to a mainframe DBMS product (such as IBM’s DB2) costing several hundred thousand dollars.

• One of the main reasons for using DBMS is to have central control of both the data and the processes that access those data. The person who has such central control over the system is called the database administrator (DBA).

• Data is usually stored in the form of records. Each record consists of a collection of related data values or items

• A file is a sequence of records.

• To simplify users’ interactions with the systems, developers hide the complexity from users through several levels of abstraction.

��

� ��

I. True or False

1. Data means any raw facts or numbers


2. A system is not a group of associated activities or functions with the following attributes.

3. A file is a sequence of fields .

II. Fill in the Blanks

1. Data is the plural of a Greek word _____________.

2. The uses of formal systems are based on proper procedures of collecting _________.

3. Data is converted to output or information by ____________.

4. The primary goal of a database system is to provide an environment for retrieving _______________.

5. At the________________, each such record is described by a type definition.

�� !��

I. True or False

1. True

2. False

3. Flase


1. datum

2. data

3. processing

4. information

5. logical level

"� � ��

I. True or False

1. Today, data manipulation and information processing have become the major tasks of any organization

2. Information sources from internal and external sources may be classified only as formal.

3. Most database applications have certain integrity constraints that must hold for the data.

4. The person who central control over the system is called the database user.

5. Database administrators usually work at logical level of abstraction.


1. _________ and _________ information can be obtained from external sources.

2. The user of information should receive all the details necessary to aid_________.

3. _______ information can be more damaging than ______ information to a business.

DATABASE SYSTEMS 26

4. __________ can process data much quicker than human.

5. __________ of concurrent access can be implemented by DBMS.

��

1. Define a system.

2. What is a database system?

3. What is DBMS Engine?

4. Discuss the roles of DBA.

5. Describe different levels of abstraction.

Introduction Object-Based Logical Models ER Analysis Record-Based Logical Models Relational Model Network Model Hierarchical Data Model

�

��

Database Management System

Learning Objectives

After reading this unit you should appreciate the following:

• Data Model

• ER Analysis

• Record Based Logical Model

• Relational Model

• Network Model

• Hierarchical Model

Top

�

��

Underlying the structure of a database is the data model.

Data model: Data model is a collection of conceptual tools for describing data, data relationships, data semantics and consistency constraints.

Constraint: A rule applied on the data or column.

The various data models that have been proposed fall into three different groups.

1. Object-Based Logical Models

2. Record-Based Logical Models

3. Physical Models.

Top

DATABASE MANAGEMENT SYSTEM 29

��

Object-based logical models are used in describing data at conceptual and external schemas. They provide fairly flexible structuring capabilities and allow data constraints to be specified explicitly. Some of object based models are:

�� The entity-relationship model

�� The object-oriented model

�� The semantic model

�� The functional data model

Entity-relational model and the object-oriented model act as representatives of the class of the object-based logical models.

��

The entity-relationship (E-R) data model is based on a perception of a real world that consists of a collection of basic objects, called entities, and of relationships among these objects.

Fig. Below shows two entities and the values of their attributes. The employee entity e1 has four attributes: Name, Address, Age and Home phone; their values are “John Smith,” ”2311 Kirby, Houston, Texas 77001,” ”55,”and “713-749-2630,” respectively. The company entity c1 has three attributes: Name, Headquarters, and president; their values are “Sunco Oil,” “ Houston,” “and “John Smith,” respectively.

�

��

��

Like E-R model, the object-oriented model is based on a collection of objects. An object is a software bundle of variables and related methods.

Object that contains the same types of values and the same methods are grouped together into class.

Example: Bank

Each bank contains same type of functionalities such as withdraw and deposit.

Here withdraw and deposits are called related methods of class bank. Each bank i.e. Canara Bank, State Bank of India, Andhra Bank contains same functionalities. All these three banks represent the Objects of Bank class. Because the functionalities of all these are same.

DATABASE SYSTEMS 30

�

�

��

Top

��

A method called entity- relationship analysis is used for obtaining the conceptual model of the data, which will finally help us in obtaining our relational database. In order to carry out ER analysis it is necessary to understand and use the following features:

1. Entities: These are the real word objects in an application, rectangles that represent entity sets.

2. Attributes: These are the properties of importance of the entities and the relationships. Ellipse represents the attributes.

3. Relationships: These connect different entities and represent dependencies between them. Diamonds that represent relationships among entity sets.

4. Lines: which links attributes to entity sets and entity sets and entity sets to relationships.

In order to understand these terms let us take an example. If a supplier supplies an item to a company, then the supplier is an entity. The item supplied is also an entity. The item and supplier are related with each other are in the sense that the supplier supplies an item. Supplying is thus the verb which specifies the relationship between item and supplier. A collection of similar entities is known as an entity set. Therefore each member of the entity set is described by its attributes.

A supplier could be described by the following attributes:

SUPPLIER [supplier code, supplier name, address] an item would have the following attributes

ITEM [item Code, item name, rate] The Entity Relationship diagram shown in Figure 2.3 represents entities by rectangles and relationships by a diamond.

��

��

��

��

��


SUPPLIES [Supplier code, item code, order no, quantity] is represented by Diamond which combines the two table with relation

�

��

Top

��

Record-based logical models are used in describing data at the logical and view levels. In contrast to object-based data models, they are used both to specify the overall logical structure of the database and to provide a higher-level description of the implementation.

Record-based models are so named because the database is structured in fixed-format records of several types. Each record type defines a fixed number of fields, or attributes, and each field is usually of a fixed number of fields, or attributes, and each field is usually of a fixed length.

The three most widely accepted record-based data models are the

1. Relational Model

2. Network Model

3. Hierarchical model

Top

��

The relational model uses a collection of tables to represent both data and the relationships among those data. Each table has multiple columns, and each column has a unique name.

�

DATABASE SYSTEMS 32

�

��

Fig 2.2 shows that customer Johnson with social-security number 321-12-3123 has two accounts A-101, with a balance 500, and A-201, with a balance of 900. Note that the customers Johnson and Smith share the same account number which means that they may share the same business venture.

��

Top

��!�"��

Data in the network model are represented by collections of records, and relationship among data is represented by links, which can be viewed as pointers. The records in the database are organized as a collection of arbitrary graphs. Fig 2.5 shows the same relation data model information in network model.

Fig. 2.5 the same data of relational model was represented by network model.

�

��

Top


#��$��

The hierarchical data model is similar to the network model in the sense that it records and links represent

data and relationships among the data, respectively. It differs from the network model in that the records are

organized as collections of trees rather than arbitrary graphs. Fig 2.6 represents the same information as in

Fig 2.4 and Fig 2.5

�

��!��"�#�� $��%��

$�%%��& ��

The relational model differs from the network and hierarchical models in that it does not use pointers or links. Instead, the relational model relates records by the values that they contain. This freedom from the use of pointers allows a formal mathematical foundation to be defined.


Answer the following questions.

1. What do understand by ER Diagram & Analysis?

2. Differentiate between Network and Hierarchical Data Model.

�

'�& & ��

• Data model is a collection of conceptual tools for describing data, data relationships, data semantics and consistency constraints.

DATABASE SYSTEMS 34

• A method called entity-relationship analysis is used for obtaining the conceptual model of the data, which finally helps in obtaining a relational database.

• Record-based logical models are used in describing data at the logical and view levels.

• The relational model uses a collection of tables to represent both data and the relationships among those data.

• Data in the network model are represented by collections of records, and relationship among data is represented by links, which can be viewed as pointers.

• In hierarchical data model, data and relationships among the data are represented by records and links, respectively.

'��%��& ��(��

��

I. True or False

1. Object-based logical models are used in describing data at conceptual and external schemas.

2. The semantic data model is based on a perception of a real world that consists of a collection of basic objects, called entities, and of relationships among these objects.

3. Record-based logical models are used in describing data at the logical and view levels.


1. _______________models provide fairly flexible structuring capabilities and allow data constraints to be specified explicitly.

2. The object-oriented model is based on a collection of _________.

3. ________ connect different entities and represent dependencies between them.

4. Object that contains the same types of values and the same methods are grouped together into _______________.

5. The records in the network model database are organized as a collection of ___________________.

��

I. True or False

1. True

2. False

3. True


1. Object-based

2. objects

3. Relationships


4. class

5. arbitrary graphs

��

I. True or False

1. Entity-relational model and the object-oriented model act as representatives of the class of the object-based logical models.

2. The various data models that have been proposed fall into three different groups.

3. Like E-R model, semantic model is based on a collection of objects. An object is a software bundle of variables and related methods.

4. Record-based logical models are used in describing data at the physical level.


a. _________ is a rule applied on the data or column.

b. _______ are the real world objects in an application, rectangles that represent entity sets.

c. _______models are used in describing data at the logical and view levels.

d. The related not model differs from the network and hierarchical model in that it does not use________or ________.

$��(��

1. Name different object based data models.

2. What are entities, attribute and relationships?

3. Discuss the importance of E-R model.

4. Differentiate between Network and Hierarchical model.

5. Discuss the advantage of relational model over other models.

Basic Relational Algebra Operations A Complete Set of Relational Algebra Operations Relational Calculus SQL Structured Query Language (SQL)

�

��

Relational Data Manipulation

Learning Objectives


• Relation Algebra

• Relational Calculus

• SQL

Top

��

In addition to defining the database structure and constraints, a data model must include a set of operations to manipulate the data. Basic sets of relational model operations constitute the relational algebra. These operations enable the user to specify basic retrieval requests. The result of retrieval is a new relation, which may have been formed from one or more relations. The algebra operations thus produce new relations, which can be further manipulated using operations of the same algebra. A sequence of relational algebra operations forms a relational algebra expression, whose result will also be a relation.

The relational algebra operations are usually divided into two groups. One group includes set operations from mathematical set theory; these are applicable because each relation is defined to be a set of tuples. Set operations include UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT. The other group consists of operations developed specifically for relational databases; these include SELECT, PROJECT, and JOIN, among others. The SELECT and PROJECT operations are discussed first below, because they are the simplest. Then we discuss set operations. Finally, we discuss JOIN and other complex operations. The relational database shown in Figure 3.1 is used for our example.

Some common database requests cannot be performed with the basic relational algebra operations, so additional operations are needed to, express the requests.

��

�

��

��

��

RELATIONAL DATA MANIPULATION 37

�

��

�

��

� ��

�

�� !� �� "#��$�� % $�#& %�! $ ' (��(�� )��$��*�� +�,�+(� ��&!��%�&�!��

��

The SELECT operation is used to select a subset of the tuples from a relation that satisfy a selection condition. One can consider the SELECT operation to be a filter that keeps only those tuples that satisfy a qualifying condition. For example, to select the EMPLOYEE tuples whose department is 4, or those whose salary is greater than $30,000, we can individually specify each of these two conditions with a SELECT operation as follows:

σDNO =4 (EMPLOYEE)

σSALARY>30000(EMPLOYEE)

In general, the SELECT operation is denoted by

σ<selection condition> ( R )

Where the symbol σ (sigma) is used to denote the SELECT operator, and the selection condition is a Boolean expression specified on the attributes of relation R. Notice that R is generally a relational algebra expression whose result is a relation; the simplest expression is just the name of a database relation. The relation resulting from the SELECT operation has the same attributes as R. The Boolean expression specified in <selection condition> is made up of a number of clauses of the form.

<attribute name> <comparison operator> <constraint value>, or

<attribute name> <comparison operator> <attribute name>

where <attribute name> is the name of an attribute of R, <comparison operator> is normally one of the operators {=, <, ≥, >, ≥, ≠,} and <constant value> is a constant value from the attribute domain. Clauses can be arbitrarily connected by the Boolean operators AND, OR, and NOT to form a general selection condition. For example, to select the tuples for all employees who either work in department 4 and make over $25,000 per year, or work in department 5 and make over $30,000, we can specify the following SELECT operation:

σ(DNO=4 AND SALARY> 25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)

��

��

��

��

DATABASE SYSTEMS 38

�� !"#$%&� �&#"'(�'(&� $ �� )*+*�,��-.��,��

�� ((((� !!""""� "�

�+/�01�� 2 ��3� !!"""""� �&""'��'(%� # %� 4�..,��-.��,��

�� !((((� %%%##""""� "�

�115�/� �� 6*1/7/� &&&%%$$$$� �&#%'($'�&� �� /.�1*,�8+��3,��9�

�� "(((� &%$#"! �� !�

�

�*��:*+� � 4/11/5*� &%$##"! �� &!�'(#'�(� �&�� *++7,��/11/+�*,��

�� ! (((� %%%##""""� !�

/� *.�� ;� �/+/7/��

###%%!!!!� �&#�'(&'�"� &$"� ��+*� �/0,��-� <1*,��

�� %(((� !!""""� "�

��75*� �� 31�.�� !" !" !" � �&$�'($' �� "# �� 5*,��-.��,��

�� "((((� !!""""� "�

�

�� *)� 4� �/<<*+� &#$&%$&%$� �&#&'( '�&� &%(� �/11/.,��-.��,��

�

�� "(((� &%$#"! �� !�

�

�

�

�

�

�

�

�

�

�

��

�� -.��

!� �/::�+)�

"� �*11/�+*�

"� -3/+�1/�)�

�

"� ��-.��

��

*/.*+5�� "� !!""""� �&#%'("'��

�)� ��.��+/�� !� &%$#"! �� &&"'(�'(��

�

�*/)=-/+�*+.� �� %%%##""""� �&%�'(#'�&�

2 � ;��

�� !"#$%&� �� >"�

�� !"#$%&� �� $>"�

##%%%!!!!� � !(>(�

!" !" !" � �� (>(�

!" !" !" � �� (>(�

!!""""� �� (>(�

!!""""� � �(>(�

!!""""� �(� �(>(�

!!""""� �(� �(>(�

&&&%%$$$$� (� (>(�

&&&%%$$$$� �(� �(>(�

&%$&%$&%$� �(� ">(�

&%$&%$&%$� (� ">(�

&%$#"! �� (� �(>(�

&%$#"! �� (� �">(�

�

%%%##""""� �(� �-11�

� ��

�+�)-5�� *11/�+*� "�

�+�)-5�� -3/+1/�)� "�

�+�)-5�6� � ��-.�� "�

�� 8-�*+�?/�� (� �/::�+)� !�

*5�3��?/�� (� ��-.��

�*@<*�*:��.� (� �/::�+)� !�

�

�+�)-5�� "�


�

A/B� �

��

�+/�0�� 2 ��3� !!""""� �&#"'��'(%� % %�4�..,��-.��,�� !((((� %%%##""""� "�

�*��:*+� �� 2 /11/5*� &%$#"! ��

�&!�'(#'�(� �&��*++7,��*11/�+*,�� ! (((� %%%#""""� !�

/� *.�� ;� �/+/7/�� ###%%!!!!�

�&#�'(&'�"� &$"��+*�/0,��-� <1*,�� %(((� !!""""� "�

(b)

��

� �� ((((�

2 ��3� �+/�0�� !((((�

6*1/7/� �1�5�/� �"(((�

2 /11/5*� �*��:*+� ! (((�

�/+/7/�� /� *.�� %(((�

��31�.�� 75*� �"(((�

�/<</+� �� /)� �"(((�

��3� �/� *.� ""(((�

�

�

�

�

�

�

��(�%$(�#"�� &!��#*�� $�#&(��- .�σ-��/ 0��1 23444.��-��/ 3��

��1 �4444.-��.��-'.�π�� 5��5��-��.��-�.�π��65��-��.�

The result is shown in Figure 3.3(a). Notice that the comparison operators in the set {=, <, ≥, >, ≥, ≠,} apply to attributes whose domains are ordered values, such as numeric or date domains. Domains of strings of characters are considered ordered, based on the collaring sequence of the characters. If the domain of an attribute is a set of unordered values, then only the comparison operators in the set {=, ≠} can be used.

��

!!""""� �1�5*� �� &%#'(!'("� ��

!!""""� ��*)��*� �� &# '�('�"� ��

!!""""� ��7� �� &"%'("'( � ��

&%$#"! �� <�*+� �� &!�'(�'�%� ��

�� !"#$%&� ��5�/*�� &#%'(�'(!� ��

�� !"#$%&� �1�5*� �� &%%'��' (� ��

�

�� !"#$%&� �1�?/<*�� &#$'("'("� ��

��

�� ((((�

�� !((((�

�� "(((�

�� ! (((�

�� %(((�

�� "(((�

�� ""(((�

��27��&��*#((�'%��% $�#& %�! $ ' (��($ $��#��(*#&!�&��$#�$�� (��

(c)

DATABASE SYSTEMS 40

An example of an unordered domain is the domain Color={red, blue, green, white, yellow, ….} where no order is specified among the various colors. Some domains allow additional types of comparison operators; for example, a domain of character strings may allow the comparison operator SUBSTRING_OF.

In general, the result of a SELECT operation can be determined as follows. The <selection condition> is applied independently to each tuple t in R. This is done by substituting each occurrence of an attribute Ai in the selection condition with its value in the tuple t[Ai]. If the condition evaluates to true, then tuple t is selected. All the selected tuples appear in the result of the SELECT operation. The Boolean conditions AND, OR and NOT have their normal interpretation as follows:

(cond1 AND cond2) is true if both (cond1)and (cond2) are true; otherwise, it is false.

(cond1 OR cond2) is true if either (cond1) or (cond2) are true; otherwise, it is false.

(NOT cond) is true if cond is false; otherwise, it is false.

The SELECT operator is unary; that is, it is applied to a single relation. Moreover, the selection operation is applied to each tuple individually; hence, selection conditions cannot involve more than one tuple. The degree of the relation resulting from a SELECT operation is the same as that of R. The number of tuples in the resulting relation is always less than or equal to the number of tuples in R. That is Iσc ( R ) I ≤ I R I for any condition C.

The fraction of tuples selected by a selection condition is referred to as the selectivity of the condition.

Notice that the SELECT operation is commutative; that is,

σ<cond1>(σ<cond2> ( R )) = σ<cond2>(σ<cond1>( R ))

Hence, a sequence of SELECT can be applied in any order. In addition, we can always combine a cascade of SELECT operations into a single SELECT operation with a conjunctive (AND) condition; that is :

σ<cond1>(σ<cond2> (…(σ<condn>( R ))…)) = σ<cond1> AND <cond2> AND …AND <condn>( R )

��

If we think of a relation as a table, the SELECT operation selects some of the rows from the table while discarding other rows. The PROJECT operation, on the other hand, selects certain columns from the table and discards the other columns. If we are interested in only certain attributes of a relation, we use the PROJECT operation to project the relation over these attributes only. For example, to list each employee’s first and last name and salary, we can use the PROJECT operation as follows:

πLNAME, FNAME, SALARY (EMPLOYEE)

The resulting relation is shown in Figure 3.3(b). The general form of the PROJECT operation is

π<attribute list> ( R )

where π (pi) is the symbol used to represent the PROJECT operation and <attribute list> is a list of attributes from the attributes of relation R. Again, notice that R is, in general, a relational algebra expression whose result is a relation, which in the simplest case is just the name of a database relation. The result of the PROJECT operation has only the attributes specified in <attribute list> and in the same order as they appear in the list. Hence, its degree is equal to the number of attributes in <attribute list>.

If the attribute list includes only non-key attribute of R, duplicate tuples are likely to occure; the PROJECT operation removes any duplicate tuples, so the result of the PROJECT operation is a set of tuples and hence a valid relation. This is known as duplicate elimination.

For example, consider the following PROJECT operation:

πSEX, SALARY (EMPLOYEE)


The result is shown in Figure 3.3 (c) Notice that the tuple <F,25000> appears only once in Figure 3.3 (c) even though this combination of values appears twice in the EMPLOYEE relation.

The number of tuples in a relation resulting from a PROJECT operation is always less than or equal to the number of tuples in R. If the projection list is a super key of R-that is, it includes some key of R, the result has the same number of tuples as R. Moreover,

π<list> (π<list2>( R )) = π<list> ( R )

As long as <list2> contains the attributes in <list1>; otherwise, the left-hand side is an incorrect expression. It is also noteworthy that commutativity does not hold on PROJECT.

�� !��

The relation shown in Figure 3.4 does not have any names. In general, we may want to apply several relational algebra operations one after the other. Either we can write the operations as a single relational algebra expression by nesting the operations, or we can apply one operation at a time and create intermediate result relations. In the latter case, we must name the relations that hold the intermediate results. For example, to retrieve the first name, last name, and salary of all employees who work in department number 5, we must apply a SELECT and a PROJECT operation. We can write a single relational algebra expression as follows:

πFNAME, LNAME, SALARY(σDNO=5(EMPLOYEE))

Figure 2.4(a) shows the result of this relational algebra expression. Alternatively, we can explicitly show the sequence of operations, giving a name to each intermediate relation:

DEP5_EMPS ← σDNO=5(EMPLOYEE)

RESULT ← πFNAME, LNAME, SALARY(DEP5_EMPS)

It is often simpler to break down a complex sequence of operations by specifying intermediate result relations than to write a single relational algebra expression. We can also use this technique to rename the attributes in the intermediate and result relations. This can be useful in connection with more complex operations such as UNION and JOIN, as we shall see. To rename the attributes in a relation, we simply list the new attribute names in parentheses, as in the following example:

TEMP ← σDNO=5(EMPLOYEE)

R(FIRSTNAME, LASTNAME, SALARY ← πFNAME, LNAME, SALARY(TEMP)

The above two operations are illustrated in Figure 3.4(b). If no renaming is applied, the names of the attributes in the resulting relation of SELECT operation are the same as those in the original relation and in the same order. For a PROJECT operation with no renaming, the resulting relation has the same attribute names as those in the projection list and in the same order in which they appear in the list.

We can also define a RENAME operation which can rename either the relation name, or the attribute names, or both in a manner similar to the way we defined SELECT (a) ��

�� ((((�

�+/�0)�� 2 ��3� !((((�

� /*.�� /+/7/�� %(((�

��75*� ��31�.�� "(((�

(b) TEMP

DATABASE SYSTEMS 42

�*� 8� ��

� �� !"#$%&� �&#"'(�'(&� $ ��)+*�,��-.��,�� ((((� !!"""� "�

� �+/�0)�� 2 ��3� !!"""� �&""'��'(%� #�%�4�..,��-.��,�� !((((� %%%##"""� "�

� � /*.�� ;� �/+/7/�� &#�'(&'�"� &$"��+*��/0,��-� <1*,�� %(((� !!"""� "�

� ��75*� �� 31�.�� &$�'($' �� "# �,� �5*,��-.��,�� "(((� !!"""� "�

� ��

�� ((((�

�+/�0)�� 2 ��3� !((((�

� /*.�� /+/7/�� %(((�

�

��75*� ��31�.�� "(((�

πFNAME, LNAME, SALARY (σDNO=5(EMPLOYEE)).

The same expression using intermediate relations and renaming of attributes. And PROJECT. The general RENAME operation when applied to a relation R of degree n is denoted by

ρS(B1,B2,…,BN) ( R ) or ρS( R ) or ρS (B1,B2,…,Bn)( R )

Where the symbol ρ (rho) is used to denote the RENAME operator, S is the new relation name, and B1, B2, …, B n are the new attribute names. The first expression renames both the relation and its attributes; the second renames the relation only; and the third renames the attributes only. If the attributes of R are (A1, A2, …, An) in that order, then each Ai is renamed as Bi.

��

The next group of relational algebra operations is the standard mathematical operations on sets. For example, to retrieve the social security numbers of all employees who either work in department 5 or directly supervise an employee who works in department 5, we can use the UNION operation as follows:

DEP5_EMPS ← σDNO=5(EMPLOYEE)

RESULT1← πSSN(DEP5_EMPS)

RESULT2(SSN) ← πSUPERSSN(DEP5_EMPS)

RESULT←RESULT1 U RESULT2

The relation RESULT1 has the social security numbers of all employees who work in department 5, whereas RESULT2 has the social security numbers of all employees who directly supervise an

employee who works in department 5. The UNION operation produces the tuples that are in either RESULT1 or RESULT2 or both (see figure 3.5).

��07��(�%$(�#"��% $�#& %� %��'��8*��((�#&��


��

�� !"#$%&�

!!""""�

###%%!!!!�

�

!" !" !" �

��37�9 ��+��(�%$� "$��$��:� �� #*�� $�#&7��:��←��:��:��:��2��

Several set theoretic operations are used to merge the elements of two sets in various ways, including UNION, INTERSECTION, and SET DIFFERENCE. These are binary operations; that is, each is applied to two sets. When these operations are adapted to relational databases, the two relations on which any of the above three operations are applied must have the same type of tuples; this condition is called union compatibility. Two relations of R(A1, A2, …, An) and of S(B1, B2, …, Bn) are said to be union compatible if they have the same degree n, and if dom (Ai) = dom(Bi) for 1≤ i ≤n. This means that the two relations have the same number of attributes and that each pair of corresponding attributes have the same domain.

We can define the three operations UNION, INTERSECTION, and SET DIFFEENCE on two union-compatible relations R and S as follows:

UNION: The result of this operation, denoted by R ∪ S, is a relation that includes all tuples that are either in R or in S or in both R and S. Duplicate tuples are eliminated.

INTERSECTION: The result of this operation, denoted by R ∩ S , is a relation that includes all tuples that are in both R and S.

SET DIFFERNCES: The result of this operation, denoted by R – S, is a relation that includes all tuples that are in R but not in S.

We will adopt the convention that the resulting relation has the same attribute names as the first relation R. Figure 3.6 illustrates the three operations. The relations STUDENT and INSTRUCTOR in Figure 3.6(a) are union compatible, and their tuples represent the names of students and instructors, respectively. The result of the UNION operation in Figure 3.6 (b) shows the names of all students and instructors. Note that duplicate tuples appear only once in the result. The result of the INTERSECTION operation (Figure 3.6 (c) includes only those who are both students and instructors. Notice that UNION and INTERSECTION are commutative operations; that is

R ∪ S = S ∪ R ,and R ∩ S = S ∩ R ,

Both union and intersection can be treated as n-array operations applicable to any number of relations as both are associative operations; that is

R ∪ (S ∪ T) = (R ∪ S) ∪ T, and (R ∩ S) ∩t = R ∩ (S ∩ T).

��

��

� !!""""� �� !"#$%&�

� %%%##""""�

!!""""�

###%%!!!!�

!" !" !" �

�

%%%##""""�

�

(a)

��

� ��

� �5/+)�� +�@�*�

� -./�� /��

� �+/�5�.� ��.��

DATABASE SYSTEMS 44

� -./�� /��

� /� *.�� /��

� ��7� ;��*+�

� �/+</+/� ��*.�

� �+� 7� ��+)�

� �� 7� 2 /�3�

� �� *.�� 11<*+��

� � � A<B� � � � � � � A5B� � � �

��

-./�� /��

/� *.�� /��

��7� ;��*+�

�/+</+� ��*.�

�� 7� ��+)�

�� 7� 2 /�3�

�� *.�� 1<*+��

��

�5/+)�� +�@�*�

�+/�5�.� ��.��

�A)B�� A*B� � � � ��

��7� ;��*+�

�/+</+� ��*.�

�� 7� ��+)�

�� 7� 2 /�3�

�� *.�� 1<*+��

��3-+*� >#C��11-.�.�+/��3��*�.*��8*+/��.��,�� ,�/�)�� >�A/B��@��-��5�� 8/��<1*�+*1/��.>��A<B��∪�� >�A�B��∩�� A)B��D�� A*B�� D��

The DIFFERNCE operation is not commutative; that is, in general

R – S ≠ S – R

Next we discuss the CARTESIAN PRODUCT operation also known as CROSS PRODUCT or CROSS JOIN—denoted by x, which is also a binary set operation, but the relations on which it is applied do not have to be union compatible. This operation is used to combine tuples from two relations in a combinatorial fashion. In general, the result of R(A1, A2, …, An) x S (B1, B2, …, Bm) is a relation Q with n+m attributes Q(A1, A2, …, An, B1, B2, …, Bm), in that order. The resulting relation Q has one tuple for each combination of tuples —one from R and one from S. Hence, if R has nR tuples and S has nS tuples, then R x S will have nR * nS tuples. The operation applied by itself is generally meaningless. It is useful

��

-./�� /��

/� *.�� /��

��

��

�5/+)�� +�@�*�

�+/�5�.� ��.��


when followed by a selection that matches values of attributes coming from the component relations. For example, suppose that we want to retrieve for each female employee a list of the names of her dependents; we can do this as follows:

FEMALE_EMPS ← σ SEX = ‘F’(EMPLOYEE)

EMPNAMES ← πFNAME, LNAME SSN (FEMALE_EMPS)

EMP_DEPENDENTS ← EMPNAMES X DEPENDENT

ACTUAL_DEPENDENTS ← σ SSN=ESSN (EMP_DEPENDTS)

RESULT ← πFNAME, LNAME, DEPENDT_NAME (ACTUAL_DEPENDENT)

The resulting relations from the above sequence of operations are shown in Figure 3.7. The EMP_DEPENDETS relation is the result of applying the CARTESIAN PRODUCT operation to EMPNAMES from Figure 3.7 with DEPENDENT from Figure 3.7. In EMP_DEPENDENTS, every tuple from EMPNAMES is combined with every tuple from DEPENDENT, giving a result that is not very meaningful. We only want to combine a female employee tuple with her dependents—namely, the DEPENDENT tuples whose ESSN values match the SSN value of the EMPLOYEE TUPLE. The ACTUAL_DEPENDENTS relation accomplishes this.

The CARTESIAN PRODUCT creates tuple with the combined attributes of two relations. We can then SELECT only related tuples from the two relations by specifying an appropriate selection condition, as we did in the preceding example. Because this sequence of CARTESIAN PRODUCT followed by SELECT is used quite commonly to identify and select related tuples from two relations, a special operation, called JOIN, was created to specify this sequence as a single operation. We discuss the JOIN operation next.

��" ��

The JOIN operation, denoted by, is used to combine related tuples from two relations into single tuples. This operation is very important for any relational database with more than a single relation, because it allows us to process relationships among relations. To illustrate join, suppose that we want to retrieve the name of the manager of each department. To get the manager’s name, we need to combine each department tuple with the employee tuple whose SSN value matches the MGRSSN value in the department tuple. We do this by suing the JOIN operation, and then projecting the result over the necessary attributes, as follows:

DEPT_MGR ← DEPARTMENT MGRSSN=SSN EMPLOYEE

RESULT ← πDNAME, LNAME, FNAME (DEPT_MGR)

The first operation is illustrated in Figure 3.7. Note that MGRSSN is a foreign key and that the referential integrity constraints plays a role in having matching tuples in the referenced relation EMPLOYEE. The example we gave earlier to illustrate the CARTESIAN PRODUCT operation can be specified, using the JOIN operation, by replacing the two operations:

��

��

�1�5�/� �� 6*1/7/� &&&%%$$$$��&#%'($'�&�

�� /.�1*,�8+��3,��

�� "(((� &%$#"! �� !��

�*��:*+� � 2 /11/5*� &%$#"! ��&!�'(#'�(�

�&�� *++7,��*11/�+*,��

�� ! (((� %%%##""""� !�

DATABASE SYSTEMS 46

��75*� �� 31�.�� !" !" !" �

�&$�'($' ��

"# �,� �5*,��-.��,��

�� "(((� !!""""� "�

�

��

��

��

•�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� !!""""� �1�5*� �� &%#'(!'("� •�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� !!""""� ��*�)�+*� �� &% '�('�"� •�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� !!""""� ��7� �� &"%'("'( � •�•�•��

�1�5�/� 6*1/7/� &&&%%$$$$� &%$#"! �� <�*+� �� &!�'(�'�%� •�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� �� !"#$%&� ��5�/*1� �� &%%'(�'(!� •�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� �� !"#$%&� �1�5*� �� &%%'(�'(!� •�•�•�

�1�5�/� 6*1/7/� &&&%%$$$$� �� !"#$%&� �1�?/<*�� &%%'��' (� •�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� !!""""� �1�5*�

��

�&#$'("'("�

•�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� !!""""� ��*)��*� �� &%%'(!'("� •�•�•�

��

�1�5�/� 6*1/7/� &&&%%$$$$�

�*��:*+� 2 /11/5*� &%$#"! ��

��75*� ��31�.�� !" !" !" �

�� •�•�•� �� •�•�•�

*.*/+5�� "� !!""""� •�•�•� �+/�01�� 2 ��3� !!"""� •�•�•�

�)� ��.�+/�� !� &%$#"! �� •�•�•� �*��:*+� � 2 /11/�5*� &%$#"! �� •�•�•�

�

�*/)=-/+�*+.� ��

��

��

��

�

�*��:*+�

2 /11/�5*�

�<�*+�

��

%%%##""""� •�•�•� �/� *.� �� +3� %%%##""""� •�•�•�

�� •�•�•�

� �*��:*+� 2 /11/�5*� &%$#"! �� &%$#"! �� <�*+� �� &!�'(�'�%� •�•�•�


�*��:*+� 2 /11/5*� &%$#"! �� !!""""� ��7�

��

�&% '�('�"�

•�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� &%$#"! �� <�*+�

��

�&"%'("'( �

•�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� !"#$%&� ��5�/*1� �� &!�'(�'�%� •�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� !"#$%&� �1�5*� �� &%%'(�'(!� •�•�•�

�*��:*+� 2 /11/5*� &%$#"! �� !"#$%&� �1�?/<*�� &%%'��' (� •�•�•�

��75*� ��31�.�� !" !" !" � !!""""� �1�5*� �� &#$'("'("� •�•�•�

��75*� ��31�.�� !" !" !" � !!""""� ��*)��*� �� &##'(!'( � •�•�•�

��75*� ��31�.�� !" !" !" � !!""""� ��7� �� &% '�('�"� •�•�•�

��75*� ��31�.�� !" !" !" � &%$#"! �� <�*+� �� &"%'("'( � •�•�•�

��75*� ��31�.�� !" !" !" � �� !"#$%&� ��5�/*1� �� &!�'(�'�%� •�•�•�

��75*� ��31�.�� !" !" !" � �� !"#$%&� �1�5*� �� &%%'��' (� •�•�•�

��75*� ��31�.�� !" !" !" � �� !"#$%&� �1�?/<*�� &#$'("'("� •�•�•�

��;�7��%%�($� $�&��$�� #*�� $�#&�

EMP_DEPENDENTS ← EMPNAMES X DEPENDENT

ACTUAL_DEPENDENTS ← σSSN=ESSN(EMP_DEPENDENTS)

With a single JOIN operation:

ACTUAL_DEPENDENTS ← EMPNAMES SSN=ESSN DEPENDENT

The general form of a JOIN operation on two relations R (A1, A2, …, An2) and S (B1, B2, …, Bm) is:

R <join condition>S

The result of the JOIN is a relation Q with n + m attributes Q (A1, A2, …, An, B1, B2, …, Bm) in that order; Q has one tuple for each combination of tuples—one from R and one from S—wherever the combination satisfies the join condition. This is the main difference between Cartesian Product and JOIN: in JOIN, only combinations of tuples satisfying the join condition appear in the result, whereas in the CARTESIAN PRODUCT all combinations of tuples are included in the result. The join condition is specified on attributes from the two relations R and S and is evaluated for each precombination of tuples. Each tuple combination for which the join condition evaluates to true is included in the resulting relation Q as a single combined tuple.

A general join condition is of the form:

<condition> AND <condition> AND… AND <condition>

where each condition is of the form Ai θ Bi, Ai is an attribute of R; Bi is an attribute of S, Ai and Bi

have the same domain, and θ (theta) is one of the comparison operations {=, <, ≤ >, ≥, ≠}. A JOIN, operation with such a general join condition is called a THETA JOIN. Tuples whose join attributes are null do not appear in the result. In that sense, the join operation does not necessarily preserve all of the information in the participating relations.

DATABASE SYSTEMS 48

The most common JOIN involves join conditions with equality comparisons only. Such a JOIN, where the only comparison operator used is =, is called an EQUIJOIN. Both examples we have considered were EQUIJOINs. Notice that in the result of an EQUIJOIN we always have one or more pairs of attributes that have identical values in every tuple. For example, in Figure 3.9 the values of the attributes MGRSSN and SSN are identical in every tuple of DEPT_MGR because of the equality join condition specified on these two attributes. Because one of each pair of attributes with identical values is superfluous, a new operation called NATURAL JOIN—denoted by ‘ * ’ was created to get rid of the second (superfluous) attribute in an EQUIJOIN condition. The standard definition of NATURAL JOIN requires that the two join attributes (or each pair of join attributes) have the same name in both relations.

(a)

�+�)-5�� -3/+1/�)� "� *.*/+5�� !!""""� �&%%'("'��

�+�)-5�6� � ��-.�� "� *.*/+5�� !!""""� �&%%'("'��

�� 8-�*+�?/�� (� �/::�+)� !� �)� ��.�+/�� &%$#"! �� &&"'(�'(��

*�+3/��?/�� (� ��-.�� */)=-/+�*+.� %%%##""""� �&%�'(#'�&�

�

�*@<*�*:��.� (� �/::�+)� !� �)� ��.��+/�� &%$#"! �� &&"'(�'(��

(b)

�*/)=-/+�*+.� ��

%%%##""""� �&%�'(#'�&� Houston

�)� ��.�+/�� !� &%$#"! �� &&"'(�'(�� /::�+)�

*.*/+5�� "� 333445555 �&%%'("'�� *11/�+*�

*.*/+5�� "� 333445555 �&%%'("'�� -3/+1/�)�

�

*.*/+5�� "� 333445555 �&%%'("'�� -.��

��<�7��&��%%�($� $�#&�#"�$�� :�� #*�� $�#&��- .�� ←��=��-'.�� ←�� =��

If this is not the case, a renaming operation is applied first. In the following example, we first rename the DNUMBER attribute of DEPARTMENT to DNUM so that it has the same name as the DNUM attribute in PROJECT then apply NATURAL JOIN:

PROJ_DEPT ← PROJECT * ρ(DNAME, DNUM, MFRSSN, MGRSTARTDATE) (DEPARTMENT)

� ��

� �+�)-5��

�*11/�+*� "� *.*/+5�� !!""""� �&%%'("'��

��


The attribute DNUM is called the join attribute. The resulting relation is illustrated in Figure 3.8 (a). In the PROJ_DEPT relation, each tuple combines a PROJECT tuple with the DEPARTMENT tuple for the department that controls the project, but only one join attribute is kept.

If the attributes on which the natural join is specified have the same names in both relations, renaming is unnecessary. For example, to apply a natural join on the number attributes of department and DEPT_LOCATION, it is sufficient to write:

DEPT_LOCS ← DEPARTMENT * DEPT_LOCATIONS

The resulting relation is shown in Figure 3.8 (b), which combines each department with its locations and has one tuple for each location. In general, equating all attribute pairs that have the same name in the two relations performs NATURAL JOIN. There can be a list of join attributes from each relation, and each corresponding pair must have the same name.

A more general but non-standard definition for NATURAL JOIN is

Q ← R * <list>), (<list2>) S

In this case, <list> specifies a list of I attributes from R, and <list2> specifies a list of I attributes from S. The lists are used to form equality comparison condition between pairs of corresponding attributes; the conditions are then ANDed together. Only the list corresponding to attributes of the first relation R — <list> is kept in the result Q.

Notice that if no combination of tuples satisfies the join condition, the result of a JOIN is an empty relation with zero tuples. In general, if R has nR tuples and S has nS tuples, the result of a JOIN operation R <join

condition> S will have between zero and nR* nS tuples. The expected size of the join result divided by the maximum size nR* nS leads to a relation called join selectivity, which is a property of each join condition. If there is no join condition, all combinations of tuples qualify and the JOIN becomes a CARTESIAN PRODUCT, also called CROSS PRODUCT or CROSS JOIN.

The join operation is used to combine data from multiple relations so that related information can be presented in a single table. Note that sometimes a join may be specified between a relation and itself. The natural join or equijoin operation can also be specified among multiple tables, leading to an n-way join. For example, consider the following three-way join:

((PROJECT <NUM=DNUMBER) <MGRSSN=SSN EMPLOYEE)

This links each project to its controlling department, and then relates the department to its manager employee. The net result is a consolidated relation where each tuple contains this project-department-manager information.

Student Activity 3.1 Before reading the next section, answer the following question:

1. What do you understand by PROJECT, RENAME, and Set Theoretic operations?

2. What do you understand by Difference, Select and Joint Operation?

If your answer is correct, then proceed to the next section.

Top

��# � ��

It has been shown that the set of relational algebra operations {σ, π, U, –, x} is a complete set; that is, any of the other relational algebra operations can be expressed as a sequence of operations from this set. For example, the INTERSECTION operation can be expressed by using UNION and DIFFERENCE as follows:

DATABASE SYSTEMS 50

R ∩ S ≡ ( R ∪ S ) – ((R – S ) ∪ ( S – R ))

Although, strictly speaking, INTERSECTION is not required, it is inconvenient to specify this complex expression every time we wish to specify an intersection. As another example, a JOIN operation can be specified as a CARTESIAN PRODUCT followed by a SELECT operation, as we discussed:

R <condition>S ≡ σ <condition> (R x S )

Similarly, a NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by RENAME and followed by SELECT and PROJECT operations. Hence, the various JOIN operations are also not strictly necessary for the expressive power of the relational algebra; however, they are very important because they are convenient to use and are very commonly applied in database applications. Other operations have been included in the relational algebra for convenience rather than necessity. We discuss one of these—the DIVISION operation—in the next section.

��$�%��

The DIVISION operation is useful for a special kind of query that sometimes occurs in database application. An example is “Retrieve the names of employees who work on all the projects that ‘John Smith’ works on. “To express this query using the DIVISION operation, proceed as follows. First, retrieve the list of project numbers that ‘John Smith’ works on in the intermediate relation SMITH_PNOS:

��←σ��E�F��G��E�F� ��G�A��B�

��←�π��A2 � ;�� E��B�

Next, create a relation that includes a tuple <PNO, ESSN> whenever the employee whose social security number is ESSN works on the project whose number is PNO in the intermediate relation SSN_PNOS: ��←�π��,��A2 � ;��B�

Finally, apply the DIVISION operation to the two relations, which are given the desired employees’ social security numbers:

��A��←��÷�� ←�π��,��A��H��B�

The previous operations are shown in Figure 3.9 (a). In general, the DIVISION operation is applied to two relations R(Z) ÷ S(X), where X ⊆ Z. Let Y = Z – X (and hence Z = X U Y); that is, let Y be the set of attributes of S. The result of DIVISION is a relation T(Y) that includes a tuple t if tuple t if tuples tR appear in R with tR [Y] = t, and with tR [X] = tS for every tuple tS in S. This means that, for a tuple t to appear in the result T of the DIVISION, the values in t must appear in R in combination with every tuple in S.

Figure 3.9 (b) illustrates a DIVISON operator where X = {A}, Y = {B}, and Z = {A, B}. Notice that the tuples (values) b1 and b4 appear in R in combination with all three tuples in S; that is why they appear in the resulting relation T. All other values of B in R do not appear with all the tuples in S and are not selected: b2 does not appear with a2 and b3 does not appear with a1.

The DIVISION operator can be expressed as a sequence of π, x, and – operations as follows:

T1 ← πγ ( R) T2 ← πγ ((S x T) – R ) T ← T1 – T2

(a)


��

��

��

�� SSN

123456789

453453453

� A B

��>7��%%�($� $�&��$��?�� #*�� $�#&��- .��@�!�&�� '+��A ��-'.��←��÷��

��

��

�� !"#$%&� �� !"#$%&� ��

###%%!!!!� �

!" !" !" � ��

!" !" !" � ��

!!""""� ��

!!""""� �

!!""""� �(�

!!""""� �(�

&&&%%$$$$� (�

&&&%%$$$$� �(�

&%$&%$&%$� �(�

&%$&%$&%$� (�

&%$#"! �� (�&%$#"! �� (�

�

%%%##""""� �(�

A1 b1

A2 b2

A3 b3

A4 b4

A1 b1

A2 b2

A3 b3

A4 b4

A2 b4

A3 b4

S A

a1

a2

a3

T B

b1

b4

(b)

DATABASE SYSTEMS 52

In this section we define additional operations to express these requests. These operations enhance the expressive power of the relational algebra.

��&��'��

The first type of request that cannot be expressed in the basic relational algebra is to specify mathematical aggregate functions on collections of values from the database. Examples of such functions include retrieving the average or total salary of all employees or the number of employee tuples. Common functions applied to collections of numeric values include SUM, AVERAGE, MAXIMUM and MINIMUM. The COUNT function is used for counting tuples or values.

Another common type of request involves grouping the tuples in a relation by the value of some of their attributes and then applying an aggregate function independently to each group. An example would be to group employee tuples by DNO, so that each group includes the tuples for employees working in the same department. We can then list each DNO value along with, say, the average salary of employees within the department.

We can define an AGGREGATE FUNCTION operation, using the symbol ℑ (pronounced “scrip F”), to specify these types of requests as follows :

<grouping attributes> ℑ < function list> (R)

where <grouping attributes> is a list of attributes of the relation specified in R, and <function list> is a list of (<function> < attribute>) pairs. In each such pair, <function> is one of the allowed functions —such as SUM, AVERAGE, MAXIMUM, MINIMUM, COUNT— and <attribute> is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list. For example, to retrieve each department number,

the number of employees in the department, and their average salary, while renaming the resulting attributes as indicated below, we write:

ρR(DNO, NO_OF_EMPLOYEES, AVEARGE_SAL) (DNO ℑ COUNT SSN, AVERAGE SALARY (EMPLOYEE))

The result of this operation is shown in Figure 3.10

In the above example, we specified a list of attribute names—between parentheses in the rename operation—for the resulting relation R. If no renaming is applied, then the attributes of the resulting relation that correspond to the function list will each be the concatenation of the function name with the attribute name in the form <function>_<attribute>. For example, Figure 3.10 (b) shows the result of the following operation:

DNOℑCOUNT SSN, AVERAGE SALARY(EMPLOYEE)

(a)

� �� ?�B � ��

3� 0� ��234�

0� �� 444�

�

�� ""(((�

(b)

�� COUNT_SSN AVERAGE_SALARY


3� 4 33250

0� 3 31000

�� 1 55000

(c)

COUNT_SSN AVERAGE_SALARY

8 35125

��47��&��%%�($� $�#&�#"�$��B B �B ��:� �� *�� $�#&��-��5�� 5�?�B � ��.��- .��:��5�?�B ��-��.��

-'.��ℑ��:�� 5�?�B ��-��.��-�.��:�� 5�?�B ��-��.�

If no grouping attributes are specified, the functions are applied to the attribute values of all the tuples in the relation, so the resulting relation has a single tuple only. For example, Figure 3.11 shows the result of the following operation :

ℑ COUNT SSN, AVERAGE SALARY(EMPLOYEE)

It is important to note that, in general, duplicates are not eliminated when an aggregate function is applied; this way, the normal interpretation of functions such as SUM and AVERAGE is computed. It is worth emphasizing that the result of applying an aggregate function is a relation, not a scalar number—even if it has a single value.

�

��%��

Another type of operation that, in general, cannot be specified in the basic relational algebra is recursive closure. This operation is applied to a recursive relationship between tuples of the same type, such as the relationship between an employee and a supervisor. This relationship is described by the foreign key SUPERSSN of the EMPLOYEE relation in Figure 3.2 and 3.3, which relates each employee tuple (in the role of supervisee) to another employee tuple (in the role of supervisor). An example of a recursive operation is to retrieve all supervisees of an employee e at all levels —that is, all employees e’ directly supervised by e; all employees e” directly supervised by each employee’; all employees e” directly supervised by each employee e”; and so on. Although it is straightforward in the relational algebra to specify all employees supervised by e at a specific level, it is difficult to specify all supervisees at all levels. For example, to specify the SSNs of all employees e’ directly supervised—at level one — by the employee e whose name is ‘James Bong’ (see Figure 3.3) , we can apply the following operation:

BORG_SSN ← π SSN (FNAME=’James’ AND LNAME=’Bong’ (Employee))

SUPERVISION(SSN1, SSN2) ← π SSN, SUPERSSN(EMPLOYEE)

RESULT1(SSN) ← π SSN1 (SUPERVISION SSN2=SSNBORG_SSN)

To retrieve all employees supervised by Bong at level 2—that is, all employees e” supervised by some employee e’ who is directly supervised by Bong—we can apply another JOIN to the result of the first query, as follows :

RESULT2 (SSN) ← π SSN2 (SUPERVISION SSN2=SSN RESULT1)

DATABASE SYSTEMS 54

To get both sets of employees supervised at level 1 and 2 by ‘James Bong’, we can apply the UNION operation to the two results, as follows:

RESULT ← RESULT2 ∪ RESULT1

The results of these queries are illustrated in Figure 3.12. Although it is possible to retrieve employees at each level and then take their UNION, we cannot, in general, specify a query such as “retrieve the supervisees of ‘James Bong’ at all levels” without utilizing a looping mechanism. An operation called the transitive closure of relations has been proposed to computer the recursive relationship as far as the recursion proceeds.

��

Finally, we discuss some extensions of the JOIN and UNION operations. The JOIN operations described earlier match tuples that satisfy the join condition. For example, for a NATURAL JOIN operation R * S, only tuples form R that have matching tuples in S—and vice versa—appear in the result. Hence, tuples without a matching (or related) tuples are eliminated from the JOIN result. Tuples with null in the join attributes are also eliminating. A set of operations, called OUTER JOINs, can be used when we want to keep all the tuples in R, or those in S, or those in both relations in the result of the JOIN, whether or not they have matching tuples in the other relation. This satisfies the need of queries where tuples from two tables are to be combined by matching corresponding rows, but some tuples are liable to be lost for lack of matching values. In such cases an operation is desirable that would preserve all the tuples whether or not they produce a match.

For example, suppose that we want a list of all employee names and also the name of the departments they manger if they happen to manage a department; we can apply an operation LEFT OUTER JOIN, denoted by , to retrieve the result as follows:

TEMP← (EMPLYEE SSN=MGRSSN DEPARMENT)

RESULT ←πFNAME, MINIT, LNAME, DNAME (TEMP)

(Bong’s SSN is 888665555)

(SSN) (SUPERSSN)

�� 4��

�� !"#$%&� !!""""�

!!""""� %%%##""""�

&&&%%$$$$� &%$#"! ��

&%$#"! �� %%%##""""�

###%%!!!!� !!""""�

!" !" !" � !!""""�

&%$&%$&%$� &%$#"! ��

�

%%%##""""� �-11�

��

!!"""""��

&%$#"! ��


-��*��@�(�!�'+�C#&�D(�(�'#�!�& $�(.�

-��:��∪��:��2.�

��7��$E#F%�@�%��(�@��G��+�

The LEFT OUTER JOIN operation keeps every tuple in the first or left relation R in R S; if no matching tuple is found in S, then the attributes of S in the join result are filled or “padded” with null values. The result of these operations is shown in Figure 3.12.

A similar operation, RIGHT OUTER JOIN, denoted by keeps every tuple in the second or right relation S in the result of R S. A third operation, FULL OUTER JOIN, denoted by , keeps all tuples in both the left and the right relations when no matching tuples are found, padding them with null values as needed.

The OUTER UNION operation was developed to take the union of tuples from two relations if the relations are not union compatible. This operation will take the UNION of tuples in two relations that are partially compatible, meaning that only some of their attributes are union compatible. It is expected that the list of compatible attributes includes a key for both relations. Tuples from the component relations with the same key are represented only once in the result and have values for all attributes in the result. The attributes that are not union compatible from either relation are kept in the result, and tuples that have no values for these attributes are padded with null values. For example, an OUTER UNION can be applied to two relations whose schemas are STUDENT (Name , SSN, Department, Advisor) and FACULTY (Name, SSN, Department, Rank). The resulting relation schema is R (Name, SSN, Department, Advisor, Rank), and all the tuples from both relations are included in the result. Student tuples will have a null for the Rank attribute, whereas faculty tuples will have a null for the Advisor attribute. A tuple that exists in both will have values for all its attributes.

Another capability that exist in most commercial languages (but not in the basic relational algebra) is that of specifying operations on values after they are extracted from the database. For example, arithmetic operations such as +, –, and * can be applied to numeric values.

��

�� -11�

�+/�01�� 2 ��3� *.*/+5��

�1�5�/� �� 6*1/7/� �-11�

�

�*��:�*+� � 2 /11/5*� �)� ��.�+/��

��

�� !"#$%&�

&&&%%!!!!�

###%%!!!!�

!" !" !" �

�

&%$&%$&%$�

RESULT SSN

123456789

999887777

666884444

453453453

987987987

333445555

987654321

DATABASE SYSTEMS 56

/� *.�� ;� �/+7/�� -11�

��75*� �� 31�.�� -11�

�� /)� 4� �/<</+� �-11�

�

�/� *.� �� 3� �*/)=-/+�*+.�

��27��:�� #*�� $�#&�

�(�# � ��)��

We now give additional examples to illustrate the use of the relational algebra operations. All examples refer to the database of Figure 3.2. In general, the same query can be stated in numerous ways using the various operations. We will state each query in one way and leave it to the reader to come up with equivalent formulations.

)��*�+�

Retrieve the name and address of all employees who work for the ‘Research’ department.

�� ←�σ��EG *.*/+5�G�A�� B�

�� ←� �� E��I ��B�

��←π��,��,�� A �� B�

This query could be specified in other ways; for example, the order of the JOIN and SELECT operations could be reversed, or the JOIN could be replaced by a NATURAL JOIN (after renaming).

)��*�,�

For every project located in ‘Stafford’, list the project number, the controlling department number, and the department manager’s last name, address, and birthday.

�� ←�σ��EG�� G�A� ��B�

�� ←��A�� J�� B�

� �� ←�A�� J��B�

��←π�� ,��,��,�� ,��A� �� B�

)��*��

Find the names of employees who work on all the projects controlled by department number 5.

DEPT5_PROJS(PNO) ←π PNUMBER (σDNUM=5(PROJECT))

EMP_PROJ(SSN, PNO) ←πESSN, PNO(WORKS_ON)

RESULT_EMP_SSNS ← EMP_PRJO ÷ DEPT5_PROJS

RESULT ←πLNAME, FNAME (RESULT_EMP_SSNS * EMPLOYEE)

)��*�-�

Make a list of project numbers for projects that involve an employee whose last name is ‘Smith’, either as a worker or as a manager of the department that controls the project.

��A��B�←π��Aσ��EG� ��G�A��BB�


��2 � ;� �� ←π��A2 � ;��H��B�

�� ←π��,�� A�� E�� B�

�� A��B�←π�� Aσ��EG� ��G�A�� BB�

�� A��B�←π�� A�� H�� B�

��←A��2 � ;� �� Υ �� B�

)��*�.�

List the names of all employees with two or more dependents.

Strictly speaking, this query cannot be handled in the basic relational algebra. We have to use the AGGREGATE Function operation with the COUNT aggregate function. We assume that dependents of the same employee have distinct DEPENDENT_NAME values.

��A�,��B�←��ℑ��A��B�

��←��≥��A��B�

��,��A��H��B�

)��*�/�

Retrieve the names of employees who have no dependents.

��←π��A��B�

��2 ��A�B�←π��A��B�

��2 ��←A��D��2 ��B�

��←π��,��A��2 ��H��

)��*�0�

List the names of managers who have at least one dependent.

�� A�B�←π�� A�� B�

��2 ��A�B�←π��A��B�

�� 2 ��←��A�� ∩��2 ��B�

��←π��,��A�� 2 ��H��B�

As we mentioned earlier, the same query can in general be specified in many different ways. For example, the operations can often be applied in various sequences. In addition, some operations can be used to replace others; for example, the INTERSECTION operation in Query 7 can be replaced by NATURAL JOIN. As an exercise, try to do each of the above example queries using different operations.



1. What do you understand by Aggregate functions?

2. Make a list of Aggregate functions.

3. Why do we use group by clause in our query?

4. What do understand by division operations? Discuss with example.

DATABASE SYSTEMS 58


Top

��

Relational calculus is an alternative to relational algebra. In contrast to the algebra, which is procedural, the calculus is nonprocedural, or declarative, in that it allows us to describe the set of answers without being explicit about how they should be computed. Relational calculus has had a bid influence on the design of commercial query languages such as SQL and, especially, Query-by-Example (QBE).

The Variant of the calculus that we present in detail is called the tuple relational calculus (TRC), variables in TRC take on tuples as values. In another variant, called the domain relational calculus (DRC), the variables range over field values. TRC has had more of an influence on SQL, while DRC has strongly influenced QBE.

�

�

��

A tuple variable is a variable that takes on tuples of a particular relation schema as values. That is, every value assigned to a given tuple variable has the same number and type of fields. A tuple relational calculus query has the form (T p(T) ), where T is a tuple variable and p(T) denotes a formula that describes T; we will shortly define formulas and queries rigorously. The result of this query is the set of all tuples t for which the formula p(T) is thus at the heart of TRC and is essentially a simple subset of first-order logic. As a simple example, consider the following query.

(Q1) Find all sailors with a rating above 7.

(S ! S � Sailors � S. rating >7)

sid sname Rating age

��

��

��

Instance S3 of Sailors

When this query is evaluated on instance of the Sailors relation, the tuple variable S is instantiated successively with each tuple, and the test S.rating>7 is applied. The answer contains those instances of S that pass this test. On instance S3 of Sailors, the answer contains Sailors tuples with sid 31.

��

We now define these concepts formally, beginning with the notion of a formula. Let Rel be a relation name, R and S be tuple variable, a an attribute of R, and b an attribute of S. Let op denote an operator in the set (<,>, =, ��≥ , ≠). An atomic formula, is one of the following:

�� R ε Rel

�� R.a op S.b

�� R.a op constant, or constant op R.a


A formula is recursively defined to be one of the following, where p and q are themselves formulas, and p(R) denotes a formula in which the variable R appears:

�� Any atomic formula

�� ¬p, p ∧ q, p v q, or p � q

�� ∃R (p(R)), where R is a tuple variable

�� ∀ R (pR), where R is a tuple variable

In the last two clauses above, the quantifiers ∃ and ∀ are said to blind the variable R. A variable is said to be free in a formula or sub formula (a formula contained in a larger formula) if the (sub) formula does not contain an occurrence of a quantifier that binds it.

We observe that every variable in a TRC formula appears in a sub formula that is atomic, and every relation schema specifies a domain for each field; this observation ensures that each variable in a TRC formula has a well-defined domain from which values for the variable are drawn. That is, each variable has a well-defined type, in the programming language sense. Informally, an atomic formula R � Rel gives R the type of tuples in Rel, and comparisons such as R.a op S.b and R.a op constant induce type restrictions on the field R.a. If a variable R does not appear in an atomic formula of the form R � Rel (i.e., it appears only in atomic formulas that are comparisons), we will follow the convention that the type of R is a type whose fields include all (and only) fields of R that appear in the formula.

We will not define types of variables formally, but the type of a variable should be clear in most cases, and the important point to note is that comparisons of values having different types should always fail. (In discussions of relational calculus, the simplifying assumption is often made that there is a single domain or constants and that this is the domain associated with each field of each relation.)

A TRC query is defined to be expression of the form (T! p (T)), where T is the only free variable in the formula p.

��

What does a TRC query mean? More precisely, what is the set of answer tuples for a given TRC query the answer to a TRC (T!p(T)), as we noted earlier, is the set of all tuples t for which the formula p(T) evaluates to true with assignments of tuple values to the free variables in formula make the formula evaluate to true.

A query is evaluated on a given instance of the database. Let each free variable in a formula F be bound to a tuple value. For the given assignment of tuples to variables, with respect to the given database instance, F evaluates to (or simply ‘is’) true if one of the following holds:

�� F is an atomic formula R � Rel, and R is assigned a tuple in the instance of relation Rel.

�� F is a comparison R.a op S.b, R.a op constant, or constant op R.a, and the tuples assigned to R and S have field values R.a and S.b that make the comparison true.

�� F is of the form ¬p, and p is not true; or of the form p ∧q, and both p and q are true; or of the form p v q, and one of them is true, or of the form p � q and q is true whenever p is true.

�� F is of the form ∃R(p(R)), and there is some assignment of tuples to the free variables in p(R), including the variable R, that makes the formula p(R) true.

�� F is of the form ∀ R (p(R)), and there is some assignment of tuples to the free variables in p(R) that makes the formula p (R) true no matter what tuple is assigned to R.

�(�# � ��)��

DATABASE SYSTEMS 60

We now illustrate the calculus through several examples, using the instances B1 of Boats, R2 of Reserves, and S3 of Sailors as shown below:

sid sname Rating age

��

��

��

�&($ &��#"�� %#�(�

�� !�� "��!�� #�� $��!�� %�� &� $��!�'�� (��

�)��

�� *��*+��

�� +*�*+��

�� *��*+��

�� +*�*+��

�� *��*+��

�� *�*+��

�� +*�*+��

�� *,*+��

�� **+��

�� *��*+��

�� '��$�� (�$��#��

We will use parentheses as needed to make our formulas unambiguous. Often, a formula p (R) includes a condition R ∈ Rel, and the meaning of the phrases some tuple R and for all tuples R is intuitive. We will use the notation ∃ R ∈ Rel (p(R) for ∃R(R ∈Rel ∧ p(R)).

Similarly, we use the notation ∀R ∈ Rel (p(R)) for ∀ R (R∈Rel �p(R)).

(Q2) find the names and ages of sailors with a rating above 7.

(P ! ∃ S ∈ Sailors (S.rating > 7 ∧ fP. Name = S. sname ∧ P.age = S. age))

This query illustrates a useful convention: P is considered to be a tuple variable with exactly two fields, which are called name and age, because these are the only fields of FP that are mentioned and P does not range over any of the relations in the query; that is, there is no subformula of the form P ∈ Relname. The


result of this query is a relation with two fields, name and age. The atomic formula P. name = S.sname and P.age = S. age give values to the fields of answer tuple P. On instances B1, R2, and S3, the answer is the set of tuples (Lubber, 55.5), (Andy, 25.5), (Rusty, 35.0), (Zorba, 16.0), and (Horatio,35.0).

(Q3) Find the sailor name, boat id, and reservation date for each reservation.

{P ∃ R ∈ Reserves ∃ S ∈ Sailors

(R.sid = S.sid ∧ P.bid = R.bid ∧ P.day = R.day ∧ P.sname = S.sname)

For each Reserves tuple, we look for a tuple in Sailors with the same sid. Given a pair of such tuples, we construct an answer tuple P with fields sname, bid, and day by copying the corresponding fields from these two tuples. This query illustrates how we can combine values from different relations in each answer tuple. The answer to this query on instances B1, R2, and S3 is shown in figure given below:

�/� *� ��)� �/7�

�-.�� (�� (K��K&%�

�-.�� (�� (K�(K&%�

�-.�� ( � �(K(%K&%�

�-.�� (!� �(K($K&%�

-<<*+� �(�� K�(K&%�

-<<*+� �( � ��K(#K&%�

-<<*+� �(!� ��K��K&%�

��+/�� (�� (&K("K&%�

��+/�� (�� (&K(%K&%�

��+/�� ( � (&K(%K&%�

(Q4) find the names of sailors who have reserved boat 103.

{p ∃ S ∈ Sailors ∃ R ∈ Reserves (R.sid = S.sid ∧R.bid= 103 ∧P.sname = S.sname)}

This query can be read as follows: “Retrieve all sailor tuples for which there exists a tuple in Reserves, having the same value in the sid field, and with bid = 103”. That is, for each sailor tuple, we look for a tuple in Reserves that shows that this sailor has reserved boat 103. The answer tuple P contains just one field, sname.

(Q5) Find the names of sailors who have reserved a red boat.

{(P ∃S ∈ Sailors ∃ R ∈Reserves (R.sid = S. sid ∆∧ FP. Sname = S. sname

∧ ∃ B ∈Boads (B.bid = R.bid ∧ B.color = ‘red’))}

This query can be read as follows: “Retrieve all sailor tuples S for which there exist tuples R in Reserves and B in Boats such that S.sid = R.sid, R.bid = B.bid, and B.color =’red’.” Another way to write this query, which corresponds more closely to this reading, is as follows:

{(P | ∃ S x2 Sailors ∃ R ∈Reserves ∃ B ∈ Boats

(R.sid = S.sid ∧ B.bid = R.bid ∧ B.color = ‘red’ ∧ P.sname = S.sname)}

(Q6) Find the names of sailors who have reserved at least two boats.

DATABASE SYSTEMS 62

{P | ∃ S ∈Sailors ∃ R1 ∈ Reserves ∃ R2 ∈ Reserves

(S.sid = R1.sid ∧ R1.sid = R2.sid ∧ R1.bid ≠ R2.bid ∧ P.sname = S.sname) }

Contrast this query with the algebra version and see how much simpler the calculus version is. In past, this difference is due to the cumbersome renaming of fields in the algebra version, but the calculus version really is simple.

(Q7) Find the names of sailors who have reserved all boats.

{P | ∃ S ∈ Sailors ∀B ∈ Boats

(∃R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid ∧ P.sname = S.sname)) }

This query was expressed using the division operator in relational algebra. Notice how easily it is expressed in the calculus. The calculus query directly reflects how we might express the query in English: “Find sailors S such that for all boats B there is a Reserves tuple showing that sailor S has reserved boat B.”

(Q 8 ) Find sailors who have reserved all red boats.

{ S | S ∈ Sailors ∧ ∀ B ∈ Boats

(B.color =’red’ � (∃R ∈ Reserves (S.sid = R.sid ∧ R.bid = B.bid))}

This query can be read as follows: For each candidate (sailor), if a boat is red, the sailor must have reserved it. That is, for a candidate sailor, a boat being red must imply the sailor having reserved it. Observe that since we can return an entire sailor tuple as the answer instead of just the sailor’s name, we have avoided introducing a new free variable (e.g., the variable P in the previous example) to hold the answer values. On instances B1, R2, and S3, the answer contains the Sailors tuples with sid 22 and 31.

We can write this query without using implication, by observing that an expression of the form p�q is logically equivalent to y3 p v q:

{ S | S ∈ Sailors ∧∀ B ∈Boats

(B.color ≠ ‘red’ v (∃ R ∈ Reserves (S.sid = R.sid ∧R.bid = B.bid)))}

This query should be read as follows: “Find sailors S such that for all boats B, either the boat is not red or a Reserves tuple shows that sailor S has reserved boat B.”

$�# ��

A domain variable is a variable that ranges over the values in the domain of some attribute (e.g., the variable can be assigned an integer if it appears in an attribute whose domain is the set of integers). A DRC query has the form { (∃, x2,……,xn) |p ((∃, x2,…..,xn)) }, where each xi is either a domain variable or a constant and p ((∃, x2,…..,xn)) denotes a DRC formula whose only free variables are the variables among the xi, 1 ≤ i ≤ n. The result of this query is the set of all tuples (∃,x2,…,xn) for which the formula evaluates to true.

A DRC formula is defined in a manner that is very similar to the definition of a TRC formula. The main difference is that the variables are now domain variables. Let op denote an operator in the set { <,>, = ≤, ≠} and let X and Y be domain variables. An atomic formula in DRC is one of the following:

�� (∃, x2,…,xn) ∈Rel, where Rel is a relation with n attributes; each xi, 1 ≤ i n is either a variable or a constant.

�� X op Y

�� X op constant, or constant op X


A formula is recursively defined to be one of the following, where p and q are themselves formulas, and p(X) denotes a formula in which the variable X appears:

�� Any atomic formula

�� p, p ∧ q, p V q, or p� q

�� ∃ X (p(X)), where X is a domain variable

�� ∀ X (p(X), where X is a domain variable

the reader is invited to compare this definition with the definition of TRC formulas and see how closely these two definitions correspond. We will not define the semantics of DRC formulas formally; this is left as an exercise for the reader.

�(�# � ��$��)��

We now illustrate DRC through several examples. The reader is invited to compare these with the TRC versions.

(Q2) Find all sailors with a rating above 7.

{ (I,N,T,A) | (I, N, T, A) ∈ Sailors ∧ T > 7 }

This differs from the TRC version in giving each attribute a (variable) name. The condition (I, N, T, A) ∈ Sailors ensures that the domain variables I, N, T, and A are restricted to be fields of the same tuple. In comparison with the TRC query, we can say T>7 instead of S.rating > 7, but we must specify the tuple (I, N,T, A) in the result, rather than just S.

(Q4) Find the names of sailors who have reserved boat 103.

{ (N) | ∃ I, T, A ((I,N,T,A) ∈ Sailors

∧∃ Ir, Br, D((Ir, Br, D ) ∈ Reserves ∧ Ir = I ∧ Br = 103)) }

Notice that only the sname field is retained in the answer and that only N is a free variable. We use the notation ∃ Ir, Br, D(..) as a shorthand for x1 Ir (∃Br (∃D(..)))).

Very often, all the quantified variables appear in a single relation, as in this example. an even more compact notation in this case is ∃ (Ir, Br, D) ∈ Reserves. With this notation, which we will use henceforth, the above query would be as follows:

{ (N) || ∃ I, T, A((I,N,T, A ) ∈ Sailors

∧ ∃(Ir, Br, D) ∈ Reserves (Ir = I ∧ Br = 103 )) }

The comparison with the corresponding TRC formula should now be straightforward. This query can also be written as follows; notice the repetition of variable I and the use of the constant 103:

{ (N) | ∃ I, T, A((I,N,T,A) ∈Sailors

∧ ∃D ((I, 103, D ) ∈Reserves))}

(Q5) Find the names of sailors who have reserved a red boat.

{ (N) | ∃ I,T, A((I, N, T, A) ∈ Sailors

∧ ∃ (I, Br, D) ∈ Reserves ∧ ∃ (Br, BN, ‘red) ∈Boats) }

(Q6) Find the names of sailors who have reserved at least two boats.

{(N) | ∃I, T, A((I, N, T, A) ∈ Sailors ∧ ∃ Brl, Br2 , D1, D2 ((I, Brl, D1) ∈Reserves ∧ (I,Br2, D2) ∈ Reserves ∧ Brl ≠ Br2))}

DATABASE SYSTEMS 64

Notice how the repeated use of variable I insures that the same sailor has reserved both the boats in question.

(Q7) Find the names of sailors who have reserved all boats. { (N) | ∃I, T, A((I,N, T, A) ∈ Sailors ∧ ∀B, BN, C(¬((B, BN, C) ∈ Boats) V (∃(Ir, Br, D) ∈Reserves (I= Ir ∧ Br = B))))}

This query can be read as follows: “Find all values of N such that there is some tuple (I,N,T,A) in Sailors satisfying the following condition: for every (B,BN,C), either this is not a tuple in Boats or there is some tuple (Ir, Br, D ) in Reserves that proves that Sailor I has reserved boat B.” The ∀ quantifier allows the domain variables B, BN, and C to range over all values in their respective attribute domains, and the pattern ‘ ¬((B, BN, C) ∈ Boats) V’ is necessary to restrict attention to those values that appear in tuples of boats. This pattern is common in DRC formulas, and the notation ∀ (B, BN, C) ∈ Boats can be used shorthand instead. This is similar to the notation introduced earlier for ∃. With this notation the query would be written as follows:

{ (N) | ∃ I, T, A((I,N,T,A) ∈ Sailors ∧ ∀ (B,BN,C ) ∈ Boats

(∃( Ir, Br, D) ∈ Reserves ( I = Ir ∧ Br = B)))}

(Q8) Find sailors who have reserved all red boats.

{ (I, N, T, A) | (I,N,T,A) ∈Sailors ∧∀( B, BN, C) ∈Boats

(C =’red’ � ∃( Ir, Br, D) ∈ Reserves (I = Ir ∧ Br = B))}

Here, we find all sailors such that for every red boat there is a tuple in Reserves that shows the sailor has reserved it.

SQL language is a " Query language", it contains many other capabilities besides querying a data base. It includes features for defining the structure of the data, for modifying the data in data base, and for specifying security constrains. SQL has clearly established itself as the standard relational-data base language.

��

A relational database consists of a collection of relations, each of which is assigned a unique name. Each relation has a structure.


Before reading the next section, answer the following questions.

1. How Relational Calculus is differ from Relational Algebra?

2. What do understand by TRC queries?

3. What do understand by DRC queries?


Top

�)1�

The basic structure of an SQL expression consists of three clauses: select, from, and where.


The select clause corresponds to the projection operation of the relational algebra. It is used to list the attributes desired in the result of a query.

The from clause corresponds to the Cartesian product operation of the relational algebra. It lists the relations to be scanned in the evaluation of the expression.

The where clause corresponds to the selection predicate of the relational algebra. It consists of a predicate involving attributes of the relations that appear in the from clause.

A typical SQL query has the form

select A1,A2,..., An

from r1,r2,....,rm

where P

Each Ai represents an attribute, and each ri a relation, P represent a predicate.

SQL * PLUS: GETTING STARTED

Update the std_fee of the student tuple with std_no. = 1 to 3000.50.

In SQL

Update student set std_fee = 3000.50 where std_id = 1

Domain Constraint: Data types help determine what values are valid for a particular column.

Referential constraint: It refers to the maintenance of relationships of data rows in multiple tables.

Entity Constraint: It means that we can uniquely identify every row in a table.



1. What are the various types of the update operations on relations?

2. Which operation we have to use to change an existing value in a table?


Top

��)��*�1��2�)13�

Structured Query Language (SQL), pronounced “sequel”, is the set of commands that all programs and users must use to access data within the Oracle7 database. Application programmes and Oracle7 tools often allow users to access the database without directly using SQL, but these applications in turn must use SQL when executing the user’s request.

Historically the paper, “A Relational Model of Data for Large Shared Data Banks,” by Dr E F Codd, was published in June 1970 in the Association of Computer Machinery (ACM) journal, Communications of the ACM. Codd’s model is now accepted as the definitive model for relational database management systems (RDBMS). The language, Structured English Query Language (SEQUEL) was developed by IBM Corporation, Inc. to use Codd’s model. SEQUEL later became SQL. In 1979, Relational Software, Inc. (now Oracle Corporation) introduced the first commercially available implementation of SQL. Today, SQL is accepted as the standard RDBMS language. The latest SQL standard published by ANSI and ISO is often called SQL-92 (and sometimes SQL2).

DATABASE SYSTEMS 66

��)1�

This section describes many reasons for SQL’s widespread acceptance by relational database vendors as well as end users. The strengths of SQL benefit all ranges of users including application programmers, database administrators, and management and end users

��4�� 1��

SQL is a non-procedural language because it:

• Processes sets of records rather than just one at a time;

• Provides automatic navigation to the data.

• System Administrators

• Database Administrators

• Security Administrators

• Application Programmers

• Decision Support System personnel

• Many other types of end users

SQL provides easy-to-learn commands that are both consistent and applicable to all users. The basic SQL commands can be learned in a few hours and even the most advanced commands can be mastered in a few days.

��1��

SQL provides commands for a variety of tasks including:

Querying data;

Inserting, updating and deleting rows in a table;

Creating, replacing, altering and dropping objects;

Controlling access to the database and its object;

Guaranteeing database consistency and language.

SQL unifies all the above tasks in one consistent language.

��# # ��1�� $��

Because all major relational database management systems support SQL, you can transfer all skills you have gained with SQL from one database to another. In addition, since all programmes written in SQL are portable, they can often be moved from one database to another with very little modification.

�# ��)1�

Embedded SQL refers to the use of standard SQL commands embedded within a procedural programming language. Embedded SQL is a collection of these commands:

All SQL commands, such as SELECT and INSERT, available with SQL with interactive tools;

Flow control commands, such as PREPARE and OPEN, which integrate the standard SQL commands with a procedural programming language.


The Oracle precompilers support embedded SQL. The Oracle precompilers interpret embedded SQL statements and translate them into statements that can be understood by procedural language compilers. Each of these Oracle precompilers translate embedded SQL programmes into a different procedural language:

The Pro*Ada precompiler

The Pro*C/C++ precompiler

The Pro*COBOL precompiler

The Pro*FORTAN precompiler

The Pro*Pascal precompiler

The Pro*PL/l precompiler

��

Oracle supports two types of data objects:

Schema Objects: A schema is a collection of logical structures of data, of schema objects. A schema is owned by a database user and has the same name as that user. Each user owns a single schema. Schema objects can be created and manipulated with SQL and include the following types of objects.

�1-.�*+�� )/�/</.*�1��0.� )/�/</.*��+�33*+.�

��)*9*.�� /50/3*)� .*=-*�5*.�

�/8.��.�� .�/8.��1�3.� .��+*)�:-�5��.�

��+*)�8+�5*)-+*.� .7��7� .�� /<1*.�

4�*@.� � �

Non-schema Objects: Other types of objects are also stored in the database and can be created and manipulated with SQL, but are not contained in a schema.

�

��5�� # ��%��

The following rules apply when naming objects:

• Names must be from 1 to 30 characters long with the following exceptions:

• Names of databases are limited to 8 characters. Names of database links can be as long as 128 characters.

• Names cannot contain quotation marks.

• Names are not case-sensitive.

�+�:�1*.� ��*.�

��11</50�.*3� *��.� �/<1*�.8/5*.��

�.*+.� �

DATABASE SYSTEMS 68

• A name must begin with an alphabetic character from you database character set unless surrounded by double quotation marks.

• Names can only contain alphanumeric characters form your database character set and the characters_,$ and#. You are strongly discouraged from using $ and #.

• If your database character set contains multi-byte characters, it is recommended that each name for a user or a role contain at least one single-byte character.

• Names of databases links can also contain periods (.) and ampersands @.

• Columns in the same table or view cannot have the same name. However, column in different tables or views can have the same name.

• Procedures or functions contained in the same package can have the same name, provided that their arguments are not of the same number and data types. Creating multiple procedures of functions with the same name in the same package with different arguments is called overloading the procedure or function.

��5�� # ��'��

There are several helpful guidelines for naming objects and their parts:

• Useful, descriptive, pronounceable names (or well-known abbreviations).

• Use consistent naming rules.

• Use the same name to describe the same entity or attributes across tables.

• When naming objects, balance the objective of keeping names short and easy to use with the objective of making names as long and descriptive as possible. When in doubt, choose the more descriptive name because many people may use the objects in the database over a period of time. Your counterpart ten years from now may have difficulty understanding a database with names like PMDD instead of PAYMENT_DUE_DATE.

• Using consistent naming rules helps users to understand the part that each table plays in you application. One such rule might be beginning the names of all tables belonging to the FINANCE application with FIN_.

• Use the same names to describe the same things across tables. For example, the department number columns of the EMP and DEPT tables should both be named DEPTNO.

��

��*+�/1��/��78*� �*.5+�8��

4� �� A.�?*B� 4/+�/<1*� 1*�3�� 5�/+/5�*+� .�+��3� �/L��3� � /9�� -� � 1*�3�� .�?*� <7�*.>��/9�� -� � .�?*� �.� �(((� /�)�� -� ��.��>��-�� -.��.8*5�:7�.�?*�:�+�/�4� �� >�

�� A8,.B� �-� <*+��/L��3�8+*5�.��8�/�)�.5/1*�.>��*�8+*5�.��8�5/��+/�3*�:+�� %>��*�.5/1*�.�5/��+/�3*�:�+� '�%!��$>�

�� /+/5�*+�)/�/��:�L/+�/<1*�1*�3��-8��3�3/<7�*.,��+�� '��<7�*.>�

�� 4/1�)�)/�/�+/�3*�:+�� /�-/+7��,!$��*5*� <*+� �,�!$��>�

�2 A.�?*B� /@�<��/+7�)/�/��:�1*�3��.�?*�<7�*.>��/9�� -� �.�?*��.��""�<7�*.>��-�� -.��.8*5�:7�.�?*��:�/� �2 �L/1-*>�

�� 2 � /@�<��/+7�)/�/��:�L/+�/<1*�1*�3��-8��3�3/<7�*.>�


�2 ��A.**��*�<*1�@B� �*9/)*5�� /1�.�+��3�+*8+*.*��3��*�-��=-*�/))+*..��:�/�+�@��.��/<1*>��.�)/�/��78*��.�8+�� /+�17�:�+�L/1-*.�+*�-+�*)�<7��*� �2 ��8.*-)�5�1-� �>�

�� A.�?*B� ��9*)�1*�3��5�/+/5�*+�)/�/��:�1*��.�?*�<7�*>��/9�� -� �.�?*��.��"">��*:/-1��/�)�� -� �.�?*��.��<7�*>�

�� /+7�:�+� /��:�/��8*+/��3�.7.�*� �1/<1*>��.�)/�/��78*��.�-.*)�@��+-.�*)��+/51*$>�

��7�� $ �$+*�� +�

��$��*��

Character data types are used to manipulate words and free form text. These data types are used to store character (alphanumeric) data in the database character set. They are less restrictive than other data types and consequently have fewer properties. For example, character columns can store all alphanumeric values, but NUMBER columns can only store numeric values. These data types are used for character data CHAR, VARCHAR2.

�6��$��*��

The CHAR data type specifies a fixed length character string. When you create a table with a CHAR column, you can supply the column length in bytes. Oracle7 subsequently ensures that all values stored in that column have this length. If you insert a value that is shorter than the column length, Oracle7 adds blank pads to the value to column length. If you try to insert a value that is too long for the column, Oracle7 returns an error. The default for a CHAR column is 1 character and the maximum allowed is 255 characters. A zero-length string can be inserted into CHAR column, but the column is blank-padded to 1 character when used in comparisons.

7��6��,�$��*��

The VARCHAR2 data type specifies a variable length character string. When you create a VARCHAR2 column, you can supply the maximum number of bytes of data that it can hold. Oracle7 subsequently stores each value in the column exactly as you specify it, provided it does not exceed the column’s maximum length.

7��6��$��*��

The VARCHAR data type is currently synonymous with the VARCHAR2 data type. It is recommended that you use VARCHAR2 rather that VARCHAR. In a future version of Oracle7, VARCHAR might be a separate data type used for variable length character strings compared with different comparison semantics.

�!��$��*��

The NUMBER data type is used to store zero, positive and negative fixed and floating point numbers with magnitudes between 1.0x10-130 and 9.9…x10125 (38 9s followed by 88 0s) with 38 digits of precision.

$��$��*��

The DATE data type is used to store data and time information. Although data and time information can be represented in both CHAR and NUMBER data types, the DATE data type has special associated properties.

For each DATE value the following information is stored:

Century year month day hour minute second.

To specify date value, you must convert a character or numeric value to data value with the TO_DATE function. Oracle7 automatically converts character values that are in the default date format into date values

DATABASE SYSTEMS 70

when they are used in date expressions. The default date format is specified by the initialization parameter NLS_DATE_FORMAT and is a string such as ‘DD-MM_YY’. This example date format includes a two-digit number for the day of the month, an abbreviation of the month name and the last two digits of the year.

If you specify a date value without a time component, the default time is 12:00 a.m. (midnight). If you specify a date value without a date, the default date is the first day of the current month.

The date function SYSDATE returns the current data and time.

��8 ��1� '��8 �$��*��

The RAW and LONG RAW data types are used for data that is not to be interpreted (not converted when moving data between different systems) by Oracle. These data types are intended for binary data or byte strings. For example, LONG RAW can be used to store graphics, sound, documents or areas of binary data; the interpretation is dependent on the use.

��8 "$�$��*��

Each row in the database has an address. You can examine a row’s address by querying the pseudo-column ROWID. Values of this pseudo-column are hexadecimal strings representing the address of each row. These strings have the data type ROWID. You can also create tables and clusters that contain actual columns having the ROWID data type. Oracle7 does not guarantee that the values of such columns are valid ROWIDs.

!1�1��1�$��*��

The MLSLABEL data type is used to store the binary format a label used on a secure operating system. Labels are used by Trusted Oracle7 to mediate access to information. You can also define columns with this data type if you are using the standard Oracle7 server.

� �

If a column in a row has no value, then column is said to be null, or to contain a null. Nulls can appear in columns of any data type that are not restricted by NOT NULL or PRIMARY KEY integrity constraints. Use a null when the actual value is not known or when a value would not be meaningful. Oracle7 currently treats a character value with a length of zero as null. However, this may not continue to be true in future versions of Oracle7. Do not use null to represent a value of zero, because they are not equivalent. Any arithmetic expression containing a null always evaluates to null. For example, null added to 10 is null. In fact, all operators (except concatenation) return null when given a null operand.

��

All data in a relational database is stored in tables. Every table has a table name and a set of columns and rows in which the data is stored. Each column is given a column name, a data type (defining characteristics of the data to be entered in the column). Usually in a relational database, some of the columns in different tables contain the same information. In this way, the tables can refer to one another.

For example, you might want to create a database containing information about the products your company manufactures. In a relational database, you can create several tables to store different pieces of information about your products, such as an inventory table, a manufacturing table and a shipping table. Each table would include columns to store data appropriate to the table (for example, the inventory table would include a column showing how much stock is on hand) and a column for the product’s part number.

7��9�


A view is customized presentation of the data from one or more tables. Views derive their data from the tables on which they are based, which are known as base tables. All operations performed on a view actually affect the base tables of the view. You can use views for several purposes: to give you an additional level of table security by restricting access to a predetermined set of table rows and columns. For example, you can create a view of a table that does not include sensitive data (i.e., salary information).

To hide data complexity: Oracle7 databases usually include many tables and by creating a view combining information from two or more tables, you make it easier for other users to access information from your database. For example, you might have a view that is a combination of your Employee table and Department table. A user looking at this view, which you have called emp_dept, only has to go to one place to get information, instead of having to access the two tables that make up this view.

To present the data in a different perspective from that of the base table: View provides a means to rename columns without affecting the base table. For example, to store complex queries, a query might perform extensive calculations with table information. By saving this query as a view, the calculations are performed only when the view is queried.

�

"��(��

An index is used to quickly retrieve information from a database project. Just as indexes help you retrieve specific information faster, a database index provides faster access to table data. Indexing creates an index file consisting of a list of records in a logical record order, along with their corresponding physical position in the table. You can use indexes to rapidly locate and display records, which is especially important with large tables, or with database composed of many tables.

Indexes are created on one or more columns of a table. Once created, an index is automatically maintained and used by the Oracle7 Database. Changes to table data (such as adding new rows, or deleting rows) are automatically incorporated into all relevant indexes.

To understand how an index works, suppose you have created an employee table containing the first name, last name an employee ID number of hundreds of employees, and that you entered the name of each employee into the table as they were hired. Now, suppose you want to locate a particular record in the table. Because you entered information about each employee in no particular order, the DBMS must do a great deal of database searching to find the record.

If you create an index using the LAST-NAME column of your employee table, the DBMS has to do much less searching and can return the results of a query very quickly.

The tables in the following sections provide a functional summary of SQL commands and are divided into these categories:

• Date Definition Language commands

• Data Manipulation Language commands

• Transaction Control commands

• Session Control commands

• System Control commands

• Embedded SQL commands



DATABASE SYSTEMS 72

1. What do you understand by SQL?

2. What we view in DBMS?

3. Why do we use indexes?


$��$��1��# # ��

Data Definition Language (DDL) commands allow you to perform these tasks:

• Create, alter and drop objects;

• Grant and revoke privileges and roles;

• Analysis information on a table, index, or cluster;

• Establish auditing options.

��# # ��$��$��*�

The CREATE, ALTER and DROP command require exclusive access to the object being acted upon. For example, an ALTER TABLE command fails if another user has an open transaction on the specified table.

The GRANT, REVOKE, ANALYSE, AUDIT and COMMENT commands do not require exclusive access to the object being acted upon. For example, you can analyze a table while other users are updating the table.

The following Table shows the Data Definition Language Commands.

� '%��27�� $ ��"�&�$�#&�� &�� #� � &!(�

�� /�)� �-+8�.*�

�1�*+��1-.�*+� ��5�/�3*��*�.��+/3*�5�/+/5�*+�.��5.��:�/�51-.�*+>��/11�5/�*�/��*9�*��:�+��/�51-.�*+>�

�1�*+��/�/</.*� �� 8*�K� �-�� *� )/�/</.*>� �� 5��L*+�� /�� +/51*� 4*+.�� #� )/�/� )�5��/+7� @�*�� 3+/��3� �� +/51*� $>� �� 8+*8/+*� �� )�@�3+/)*� �� /�� */+1�*+� +*1*/.*� �:� �+/51*$>� ��5��.*�/+L��L*�1�3K��/5��L*�1�3�� )*>��8*+:�+� �� *)�/� +*5�L*+7>��/))K)+�8K51*/+�+*')��1�3�:�1*�3+�-8.�� *� <*+.>�� +*�/� *�/�)/�/�:�1*K+*')��1�3�3�1*�� *� <*+>��</50-8��*� 5-++*�� 5��+�1�:�1*>� �� </50� -8� I � 5�� /�).� A��/�� 5/�� <*� -.*)� �� +*'L5+*/�*� ��*�)/�/</.*B��*��+/5*�:�1*>��5+*/�*�/��*@�)/�/�:�1*>��+*.�?*��*��+�� +*�)/�/�:�1*.>��5+*/�*�/��*@�)/�/�:�1*��81/5*��:�/��1)��*�:�+�+*5�L*+7�8-+8�.*.>��*�/<1*K)�./<1*�/-��*9�*�)��3� ��*� .�?*� �:� )/�/�:�1*.>� �� /0*� /� )/�/�:�1*� ��'1��*K�::'1��*>� �� *�/<*1K)�./<1*� /��+*/)� �:� +*')�� 1�3� :�1*� 3+�-8.>� �� 5�/�3*� ��*� )/�/</.*G.� 31�</1� �/� *>� �� 5�/�3*� ��*�� )*>��.*��*�� +��2 �1/<1*.>�

�1�*+��-�5�� +*5�� 8�1*�/�.��+*)�:-�5��>�

�1�*+��)*9� ��+*)*:��*�/��)*9G.�:-�-+*�.��+/3*�/11�5/��>�

�1�*+��/50/3*� ��+*5�� 8�1*�/�.��+*)�8+�5*)-+*>�

�1�*+��+�5*)-+*� ��+*5�� 8�1*�/�.��+*)�8+�5*)-+*>�

�1�*+��+�:�*� ��/))��+�+*� �L*�/�+*.�-+5*�1�� +�:+�� /�8+�:�1*>�

�1�*+� *.�-+5*��.�� .8*5�:7�/�:�+� -1/��5/1-5-1/�*��*��/1�5�.��:�+*.�-+5*.�-.*)�<7�/�.*..��>�

�1�*+� �1*� ��5�/�3*��*�/-��+�./��**)*)��/55*..�/�+�1*>�


�1�*+� �11</50�*3� *�� 5�/�3*� /� +�11</50� .*3� *��G.� .��+/3*� 5�/+/5�*+..��5.,� /-�� /��5� +*:+*.�� *,� �+�/-�� /��5�+*:+*.�� )*>�

�1�*+��/8.��3� ��5�/�3*�/�.�/8.��1�3G.�.��+/3*�5�/+*5�*+.��5.>�

�1�*+��/<1*� ��/))�/�5�1-� �K��*3+��7�5��.�+/��/��</1*>��+*)*:��*�/�5�1-� �,��5�/�3*�/��/<1*G.�.��+/3*�5�/+/5�*+.��5.>��*</<1*K)�./<1*K)+�8�/�� *3+��7�5��.�+/��>��*�/<*1K)�./<1*��/<1*�1�50.��/��/<1*>��*�/<1*K)�./<1*�/11��+�33*+.��/��/<1*>��/11�5/�*�/��*9�*��:�+��*��/<1*>��/11�@K)�./11�@�@+��3��/��/<1*>�� )�:7��*�)*3+**��:�8/+/11*1�.� �:�+�/��/<1*>�

�1�*+��/<1*.8/5*� �� /))K+*�/� *� )/�/� :�1*.>� �� 5�/�3*� .��+/3*� 5�/+/5�*+�.��5.>� �� /0*� /� �/<1*.8/5*� ��'1��*K�::'1��*>��<*3��K*�)�/�</50�-8>��/11�@K)�./11�@�@+��3��/��/<1*.85*>�

�1�*+��+�33*+� ��*�/<1*K)�./<1*�/�)/�/</.*��+�33*+>�

�1�*+��.*+� �� 5�/�3*� � /� -.*+G.� 8/..@�+),� )*:/-1�� /<1*.8/5*,� �*� 8�+/+7� �/<1*.8/5*,� �/<1*.8/5*�=-��/.,�8+�:�1*,��+�)*:/-1��+�1*.>�

�1�*+�4�*@� ��+*5�� 8�1*�/�L�*.>�

��/17.*� �� 5�11*5�� 8*+:�+� /�5*� .�/��.��5.,� L/1�)/�*� .�+-5�-+*,� �+� �)*��:7� 5�/��*)� +�@.� :�+� /� �</1*,�51-.�*+,��+��)*9>�

�-)�� 5��.*�/-)��3�:�+�.8*5�:�*)�I �5�� /�).��+��8*+/��.5�*� /��<M*5�.>�

�� *�� /))�/�5�� *��/<�-��/��/<1,�L�*@,�.�/8.��,��+�5�1-� ��*�)/�/�)�5��/+7>�

�+*/�*��+�1��1*� ��+*5+*/�*�/�5��+�1�:�1*>�

�+*/�*��/�/</.*� ��5+*/�*�/�)/�/</.*�

�+*/�*��/�/</.*��0� ��5+*/�*�/�1��0��/�+*� ��*�)/�/</.*>�

�+*/�*��-�5�� 5+*/�*�/�.��+*)�:-�5��>�

�+*/�*��)*9� ��5+*/�*�/��)*9�:�+�/��/<1*��+�51-.�*+>�

�+*/�*��/50/3*� ��5+*/�*��*�.8*5�:�5/��:�/�.��+*)�8/50/3*>�

�+*/�*��/50/3*��)7� ��5+*/�*��*�<�)7��:�/�.��+*)�8/50/3*>�

�+*/�*��+�5*)-+*� ��5+*/�*�/�.��+*)�8+�5*)-+*>�

�+*/�*��+�:�1*�� 5+*/�*�/�8+�:�1/�/�)�.8*5�:7��.�+*.�-+5*�1�� .>�

�+*/�*� �1*� ��5+*/�*�/�+�1*>�

�+*/�*� �11</50�*3� *� �� 5+*/�*�/�+�11</50�.*3� *��>�

�+*/�*�5�*� /� �� ..-*� � -1��81*� � �� ,� � �� 4��2 � /�)� � �� .�/�*� *��.� �� /� .��31*��+/�./5��>�

�+*/�*�*=-*�5*� ��5+*/�*�/�.*=-*�5*�:�+�3*�*+/��3�.*=-*��/1�L/1-*.>�

�+*/�*��/8.�� 5+*/�*�/�.�/8.��:�)/�/�:+�� *��+�� +*�+*� ��*�� /.�*+��/<1*.>�

�+*/�*��/8.��3� ��5+*/�*�/�.�/8.��1�3�5��/��3�5�/�3*.�� /)*��*�� /.�*+��/<1*��:�/�.�/8.��>�

�+*/�*�7��7� � ��5+*/�*�/�.7��7� �:�+�/�.5�*� /��<M*5�>�

�+*/�*��/<1*� ��5+*/�*�/��/<1*,�)*:��3��.�5�1-� �.,��*3+��7�5��.�+�/��.�/�)�.��+/3*�/11�5/��>�

�+*/�*��/<1*.8/5*�� 5+*/�*�/�81/5*� ��*�)/�/</.*�:�+�.��+/3*��:�.5�*� /��<M*5�.,� +�11</50�.*3� *��.�/�)��*� 8�+/+7�.*3� *��.,��/� ��3��*�)/�/�:�1*.��5�� 8+�.*��*��/<1*.8/5*>�

�+*/�*��+�33*+� ��5+*/�*�/�)/�/</.*��+�33*+>�

�+*/�*��.*+� ��5+*/�*�/�)/�/</.*�-.*+>�

DATABASE SYSTEMS 74

�+*/�*�4�*@� ��)*:��*�/�L�*@��:��*��+�� +*��/<1*.��+�L�*@.>�

�+�8��1-.�*+� ��+*� �L*�/�51-.�*+�:+�� *�)/�/</.*>�

�+�8��/�/</.*��0� ��+*� �L*�/�)/�/</.*�1��0>�

�+�8��-�5�� +*� �L*�/�.��+*)�:-�5��:+�� *�)/�/</.*>�

�+�8��)*9� ��+*� �L*�/��)*9�:+�� *�)/�/</.*>�

�+�8��/50/3*� ��+*� �L*�/�.��+*)�8/50/3*�:+�� *�)/�/</.*>�

�+�8��+�5*)-5*� ��+*� �L*�/�.��+*)�8+�5*)-+*�:+�� *�)/�/</.*>�

�+�8��+�:�1*� ��+*� �L*�/�8+�:�1*�:+�� *�)/�/</.*>�

�+�8� �1*� ��+*� �L*�/�+�1*�:�+� ��*�)/�/</.*>�

�+�8�*=-*�5*� ��+*� �L*�/�.*=-*�5*�:+�� *�)/�/</.*>�

�*�8��/8.�� +*� �L*�/�.��8.��:+�� *�)/�/</.*�

�+�8��/�8.��3� ��+*� �L*�/�.�/8.��1�3�:+�� *�)/�/</.*>�

�+�8�7��7� � ��+*� �L*�/�.7��7� �:+�� *�)/�/</.*>��

�+�8��/<*1� ��+*� �L*�/��/<1*�:+�� *�)/�/</.*>�

�+�8��/<1*.8/5*�� +*� �L*�/��/<1*.8/5*�:+�� *�)/�/</.*>�

�+�8��+�33*+� ��+*� �L*�/��+�33*+�:+�� *�)/�/</.*��

�+�8��.*+� ��+*� �L*�/�-.*+�/�)��*��<M*5�.��*�-.*+G.�.5�*� /�:+�� *�)/�/</.*>�

�+�8�4�*@� ��+*� �L*�/�L�*@�:�+� ��*�)/�/</.*>�

� +/�� 3+/��.7.�*� �8+�L�1�3*.,�+�1*.�/�)��<M*5��8+�L�11*3*.��-.**+.�/�)��1*.>�

��/-)�� )�.</1*� /-)��3� <7� +*L*+.��3,� 8/+��/117� �+� 5�� 81*�*17,� ��*� *::*5�� :� /� 8+��+� ��.�/�*� *��>�

*�/� *� ��5�/�3*��*��/� *��:�/�.5�*� /��<M*5�>�

*L�0*� ��+*L�0*�.7.�*� �8+�L�1*3*.,�+�1*.�/�)��<M*5��8+�L�1*3*.�:�+� �-.*+.�/�)�+�1*.>�

�+-�5/�*�� +*� �L*�/11�+�@.�:+�� /��/<1*��+�51-.�*+�/�)�:+**��*�.8/5*��/��*�+�@.�-.*)>�



1. What do you understand by DDL?

2. Make a list of commands used in DDL.


$��!�� 1��# # ��

Data Manipulation Language (DML) commands query and manipulate data in existing schema objects. These commands implicitly commit the current transaction.

� '�%��7�� $ �� &�*�% $�#&�� &�� #� � &!(�

�#� � &!� ��*#(��


�� +*� �L*�+�@.�:+�� /��/<1*>�

�� +*�-+��*�*9*5-��81/��:�+�/�I �.�/�*� *��>�

�� /))��*@�+�@.��/��/<1*>�

��;�� 1�50�/��/<1*��+�L�*@,�1�� 3�/55*..��<7��*+�-.*+.>�

�� .*1*5��)/�/��+�@.�/�)�5�1-� �.�:+�� *��+�� +*��/<1*.>�

�� 5�/�3*�)/�/��/��/<1*>�

�

�� # # ��

Transaction Control commands manage changes made by Data Manipulation Language commands.

� '%��07�� &( �$�#&��#&$�#%��#� � � &!(�

�#� � &!� ��*#(��

�� /0*�8*+� /�*��*�5�/�3*)�� /)*�<7�.�/�*� *��.��..-*)�/��*�<*3��3��:�/��+/�./5��>�

��;� ��-�)��/11�5�/�3*.�.��5*��*�<*3��3��:�/��+/�./5��+�.��5*�/�./5*8��>�

�4�� *.�/<1�.��/�8��</50��@��5��7�-�� /7�+�11>�

�� *9�/<1�.��8+�8*+��*.�:�+��*�5-++*��+/�./5��>�

�)1��# # ��*��(��

The description and syntax of some of the primary commands used in the SQL is explained next.

8 ��)1��# # ��

When writing SQL commands, it is important to remember a few simple rules and guidelines in order to construct valid statements that are easy to read and edit: • SQL commands may be on one or many lines. • Clauses are usually placed on separate lines. • Tabulation can be used. • Command words cannot be split across lines. • SQL commands are not case sensitive (unless indicated otherwise). • An SQL command is entered at the SQL prompt and subsequent lines are numbered. This is called the

SQL buffer. • Only one statement can be current at any time within the buffer and it can be run in a number of ways:

Place a semi-colon (;) at the end of last clause. Place a semi-colon/ forward slash on the last line in the buffer. Place a forward slash at the SQL prompt. Issue a RUN command at the SQL prompt.

Any one of the following statements is valid: Select*From EMP;

Select

DATABASE SYSTEMS 76

* From EMP

; Select * FROM EMP; Here, SQL commands are split into clauses in the interests of clarity.

$��

PURPOSE: Is to view the structure of the table

SYNTAX

DESCRIBE <tablename> E.g.

DESCRIBEemp

SELECT PURPOSE: Is used to extract information of a table

SYNTAX

SELECT column 1, column 2

FROM tablename [WHERE condition] SELECT*FROM tablename

E.g. SELECT empno, ename, job, salary FROM emp

WHERE ename=’KAMAL’

!��

COLUMN EXPRESSION

SELSECT empno, ename, job, salary*1000 “NET SALARY”

FROM emp

WHERE salary>1000;

��)��*�� :�

The SELECT statement retrieves information from the database, implementing all the operators of Relational Algebra.

In its simplest form, it must include:

1. A SELECT clause, which lists the columns to be displayed i.e. it is essentially a PROJECTION.

2. A FROM clause, which specifies the table involved.

To list all department numbers, employee names and manager numbers in the EMP table you enter the following:


��5�� 5��B ��

�(� �� $&(��

(� �� $#&%�

(� 2 � �� $#&%�

�(� �� $#&%�

(� �� $#&%�

(� ��;�� $% &�

�(� �� ;� $% &�

�(� �� $"##�

(� �� $#&%�

�(� �� $$%%�

(� �� $#&%�

�(� �� $"##�

�(� �� $$%��

Note that column names are separated by a comma.

It is possible to select all columns from the table, by specifying an*(asterisk) after the SELECT command word.

��%��$�=��

�� C� �B � A��

$ #&� �� ;� $&(�� '��'% � %((>((� � �(�

$!&&� �� $#&%� �"'�� '% � �,#((>((� ((>((� (�

$"�� 2 � �� $#&%� �#'�� '%!� �,�"(>((� "((>((� (�

$"##� �� $% &� �'��'% � �,&$">((� � �(�

$#"!� �� $#&%� ("'��'% � �,�">((� �,!((>((� (�

$#&%� ��;�� $% &� ��'��'%!� �,%"(>((� � (�

$$%�� ;� �� $% &� �!'��'%!� �,!"(>((� � �(�

$$%%� �� $"##� ("'�� '%!� ,(((>((� � �(�

$%!!� �� $#&%� (!'��'%!� �,"((>((� >((� (�

$%$#� �� ;� $$%%� (!'��'%!� �,�((>((� � �(�

$&((� �� ;� $#&%� � '��'%!� �&"(>((� � (�

$&(�� $"##� ("'��'% � ,(((>((� � �(�

$& !� �� ;� $$%�� '��4'% � �, ((>((� � �(�

��"��# ��1��

It is possible to include other items in the SELECT Clause.

• Arithmetic expressions

• Column aliases

DATABASE SYSTEMS 78

• Concatenated columns

• Literals

All these options allow the user to query data, manipulate it for query purposes; for example, performing calculations, joining columns together, or displaying literal text strings.

�

��

An expression is a combination of one or more values, operators and functions, which evaluate to a value.

Arithmetic expressions may contains column names, constant numeric values and the arithmetic operators: ��

J� /))�

'� .-<�+/5��

H� � -1��817�

K� )�L�)*�

*1*5�� /� *,�/1H��,��>�

�+��

If your arithmetic expression contains more than one operator, the priority is *, /first, the +, - second (left to right if there are several operators with the same priority).

In the following example, the multiplication (250*12) is evaluated first; then the salary value is added to the result of the multiplication (3000). So for Smith’s row: 800+3000=3800

*1*5��*�/� *�./1J�"(H��:+�� *� 8N�

Pretences may be used to specify the order, in which operators are to be executed; if, for example, addition is required before multiplication,

*1*5��*�/� *�A./1J�"(BH��+�� *� 8N�

�� # ��

When displaying the result of a query, SQL*Plus normally uses the selected column’s name as the heading. In many cases it may be cryptic or meaningless, You can change a column’s heading by using an Alias.

A column alias gives a column an alternative heading on output. Specify the alias after the column in the select list. By default, alias headings will be forced to uppercase and cannot contain blank spaces, unless the alias in enclosed in double quotes (“ “).

To display the column heading ANNSAL for annual salary instead of SAL*12, use a column alias:

*1*5��*�/� *�./1H��./1�,�� +�� *� 8N�

Once defined, an alias can be used with SQL*Plus commands.

Note: 2 ��/�I �.�/�*� *��,�/�5�1-� ��/1�/.�5/��17�<*�-.*)��*��51/-.*>

��

The Concatenation Operator (..…) allows columns to be linked to other columns, arithmetic expressions or constant values to create a character expression. Columns on either side of the operator are combined to make one single column.

To combine EMPNO and ENAME and give the alias EMPLOYEE to the expression, enter:�


�

��H ��

�

$ #&� ��

$!&&� ��

$"�� 2 � ��

$"##� ��

$#"!� ��

$#&%� ��;��

$$%��

$% &� ;��

$%!!� ��

$%$#� ��

$&((� ��

$&(��

$& !� ��

1��

A literal is any character, expression, number included on the SELECT list which is not a column name or a column alias.

A literal in the SELECT list is output for each row returned. Literal strings of free formal text can be included in the query result and are treated like a column in the select list.

The following statement contains literal selected with concatenation and a column alias:

��O G�GO >��O O G2 � ;�� G,��

�� 2 � ;��

$ #&'�� 2 � ;�� (�

$!&&'�� 2 � ;�� (�

$!&&'�� 2 � ;�� (�

$"��'2 � �� 2 � ;�� (�

$"##'�� 2 � ;�� (�

$#"!'�� 2 � ;�� (�

$#&%'��;�� 2 � ;�� (�

$$%�'�� 2 � ;�� (�

$% &';�� 2 � ;�� (�

$%!!'�� 2 � ;�� (�

$%$#'�� 2 � ;�� (�

$&(('�� 2 � ;�� (�

$&(�'�� 2 � ;�� (�

$& !'�� 2 � ;�� (�

6�� 7� ��

If a row lacks a data value for a particular column, that value is said to be null.

DATABASE SYSTEMS 80

A null value is a value which is either unavailable, unassigned, unknown or inapplicable. A null value is not the same as zero. Zero is a number. Null values take up one byte of internal ‘storage’ overhead.

�� !��!�� "#�

If any column value in an expression is null, the result is null. In the following statement, only Salesmen have a remuneration result:

SELECT ENAME, SAL * 12 +COMM ANNUAL_SAL FROM EMP �

��

�� &"((�

2 � �� ""((�

��

�� #!((�

��;��

�� ;� �

��

;��

�� %(((�

��

��

��

��

In order to achieve a result for all employees, it is necessary to convert the null value to a number. We use the NVL function a null value to a non-null value.

Use the NVL function to convert null values form the previous statement to zero.

��H��J�4A��,(B��

�� :� ��

�� &#((�

�� &"((�

2 � �� """((�

�� "$((�

�� #!((�

��;�� !�((�

�� ;�� &!((�

�� #(((�

;�� #((((�

�� %(((�


�� ((�

�� !((�

�� #(((�

�� "#((

NVL expects two arguments:

1. an expression

2. a non-null value

Note that you can use the NVL function to convert a null number, date or even character string to another number, date or character string, as long as the data types match.

NVL (Datecolumn, ’01-jan-88’)

NVL (Numbercolumn, 9)

NVL (charcolumn, ‘strong’)

Preventing the Selection of Duplicate Rows

Unless you indicate otherwise, SQL*Plus displays the result of query without eliminating duplicate entries.

To list all department numbers in the EMP table, enter:

��$��

To eliminate duplicate values in the result, include the DISTINCT qualifier in the SELECT command. To eliminate the duplicate values displayed in the previous example, enter:

��

�(� (� (��(� (� (��(��(��(� (��(� (��(��(�

DATABASE SYSTEMS 82

Multiple columns may be specified after the DISTINCT qualifier and the DISTINCT affects all selected columns. To display distinct values of DEPTNO and JOB, enter:

This displays a list of all different combinations of jobs and department numbers.

Note that order of rows returned in a query result is undefined. The ORDER BY clause may be used to sort the rows. If used, ORDER by must always be the last clause in the SELECT statement.

To sort by ENAME, enter: � �

�� 5��C5��=�25��C��

�� C�� =�2��

�� ;�� ((�� (�

�� &�((�� (�

��;�� !�((�� (�

�� ;�� %!((�� (�

�� #(((�� (�

�� ;�� !((�� (�

�� "$((� �(�

;�� #((((� �(�

�� "(((� (�

�� ;� �"#((� �(�

�� #(((� �(�

�� ;� &#((� �(�

�� %(((� (�

2 � �� "(((� (�

�

�

��

�(��(� (�

��>,��

�(� �� ;�

�(� ��

�( � ��

�( ��

�(�� ;�

�(��

(�� ;�

(��

(��


$�� $��

The default sort order is ASCENDING:

• Numeric values lowest first

• Date values earliest first

• Character values alphabetically

��%��$��

To reverse the order, the command word DESC is specified after the column name in the ORDER BY clause.

To reverse the order of the HIREDATE column, so that the latest dates are displayed first, enter: � �

��,��,��

�� C�� A��

�� 2�F�:�F<0�

��B �� 4>F�:�F<0�

C�� B �� F�:� F<0�

�:� �� 40F�:� F<0�

�� 40F�:� F<0�

�� B �� 0F��F<0�

� �� 2IF�F<0�

�� 43F�F<0�

�� 43F��F<��

�� 43F��F<��

�� 2�F��?F<��

�� B �� F��F<��

�� 3F:B F<��

��A�� F�:� F<��

��*�! ��*�� # ��

It is possible to ORDER BY more than one column. The limit is the number of columns on the table. In the ORDER BY clause, specify the columns to order by separated commas. If any or all are to be reversed, specify DESC after any or each column.

To order by two columns and display in reverse order of salary, enter: � �

��5��C5�� C��5��

�� C��

�4� �� B �

�4� ��B ��

�4� ��

24� � ��

DATABASE SYSTEMS 84

24��

24�� B ��

24� ��

24�� A�

�4�� B �� C��

�4��

�4�� :� ��

�4��

�4� ��

�4��

��8 6��

The WHERE clause corresponds to the Restriction operator of Relational Algebra.

It contains a condition that rows must meet in order to be displayed.

The WHERE clause, if used, must follow the FROM clause: �*1*5�� 5�1-� �.��+�� /<1*�2 �*+*� � 5*+�/��5��)��.�/+*�� *��

The WHERE clause may compare values in columns literal values, arithmetic expressions or functions. The WHERE clause expects 3 elements:

1. A column name

2. A comparison operator

3. A column name, constant or list of values.

Comparison Operators are used on the WHERE clause and can be divided into two categories-Logical and SQL.

1��

The logical operators will test the following conditions: �8*+/��+� �*/��3�

E� *=-/1��

P� 3+*/�*+��/��

PE� 3+*/�*+��/��*=-/1��

Q� 1*..��/��

QE� 1*..��/��+�*=-/1��

�

Character String and Database in the WHERE clause

ORACLE columns may be: Character, Number or Date.

Character strings and dates in the WHERE clause must be enclosed in single quotation marks. Character strings must match case with the column value unless modified by a function.

To list the names, number, job and department of all clerks, enter: � �


�� 5��5��C5�� A��C/ D��D�

�

�� C��

�� $ #&�� ;� �(�

�� $%$#�� ;�� (�

�� $&((�� ;�� (�

�� $& !� �� ;�� (�

To find department names with department numbers greater than 20, enter: � �� ,��

� � ��

� 2 �� P�(N�

�

��

�� (�

�� !(�

��# �� # ��9��# ��9 �

You can compare a column with another column in the same row, as well as with constant value.

For example, suppose you want to find those employees whose commission is greater than their salary, enter: � �� ,��,��

� � ��

� 2 �� P��C�

��

�� ,�"((>((�� ,!((>((�

�)1��

There are four SQL operators, which operate with all data types:

SQL Operators �8*+/��+� �*/��3�

��2 ��>>��>>� �*�@**��@��L/1-*.�A��51-.�L*B�

��A1�.�B� �/�5��/�7��:�/�1�.��:�L/1-*.�

�;�� /�5��/�7��:�/�1�.��:�L/1-*.�

�� .�/��-11�L/1-*�

��8 ��

Tests for values between and inclusive of, low and high range.

Suppose we want to see those employees whose salary is between 1000 and 2000.

��,�� 2 �� 2 ��(((��(((�

��

�� ,#((>((� �

2 � �� ,�"(>((� �

�� ,�"(>((� �

�� ,"((>((� �

�� ,�((>((� �

�� , ((>((� �

DATABASE SYSTEMS 86

Note that values specified are inclusive and the lower limit must be specified first.

��" ��

Purpose: Tests for existence of values in a specified list.

To find all employees who have one of the three MGR numbers, enter:

� �� ,��,��,�� 2 �� A$&(�,�$"##,�$$%%BN�

�

��

$ #&�� %((>((� $&(��

$$%%�� ,(((>((�� $"##�

$%$#� � �� ,�((>((�� $$%%�

$&(�� ,((>((�� $"##�

If character or dates are used in the list they must be enclosed in single quotes(‘ ’).

��1�:��

Sometimes you may not know the exact value to search for. Using the LIKE operator, it is possible to select rows that match a character pattern. The character pattern matching operation may be referred to as ‘wild-card’ search. Two symbols can be used to construct the search string. ��

7� <�1�7� <�1�7� <�1�7� <�1�� *8+*.*��. *8+*.*��. *8+*.*��. *8+*.*��.��

��7�.*=-*�5*��:�?*+��+�� +*�5�/+/5�*+.�

��7�.��31*�5�/+/5�*+�

To list all employees whose name starts with an S, enter:

�� 2 �� ;��GRG�

��

��

��

The can be used to search for specific number of characters.

For example, to list all employees who have a name exactly 4 characters in length:

�� 2 �� ;�G��G�

�

��

2 � ��

;��

��

The % and * may be used in any combination with literal characters.

"� �11��

The ISNULL operator specifically tests for values that are NULL.

So to find all employees who have no manager, you are testing for a NULL:


��,�� 2 ��

ENAME MGR

KING

��(��

The following operators are negative test:

��

SE� � � ��*=-=1��A4��,��,��B�

T E� � � ��*=-/1��A��B�

Q�P� � � ��*=-/1��A/11��KB�

��E�� *=-/1��

��P�� 3+*/�*+��/��

�)1��

��

��2 �� <*�@**��@��3�L*��L/1-*.�

��;�� 3�L*��1�.��:�L/1-*.�

��;�� .��/��-11�L/1-*�

To find employees whose salary is not between a range, enter:

��,�� 2 �� 2 ��(((��(((�

��

�� %((>((�

�� ,&$">((�

��;�� ,%"(>((�

�� ;�� ,!"(>((�

�� ,((>((�

;�� ",(((>((�

�� &"(>((�

�� ,(((>((�

�#�"�&!�$�#(�� *#+��(�E�#(��J#'�!#�(�&#$�($ �$�E�$��5��&$��7� � �

��,�� 2 �� ;��F�RG�

�� <�

�� ;�

��

2 � ��

��

��

;��

DATABASE SYSTEMS 88

��

�� ;�

�� ;�

��

�� ;�

�#�"�&!� %%�� *%#+��(�E�#�� @�� & ��-�B .5��&$��7�

�� 5��B �� A�� B �� :��)��

�� B �

�� $&(��

�� $#&%�

2 � �� $#&%�

�� $% &�

�� $#&%�

��;�� $% &�

�� ;� � $% &�

�� $"##�

�� $#&%�

�� $$%%�

�� $#&%�

�� $"##�

�� $$%��

Note

• If a NULL value is used in a comparison, then the comparison operator should be either IS or IS NOT NULL. If these operators are not used and NULL values are compared, the result is always FALSE.

• For example, COMM!=NULL is always FALSE. The result is false because a NULL value can neither be either equal or unequal to any other value, even another NULL. Note that an error is not raised; the result is simply always false.



1. What do you understand by DML?

2. What is the use of Insert, Delete and Update commands?

3. Why do we use select statement?


�

)��*��$��9��!� ��

The AND an OR operators may be used to make compound logical expressions.


The AND predicate will expect both conditions to be ‘true’; whereas the OR predicate will expect either condition to be ‘true.

In the following two examples, the conditions are the same, the predicate is different. See how the result is dramatically changed.

To find all clerks who earn between 1000 and 2000, enter:

��,��,��,�� 2 �� 2 ��(((��(((��EG�� ;G�

��

$%$#� �� ;� �,�((>((�

$& !� �� ;� �, ((>((�

To find all employees who are either clerks and/or all employees who earn between 1000 and 2000, enter.

��5�� 5��C5�� A��C�� 444�� 2444��C/ D��D�

�� C� ��

$ #&� �� ;� %((>((�

$!&&� �� ,#((>((�

$"�� 2 � �� ,�"(>((�

$#"!� �� ,�"(>((�

$%!!� �� ,�"(>((�

$%$#� �� ;� �,�((>((�

$&((� �� ;� &"(>((�

$& !� �� ;�� , ((>((�

You may combine AND and OR in the same logical expression. When AND and OR appear in the same WHERE clause, all the ANDs are performed first, then all the Ors are performed. We say that AND has a higher precedence than OR.

Since AND has a higher precedence than OR, the following SQL statements returns all managers with salaries over $1500 and all salesmen. �

��5�� 5��C5��5�� A�� 1 � �344� ��C/ D��B �D��C/ �K��D�

�� C� ��

$!&&� �� ,#((>((� (�

$"�� 2 � �� ,�"(>((� (�

$"##� �� ,&$">(((� �(�

$#"!� �� ,�"(>((�� (�

$#&%� ��;�� ,!"(>((� �(�

$%!!� �� ,"((>((� (�

If you wanted to select all managers and salesman with salaries over $1500, you would enter:

�� 5� �� 5� ��C5� ��5� �� A�� 1 �344� �� -��C/ D��B �D� ��C/ D��D�

�� C� ��

DATABASE SYSTEMS 90

$!&&� �� ,#((>((� (�

$"##� �� ,&$">((� �(�

$#&%� ��;�� ,%"(>((�� (�

$$%�� ;� �� ,!"(>((� �(�

The parentheses specify the order in which the operators should be evaluated. In the second example, the OR operator is evaluated before the AND operator.

"��

Purpose: This command is used to add new records into a table.

SYNTAX: INSERT INTO<table name>

VALUES (value 1. value 2…………)

E.G.

INSERT INTO emp

VALUES (‘101’, king’, ‘President’, 17-NOV-91’, 5000, null, ‘10’)

INSERT INTO emp (empno, deptno, ename)

VALUES (‘101’, 29’, ANITA’);

INSERT INTO TABLE (column 1, column 2………..)

SELECT column 1, column 2…………………….

FROM TABLE WHERE (condition);

E.g.

�� *� 8�A*� 8��,�)*8��,�*�/� *B�

��*� 8��,�)*8��,�*�/� *,�:+�� *� 8�@�*+*�

��/� *E�F ��G�

��

Purpose: is used to change the values of the field in specified table.

Syntax: Update <tableneme>

SET column 1= expression, column 2= expression

WHERE condition

E.g.

��*� 8�

��./1/+7E��(((,�*�/� *'�F4��G�

2 �� +*�)/�*�QEG�('��'&"GN�

Note: If where clause is omitted all rows are updated.

$� ��

Purpose: Removes the rows from a table.


7��/9C��*1*�*��+�� Q�/<1*�/� *P�

2 �*+*Q5��)��P�

�>3>�� *� 8�2 �� ./1/+7P��(((�

��*C��:��*�2 �� 51/-.*��.�� *)�/11��*�+�@.��:��*�.8*5�:�*)��/<1*�@�11�<*�)*1*�*)>��8/+��:��*�+�@�5/��<*�)*1*�*)>�

��# # ��

Purpose: Commit is used to make the changes [Insert, Update, Delete] that the have been made permanent.

�� :�

Purpose: This is used to undo (discard) all the changes that have been completed but not made permanent in the table by a COMMIT, i.e., all changes since the lose commit.

�$��$��1��

• Data definition language is used to create, alter or remove a data structure, table ORACLE Database Structure.

• In the ORACLE database data is stored in data structures known as tables, comprising of rows and columns. A table is created in a logical unit of the database called table space. A database may have one or more table spaces.

• The table space is divided into segments, which is a set of database blocks allocated for storage of database structures, namely tables, indexes etc.

• Segments are defined using storage parameters, which in turn is expressed in terms of extents of data. An extent is an allocation of database space which itself contains many oracle blocks- the basic unit of storage.

��

Purpose: To create a TABLE Structure.

Syntax: CREATE TABLE <tablename>

A��1-� ��)/�/��78*A.�?*B�U�-11K��-11V�

A5�1-� ��)/�/��78*�A.�?*B,O O O O O >B�

�>� >��+*/�*��/<1*�*� 8�

A*� 8��-� <*+�A!B��-11�

*�/� *�L/+5�/+��A�(B,�

M�<�L/+5�/+��A&B�

��+*)/�*�)/�*,�

./1/+7��-� <*+�A$,�B�

5�� -� <*+�A$,�B�

)*8��-� <*+�A��-11B�

�>3>�

� ��

A�� A�B�� ;��,�

DATABASE SYSTEMS 92

��4� �� A�!B��

��=�)�/� *��I ��

��4� �� A� B�

� ��

A�� A!B��

*� 8�8+� �0*7�� ;��,�

��A��B�

�� A$,�B�� 5�0�

A�� P�(((B,�

�� A�B�

�� :�+��0*7� ��

�*8��A��B��B�

� ��

Purpose: ALTER is used to change the structure of an existing table

Syntax: ALTER TABEL TABLENAME

U��5�1-� ��*1*� *��O O O ,��V�

�>3>��))�/��*@�5�1-� ��

��

��A�� A�(BB�

It is not possible to change the name of an existing column or delete an existing column.

The data type of an existing column can be changed, if the field is blank for all existing records.

7��9 �

• View can be defined as a Logical (Virtual) table derived from one or more base tables or Views.

• It is basically a subschema defined as a subset of the Schema.

• Views are like windows through which one can view/forced-restricted information.

• A View is a data object that contains no data of its own but whose contents are taken from other tables through the execution of a query.

• Since the contents of the table change, the view would also change dynamically.

Syntax: Create View <view name>

As <query>

[with CHECK OPTION];

• Oracle implementation of WITH CHECK option places no restrictions on the form of query that may be used in the AS clause.

• One may UPDATE and DELETE rows in a view, based on a single table and its query does not contain GORUP BY clause the DISTINCT clause.


• One may INSERT rows if the vies observes the same restrictions and its query contains on columns defined by expressions.

E.g. In order to create a view of EMP table named DEPT 20, to show the employees in department 20 and their annual salary: � � ��4��2 �)*8��(�

��*�/� *,�./1H��/��-/1�./1/+7�

��+�� *� 8��

��2 �*+*�)*8��E��(�

Once the VIEW is up, it can be treated like any other table

��H�:+�� )*8��(�

��

Purpose: Creates a database object to generate unique integers.

Syntax: CREATE SEQUENCE sez_name

U�� V�

U�� 2 ��V�

U��4��V�

U��4��V�

U��V�

�>� >�� I ��

�� (�

�� 2 ��

��4��((�

��

��"��(�

PURPOSE: To create an index on one or more columns of a table or a cluster.

SYNTAX: CREATE [UNIQUE] INDEX index_name

ON table_name

(column_name[, column_name…])

TABLESPACE tablespace

E.g. CREATE INDEX 1_emp_ename ON emp (ename)



1. What is SQL? Define its types.

2. What are the functionality of create, After commands?

��# # ��*�

DATABASE SYSTEMS 94

• Basic sets of relational model operations constitute the relational algebra.

• A sequence of relational algebra operations forms a relational algebra expression, whose result will also be a relation.

• The set of relational algebra operations {σ, π, U, –, x } is a complete set; that is, any of the other relational algebra operations can be expressed as a sequence of operations from this set.

• Relational calculus is an alternative to relational algebra. The calculus is nonprocedural, or declarative, that is it allows to describe the set of answers without being explicit about how they should be computed.

• Structured Query Language (SQL), pronounced “sequel”, is the set of commands that all programs and users use to access data within the Oracle7 database.

• The strengths of SQL benefit all ranges of users including application programmers, database administrators, management and end users.

�� 4��# ��)��

��$�!��

I. True or False

1. Basic sets of relational model operations constitute the relational algebra.

2. The SELECT operation is used to select a subset of the columns from a relation that satisfy a selection condition.

3. The SELECT operator is binary.


1. The fraction of tuples selected by a __________condition is referred to as the selectivity of the condition.

2. The _________operation selects certain columns from the table and discards the other columns.

3. The CARTESIAN PRODUCT creates tuple with the ________attributes of two relations.

4. ____________is used to undo (discard) all the changes that have been completed but not made permanent in the table by a COMMIT.

5. Character strings must match case with the _________value unless modified by a function.

��%��

I. True or False

1. True

2. False

3. False


1. selection


2. PROJECT

3. combined

4. rollback

5. column

&��$�!��

I. True or False

1. Set operations include UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT.

2. The selection operation is not applied to each tuple individually.

3. A NATURAL JOIN can be specified as a CARTESIAN PRODUCT proceeded by RENAME and followed by SELECT and PROJECT operations.

4. Character strings and dates in the WHERE clause need not be enclosed in single quotation marks.

5. The ISNULL operator specifically tests for values that are NULL.


1. A ________ must include a set of operations to manipulate the data.

2. Cartesian Product is also known as ___________.

3. The JOIN operation is used to combine ____________ from two relations into single tuple.

4. __________ is an alternative to relational algebra.

5. The CHAR data type specifies a ____________ character string.

$�� )��

1. What is the difference between a key and a super key?

2. Discuss the various reasons that lead to the occurrence of null values in relations.

3. Discuss the entity integrity and referential integrity constraints. Why is each considered important?

4. Define foreign key. What is this used for? How does it play a role in join operation?

5. Discuss the various update operations on relations and the types of integrity constraints that must be checked for each update operation.

6. List the operations of relational algebra and the purpose of each.

7. Discuss the various types of join operations. Why is join required?

8. How are the OUTER JOIN operations different from the (INNER) JOIN operation? How is the OUTER UNION OPERATION different form UNION?

9. Suppose we have a table having structure like employee (emp_id number (3), emp_name varchar2(15), dep_no number(2), emp_desig varchar2(5), salary number (8.2))

i) Create a table named employee as given above.

DATABASE SYSTEMS 96

ii) Insert <1, ‘John’, 01,’manager, 25000.00) into employee.

iii) Inser <2. ‘Ram’, 02,’cler,’ 5000.00) into employee.

iv) Insert <3, ‘ Ramesh,’ 03, ‘Accountant,’ 7000.00) into employee.

v) Insert <4. ‘ Raje,’ 05 ‘clerk’, 500000.00) into employee.

Now make the following queries-

vi) Find the employee whose designation is manager.

vii) Find the employee’s details whose salary is second longest.

viii) Find the name of all employees who are clerks.

ix) Find the employee who belongs to the same department.

x) Find the sum of all salaries.

xi) Find the sum of salary of the employees who belong to the same department.

xii) Update Ramesh department no to 04.

10. Design a relational database for a University register’s office. The office maintains data about each class, including the instructor, the number of students enrolled and the time and placed the class meetings for each student class pain, a grade is recorded.

List two reasons why null values might be introduced into the database.

Introduction Update Operations on Relations Functional Dependencies Closure Of A Set of dependencies First Normal Form Second Normal Form Third Normal Form Boyce-Codd Normal Form Fifth Normal Form

�

��

Relational Database Design

Learning Objectives


• Introduction

• Functional Dependencies

• Normalisation

• First Normal Form

• Second Normal Form

• Third Normal Form

• Boyce-Codd Normal Form

• Fourth Normal Form

• Fifth Normal Form

Top

��

The relational model is an abstract theory of data that is based on certain aspects of mathematics (principally set theory and predicate logic).

The principles of relational model were originally laid down in 1969-70 by Dr. E.F. Codd. Relational model is a way of looking at data. The relational model is concerned with three aspects of data-Structures, data integrity, manipulation (for example join, projection etc.)

1. Structure aspects: the data in the database is perceived by the user as a table. It means database arranged in the tables & collection of tables called database. Structure means design view of database like data type, its size etc.

DATABASE SYSTEMS 100

2. Integrity aspect: Those tables that satisfy certain integrity constraints like domain constraints, entity integrity, referential integrity and operational constraints.

3. Manipulative aspects: The operators available for the user for manipulating those tables into database e.g. for purpose of retrieval of data like projection join and restrict.

��

PURPOSE: Used to validate data entered for the specified columns (s) namely

There are two types of constraints

• Table Constraint

• Column

��

If the constraint span across multiple columns, the user will have to use table level constraints. If the data constraint attached to a specific cell in a table references the contents of another cell in the table, then the user will have to use table level constraints.

Primary key as a table level constraint.

E.g. Create table sales-order-details (s_order_no var char2 (6),

Product no var char2 (6),…. PREMARY KEY (S_order_no, product no.));

��

If the constraints are defined with the column definition, it is called as a column level constraint. They are local to a specific column.

Primary key as a column level constraint

Create table client (client _no varchar 2(61 Primary key…);



1. What is relational model? Define its aspects.

2. What is relational model constraint? Define its types.


��

• NOT NULL CONDITION

• UNIQUENESS

• PRIMARY KEY identification

• FOREGIEN KEY

• CHECK the column value against a specified condition

RELATIONAL DATABASE DESIGN 101

Some important constraints features and their implementation have been discussed below:

��

A PRIMARY KEY constraint designates a column or combination of columns as the table’s primary key. To satisfy a PRIMARY KEY constraint, both the following conditions must be true:

• No primary key value can appear in more than one row in the table .

• No column that is part of the primary key can contain a null.

A table can have only one primary key.

A primary key column cannot be of data type LONG OR LONG ROW. You cannot designate the same column or combination of columns as both a primary key and a unique key or as both a primary key and a cluster key. However, you can designate the same column or combination of columns as both a primary key and a foreign key.

��

You can use the column_constraint syntax to define a primary key on a single column.

Example

The following statement creates the DEPT table and defines and enables a primary key on the DEPTNO column:

CREATE TABLE dept

(deptno NUMBER (2)CONSTRAINTpk_dept PRIMARY KEY,

dname VARCHAR2 (10))

The PK_DEPT constraint identifies the DEPTNO column as the primary key of the DEPTNO table. This constraint ensures that no two departments in the table have the same department number and that no department number is NULL.

Alternatively, you can define and enable this constraint with table_ constraint syntax:

CREATE TABLE dept

(deptno NUMBER(2),

dname VARCHAR2 (9),

loc VARCHAR2 (10),

CONSTRAINT pk_deptPRIMARY KEY (deptno)

�� !��

A composite primary key is a primary key made up of a combination of columns. Because Oracle 7 creates an index on the columns of a primary key, a composite primary key can contain a maximum of 16 columns. To define a composite primary key, you must use the table_constraint syntax, rather than the column_constraint syntax.

Example

The following statement defines a composite primary key on the combination of the SHIP_NO and CONTAINER_NO columns of the SHIP_CONT table:


ALTER TABEL ship_cont

ADD PRIMARY KEY (ship_no, container_no) DISABEL

This constraint identifies the combination of the SHIP_NO and CONTAINER_NO columns as the primary key of the SHIP_CONTAINER. The constraint ensures that no two rows in the table have the same values for both SHIP_NO column and the CONTAINER_NO column.

The CONSTRAINT clause also specifies the following properties of the constraint.

• Since the constraint definition does not include a constraint name, Oracle 7 generates a name for the constraint.

• The DISABLE option causes Oracel7 to define the constraint but not enforce it.



1. What are the various features of constraint?

2. Define primary key and composite key.


��

A referential integrity constraint designates a column or combination of columns as a foreign key and establishes a relationship between that foreign key and a specified primary or unique key, called the referenced key. In this relationship, the table containing the foreign key is called the child table and the table containing the referenced key is called the parent table. Note the following:

• The child and parent tables must be on the same database. They cannot be on different nodes of a distributed database.

• The foreign key and the referenced key can be in the same table. In this case, the parent and child tables are the same.

• To satisfy a referential integrity constraint, each row of the child table must meet on of the following conditions:

� The value of the row’s foreign key must appear as a referenced key value in one of the parent table’s rows. The row in the child table is said to depend on the referenced key in the parent table.

� The value of one of the columns that makes up the foreign key must be null.

A referential integrity constraint is defined in the child table. A referential integrity constraint definition can include any of the following key words:

• Foreign Key: Identifies the column or combination of columns in the child table that makes up the foreign key. Only use this keyword when you define a foreign key with a table constraint clause.

• Reference: Identifies the parent table and the column or combination of columns that make up the referenced key. If you only identify the parent table and omit the column names, the foreign key automatically references the primary key of the parent table. The corresponding columns of the referenced key and the foreign key must match in number and data types.


On Delete Cascade: Allows deletion of referenced key values in the parent table that have dependent rows in the child table and causes Oracle7 to automatically delete dependent rows from the child table to maintain referential integrity. If you omit this option, Oracle7 forbids deletions or referenced key in the parent table that have dependent rows in the child table.

Before you define a referential integrity constraint in a CREATE TABLE statement that contains as clause. Instead, you can create the table without the constraint and then add it later with an ALTER TABLE statement.

You can define multiple foreign keys in a table. Also, a single column can be part of more than one foreign key.

��

You can use column_constraint syntax to define a referential integrity constraint in which the foreign key is made up of a single column.

Example

The following statement creates the EMP table and defines and enables a foreign key on the DEPTNO column that references the primary key on the DDPTNO column of the DEPT table:

CREATE TABLE emp

(EmpnoNUMBER (4),

ename VARCHAR2 (10),

job VARCHAR2 (9),

ngr NUMBER (4),

hiredate DATE,

sl NUMBER (7,2),

comm. NUMBER (7,2),

deptno CONSTRAINT fk_deptno REFERENCES dept (deptno))

The constraint FK_DEPTNO ensures that all employees in the EMP table work in a department in the DEPT table. However, employees can have null department numbers.

Before you define and enable this constraint, you must define and enable a constraint that designates the DEPTNO column of the DEPT table as a primary or unique key. Note that the referential integrity constraint definition does not use the FOREIGNKEY keyword to identify the columns that make up the foreign key. Because the constraint is defined with a column constraint clause on the DEPTNO column, the foreign, the foreign key is automatically on the DEPTNO column.

Note that the constraint definition identifies both the parent table and the columns of the referenced key. Because the referenced key is the parent table’s primary key, the referenced key column names are optional.

Note that the above statement omits the DEPTNO column’s data type. Because this column is a foreign key, Oracle 7 automatically assigns it the data type. Because this column is a foreign key, Oracle 7 automatically assigns it the data TYPE OF THE DEPT. DEPTNO column to which the foreign key refers.

Alternatively, you can define a referential integrity constraint with table_constraint syntax :

CREATETABLEemp (empno


NUMBER (4), ename

VARCHAR2(10), job

VERCHAR2(9)

VERCHAR2(9), mgr

NUMBER(4), hiredate

DATE, sal

NUMBER(7,2), comm

NUMBER(7,2), deptno, CONSTRAINT fk_deptno FOREIGN KEY

(deptno) REFERENCES dept(deptno)

Note that the foreign key definitions in both the above statement omit the ON DELETE CASADE option, causing Oracle7 to forbid the deletion of a department if any employee works in that department.



1. What do you understand by primary key constraint and Referential Integrity constraint?

2. Why do we use Null constraint in the table?


Top

�!��"!��

The operations of the relational model can be categorised into retrievals and updates. But we will discuss update operation here.

There are three basic update operations on relations

(1) Insert, (2) delete, and (3) modify.

�#��"!��

It is used to insert a new tuple (row) or tuples in a relation. It can violate any types of constraint (Primary, referential constraint) Domain constraints do mash can be violated if an attribute value is given that does not appear in the corresponding. Key constraint can be violated if entered a key value in the new tuple ahead, exists in another tuple in the relation r(k), Referential Intergrity can be violated if the value of any foreign key, t refers to a tuple that does not exist in the referenced ration.

Suppose we have a table student(std-id number(4), std-name varchar2(10), std_cause vanchar2(5), std_fee number(7,2)). Then, to insert values in this tube, we will use Insert operation in such a way

Insert < 1, ‘john’, ‘Msc’, 5000, 50> into student

In SQL (Structured Query Language)

Insert into student values(1, ‘john’ ‘Msc’, 5000.50)’,

�#�� "!��


The Delete operation is used to delete tuples. The Delete operation can violate only referential integrity, if the tuple being deleted is referenced by the foreign keys from other tuple in the database. To specify deletion, conditions on the attributes of the relation select the tuple (or tuples) to be deleted.

Delete the student tuple with std_no = 2;

Delete where std_id = 2;

�#��!��"!��

Update (or Modify) is used to change the values of some attributes in existing tuples. Whenever update operation is applied, the integrity constraints specified on the relational datbase scheme should not be violated. It is necessary to specify a condition on the attributes of the relation to select the tuple (or tuples) to be modified.

Top

�� !��

Functional Dependencies (abbreviated as FD) is a many-to-one relationship from one set of attributes to another within a given relation.

Definition: If X and Y are two subsets of a set of attributes of a relation, then attribute Y is said to be functionally dependent on attribute X, if and only if each X value in the relation has associated with it exactly one Y value in the relation.

Stated differently, whenever two tuples of the relation agree on their X value, they also agree on their Y value.

Symbolically, the same can be expressed as: X →Y

To understand the concept of functional dependency, consider the following relation (BM) with three attributes (S#, Item, P#, Qty):

� ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

� � ��

There are 4 attributes in this relation and at the moment 8 tuples are inserted into it. Note that whenever in any two tuples, S# value is same, Item value is also same. That is whenever value of S# is ‘S1’, the value of Item is ‘Book’; if value of #S is ‘S2’, value of Item is ‘Magazine’ etc. Therefore, attribute Item is functionally dependent on attribute S#, i.e. the set of attributes {S#} →{Item}. However, the converse is not true, i.e. {Item} does not functionally determine {S#} in this example.

Other functional dependencies valid in the relation are: � �� → � ��

� �� → � ��

� �� → � ��


� �� → � ��

� �� → � ��

� �� → � ��

� �� → � ��

The L.H.S. is called determinant and R.H.S. is called dependent. When the set contains just one attribute (i.e. S#) we can drop the brackets and can write S# → Item.

The above definition refers not only to the existing tuples but all the possible values of the attributes in all the tuples.

What is the significance of finding functional dependency after all? Well, this is because FD’s (short for functional dependencies) represent integrity constraints of the database and therefore, must be enforced. Now, obviously, the set of all FD’s could be very large. This motivates the database designer to look for a smaller set of FD’s which is easily manageable and yet implies all the FD’s. Thus finding out a minimal set of FD’s is of great practical value.

FD’s can be of two types: Trivial and Non Trivial functional dependency. An FD is of trivial type if the right hand side is a subset (not necessarily a proper subset) of the left hand side such as:

{ S#, P# } → { S#}

Here you see that the right hand side (i.e. dependent S#) is a subset of its determinant (i.e. {S#, P#}) and hence this is an instance of trivial functional dependency. If this condition does not hold for a dependency, we call it Non trivial dependency.

Top

��"��$�%��!��

A dependency may imply another dependency even if it is not explicitly obvious. For example dependency {S#, P#} implies two dependencies, viz. {S#, P#} →Item and {S#, P#} →Qty.

The set of all functional dependencies implied by a given set S of FD’s is called the closure of S and is denoted by S+.

For the purpose of deriving S+ of a given S, a set of inference rules exist, called Armstrong’s Inference sets or axioms. Using these rules a set of FD’s may be reduced to its Closure. We state these rules hereunder:

Let R be a relation and A, B, C arbitrary subsets of set of attributes of R (i.e. A, B, C ∈ R) then:

1. Reflexivity — If B is a subset of A, then A → B

2. Augmentation — If A → B, then A Υ C → B Υ C

3. Transitivity — If A→ B and B → C, then A → C

The rules can be used to derive precisely the closure S+.

The additional rules can be derived from these rules and can be used to simplify the task of computing S+ from S. The derived rules are:

4. Self-determination — A → A

5. Decomposition — If A → B Υ C, then A → B and A → C

6. Union — If A → B and A → C, then A → B Υ C


7. Composition — If A→B and C → D, then A Υ C → B Υ D

For simplification, we will represent A Υ B by AB in what follows.

Example: Suppose we have a relation R with attributes A, B, C, D, E, F and following FD’s:

� ��→��

� ��→� �

� �!�→� "�

#��$%�&��%��"!��!�→�"�%�'($��)��(��$��%*+�*�� ,*��+��%�-'�$.*/�

��'.��0 �&�''��11' ��%��+*�-�*.'$��2/3/�/��(*�4��%�)/3/�/�

� ��→�� 56 �4�7�

� ��→�� 5(-� 1�$��7�

� ��→�� 5(-� 1�$��7�

� �!�→��!� 5�.� ��7�

� �!→� "� 5�*��$��4�� 7�

� �!�→�"� 5!-� 1�$��7�

Thus we can say that in the given set of FD’s, AD → F holds.

It is clear from the ongoing discussion that given a set of FD’s, we can easily determine (by applying Armstrong axioms or otherwise) whether an FD holds in the relation variable R or not.

In principle, we can always compute the closure of a given set of attributes by repeatedly applying the Armstrong axioms and derived rules until they stop producing any new FD. However, as stated earlier, it is difficult this way.

It is more relevant to compute the closure of a set of attributes of relation R (say Z) and a set of functional dependencies on R (say S). We will call this as closure Z+ of Z under S. One of many possible algorithms is given below:

Closure(Z, S)

repeat

for each FD, X→Y in S

do

if (X is subset of Closure(Z, S)) then

Closure(Z, S) = Closure(Z, S) Υ Y

End if

end do

if Closure(Z, S) did not change in the current iteration leave the for loop.

End repeat

Example: Let the FD’s of a relation R with attributes A, B, C, D, E, F are:

A → BC

E → CF


B → E

CD → EF

Compute the closure Z+, where Z={A, B} under the given set of FD’s (S).

Solution: Applying the above algorithm we find

1. Closure = Z. That is initialize Closure with {A, B}.

2. Start repeating

a. Number of FD’s is 4, therefore, loop 4 times.

i. First FD is A → BC. Since the LHS is subset of Closure (Z, S), so we add B and C to the Closure set. Thus, Closure becomes {A, B, C}.

ii. The LHS of FD, E → CF, is not a subset of the Closure set. Therefore, no change in Closure set.

iii. In FD, B → E, the LHS is a subset of Closure, therefore add RHS to Closure. Thus Closure={A, B, C, E}.

iv. LHS of FD, CD →EF, CD is not a subset of the Closure, so no change.

b. We go round the ‘for’ loop once again 4 times. Closure set, evidently, does not change for first, third and fourth iterations. However, in second iteration it changes to include attribute F. It becomes now {A, B, C, E, F}.

c. We go round the ‘for’ loop once again. But this time there is no change in the Closure set and hence the algorithm terminates giving the result as {A, B}+={A, B, C, E, F}.

From the above algorithm two corollaries can be derived.

Corollary 1: An FD X Y follows from a set of FD’s S if and only if Y is a subset of the closure of X+ of X under S. Thus, we can determine if an FD follows from S even without computing S+.

Corollary 2: A subset of attributes K of a relation R is a super-key of R if and only if the closure of K+ of K under the given set of FD’s is exactly the set of all attributes of R.

��%�� !��

Sometimes we may have two sets of FD’s S1 and S2 such that every FD implied by S1 is also implied by S2 (i.e. S1+ ⊂ S2+). When this happens, S2 is called a cover of S1. The implication of this fact is that if the DBMS enforces the FD’s in S2, then it will automatically enforce the FD’s in S1.

If it so happens that S1+ ⊂ S2+ and S2+ ⊂ S1+ (i.e. S1+=S2+) then S1 and S2 are said to be equivalent. In this the if the FD’s of S1 are enforced it implies that FD’s of S2 are also enforced and vice versa.

A set of FDs is said to be irreducible (also called minimal) if and only if it satisfies the following three properties:

1. The RHS (the dependent) of every FD in S consists of just one attribute (i.e. is a singleton set).

2. No attribute of the LHS (the determinant) can be removed without changing the closure of the set or its equivalents. We term it as Left-irreducable.

3. No FD of the set can be discarded without changing the closure S+ or its equivalents.

For example, consider the following FD’s in relation R ():

�/� �� →� ��

� �� →� ��

� �� →� !�


� �� →� �

The RHS of each of the FD’s is a singleton. Also, each case, the LHS is obviously irreducible in turn, and none of the FDs can be discarded without changing the closure (i.e., without losing some information). The above set of FD’s has all the three properties and therefore is left-irreducable.

The following sets of FDs are not irreducible for the stated reasons.

2. A → {A, B} The RHS is not a singleton set.

A → C

A → D

3. {A, B} → C This FD can be simplified by dropping B from

A → B the left-hand side without changing the

A → D closure (i.e., it is not left – irreducible)

A → E

4. A → A The first FD here can be discarded without

A → B Changing the closure

A → C

A → D

A → E

Now, the claim is that for every set of FDs, there exists at least one equivalent set that is irreducible. This is easy to see. Let there be a set of FDs, S. By decomposition axiom, we can assume without loss of generality that every FD in S has a singleton right-hand side. Also, for each FD f in S, we examine each attribute A in the LHS of f; if deleting A from the LHS of f has no effect on the closure S+, we delete A from the LHS of f. Then, for each FD f remaining in S, if deleting f from S has no effect on the closure S+, we delete f from S. The final set S is irreducable and is equivalent to the original set.

Example: Suppose we are given relation R with attributes A, B, C, D and FDs

A → BC

B → C

A → B

AB → C

AC → D

We now compute an irreducible set of FDs that is equivalent to this given set.

1. The first step is to rewrite the FDs such that each one has a singleton RHS. Thus,

A → B

A → C

A → C

A → B


AB → C

AC → D

There is double occurrence of FD A → C so one occurrence can be eliminated.

2. The attribute FD, AC → D can be eliminated from the L.H.S. because we have A → C, so by augmentation A → AC, and we are given AC → D so A → D by transitivity; thus C on the LHS of AC → D is redundant and can be eliminated.

3. Next, we observe that the FD AB → C can be eliminated, because again we have A → C, so AB → CB augmentation, so AB → C by decomposition.

4. Finally, the FD A → C is implied by the FDA A → B and B → C, so it can also be eliminated. We are left with:

A → B

B → C

C → D This is the required irreducible set.



1. What do you understand by functional dependency?

2. What is irreducible set of dependencies?


&�� '��

While designing a database, usually a data model is translated into relational schema. The important question is whether there is a design methodology or is the process arbitrary. A simple answer to this question is affirmative. There are certain properties that a good database design must possess as dictated by Codd’s rules.

There are many different ways of designing good database. One of such methodologies is the method involving ‘Normalization’.

Normalization theory is built around the concept of normal forms. Normalization reduces redundancy. Redundancy is unnecessary repetition of data. It can cause problems with storage and retrieval of data. During the process of normalization, dependencies can be identified, which can cause problems during deletion and updation. Normalization theory is based on the fundamental notion of Dependency. Normalization helps in simplifying the structure of schema and tables.

For the purpose of illustration of the normal forms, we will take an example of a database of the following logical design:

Relation S { S#, SUPPLIERNAME, SUPPLYTATUS, SUPPLYCITY}, Primary Key{S#}

Relation P { P#, PARTNAME, PARTCOLOR, PARTWEIGHT, SUPPLYCITY}, Primary Key{P#}

Relation SP { S#, SUPPLYCITY, P#, PARTQTY}, Primary Key{S#, P#}

Foreign Key{S#} Reference S


Foreign Key{P#} Reference P

Now, what prompts the designer to make schema this way? Is this the only design? Is this the most appropriate? Could it be better if we modify it? If yes, then how? There are many such questions that a designer has to ask and answer. Let us see what are the problems we might face if we continue with this design. First of all let us insert some tuples into the table SP.

SP

S# SUPPLYCITY P# PARTQTY

S1 Delhi P1 3000

S1 Delhi P2 2000

S1 Delhi P3 4000

S1 Delhi P4 2000

S1 Delhi P5 1000

S1 Delhi P6 1000

S2 Mumbai P1 3000

S2 Mumbai P2 4000

S3 Mumbai P2 2000

S4 Delhi P2 2000

S4 Delhi P4 3000

S4 Delhi P5 4000

Let us examine the table above to find any design discrepancy. A quick glance reveals that some of the data are being repeated. That is data redundancy, which is of course an undesirable. The fact that a particular supplier is located in a city has been repeated many times. This redundancy causes many other related problems. For instance, after an update a supplier may be displayed to be from Delhi in one entry while from Mumbai in another. This further gives rise to many other problems.

Therefore, for the above reasons, the tables need to be refined. This process of refinement of a given schema into another schema or a set of schema possessing qualities of a good database is known as Normalization.

Database experts have defined a series of Normal forms each conforming to some specified design quality condition(s). We shall restrict ourselves to the first five normal forms for the simple reason of simplicity. Each next level of normal form adds another condition. It is interesting to note that the process of normalization is reversible. The following diagram depicts the relation between various normal forms.

1NF

2NF

3NF

4NF

5NF


�

�

The diagram implies that 5th Normal form is also in 4th Normal form, which itself in 3rd Normal form and so on. These normal forms are not the only ones. There may be 6th, 7th and nth normal forms, but this is not of our concern at this stage.

Before we embark on normalization, however, there are a few more concepts that should be understood.

�� !��

Decomposition is the process of splitting a relation into two or more relations. This is nothing but projection process.

Decompositions may or may not loose information. As you would learn shortly, that normalization process involves breaking a given relation into one or more relations and also that these decompositions should be reversible as well, so that no information is lost in the process. Thus, we will be interested more with the decompositions that incur no loss of information rather than the ones in which information is lost.

Lossless decomposition: The decomposition, which results into relations without loosing any information, is known as lossless decomposition or nonloss decomposition. The decomposition that results in loss of information is known as lossy decomposition.

Consider the relation S{S#, SUPPLYSTATUS, SUPPLYCITY} with some instances of the entries as shown below.

S S# SUPPLYSTATUS SUPPLYCITY

S3 100 Delhi

S5 100 Mumbai

Let us decompose this table into two as shown below:

(1) SX S# SUPPLYSTATUS SY S# SUPPLYCITY

S3 100 S3 Delhi

S5 100 S5 Mumbai

(2) SX S# SUPPLYSTATUS SY SUPPLYSTATUS SUPPLYCITY

S3 100 100 Delhi

S5 100 100 Mumbai

Let us examine these decompositions. In decomposition (1) no information is lost. We can still say that S3’s status is 100 and location is Delhi and also that supplier S5 has 100 as its status and location Mumbai. This decomposition is therefore lossless.

In decomposition (2), however, we can still say that status of both S3 and S5 is 100. But the location of suppliers cannot be determined by these two tables. The information regarding the location of the suppliers has been lost in this case. This is a lossy decomposition.

Certainly, lossless decomposition is more desirable because otherwise the decomposition will be irreversible. The decomposition process is in fact projection, where some attributes are selected from a table.


A natural question arises here as to why the first decomposition is lossless while the second one is lossy? How should a given relation must be decomposed so that the resulting projections are nonlossy? Answer to these questions lies in functional dependencies and may be given by the following theorem.

Heath’s theorem: Let R{A, B, C} be a relation, where A, B and C are sets of attributes. If R satisfies the FD A→B, then R is equal to the join of its projections on {A, B} and {A, C}.

Let us apply this theorem on the decompositions described above. We observe that relation S satisfies two irreducible sets of FD’s

S# → SUPPLYSTATUS

S# → SUPPLYCITY

Now taking A as S#, B as SUPPLYSTATUS, and C as SUPPLYCITY, this theorem confirms that relation S can be nonloss decomposition into its projections on {S#, SUPPLYSTATUS} and {S#, SUPPLYCITY} . Note, however, that the theorem does not say why projections {S#, SUPPLYSTATUS} and {SUPPLYSTATUS, SUPPLYCITY} should be lossy. Yet we can see that one of the FD’s is lost in this decomposition. While the FD S#→SUPPLYSTATUS is still represented by projection on {S#, SUPPLYSTATUS}, but the FD S#→SUPPLYCITY has been lost.

An alternative criteria for lossless decomposition is as follows. Let R be a relation schema, and let F be a set of functional dependencies on R. let R1 and R2 form a decomposition of R. this decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in F+:

R1 � R2 → R1

R1 � R2 → R2

Functional Dependency Diagrams: This is a handy tool for representing function dependencies existing in a relation.

The diagram is very useful for its eloquence and in visualizing the FD’s in a relation. Later in the Unit you will learn how to use this diagram for normalization purposes.

Top

��&��

A relation is in 1st Normal form (1NF) if and only if, in every legal value of that relation, every tuple contains exactly one value for each attribute.

Although, simplest, 1NF relations have a number of discrepancies and therefore it not the most desirable form of a relation.

S#

SUPPLIERNAME

SUPPLYSTATUS

SUPPLYCITY

S#

P#

PARTQTY P#

PARTNAME

PARTCOLOR

PARTWEIGHT

SUPPLYCITY


Let us take a relation (modified to illustrate the point in discussion) as

Rel1{S#, SUPPLYSTATUS, SUPPLYCITY, P#, PARTQTY} Primary Key{S#, P#}

FD{SUPPLYCITY → SUPPLYSTATUS}

Note that SUPPLYSTATUS is functionally dependent on SUPPLYCITY; meaning that a supplier’s status is determined by the location of that supplier – e.g. all suppliers from Delhi must have status of 100. The primary key of the relation Rel1 is {S#, P#}. The FD diagram is shown below:

For a good design the diagram should have arrows out of candidate keys only. The additional arrows cause trouble.

Let us discuss some of the problems with this 1NF relation. For the purpose of illustration, let us insert some sample tuples into this relation.

REL1 S# SUPPLYSTATUS SUPPLYCITY P# PARTQTY

S1 200 Delhi P1 3000

S1 200 Delhi P2 2000

S1 200 Delhi P3 4000

S1 200 Delhi P4 2000

S1 200 Delhi P5 1000

S1 200 Delhi P6 1000

S2 100 Mumbai P1 3000

S2 100 Mumbai P2 4000

S3 100 Mumbai P2 2000

S4 200 Delhi P2 2000

S4 200 Delhi P4 3000

S4 200 Delhi P5 4000

The redundancies in the above relation causes many problems – usually known as update anamolies, that is in INSERT, DELETE and UPDATE operations. Let us see these problems due to supplier-city redundancy corresponding to FD S#→SUPPLYCITY.

INSERT: In this relation, unless a supplier supplies at least one part, we cannot insert the information regarding a supplier. Thus, a supplier located in Kolkata is missing from the relation because he has not supplied any part so far.

DELETE: Let us see what problem we may face during deletion of a tuple. If we delete the tuple of a supplier (if there is a single entry for that supplier), we not only delte the fact that the supplier supplied a

PARTQTY

S#

P#

SUPPLYCITY

SUPPLYSTATUS


particular part but also the fact that the supplier is located in a particular city. In our case, if we delete entries corresponding to S#=S2, we loose the information that the supplier is located at Mumbai. This is definitely undesirable. The problem here is there are too many informations attached to each tuple, therefore deletion forces loosing too many informations.

UPDATE: If we modify the city of a supplier S1 to Mumbai from Delhi, we have to make sure that all the entries corresponding to S#=S1 are updated otherwise inconsistency will be introduced. As a result some entries will suggest that the supplier is located at Delhi while others will contradict this fact.

Top

%��&��

A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is fully functionally dependent on the primary key. Here it has been assumed that there is only one candidate key, which is of course primary key.

A relation in 1NF can always decomposed into an equivalent set of 2NF relations. The reduction process consists of replacing the 1NF relation by suitable projections.

We have seen the problems arising due to the less-normalization (1NF) of the relation. The remedy is to break the relation into two simpler relations.

REL2{S#, SUPPLYSTATUS, SUPPLYCITY} and

REL3{S#, P#, PARTQTY}

The FD diagram and sample relation, are shown below.

�

�

�

REL2 REL3

S# SUPPLYSTATUS SUPPLYCITY S# P# PARTQTY

S1 200 Delhi S1 P1 3000

S2 100 Mumbai S1 P2 2000

S3 100 Mumbai S1 P3 4000

S4 200 Delhi S1 P4 2000

S5 300 Kolkata S1 P5 1000

S1 P6 1000

S2 P1 3000

S2 P2 4000

S3 P2 2000

S#

SUPPLYCITY

SUPPLYSTATUS P#

S#

PARTQTY


S4 P2 2000

S4 P4 3000

S4 P5 4000

REL2 and REL3 are in 2NF with their {S#} and {S#, P#} respectively. This is because all nonkeys of REL1{ SUPPLYSTATUS, SUPPLYCITY}, each is functionally dependent on the primary key that is S#. By similar argument, REL3 is also in 2NF.

Evidently, these two relations have overcome all the update anomalies stated earlier.

Now it is possible to insert the facts regarding supplier S5 even when he is not supplied any part, which was earlier not possible. This solves insert problem. Similarly, delete and update problems are also over now.

These relations in 2NF are still not free from all the anomalies. REL3 is free from most of the problems we are going to discuss here, however, REL2 still carries some problems. The reason is that the dependency of SUPPLYSTATUS on S# is though functional, it is transitive via SUPPLYCITY. Thus we see that there are two dependencies S#→SUPPLYCITY and SUPPLYCITY→SUPPLYSTATUS. This implies S#→SUPPLYSTATUS. This relation has a transitive dependency. We will see that this transitive dependency gives rise to another set of anomalies.

INSERT: We are unable to insert the fact that a particular city has a particular status until we have some supplier actually located in that city.

DELETE: If we delete sole REL2 tuple for a particular city, we delete the information that that city has that particular status.

UPDATE: The status for a given city still has redundancy. This causes usual redundancy problem related to updataion.

Top

�#��&��

A relation is in 3NF if only if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key.

To convert the 2NF relation into 3NF, once again, the REL2 is split into two simpler relations – REL4 and REL5 as shown below.

REL4{S#, SUPPLYCITY} and

REL5{SUPPLYCITY, SUPLLYSTATUS}

The FD diagram and sample relation, is shown below.

�

�

REL4 REL5

S# SUPPLYCITY SUPPLYCITY SUPPLYSTATUS

S1 Delhi Delhi 200

S2 Mumbai Mumbai 100

S# SUPPLYCITY SUPPLYCITY SUPPLYCITY


S3 Mumbai Kolakata 300

S4 Delhi

S5 Kolkata

Evidently, the above relations REL4 and REL5 are in 3NF, because there is no transitive dependencies. Every 2NF can be reduced into 3NF by decomposing it further and removing any transitive dependency.

�!��

The reduction process may suggest a variety of ways in which a relation may be decomposed in lossless decomposition. Thus, REL2 can be in which there was a transitive dependency and therefore, we split it into two 3NF projections, i.e.


REL5{SUPPLYCITY, SUPLLYSTATUS}

Let us call this decomposition as decompositio-1. An alternative decomposition may be:


REL5{S#, SUPPLYSTATUS}

Which we will call decomposition-2.

Both the decompositions decomposition-1 and decomposition-2 are 3NF and lossless. However, decomposition-2 is less satisfactory than decomposition-1. For example, it is still not possible to insert the information that a particular city has a particular status unless some supplier is located in the city.

In the decomposition-1, the two projections are independent of each other but the same is not true in the second decomposition. Here independence is in the sense that updates are made into the relations without regard of the other provided the insertion is legal. Also independent decompositions preserve the dependencies of the database and no dependence is lost in the decomposition process.

The concept of independent projections provides for choosing a particular decomposition when there is more than one choice.

Top

(��)��&��

The previous normal forms assumed that there was just one candidate key in the relation and that key was also the primary key. Another class of problems arises when this is not the case. Very often there will be more candidate keys than one in practical database designing situation. To be precise the 1NF, 2NF and 3NF did not deal adequately with the case of relations that

1. Had two or more candidate keys, and that

2. The candidate keys were composite, and

3. They overlapped (i.e. had at least one attribute common).

A relation is in BCNF (Boyce-Codd Normal Form) if and only if every nontrivial, left-irreducible FD has a candiadte key as its determinant.

Or

A relation is in BCNF if and only if all the determinants are candidate keys.


In other words, the only arrows in the FD diagram are arrows out of candidate keys. It has already been explained that there will always be arrows out of candidate keys; the BCNF definition says there are no others, meaning there are no arrows that can be eliminated by the normalization procedure.

These two definitions are apparently different from each other. The difference between the two BCNF definitions is that we tacitly assume in the former case determinants are "not too big" and that all FDs are nontrivial.

It should be noted that the BCNF definition is conceptually simpler than the old 3NF definition, in that it makes no explicit reference to first and second normal forms as such, nor to the concept of transitive dependence. Furthermore, although BCNF is strictly stronger than 3NF, it is still the case that any given relation can be nonloss decomposed into an equivalent collection of BCNF relations.

Thus, relations REL1 and REL2 which were not in 3NF, are not in BCNF either; also that relations REL3, REL4, and REL5, which were in 3NF, are also in BCNF. Relation REL1 contains three determinants, namely {S#}, {SUPPLYCITY}, and {S#, P#}; of these, only {S#, P#} is a candidate key, so REL1 is not in BCNF. Similarly, REL2 is not in BCNF either, because the determinant {SUPPLYCITY} is not a candidate key. Relations REL3, REL4, and REL5, on the other hand, are each in BCNF, because in each case the sole candidate key is the only determinant in the respective relations.

We now consider an example involving two disjoint - i.e., nonoverlapping - candidate keys. Suppose that in the usual suppliers relation REL1{S#, SUPPLIERNAME, SUPPLYSTATUS, SUPPLYCITY}, {S#} and {SUPPLIERNAME} are both candidate keys (i.e., for all time, it is the case that every supplier has a unique supplier number and also a unique supplier name). Assume, however, that attributes SUPPLYSTATUS and SUPPLYCITY are mutually independent - i.e., the FD SUPPLYCITY→SUPPLYSTATUS no longer holds. Then the FD diagram is as shown below.

Relation REL1 is in BCNF. Although the FD diagram does look "more complex" than a 3NF diagram, it is nevertheless still the case that the only determinants are candidate keys; i.e., the only arrows are arrows out of candidate keys. So the message of this example is just that having more than one candidate key is not necessarily bad.

For illustration we will assume that in our relations supplier names are unique. Consider REL6.

REL6{ S#, SUPPLIERNAME, P#, PARTQTY }.

Since it contains two determinants, S# and SUPPLIERNAME that are not candidate keys for the relation, this relation is not in BCNF. A sample snapshot of this relation is shown below:

REL6 S# SUPPLIERNAME P# PARTQTY

S1 Pooran P1 3000

S1 Anupam P2 2000

S1 Vishal P3 4000

S1 Vinod P4 2000

S#

SUPPLIERNAME SUPPLYCITY

SUPPLYSTATUS


As is evident from the figure above, relation REL6 involves the same kind of redundancies as did relations REL1 and REL2, and hence is subject to the same kind of update anomalies. For example, changing the name of suppliers from Vinod to Rahul leads, once again, either to search problems or to possibly inconsistent results. Yet REL6 is in 3NF by the old definition, because that definition did not require an attribute to be irreducibly dependent on each candidate key if it was itself a component of some candidate key of the relation, and so the fact that SUPPLIERNAME is not irreducibly dependent on {S#, P#} was ignored.

The solution to the REL6 problems is, of course, to break the relation down into two projections, in this case the projections are:

REL7{S#, SUPPLIERNAME} and

REL8{S#, P#, PARTQTY}

Or

REL7{S#, SUPPLIERNAME} and

REL8{SUPPLIERNAME, P#, PARTQTY}

Both of these projections are in BCNF. The original design, consisting of the single relation REL1, is clearly bad; the problems with it are intuitively obvious, and it is unlikely that any competent database designer would ever seriously propose it, even if he or she had no exposure to the ideas of BCNF etc. at all.

�� !��(�&��*&��

We have seen two normal forms for relational-database schemas: 3NF and BCNF. There is an advantage to 3NF in that we know that it is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation. Nevertheless, there is a disadvantage to 3NF. If we do not eliminate all transitive dependencies, we may have to use null values to represent some of the possible meaningful relationship among data items, and there is the problem of repetition of information. The other difficulty is the repetition of information.

If we are forced to choose between BCNF and dependency preservation with 3NF, it is generally preferable to opt for 3NF. If we cannot test for dependency preservation efficiently, we either pay a high penalty in system performance or risk the integrity of the data in our database. Neither of these alternatives is attractive. With such alternatives, the limited amount of redundancy imposed by transitive dependencies allowed under 3NF is the lesser evil. Thus, we normally choose to retain dependency preservation and to sacrifice BCNF.

In summary, we repeat that our three design goals for a relational-database design are

1. BCNF

2. Lossless join

3. Dependency preservation

If we cannot achieve all three, we accept

1. 3NF

2. Lossless join

3. Dependency preservation




1. What do you understand by Normalisation?

2. Why Normalisation of Database is required?

3. Write short notes of the following:

a. 1st Normal form

b. IInd Normal form

c. IIIrd Normal form

4. Discuss the difference between BCNF and BNF


��#�&��

So far we have been normalizing relations based on their functional dependencies. However, they are not the only type dependencies found in relations giving their own characteristic anomalies.

There is another class of higher normalization (4th and 5th) that revolve around the concept of another type of dependencies – multi-valued dependency (MVD) and join-dependency (JD).

��)��!��

Multi-valued dependency may be formally defined as:

Let R be a relation, and let A, B, and C be subsets of the attributes of R. Then we say that B is multi-dependent on A - in symbols,

A →→B

(read "A multi-determines B," or simply "A double arrow B") - if and only if, in every possible legal value of R, the set of B values matching a given A value, C value pair depends only on the A value and is independent of the C value.

To elucidate the meaningof the above statement, let us take one example relation REL8 as shown beolw:

REL8 COURSE TEACHERS BOOKS

Computer TEACHER BOOK

Dr. Wadhwa Graphics

Prof. Mittal UNIX

Mathematics TEACHER BOOK

Prof. Saxena Relational Algebra

Prof. Karmeshu Discrete Maths

Assume that for a given course there can exist any number of corresponding teachers and any number of corresponding books. Moreover, let us also assume that teachers and books are quite independent of one


another; that is, no matter who actually teaches any particular course, the same books are used. Finally, also assume that a given teacher or a given book can be associated with any number of courses.

Let us try to eliminate the relation-valued attributes. One way to do this is simply to replace relation REL8 by a relation REL9 with three scalar attributes COURSE, TEACHER, and BOOK as indicated below.

REL9 COURSE TEACHER BOOK

Computer Dr. Wadhwa Graphics

Computer Dr. Wadhwa UNIX

Computer Prof. Mittal Graphics

Computer Prof. Mittal UNIX

Mathematics Prof. Saxena Relational Algebra

Mathematics Prof. Karmeshu Disrete Maths

Mathematics Prof. Karmeshu Relational Algebra

As you can see from the relation, each tuple of REL8 gives rise to m * n tuples in REL9, where m and n are the cardinalities of the TEACHERS and BOOKS relations in that REL8 tuple. Note that the resulting relation REL9 is "all key".

The meaning of relation REL9 is basically as follows: A tuple {COURSE:c, TEACHER:t, BOOK:x} appears in REL9 if and only if course c can be taught by teacher t and uses book x as a reference. Observe that, for a given course, all possible combinations of teacher and book appear: that is, REL9 satisfies the (relation) constraint

if tuples (c, t1, x1), (c, t2, x2) both appear

then tuples (c, t1, x2), (c, t2, x1) both appear also

Now, it should be apparent that relation REL9 involves a good deal of redundancy, leading as usual to certain update anomalies. For example, to add the information that the Computer course can be taught by a new teacher, it is necessary to insert two new tuples, one for each of the two books. Can we avoid such problems? Well, it is easy to see that:

1. The problems in question are caused by the fact that teachers and books are completely independent of one another;

2. Matters would be much improved if REL9 were decomposed into its two projections call them REL10 and REL11 - on {COURSE, TEACHER} and {COURSE, BOOK}, respectively.

To add the information that the Computer course can be taught by a new teacher, all we have to do now is insert a single tuple into relation REL10. Thus, it does seem reasonable to suggest that there should be a way of "further normalizing" a relation like REL9.

It is obvious that the design of REL9 is bad and the decomposition into REL10 and REL11 is better. The trouble is, however, these facts are not formally obvious. Note in particular that REL9 satisfies no functional dependencies at all (apart from trivial ones such as COURSE → COURSE); in fact, REL9 is in


BCNF, since as already noted it is all key-any "all key" relation must necessarily be in BCNF. (Note that the two projections REL10 and REL11 are also all key and hence in BCNF.) The ideas of the previous normalization are therefore of no help with the problem at hand.

The existence of "problem" BCNF relation like REL9 was recognized very early on, and the way to deal with them was also soon understood, at least intuitively. However, it was not until 1977 that these intuitive ideas were put on a sound theoretical footing by Fagin's introduction of the notion of multi-valued dependencies, MVDs. Multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD, but the converse is not true (i.e., there exist MVDs that are not FDs). In the case of relation REL9 there are two MVDs that hold:

COURSE →→ TEACHER

COURSE →→ BOOK

Note the double arrows; the MVD A→→B is read as "B is multi-dependent on A" or, equivalently, "A multi-determines B." Let us concentrate on the first MVD, COURSE→→TEACHER. Intuitively, what this MVD means is that, although a course does not have a single corresponding teacher - i.e., the functional dependence COURSE→TEACHER does not hold-nevertheless, each course does have a well-defined set of corresponding teachers. By "well-defined" here we mean, more precisely, that for a given course c and a given book x, the set of teachers t matching the pair (c, x) in REL9 depends on the value c alone - it makes no difference which particular value of x we choose. The second MVD, COURSE→→BOOK, is interpreted analogously.

It is easy to show that, given the relation R{A, B, C), the MVD A→→B holds if and only if the MVD A→→C also holds. MVDs always go together in pairs in this way. For this reason it is common to represent them both in one statement, thus:

COURSE→→TEACHER | TEXT

Now, we stated above that multi-valued dependencies are a generalization of functional dependencies, in the sense that every FD is an MVD. More precisely, an FD is an MVD in which the set of dependent (right-hand side) values matching a given determinant (left-hand side) value is always a singleton set. Thus, if A→B. then certainly A→→B.

Returning to our original REL9 problem, we can now see that the trouble with relation such as REL9 is that they involve MVDs that are not also FDs. (In case it is not obvious, we point out that it is precisely the existence of those MVDs that leads to the necessity of – for example - inserting two tuples to add another Computer teacher. Those two tuples are needed in order to maintain the integrity constraint that is represented by the MVD.) The two projections REL10 and REL11 do not involve any such MVDs, which is why they represent an improvement over the original design. We would therefore like to replace REL9 by those two projections, and an important theorem proved by Fagin in reference allows us to make exactly that replacement:

Theorem (Fagin): Let R{A, B, C} be a relation, where A, B, and C are sets of attributes. Then R is equal to the join of its projections on {A, B} and {A, C} if and only if R satisfies the MVDs A→→B | C.

At this stage we are equipped to define fourth normal form:

Fourth normal form: Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the nontrivial (An MVD A→→B is trivial if either A is a superset of B or the union of R and B is the entire heading) MVD A→→B is satisfied, then all attributes of R are also functionally dependent on A.


In other words, the only nontrivial dependencies (FDs or MVDs) in R are of the form Y→X (i.e., functional dependency from a superkey Y to some other attribute X). Equivalently: R is in 4NF if it is in BCNF and all MVDs in R are in fact "FDs out of keys." Therefore, that 4NF implies BCNF.

Relation REL9 is not in 4NF, since it involves an MVD that is not an FD at all, let alone an FD "out of a key." The two projections REL10 and REL11 are both in 4NF, however. Thus 4NF is an improvement over BCNF, in that it eliminates another form of undesirable dependency. What is more, 4NF is always achievable; that is, any relation can be nonloss decomposed into an equivalent collection of 4NF relations.

You may recall that a relation R{A, B, C} satisfying the FDs A→B and B→C is better decomposed into its projections on (A, B) and {B, C} rather than into those on {A, B] and {A, C). The same holds true if we replace the FDs by the MVDs A→→B and B→→C.

Top

��#�&��

It seems from our discussion so far in that the sole operation necessary or available in the further normalization process is the replacement of a relation in a nonloss way by exactly two of its projections. This assumption has successfully carried us as far as 4NF. It comes perhaps as a surprise, therefore, to discover that there exist relations that cannot be nonloss-decomposed into two projections but can be nonloss-decomposed into three (or more). An unpleasant but convenient term, we will describe such a relation as "n-decomposable" (for some n > 2) - meaning that the relation in question can be nonloss-decomposed into n projections but not into m for any m < n.

A relation that can be nonloss-decomposed into two projections we will call "2-decomposable" and similarly term “n-decomposable” may be defined. The phenomenon of n-decomposability for n > 2 was first noted by Aho, Been, and Ullman. The particular case n = 3 was also studied by Nicolas.

Consider relation REL12 from the suppliers-parts-projects database ignoring attribute QTY for simplicity for the moment. A sample snapshot of the same is shown below. It may be pointed out that relation REL12 is all key and involves no nontrivial FDs or MVDs at all, and is therefore in 4NF. The snapshot of the relations also shows:

a. The three binary projections REL13, REL14, and REL15 corresponding to the REL12 relation value displayed on the top section of the adjoining diagram;

b. The effect of joining the REL13 and REL14 projections (over P#);

c. The effect of joining that result and the REL15 projection (over J# and S#).

REL12 S# P# J#

S1 P1 J2

S1 P2 J1

S2 P1 J1

S1 P1 J1

REL13 S# P# REL14 P# J# REL15 J# S#

S1 P1 P1 J2 J2 S1

S1 P2 P1 J1 J1 S1

S2 P1 P1 J1 J1 S2


Join Dependency:

Let R be a relation, and let A, B, ..., Z be subsets of the attributes of R. Then we say that R satisfies the Join Dependency (JD)

*{ A, B, ..., Z}

(read "star A, K ..., Z") if and only if every possible legal value of R is equal to the join of its projections on A, B,..., Z.

For example, if we agree to use SP to mean the subset (S#, P#} of the set of attributes of REL12, and similarly for FJ and JS, then relation REL12 satisfies the JD * {SP, PJ, JS}.

We have seen, then, that relation REL12, with its JD * {REL13, REL14, REL15}, can be 3-decomposed. The question is, should it be? And the answer is "Probably yes." Relation REL12 (with its JD) suffers from a number of problems over update operations, problems that are removed when it is 3-decomposed.

Fagin's theorem, to the effect that R{A, B, C} can be non-loss-decomposed into its projections on {A, B} and {A, C] if and only if the MVDs A→→B and A→→C hold in R, can now be restated as follows:

R{A, B, C} satisfies the JD*{AB, AC} if and only if it satisfies the MVDs A→→B | C.

Since this theorem can be taken as a definition of multi-valued dependency, it follows that an MVD is just a special case of a JD, or (equivalently) that JDs are a generalization of MVDs.

Thus, to put it formally, we have

A→→B | C ≡ * {AB, AC}

Note that joint dependencies are the most general form of dependency possible (using, of course, the term "dependency" in a very special sense). That is, there does not exist a still higher form of dependency such that JDs are merely a special case of that higher form - so long as we restrict our attention to dependencies that deal with a relation being decomposed via projection and recomposed via join.

Coming back to the running example, we can see that the problem with relation REL12 is that it involves a JD that is not an MVD, and hence not an FD either. We have also seen that it is possible, and probably desirable, to decompose such a relation into smaller components - namely, into the projections specified by the join dependency. That decomposition process can be repeated until all resulting relations are in fifth normal form, which we now define:

Fifth normal form: A relation R is in 5NF - also called projection-join normal torn (PJ/NF) - if and only if every nontrivial* join dependency that holds for R is implied by the candidate keys of R.

Let us understand what it means for a JD to be "implied by candidate keys."

Relation REL12 is not in 5NF, it satisfies a certain join dependency, namely Constraint 3D, that is certainly not implied by its sole candidate key (that key being the combination of all of its attributes). Stated differently, relation REL12 is not in 5NF, because (a) it can be 3-decomposed and (b) that 3-decomposability is not implied by the fact that the combination {S#, P#, J#} is a candidate key. By contrast, after 3-decomposition, the three projections SP, PJ, and JS are each in 5NF, since they do not involve any (nontrivial) JDs at all.

Now let us understand through an example, what it means for a JD to be implied by candidate keys. Suppose that the familiar suppliers relation REL1 has two candidate keys, {S#} and {SUPPLIERNAME}. Then that relation satisfies several join dependencies - for example, it satisfies the JD

*{ { S#, SUPPLIERNAME, SUPPLYSTATUS }, { S#, SUPPLYCITY } }


That is, relation REL1 is equal to the join of its projections on {S#, SUPPLIERNAME, SUPPLYSTATUS} and {S#, SUPPLYCITY), and hence can be nonloss-decomposed into those projections. (This fact does not mean that it should be so decomposed, of course, only that it could be.) This JD is implied by the fact that {S#} is a candidate key (in fact it is implied by Heath's theorem) Likewise, relation REL1 also satisfies the JD

* {{S#, SUPPLIERNAME}, {S#, SUPPLYSTATUS}, {SUPPLIERNAME, SUPPLYCITY}}

This JD is implied by the fact that {S#} and { SUPPLYNAME} are both candidate keys.

To conclude, we note that it follows from the definition that 5NF is the ultimate normal form with respect to projection and join (which accounts for its alternative name, projection-join normal form). That is, a relation in 5NF is guaranteed to be free of anomalies that can be eliminated by taking projections. For a relation is in 5NF the only join dependencies are those that are implied by candidate keys, and so the only valid decompositions are ones that are based on those candidate keys. (Each projection in such a decomposition will consist of one or more of those candidate keys, plus zero or more additional attributes.) For example, the suppliers relation REL15 is in 5NF. It can be further decomposed in several nonloss ways, as we saw earlier, but every projection in any such decomposition will still include one of the original candidate keys, and hence there does not seem to be any particular advantage in that further reduction.

�+,�&" �$��-$��"&�� "�, � ,�%��$ �-, �

In normalization of a relation, the basic idea is as follows:

Given some 1NF relation R and some set of FDs, MVDs, and JDs that apply to R, we systematically reduce R to a collection of "smaller" (i.e., lower-degree) relations that are equivalent to R in a certain well-defined sense but are also in some way more desirable. (The original relation might have been obtained by first eliminating certain relation-valued attributes)

The process is essentially an iterative refinement. Each step of the reduction process consists of taking projections of the relations resulting from the preceding step. The given constraints are used at each step to guide the choice of which projections to take next. The overall process can be stated informally as a set of rules, thus:

1. Take projections of the original 1NF relation to eliminate any FDs that are not irreducible. This step will produce a collection of 2NF relations.

2. Take projections of those 2NF relations to eliminate any transitive FDs. This step will produce a collection of 3NF relations.

3. Take projections of those 3NF relations to eliminate any remaining FDs in which the determinant is not a candidate key. This step will produce a collection of BCNF relations.

Rules 1-3 can be condensed into the single guideline "Take projections of the original relation to eliminate all FDs in which the determinant is not a candidate key"

4. Take projections of those BCNF relations to eliminate any MVDs that are not also FDs.

This step will produce a collection of 4NF relations. In practice it is usual "separating independent RVAs," as explained in our discussion of the REL13 example.

5. Take projections of those 4NF relations to eliminate any JDs that are not implied by the candidate keys - though perhaps we should add "if you can find them." This step will produce a collection of relations in 5NF.




1. What do understand by Decomposition?

2. Discuss the various properties of Decomposition.

3. Write short notes of following:

a. Dependency preservation decomposition.

b. Lossless-Join Decomposition.

%��

• Normalization is a technique used to design tables in which data redundancies are minimized.

• The first three normal forms (1NF, 2NF and 3NF) are most commonly encountered.

• From a structural point of view, higher normal forms yield relatively fewer data redundancies in the database. In other words, 3NF is better than 2NF, which is better than 1NF.

• Almost all business designs use the 3NF as the ideal normal form. (A special, more restricted 3NF is known as Boyce-Codd normal form, or BCNF).

• A table is in 1NF when all the key attributes are defined and when all remaining attributes are dependent on the primary key. However, a table in 1NF can still contain both partial and transitive dependencies.

• A partial dependency is one in which an attribute is functionally dependent on only a part of a multi-attribute primary key.

• A transitive dependency is one in which one attribute is functionally dependent on another non-key attribute.

• A table with a single-attribute primary key cannot exhibit partial dependencies.

• A table is in 2NF if it is in 1NF and contains no partial dependencies.

• A 1NF table is also in 2NF if its primary key is based on only single attribute.

• A table is in 3NF if it is in 2NF and contains no transitive dependencies.

• Boyce-Codd (BCNF) is a special case of 3NF in which all the determinant keys are also candidate keys.

• A 3NF table having a single candidate key is in BCNF.

%��)�� .��

��

I. True or False

1. The data in the database is perceived by the user as a record.

2. If the constraints are defined within the column definition, it is called as a table level constraint.

3. A primary key column cannot be of data type LONG OR LONG ROW.



1. The relational model is an abstract theory of data that is based on certain aspects of _____________.

2. Oracle 7 creates an index on the columns of a _________.

3. You can use ______________syntax to define a referential integrity constraint in which the foreign key is made up of a single column.

4. For a table to be in the third normal form, the first condition is that it should also be in the ________normal form

5. When decomposing a relation into a number of smaller relations, it is crucial that the decomposition be _________.

��

I. True or False

1. False

2. False

3. True


1. mathematics

2. primary key

3. column_constraint

4. second

5. lossless.

�

��

I. True or False

1. Database arranged in the tables & collection of tables called relation database.

2. A composite primary key is a foreign key made up of a combination of columns.

3. The table containing the foreign key is called the child table and the table containing the referenced key is called the parent table.

4. The repetition of information required by the use of our alternative design is desirable.

5. To have lossless-join decomposition, we need not to impose constraints on the set of possible relations.


a. A _________ constraint designates a column or combination of columns as the tables primary by.


b. A primary by column cannot be of data type ___________.

c. ____________ helps in reducing redundancy.

d. To determine whether these schemas are in BCNF, we need to determine what _____________ apply to them.

��.��

1. Discuss insertion, deletion, and modification anomalies. Why are they considered bad? Illustrate with examples.

2. Why are many nulls in a relation considered bad?

3. Discuss the problem of spurious tuples and how we may prevent it.

4. State the informal guidelines for relation schema design that we discuss. Illustrate how violation of these guidelines may be harmful.

5. What is a functional dependency? Who specifies the functional dependences that hold among the attributes of a relation schema?

6. What are Armstrong’s inference rules?

7. What is meant by the closure of a set of functional dependency?

8. When are two sets of functional dependences equivalent? How can we determine their equivalence?

9. What does the term unnormalised relation refer to?

10. Define first, second and third normal forms.

11. Define Boyce-Codd normal form. How does it differ from 3NF? Why is it considered a stronger form of 3NF?

12. What is irreducible set of dependencies? Suppose we are given relation R with attributes B, C, D, E F, G and

F G

Gf E

B CD

D F

Find an irreducible set of FD that is equivalent to this given set.

13. What is normalization? Normalize the following table up to 3NF.

Supplier

S_id s_city s_status P_id

Sl Delhi 10 pl, p2

S2 Calcutta 20 p3

S3 Madras 30 pl, p5

�

Introduction Query Processor Query Processing Strategies Selections Involving Comparisons Query Optimization General Transformation Rules for Relational Algebra Operations Basic Algorithms for Executing Query Operations Locking Techniques for Concurrency Control Concurrency Control Based on Timestamp Ordering Multiversion Concurrency Control Techniques

�

��

Query Processing

Learning Objectives


• Introduction

• Query Processor

• General Strategies for Query Processing

• Query Optimization

• Concept of Security

• Concurrency

• Recovery

Top

��

In this chapter we discuss the techniques used by a DBMS to process, optimize, and execute high-level queries. A query expressed in a high-level query language such as SQL must first be scanned, parsed, and validated. The scanner identifies the language tokens—such as SQL keywords, attribute names, and relation names—in the text of the query, whereas the parser checks the query syntax to determine whether it is formulated according to the syntax rules (rules of grammar) of the query language. The query must also be validated, by checking that all attribute and relation names are valid and semantically meaningful names in the schema of the particular database being queried. An internal representation of the query is then created, usually as a tree data structure called query tree. It is also possible to represent the query using a graph data structure called a query graph. The DBMS must then devise an execution strategy for retrieving the result of the query from the database files. A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimization.

Top


��

Figure 5.1 shows the different steps of processing a high-level query. The query optimizer module has the task of producing an execution plan, and the code generator generates the code to execute the plan. The runtime database processor has the task of running the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime error results, an error message is generated by the runtime database processor.

�

�

�

��

The term optimization is actually a misnomer because in some cases the chosen execution plan is not the optimal (best) strategy—it is just a reasonably efficient strategy for executing the query. Finding the optimal strategy is usually too time-consuming except for the simplest of queries and may require information on how the files are implemented and even on the contents of these-information that may not be fully available in the DBMS catalog. Hence, planning of an execution strategy may be a more accurate description than query optimization.

QUERY PROCESSING 133

For lower-level navigational database languages in legacy systems–such as the network DML or the hierarchical HDML- the programmer must choose the query execution strategy while writing a database program. If a DBMS provides only a navigational language, there is limited need or opportunity for extensive query optimization by the DBMS; instead, the programmer is given the capability to choose the "optimal" execution strategy. On the other hand, a high-level query language—such as SQL for relational DBMSs (RDBMSs) or OQL for object DBMSs (ODBMSs)—is more declarative in nature because it specifies what the intended results of the query are, rather than the details of how the result should be obtained. Query optimization's thus necessary for queries that are specified in a high-level query language.

Top

��

The steps involved in processing a query are illustrated in Figure 5.2. The basic steps are:

1. Parsing and translation

2. Optimization

3. Evaluation

Before query processing can begin, the system must translate the query into a usable form. A language such as SQL is suitable for human use, but is ill suited to be the system’s internal representation of a query. A more useful internal representation is one based on the extended relational algebra.

�

��

Thus, the first action the system must take in query processing is to translate a given query into its internal form. This translation process is similar to the work performed by the parser of a compiler. In generating the internal form of the query, the parser checks the syntax of the user's query, verifies that the relation names appearing in the query are names of relations in the database, and so on. A parse-tree representation of the query is constructed, which is then translated into a relational-algebra expression. If the query was expressed in terms of a view, the translation phase also replaces all uses of the view by the relational-algebra expression that defines the view. Parsing is covered in most compiler texts and is outside the scope of this book.

In the network and hierarchical models (discussed later), query optimization is left, for the most part, to the application programmer. That choice is made because the data-manipulation-language statements of these two models are usually embedded in a host programming language, and it is not easy to transform a network or hierarchical query into an equivalent one without knowledge of the entire application program. In contrast, relational-query languages are either declarative or algebraic. Declarative languages permit


users to specify what a query should generate without saying how the system should do the generating. Algebraic languages allow for algebraic transformation of users' queries. Based on the query specification, it is relatively easy for an optimizer to generate a variety of equivalent plans for a query, and to choose the least expensive one.

In this unit, we assume the relational model. We shall see that the algebraic basis provided by this model is of considerable help in query optimization. Given a query, generally there are B+ variety of methods for computing the answer. For example, we have seen that, in SQL, a query could be expressed in several different ways. Each SQL query can itself be translated into a relational-algebra expression in one of several ways. Furthermore, the relational-algebra representation of a query specifies only partially how to evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an illustration, consider the query

select balance

from account

where balance < 2500

This query can be translated into either of the following relational-algebra expressions:

� balanceσ <2500 ( balanceπ (account))

� balanceπ ( balanceσ <2500(account))

Further, we can execute each relational-algebra operation using one of several different algorithms. For example, to implement the preceding selection, we can search every tuple in account to find tuples with balance less than 2500. If a tree index is available on the attribute balance, we can use the index instead to locate the tuples.

To specify fully how to evaluate a query, we need not only to provide the relational-algebra expression, but also to annotate it with instructions specifying how to evaluate each operation. Annotations may state the algorithm to be used for a specific operation, or the particular index or indices to use. A relational-algebra operation annotated with instructions on how to evaluate it is called an evaluation primitive. Several primitives may be grouped together into a pipeline, in which several operations are performed in parallel. A sequence of primitive operations that can be used to evaluate a query is a query-execution plan or query-evaluation plan. Figure 5.3 illustrated an evaluation plan for our example query, in which a particular index (denoted in the figure as "index I") is specified for the selection operation. The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

The different evaluation plans for a given query can have different costs. We do not expect users to write their queries in a way that suggests the most efficient evaluation plan. Rather, it is the responsibility of the system to construct a query-evaluation plan that minimizes the cost of query evaluation. As we explained the most relevant performance measure is usually the number of disk accesses.

balanceπ �

indexIuse;2500balance<σ �

account

��


Query optimization is the process of selecting the most efficient query-evaluation plan for a query. One aspect of optimization occurs at the relational- algebra level. An attempt is made to find an expression that is equivalent to the given expression, but that is more efficient to execute. The other aspect involves the selection of a detailed strategy for processing the query, such as choosing the algorithm to use for executing an operation, choosing the specific indices to use, and so on.

To choose among different query-evaluation plans, the optimizer has to estimate the cost of each evaluation plan. Computing the precise cost of evaluation of a plan is usually not possible without actually evaluating the plan. Instead, optimizers make use of statistical information about the relations, such as relation sizes and index depths, to make a good estimate of the cost of a plan.

Consider the preceding example of a selection applied to the account relation. The optimizer estimates the cost of the different evaluation plans. If an index is available on attribute balance of account, then the evaluation plan shown in Figure 5.3, in which the selection is done using the index, is likely to have the lowest cost and thus, to be chosen.

Once the query plan is chosen, the query is evaluated with that plan. And the result of the query is output.

The sequence of steps already described for processing a query is representative; not all databases exactly follow those steps. For instance, instead of using the relational-algebra representation, several databases use an annotated parse-tree representation based on the structure of the given SQL query. However, the concepts that we describe here form the basis of query processing in databases.

In the next section, we construct a cost model that allows us to estimate the cost of various operations. Using this cost measure, we address the optimal evaluation of individual operations. We examine the efficiencies that we can achieve by combining multiple operations into one pipelined operation. These tools allow us to determine the approximate cost of evaluating a given relational-algebra expression optimally. Finally, we show equivalences among relational-algebra expressions. We can use these equivalences to replace a relational-algebra expression constructed from a user's query with an equivalent expression whose estimated cost of evaluation is lower.

��

The strategy that we choose for query evaluation depends on the estimated cost of the strategy. Query optimizers make use of statistical information stored in the DBMS catalog to estimate the cost of a plan. The relevant catalog information about relations includes:

� nr is the number of tuples in the relation r.

� br is the number of blocks containing tuples of relation r.

� Sr is the size of a tuple of relation r in bytes.

� fr is the blocking factor of relation r—that is the number of tuples of relation r that fit into one block.

� V(A, r) is the number of distinct values that appear in the relation r for attribute A. This value is the same as the size of )r(AΠ . If A is a key for relation r, V(A,r) is nr.

� SC(A, r) is the selection cardinality of attribute A of relation r. Given relation r and an attribute A of the relation, SC{A. r) is the average number of records that satisfy an equality condition on attribute A, given that at least one record satisfies the equality condition. For example, SC(A,r) = 1, if f is a key attribute of r; for a non-key attribute, we estimate that the V(A,r distinct values are distributed evenly among the tuples, yielding SC(A, r) = {nr/V(A.r)).

The last two statistics, V(A, r) and SC(A, r), can also be maintained for sets of attributes if desired, instead of just for individual attributes. Thus, given a set of attributes A, V(A, r) is the size of ).r(AΠ


If we assume that the tuples of relations are stored together physically in file, the following equation holds:

��

��

�=

4

rr f

nb

In addition to catalog information about relations, the following catalog information about indices is also used:

� f1 is the average fan-out of internal nodes of index i, for tree-structured indices such as B+ trees.

� HTi is the number of levels in index i—that is, the height of index i. For balanced tree index (such as a B+-tree) on attribute A of relation r, HTi = [logfi(V (A, r)]. For a hash index, HTi is 1.

� LBi is the number of lowest-level index blocks in index i—that is, the number of blocks at the leaf level of the index.

We use the statistical variables to estimate the size of the result and the cost for various operations and algorithms, as we shall see in the following sections. We refer to the cost estimate of algorithm A as E.

If we wish to maintain accurate statistics, then every time a relation i modified, we must also update the statistics. This update incurs a substantial amount of overhead. Therefore, most systems do not update the statistics on ever modification. Instead, the updates are done during periods of light system load.

As a result, the statistics used for choosing a query-processing strategy may not be completely accurate. However, if not too many updates occur in the intervals between the updates of the statistics, the statistics will be sufficiently accurate to provide a good estimation of the statistics, the differed plans. The statistical information noted here is simplified. Real-worid optimizers often maintain further statistical information to improve the accuracy of their cost estimates of evaluation plans.

��

The cost of query evaluation can be measured in terms of a number of different resources, including disk accesses, CPU time to execute a query, and in a distributed or parallel database system, the cost of communication. The response time for a query-evaluation plan (that is, the clock time required to execute the plan), assuming no other activity is going on the computer, would account for all these costs, and could be used as a good measure of the cost of the plan.

In large database systems, however, disk accesses (which we measure as the number of transfers of blocks from disk) are usually the most important cost, since disk accesses are slow compared to in-memory operations. Moreover, CPU speeds have been improving much faster than have disk speeds. Thus, it is likely that the time spent in disk activity will continue to dominate the total time to execute a query. Finally, estimating the CPU time is relatively hard, compared to estimating the disk-access cost. Therefore, the disk-access cost is considered a reasonable measure of the cost of a query-evaluation plan.

To simplify our computation of disk-access cost, we assume that all transfers of blocks have the same cost. This assumption ignores the variance arising from rotational latency (waiting for the desired data to spin under the read-write head) and seek time (the time that it takes to move the head over the desired track or cylinder). Although these factors are significant, they are difficult to estimate in a shared system. Therefore, we simply use the number of block transfers from disk as a measure of the actual cost.

We also ignore the cost of writing the final result of an operation back to disk. Whatever the query-evaluation plan used, this cost does not change; hence, ignoring it does not affect the choice of a plan.

The costs of all the algorithms that we consider depend significantly on the size of the buffer in main memory. In the best case, all data can be read into the buffers, and the disk does not need to be accessed


again. In the worst case, we assume that the buffer can hold only a few blocks of data—approximately one block per relation. When presenting cost estimates, we generally assume the worst case.

��

In query processing, the file scan is the lowest-level operator to access data. File scans are search algorithms that locate and retrieve records that fulfill a selection condition. In relational systems, file scan allows an entire relation to be read in the cases where the relation is stored in a single, dedicated file.

��

Consider a selection operation on a relation whose tuples are stored together in one file. Two scan algorithms to implement the selection operation are as follows:

� A1 (linear search). In a linear search, each file block is scanned, and all records are tested to see whether they satisfy the selection condition. Since all blocks have to be read, r1A bE = . (Recall that

1AE denotes the estimated cost of algorithm A1). For a selection on a key attribute, we assume that one-half of the blocks will be searched before the record is found, at which point the scan can terminate. The estimate in this case is )2/b(E r1A = .

Although it may be inefficient in many cases, the linear search algorithm can be applied to any file, regardless of the ordering of the file or of the availability of indices.

� A2 (binary search). If the file is ordered on an attribute, and the selection condition is an equality comparison on the attribute, we can use a binary search to locate records that satisfy the selection. The binary search is performed on the blocks of the file, giving the following estimate for the file blocks to be scanned:

1f

)r,A(SC)]b([logE

rr22A −�

�

��

�+=

The first term, )],b(log r2 accounts for the cost of locating the first tuple by a binary search on the blocks. The total number of records that will satisfy the selection is SC(A,r), and these records will occupy [SC(A,r)/fr] blocks, of which one has already been retrieved, giving the preceding estimate. If the equality condition is on a key attribute, then SC(A, r) = 1, and the estimate reduces to EA2 = )].b([logE r22 =

The cost estimates for binary search are based on the assumption that the blocks of a relation are stored contiguously on disk. Otherwise, the cost of looking up the file-access structures (which may be on disk) to locate the physical address of a block in a file must be added to the estimates. The cost estimates also depend on the size of the result of the selection.

If we assume uniform distribution of values (that is. each value appears with equal probability), then the query )r(aA=σ is estimated to have tuples, assuming that the value appears in attribute A of some record of r. The assumption that the value a in the selection appears in some record is generally true, and cost estimates often make it implicitly. However, it is often not realistic to assume that each value appears wish equal probability. The branch-name attribute in the account relation is an example where the assumption is not valid. There is one tuple in the account relation for each account. It is reasonable to expect that the large branches have more accounts than smaller branches. Therefore, certain branch-name values appear with greater probability than do others. Despite the fact that the uniform-distribution assumption is often not correct, it is a reasonable approximation of reality in many cases, and it helps us to keep our presentation relatively simple.

)r,A(Vn

)r,A(SC r=


As an illustration of this use of the cost estimates, suppose that we have the following statistical information about the account relation:

� faccount=20 (that is, 20 tuples of account fit in one block).

� V(branch-name, account) = 50 (that is, there are 50 different branches).

� V (balance, account) = 500 (that is, there are 500 different balance values).

� naccount=10000 (that is, the account relation has 10,000 tuples).

Consider the query

)account("Perryridge"namebranch =−σ

Since the relation has 10,000 tuples, and each block holds 20 tuples, the number of blocks is baccount=500. A simple file scan on account therefore takes 500 block accesses.

Suppose that account is sorted on branch-name. Since V (branch-name, account) is 50, we expect that 10000/50 = 200 tuples of the account relation pertain to the Perry ridge branch. These tuples would fit in 200/20 == 10 blocks. A binary search to find the first record would take [log2(500)] = 9 block accesses. Thus, the total cost would be 9+10— 1 = 18 block accesses.

��

Index structures are referred to as access paths, since they provide a path through which data can be located and accessed. It is efficient to read the records of a file in an order corresponding closely to physical order. Recall that a primary index is an index that allows the records of a file to be read in an order that corresponds to the physical order in the tile. An index that is not a primary index is called a secondary index.

Search algorithms that use an index are referred to as index scans. Ordered indices, such as B+-trees, also permit access to tuples in a sorted order, which is useful for implementing range queries. Although indices can provide fast, direct, and ordered access, their use imposes the overhead of access to those blocks containing the index. We need to take into account these block accesses when we estimate the cost of a strategy that involves the use of indices. We use the selection predicate for guide use in the selection predicate to guide us in the choice of the index to use in processing the query.

� A3 (primary index, equality on key). For an equality comparison on a key attribute with a primary index, we can use the index to retrieve a single record that satisfies the corresponding equality condition. To retrieve a single record, we need to retrieve one block more than the number of index levels (HTi); the cost is EA3=HTi+1.

� A4 (primary index, equality on non-key). We can retrieve multiple records by using a primary index when the selection condition specifies an equality comparison on a non-key attribute, A,SC (A,r) records will satisfy an equality condition, and [SC(A,r)/fr) file blocks will be accessed; hence,

��

��

�==

ri4A f

)r,A(SCHTE

� A5 (secondary index, equality). Selections specifying an equality condition can use a secondary index. This strategy can retrieve a single record if the indexing field is a key; multiple records can be retrieved if the indexing field is not a key. For an equality condition on attribute A,SC(A,r) records satisfy the condition. Given that the index is a secondary index, we assume the worst-case scenario that each matching record resides on a different block, yielding EA5=HTi+ SC(A,r), or, for a key indexing attribute, EA5=HTi+1.


We assume the same statistical information about account as used in the earlier example. We also suppose that the following indices exist on account:

� A primary, B+-tree index for attribute branch name.

� A secondary, B+-tree index for attribute balance.

As mentioned earlier, we make the simplifying assumption that values are distributed uniformly.

Consider the query

)account("Perryridge"namebranch =−σ

Since V (branch-name, account)=50, we expect that 10000/50=200 tuples of the account relation pertain to the Perry ridge branch. Suppose that we use the index on branch-name. Since the index is a clustering index. 200/20=10 block reads are required to read the account tuples. In addition, several index blocks must be read. Assume that the B+-tree index stores 20 pointers per node. Since there are MB 50 different branch names, the B+-tree index must have between three and five leaf nodes. With this number of leaf nodes, the entire tree has a depth of 2, so two index blocks must be read. Thus, the preceding strategy requires 12 total block reads.

Top

��

Consider a selection of the form )r(vA≤σ . In the absence of any further information about the comparison, we assume that approximately one-half of the records will satisfy the comparison condition; hence, the result has 2/n r tuples.

If the actual value used in the comparison (u) is available at the time of cost estimation, a more accurate estimate can be made. The lowest and highest values (min(A, r) and max(A, r)) for the attribute can be stored in the catalog. Assuming that values are uniformly distributed, we can estimate the number of records that will satisfy the condition

A ( )( ) ( )r,Aminr,Amex

r,Aminnand)r,A(minif0as r −

−ν<νν≤ otherwise.

We can implement selections involving comparisons either using a linear or binary search, or using indices in one of the following ways:

� A6 (primary index, comparison). A primary ordered index (for example, a primary B+-tree index) can be used when the selection condition is a comparison. For comparison conditions of the form A > ν or A ≥ ν , the primary index can be used to direct the retrieval of tuples, as follows. For A ≥ ν , we look up-the value v in the index to find the first tuple in the file that has a value of A = ν . A file scan starting from that tuple up to the end of the file returns all tuples that satisfy the condition. For A > ν , the file scan starts with the first tuple such that A > ν .

For comparisons of the form A < ν or A ≤ ν , an index lookup is not required. For A < ν , we use a simple file scan starting from the beginning of the file, and continuing up to (but not including) the first tuple with attribute A = ν . The case of A ≤ ν is similar, except that the scan continues up to (but not including) the first tuple with attribute A > ν . In either case, the index is not useful.

We assume that approximately one-half of the records will satisfy one of the conditions. Under this assumption, retrieval using the index has the following cost:


2b

HTE ri6A +=

If the actual value used in the comparison is available at the time of cost estimation, a more accurate estimate can be made. Let the estimated number of values that satisfy the condition (as described earlier) be c. Then,

��

��

�+=

ri6A f

cHTE

� A7 (secondary index, comparison). We can use a secondary ordered index to guide retrieval for comparison conditions involving <, ≤ , ≥ , or >. The lowest level index blocks are scaled either from the smallest value upto v (for < and ≤ ), or from ν upto the maximum value (for > and ≥ ). For these comparisons, if we assume that at least one-half of the records satisfy the condition, then one-half of the lowest-level index blocks are accessed and, via the index, one-half of the file records are accessed. Furthermore, a path must be traversed in the index from the root block to the first leaf block to be used. Thus, the cost estimate is the following:

2n

2LB

HTE rii7A ++=

As with non-equality comparisons on clustering indices, we can get a more accurate estimate if we know the actual value used in the comparison at the time of cost estimation. In Tandem's Non-Stop SQL System, B+-trees are used both for primary data storage and as secondary access paths. The primary indices are clustering indices, whereas the secondary ones are not. Rather than pointers to records' physical location, the secondary indices contain keys to search the primary B+-tree. The cost formulae described previously for secondary indices will have to be modified slightly if such indices are used.

Although the preceding algorithms show that indices are helpful in processing selections with comparisons, they are not always so useful. As an illustration, consider the query

)account(1200balance<σ

Suppose that the statistical information about the relations is the same as that used earlier. If we have no information about the minimum and maximum balances in the account relation, then we assume that one-half of the tuples satisfy the selection.

If we use the index for balance, we estimate the number of block accesses as follows. Let us assume that 20 pointers fit into one node of the B+ tree index for balance. Since there are 500 different balance values, and each leaf node of the tree must be at least half-full, the tree has between 25 and 50 leaf nodes. So, as was the case for the index on branch-name, the index for balance has a depth of 2, and two block accesses are required to read the first index block. In the worst case, there are 50 leaf nodes, one-half of which must be accessed. This accessing leads to 25 more block reads. Finally, for each tuple that we locate in the index, we have to retrieve that tuple from the relation. We estimate that 5000 tuples (one-half of the 10,000 tuples) satisfy the condition. Since the index is non-clustering, in the worst case each of these tuple accesses will require a separate block access. Thus, we get a total of 5027 block accesses.

In contrast, a simple file scan will take only 10000/20 == 500 block accesses. In this case, it is clearly not wise to use the index, and we should use the file scan instead.

�� !��

So far, we have considered only simple selection conditions of the form A op B, where op is an equality or comparison operation. We now consider more complex selection predicates.


� Conjunction: A conjunctive selection is a selection of the form

)r(n...1 θ∧∧θσ

We can estimate the result size of such a selection as follows. For each iθ , we estimate the size of the selection )r(iθ denoted by si as described previously. Thus, the probability that a tuple in the relation satisfies selection condition iθ is si/nr.

The preceding probability is called the selectivity of the selection )r(iθσ . Assuming that the conditions are independent of each other, the probability that a tuple satisfies all the conditions is simply the product of all these probabilities. Thus, we estimate the size of the full selection as

nr

n21r n

s...ssn

××××

� Disjunction: A disjunctive selection is a selection of the form

)r(... n21 θ∨∨θ∨θσ

A disjunctive condition is satisfied by the union of all records satisfying the individual, simple conditions .iθ

As before, let si/nr, denote the probability that a tuple satisfies condition iθ . The probability that the tuple will satisfy the disjunction is then 1 minus the probability that it will satisfy none of the conditions, or

��

�

�−××��

�

�

�−×��

�

�

�−−

r

n

r

2

r

1

ns

1...ns

1ns

11

Multiplying this value by nr, gives us the number of tuples that satisfy the selection.

� Negation: The result of a selection )r(θ−σ is simply the tuples of r that are not in )r(θσ . We already know how to estimate the size of ( )rθσ . The size of )r(θ−σ is therefore estimated to be

size(r) - size ( )( )rθσ

We can implement a selection operation involving either a conjunction or a disjunction of simple conditions using one of the following algorithms:

� A8 (conjunctive selection using one index). We first determine whether an access path is available for an attribute in one of the simple conditions. If one of the selection algorithms A2 through A7 can retrieve records satisfying that condition then we complete the operation by testing in the memory buffer, whether or not each retrieved record satisfies the remaining simple conditions.

Selectivity is central to determining in what order the simple conditions in a conjunctive selection should be tested. The most selective condition (that is, the one with the smallest selectivity) will retrieve the smallest number of records; hence, that condition should constitute the first scan.

� A9 (conjunctive selection using composite index). An appropriate composite index may be available for some conjunctive selections. If the selection specifies an equality condition on two or more attributes, and a composite index exists on these combined attribute fields, then the index can be searched directly. The type of index determines which of algorithms A3, A4, or A5 will be used.

� A10(conjunctive selection by intersection of identifiers). Another alternative for implementing conjunctive selection operations involves the use of record pointers or record identifiers. This


algorithm requires indices with record pointers, on the fields involved in the individual conditions. Each index is scanned for pointers to tuples that satisfy an individual condition. The intersection of all the retrieved pointers is the set of pointers to tuples that satisfy the conjunctive condition. We then use the pointers to retrieve the actual records. If indices are not available on all the individual conditions, then the retrieved records are tested against the remaining conditions.

� A11 (disjunctive selection by union of identifiers). If access paths are available on all the conditions of a disjunctive selection, each index is scanned for pointers to tuples that satisfy the individual condition. The union of all the retrieved pointers yields the set of pointers to all tuples that satisfy the disjunctive condition. We then use the pointers to retrieve the actual records.

However, if even one of the conditions does not have an access path, we will have to perform a linear scan of the relation to find tuples that satisfy the condition. Therefore, if there is even one such condition in the disjunct, the most efficient access method is a linear scan, with the disjunctive condition tested on each tuple during the scan.

To illustrate the preceding algorithms, we suppose that we have the query

select account-number

from account

where branch-name = "Perry ridge" and balance = 1200

We assume that the statistical information about the account relation is the same as that in the earlier example.

If we use the index on branch-name, we will have a total of 12 blocks. If we use the index for balance, we estimate its access as follows. Since V (balance, account) = 500, we expect that 10000/500 = 20 tuples of the account relation will have a balance of $1200. However, since the index for balance is non-clustering, we anticipate that one block read will be required for each tuple. Thus, 20 block reads are required just to read the account tuples.

Let us assume that 20 pointers fit into one node of the B+ tree index for balance. Since there are 500 different balance values, the tree has between 25 and 50 leaf nodes. So, as was the case for the B+-tree index on branch-name, the index for balance has a depth of 2, and two block accesses are required to read the necessary index blocks. Therefore, this strategy requires a total of 22 block reads.

Thus, we conclude that it is preferable to use the index for branch-name. Observe that, if both indices were non-clustering, we would prefer to use the index on balance, since we would expect only 10 tuples to have balance = 1200, versus 200 tuples with branch-name = "Perry ridge". Without the clustering property, our first strategy could require as many as 200 block accesses to read the data, since, in the worst case, each tuple is on a different block. We add these 200 accesses to the 2 index block accesses, for a total of 202 block reads. However, because of the clustering property of the branch-name index, it is actually less expensive in this example to use the branch-name index.

Another way in which we could use the indices to process our example query is by using intersection of identifiers. We use the index for balance to retrieve pointers to records with balance = 1200, rather than retrieving the records themselves. Let S1 denote this set of pointers. Similarly, we use the index for branch-name to retrieve pointers to records with branch-name = "Perry ridge". Let S2 denote this set of pointers. Then, 21 SS ∩ is a set of pointers to records with branch-name= "Perry ridge" and balance = 1200.

This technique requires both indices to be accessed. Both indices have a height of 2, and, for each index, the number of pointers retrieved, estimated earlier as 20 and 200, will fit into a single leaf page. Thus, we read a total of four index blocks to retrieve the two sets of pointers. The intersection of the two sets of pointers can be computed with no further disk I/O. We estimate the number of blocks that must be read from the account file by estimating the number of pointers in 21 SS ∩ .


Since V (branch-name, account) = 50 and V (balance account) = 1000, we estimate that one tuple in 50* 1000 or one in 50, 000 has both branch-name = "Perry ridge" and balance = 1200. This estimate is based on an assumption of uniform distribution (which we made earlier), and on an added assumption that the distribution of branch names and balances are independent. Based on these assumptions. 21 SS ∩ is estimated to have only one pointer. Thus, only one block of account needs to be read. The total estimated cost of this strategy is five block reads.

��

Sorting of data plays an important role in database systems for two reasons. First, SQL queries can specify that the output be sorted. Second, and equally important for query processing, several of the relational operations, such as joins, can be implemented efficiently if the input relations are first sorted.

We can accomplish sorting by building an index on the sort key and then using that index to read the relation in sorted order. However, such a process orders the relation only logically, through an index, rather than physically. Hence, the reading of tuples in the sorted order may lead to a disk access for each tuple. For this reason, it may be desirable to order the tuples physically.

The problem of sorting has been studied extensively, both for the case where the relation fits entirely in main memory, and for the case where the relation is bigger than memory. In the first case, standard-sorting techniques such as quick- sort can be used. Here, we discuss how to handle the second case.

Sorting of relations that do not fit in memory is called external sorting. The most commonly used technique for external sorting is the external sort-merge algorithm. We describe the external sort-merge algorithm next. Let M denote the number of page frames in the main-memory buffer (the number of disk blocks whose contents can be buffered in main memory).

1. In the first stage, a number of sorted runs are created.

i=0;

repeat

read M blocks of the relation, or the rest of the relation.

whichever is smaller;

sort the in-memory part of the relation;

write the sorted data to run file Ri,

i = i + 1:

until the end of the relation

2. In the second stage, the runs are merged. Suppose, for now, that the total number of runs, N, is less than M, so that we can allocate one page frame to each run and have space left to hold one page of output. The merge stage operates as follows:

read one block of each of the N files K, into a buffer page in memory:

repeat

choose the first tuple (in sort order) among all buffer pages;

write the tuple to the output, and delete it from the buffer page:

if the buffer page of any run R, is empty and not end-of-file(Ri)

then read the next block of Ri into the buffer page;


until all buffer pages are empty

The output of the merge stage is the sorted relation.

�� !"��#��$ ��

In general, if the relation is much larger than memory, then there may be M or more runs generated in the first stage, and it is not possible to allocate a page frame for each run during the merge stage. In this case, the merge operation is done in multiple passes. Since there is enough memory for M -1 input buffer pages, each merge can take M - 1 runs as input.

The initial pass functions as follows. The first M - 1 runs are merged (as described previously) to get a single run for the next pass. Then, the next M - 1 runs are similarly merged, and so on, until all the initial runs have been processed. At this point, the number of runs has been reduced by a factor of M - 1. If this reduced number of runs is still greater than or equal to M, another pass is made, with the runs created by the first pass as input. Each pass reduces the number of runs by a factor of M - 1. These passes are repeated as many times as required until the number of runs is less than M, a final pass then generates the sorted output.

Figure 5.4 illustrates the steps of the external sort-merge of an example relation. For illustration purposes, we assume that only one tuple fits in a block (fr = 1), and we assume that memory holds at most three page frames. During the merge stage, two page frames are used for input and one for output.

Let us compute how many block transfers are required for the external sort merge. In the first stage, every block of the relation is read and is written out again, giving a total of 2br, disk accesses. The initial number of runs is [br/M]. Since the number of runs is decreased by a factor of M-1 in each merge pass, the total number of merge passes required is given by ( )[ ]M/blog r1M − . Each of these passes reads every block of the relation once and writes it out once with two exceptions. First, the final pass can produce the sorted output without writing its result to disk. Second, there may be runs that are not read in or written out during a pass—for example, if there are M runs to be merged in a pass, M - 1 are read in and merged, and one run is not accessed during the pass. Ignoring the (relatively small) savings due to the latter effect, the total number of disk accesses for external sorting of the relation is


( )[ ]( )1M/blog2b r1Mr +−

Applying this equation to the example in Figure 5.3, we get a total of 12* (4+1) = 60 block transfers, as you can verify from the figure. Note that this value does not include the cost of writing out the final result.

"��

In this section, we first show how to estimate the size of the result of a join. We then study several algorithms for computing the join of relations, and analyze their respective costs. We use the word equi-join to refer to a join of the form r r.A=s.B s, where A and B are attributes or sets of attributes of relations r and s respectively.

We use as a running example the expression

depositor customer

We assume the following catalog information about the two relations:

� ncustomer=10,000.

� fcustomer=25, which implies that bcustomer = 10000/25=400

� ndepositor=5000.

� fdespositor=50, which implies that bdepositor=5000/50=100.

� V (customer-name, depositor)=2500, which implies that, on average, each customer has two accounts.

We also assume that customer-name in depositor is a foreign key on customer.

�� #��"��

The Cartesian product sr × contains sr nn × , tuples. Each tuple of r x s occupies sr + ss bytes, from which we can calculate the size of the Cartesian product.

Estimation of the size of a natural join is somewhat more complicated than is estimation of the size of a selection or of a Cartesian product. Let r(R) and s(S) be relations.

� If φ=∩ SR that is, the relations have no attribute in common—their r s is the same as sr × , and we can use our estimation technique for Cartesian products.


1. Discuss the various steps in Query processing?

2. What do you understand by measuring the cost of Query Cost?

3. Discuss the steps involved in Basic algorithm of query processing.

4. Discuss the sorting for query processing?

Top

�� #��

In this lesson, we discuss optimization techniques that apply heuristics rules to modify the internal representation of a query-which is usually in the form of a query tree or a query graph data structure- to improve its expected performance. The parser of a high-level query first generates an initial internal


representation, which is then optimized according to heuristic rules. Following that, query execution plan is generated to execute groups of operations based on the access paths available on the files involved in the query.

One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other binary operations. This is because the size of the file resulting from a binary operation-such as JOIN is usually a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations reduce the size of a file and hence, should be applied before a join or other binary operation.

$�� %�� &��

A query is a tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes. An execution of the query tree consists of execution of an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation. The execution terminates when the root node is executed and produces the result ratio for the query.

Figure 5.5 shows a query tree for query block 2. For every project located in ‘Stafford’, retrieve the project number, the controlling department number, and the department manager’s last name, address, and birth date. This query is specified on the relational schema and corresponds to the following relational algebra expression:

Π��% #& '!()��*% #& )�!+% �& !)�!��**(!��)�!'*��!�

�

,'-�Π��% #& '!()��*% �& )��*% #& )�!% �& !)�!�**(!��)�!'*��!�

P.DNUM=D.D.NUMBER AND E.MGRSSN=ESSN AND P.PLOCATION Stafford

E

D

P

P.DNUM=D.DNUMBER

��

P.DNUM=D.DNUMBER

(3)

(2)

(1)

x

x


�

�

��.�� /�� ,�-�� 0��/��1��"�� ,1-�2��,��-�� +��

� ��

��

��3��,��-�� /��

∏ �� !��""�

�� ""�� #""�

This corresponds to the following SQL query:

$ %&��

'�� !�� #��

( )��

��

In figure 5.6, the three relations Project, Department, and Employee are represented by leaf nodes P, D, and E, while the relational algebra internal tree nodes represent operations of the expression. When this query tree is executed, the node marked (1) in Figure 5.4(a) must begin execution before node (2) because some resulting tuples of operation (1) must be available before we can begin execution operation (2). Similarly, node (2) must begin executing and producing results before node (3) can start execution, and so on.

As we can see, the query tree represents a specific order of operations for executing a query. A more neutral representation of a query is the query graph notation. Figure 5.5 show the query graph for query Q2. Relations in the query are represented by relation nodes, which are displayed as single circles. Constant values, typically from the query selection conditions are represented by the constant nodes, which are displayed as double circles. Selection and join conditions are represented by the graph edges, as shown in Fig. 5.5. Finally, the attributes to be retrieved from each relation are displayed in square brackets above each relation.

The query graph representation does not indicate an order in which operations perform. There is only a single graph corresponding to each query. Although some optimization techniques were based on query graphs, it is now generally accepted that query trees are preferable because, in practice, the query optimizer needs to show the order of operations for query execution, which is not possible in query graphs.

P.PLOCATION=’Stafford’

‘Stafford’

D P E


'�� #�� %��

In general, many different relational algebra expressions and hence many different query trees-can be equivalent; that is, they can correspond to the same query. The query parser will typically generate a standard initial query tree to correspond to an SQL query, without doing any optimization. Fig 5.4(b). The CARTESIAN PRODUCT of the relations specified in the FROM clause is first applied; then the selection and join conditions of the WHERE clause are applied, followed by the projection on the SELECT clause attributes. Such a canonical query tree represents a relational algebra expression that is very inefficient if executed directly, because of the CARTESIAN PRODUCT (X) operations. For example, if the PROJECT, DEPARTMENT, and EMPLOYEE relations had record sizes of 100, 50, and 150 bytes and contained 100,20 and 5000 tuples, respectively, the result of the CARTESIAN PRODUCT would contain 10 million tuples of record size 300 bytes each. It is now the job of the heuristic query optimizer to transform this initial query tree into a final query tree that is efficient to execute.

The optimizer must include rules for equivalence among relational algebra expressions that can be applied to the initial tree. The heuristic query optimization rules then utilize these equivalence expressions to transform the initial tree into the final, optimized query tree. We discuss general transformation rules and show how they may be used in an algebraic heuristic optimizer.

Example of Transforming a Query. Consider the following query Q on following database

� �� #� '�� *� ��#� ��

!�+,� �� - .�+� /%0123456� /63278/786� 40/� '�, 9�9,��)�:;��,��*�

�� 08888� 000112222� 2�

'��,<=.,� �� ( �,>� 0001122222� /6227/%785� 305� ?�;;��)�:;��,��*�

�� 18888� 555332222� 2�

�==@.�� !� A9=�B�� 666554444� /6357847/6� 00%/� ��;�=9��C�.,>��D�

'� %2888� 6543210%/� 1�

�

!9,,.�9�� ?�==�@9� 65433210%/� /61/7837%8� %6/� �9��B��==��.9��*�

'� 10888� 555332222� 1�

��- 9;+� E� ��B�,�

333551111� /63%7867/2� 642� '.�9� � �<��+:- F=9��*�

�� 05888� 00112222� 2�

!�B@9� �� ,>=.;+� 120120120� /64%78470/� 230/� �.@9��)�:;��,��*�

'� %28888� 000112222� 2�

�

�+- 9 � ?� !�FF9�� 634654654� /6367807%6� 658� ��==�;��)�:;��,��*�

�

�� %2888� 6543210%/� 1�

“Find the last names of employees born after 1957 who work on a project named ‘Aquarius’. “ this query can be specified in SQL as follows:

Q: SELECT LNAME

FROM EMPLOYEE, WORKS_ON, PROJECT

WHERE PNAME = ‘Aquarius’ AND PNUMBER = PNO AND ESSN = SSN

AND BDATE. ‘1957-12-31’;

The initial query tree for Q is shown in Figure 5.4(a). Executing this tree directly first creates a very large file containing the CARTESIAN PRODUCT of the entire EMPLOYEE, WORKS_ON, and PROJECT files. However, this query needs only one record from the PROJECT relation—for the ‘Aquarius’ project – and only the EMPLOYEE records for the those whose date of birth is after ‘1957-12-31’. Figure 5.4(b)


shows an improved query tree that first applies the SELECT operations to reduce the number of tuples that appear in the CARTESIAN PRODUCT.

A further improvement is achieved by switching the positions of the EMPLOYEE and PROJECT relations in the tree, as shown in Figure 5.6. This uses the information that PNUMBER is a key attribute of the project relation, and hence the SELECT operation on the PROJECT relation will retrieve a single record only. We can further improve the query tree by replacing any CARESTIAN PRODUCT operation that is followed by a join condition with a JOIN operation, as shown in Figure 5.6. Another improvement is to keep only the attributes needed by subsequent operations in the intermediate relation, by including PROJECT (Π) operations as early as possible in the query tree, as shown in Figure 5.7. This reduces the attributes (columns) of the intermediate relations, whereas the SELECT operations reduce the number of tuples (records).

As the preceding example demonstrates, a query tree can be transformed step by step into another query tree that is more efficient to execute. However, we must sure that the transformation steps always lead to an equivalent query tree. To do this, the query optimizer must know which transformation rules preserve this equivalence. We discuss some of these transformation rules next.

πLNAME

σPNAME = ‘Aquarius’ AND PNUMBER=PNO AND ESSN=SSN AND BOATE> ‘1957-12-31

X

X

PROJECTTT

WORKS_ON EMPLOYEE


π��

�

�� !��

( � �E�G� ��

� �� #�

*�

*�

σ��

σ �� H:��.:;��

σ�� /6347/%70/��

π��

�

( � �E�G� ��

�� !��

*�

*�

σ ��

σ �� H:��.:;��

σ��

σ��I �/6347/%70/��

�

��4�� 0��/��$ ��5 �� /��$ ��!+!6��,0-�(��6�(!�!�2�% ��(7 *#6��

��0��!+!6��.��/�87 2% ��

πLNAME

σESSN=SSN

PROJECTTT

WORKS_ON EMPLOYEE

σBDATE>’1957-12-31 PNUMBER=PNO

σPNAME=’Aquarius’

EMPLOYEE PNUMBER-PNO


�

Top

&��%�� (��(��)��

��

There are many rules for transforming relational algebra operations into equivalent ones. Here we are interested in the meaning of the operations and the resulting relations. Hence, if two relations have the same set of attributes in a different order but the two relations represent the same information, we consider the relations equivalent. We now state some transformation rules that are useful in query optimization, without proving them:

1. Cascade of σσσσ: A conjunctive selection condition can be broken up into a cascade (that is a sequence) of individual σ operations.

c1 AND c2 AND…AND cn (R) = σ c1 (σc2(…(σcn(R))…))

2. Community of σσσσ: The σ operation is commutative:

c1(σc2(R)=σc2 (σc1 (R))

3. Cascade of ΠΠΠΠ: In a cascade ( sequence) of Π operations, all but the last one can be ignored:

List (�Π)List2 (…)ΠListn (R))…))=List (R)

π��

π��

�

( � �E�G� ��

σ ��

σ �� H:��.:;��

σ��

σ��I �/6347/%70/��

� �� #�

σ��π��

π ��

�� !��

��9�� 0��/��$ ��,�-�& ��(7 8!6��0�.��/��


4. Commuting σσσσ with ΠΠΠΠ : If the selection condition c involves only those attributes Al,…, An in the projection list, the two operations can be commuted:

A1, A2…An (σ c(R)))=σc (Π)A1, A2… An (R))

5. Commutatively (and X): The o operation is commutative, as is the X operation:

R c S= S c R

R X S= S X R

Notice that, although the order of attributes may not be the same in the relations resulting from the two joints (or two Cartesian products), the “meaning” is the same because order of attributes is not important in the alternative definition of relation.

6. Commuting σσσσ with (or X): If all the attributes in the selection condition c involve only the attributes of one of the relations being joined-say, R- the two operations can be commuted as follows:

σc (R S) =(σc (R) ) S

Alternatively, if the selection condition c can be written as (c1 AND c2), where condition c1 involves only the attributes of R and condition c2 involves only the attributes of S, the operations commute as follows:

σ c (R S) = (σc1 (R) ) (σc2 (S))

The same rules apply if the is replaced by a X operation.

7. Commuting ΠΠΠΠ with (or C): Suppose that the projection list is L= (A1,…, An, B1,…,Bm), where A1…, An are attributes of R and B1,…Bm are attributes in L, the two operations can be commuted as follows:

Πl (R c S) = (Π A1.., An (R)) c (Π B1,…Bm(S))

If the join condition c contains additional attributes not in L, these must be added to the projection list, and a final Π operation is needed. For example, if attributes An+1,…An+k of R and Bm+1, …. Bm+p of S are involved in the join condition c but are not in the projection list L, the operations commute as follows:

Π (R c S)= Πl ((Π Al, An+1, …,…An+k (R)) c (Π B1, …Bm, Bm+1,…Bm+p (S))

For X, there is no condition c, so the first transformation rule always applies by replacing c with X.

8. Commutatively of set operations: The set operations U and ∩ are commutative but is not.

9. Associatively of , X, U, and ∩∩∩∩: These four operations are individually associative; that is, if θ stands for any one of these four operations (throughout the expression), we have:

(R θ(S θ T)

10. Commuting σσσσ with set operations: The σ operation comments with U, ∩ and -. If θ stands for any one of these three operations (throughout the expression), we have:

σc (R θ S )= (σc (R)) θ (σc (S))

11. The ΠΠΠΠ operation commutes with U:


ΠL (R U C)= (ΠL (R)) U (Πl (S))

�

��'��)�� #��

We can now outline the steps of an algorithm that utilizes some of the above rules to transform an initial query tree into an optimized tree that is more efficient to execute (in most case). The algorithm will lead to transformations similar to those discussed in our example of Figure 4.4. The steps of the algorithm are as follows:

1. Using Rule1, break up any SELECT operations with conjunctive conditions into a cascade of SELECT operations. This permits a greater degree of freedom in moving SELECT operations down different branches of the tree.

2. Using Rules 2, 4, 6, and 10 concerning the commutatively of SELECT with other operations, move each SELECT operation as far down the query tree as is permitted by the attributes involved in the select condition.

3. Using Rules 5 and 9 concerning commutatively and associatively of binary operations, rearrange the leaf nodes of the tree using the following criteria. First, position the leaf node relations with most restrictive SELECT operations so they are executed first in the query tree representation. The definition of most restrictive are executed fires in the query tree representation. The definition of most restrictive SELECT can mean either the ones that produce a relation with the fewest tuples or with the smallest absolute size. Another possibility is to define the most restrictive SELECT as the one with the smallest selectivity; this is more practical because estimates of selectivity are often available in the DBMS catalog. Second, make sure that the ordering of leaf nodes does not cause CARTESIAN PRODUCT operations. For example, if the two relations with the most restrictive SELECT do not have a direct join condition between them, it may be desirable to change the order of leaf nodes to avoid Cartesian products.

4. Using Rules 3, 4, 7, and 11 concerning the cascading of PROJECT and the commuting of PROJECT with other operations, break down and move lists of projection attributes down the tree as far as possible by creating new PROJECT operations as needed. Only those attributes needed in the query result and in subsequent operations in the query tree should be kept after each PROJECT operation.

5. Identify sub trees that represent groups of operations that can be executed by a single algorithm.

In our example, Figure 5.4 (b) shows the tree of Figure 5.4(a) after applying steps 1 and 2 of the algorithm; Figure 5.5 shows the tree after Step 3; Figure 5.6 after Step 4; and Figure 5.7 after Step 6 we may group together the operations in the sub tree whose root is the operation Π ESSN into a single algorithm. We may also group the remaining operations into another sub tree, where the tuples resulting from the first algorithm replace the sub tree whose root is the operation ΠESSN because the first grouping means that this sub tree is executed first.

�� '��)�� #��

We now summarize the basic heuristics for algebraic optimization. The main heuristic is to apply first the operations that reduce the size of intermediate results. This includes performing as early as possible SELECT operations to reduce the number of tuples and PROJECT operations as far down the tree as possible. In addition, the SELECT and JOIN operations that are most restrictive that is, result with the fewest tuples or with the smallest absolute size should be executed before other similar operations. This is done by reordering the leaf nodes of the tree among themselves while avoiding Cartesian products, and adjusting the rest of the tree appropriately.




1. Why do we use Heuristics in Query Optimization?

2. Write about the Notation for Query trees and Query graphs.

3. Make a Heuristic optimization of Query trees.

4. What are the General Transformation Rules for Relational operations?


Top

�� !��

�� %�� !��

An execution plan for a relational algebra expression represented as a query tree includes information about the access methods available for each relation as well as the algorithms to be used in computing the relational operators represented in the tree. As a simple example, consider query Q1 from Block 2 whose corresponding relational algebra expression is

Π FNAME, LNAME, ADDRESS (σ DNAME= ‘RESEARCH’ (DEPARTMENT)

DNUMBER=DND EMPLOYEE)

The query trees are shown in Figure 5.5. To convert this into an execution plan, the optimizer might choose an index search for the SELECT operation (assuming one exists), a table scan as access method for EMPLOYEE, a nested loop join algorithm for the join, and a scan of the JOIN result for the PROJECT operator. In addition, the approach taken for executing the query may specify a materialized or a pipelined evaluation.

With materialized evolution, the result of operations is stored as temporary relation (that is, the result is physically materialized). For instance, the join operation can be computed and the entire result stored as a temporary relation, which is then read as input by the algorithm that computes the PROJECT operation, which would produce the query result table. On the other hand, with pipelined evaluation, as the resulting tuples of an operation are produced, they are forwards directly to the next operation in the query sequence. For example, as the selected tuples from DEPARTMENT are produced by the SELECT operation, they are placed in a buffer; the JOIN operation algorithm would then consume the tuples from the buffer, and those tuples that result from the JOIN operation are pipelined to the projection operation algorithm. The advantages of pipelining is the cost savings in not having to write the intermediate results to disk and don’t having to read them back for the next operation.

�

� Π�'��

�

�

��

�

�� 9;9��@+�� #�

� ��

�




1. How can convert Query trees into Query execution plans?

2. Make some Query and try to make Query trees for them.

�� *�

The SQL data definition language includes commands to grant and revoke privileges. The SQL standard include delete, insert, select, and update privileges. The select privileges correspondence to read privilege. SQL also includes a references privilege that restrict a user’s ability to declare foreign key when creating relations. If the relation to be created includes the foreign key that references attributes of another relation. The user must have been granted references privilege on those attributes. The reason that the references privilege is a useful feature is somewhat subtle, and is explained later in this section.

The grant statement is used to confer authorization. The basic form of this statement is as follows:

Grant <privileges list> on <relation name or view name> to <user list>

The privilege list allows the granting of several privileges in one command. The following grant statement grants user U1, U2 and U3 select authorization on the branch relation:

Grant select on branch to U1, U2, U3

The update authorization may be given either on all attributes of the relation or on only some. If update authorization is include in a grant statement, the list of attributes on which update authorization is to be optionally appears parentheses immediately after the update keyword. If the list of attributes is omitted, the update privilege is granted on all attributes of the relation. The following grant statement gives users U1, U2 and U3 update authorization on the amount attribute of the loan relation:

Grant update (amount) on loan to U1, U2, U3

��:��


�

��;��$ ��*��/��<��(��

In SQL-92, the insert privilege may also specify a list of attributes: any insert to the relation must specify only these attributes, and each of the remaining attributes is either given default values (if a default is defined for the attribute) or set to null.

The SQL reference privilege is granted on specific attributes in a manner similar to that show that for the update privilege. The following grant statement allows user U1 to create relations that reference the key branch-name of the branch relation as a foreign key:

�

��/��<��= ��/�

Grant reference (branch name) on branch to U1

Initially, it may appear that there is no region ever to prevent user from creating foreign keys referencing another relation. However recall from chapter 6 that foreign key constraints restrict deletion and update


operation on the referenced relation. In the preceding example, if U1 creates a foreign key in a relation r referencing the branch-name attribute of the branch relation, and then insert a tuple into r pertaining to the Perry ridge branch, it is no longer possible to delete the Perry ridge branch from the branch relation without also modifying relation r Thus, the definition of a foreign key by U restricts future activity by other user therefore, there is a need for the references privilege

All privileges can be used as a short form for all the allowable privileges. Similarly, the user name public refers to all current and further user of the system. SQL-92 also includes a usage privilege that authorize a user to use a specified domain (recall that a domain corresponds to the programming-language notion of a type, and may be user defined).

By default, a user who is granted a privilege In SQL is not authorized to grant that privilege to another user. If we wish to grant a privilege and to allow the recipient to pass the privilege to another users then we append the with grant option clause to the appropriate grant command. For example, if we wish to allow U1 the select privilege on branch and allow U1 to grant this privilege to others, we write.

Grant select on branch to U1with grant option

To revoke an authorization, we use the revoke statement. It takes a form almost identical to that of grant.

Revoke<privilege list> on <relation name or view name>

From <user list> {restrict} {cascade}

Thus, to revoke the privilege that was granted previously, we write

Revoke select on branch from U1, U2, and U3 cascade

Revoke update (amount) on loan from U1, U2, and U3

Revoke reference (branch-name) on branch from U1

The revocation of a privilege from a user may cause other user also to lose that privilege. This behavior is called cascading of the revoke. The revoke statement may also specify restrict:

Revoke select on branch from U1, U2, U3 restrict

In this case, an error is returned if there are any cascading revokes, and the revoke action is not carried out. The following revoke statement revokes only the grant option, rather than the actual select privilege.

Revoke grant option for select on branch from U1

The SQL-92 standard specifies a primitive authorization mechanism for the database schema: Only the owner of the schema can carry out any modification to the schema. Thus, schema modification – such as creating or deleting relations, adding or dropping attributes of relations, and adding or dropping indices-may be executed by only the owner of the schema. Several database implementations have more powerful authorization mechanisms for database schemas, similar to those discussed earlier, but these mechanisms are non-standard.



1. What do you understand by security specification SQL?

2. What do understand by Authorization revocation and how will you remove it?



��

The various provisions that a database system may make for authorization may not provide sufficient protection for higher sensitive data. In such cases, data may be encrypted. It is not possible for encrypted data to be read unless the reader knows how to de-cipher (decrypt) them.

There are a vast number of techniques for the encryption of data. Simple encryption techniques may not provide adequate security, since it may be easy for an unauthorized user to break the code. As an example of a weak encryption technique, consider the substitution of each character with the next character in the alphabet Thus.

If an unauthorized user sees only “Qfsszsjehf”, she probably has insufficient information to break the code. However, if the intruder sees a large number of encrypted branch names, she could use statistical data regarding the relative of quench of characters (for example, e is more common than f) to guess what substitution is being made.

A good encryption technique has the following properties

�� It is relatively simple doe authorized user to encrypt and decrypt data.

�� The encryption scheme depends not on the secrecy of the algorithm, but rather on a parameter of the algorithm called the encryption key.

�� It is extremely difficult for an intruder to determine the encryption key.

A computer system, like any other mechanical or electrical device, is subject to failure. There is a variety of causes of such failure, including disk crash, power failure, software error, a fire in the machine room, or even sabotage. In each of these cases, information may be lost .an integral part of database system is recovery scheme that is responsible for the restoration of the database to a consistent state that existed prior to the occurrence of the failure.

+��

There are various types of failure that may occur in a system, each of which needs to be dealt in a different manner. The simplest type of failure to deal with is one that does not result in the loss of information in the system. The failures that are more difficult to deal with are those that result in loss of information. Here, we shall consider only the following types of failures:

�� Transaction Failure: There are two types of errors that may cause a transaction to fail:

• Logical error: The transaction can no longer continue with it’s normal execution, owing to some internal condition, such as bad input, data not found, overflow or resource limit exceeded.

• System error: The system has entered an undesirable state (for example, deadlock), as a result of which a transaction cannot continue with its normal execution. The transaction, however, can be re-executed at a later time.

�� System crash: There is a hardware malfunction, or a bug in the database software or the operating system, that causes the loss of the content of volatile storage, and brings transaction processing to halt. The content of nonvolatile storage remains intact and is not corrupted.

The assumption that hardware errors and bugs in the software bring the system to halt, but do not corrupt the nonvolatile storage contents is known as the fail-stop assumption. Well-designed systems have numerous internal checks, at the hardware and software level, which bring the system to a halt when there is an error. Hence, the fail-stop assumption is a reasonable one.


�� Disk failure. A disk loses its content as a result of either a head crash or failure during a data transfer operation. Copies of data on other disks, or archival backups on tertiary media, such as tapes are used to recover from the failure.

��

The various data in the database may be stored and accessed in a number of different storage media. To understand how to ensure the atomicity and durability properties of a transaction, we must gain a better understanding of these storage media and their access methods

��%��

There are various types of storage media; they are distinguished by their relative speed, capacity, and resilience to failure.

Volatile Storage: Information residing in volatile storage does not usually survive system crashes. Examples of such storage are main memory and cache memory. Access to volatile storage is extremely fast, both because of the speed of the memory access itself, and because it is possible to access any data item in volatile storage directly.

Nonvolatile storage. Information residing in nonvolatile storage survives system crashes. Examples of such storage are disk and magnetic tapes. Disks are used for online storage, where as tapes are used for archival storage. Both however, are subject to failure (for example, head crash), which may result in loss of information. At the current state of technology, nonvolatile storage is slower than volatile storage by several orders of magnitude. This distinction is the result of disk and tape devices being electro mechanical, rather than based entirely on chips, as is volatile storage. Other nonvolatile media are normally used for backup data.

Stable storage Information residing in stable storage is never lost. Although stable storage is theoretically impossible to obtain., it can be closely approximated by techniques that make data loss extremely unlikely.

The distinction among the various storage types is often less clear in practice than in our presentation. Certain system provide battery backup, so that some main memory can survive system crashes and power failures. Alternative forms of nonvolatile storage, such as optical media, provide an even higher degree of reliability than disks do.

�

��)��,��

To implement stable storage, we need to replicate the needed information in several nonvolatile storage media (usually disk) with independent failure modes, and to update the information in a controlled manner to ensure that failure during data transfer does not damage the needed information.

Data transfer between memory and disk storage can result in successful completion. The transferred information arrived safely at its destination.

Partial failure:. A failure that occurs in the midst of transfer, and the destination block has incorrect information.

Total failure: The failure that occurs sufficiently early during the transfer and the destination block remains intact.

We require that, if data transfer failure occur, the system detects it and invokes a recovery procedure to restore the block to a consistent state. To do so the system must maintain two physical blocks for each logical database block; in the case of mirrored disks, both blocks are at the same location; in the case of remote backup, one of the blocks is local, where as the other is at a remote site.


An output operation is executed as follows:

1. Write the information into the first physical block.

2. When the first write completes successfully, write the same information onto the second block.

3. The output is completed only after the second write completes successfully.

During recovery, each pair of physical blocks is examined. If both are the same and no detectable error exists, then no further actions are necessary. If one block contains detectable errors, then we replace its content with the content of the second block. If both the blocks contain no detectable error, but they differ in content, then we replace the content of the first block with the value of the second. This recovery procedure ensures that a write to stable storage either succeeds completely.

The requirements of comparing every corresponding pair of blocks during recovery are expensive to meet. We can reduce the cost greatly by keeping track of block writes that are in progress. Using a small mount of nonvolatile RAM.

-��

The database system resides permanently on nonvolatile storage, and is partitioned into fixed length storage units called blocks.

�

��'��>��7 ��

Blocks are the units of data transfer to and from disk, and may contain several data items. We shall assume that no data item spans two or more blocks. This assumption is realistic for most data-processing applications, such as our banking example

Transaction input information from the disk to main memory, and then output the information back onto the disk. The input and output operations are done on block units. The blocks residing on the disk are referred to as physical blocks; the blocks residing temporarily in main memory are referred to as buffer blocks. The area of memory where blocks reside temporarily is called the disk buffer.

Block movements between disk and main memory are initiated through the following two operations:

�� input(B) transfers the physical block B to main memory.

�� output(B) transfers the buffer block B to the disk and replaces the appropriate physical block there.

This scheme is illustrated in the above figure 5.1.

Each transaction Ti has a private work area in which copies of all the data items accessed and updated by Ti are kept. This work area is created when the transaction is initiated; it is removed when the transaction either commits or aborts. Each data item x kept in the work area of transaction Ti is denoted by xi. Transaction Ti interacts with the database system by transferring data to and from its work area to the system buffer. We transfer data using the following two operations:


1. read(X) assigns the value of data item X to the local variable xi. This operation is executed as follows:

if the block Bx on which x resides is not in main memory, then issue input(bx).

Assign to xi, the value of X from the buffer block.

2. write(X) assigns the value of the local variable xi to data item X in the buffer block. This operation is executed as follows:

if block Bx on which X resides is not in main memory, then issue input(Bx).

Assign the value of xi to X in buffer Bx.

Note that both operations may require the transfer of a block from disk to main memory. They do not, however, specially require the transfer of a block from main memory to disk.

A buffer block is eventually written out to the disk either because the buffer manager needs the memory space for other purposes or because the database system wishes to reflect the change to B on the disk. We shall say that the database system force-outputs buffer B if it issues an output(B).

When a transaction needs to access a data item X for the first time, it must execute write(x) to reflect the change to X in the database itself.

The output (Bx) operation for the buffer block Bx on which X resides does not need to take effect immediately after write(x) is executed, since the block Bx may contain other data items that are still being accessed. Thus, the actual output takes place later. Notice that, if the system crashes after the write(x) operation was executed but before output(b) was executed, the new value of X is never written to disk and , thus, is lost.

(��

Consider again our simplified banking system and transaction Ti that transfers 50 dollars form account A to account B, with initial values of A and B being 1000 and 2000 dollars each. Suppose that a system crash has occurred during the execution of Ti, after output(BA) has taken place, but before output(BB) was executed, where BA and BB denote the buffer blocks on which A and B reside.

Since the memory contents were lost, we do not know the fate of the transaction: thus we could invoke on of the two possible recovery procedures.

Re-execute T1 -- This procedure is the value of A becoming 900 dollars rather than 950. Thus the system enters an inconsistent state.

Do not re-execute T1 -- The current system has values of 950 and 2000 dollars for A and B, respectively. Thus the system enters an inconsistent state.

In either case, the database is left in an inconsistent state, and thus this simple recovery scheme does not work. The reason for this difficulty is that we have modified the database without having assurances that the transaction will indeed commit. Our goal is to perform either all or no database modifications made by T1. However, if T1 performed multiple database modifications made by T1, several output operations have been made. But before all of them are made to achieve our goal of atomicity, we must first output information describing the modifications to stable storage. Without modifying the database itself. As we shall see, this procedure will allow us to output all the modifications by a committed transaction, despite failures.



1. What do you understand by Encryption?


2. Write a short note on following: Recovery and Atomicity.


. ��/�

Several problems can occur when concurrent transactions execute in an uncontrolled manner. Here, some of the problems are discussed.

Consider an airline reservation database in which a record is stored for each airline flight. Each record includes the no. of reserved seats on that flight as a named data item, among other information. A transaction T1 that transfers N reservations form one flight whose number of reserved seats is stored in the database item named X to another flight whose number of reserved seats is stored in the database item named y. T2 represents another transaction that just reserves M seats on the 1st flight (x) referenced in transactionT1. T1 T2 Read – item (X); read-item (X); X : = X – N; X : = X + M; Write – item (X); Write – item(X); Read – item (y); Y: = y+N; Write – item (y)

We now discuss the types of problems we may encounter with these two transactions if they run concurrently.

%��*��)��

This problem occurs when two transactions that access the same database items have their operations interleaved in a way that makes the values of some database item incorrect. Suppose that transactions T1 and T2 are submitted at approximately the same time, and suppose that their operations are interrelated as shown in bit (3); then the final value of items X is incorrect, because T2 reads the value of X before T1

changes it in the database and hence the updated value resulting from T1 is lost. For example, if X = 80 at the start (originally there were 80 reservations on the flight), N = 5 (T1 transfers 5 seat reservations from the flight corresponding to X to the flight corresponding to Y), and M = 4 (T2 reserves 4 seats on X), the final result should be X = 79; but in the interleaving of operations shown in fig 5.12 it is X = 84 because the update in T1 that removed the five seats from X was lost.

T1 T2 Read – item (X); X : = X – N; read-item (X); X : = X + M; Write – item (X); Time Read – item (y); Write – item(X); Y: = y+N; Write – item (y)

'.>:�9�2�/%�

%��%�� 0��-��(��1��)��

This problem occurs when one transaction updates a database item and then the transaction fills for some reason. The updated item is accessed by another transaction before it is changed back to its original value. Fig. 5.13 shows an example where T1 updates item X and then tails before completion, so the system must

Item X has an incorrect values because its update by is “lost” (overwritten)


change X back to its original value. Before it can do so, transaction T2 reads the “temporary” value of X, which will not be recorded permanently in the database because of the failure of T1. The value of item X that is read by T2 is called dirty data, because it has been created by a transaction that has not completed and committed yet; hence, this problem is also known as the dirty read problem.

T1 T2 Read – item (X); X : = X – N; read-item (X); Time Write – item (X); X : = X + M; Write – item(X); Read – item (y);

Figure 5.13

%�� )��

If one transaction is calculating an aggregate summary function on a number of records while other transactions are updating some of these records, then the aggregate function may calculate some values before they are updated and others after they are updated. For e.g., suppose that transaction T3 is calculating the total no. of reservations on all the flights; meanwhile, transaction T1 is executing. If interleaving of operations shown in (fig 5.14) occurs, the result of T3 will be off by an amount N because T3 reads the value of X after n1 seats have been subtracted from it but reads the value of Y before those N. seats have been added to it. T1 T2 Sum : = 0; read – item (A); Sum : = Sum + A; Read – item (X); X : = X – N; Write – item (X); Read – item (X); Sum : = Sum+X; Read – item (Y); Sum : = Sum+Y; Read – item (y); Y: = Y+N; Write – item (y)

�

�

��



Transaction T fails and must change the value of x back to its old value. Mean while T2 has read the time e incorrect value. of X.

T3 reads X after N is subtracted and reads Y before N is added; A wrong summary is the result (off by N).


1. Make a comparison between single user and multi-user system.

2. Why concurrency control?

3. What do you understand by Dirty Read problem?


Top

*�2��%��3��

Some of the main techniques used to control concurrent execution of transactions are based on the concept of locking data items. A lock is a variable associated with a date item that describes the status of the item with respect to possible operations that can be applied to it. Generally, there is one lock for each data item in the database. Locks are used as a means of synchronizing the access by concurrent transactions to the database items.

%��*�2�� *�2�%�)��

��

A binary lock can have two states or values: locked and unlocked (or 1 and 0 for simplicity). A distinct lock is associated with each database item X. If the value of the lock on X is 1, item X cannot be accessed by the database operation that requests the item. If the value of the lock on X is 0, the item can be accessed when requested. We refer to the current value of the lock associated with item X as Lock (x).

Two operations lock-item and unlock-items, are used with binary locking. A transaction requests access to an item X by virus + issuing a lock-item (x) operation. If lock(s) = 1, the transaction is forced to wait. If Lock (x) = 0, it is set to 1 (the transaction locks the item) and the transaction is allowed to access item X. When the transaction is through using the item, it issues an unlock-item (X) operation, which sets locks (X) to 0 (unlocks the item) so that X may be accessed by other transactions. Hence, a binary lock enforces mutual exclusion on the data item. A description of the lock-item (x) and unlock-item (X) operations is shown below.

Lock-item (X):

B: If lock (x) = 0 (* item is unlocked *)

Then Lock (x) – 1 (* lock the item *)

Else begin

Wait (until lock (x) = 0 and

The lock manager wakes up the transaction);

Go to B

End;

Unlock-item (X):

Lock (X) ← 0; (* unlock the item * )

If any of transactions are waiting, then wakeup one of the waiting transactions.

If the simple binary locking scheme described here is used, every transaction must obey the following rules.


1. A transaction T must issue the operation lock-item (X) before any read-item(X) or write –item(X) operations are performed in it.

2. A transaction T must issue the operation unlock-item (X) after all read-item(X) and write-item(X) operations are completed in T.

3. A transaction T will not issue a lock-item(X) operation if it already holds the lock on item X.

4. A transaction T will no issue an unlock-item (X) operation unless it already holds the lock on item X.

��4�!�� 0��(��4. ��1�*�2��

The binary locking scheme is too restrictive for database items, because at most one transaction can hold a lock on a given item. We should allow several transactions to access the same item X if they all access X for reading purposes only. However, if a transaction is to write an item X, it must have exclusive access to X. For this purpose, a different type of lock called a multi-mode lock is used. In this scheme – called shared / exclusive or read/write locks – there are three locking operations: read-lock (x), write-lock(x), and unlock(x). A lock associated with an item X, lock (X), now has 3 possible states: “read-locked”, “write-locked” or “unlocked”. A real-locked item is also called shared-locked, because other transactions are allowed to read the item, whereas a write-locked item is called exclusive-locked, because a single transaction exclusively holds the lock on the item. When we use the share/exclusive locking scheme, the system must enforce the following rules.

1. A transaction T must issue the operation read-lock (X) or write-lock(X) before any read-item (X) operation is performed in T.

2. A transaction T must issue the operation write-lock(X) before any writ-item(X) operation is performed in T.

3. A transaction t must issue the operation unlock(X) after all read-item(X) and write-item(X) operations are complete in T.

4. A transaction T will not issue a read-lock (X) operation if it already holds a read (shared) lock or a write (exclusive) lock on item X.

5. A transaction T will not issue a write-lock operation if it already holds a read (shared) lock or write (exclusive) lock on item X.

6. A transaction T will not issue an unlock (X) operation unless it already holds a read (shared) lock or a write (exclusive) lock on item X.

Read-lock (X):

B: If lock (x) = “unlocked”

Then begin Lock (X) ← “read-locked”;

No. of reads (X) ←1

End

Else if Lock (X) = read-locked”

Then no. of reads (X) ← no. of reads (x) + 1;

Else begin

Wait (until Lock (x) = “unlocked” and

The lock manager wakes up the transaction);


goto B;

End;

Write-Lock (X):

B: If Lock(X) = “unlocked”

Then Lock (X) ← “write – locked”

Else begin

Wait (until lock(X) = “unlocked” and the lock manager wakes up the transaction);

goto B

End;

Unlock (X): If Lock (X) = “write-locked” Then begin Lock (X) ← “unlocked”; Wakeup one of the waiting transactions, it any

End; Else if Lock (X) = “read-locked”

Then begin No_of_reads (X) ← no_of_reads(X) –1;

If no_of_reads(X) = 0 Then begin Lock (X) = “unlocked”;

Wakeup one of the waiting transaction, If any

End End;

(Locking an unlocking Operations for two-mode locks)

%56��*�2��

A transaction is said to follow the two-phase locking protocol if all locking operations (read-lock, write-lock) precede the first unlock operation in the transaction. Such a transaction can be divided into two phases: an expanding or growing (first phase), during which new locks on items can be acquired but none can be released; and a shrinking (second phase), during which existing locks can be released but no new locks can be acquired.

Given below are two transactions T1 and T2 which do not follow the two-phase locking protocol. T1 T2 Read-locks (Y) ; read-lock (X); Read-item (Y); read-item (X);

Unlock (Y); Unlock(X);

Write-lock(X); write-lock (Y);

Read-item(X); read-item (Y);

X: = X + Y; Y : Y + 1;

Write-item(X); Write-item (y);

Unlock (X) unlock (y);


This is because the write-lock(X) operation follows the unlock(Y) operation in T1, and similarly the write-lock(y) operation follows the unlock(X) operation in T2. If we enforce two-phase locking, the transactions can be rewritten as T1, and T2, as shown below : T1 T2 Read-locks (Y) ; read-lock (X);

Read-item (Y); read-item (X);


Unlock (Y); Unlock(X);


read-item(X); read-item (Y);

X: = X + Y; Y : Y + 1;

Write-item(X); Write-item (y);

Unlock (X) unlock (y);

It can be proved that, if every transaction in a schedule follows the two-phase locking protocol, the schedule is guaranteed to be serializable, obviating the need to test for serializability of schedules any more. The locking mechanism by enforcing two-phase locking run also enforces serializability.

Two phase locking may limit the amount of concurrency that can occur in a schedule. This is because a transaction T may not be able to release an item X after it is through using it if T must lock an additional item Y later on; or conversely, T must have the additional item Y before it needs it so that it can release X. Hence, X must remain locked by T until all items that the transaction needs to read or write have been locked; only then can X be release by T. Meanwhile, another transaction seeking to access X may be forced to wait, even though T is done with X; conversely, if Y is locked earlier than it is needed, another transaction seeking to access Y is forced to wait even though T is not using Y yet. This is the price for guaranteeing serializability of all schedules without having to check the schedules themselves.

Basic, Conservative, Strict, and Rigorous Two-Phase Locking:

The technique just described is known as basic 2PL. The variation known as conservative 2PL (or Static 2PL) requires a transaction to lock all the items it accesses before the transaction begins execution, by predefining its read-set and write-set. If any of the pre-declared items needed cannot be locked, then the transaction does not lock any item; instead, it waits until all the items are available for locking. Conservative 2 phase locking is a deadlock free protocol. However it is difficult to use in practice because of the need to pre-declare the read-set and write-set, which is not possible in most situations.

In practice, the most popular variation of 2PL is strict 2PL, which guarantees strict schedule. In this variation, a transaction T does not release any of its exclusive (write) locks until after it commits or aborts.



1. What are the various Locking Techniques for concurrency control?

2. What is two phase locking?


Top

��%��


Time Stamps

A time stamp is a unique identifier created by the DBMS to identify a transaction. Typically, timestamp values are assigned in the order in which the transactions are submitted to the system, so a timestamp can be thought of as the transaction start time. We will refer to the time stamp of transaction T as TS(T).

Time Stamp Ordering Algorithm:

The idea for this scheme is to order the transactions based on their timestamps. A schedule in which the transactions participate is then serial table, and the equivalent serial schedule has the transactions in order of their timestamp values. This is called timestamp ordering (TO). In timestamp ordering, the schedule is equivalent to the particular serial order corresponding to the order of the transaction time stamps. The algorithm must ensure that for each item accessed by convicting operations in the schedule, the order in which the item is accessed does not violate the serializability order. To do this, the algorithm association with each database item X two timestamp (TS) values.

1. Read – TS(X): - The read timestamp of item X; this is the largest time stamp among all the timestamps of transactions that have successfully read item - X i.e.

read-TS(X) = TS(T), where T is the youngest transaction that has read X successfully

2. Write – TS (X): - The write timestamp of item X; this is the largest of all the timestamps of transactions that have successfully written item X – i.e., write – TS (X) = TS (T), where T is the youngest transaction that has written X successfully.

��%��

Whenever some transaction T tries to issue a read-item(X) or write-item(X) operation, the basic TO (Time ordering) compares the timestamp of T with read-TS(X) and write-TS(X) to ensure that the timestamp order of transaction is not violate. If this order is violated, then transaction T is aborted resubmitted to the system as a new transaction with a new timestamp. It T is aborted and rolled back, any transaction T1 that may have used a value written by T must also be rolled back. Similarly, any transaction T2 that may have used a value written by T1 must also be rolled back, and so on. This effect is known as cascading rollback and is one of the problems associated with basic TO, since the schedules produced are not recoverable, cordless or stick. We visit desirable the basic to algorithm here. The concurrency country algorithm must check whether convicting operations violate the time stamp ordering in the following two cases.

1. Transaction T issues a write-item(X) operation:

a) If read-TS(X) > TS (T) or if write – TX(X) > TS (T), then about and roll back T and rejects the operations. This should be done because some younger transaction with a time stamp greater than TS(T) –– and hence after T in the timestamp ordering – has already read or written the value of item X before T had a chance to write X, thus violating the timestamp ordering

b) If the condition in Past (a) does not occur; then execute the write-item(X) operation on T and set write – TS (X) to TS(T)

2. Transaction T issues a read –item(X) operation:

a) If write-TS(X) > TS(T), then abort and roll back T and reject the operation. This should be done because some younger transaction with timestamp greater than TS(T) – and hence after T in the timestamp ordering-has already written the value of item X before T had a chance to read X.

b) If write_TS(X) < = TS(T), then execute the read_item (X) operation of T and set read-TS(X) to the larger of TS(T) and the current read-TS(X).


Hence, whenever the basic TO algorithm detects two conflicting operations that occur in the incorrect order, it rejects the latter of the two operations by aborting the transaction that issued it.

��%��

A variation of basic TO called strict TO ensure that the schedules are both strict and serial table. In this variation, a transaction T that issues a read-item(X) or write-item(X) such that TS(T) write_TS(X) has its read or write operation delayed until the transaction T1 that wrote the value of X (hence TS(T’) = write TS_TS(X)) has committed or aborted. To implement this algorithm, it is necessary to simulate the locking of an item X that has been written by transaction T’ until T’ is either committed or aborted. This algorithm does not cause deadlock, since T waits for T’ only if TS (T) > TS (T’).



1. Write brief notes about Basic Time Stamp Ordering.

2. Write the Time Stamp Ordering Algorithm.

3. What is Strict Time Stamp Ordering?


Top

�� %��3��

Other protocols for concurrency control keep the old values of a data item when the item is updated. These are known as multi-version concurrency control, because several version (values) of an item are maintained. When a transaction requires access to an item, an appropriate version is chosen to maintain the serializability of the currently executing schedule, if possible. The idea is that same read operations that would be rejected in other techniques can still be accepted by reading an older version of the item to maintain serializability. When a transaction writes an item it writes a new version and the old version of the item is retained. Some multi-version concurrency control algorithms use the concept of view serializability rather than convict serializability.

An obvious drawback of multi-version techniques is that more storage is needed to maintain multiple versions of the database items.

Several multi-version concurrency control schemes have been proposed. We discuss two schemes here, one based on timestamp ordering and the other based on 2PL.

��6 ��%��3��)��%�� 7�

In this method, several versions X1, X2, …, Xk of each data item are maintained. For each version, the value of version Xi and the following two timestamps are kept.

1. read_TS(X1) : the read timestamp of Xi is the largest of all the timestamps of transactions that have successfully read version X;

2. Write –TS(X1) : The write timestamp of Xi is the time stamp of the transaction that wrote the value of version Xi.

Whenever a transaction T is allowed to execute a write-item(X) operation, a new version X4+1 of item X is created, with both the write – TS (X1+1) and the read – TS (Xx+1) set to TS(T). Correspondingly, when a transaction T is allowed to read the value of version Xi, the value of read-TS(Xi) is set to the larger of the current read-TS(Xi ) and TS (T).


To ensure serializability, the following rules are used.

1. It transaction T issues a write-item (X) operation, version I of X has the highest. Write _TS(Xi) of all versions of X that is also less than or equal to is (T), and read – TS (Xi) TS (T), then abort and roll back transaction T; otherwise create a new version x; of x with read-TS(Xi) = write _ TS (Xi) = TS(T).

If transaction T issues a read_item (X) operation, find the version of X that has the highest write –TS(Xi) of all versions of X that is also less than or equal to TS(T); then return the value of Xi to transaction T, and set the value of read –TS (Xi) to the larger of TS(T) and the current read –TS (Xi).

As we can see in case 2, a read-item (X) is always successful, since it is the appropriate version Xi to read based, on the write-TS of the various existing versions of X. In case 1, however, transaction T may be aborted and rolled back. This happens if T is attempting to write a version of C that should not have been read by another transaction T’ whose time stamp is read_TS(Xi); however, T’ has already read version Xi, which was written by the transaction with time stamp equal to write-TS(Xi). If this conflict occurs, T is rolled back; otherwise, a new version of X, written by transaction T, is created. Notice that, if T is rolled back, cascading rollback may occur. Hence, to ensure recoverability, a transaction T should not be allowed to commit until after all the transactions that have written some version that T has read have committed.

��6 ��%56��*�2��*�2��

In this Multiple-mode locking scheme, there are 3 locking modes for an item: read, write, and certify, instead of just two modes (read, write) discussed previously. Hence the state of Lock (X) for an item X can be one of read-locked, writ-locked, certify-locked or unlocked. In the standard locking schemes with only read and write locks, a write lock is an exclusive lock. We can describe the relationship but read and write locks in the standard scheme by means of the lock compatibility table as shown below. An entry of yes means that, it a transaction T holds the type of lock specified in the column header on item X and it transaction T’ requests the type of lock specified in the row header on the same item X, then T’ can obtain the lock because the locking modes are compatible. On the other hand, an entry of no in the table indicates that the locks are not compatible, so T’ must wait until T releases the lock.

In the standard locking scheme, once a transaction obtains a write lock on an item, no other transactions can access that item. The idea behind multi-version 2PL is to allow other transaction T’ to read on item X while a single transaction T holds a write lock on X. This is accomplished by allowing two versions for each item X; one version must always have been written by some committed transaction. The 2nd version X’ is created when a transaction T’ requires a write lock on the item. Other transactions can continue to read the committed version of X while t holds the write lock. Transaction T can write the value of X’ as needed, without affecting the value of the committed version X. However once T is ready to commit, it must obtain a certify lock on all items that it currently holds write locks on before it can commit. The certify lock is not compatible with read locks, so the transaction may have to delay its commit until all its write–locked items are released by any reading transitions in order to obtain the certify locks. Once the certify locks - which are exclusive locks – are acquired the committed version X of the data item is set to the value of version X’, version X’ is disrobed, and then the certify locks are released. The lock compatibility table for this scheme is shown below.

(A compatibility table for read/Write locking system) (A compatibility table for read/Write/Certify locking scheme)

Read Write

Read Yes No Write No No

Read Write Certify

Read Yes Yes No Write Yes No No

Certify Yes No


Hence no other transaction can read or write an item that is written by T unless it has committed leading to a strict schedule for recoverability. Strict 2PL is not deadlock-free. A more restrictive variation of strict 2PL is rigorous 2PL and it also guarantees strict schedules. In this variation a transaction T does not release any of its locks (exclusive or shared) until after it commits or aborts and so it is easier to implement than strict 2PL. Notice the difference between conservative and rigorous 2PL; the former must lock all its items before it starts so once the transaction starts it is in shrinking phase whereas the latter does not unlock any of its items until after it terminates (by committing or aborting) so the transaction is in its expanding phase until it ends.



1. What is meant by the concurrent execution of database transactions in a multi-user system? Discuss why concurrency control is needed, and give informal examples.

2. Why concurrency control is needed in database system?

3. What is the two-phase locking protocol? How does it guarantee serializability?

*�6��(��

The most widely used structure for recording database modifications is the log. The log is sequence of log records, and maintains a record of all the update activities in the database. There are several types of log records. An update log record describes a single database write, and has the following fields:

Transaction identifier is the unique identifier of the transaction that performed the write operation.

Data-item identifier is the unique identifier of the data item written. Typically, it is the location on disk of the data item.

Old value is the value of the data item prior to the write.

New value is the value that the data item will have after the write.

Other special log records exist to record significant events during transaction processing, such as the start of a transaction and the commit or abort of a transaction.

< Ti start>. Transaction Ti has started.

< Ti, Xj, V1, V2> Transaction Ti has performed a write on data item Xj.

Xj had value V1before the write, and will have value V2 after the write.

<Ti commit> transaction Ti has committed.

<Ti abort> Transaction Ti has aborted.

Whenever a transaction performs a write, it is estimated that the log record for that write be created before the database is modified. Once a log record exists, we can output the modification to the database if that is desirable. Also, we have the ability to undo a modification that has already been output to the database. We undo it by using the old value field in log records.

��

• A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimisation.


• The runtime database processor has the task of running the query code, whether in compiled or interpreted mode, to produce the query result.

• The steps involved in processing a query are: Parsing and translation, Optimisation, Evaluation

• The most widely used structure for recording database modifications is the log. The log is sequence of log records, and maintains a record of all the update activities in the database.

��6��

��

I. True or False

1. A query expressed in a high-level query language such as SQL must first be scanned, parsed, and validated.

2. Before query processing can begin, the system need not translate the query into a usable form.

3. Computing the precise cost of evaluation of a plan is usually not possible without actually evaluating the plan.


1. Planning of an execution strategy may be a more accurate description than_____________.

2. ___________languages permit users to specify what a query should generate without saying how the system should do the generating.

3. The cost estimates for _____________are based on the assumption that the blocks of a relation are stored contiguously on disk.

4. We must sure that the ________________steps always lead to an equivalent query tree.

5. ___________ identifier is the unique identifier of the transaction that performed the write operation.

��

I. True or False

1. True

2. False

3. True


1. query optimization

2. Declarative

3. binary search

4. transformation

5. Transaction


��

I. True or False

1. A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query processing.

2. The first action the system must take in query processing is to translate a given query into its internal form.

3. The programmer must choose the query execution strategy while writing a database program.

4. The content of nonvolatile storage remains intact and is not corrupted.

5. The idea for time – stamp ordering scheme is to order the transactions based on their timestamps.


1. A DBMS must devise a _____________ for retrieving the result of the query from the database files.

2. _____________ make use of statistical information stored in the DBMS catalog to estimate the cost of a plan.

3. In query processing, the ____________ is the lowest-level operator to access data.

4. The SQL ____________ includes commands to grant and revoke privileges.

5. A ________________ is a logical unit of database processing.

-��

1. ”Query optimization is an important step in Query” justify the statement.

2. Discuss the various steps in Query processing?

3. What are Heuristic Optimization of Query trees? Discuss the example.

4. “Security is basic need for database system” justify the statement with supporting example?

5. Compare binary locks to exclusive/shared locks. Why is the latter type of locks preferable?

6. What is a time stamp? How does the system generate timestamps?

7. Discuss the timestamp ordering protocol for concurrency control. How does strict timestamp ordering differ from basic timestamp ordering?

8. Discuss two multi-version techniques per concurrency control.

9. What is a certify lock? What are the advantages and disadvantages of using certify locks?

10. Fill in the blanks:

Definition and Analysis of Existing Systems Data Analysis Preliminary & Final Design of Relational Database Testing Process of Testing Drawbacks of Testing What is Implementation? Operation and Tuning

�

��

Database Design Project

Learning Objectives


• Definition and Analysis of Existing Systems

• Data Analysis

• Preliminary and Final Design

• Testing & Implementation

• Maintenance

• Operation and Tuning

Top

��

Here Existing System means that “a system which is working within certain constraints” and now we want to update. In other words, we can say that now we want our existing database system should work under improved constraints. For example, if we are handling the database of employees of a particular organization then their it needs time to time upgradation i.e. we have to add or delete something.

For this upgradation process we have to carefully analyze the system so that database constraints should not be changed. We should also take care of data analysis in following ways.

Top

��

Although complex statistical analysis is best left to statistics packages, databases should support simple, commonly used, forms of data analysis. Since the data stored in databases are usually large in volume, they need to be summarized in some fashion if we are to derive information that humans can use then aggregate functions are commonly used for this task.


The SQL aggregation functionality is limited, so several extensions have been implemented by different databases. For instance, although SQL defines only a few aggregate functions, many database systems provide a richer set of functions, including variance, median, and so on. Some systems also allow users to add new aggregate functions.

Histograms are frequently used in data analysis. A histogram partitions the values taken by an attribute into ranges, and computes an aggregate, such as sum, over the values in each range. For example, a histogram on salaries values might count the number of people whose salaries fall in each of the ranges 0 to 20000, 20001 to 40000, 40001 to 60000, and above 60000. Using SQL to construct such a histogram efficiently would be cumbersome. We leave it as an exercise for you to verify our claim. Extensions to the SQL syntax that allow functions to be used in the group by clause have been proposed to simplify the task. For instance, the N_tile function supported on the Red Brick database system divides values into percentiles. Consider the following query:

select percentile, avg(balance)

from account

group by N_tile(balance. 10) as percentile

Here, N_tile(balance, 10) divides the values for balance into 10 consecutive ranges, with an equal number of values in each range; duplicates are not eliminated. Thus, the first range would have the bottom 10 percent of the values, and the tenth range would have the top 10 percent of the values. The rest of the query performs a groupby based on these ranges, and returns the average balance for each range.

Statistical analysis often requires aggregation on multiple attributes. Consider an application where a shop wants to find out what kinds of clothes are popular. Let us suppose that clothes are classified based on color and size, and that we have a relation sales with the schema Sales(color, size, number). To analyze the sales by color (light versus dark) and size (small, medium, large), manager may want to see data laid out as shown in the table in figure 6.1.

The table in Figure 6.1 is an example of a cross-tabulation (cross-tab). Many report writers provide support for such tables. In this case, the data are two- dimensional, since they are based on two attributes: size and color. In general, the data can be represented as a multidimensional array, with a value for each element of the array. Such data are called multidimensional data.

�

��

The data in a cross-tabulation cannot be generated by a single SQL query, since totals are taken at several different levels. Moreover, we can see easily that a cross-tabulation is not the same as a relational table. We can represent the data in relational form by introducing a special value all to represent subtotals, as shown in Figure 6.2.

Consider the tuples (Light, all, 53) and (Dark. all. 35). We have obtained these tuples by eliminating individual tuples with different values for size, and by replacing the value of number by an aggregate—namely, sum. The value all can be thought of as representing the set of values for size. Moving from finer-granularity data to a coarser granularity by means of aggregation is called doing a rollup. In our example, we have rolled up the attribute size. The opposite operation—that of moving from coarser-granularity data to finer-granularity—is called drill down. Clearly, finer-granularity data cannot be generated from coarse-

DATABASE DESIGN PROJECT 179

granularity data: they must be generated either from the original data, or from yet-more-fine-granularity summary data.

The number of different ways in which the tuples can be grouped for aggregation can be large, as you can verify easily from the table in Figure 6.2. In fact for a table with n dimensions, rollup can be performed on each of the 2" subsets of the n dimensions. Consider a three dimensional version of the sales relation with attributes size, color, and price as the three dimensions. Figure 6.3 shows the subsets of attributes of the relation as corners of a three – dimensional cube; rollup can be performed on each of these subsets attributes . In general the subsets attributes of an n – dimensional relation can be visualized as the corners of a corresponding n-dimensional cube.

�

��

�� ! ��

Although we can generate tables such as the one in Figure 6.2 using SQL., doing so is cumbersome. The query involves the use of the union operation, and can be long; we leave it to you as an exercise to generate the rows containing all from a table containing the other rows.


There have been proposals to extend the SQL syntax with a cube operator. For instance, the following extended SQL query would generate the table shown in Figure 6.2:

select colour, size, sum (number)

from sales

groupby colour, size with cube

Top

��

Here we will discuss the process and various steps of the design of relational database design. Here we are taking an example and applying all the rules of Design step by step on that:

��

Before we begin our discussion of normal forms and data dependencies, let us look at what can go wrong in a bad database design. Among the undesirable properties that a bad design may have are

� Repetition of information

� Inability to represent certain information

We shall discuss these problems using a modified database design for our banking example, the information concerning loans is now kept in one single relation , lending , which is defined over the relation schema

��"��! ��

Lending-schema = (branch-name, branch-city, assets, customer-name, loan-number, amount)

Figure 6.4 shows an instance of the relation lending (Lending-schema). A tuple t in the lending relation has the following intuitive meaning:

• t{assets} is the asset figure for the branch named ({branch-name}.

• t{branch-city} is the city in which the branch named t {branch-name} is located.

• {loan-number} is the number assigned to a loan made by the branch named

t{branch-name} to the customer named t{customer-name].

• t[amount is the amount of the loan whose number is t{loan-number].


Suppose that we wish to add a new loan to our database. Say that the loan is made by the Perryridge branch to Adams in the amount of $1500. Let the loan number be L-31. In our design, we need a tuple with values on all the attributes of Lending-schema. Thus, we must repeat the asset and city data for the Perryridge branch, and must add the tuple

(Perryridge, Horseneck, 1700000, Adams, L-31, 1500)

to the lending relation. In general, the asset and city data for a branch must appear once for each loan made by that branch.

The repetition of information required by the use of our alternative design is undesirable. Repeating information wastes space. Furthermore, it complicates updating the database. Suppose, for example, that the Perryridge branch moves from Horseneck to Newtown. Under our original design, one tuple of the branch relation needs to be changed. Under our alternative design, many tuples of the lending relation need to be changed. Thus, updates are more costly under the alternative design than under the original design. When we perform the update in the alternative database, we must ensure that every tuple pertaining to the Perryridge branch is updated, or else our database will show two cities for the Perryridge branch.

That observation is central to understanding why the alternative design is bad. We know that a bank branch is located in exactly one city. On the other hand, we know that a branch may make many loans. In other words, the functional dependency

branch-name → branch-city

holds on Lending-schema, but we do not expect the functional dependency branch-name → loan-number to hold. The fact that a branch is located in a city and the fact that a branch makes a loan are independent, and, as we have seen, these facts are best represented in separate relations. We shall see that we can use functional dependencies to specify formally when a database design is good.

Another problem with the Lending-schema design is that we cannot represent directly the information concerning a branch (branch-name, branch-city, assets) unless there exists at least one loan at the branch. The problem is that tuples in the lending relation require values for loan number, amount, and customer-name.

One solution to this problem is to introduce null values to handle updates through views. Recall, however, that null values are difficult to handle. If we are not willing to deal with null values, then we can create the branch information only when the first loan application at that branch is made. Worse, we would have to delete this information when all the loans have been paid. Clearly, this situation is undesirable, since, under our original database design, the branch information would be available regardless of whether or not loans are currently maintained in the branch, and without resorting to the use of null values.

��

A bad design suggests that we should decompose a relational schema that has many attributes into several schemas with fewer attributes. Careless decomposition, however, may lead to another form of bad design.

Consider an alternative design in which Lending-schema is decomposed into the following two schemas:

Brunch-customer-schema = ( Π branch-name, branch-city, assets, customer-name)

Customer-loan-schema = ( Π customer-name, loan-number, amount)

Using the lending relation of Figure 6.1, we construct our new relations branch customer (Branch-customer) and customer-loan (Customer-loan-schema) as follows:

branch customer = Π branch-name, branch-city, assets, customer-name (lending)


customer-loan = Π customer-name, loan-number, amount (lending)

We show the resulting branch-customer and customer-name relations in Figures 6.5 and 6.6 respectively.

�

��#�� $��%�� ! ��

Of course, there are cases in which we need to reconstruct the loan relation. For example, suppose that we wish to find all branches that have loans with amounts less than $1000. No relation in our alternative database contains these data. We need to reconstruct the lending relation. It appears that we can do so by writing

branch-customer customer-loan

Figure 6.4 shows the result of computing branch-customer M customer-loan. When we compare this relation and the lending relation with whom we started (Figure 6.1), we notice some differences. Although every tuple that appears in lending appears in branch-customer customer-loan, there are tuples in branch-customer M customer-loan that are not in lending. In our example,

�� ! ��&� ��


��'�� $��%��&�� ! �� ! ��&�( ��

branch -customer customer-loan has the following additional tuples:

(Downtown, Brooklyn, 9000000, Jones, L-93, 500) (Perry ridge, Horseneck, 1700000, Hayes, L-16, 1300) (Mianus, Horseneck, 400000, Jones, L-17, 1000) (North Town, Rye, 3700000, Hayes, L-15, 1500)

Consider the query, "Find all branches that have made a loan in an amount less than $ 1000." if we look back at Figure 6.4 we see that the only branches with loan amounts less than $ 1000 are Mianus and Round Hill. However, when we apply the expression we obtain three branch names: Mianus, Round Hill, and Downtown.

Π branch-name ( )loandcustomercustomerbranch(100amount −−<σ

Let us examine this example more closely. If a customer happens to have several loans from different branches, we cannot tell which loan belongs to which branch. Thus, when we join branch-customer and customer-loan, we obtain not only the tuples we had originally in lending, but also several additional tuples. Although we have more tuples in branch-customer customer-loan, we actually have less information. We are no longer able, in general, to represent in the database information about which customers are borrowers from which branch. Because of this loss of information, we call the decomposition of Lending-schema into Branch-customer-schema and customer-loan-schema a lossy decomposition, or a lossy-join decomposition. A decomposition that is not a lossy-join decomposition is a lossless-join decomposition. It should be clear from our example that a lossy-join decomposition is, in general, a bad database design.

Let us examine the decomposition more closely to see why it is lossy. There is one attribute in common between Branch-customer-schema and Customer-loan- schema'.

Branch-customer-schema ∩ Customer-loan-schema == {customer-name]


The only way that we can represent a relationship between, for example, loan number and branch-name is through customer-name. This representation is not adequate because a customer may have several loans, yet these loans are not necessarily obtained from the same branch.

Let us consider another alternative design, in which Lending schema is decomposed into the following two schemas:

Branch-schema = (branch-name, branch-city, assets) Loan-info-schema = (branch-name, customer-name, loan-number, amount)

There is one attribute in common between these two schemas:

Branch-loan-schema ∩ Customer-loan-schema = {branch-name]

Thus, the only way that we can represent a relationship between, for example, customer-name and assets is through branch-name. The difference between this example and the preceding one is that the assets of a branch are the same, regardless of the customer to which we are referring, whereas the lending branch associated with a certain loan amount does depend on the customer to which we are referring. For a given branch-name, there is exactly one assets value and exactly one branch-city, whereas a similar statement cannot be made for customer-name. That is, the functional dependency holds, but customer-name does not functionally determine loan-number.

branch-name → assets branch-city

The notion of lossless joins is central to much of relational-database design. Therefore, we re-state the preceding examples more concisely and more formally. Let R be a relation schema. A set of relation schemas {R1, R2,...,Rn] is a decomposition of R if

R = R1 ∪ R2 ∪ … ∪ Rn

That is, (R1, R2…..Rn) is a decomposition of R if, for i = 1,2,.. ., n, each Ri is a subset of R, and every attribute in R appears in at least one Ri .

Let r be a relation on schema R, and let ri= )r(RiΠ for i = 1, 2,..,,n. That is, (r1, r2….rn) is the database that results from decomposing R into (R1, R2,….Rn). It is always the case that

n21 r.....rrr ⊆

To see that this assertion is true, consider a tuple t in relation r. When we compute the relations (r1,..r2.,…. rn), the tuple t gives rise to one tuple ti in each ri , i = 1, 2,..., n. These n tuples combine to regenerate t when we compute r1 r2 ….rn. The details are left for you to complete as an exercise. Therefore, every tuple in r appears in r ≠ r1 r2 …… rn.

In general, r ≠ r1 r2 …… rn. As an illustration, consider our earlier example, in which

� n = 2.

� R = Lending-schema.

� R1 = Branch-customer-schema.

� R2 = Customer-loan-schema.

� r = the relation shown in Figure 6.4

� r1 = the relation shown in Figure 6.5

�

� r2 = the relation shown in Figure 6.6


� ri r2 = the relation shown in Figure 6.7

Note that the relations in Figures 6.4 and 6.7 are not the same.

To have a lossless-join decomposition, we need to impose constraints on the set of possible relations. We found that the decomposition of Lending-schema into Branch-schema and Loan-info-schema is lossless because the functional dependency

branch-name → branch-city assets

holds on Branch-schema.

Later in this chapter, we shall introduce constraints other than functional dependencies. We say that a relation is legal if it satisfies all rules, or constraints, that we impose on our database.

Let C represent a set of constraints on the database. A decomposition {R2, R2,..., Rn] of a relation schema R is a lossless-join decomposition for R if, for all relations r on schema R that are legal under C,

)r(nR2R1R .......)r()r(r ΠΠΠ=

We shall show how to test whether a decomposition is a lossless-join decomposition in the next few sections. A major part of this chapter is concerned with the questions of how to specify constraints on the database, and how to obtain lossless-join decompositions that avoid the pitfalls represented by the examples of bad database designs that we have seen in this section.

��

We can use a given set of functional dependencies in designing a relational database in which most of the undesirable properties do not occur. When we design such systems, it may become necessary to decompose a relation into several smaller relations. Using functional dependencies, we can define several normal forms that represent "good" database designs.

��

In this subsection, we shall illustrate our concepts by considering the Lending schema discussed earlier.

Lending-schema == (branch-name, branch-city, assets, customer-name, loan-number, amount)

The set F of functional dependencies that we require to hold on Lending-schema are

branch-name → assets branch-city

loan-number → amount branch-name

As we have discussed earlier as per Fig 6.1, the Lending-schema is an example of a bad database design. Assume that we decompose it to the following three relations:

Branch-schema = (branch-name, assets, branch-city) Loan-schema == (branch-name, loan-number, amount) Borrower-schema = (customer-name, loan-number)

�

!��"#��

When decomposing a relation into a number of smaller relations, it is crucial that the decomposition be lossless. To demonstrate our claim we must first present a criterion for determining whether a decomposition is lossy.


Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a decomposition of R. This decomposition is a lossless-join decomposition of if at least one of the following functional dependencies are in F+:

� 121 RRR →∩

� 221 RRR →∩

We now show that our decomposition of Lending-schema is a lossless-join decomposition by showing a sequence of steps that generate the decomposition. We begin by decomposing Lending-schema into two schemas:

Branch-schema = (branch-name, branch-city, assets) Loan-info-schema = (branch-name, customer-name, loan-number, amount)

Since branch-name → branch-city assets, the augmentation rule for functional. dependencies implies that

branch-name → branch-name branch-city assets

Since Branch-schema ∩ Loan-info-schema={branch-name), it follows that our initial decomposition is a lossless-join decomposition.

Next, we decompose Loan-info-schema into

Loan-schema = (branch-name, loan-number, amount) Borrower-schema =(customer-name, loan-number)

This step results in a lossless-join decomposition, since loan-number is a common attribute and loan-number → amount branch-name.

�� $��

There is another goal in relational-database design: dependency preservation. When an update is made to the database, the system should be able to check that the update will not create an illegal relation—that is, one that does not satisfy all the given functional dependencies. If we are to check updates efficiently, we should design relational-database schemas that allow update validation without the computation of joins.

To decide whether joins must be computed, we need to determine what functional dependencies may be tested by checking each relation individually. Let F be a set of functional dependencies on a schema R, and let R1 , R2 ….. Rn be a decomposition of R. The restriction of F to Ri: is the set Fi, of all functional dependencies in F+, which include only attributes of Ri. Since all functional dependencies in a restriction involve attributes of only one relation schema, it is possible to test satisfaction of such a dependency by checking only one relation.

The set of restrictions F1, F2, …. Fn is the set of dependencies that can be checked efficiently. We now must ask whether testing only the restrictions is sufficient. Let F' = F1 ∪ F2 ∪ Fn. F' is a set of functional dependencies on schema R, but, in general, F` ≠ F. However, even if F' ≠ F, it may be that F'+= F+. If the latter is true, then every dependency in F is logically implied by F', and, if we verify that F’ is satisfied, we have verified that F is satisfied. We say that a decomposition having the property F’+=F+ is a dependency preserving decomposition. Figure 6.8 shows an algorithm for testing dependency preservation. The input is a set D= {R1, R2,….Rn} of decomposed relation schemas, and a set F of functional dependencies.

We can now show that our decomposition of Lending-schema is dependency preserving. We consider each member of the set F of functional dependencies that we require to hold on Lending-schema, and show that each one can be tested in at least one relation in the decomposition.

� We can test the functional dependency: branch-name → branch-city assets using Branch - schema = (branch-name, branch-city, assets).


� We can test the functional dependency: loan-number → amount branch-name using Loan-schema = (branch-name, loan-number, amount).

As the preceding example shows, it is often easier not to apply the algorithm of Figure 6.8 to test dependency preservation, since the first step—computation of F+—takes exponential time.

�� %��

��

�� φ �� %��

��

��)�� %��*��+��

Figure 6.8 to test dependency preservation, since the first step—computation of F+-takes exponential time.

��%��

ln Lending-schema, it was necessary to repeat the city and assets of a branch for each loan. The decomposition separates branch and loan data into distinct relations, thereby eliminating this redundancy. Similarly, observe that, if a single loan is made to several customers, we must repeat the amount of the loan once for each customer (as well as the city and assets of the branch). In the decomposition, the relation on schema Borrower-schema contains the loan-number, customer-name relationship, and no other schema does. Therefore, we have one tuple for each customer for a loan in only the relation on Borrower-schema. In the other relations involving loan number (those on schemas Loan-schema and Borrower-schema), only one tuple per loan needs to appear.

Clearly, the lack of redundancy exhibited by our decomposition is desirable. The degree to which we can achieve this lack of redundancy is represented b) several normal forms, which we shall discuss in the remainder of this chapter.



1. Consider a Database of a University (computer science department which have the following attribute student-id, course, teacher-id, center. How to design a database?

2. How will you analyse an existing system?


Top

&��

Testing is the most time-consuming, but an essential activity of a software project. It is vital to the success of candidate system. Though, during development phase, programmers also test their programs, but they


generally do not test the programs in a systematic way. This is because, during development phase, they concentrate more on removing syntax and some logical errors of programs and hence neither compare the outputs with requirements nor test the complete system. For making the system reliable and error free, the complete system must be tested in a systematic and organized way. Before discussing the process of testing, let us first identify the activities that are required to be tested.

&�� $��

During system testing, the following activities must be tested:

(a) Outputs: The system is tested to see whether it provides the desired outputs correctly and efficiently.

(b) Response Time: A system is expected to response quickly during data entry, modifications and query processes. The system should be tested to find the response time for various operations.

(c) Storage: A system should be tested to determine the capacity of the system to store data on the hard disk or other external storage device.

(d) Memory: During execution of the system, the programs require sufficient memory. The system is tested to determine the memory required for running various programs.

(e) Peak Load Processing: The system must also be tested to determine whether it can handle more than one activities simultaneously during peak of its processing demand. This type of test is generally conducted for multi-user systems such as banking applications, railway reservations systems, etc.

(f) Security: The system must ensure the security of data and information. Therefore, the system is tested to check whether all the security measures are provided in the system or not.

(g) Recovery: Sometimes due to certain technical or operational problems data may also he lost or damaged. The system must be tested to ensure that an efficient recovery procedure is available in the system to avoid disasters.

&��&��

Testing can be of following types –

(a) Unit Testing: Testing of individual programs or modules is known as unit testing. Unit testing is done during both the development and testing phase.

(b) Integration Testing: Testing the interfaces between related modules of a system is known as integration testing. After development phase, all modules are tested to check whether they are properly integrated or not.

(c) System Testing: Executing programs of entire system under especially prepared test data, by assuming that the programs will give logical errors and may not be according to specifications is known as systems testing. Systems testing is actually the testing that is done during testing phase.

(d) Acceptance Testing: Running the system under live or realistic data by the actual user is called acceptance testing. It can be done during both testing and implementation phases.

(e) Verification Testing: Running the system under a simulated environment using simulated data in order to find errors is known as verification testing or alpha testing.

(f) Validation Testing: Running the system under a live environment using live data in order to find errors is known as validation testing or beta testing.



(g) Static Testing: Observing the behaviour of the system not by executing the system but through reviews, reading the programs or any other non-execution method is called static testing.

(h) Dynamic Testing: Observing the behaviour of the system by executing the system is called dynamic testing. Except static testing, all other testing types are actually dynamic testing.

(i) Structural Testing: Testing the structure of various programs of a system by examining the program and data structures is called structural testing. So structural testing is concerned with the implementation of the program.

(j) Functional Testing: Testing the function of various programs of a system by executing them and by examining the structure of data and programs is called functional testing. Functional testing is concerned with functionality of the system.

Top

��&��

There are many activities that must be performed during testing process. Some important activities are –

1. Preparation of Test Plan: A test plan is the first step of testing process. A test plan is a general document for the project, which contains the following:

- Identification and specification of test unit;

- Software features to be tested, e.g., performance, design constraints;

- Techniques used for testing;

- Preparation of test data;

- Schedule of each testing unit;

- Identification of persons responsible for each activity.

2. Specification of Test Cases: Specification of test cases is the next major step of testing process. In this process, test data is prepared for testing each and every criterion of a test unit alongwith the specifications of conditions and expected outputs. Selecting the test cases manually is a very difficult process. Some data flow analysis tools help in deciding the test cases.

3. Execution and Analysis of Test Cases: All the test cases are executed and analyzed by the analyst to see whether the system is giving expected outputs for all the conditions.

4. Special System Tests: Besides testing the normal execution of the system, special tests are needed to be performed to check the response time, storage capacity, memory requirements, peak load performance, security features and recovery procedures of the system.

Top

��'��(��&��

Although, testing is an essential phase of SDLC, it has the following drawbacks:

� Testing is an expensive method for identification and removal for faults (bugs) in the system.

� Testing is the most time-consuming activity of software development process.



1. Describe the activities performed during system testing.

2. Differentiate between the following.

a. Unit and Integration Testing

b. Static and Dynamic Testing

c. Verification and Validation Testing

d. Structural and Functional Testing

3. Differentiate between testing and debugging with suitable example.


Top

) *��%� �� +�

After testing of the system, the candidate system is installed and implemented at the user’s place. The old system is changed to a new or modified system and users are provided training to operate the new system. This is a crucial phase of SDLC and is known as implementation phase. Before discussing the activities of implementation phase, let us first see what is meant by implementation. The term ‘Implementation’ may be defined as follows:

Implementation is the process of converting the manual or old computerized system with the newly developed system and making it operational, without disturbing the functioning of the organization.

&��%� ��

Implementation may be of the following three types-

(a) Fresh Implementation: Implementation of totally new computerized system by replacing manual system.

(b) Replacement Implementation: Implementation of new computerized system by replacing old computerized system.

(c) Modified Implementation: Implementation of modified computerized system by replacing old computerized system.

��%� ��

Whatever be the kind of implementation, the implementation process has following two parts:

(i) Conversion

(ii) Training of Users

We will discuss these procedures in brief.

��

Conversion is the process of changing from the old system to modified or new one. Many different activities are needed to be performed in conversion process depending upon the type of implementation (as defined above). During fresh implementation, all necessary hardware is installed and manual files are converted to computer-compatible files. During replacement implementation, old hardware may be


replaced with new hardware and old file structures are also needed to be converted to new ones. The conversion process is comparatively simpler in the third type of implementation, i.e., modified implementation. In such implementation, existing hardware is generally not replaced and also no changes are made in file structures.

Conversion Plan

Before starting conversion process, the analyst must prepare a plan for conversion. This plan should be prepared in consultation with users. The conversion plan contains following important tasks:

(i) Selection of conversion method;

(ii) Preparation of a conversion schedule;

(iii) Identification of all data files needed to be converted;

(iv) Identification of documents required for conversion process;

(v) Selecting team members and assigning them different responsibilities.

Conversion Methods

The following four methods are available for conversion process:

(a) Direct Cutover: In this method, the old system (whether manual or computerised) is completely dropped out on one particular date and new system is implemented.

(b) Parallel Conversion: In this method, the old method is not dropped out at once, but both old and new systems are operated in parallel. When new system is accepted and successfully implemented, old system is dropped out.

(c) Phase-in-method of Conversion: In this method, the new system is implemented in many phases. Each phase is carried out only after successful implementation of previous phase.

(d) Pilot System: In this method, only a working version of new system is implemented in one department of the organization. If the system is accepted in that department, it is implemented in other departments the either in phases or completely.

Each of the above methods has its advantages and disadvantages. Although, the direct cutover is the fastest way of implementing the system, but this method is very risky. As the organization will completely depend on the new system, if it fails, there would be a great loss to the company. Parallel method is considered more secure, but it has many disadvantages. The parallel conversion doubles not only the operating costs but also the workload. The major disadvantage of parallel conversion is the outputs of both the systems may mis-match and in such cases, it becomes very difficult for management to analyze, compare and evaluate their results. Although, phase-in-method and pilot systems are more time-consuming methods of implementation, they are considered more reliable, secure and economical.

Top

,��& ��

The performance of a system involves adjusting various parameters design choices to improve its performance for a specific application. Various aspects of a database-system design–ranging from high-level aspects such as the schema and transaction design, to database parameters such as buffer size; down to hardware issues such as number of disks affect the performance of an application. Each of these aspects can be adjusted such that performance is improved.

!��(��


The performance of most systems is limited primarily by the performance of one or a few components, which are called bottlenecks. For instance, a program may spend 80 percent of its time in a small loop deep in the code, and the remaining 20 percent improvement overall, where as improving the speed of the bottleneck loop could result in an improvement of nearly 80 percent overall, in the best case.

Hence when a bottleneck occurs in a system, we must first try to discover what are the bottlenecks, and then to eliminate the bottlenecks by improving the performance of the components causing the bottlenecks. When one bottleneck is removed, it may turn out that another component becomes the bottleneck. In a well-balanced system, no single component is the bottleneck. If the system contains bottlenecks, components that are not part of the bottleneck are utilized and could perhaps have been replaced by cheaper components with lower performance.

For simple programs, the time spent in each region of the code determines the overall execution time. However the data base systems are much more complex, and can be modeled as queuing systems. A transaction requests various services from the database system, starting from entry to server process.

A transaction requests various services from the database system, starting from entry to a server process, disk reads during execution, CPU cycles, and locks for concurrency control. Each of these services has a queue associated with it, and small transactions may spend most of their time waiting in queues especially in disk I/O queues-rather than executing code.

As a result of the numerous queues in the database, bottlenecks in a database system typically show up in the form of long queues for a particular service, or, equivalently, in high utilizations for a particular service. If requests are spaced exactly uniformly, and the time to service a request is less than or equal to the time when the next request arrives, then each request will find the resource idle and can therefore start execution immediately without waiting. Unfortunately, the arrival of requests in a database system is never so uniform, and is instead random.

If a resource, such as a disk, has a low utilization, then, when a request is made, the resource is likely to be idle, in which case the waiting time for the request will be O. assuming uniformly randomly distributed arrivals, the length of the queue (and correspondingly the waiting time) go up exponentially with utilization; as utilization approaches 100 percent, the queue length increases sharply, resulting in excessively long waiting times. The utilization of a resource should be kept low enough that queue length is short. As a rule of the thumb, utilizations of around 70 percent are considered to be good, and utilizations above 90 percent are considered excessive, since they will result in significant delays. To learn more about the theory of queuing system generally referred to as queering theory, you can consult the references cited in the bibliographic notes.

& ��

Database administrators can tune a database system at there levels. The lowest level is at the hardware level. Options for systems at this level include adding disks or using a RAID system if disk I/O is a bottleneck, adding more memory if the disk buffer size is a bottleneck, or moving to a faster processor if CPU use is a bottleneck.

The second level consists of the database-system parameters, such as buffer size and check pointing intervals. The exact set of database –system parameters that can be tuned depends on the specific database system. Most database-system manuals provide information on what database-system parameters can be adjusted, and how you should choose values for the parameters. Well-designed database systems perform as much as possible automatically, freeing the user or database administrator from the burden. For instance, many database systems have a buffer of fixed size, and the buffer size can be tuned. If the system automatically adjusts the buffer size by observing indicators such as page-fault rates, then the user will not have to worry about the buffer size.

The third level is the higher-level design, including the schema and transactions. You can tune the design of the schema, the indices that are created, and the transactions that are executed; to improve performance at


this level is comparatively system independent. In the rest of this section, we discuss performance of the higher-level design.

The three levels of interact with one other; we must consider them together when a system. For example, at a higher level may result in the hardware bottleneck changing from the disk system to the CPU, or vice versa.

& ��*��*��

Within the constraints of the normal form adopted, it is possible to partition relations vertically. For example, consider the account relation, with the schema:

Account (branch-name, account-number, balance)

For which account-number is a key. Within the constraints of the normal forms (BCNF and third normal forms), we can partition the account relation into two relations as follows:

Account-branch (account-number, branch-name)

Account-balance (account-number, balance)

The two representations are logically equivalent, since account- number is a key, but they have different performance characteristics.

If most accesses to account information look at only the account-number and balance, then they can be run against the account-balance relation, and access is likely to be somewhat faster, since the branch-name attribute is not fetched. For the same reason, more tuples of account-balance will fit in the buffer than corresponding tuples of account, again leading to faster performance. This effect would be particularly marked if the branch-name attribute were large. Hence, a schema consisting of account-branch and account-balance would be preferable to a schema consisting of the account relation in this case.

On the other hand, if most accesses to account information require both balance and branch-name, using the account relation would be preferable, since the cost of the join of account-balance and account-branch would be avoided. Also, the storage overhead would be lower, since there would be only relation, and the attribute account-number would not be replicated.

& ��%��

We can tune the indices in a system to improve performance. If queries are the bottleneck, we can often speed them up by creating appropriate indices on relations. If updates are the bottleneck, there may be too many indices, which have to be updated when the relations are updated. Removing indices may speed up updates

The choice of the type of index also is important. Some database systems support different kinds of indices, such as hash indices and B-tree indices. If range queries are common, B-tree indices are preferable to hash indices. Whether or not to make an index a clustered index is another tunable parameter. Only one index on a relation can be made clustered; generally, the one that benefits the most number of queries and updates should be made clustered.

& ��&��

Both read-only and update transactions can be tuned. In the past, optimizers on many database systems were not particularly good, so how a query was written would have a big influence on how it was executed, and therefore on the performance. Today, optimizers are advanced, and can transform even badly written queries and execute them efficiently. However, optimizer’s have limits on what they can do. Most systems provide a mechanism to find out the exact execution plan for a query, and to use it to tune the query.


In embedded SQL, if a query is executed into a more set-oriented query that is executed only once. The costs of communication of SQL queries can be high in client-server systems, so combining the embedded SQL calls is particularly helpful in such systems. For example, consider a program that steps through each department specified in a list, invoking an embedded SQL query to find the total expenses of the department using the group by construct on a relation expenses (date, employee, department, amount). If the expenses relation does not have a clustered index on department, each such query will result in a scan of the relation. Instead, we can use a single embedded SQL query to find total expense of every department, and to store the total in a temporary relations; the query can be evaluated with a single scan. The relevant departments can then be looked up in this (presumably much smaller) temporary relation.



1. What do mean by concept of maintenance of a database system?

2. What do you understand by bottlenecks?

� � � ��

• Although complex statistical analysis is best left to statistics packages, databases should support simple, commonly used, forms of data analysis.

• For making the system reliable and error free, the complete system must be tested in a systematic and organized way.

• During system testing, the following activities must be tested: Outputs, Response Time, Storage, Memory, Peak Load Processing, Security and Recovery.

• Implementation is the process of converting the manual or old computerized system with the newly developed system and making it operational, without disturbing the functioning of the organization.

��"�� - ��

��

I. True or False

1. The data stored in databases are usually small in volume.

2. Statistical analysis often requires aggregation on multiple attributes.

3. The repetition of information required by the use of our alternative design is undesirable.


1. The SQL aggregation functionality is___________, so several extensions have been implemented by different databases.

2. Histograms are frequently used in____________.

3. Repeating information wastes _________.

4. To have a_____________________________, we need to impose constraints on the set of possible relations.

5. The ________________of a system involves adjusting various parameters design choices to improve its performance for a specific application.


��

I. True or False

1. False

2. True

3. True


1. limited

2. data analysis

3. space

4. lossless-join decomposition

5. performance

��

I. True or False

1. Updates are less costly under the alternative design than under the original design.

2. Careless decomposition may lead to another form of bad design.

3. If we are to check updates efficiently, we do not design relational-database schemas that allow update validation without the computation of joins.

4. Fresh Implementation is the Implementation of totally new computerized system by replacing manual system.

5. We can tune the indices in a system to improve performance.


1. Statistical analysis often requires _____________on multiple attribute.

2. Decomposing is dividing one __________into two.

3. When an update is made to the database the system should be able to check that the update will not create an illegal __________ .

4. 3NF is _________ than BCNF.

��- ��

1. “Design of database requires a full conceptual knowledge of BASIC”. Justify the statement with an example.

2. BCNF is stronger than 3NF, justify with supportive example?

3. What do you understand by Tuning?

4. Is testing necessary before implementation? Give your views with example?

Implementation of SQL Using ORACLE RDBMS

�

��

Use of Relational DBMS Package for Class Project

Learning Objectives


• Implementation of SQL using Oracle RDBMS

Top

��

��

� CHAR: Values of this Datatype are final length character strings of maximum length 255 characters.

� VARCHAR/VARCHAR2: Values of this Datatype of variable length character strings of maximum length 2000.

� NUMBER: The NUMBER Datatype is used to store numbers (fixed or floating point). Numbers of virtually any magnitude may be stored up to 30 digits of precision. Number as large as 9.99*10 to the power of 124 i.e. 1 followed by 125 zeros can be stored.

� DATE : The standard format is DD-MM-YY as in 13-DEC-99. To enter dates other than the standard format, use the appropriate functions. Date time stores date in the 24-hour format. By default, the time in a date field is 12: 00: 00 am, if no time portion is specified. The default date for a date field is the first day of the current month.

� LONG: It is used when variable length character strings commit off 65, 535 characters.

�� !��

Syntax

CREATE TABLE tablename (Columname datatype (size), column name datatype (size);

Example

1. Create client-master table were

Column Name Datatype Size

Client-no Varchar 2 6

Name Varchar 2 20


Address 1 Varcar 2 30

Address 2 Varchar 2 30

City Varchar 2 15

State Varchar 2 15

Pincode Number 6

Remarks Varchar 2 60

Bal-due Number 10, 2

Create Table Client-master (Client-no Varchar 2 (6),

Name Varchar 2 (20),

address 1 Varchar 2 (30),

Address 2 Varchar 2 (30),

city Varchar 2 (15),

State Varchar 2 (15),

Pincode number (6),

Remarks Varchar 2 (60),

Bal-due number (10, 2);

2. Create product-master table where

Column Name Data Type Size

Product-no Varchar 2 6

Description Varchar 2 25

Profit-percent Number 2, 2

Unit-measure` Varchar 2 10

Qty-on-hand Number 8

Recorder – lvl Number 8

Sell-price Number 8, 2

Cost-price Number 8, 2

CREATE TABLE product-master (product-no varchar 2 (6), description varchar 2 (25), profit – percent number (2, 2), nit-measure varchar 2 (10), qty.-on-hand number (8), recorder-lvl number (8), sell-price number (8, 2), cost-price number (8, 2);

� ��!�� !��

Syntax

CREATE TABLE tablename [(columnname, columnname)] AS SELECT columnname, columnname from tablename;

USE OF RELATIONAL DBMS PACKAGE FOR CLASS PROJECT 199

Note: If the source table, from which the target table is being created, has records in it then the target table is populated with these records as well.

Example

Create table supplier-master from client-master. Select all fields and rename client-no with supplier-no and name with supplier-name.

CREATE TABLE supplier – master (supplier-no, supplier-name, address 1, address 2, city, state, pincode, remarks) AS SELECT client-no, name, address 1, address 2, city, state, pincode remarks from client-master;

�� !��

�� " � �� !��

Syntax

INSERT INTO tablename [columnname, columnname)]

VALUES (expression, expression);

Example

Insert a record in the client-master table as client-no = C02000, name = Prabhakar Rane, address 1= A-5, Jay Apartments, address 2 = Service Road, Vile Parle, City = Bombay, State = Maharashtra, pincode = 400057;

INSERT INTO Client-master (Client-no, name, address 1, address 2, city, state, pincode) Values (‘C02000’, ‘Prabakar Rane’, ‘A-5, Jay Apartments’, ‘Service Road, Vile Parle’, ‘Bombay’, ‘Maharashtra, 400057);

Note: The character expressions must be in single quotes.

�� !�� !��

Syntax

INSERT INTO table name SELECT columnname, columnname FROM tablename;

Example

Insert records in table supplier-master from table client-master.

INSERT INTO supplier-master SELECT client-no, name, address1, address2, city, state, pincode, remarks from client-master;

�� #�� !�� !��

INSERT INTO tablename SELECT columnname, columnname FROM tablename were column = expression;

Example

Insert records into supplier-master from table client-master were client-no = ‘C01001’;

INSERT INTO supplier-master SELECT client-no, name, address1, address2, city, pincode, state, remarks FROM client-master were client-no = ‘C01001’;


�� !��

Syntax

UPDATE tablename Set columnname = expression, columnname = expression…..Where columnname = expression;

Example

UPDATE TABLE client-master set name = ‘Vijay Kadam’ and address = ‘SCT Jay Apartments’ where client-no = ‘C02000’; UPDATE client-master SET name = ‘Vijay Kadam’, address1 = ‘SCT Jay Apartments’ where client-no = ‘C02000’;

��

��#�� "��

Syntax

DELETE FROM table name

Example

Delete all records from table client-master;

Delete from client-master;

Deletion of a Specified Number of Rows:

Syntax

DELETE FROM table name WHERE search condition;

Example

DELETE from table client-master where client-no = ‘C02000’;

Delete from Client-master where client-no = ‘C02000’;

��#��

$ !��%� �#��

Syntax

SELECT * FROM tablename;

Example

Select all records from table client-master;

SELECT * FROM client-master;

�� &�� #��#�� '� �� !��

Syntax

SELECT columnname, columnname FROM tablename;


Examples

Select client-no and name from client-master

SELECT client-no, name FROM client-master;

�� '�#�� #��

Syntax

SELECT DISTINCT columname, columnname FROM tablename;

Example

Select unique rows from client-master,

SELECT DISTINCT client-no, name FROM client-master;

� �� !��

Syntax

SELECT columnname, columname FROM table name ORDER BY columnname, columnname;

Example

Select client-no, name, address1, address2, city, pincode, from client-master sort in the ascending order of client-no; select client-no, name, address1, address2, city, pincode from client-master order by client-no;

��#�� !��

Syntax

SELECT columnname, columnname FROM tablename where search condition;

Example

Select client-no, name from client-master where client-no is equal to ‘C01234’;

SELECT client-no, name from client-master where client-no = ‘C01234’;

Note: In the search condition all standard operators such as logic, arithmetic, predicates, etc. can be used.

� �� '#�' �� !��

��(�" �� '� ��

Syntax

ALTER TABLE tablename ADD (new columnname datatype (size), new columnname datatype (size)……);

Example

Add fields, client-tel number (8), client-fax number (15) to table client-master;

ALTER TABLE Client-master

Add (Client-tel number (8), client-fax number (15));

��


Syntax

ALTER TABLE tablename MODIFY (columname new datatype (size));

Example

Modify field client-fax as client-fax varchar 2 (25); ALTER TABLE client-master MODIFY (client-fax varchar 2 (25));

��

Using the alter table clause you can not perform the following tasks:

� Change the name of the table.

� Change the name of the column.

� Drop a column.

� Decrease the size of a column if table data exists.

��

Syntax

DROP TABLE tablename;

Example

Delete table client-master;

DROP TABLE client-master;

��

(��(��# '� ��# ��

CREATE TABLE client-master

(Client-no varchar 2 (20) NOT NULL, name varchar 2 (20) NOT NULL, address1 varchar 2 (30) NOT NULL, address2 varchar 2 (30) NOT NULL, city varchar 2 (15), state varchar 2 (15), pincode number (6), remarks varchar 2 (60), bal-due number (10, 2));

��

A primary key is one or more columns in a table used to uniquely identify each row in table. Primary key values must not be null and must be unique across the column.

A multicolumn primary key is called a composite primary key. The only function that a primary key. The only function that a primary key performs is to uniquely identify a row and thus if one column it used it is just as good as if multiple columns are used Multiple columns i.e. (Composite keys) are used only when the system designed requires a primary key that cannot be contained in a single column.

Example

��

Create client-master where client-no is the primary key.


CREATE TABLE client-master (client-no varchar 2 (6) primary key, name varchar 2 (20), address1 varchar 2 (30), address 2 varchar 2 (30), city varchar 2 (15), state varchar 2 (15), pincode number (6), remarks varchar 2 (60), bal-due number (10, 2));

��

Create a sales-order-details table where

Column Name Date Type Size Attributes

S-order-no Varchar 2 6 Primary key

Product-no Varchar 2 6 Primary key

Qty.-ordered Number 8

Qty.-disp Number 8

Product-rate Number 8, 2

CREATE TABLE sales-order-details ES-order-no varchar 2 (6), product no varchar 2 (6), qty.– ordered number (8), qty-disp number (8), product-rate number (8, 2); primary key (S-order=no, product-no));

!��"��

A unique key is similar to primary key, except that the purpose of a unique key is to ensure that information in the column for each record is unique, as with telephone or driver’s license number.

A table may have unique keys

Example

Create table client Master with unique constraint on column client-no.

!��"��

CREATE TABLE Client-master

(Client-no Varchar 2 (6) constraint num-key UNIQUE,

name Varchar 2 (20),



city Varchar 2 (15),

state Varchar 2 (15),

Pincode number (6));

!��"��

CREATE TABLE Client-Master

(Client-no Varchar 2 (6), Name Varchar 2 (20),

Address 1 Varchar 2 (30), Address 2 Varchar 2 (30),


City Varchar 2 (15), State Varchar 2 (15),

Pincode number (6),

CONSTRAINT (num-key UNIQUE (client-no));

�� #��

At the time of cell creation ‘a default value’ can be assigned to it. When the user is loading a ‘record’ with values and leaves this cell empty. The DBA will automatically load this cell with the default value specified. The data type of the default value should match the data type of the column.

Example

Create Sales order table were:

Column Name Data Type Size Attributes

S-order-No Varchar 2 6 Primary Key

S-order-date Date

Client-no Varchar 2 6

Dely-Add Varchar 2 25

Salesman-no Varchar 2 6

Dely-type Char 1 Delivery: Past (P/Full (F) Default ‘F’

Dely-date Date

Order-status Varchar 2 10

CREATE TABLE Sales-order

(S. order-no Varchar 2 (6) PRIMARY KEY,

S-order-date date,

Client-no Varchar 2 (6),

Dely-Add Varchar 2 (25),

Salesman-no Varchar 2 (6),

Dely-type Char (1) DEFAULT ‘F’

Dely-date date;

Order-status varchar2 (10));

�$� %� ��

Use the CHECK constraint when you need to enforce integrity rules that can be evaluated based on a logical expression. Never use CHECK constraints if the constraint can be defined using the not null, primary key or foreign key constraint.

Following are a few examples of CHECK constraints.


� A CHECK constraint on the client-no column of the client-master so that client-no value starts with ‘C’.

� A CHECK constraint on name column of the client-master so that the name is entered in upper case.

� A CHECK constraint on the city column of the client-master so that only the cities “BOMBAY”, “NEW DELHI” “MADRAS” and “CALCUTTA” are allowed.

Exercise

CREATE TABLE client-master

(Client-no varchar 2 (6) constraint ‘K-Client

Check (Client-no like ‘C%’),

Name Varchar 2 (20) constraint K-name

CHECK (name = uppername)),



City Varchar 2 (15) CONSTRAINT k-City

CHECK (City IN (‘NEW DELHI’, ‘BOMBAY’, ‘CALCUTTA’, ‘MADRAS’));

&��

Foreign keys represent relationships between tables. A foreign key is a column (or a group of columns) whose values are derived from the primary key of the same or some other table.

The existence of a foreign key implies that the table with the foreign key is related to the primary key table from which the foreign key is derived. A foreign key must have corresponding primary key value in the primary key table to have a meaning.

For example, the S-order-no column is the primary key of table Sales-order. In table sales-order-details, S-order-no is a foreign key that references the S-order-no values in table sales-order.

Example: Create table sales-order-details with primary key as S-order no and product-no and foreign key as S-order-no referencing column S-order-no in the sales-order table.

Create Table Sales-order-details

(S-order-no Varchar 2 (6) REFERENCES Sales-order,

Product-no Varchar 2 (6),

Qty-order number (8),

Product-rate number (8, 2),

PRIMARY KEY (S-order-no, Product-no));

��%� �� '��('��

You can also define integrity constraints using the constraint clause in the ALTER TABLE command. For example are given below.

1. Add PRIMARY KEY constraint on column supplier-no in table SUPPLIER-MASTER.

ALTER TABLE Supplier-Master


Add Primary Key (Supplier-no);

2. Add FOREIGN KEY constraint on column S-order-no in table sales-order-details referencing table Sales-order, modify column qty-ordered to include NOT NULL constraint.

ALTER TABLE Sales-order-details

Add Constraint order-Key

FOREIGN KEY (S-order-no) REFERENCES Sales-order

MODIFY (qty. ordered number (8) NOT NULL)

Dropping Integrity Constraints in the alter table command

You can drop an integrity constraint if the rule that it enforces is no longer true or if the constraint is no longer needed. Drop the constraint using the alter table command with the DROP clause. The following example illustrates the dropping of integrity constraints.

1. Drop the PRIMARY KEY constraint from supplier-master.

ALTER TABLE Supplier-Master

DROP PRIMARY Key.

2. Drop FOREIGN KEY constraint on columns product-no in table Sales-order-details;

ALTER TABLE Sales-order-details

DROP CONSTRAINT Product-key;

� � '�� % �� '�� #��

�� )��

+ Addition * Multiplication

- Subtraction ** Exponentiation

/ Division () Enclosed operation

Example

Select Product-no, description and compute sell-price * 0.05 and Sell-price * 1.05 for each row retrieved.

Select product-no, description, Sell-price * 0.05, Sell-price * 1.05 from Product-master;

Here, Sell-price * 0.05 ad Sell-Price *1.05 are not columns in the table product-master, but are calculations done on the contents of the column sell-price of the table product-master. The output will be shown as follows.

Product No. Description Sell-Price *0.05 Sell-Price *1.05

P00001 Floppy 25 525

P03453 Mouse 50 1050

P07865 Keyboard 150 3150

��!��*� ��'��

The default output column names can be renamed by the user if required.


Syntax

SELECT column name result-columname, columnname result-column…..FROM tablename;

Example

Select Product-no, description and compute sell-price *0.05 and sell-price *1.05 for each row retrieved. Rename sell-price * 0.05 as increase ad salary *1.05 on New price.

SELECT Product-no, description,

Sell-price * 0.05 Decrease,

Sell-price *1.05 New Price

FROM Product-Master.

The output will be

Product No. Description Decrease New Price

P00001 Floppy 25 525

P03453 Mouse 50 1050

P07865 Keyboard 150 3150

'��)��

The Logical operators that can be used in SQL sentences are and, or, not

Example

1. Select client information like client-no, name, address 1, address 2, city and pincode for all the clients in ‘Bombay’ or ‘Delhi’;

SELECT circuit-no, name, address 1, address 2, city, pincode

FROM client-master

WHERE City = ‘BOMBAY’ or City = ‘DELHI’;

2. Select Product-no, description, profit-percent, sell price where profit-percent is between 10 and 20 both inclusive.

SELECT Product-no, description, profit-percent, Sell-price

FROM product-Master

WHERE profit-Percent > = 10 AND

Profit-Percent < = 20;

��+��

Example

Select product-no, description, profit-percent, sell-price where profit-percent is not between 10 and 20;

SELECT product-no, description, profit-percent, sell-price

FROM product-master.

WHERE profit-percent NOT BETWEEN 10 AND 15;


)�� #��

��!�� '�,��

The character data types: % matches any string-(underscore) matches any single characters.

Example

1. Select supplier_name from supplier_master where the first two characters of name are ‘ja’;

SELECT supplier_name

FROM supplier_master

WHERE supplier_name LIKE ‘ja%’;

2. Select supplier_name from supplier_master where the second character of name is ‘r’ or ‘h’;

SELECT supplier_name

FROM supplier-master

WHERE supplier_name like ‘r%’ OR supplier_name LIKE ‘-h%’

3. Select supplier-name, address 1, address 2, city and pincode from supplier_master where name is 3 characters long and the first two characters are ‘ja’;

SELECT supplier_name, address 1, address 2, city and pincode

FROM supplier_master

WHERE name is 3 characters long and the first two characters and ‘ja’;

�

�

�'� � � ��

• Command for creating a table is CREATE TABLE tablename (Columname datatype (size), column name datatype (size);

• Command for inserting record into a table is INSERT INTO tablename [columnname, columnname)] VALUES (expression, expression);

• Command to update a record is UPDATE tablename Set columnname = expression, columnname = expression…..Where columnname = expression;

• Command to delete a record is DELETE FROM table name.

��*�� '��

+��

I. True or False

1. Values of Char Datatype are fixed length character strings of maximum length 255 characters.

2. Using the alter table clause you can change the name of the table.


3. You can define integrity constraints using the constraint clause in the ALTER TABLE command.


1. The ______________Datatype is used to store numbers (fixed or floating point).

2. INSERT INTO tablename [columnname, columnname)]

___________(expression, expression);

3. _____________keys represent relationships between tables.

4. DELETE ________table name

5. You can drop an __________ ___________if the rule that it enforces is no longer true or if the constraint is no longer needed.

��*��

I. True or False

1. True

2. False

3. True


1. NUMBER

2. VALUES

3. Foreign

4. FROM

5. integrity constraint

!��

I. True or False

1. LONG is used when variable length character strings commit off 65, 535 characters.

2. The syntax of create table command is CREATE TABLE (Columname datatype (size), column name datatype (size);

3. Using the alter table clause you can not change the name of the column.

4. Use the CHECK constraint when you need to enforce integrity rules that can be evaluated based on a logical expression.

5. At the time of cell creation ‘a default value’ can not be assigned to it.


1. The standard format of data is ______________ for oracle.

2. Long is used when variable length character string commit off ____________ characters.


3. Char values of this data type are final length character strings of maximum length __________ character.

4. VARCHAR values of this data type of variable length character strings of maximum length ______________.

5. Number of virtually any magnitude may be stored up to ___________ digit of precision.

��'��

1. Create table supplier-master from client-master select all filed and rename client-x with supplier-no and name with supplier-name.

2. Insert a record in the client-master table as client-no=C03000, names=Peter, address 1=A-S, Jay Apartments, address 2=Service Road, Vile Parle, City-Bangalore, State-Karnataka, Pincode=300056.

3. Insert records into supplier-master from table client-master where client-no=C03000.

4. Update table client-master set name=’Peterwiley’ and address=’SXT Jay Apartments’ where client no=’C03000’.

��

��

1. Discuss each of the following terms:

a) Data

b) Field

c) Record

d) File

2. What do you suppose to be the difference in requirement for database management systems intended for use with general purpose business and administrative systems and those intended for use with each of the following:

a) Science Research

b) Library Retrieval Systems

c) Inventory Management

d) Computer Aided Software Design

e) Personal Management System

3. What is the difference between data and information?

4. What is the significance of Information Systems from the Manager’s point of view?

5. Highlight the difference in the File Management and Database Management approach.

6. Discuss the advantages and disadvantages of Database approach.

7. What do you think should the major objectives of a Database be?

8. What is Data Independence and why do you think is it important?

9. What do you think is the significance of Information, for an organisation.

10. What is meant by Data Processing?

11. Describe the Storage hierarchy.

12. Illustrate the benefits of maintaining records of computers.

13. Compile a list of applications that exist around you, which you think are directly dependent on the Database technology.

14. Describe how would you go about collecting information for the following business systems:

a) Marketing Information Systems

b) Accounts Receivable System


c) Purchase Order Processing Systems

d) Describe the various input, processes and output involved in the above mentioned business

systems.

15. Collect some real-life examples of how Database technology has helped organisations to grow better and improve their efficiency and customer service.

16. Describe the Database System Life Cycle. What is functional operation of each of these stages?

17. Why do you think proper analysis of the required database is important?

18. Make a checklist of items that you would consider while you are in the design stage.

19. What is Operation and Maintenance stage all about? Describe the role of DBA in this stage?

20. Describe the different components of a typical Database Management Systems.

21. Identify the different files involved in the following business processing system and detail out their database structure:

a) Order Processing System

b) Accounts Payable System

c) Inventory Management System

d) Student Information System.

22. Perform a research on the present set of tools and techniques offered by major RDBMS companies like Oracle and Sybase.

23. Describe the following terms:

a) Schema

b) Data Dictionary

c) End-User

d) Three Levels of Abstraction

24. Having understood the structure of a typical DBMS package, highlight the role of each individual component.

25. Compare the three functional approaches to database design and implementation—Hierarchical, Network and Relational.

26. What are the distinct advantages of Hierarchical and Network Database?

27. Describe the disadvantages of Relational Databases.

28. To acquire knowledge about latest developments in the area of Relational Databases and reviews about different database packages, read the latest reviews and articles. Document this information and draw your comparison chart.

29. Discuss the benefits of Data Dictionary.

30. Why is the relational model becoming dominant?

APPENDIX 213

31. Describe the basic features of the Relational Database model and discuss their importance to the end user and the designer.

32. Describe the architecture of a DBMS? What is the importance of Logical and Physical Data Independence?

33. Explain the following with relevant examples:

Data Description Language , Device Media Control Language, Data Manipulation Language.

34. Compare the Insertion and Deletion functions between the three models.

35. Relational model is one the major inventions of information technology? Elaborate.

36. Describe the following terms

i. Primary Key

ii. Candidate Key

iii. Foreign Key

iv. Cardinality

v. Degree

vi. Domain

37. Explain the relevance and significance of each of the Twelve Codd rules, with appropriate relevant examples.

38. With context to E-R relationship diagram, diagrammatically represent atleast five examples for each of the following types of relationships

a) One – One

b) One – Many

c) Many – Many

39. Describe the technique of Information Modeling and the art of drawing E-R diagrams. Take up any example of business system that exists around you and try modelling it using this technique.

40. What is the concept of Normalisation? Why do you think is it required?

41. Document the features of Oracle latest products—Oracle 8.0, Developer 2000 and Designer 2000.

42. Why do you think the database needs to be protected?

43. Highlight the differences between Security and Integrity.

44. How are the concepts of Integrity maintained in Relational Databases?

45. What are the major threats and security mechanisms to be adopted against them?

46. Discuss some of the security and integrity functions provided by Oracle.

47. Describe the following terms

a) Access Control


b) Audit Trail

c) Revoke and Grant

d) Hacking

e) Virus

f) Failure Recovery

g) Backup Techniques

h) Administrative Controls

48. Describe the different types of database failures against which the database should be guarded and the respective recovery technique to be adopted against them.

49. Describe in detail the role of Database Administrator.

50. How should one plan for Audit and Control Mechanisms?

51. Describe the technique of Encryption as a control mechanism.

52. What is the difference between deadlock prevention and deadlock resolution?

53. What is Transaction Integrity? Why is it important?

54. HallMart Department Stores runs a multi-user DBMS on a local area network file server. Unfortunately, at the present time the DBMS does not enforce concurrency control. One HallMart customer had a balance of $250.00 when the following three transactions were processed at about the same time:

a) Payment of $250.00

b) Purchase on credit of $100.00

c) Merchandise return (credit) of $50.00

Each of the three transactions read the customer record when the balance was $250.00 (that is, before the other transactions were completed). The updated customer record was returned to the database in the order shown above.

i) What was the actual balance for the customer after that last transaction was completed?

ii) What balance should have resulted from processing these three transactions?

55. For each of the situation described below, indicate which of the following security measures is most appropriate:

a) Authorisation

b) Encryption

c) Authentication schemes

APPENDIX 215

i) A national brokerage firm uses an electronic funds transfer system (EFTS) to transmit

sensitive financial data between locations.

ii) A manufacturing firm uses a simple password system to protect its database but finds it

needs a more comprehensive system to grant different privileges (such as read versus

create or update) to different users.

iii) A university has experienced considerably difficulty with unauthorised users who

access files and databases by appropriating passwords from legitimate users.

56. Customcraft, Inc., is a mail-order firm specialising in the manufacture of stationery and other paper products. Annual sales of Customcraft are $25 million and are growing at a rate of 15% per year. After several years of experience with conventional data processing systems, Customcraft has decided to organise a data administration function. At present, they have four major candidates for the data administrator position:

a. John Bach, a senior systems analyst with three years experience at Customcraft, who has

attended recent seminars in structured systems design and database design.

b. Margaret Smith, who has been production control manager for the past two years after a

year’s experience as programmer/analyst at Customcraft.

c. William Rogers, a systems programmer with extensive experience with DB2 and Oracle, the

two database management systems under consideration at Customcraft.

d. Ellen Reddy, who is currently database administrator with a medium-size electronics firm in

the same city as Customcraft.

Based on this limited information, rank the four candidates for the data administrator position, and

state your reasons.

57. Referring to Problem 15, rank the four candidates for the position of database administrator at Customcraft. State your reasons.

58. Visit an organisation that has implemented a database approach. Evaluate each of the following:

a. The organisational placement of the data administration function

b. The functions performed by data administration and database administration

c. The background of the person chosen as head of data administration

d. The status and usage of an information repository (passive, active-in-design, active-in-

production)

e. The procedures that are used for security, concurrency control, and back up and recovery.

59. Find a recent article describing an incident of computer crime. Was there evidence of inadequate security in this incident? What security measures described in this chapter might have been instrumental in preventing this incident?


60. A Primary key is a minimum superkey

a) True

b) False

c) Partially True

d) Inadequate data

61. ................................. Statement is used to modify one or more records in a specified relation.

a) Update

b) Alter

c) Add, delete, modify

d) Both 1 and 2

62. A database is said to be fully redundant when

a) when no replicates of the fragments are allowed

b) when complete database copies are distributed at all sites

c) when only certain fragments are replicated

d) None of the above

63. In order to modify data, SQL uses update statements, these statements include

a) Insert and Update

b) Modify, Insert and Delete

c) Insert, Update and Delete


64. The Network Model is usually used to represent

a) One to One relationship

b) One to Many relationship

c) Many to Many relationship


65. If all non-key attributes are dependent on all the attributes of the key in a relational database, the relation is in

a) Second Normal Form

APPENDIX 217

b) Third Normal Form

c) Fourth Normal Form


66. The stored value of the attribute is referred to as an ..........................................

a) Attribute value

b) Stored field

c) Field

d) All of the above

67. In a Client-Server Architecture one of the possible choices for Back-end could be

a) Oracle

b) Sybase

c) FoxPro

d) All of the above

68. A user in a DDBMS environment does not know the location of the data, this is called as

a) Location transparency

b) Replication transparency

c) Fragmentation transparency


69. The .................................. option specifies that only one record could exist at any time with a given value for the column(s) specified in the statement to create the index.

a) Distinct

b) Unique

c) Cluster

d) Both a and b

70. An existing relation could be deleted from the database by the ................................... SQL statement.

a) delete table tablename

b) delete relation relationname

c) drop table tablename



71. The objective of Normalisation is

a) to reduce the number of tables in a data base

b) to reduce the number of fields in a record

c) to increase the speed of retrieving data


72. A Transaction can end itself in

a) Successful Termination

b) Suicidal Termination

c) Murderous Termination

d) All of the above

73. Advantage of Distributed Data processing is

a) Sharing

b) Availability and Reliability

c) Incremental Growth and Parallel evaluation

d) All of the above

74. Which of the following is not a Network Topology

a) Star Topology

b) Bus Topology

c) Synchronous Topology

d) Mesh Topology

75. A ................................. statement is used to delete one or more records from a relation.

a) Alter

b) Modify

c) Drop

d) Delete

76. In a Client-Server Architecture one of the possible choices for Front-end could be

a) Oracle

APPENDIX 219

b) Sybase

c) Power Builder

d) All of the above

77. Which of the following truly represents Binary relationship

a) 1:1

b) 1:M

c) M:N

d) All of the above

78. The advantage of local data availability in a Distributed environment are

a) Access of a non-update type is cheaper

b) Even if access to a remote site is not possible, access to local data is still

c) Cost and complexity of updates increases

d) Both 1 and 2

79. The characteristic of Distributed Databases is

a) Location Transparency

b) Fragmentation Transparency

c) Replication Transparency and Update Transparency

d) All of the above

80. ................................... is the simplest concurrency control method.

a) Locking

b) Fragmentation

c) Replication

d) Isolation

81. Which of the following is not a SQL built-in-function?

a) Count

b) Sum

c) Min

d) Mean


82. Systems Catalogues are used to maintain

a) metadata on database relations

b) hardware information

c) system performance

d) both 2 and 3

83. A Data Dictionary doesn’t provide information about

a) where data is located

b) the size of the disk storage device

c) who owns or is responsible for the data

d) how is data used

84. Which of the following contain a complete record of all activities that affect the contents of a database during a certain period of time?

a) Report writer

b) Transaction Manager

c) Transaction log

d) Database Administrator

85. Which of the following is a traditional data model?

a) Relational

b) Network

c) Hierarchical

d) All of the above

86. A Schema describes

a) data elements, attributes

b) records and relationship

c) size of the disk storage and its usage by the database

d) both a and b

87. Security in a database involves

a) Policies to protect data and ensure that it is not accessed or deleted without proper

authorisation

APPENDIX 221

b) Appointing a Security Manager

c) Mechanisms to protect the data and ensure that it is not accessed or deleted without proper

authorisation

d) both a and b

88. A site needing remote catalogue information requests and stores it for later use. This scheme is called as

a) remote cataloging

b) caching the remote catalogue

c) remote catalogue caching


89. ..................................... Data Models were developed to organize and represent general knowledge; also they are able to express greater interdependence among entities of interest.

a) Network

b) Semantic

c) Hierarchical

d) All of the above

90. The smallest amount of data can be stored in

a) Bit

b) Byte

c) Nibble

d) Record

91. The two important dimensions for the protection of the data in database are

a) Confidentiality and Protection from accidental and malicious corruption and destruction

b) Protection and Security Locks

c) Data compression and Encryption techniques


92. Which of the following is not true about Primary Key?

a) it is unique entity

b) could be one or combination of attributes


c) can be null

d) both a and b

93. Alternate key is

a) A primary key

b) All keys except candidate key

c) Combination of one of more keys


94. Entity Relation Diagram

a) Describes the entities, their attributes and their relationships

b) Describes the scope of the project

c) Describes the level of participation of each entity


95. The three levels of architecture of a DBMS are

a) DBA, Internal schema and user

b) User, Application and Transaction Manager

c) Database, Application and user

d) External level, Conceptual level and Internal Level

96. Which the following is not the component of a DBMS?

a) Data definition language

b) Data manipulation language

c) Query processor

d) All of the above

97. Metadata is

a) Data about data

b) Data into data

c) Description of data and user


98. Attribute is a

APPENDIX 223

a) Name of a field

b) Property of given entity

c) Name of the database

d) All of the above

99. The way a particular application views the data from the database, depending on the user requirement is

a) Subschema

b) Schema

c) Metadatabase


100. Which of the following is not the characteristic of a relational database model?

a) Logical relationship

b) Tables

c) Relationships

d) Tree like structure

101. The set of possible values that a given attribute can have is called its

a) Entity set

b) Domain

c) Object property


102. A ................................. key is an attribute or combination of attributes in a database that uniquely identifies an instance.

a) Superkeys

b) Primary key

c) Candidate key

d) Both a and b

103. .................................... is a collection of identical record type occurrences pertaining to an entity set and is labelled to identify the entity set.

a) File


b) Database

c) Field


104. A ............................... represents a different perspective of a base relation or relations.

a) Virtual table

b) Table

c) Tuple

c) View

105. The mapping between the Conceptual and Internal view is provided by

a) DBA

b) DBMS

c) Operating System

d) Both b and c

106. Disadvantages of Database Processing are

a) Size, complexity, cost and failure impact

b) Needs highly skilled staff

c) Needs expensive hardware

d) Integrity

107. Which of the following is not an advantage of Data processing?

a) Data Redundancy

b) Data Sharing

c) Physical and Logical Independence


108. What is Data Integrity?

a) Protection of the information contained in the database against unauthorised access,

modification or destruction

b) The culmination of administrative policies of the organisation

APPENDIX 225

c) The mechanism that is applied to ensure that the data in the database is correct and

consistent


109. Which of the following is not an example of RDBMS?

a) Oracle

b) Informix

c) Access

d) Focus

110. Advantages of Distributed Database are

a) Data Sharing and Distributed Control

b) High Software Development Cost

c) Reliability and Availability

d) Both a and b

111. Four levels of defense generally recognised for database security are

a) Human Factor, Legal Laws, Administrative Controls and Security Policies

b) Human Factor, Authorisation, Encryption and Compression

c) Human Factor, Physical Security, Administrative Control DBMS and OS Security

Mechanisms


112. Content-Dependent Access control is

a) A user is allowed access to everything unless access is explicitly denied

b) Access is allowed to those data objects whose names are known to the user

c) Concept of least privilege extended to take into account the contents of database


113. A ......................... subset view slices the table horizontally.

a) Row

b) Column

c) Join


d) Both a and b

114. SQL statement to give privilege to user is

a) Grant

b) Revoke

c) Select

d) Update

115. The Commit operation is

a) Start of a new operation

b) Signifies an unsuccessful end of an operation

c) Generating a Report

d) None of the following

116. Large collections of files is called as is

a) Fields

b) Records

c) Databases

d) Record

117. A Subject

a) Is something that needs protection

b) Is an active element in the security mechanism; it operates on objects

c) A user is allowed to only that portion of the database defined by the user’s view

d) Both a and c

118. Identification and Authentication can be enforced through

a) Something you have

b) Someone you are

c) Something you know

d) All of the above

119. A Self Join statement is used to

a) Join two records

APPENDIX 227

b) Join two tables

c) Join Table with itself


120. A .......................... is well-defined collection of objects.

a) Set

b) Member

c) Field

d) File

121. Duplicate tuples are not permitted in a relation.

a) False

b) Inadequate data

c) True

d) Either a or c

122. A single relation may be stored in more than one file, i.e., some attributes in one and the rest in others, this is called as

a) Distribution

b) Cardinality

c) Fragmentation


123. No. of Tuples in a Table is called as

a) Cardinality

b) Degree

c) Count


124. Which of the following is not true about a Foreign Key?

a) It is primary key of some other table

b) It cannot be NULL

c) It is used to maintain referential integrity


d) It should always be numeric

125. The ........................ option is used in a Select statement to eliminate duplicate tuples in the result.

a) Unique

b) Distinct

c) Exists


126. No. of columns in a Table is called as

a) Cardinality

b) Degree

c) Tuple


127. In a Table of 4X3, which of the following is true?

a) Cardinally is 12

b) Degree is 4

c) No. of rows is 3


128. Database management systems are intended to

a) establish relationships among records in different files

b) manage file access

c) maintain data integrity

d) all of the above

129. Which of the following hardware components is the most important to the operation of a database management system?

a) High-resolution video display

b) High speed printer

c) High speed and large capacity disk

d) All of the above

130. Which of the following are the functions of a DBA?

APPENDIX 229

a) Database Design

b) Backing up the data

c) Maintenance and administration of data

d) All of the above

131. Which of the following is a serious problem of the file management?

a) Lack of data independence

b) Data redundancy

c) Non-shareability

d) All of the above

132. The database environment has all of the following components except

a) Users

b) Separate files

c) Metadata

d) DBA

133. ............................................. is used to represent an entity type in a DBTG model.

a) Owner record type

b) Record type

c) Member Record type


134. ......................................... is also called as the Navigational Model.

a) Network

b) Hierarchical

c) Relational

d) both a and b

135. In .................................................. file organisation the physical location of record is based on some relationship with its primary key value.

a) Index-sequential

b) Sequential


c) Direct


136. Atomic domains are sometimes also referred to as

a) Composite domains

b) Structured domains

c) Application dependent domains

d) Application-independent domains

137. Address is an example of

a) Composite domains

b) Structured domains

c) Both 1 and 2


138. Important property associated with key are

a) Should be from Composite domains

b) Should be Unique and Numeric

c) Should have Unique Identification and non redundancy


139. The ........................................ operation removes common tuples from the first relation.

a) Difference

b) Intersection

c) Union

d) Both a and b

140. The ........................................ operation is also called as the Restriction operation.

a) Difference

b) Selection

c) Union

d) Both a and b

141. Two tuples of a Table can have identical data

APPENDIX 231

a) True

b) False

c) Partially False

d) Inadequate data

142. Data Security threats include

a) Hardware Failure

b) Privacy invasion

c) Fraudulent manipulation of data

d) All of the above

143. The ...................................... function of SQL allows data to be classified into categories.

a) Having Count

b) Count(*)

c) Sum

d) Group by

144. Which of the following is true about Updation of Views?

a) Any update operation through a view requires that user has appropriate authorisation

b) If the definition of the view involves a single key and includes the primary key

c) The value of a nonprime attribute can be modified

d) All of the above

145. A ......................................... is a programme unit whose execution may change the contents of a database.

a) Update Statement

b) Transaction

c) Event


146. Which of the following is not the property of a Transaction?

a) Atomicity

b) Consistency


c) Durability

d) Inadequate data

147. Which of the following is not recorded in a Transaction log

a) Size of the database before and after the transaction

b) The record identifiers, which include who and where information

c) The updated value of the modified records

d) All of the above are recorded

148. A scheme called ......................................... is used to limit the volume of log information that has to be handled and processed in the event of system failure.

a) Write-ahead log

b) Checkpoint

c) Volume limit transaction log


149. The following is also called as the Read lock

a) Non exclusive lock

b) Exclusive lock

c) Shared lock

d) Non Shared Lock

150. In the ...................................... phase the number of lock increases from zero to maximum.

a) Two phase locking

b) Growing phase

c) Contracting phase


Suggested Readings

1. Database Management,Fred Mcfadden, Benjamin Cummings Publishing.

2. Database Management: Principles and Products, Charles J Bontempo, Prentice Hall PTR.

3. Database Management Systems,Raghu Ramakrishnan, Mcgraw-Hill Inc.

4. Database Systems, Thomas M Connolly, Addison-Wesley.

5. Database Systems, Rob Coronel, Galgotia Publishers.

6. Fundamentals of Database Systems, Ramez Elmastri, Addison Wesley.

bca 204 3rd database

Documents

information processing

list of information

information economy

context of information

list data

data modeling

sincethebeginningofci

processed data