roever engineering college department of … 16mark.pdf · roever engineering college department of...

ROEVER ENGINEERING COLLEGE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CS1254 DATABASE MANAGEMENT SYSTEMS

16 MARKS:

UNIT I

FUNDAMENTALS

PART B

1. EXPLAIN ABOUT DB SYSTEM STRUCTURE OR ARCHITECTURE. (16 marks)

The functional components can be broadly divided into the

Storage manager

Query Processor.

Storage Manager It is a program module that provides the interface b/w the low level data stored in the db and the application programs & queries submitted to the system.

It translates the various DML statements into file system commands.

It is responsible for storing, retrieving and updating data in the database.

STORAGE MANAGER

The storage manager components includes:

Authorization & Integrity Manager: It tests for the satisfication of integrity constraints and checks the authority of users to access data.

File Manager: Allocation of files and disk & data structures used to represent information stored on disk.

Buffer Manger: used to decide what data to cache in mainmemory in orde to speed up the accessing of data.

Transaction Manager: db remains in a consistent state despite system failures and that concurrent transaction execution proceed without conflicting.

The SM implements several DS such as,

Data files which store db

Data dictionary which stores meta data about the structure of the db.

Indices: It is a file that provides fast access of data items that hold particular values.

QUERY PROCESSOR

It includes

DDL Interpreter:

- which interprets DDL sts and records the definitions in the data dictionary.

DML Compiler:

- It translates DML sts into an evaluation plan consisting of low level instructions that the qury

evaluation engine understands.

Query Evaluation Engine:

-Which executes low level instructions generated by DML compiler.

DATABASE ARCHITECTURE

DATABASE USERS

We can classify users of db into many types.

Naïve users:

- who interact with the system by invoking one of the application programs that have been previously written.

Application Programmers:

- Computer professionals who write application programs.

Sophisticated users

- familiar with the structure of database

- Such users can use a query language such as SQL to perform the required operations on databases.

Specialized users:

- Specialized users who write specialized db applications

DATABASE ADMINISTRATOR

DBA is a person who has central control over the system.

DBA is responsible for authorizing access to the db.

DBA is responsible for acquiring s/w and h/w resources.

DBA is responsible for coordinating and monitoring the use of db.

DBA creates original db schema by executing a set of data definition stmts in DDL.

2. EXPLAIN ABOUT CONCEPTS OF ER MODEL. (16 marks)

Entity-Relationship data model It is a high level conceptual data model that describes the structure of db in terms of entities, relationship among entities & constraints on them..

Basic Concepts of E-R Model:

Entity

Entity Set

Attributes

Relationship

Relationship set

Identifying Relationship

Entity:

―An entity is a business object that represents a group, or category of data.‖

-Example:

- Person, Employee, Car, Home etc..

Object with conceptual Existence

- Account, loan, job etc…

Entity Set:

- A set of entities of the same type.

Attributes:

- A set of properties that describe an entity.

Types of Attributes:

Simple (or) atomic vs. Composite:

- An attribute which cant be sub divided. (Eg.Roll No)

- An attribute which can be divided into sub parts is called

Composite attribute.

e.g.. Address- Apartment no.

- Street - Place - City - District

Single Valued vs. Multivalued:

-An attribute having only one value (e.g.. Eid,Roll No)

- An attribute having multiple values (e.g.. Deptlocat- A dept can be located in several places)

Stored Vs Derived

- Stored attribute (SA) is one that has some value where as Derived Attribute (DA) is a one where its value is derived from sa.

-E.g.. SA-DOB

DA- Age derived from DOB.

Key Attribute:

- An attribute which is used to uniquely identify records.

E.g.. eid, sno, dno

Relationship:

It is an association among several entities. It specifies what type of relationship exists between entities.

Relationship set:

- It is a set of relationships of the same type.

Weak Entity Set/ Strong Entity Set:

An entity set that does not possess sufficient attributes to form a primary key is called a Weak entity set. One that does have a primary key is called a Strong entity set.

Identifying Relationship:

The relationship associated with the weak entity type

Constraints

Two of the most important constraints are

Mapping Constraints

Participation constraints- Total Participation and Partial Participation

Mapping Cardinalities:

Mapping cardinalities or cardinality ratios, expresss the number of entities to which another entity can be associated via a relationshipset. There are several types of Mapping Cardinalities available. They are,

(i) One-to-One

An entity in set A is associated with at most one entity in set B and vice versa.

(ii) One-to-many An entity in set A is associated with zero or more no. of entities in set B and an entity in B

is associated with at most one entity in A.

E

N

T

I

T

Y

S

E

T

(iii) Many-to-One One or more no. of entities in set A is associated with at most one entity in B. An entity in B can be associated with any no. of entities in A.

(iv) Many-to-Many

One or more no. of entities in set A is associated with one or more no. of entities in set B.

Participation Constraints:

Total Participation The participation of an entity set E in a relationship set R is said to be total if every entity in E

participates in atleast one relationship in R. Partial Participation:

The participation of an entity set E in a relationship set R is said to be partial if only a few of the

entities in E participated in relationship in R.

Keys

It is used to uniquely identify entities within a given entity set or a relationship set.

Keys in Entity set:

(i) Primary Key: • It is a key used to uniquely identify an entity in the entity set.

• E,g, eno,rno,dno etc… (ii) Super Key:

It is a set of one or more attributes that allow us to uniquely identify an entity in the entity set. Among them one must be a primary key attribute. E.G.. Eid (primary key) and ename together can be identify an entity in entity set.

(iii) Candidate key: They are minimal super keys for which no proper subset is a superkey.

E.g.. Ename and eaddr can be sufficient to identify an employee in employee set. {eid} and {ename,eaddr} – Candidate keys

Foreign keys:

An attribute which makes a reference to an attribute of another entity type is called foreign key Foreign keys link tables together to form an integrated database

Domain

A range of values can be defined for an attribute and is called as Domain of that attribute.

E.g.. Age – attribute A Domain (A)= {1,2,….100}

Keys in Relationship set:

Case 1: If the relationship set R has no attributes, then the set of attributes Primarykey(E1) U Primarykey(E2) U Primarykey (n)

Case2: If the relationship set R has attributes, then the set of attribues, Primarykey(E1) U Primarykey(E2) U Primarykey (n)U{a1,a2,…an}

describes an individual relationship in set R. In both cases,

Primarykey(E1) U Primarykey(E2) U Primarykey (n) forms a superkey for a relationship set.

E-R Diagram

3. COMPARE AND CONTRAST BETWEEN DATABASE SYSTEMS AND FILE SYSTEMS (6 marks)

In the early days, database applications were built directly on top of file systems

Drawbacks of using file systems to store data:

a. Data redundancy and inconsistency

i. Multiple file formats, duplication of information in different files

b. Difficulty in accessing data

i. Need to write a new program to carry out each new task

c. Data isolation — multiple files and formats

d. Integrity problems

i. Integrity constraints (e.g. account balance > 0) become ―buried‖ in program code rather than being stated explicitly

ii. Hard to add new constraints or change existing ones

e. Atomicity of updates

i. Failures may leave database in an inconsistent state with partial updates carried out

ii. Example: Transfer of funds from one account to another should either complete or not happen at all

f. Concurrent access by multiple users

i. Concurrent accessed needed for performance

ii. Uncontrolled concurrent accesses can lead to inconsistencies

1. Example: Two people reading a balance and updating it at the same time

g. Security problems

i. Hard to provide user access to some, but not all, data

ii. Database systems offer solutions to all the above problems

4.WRITE ABOUT DIFFERENT LEVELS OF ABSTRACTION ,INSTANCES AND SCHEMAS (6 marks)

ABSTRACTION

It hides the complex details from the user and provide only the necessary data to the user

Three levels are there,

Physical level: describes how a record (e.g., customer) is stored.

Logical level: describes data stored in database, and the relationships among the data.

type instructor = record

ID : string; name : string; dept_name : string;

salary : integer;

end;

View level: application programs hide details of data types. Views can also hide information (such as an

employee‘s salary) for security purposes.

Three levels of abstraction

SCHEMA

The logical structure of the database .

– Example: The database consists of information about a set of customers and accounts and the

relationship between them)

– Analogous to type information of a variable in a program

– Physical schema: database design at the physical level

– Logical schema: database design at the logical level

Conceptual Schema

1. Conceptual Schema: or logical schema describes all relations that are stored in the database.

University, a conceptual schema is:

- Students (sid:string,Age:integer)

– Faculty (fid: string, salary: real)

– Courses (cid: string, cname: string, credits:integer)

In the university example, these relations contain information about entities, such as students and faculty, and about relationships, such as students‘ enrollment in courses.

Physical Schema: specifies additional storage details.

• It summarizes how the relations described in the conceptual schema are stored on secondary storage devices such as disks and tapes.

• Creation of data structures called indexes, to speed up data retrieval operations

A sample physical schema for the university:

– Store all relations as unsorted files of records

– Create indexes on the first column of Students, Faculty, and Courses relations.

External Schema

Each external schema consists of a collection of one or more views and relations from the conceptual

schema.

cid:string

• A view is conceptually a relation, but the records in a view are not stored in the DBMS.

• A user creates any view from data already stored.

• For example: we might want to allow students to find out the names of faculty members teaching

courses.

– This is the view associated:

Courseinfo (cid: string, fname:string)

– A user can treat a view just like a relation and ask questions about the records in the view, even though the records in the view are not stored explicitly.

INSTANCE – the actual content of the database at a particular point in time

– Analogous to the value of a variable

Physical Data Independence – the ability to modify the physical schema without changing the logical schema

– Applications depend on the logical schema

– In general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.

5. DRAW THE ER DIAGRAM FOR BANKING (10 marks)

6. DISCUSS ABOUT DATABASE LANGUAGES (6 marks)

Data Manipulation Language (DML)

Language for accessing and manipulating the data organized by the appropriate data model.

DML also known as query language

Two classes of languages

o Procedural – user specifies what data is required and how to get those data

o Declarative (nonprocedural) – user specifies what data is required without specifying how to get those data

SQL is the most widely used query language

SQL

SQL: widely used non-procedural language

Example: Find the name of the instructor with ID 22222 select name from instructor where instructor.ID = ‗22222‘

select instructor.ID, department.dept name from instructor, department where instructor.dept name= department.dept name and department.budget > 95000

Application programs generally access databases through one of

Language extensions to allow embedded SQL

Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be

sent to a database

Data Definition Language (DDL)

Specification notation for defining the database schema.

Example: create table instructor (ID char(5),name varchar(20),dept_name varchar(20),

salary numeric(8,2))

DDL compiler generates a set of tables stored in a data dictionary.

Data dictionary contains metadata (i.e., data about data)

Database schema

Integrity constraints

Primary key (ID uniquely identifies instructors)

Referential integrity (references constraint in SQL)

e.g. dept_name value in any instructor tuple must appear in department

relation

Authorization

7.EXPLAIN DIFFERENT TYPES OF ATTRIBUTES IN ER MODEL( 4 MARKS)

Single valued attributes: attributes with a single value for a particular entity are called single valued attributes. Eg. Cus_id- it takes only one value

Multivalued attributes: Attributes with a set of value for a particular entity are called multivalued attributes. Eg. Phone number- it takes many values

Stored attributes: The attributes stored in a data base are called stored attributes.

Derived attributes: The attributes that are derived from the stored attributes are called derived attributes.

Eg. Age- it will be derived from date of birth(another attribute)

Simple attribute: it can not be divided into sub parts. Eg. Cus_id

Composite attributes: Composite attributes can be divided in to sub parts.

Eg. Name-has first name, middle name and last name

UNIT II

RELATIONAL MODEL

PART B

1. EXPLAIN THE BASIC OPERATIONS OF RELATION ALGEBRA(16 marks)

Relational Algebra

It consists of set of operations that take one or more relations as input and produces new relation as

output.

The operations can be divided into,

Basic operations: Select, Project, Union, rename, set difference and Cartesian product

Additional operations: Set intersections, natural join, division and assignment.

Extended operations: Aggregate operations and outer join

Basic operations

SELECT

It selects tuples that satisfy a given predicate, To denote selection.

(Sigma) is used.

Syntax: condition (Table name) ie. Sal>1000 (Employee)

PROJECT

It selects attributes from the relation.

π– Symbol for project,

π<Attribute list><Attribute list>.

for eg. πEid,sal (employee)

1. Mathematical Set Operations

UNION OPERATION:

R1 U R2 - implies that tupl es either from R1 or tuples from R2 or both R1 & R2.

U symbol is used

SET DIFFERENCER1 – R2 implies tuples present in R1 but not in R2. ‘− ‘ is used

CARTESIAN PRODUCT

R1 × R2 allows to combine tuples from any two relations.

E.G.. Emp1 × Emp2

RENAME OPERATION

To rename the name of a relation or the name of an attribute.

2. Additional operations

INTERSECTION

R1η R2 implies tuples present in both R1 & R2

NATURAL JOIN OR EQUI JOIN

Used to combine related tupules from two relations.

It requires that the two join attributes have the same name, otherwise renaming operation is applied first and then join operation is applied.

Symbol:

OUTER JOIN

It is an extension of the join operation to deal with missing information.

In natural join, only the matching tuples comes in the result and the unmatched tuples are lost. To avoid this loss of information we use outer join.

There are 3 forms of outer join operation They are Left outer join, Right outer join and full outer join

LEFT OUTER JOIN-

It takes all tuples in the left relation that did not match with any tuple in the right relation and pads

the tuples with null values for all other attributes from the right relation and adds them to the result of the natural join.

RIGHT OUTER JOIN-

It takes all tuples in the right relation that did not match with any tuple in the left relation and pads the tuples with null values for all other attributes and adds them to the result of the natural join.

FULL OUTER JOIN

Padding tuples from the left relation that didn‘t match any from the right relation, as well as tuples from the right relation that did not match any from the left relation & adding them to the result of the

join.

DIVISION OPERATION

It is denoted by ’ . It is suited to queries that include the phrase ‗for all‘.

AGGREGATE FUNCTIONS

It takes a collection of values and return a single value as a result.

Avg., min., max., sum., count are few aggregate functions.

2. EXPLAIN ABOUT TUPLE RELATIONAL AND DOMAIN RELATIONAL CALCULUS (16 marks)

Relational Calculus

It can be divided as Tuple Relational calculus and Domain Relational Calculus

Tuple relational Calculus:

It is a non procedural query language

Specifies what data are required without describing how to get those data

Each Query is in the form of {t | P(t)}.

It is the set of all tuples ‗t‘ such that predicate P is true for ‗t‘.

Notations Used:

t is a tuple variable, t[A] denotes the value of tuple t on attribute A

t r denotes that tuple ‗t‘ is in relation ‗r‘.

P is the formula similar to that of the predicate calculus.

Tuple relational calculus.

The tuple relational calculation is anon procedural query language. It describes the desired information with out giving a specific procedure for obtaining that information. A query or

expression can be expressed in tuple relational calculus as {t | P (t)} which means the set of all tuples‗t‘ such that predicate P is true for‗t‘. Notations used:

t[A] → the value of tuple ‗t‘ on attribute, A

t ∈ r → tuple ‗t‘ is in relation ‗r‘

∃ → there exists Definition for ‗there exists‘ (∃): ∃ t ∈ r(Q(t)) which means there exists a tuple ‗t‘ in relation ‗r‘ such that predicate Q(t) is true.

∀→ for all Definition for ‗for all‘ (∀): ∀t ∈ r(Q(t)) which means Q(t) is true for all tuples ‗t‘ in

relation ‗r‘.

⇒ → Implication Definition for Implication (⇒): P⇒Q means if P is true then Q must be true.

Predicate calculus formula

Set of attributes and constants

Set of comparison operators(e.g. <, >, <, >, =, =).

Set of connectives: and(^), or( ), not( )

Implication( ) : X Y, if X is true, then Y is True

Set of Quantifiers:

- there exists

Definition for there exists

t r(Q(t)) = ‗there exists‘ a tuple in relation r such that predicate Q(t) is true.

V - For all

Definition for ‗For all‘

V t r(Q(t))=Q(t) is true ― for all‖ tuples ‗t‘ in relation r.

Safety of Expressions

It is possible to write tuple calculus expressions that generate infinite relations.

For example, { t | t r } results in an infinite relation if the domain of any attribute of relation r is

infinite

To guard against the problem, we restrict the set of allowable expressions to safe expressions.

An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in

one of the relations, tuples, or constants that appear in P

o NOTE: this is more than just a syntax condition.

o E.g. { t | t [A] = 5 true } is not safe --- it defines an infinite set with attribute

values that do not appear in any relation or tuples or constants in P.

Domain Relational calculus

The domain relational calculus uses domain variables that take on values from an attribute domain rather than values for entire tuple.

Each Query is an expression of the form,

{<x1,x2,…xn>/P(x1,x2,…xn)}

Where x1,x2,…xn represent domain variables.

P represents a formula similar to that of predicate calculus.

Safety of Expressions

The expression: { x1, x2, …, xn | P (x1, x2, …, xn )}

is safe if all of the following hold:

1. All values that appear in tuples of the expression are values from dom (P ) (that is, the values

appear either in P or in a tuple of a relation mentioned in P ).

2. For every ―there exists‖ subformula of the form x (P1(x )), the subformula is true if and only if

there is a value of x in dom (P1) such that P1(x ) is true.

3. For every ―for all‖ subformula of the form x (P1 (x )), the subformula is true if and only if P1(x ) is

true for all values x from dom (P1).

3.EXPLAIN ABOUT TRIGGERS (6 marks)

Triggers

A Trigger is a statement that is executed automatically by the system as a side effect of modification

to the db.

To design a trigger mechanism, we must

* Specify the conditions under which the trigger is to be executed.

* Specify the actions to be taken when the triggers executed.

Need fro triggers

Suppose that instead of allowing negative account balances, the bank deals with overdrafts by,

* Setting the account balance to zero.

* Creating the loan in the amount of overdraft.

* Giving this loan a loan no identical to the account number to the overdrawn amount.

3 parts of triggers:

o Events- A change ti the database that activates triggers o Condition- a query that is running when the triggers is activated.

o Action –A procedure that is executed when the trigger is activated.

The condition for executing the trigger is an update to the account relation that results in a negative balance value.

E.G… create trigger overdraft_trigger after update on account referencing new row as nrow

for each row when nrow.balance<0

begin atomic

insert into borrower(select customer_name,acc_no

from depositor where nrow.acc_no=depositor.acc_no);

insert into loan values

(nrow.acc_no,nrow.branch_name,nrow.balance);

update account set balance = 0 where

account.acc_no=nrow.acc_no

end.

4. EXPALIN ABOUT DATA DEFINITION LANGUAGE (DDL) IN SQL

(Or basic schema definition in SQL) (10 marks)

1. Create:

It is used to create a new table in oracle.

Syntax:

Create table table_name(colunmn_name1 data type1 [constraint], column_name2 data type2[constraint],… column_ name n datatype n [constraint]);

2. Alter:

It is used to add a new column to the table, modifying the existing column, including or dropping an integrity constraints (primary key, not null, etc..)

a. Adding new column:

Syntax: alter table tablename add column_name data_type;

E.G. alter table customer add pno number(10);

b. Modifying an existing column:

Syntax: alter table tablename modify column_name newdata_type;

c. Dropping a column: It is used to delete a table.

Syntax: alter table tablename drop column column_name;

E.g.. alter table customer drop column cust_add;

3. Dropping a table It is used to drop a table.

Syntax: drop table table_name;

E.G.. Drop table customer;

All data and table structure of a customer table is permanently removed.

4. Renaming a table

SYNTAX: rename oldtablename to newtablename;

E.G.. Rename customer to cust_det;

5. Truncate a table

It removes all records or rows from the table.

SYNTAX: truncate table tablename;

E.G.. Truncate table cust_det;

5. EXPLAIN THE BASIC STRUCTURE OF SQL (10 marks)

The basic structure of an SQL expression consists of 3 clauses,

Select- select clause corresponds to projection operation in relational algebra

It is used to list the attributes desired in the result of a query.

From- from clause corresponds to Cartesian product in relation algebra.

Where- where clause corresponds to the selection operation in relational algebra.

It consist of a predicate involving attributes of the relations that appear in the from clause

All comparisons can be used such as <,>,<=,>=,=.

Logical operator like or, and are used.

It is used to list the attributes desired in the result of a query.

Syntax: select A1,A2,…..An from R1,R2,…Rm where P

E.g.. Select cust_name from customer;

E.g.2 Select distinct cust_name from customer;

if we want duplicates removed.

E.g.3 Select all cust_name from customer;

To specify explicitly that duplicates are not allowed

From Clause: It is used to list the relations involved in the query.

E.G.. Select * from customer;

Where clause: It is used for specifying the condition.

E.g.. Find the names of all customer whose city is ―chennai‖.

select customer_name from customer where city=―chennai‖.

OTHER OPERATIONS

Rename operation

Sql provides rename operation for both relations and attributes.

It uses the as clause.

Syntax: old_name as new_name.

Example: select loan_number as loanid ,amount from loan.

Tuple variable

The as clause is particularly useful in defining the notation of tuple variables.

A tuple variable in SQL must be associated with particular relation.

It is useful in comparing two tuples in the same relation.

Ex: select customer_name , T.loan_number, S.amount from borrower as T, loan as S.

String operations

SQL specifies strings by enclosing them in single quote.

The most commonly used operation on string is pattern matching-Like.

We describe patterns by using two special characters,

o Percent (%) - % character matches any substring. o Underscore (_) - _ character matches any character.

Ex: perry%-- matches any string beginning with perry.

%idge%--matches any string containing idge as a substring o Eg., perryridge,rockridge,ridgeway.

‗_ _ _‘ -- matches any string of exactly 3 characters.

‗_ _ _ %‘ – matches any string of at least 3 characters.

Ex: select customer_name from customer where customer_street like ‘%main%’

6. EXPLAIN SET OPERATIONS AND AGGREGATE FUNCTIONS IN SQL.

(10 marks)

Set Operations

The set operations union, intersect and except operate on relations and correspond to the relational algebra operations U, , and -.

Each above operations automatically eliminates the duplicates and to retain all duplicates use union all,

intersect all, except all.

1. Find all customers who have a loan or an account,or both(select customer_name from depositor) union (select customer_name from borrower);

2. Find all customers who have both loan and an account,select customer_name from depositor) intersect(select customer_name from borrower);

3. Find all customers who have an account but no loan,(select customer_name from depositor) except (select customer_name from borrower);

Aggregate functions

It takes collection of values as input and return a single value as output.

SQL has 5 built in agg.fns,

avg – Find average value.

min – Find minimum value.

max – Find Maximum value.

Sum – Find sum of values.

Count – Counts number of values

Find the average account balance at the adayar branch.Select avg(balance) from account where branch_name=―adayar‖;

Others

Group by clause

It is used to group a set of tuples with the same value on all attr ibutes in the group by clause are placed in one group.

Eg., To find the avg account balance at each branch.

Select branch_name, avg(balance) from account group by branch_name. We are using the keyword distinct for eliminating the duplicates.

Eg., to find the number of depositors Select branch_name,count(distinct cus_name) from depositor,account where

depositer.acc_num=account.acc_num group by branch_name.

Having

It is useful to state a condition that applies to group rather than to tuples.

Eg., to find the branch_name where the average account balance is more than $1200. Select branch_name, avg(balance) from account group by branch_name having (balance) >1200.

NULL values

SQL allows the use of null values to indicate the absence of information about the va lue of an

attribute. Eg., To find the loan number that appears in th loan relation with null values for amount.

Select loan_number from loan where amount is null. The result of an arithmetic expression is null if any of the input values is null, SQL treats the result

of any comparison involving a null values.

Boolean operations

And : true and unknown =unknown

False and unknown =false

Unknown and unknown=unknown

Or : true or unknown =true

False or unknown = unknown

Unknown or unknown=unknown

Not : not known = unknown.

Nested sub queries

A sub query is a select- from-where expression that is nested within another query.

7. EXPLAIN ABOUT SET MEMBERSHIP AND SET COMPARISON (10 marks)

Set membership

SQL allows testing tuples for membership in a relation

Eg., find all customers who have both a loan and an account at the bank.

For finding all account holders we write the sub query as,

Select cus_name from depositor.

1. In clause

Then find customers who are borrowers from the bank and who appears in the list of account holders

obtained in the sub query.

Select distinct cus_name from borrower where cus_name in (select cus_namefrom depositor). 2. Not in clause

Eg., Find all customers who have loan but not having deposit account in the bank.

select distinct cus_name from borrower where cus_name not in (select cus_name from depositor)

Set comparison

Some- find the names of all branches that have assets greater than those of at least one branch

located in brooklyn

Select branch_name from branch where assets > some (select assets from branch where

branch_city=‘brooklyn‘.

The > some comparison in the where clause of the outer select is true if the assets value of the tuple is greater than atleast one member of the set of all assets values branches in booklyn

SQL allows < some,< = some,>= some,= some and < > some comparisons.

All- find the names of all branches that have an asssets values greater than of each branch in

Brooklyn[>all-greater than all]

Select branch_name from branch where assets > all (select assets from branch where

branch_name=‘brooklyn‘.

SQL allows < all,< = all,>= all,= all and < > all comparisons.

Test for empty relations

SQL includes a features for testing whether a sub query has any tuples in its result.

Exist – returns value true if subquery non empty.

Eg., Find all customer who have both an account and a loan at the bank.

Select cus_name from borrower where exist (select * from depositer where depositer .cus_name = borrower.cus_name)

Not exist-find all customer whi have an account at all the branches in Brooklyn.

Select distinct s,cus_name from depositor as s where not exist (( select branct_name from branch

where branch_city=‘brooklyn‘) except (select r.branch_name from depositor as t, account as r where t.account_num=r.account.num and s.cus_name=t.cus_name))

The sub query finds all the branches at which customer s.cus_name has an account. Thus the outer

select takes each customer and tests whether the set of all branches at which that customer has an account contains the set of all branches located in Brooklyn.

Test for the absence of duplicate tuples

SQL includes a features for testing whether a sub query has any duplicate tuples in its result.

The unique construct returns the values true, if the argument sub query contains no duplicates tuples.

Eg., find all customers who have at least one account at the perryridge branch.

Select t.cus_name from depositor as t where unique ( select r.cus_name from account, depositor as r

where t.cus_name=r.cus_name and r.acc_num=account.acc_num and account.branch_name=‘peeryridge‘).

Not unique- test for the existence of duplicate tuples.

8. EXPLAIN DIFFERENT TYPES OF JOIN QUERIES. (8 marks)

The purpose of join is to combine the data spread across tables.

Types of joins

Inner join

Left outer join

Right outer join

Full outer join

Loan borrower

Loan_no Branch_name amt

170 Down town 3000

180 Redwood 4000

190 Perryridge 7000

Inner loan

Inner join combines the two relations .ie left side attribute and right side attribute

Eg., loan innerjoin borrower on loan.loan_no=borrower.loan_no.

Output will be

Loan_no Branch_name amt Cus_name Loan_no

170 Down town 3000 Jones 170

180 Redwood 4000 Smith 180

Natural inner join

Displayed only once. Eg., loan natural join borrower.

Loan_no Branch_name amt Cus_name

170 Down town 3000 Jones

180 Redwood 4000 Smith

Left outer join

It displays the left hand side relation that does not match any tuple in the right hand side relation are

padded with null values.

Eg., loan left outer join borrower on loan.loan_no=borrower.loan_no.




190 Perryridge 7000 Null Null

Natural left outer join

Eg., loan natural left outer join borrower.




Cus_name Loan_no

Jones 170

Smith 180

Hayes 155

190 Perryridge 7000 Null

Right outer join

It displays the right hand relation that does not match any tuple in left hand relation are padded with null

values.

Eg., loan left outer join borrower on loan.loan_no=borrower.loan_no.




Null Null Null Hayes 155

Natural right outer join

Eg., loan natural right outer join borrower.




155 Null Null Hayes

Full outer join

Both right and left outer join.

Eg., loan full outer join borrower on loan.loan_no=borrower.loan_no.




190 Perryridge 7000 Null Null

Null Null Null Hayes 155

Natural full outer join

9. EXPLAIN ABOUT INTEGRITY CONSTRAINTS . (6 marks)

Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency.

A checking account must have a balance greater than $10,000.00




190 Perryridge 7000 Null

155 Null Null Hayes

A salary of a bank employee must be at least $4.00 an hour

A customer must have a (non-null) phone number

Constraints on a Single Relation

not null

primary key

unique

check (P ), where P is a predicate

Not Null Constraint

Declare branch_name for branch is not null

branch_name char(15) not null

Declare the domain Dollars to be not null

create domain Dollars numeric(12,2) not null

The Unique Constraint

unique ( A1, A2, …, Am)

The unique specification states that the attributes

A1, A2, … Am form a candidate key.

Candidate keys are permitted to be null (in contrast to primary keys).

The check clause

check (P ), where P is a predicate

Example: Declare branch_name as the primary key for branch and ensure that the values of assets

are non-negative.

create table branch

(branch_name char(15),

branch_city char(30),

assets integer,

primary key (branch_name),

check (assets >= 0))

The check clause in SQL-92 permits domains to be restricted:

Use check clause to ensure that an hourly_wage domain allows only values greater

than a specified value. create domain hourly_wage numeric(5,2)

constraint value_test check(value > = 4.00)

The domain has a constraint that ensures that the hourly_wage is greater than 4.00

The clause constraint value_test is optional; useful to indicate which constraint an update violated.

10.EXPLAIN ABOUT REFERENTIAL INTEGRITY AND ASSERTIONS

(8 marks)

Referential integrity

Ensures that a value that appears in one relation for a given set of attributes also appears for a certain set of

attributes in another relation.

Example: If ―Perryridge‖ is a branch name appear ing in one of the tuples in the account relation, then there exists a tuple in the branch relation for branch ―Perryridge‖.

Primary and candidate keys and foreign keys can be specified as part of the SQL create table statement:

The primary key clause lists attributes that comprise the primary key.

The unique key clause lists attributes that comprise a candidate key.

The foreign key clause lists the attributes that comprise the foreign key and the name of the

relation referenced by the foreign key. By default, a foreign key references the primary key attributes of the referenced table.

Example

create table customer (customer_name char(20), customer_street char(30),

customer_city char(30), primary key (customer_name ))

create table branch (branch_name char(15),

branch_city char(30), assets numeric(12,2), primary key (branch_name ))

create table account

(account_number char(10), branch_name char(15), balance integer,

primary key (account_number), foreign key (branch_name) references branch )

create table depositor (customer_name char(20),

account_number char(10), primary key (customer_name, account_number),

foreign key (account_number ) references account, foreign key (customer_name ) references customer )

Assertions

An assertion is a predicate expressing a condition that we wish the database always to satisfy.

An assertion in SQL takes the form

create assertion <assertion-name> check <predicate>

When an assertion is made, the system tests it for validity, and tests it again onevery update that may violate

the assertion

This testing may introduce a significant amount of overhead; hence assertions should be used with great

care.

Asserting

for all X, P(X) is achieved in a round-about fashion using

not exists X such that not P(X)

Assertion Example

Every loan has at least one borrower who maintains an account with a minimum balance or

$1000.00

create assertion balance_constraint check

(not exists (

select *

from loan where not exists (

select * from borrower, depositor, account where loan.loan_number = borrower.loan_number and borrower.customer_name = depositor.customer_name

and depositor.account_number = account.account_number and account.balance >= 1000)))

The sum of all loan amounts for each branch must be less than the sum of all account balances at the branch.

create assertion sum_constraint check

(not exists (select *

from branch where (select sum(amount )

from loan where loan.branch_name = branch.branch_name )

>= (select sum (amount ) from account

where loan.branch_name = branch.branch_name )))

11. EXPLAIN ABOUT EMBEDDED SQL AND DYNAMIC SQL (6 marks)

Embedded SQL

The open statement causes the query to be evaluated

EXEC SQL open c END_EXEC

The fetch statement causes the values of one tuple in the query result to be placed on host language

variables.

EXEC SQL fetch c into :cn, :cc END_EXEC Repeated calls to fetch get successive tuples in the query result

A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to ‗02000‘ to indicate no more data is available

The close statement causes the database system to delete the temporary relation that holds the result of the query.

EXEC SQL close c END_EXEC

Note: above details vary with language. For example, the Java embedding defines Java iterators to step through result tuples.

Updates Through Cursors

Can update tuples fetched by cursor by declaring that the cursor is for update

declare c cursor for

select * from account

where branch_name = ‗Perryridge‘ for update

To update tuple at the current location of cursor c

update account

set balance = balance + 100 where current of c

Dynamic SQL

Allows programs to construct and submit SQL queries at run time.

Example of the use of dynamic SQL from within a C program.

char * sqlprog = “update account

set balance = balance * 1.05 where account_number = ?” EXEC SQL prepare dynprog from :sqlprog;

char account [10] = ―A-101‖; EXEC SQL execute dynprog using :account;

The dynamic SQL program contains a ?, which is a place holder for a value that is provided when the SQL program is executed.

UNIT III DATABASE DESIGN

PARTB

1. EXPLAIN FIRST NORMAL FORM(1NF) (8 marks)

A relation schema R is in 1NF if * all the attributes of the relation R are atomic in nature.

E.G… DEPT

Suppose we extend it by including DLOCATIONS attribute as shown above. We assume that

each dept may have a no. of Locations.

This is not 1NF bcoz DLOCATIONS is not an atomic attribute.

First Normal Form(1NF)

There are 2 main techniques to achieve 1NF,

1. Remove the attribute DLOCATIONS and place it in separate relation DEPT_LOCATIONS along

with a primary key DNO. The primary key of the original DEPT is the combination

{DNO,DLOCATIONS} DEPT-LOCATIONS

2. 2. Expand the key so that there will be a separate tuple in the orginal DEPT relation for each location

of a DEPT as shown below,

DNO LOCATIONS

DNAME DNO DHEAD DLOCATIONS

Research 3 John (Mianus,Rye,Stratford)

Administrator 2 prince Mianu

Headquarter 1 Peter Rye

1

2

3

3

3

Rye

Mianus

Rye

Mianus

stratford

So the first technique is superior.

DNAME DNO DHEAD DLOCATIONS

Research

Research

Research

Administration

HQ

3

3

3

2

1

John

John

John

Princy

Peter

Mianus

Rye

Statford

Mianus

Rye

2.EXPLAIN SECOND NORMAL FORM(2NF) (8 marks)

A relation R is in 2NF if and only if,

• It is in the 1NF and

• No partial dependency exists between non-key attributes and key attributes.

• The test for 2NF involves testing for functional dependencies whose left hand side attributes are

part of primary key. If the primary key contains a single attr ibute, the test need not be applied at all.

• A relation schema R is in 2NF if every non-prime attribute A in R is fully functionally

dependent on the primary key of R.

• E.G.. Consider the EMP_PROJ relation, it is in 1NF but not in 2NF,

The non-prime attribute ENAME violates 2NF because of FD2, as do the non-prime attribute PNAME

and PLOCATION because of FD2 and FD3 make ENAME, PNAME and PLOCATION partially dependent

on the primary key {SSN,PNO}, thus violating 2NF test.

The Functional dependencies FD1,FD2 and FD3 leads to the decomposition of EMP_PROJ into the

3 relation schemas EP1,EP2 and EP3, each of which is in 2NF.

3. EXPLAIN THIRD NORMAL FORM (3NF) AND BCNF.GIVE THEIR

COMPARISON (16 marks)

THIRD NORMAL FORM

A relation R is said to be in the 3NF if and only if

* It is in 2 NF and

* No transitive dependency exists between non-key attributes and key attributes.

E.G… Consider the relation schema EMP_DEPT

The dependency SSN DNGRSSN is transitive through DNO in EMP-DEPT, because both the

dependencies SSN DNO and DNO DNGRSSN hold and DNO a key itself nor a subset of the key

of EMP-DEPT is neither.

A relation schema R is n 3NF, if it satisfies 2NF and no non-prime attribute of R is transitively

dependent on the primary key.

EMP_DEPT Relation Schema in 2NF

ENAME SSN BDATE ADDRESS DNO DNAME DNGRSSN

We can normalize schemas ED1 and ED2,

3NF relation schemas ED1 and ED2

Here ED1 and ED2 represet independent entity facts about employees and departments.

ED1

ENAME SSN BDATE ADDRESS

DNO

ED2

DNO DNAME DNGRSSN

BOYCE-CODD NORMAL FORM (BCNF)

A relation R is said to be in BCNF, if and only if all the determinant are candidate keys.

BCNF relation is a strong 3NF, but not every 3NF relation is BCNF.

A relation schema R is in BCNF with respect to a set F of functional dependencies if for all

functional dependencies in F of the form

Α β

Where α C R and β C R,

at least one of the following holds

1. Α β is trivial (i.e β C A)

2. α is a super key of R

Comparison of BCNF and 3NF

1. It is always possible to decompose a relation into a set of relations that are in 3NF such that:

the decomposition is lossless

the dependencies are preserved

2.It is always possible to decompose a relation into a set of relations that are in BCNF such that

the decomposition is lossless

it may not be possible to preserve dependencies.

4.EXPLAIN ABOUT MULTIVALUED DEPENDENCIES AND FOURTH NORMAL FORM (16 marks)

Multivalued Dependencies (MVDs)

Let R be a relation schema and let R and R. The multivalued dependency

holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[] = t2 [],

there exist tuples t3 and t4 in r such that:

t1[] = t2 [] = t3 [] = t4 []

t3[] = t1 []

t3[R – ] = t2[R – ]

t4 [] = t2[]

t4[R – ] = t1[R – ]

Let R be a relation schema with a set of attributes that are partitioned into 3 nonempty subsets.

Y, Z, W

We say that Y Z (Y multidetermines Z ) if and only if for all possible relations r (R )

< y1, z1, w1 > r and < y1, z2, w2 > r

then

< y1, z1, w2 > r and < y1, z2, w1 > r

Note that since the behavior of Z and W are identical it follows that

Y Z if Y W

In our example:

course teacher

course book

The above formal definition is supposed to formalize the notion that given a particular value of Y (course) it has associated with it a set of values of Z (teacher) and a set of values of W (book), and these two sets are in some sense independent of each other.

Note:

If Y Z then Y Z

Indeed we have (in above notation) Z1 = Z2

The claim follows.

We use multivalued dependencies in two ways:

1. To test relations to determine whether they are legal under a given set of functional and multivalued dependencies

2. To specify constraints on the set of legal relations. We shall thus concern ourselves only with relations that satisfy a given set of functional and multivalued dependencies.

If a relation r fails to satisfy a given multivalued dependency, we can construct a relations r that

does satisfy the multivalued dependency by adding tuples to r.

Fourth Normal Form

A relation schema R is in 4NF with respect to a set D of functional and multivalued dependencies if

for all multivalued dependencies in D+ of the form , where R and R, at least one of the following hold:

is trivial (i.e., or = R)

is a superkey for schema R

If a relation is in 4NF it is in BCNF

5.EXPLAIN ABOUT 5NF AND DK/NF(6 marks)

Fifth Normal Form (5NF)

There are certain conditions under which after decomposing a relation, it cannot be reassembled back into its original form.

We don't consider these issues here.

Domain Key Normal Form (DK/NF)

A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition

of keys and domains. Constraint: An rule governing static values of an attribute such that we can determine if this

constraint is True or False. Examples: 1. Functional Dependencies 2. Multivalued Dependencies

3. Inter-relation rules 4. Intra-relation rules

However: Does Not include time dependent constraints.

Key: Unique identifier of a tuple.

Domain: The physical (data type, size, NULL values) and semantic (logical) description of what values an attribute can hold.

There is no known algorithm for converting a relation directly into DK/NF.

UNIT IV

TRANSACTIONS

PART B

1. EXPLAIN ABOUT DIFFERENT TRANSACTION STATES. (6 marks)

o Active – the initial state; the transaction stays in this state while it is executing

o Partially committed – after the final statement has been executed.

o Failed -- after the discovery that normal execution can no longer proceed.

o Aborted – after the transaction has been rolled back and the database restored to its state

prior to the start of the transaction. Two options after it has been aborted:

restart the transaction; can be done only if no internal logical error kill the transaction

o Committed – after successful completion.

2. EXPLAIN THE IMPLEMENTATION OF ATOMICITY AND DURABILITY (6

marks)

The recovery-management component of a database system implements the support for atomicity

and durability.

The shadow-database scheme:

assume that only one transaction is active at a time.

a pointer called db_pointer always points to the current consistent copy of the

database.

all updates are made on a shadow copy of the database, and db_pointer is made to

point to the updated shadow copy only after the transaction reaches partial commit

and all updated pages have been flushed to disk.

in case transaction fails, old consistent copy pointed to by db_pointer can be used,

and the shadow copy can be deleted.

o Assumes disks do not fail

o Useful for text editors, but

extremely inefficient for large databases (why?)

Does not handle concurrent transactions

o Will study better schemes in Chapter 17.

3. DISCUSS ABOUT SERIALIZABILITY (16 marks)

Basic Assumption – Each transaction preserves database consistency.

Thus serial execution of a set of transactions preserves database consistency.

A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different

forms of schedule equivalence give rise to the notions of: 1. conflict serializability

2. view serializability We ignore operations other than read and write instructions, and we assume that transactions may

perform arbitrary computations on data in local buffers in between reads and writes. Our simplified

schedules consist of only read and write instructions.

Conflicting Instructions

Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some

item Q accessed by both li and lj, and at least one of these instructions wrote Q. 1. li = read(Q), lj = read(Q). li and lj don‘t conflict.

2. li = read(Q), lj = write(Q). They conflict. 3. li = write(Q), lj = read(Q). They conflict

4. li = write(Q), lj = write(Q). They conflict Intuitively, a conflict between li and lj forces a (logical) temporal order between them.

If li and lj are consecutive in a schedule and they do not conflict, their results would

remain the same even if they had been interchanged in the schedule.

Conflict Serializability

If a schedule S can be transformed into a schedule S´ by a series of swaps of non-

conflicting instructions, we say that S and S´ are conflict equivalent.

We say that a schedule S is conflict serializable if it is conflict equivalent to a serial

schedule

Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1,

by series of swaps of non-conflicting instructions.

Therefore Schedule 3 is conflict serializable.

Schedule 3 Schedule 6

Example of a schedule that is not conflict serializable: We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4

>, or the serial schedule < T4, T3 >.

View Serializability

Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met:

1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q. 2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was

produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj .

3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.

As can be seen, view equivalence is also based purely on reads and writes alone.

A schedule S is view serializable it is view equivalent to a serial schedule.

Every conflict serializable schedule is also view serializable.

Below is a schedule which is view-serializable but not conflict serializable.

What serial schedule is above equivalent to?

Every view serializable schedule that is not conflict serializable has blind writes.

4. EXPLAIN LOCK-BASED PROTOCOLS (8 marks)

A lock is a mechanism to control concurrent access to a data item

Data items can be locked in two modes :

exclusive (X) mode. Data item can be both read as well as

written. X-lock is requested using lock-X instruction.

shared (S) mode. Data item can only be read. S- lock is

requested using lock-S instruction.

Lock requests are made to concurrency-control manager. Transaction can proceed only after request

is granted.

Lock-compatibility matrix

A transaction may be granted a lock on an item if the requested lock is compatible with locks already

held on the item by other transactions

Any number of transactions can hold shared locks on an item, but if any transaction holds an

exclusive on the item no other transaction may hold any lock on the item.

If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held

by other transactions have been released. The lock is then granted.

Example of a transaction performing locking:

T2: lock-S(A); read (A);

unlock(A); lock-S(B); read (B);

unlock(B); display(A+B)

Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between

the read of A and B, the displayed sum would be wrong.

A locking protocol is a set of rules followed by all transactions while requesting and releasing

locks. Locking protocols restrict the set of possible schedules.

Pitfalls of Lock-Based Protocols

Consider the partial schedule

Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its

lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.

Such a situation is called a deadlock.

To handle a deadlock one of T3 or T4 must be rolled back and its locks released.

The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.

Starvation is also possible if concurrency control manager is badly designed. For example:

A transaction may be waiting for an X-lock on an item, while a sequence of other transactions

request and are granted an S- lock on the same item.

The same transaction is repeatedly rolled back due to deadlocks.

Concurrency control manager can be designed to prevent starvation.

5. EXPLAIN THE CONCEPT OF TWO-PHASE LOCKING PROTOCOL

(8 marks) This is a protocol which ensures conflict-serializable schedules.

Phase 1: Growing Phase

transaction may obtain locks

transaction may not release locks

Phase 2: Shrinking Phase

transaction may release locks

transaction may not obtain locks

The protocol assures serializability. It can be proved that the transactions can be serialized in

the order of their lock points (i.e. the point where a transaction acquired its final lock).

Two-phase locking does not ensure freedom from deadlocks

Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified

protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks

till it commits/aborts.

Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this

protocol transactions can be serialized in the order in which they commit.

There can be conflict serializable schedules that cannot be obtained if two-phase locking is

used.

However, in the absence of extra information (e.g., ordering of access to data), two-phase

locking is needed for conflict serializability in the following sense:

Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj

that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.

Lock Conversions

Two-phase locking with lock conversions:

– First Phase:

can acquire a lock-S on item

can acquire a lock-X on item

can convert a lock-S to a lock-X (upgrade)

– Second Phase:

can release a lock-S

can release a lock-X

can convert a lock-X to a lock-S (downgrade)

This protocol assures serializability. But still relies on the programmer to insert the various locking

instructions.

6.EXPLAIN THE IMPLEMENTATION OF LOCKING (6 marks)

A lock manager can be implemented as a separate process to which transactions send lock

and unlock requests

The lock manager replies to a lock request by sending a lock grant messages (or a message

asking the transaction to roll back, in case of a deadlock)

The requesting transaction waits until its request is answered

The lock manager maintains a data-structure called a lock table to record granted locks and

pending requests

The lock table is usually implemented as an in-memory hash table indexed on the name of

the data item being locked

Black rectangles indicate granted locks, white ones indicate waiting requests

Lock table also records the type of lock granted or requested

New request is added to the end of the queue of requests for the data item, and granted if it is

compatible with all earlier locks

Unlock requests result in the request being deleted, and later requests are checked to see if

they can now be granted

If transaction aborts, all waiting or granted requests of the transaction are deleted

lock manager may keep a list of locks held by each transaction, to implement this efficiently

7.EXPLAIN ABOUT LOG BASED RECOVERY (16 marks)

Log-Based Recovery

o A log is kept on stable storage.

The log is a sequence of log records, and maintains a record of update activities on

the database.

o When transaction Ti starts, it registers itself by writing a

<Ti start>log record

o Before Ti executes write(X), a log record <Ti, X, V1, V2> is written, where V1 is the value

of X before the write, and V2 is the value to be written to X.

o Log record notes that Ti has performed a write on data item Xj Xj had value V1 before the

write, and will have value V2 after the write.

o When Ti finishes it last statement, the log record <Ti commit> is written.

o We assume for now that log records are written directly to stable storage (that is, they are

not buffered)

o Two approaches using logs

Deferred database modification

Immediate database modification

Deferred Database Modification

o The deferred database modification scheme records all modifications to the log, but defers all

the writes to after partial commit.

o Assume that transactions execute serially

o Transaction starts by writing <Ti start> record to log.

o A write(X) operation results in a log record <Ti, X, V> being written, where V is the new

value for X

Note: old value is not needed for this scheme

o The write is not performed on X at this time, but is deferred.

o When Ti partially commits, <Ti commit> is written to the log

o Finally, the log records are read and used to actually execute the previous ly deferred writes.

o During recovery after a crash, a transaction needs to be redone if and only if both <Ti start>

and<Ti commit> are there in the log.

Redoing a transaction Ti ( redoTi) sets the value of all data items updated by the transaction to the new

values.

o Crashes can occur while

the transaction is executing the original updates, or

while recovery action is being taken

o example transactions T0 and T1 (T0 executes before T1):

o T0: read (A) T1 : read (C)

A: - A - 50 C:- C- 100

Write (A) write (C)

read (B)

B:- B + 50

write (B)

o Below we show the log as it appears at three instances of time.

If log on stable storage at time of crash is as in case:

(a) No redo actions need to be taken

(b) redo(T0) must be performed since <T0 commit> is present

(c) redo(T0) must be performed followed by redo(T1) since

<T0 commit> and <Ti commit> are present

Immediate Database Modification

o The immediate database modification scheme allows database updates of an uncommitted

transaction to be made as the writes are issued

since undoing may be needed, update logs must have both old value and new value

o Update log record must be written before database item is written

We assume that the log record is output directly to stable storage

Can be extended to postpone log record output, so long as prior to execution of an

output(B) operation for a data block B, all log records corresponding to items B must

be flushed to stable storage

o Output of updated blocks can take place at any time before or after transaction commit

o Order in which blocks are output can be different from the order in which they are written.

Immediate Database Modification Example

Log Write Output

<T0 start>

<T0, A, 1000, 950>

To, B, 2000, 2050

A = 950

B = 2050

<T0 commit>

<T1 start>

<T1, C, 700, 600>

C = 600

BB, BC

<T1 commit>

BA

Note: BX denotes block containing X.

Immediate DB Modification Recovery Example

Below we show the log as it appears at three instances of time.

Recovery actions in each case above are:

undo (T0): B is restored to 2000 and A to 1000.

undo (T1) and redo (T0): C is restored to 700, and then A and B are

set to 950 and 2050 respectively.

redo (T0) and redo (T1): A and B are set to 950 and 2050

respectively. Then C is set to 600

Checkpoints

Problems in recovery procedure as discussed earlier :

searching the entire log is time-consuming

we might unnecessarily redo transactions which have already

output their updates to the database.

Streamline recovery procedure by periodically performing checkpointing

Output all log records currently residing in main memory onto stable storage.

Output all modified buffer blocks to the disk.

Write a log record < checkpoint> onto stable storage.

Example of Checkpoints

8. DESCRIBE DEADLOCK HANDLING (16 marks)

Neither of the transaction can ever proceed with its normal execution. This situation is called deadlock.

Consider the following two transactions:

T1: write (X) T2 : write(Y)

write(Y) write(X)

Schedule with deadlock

System is deadlocked if there is a set of transactions such that every transaction in the set is waiting

for another transaction in the set.

Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some

prevention strategies :

Require that each transaction locks all its data items before it begins execution

(predeclaration).

Impose partial ordering of all data items and require that a transaction can lock data

items only in the order specified by the partial order (graph-based protocol).

Deadlock Prevention Strategies

Following schemes use transaction timestamps for the sake of deadlock prevention alone.

wait-die scheme — non-preemptive

older transaction may wait for younger one to release data item. Younger transactions

never wait for older ones; they are rolled back instead.

a transaction may die several times before acquiring needed data item

wound-wait scheme — preemptive

older transaction wounds (forces rollback) of younger transaction instead of waiting

for it. Younger transactions may wait for older ones.

may be fewer rollbacks than wait-die scheme.

Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original

timestamp. Older transactions thus have precedence over newer ones, and starvation is hence

avoided.

Timeout-Based Schemes :

a transaction waits for a lock only for a specified amount of time. After that, the wait times out a nd

the transaction is rolled back.

thus deadlocks are not possible

simple to implement; but starvation is possible. Also difficult to determine good value of the timeout

interval.

Deadlock Detection

o Deadlocks can be described as a wait- for graph, which consists of a pair G = (V,E),

V is a set of vertices (all the transactions in the system)

E is a set of edges; each element is an ordered pair Ti Tj.

o If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj

to release a data item.

o When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the

wait-for graph. This edge is removed only when Tj is no longer holding a data item needed

by Ti.

o The system is in a deadlock state if and only if the wait- for graph has a cycle. Must invoke a

deadlock-detection algorithm periodically to look for cycles.

Wait- for graph without a cycle Wait- for graph with a cycle

Deadlock Recovery

o When deadlock is detected :

Some transaction will have to rolled back (made a victim) to break deadlock. Select

that transaction as victim that will incur minimum cost.

Rollback -- determine how far to roll back transaction

Total rollback: Abort the transaction and then restart it.

More effective to roll back transaction only as far as necessary to break deadlock.

Starvation happens if same transaction is always chosen as victim. Include the

number of rollbacks in the cost factor to avoid starvation

UNIT V

IMPLEMENTATION TECHNIQUES PART B

1. EXPLAIN DIFFERENT TYPES OF PHYSICAL STORAGE MEDIA (8 marks)

Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware.

Main memory:

fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)

generally too small (or too expensive) to store the entire database

capacities of up to a few Gigabytes widely used currently

Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2

every 2 to 3 years)

Volatile — contents of main memory are usually lost if a power failure or system crash occurs.

Flash memory

Data survives power failure

Data can be written at a location only once, but location can be erased and written to again o Can support only a limited number (10K – 1M) of write/erase cycles.

o Erasing of memory has to be done to an entire bank of memory

Reads are roughly as fast as main memory

But writes are slow (few microseconds), erase is slower

Cost per unit of storage roughly similar to main memory

Widely used in embedded devices such as digital cameras

Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)

Magnetic-disk

Data is stored on spinning disk, and read/written magnetically

Primary medium for the long-term storage of data; typically stores entire database.

Data must be moved from disk to main memory for access, and written back for storage o Much slower access than main memory (more on this later)

direct-access – possible to read data on disk in any order, unlike magnetic tape

Capacities range up to roughly 400 GB currently

o Much larger capacity and cost/byte than main memory/flash memory o Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2

years)

Survives power failures and system crashes

o disk failure can destroy data, but is rare

Optical storage

non-volatile, data is read optically from a spinning disk using a laser

CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms

Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R)

Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)

Reads and writes are slower than with magnetic disk

Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data

Tape storage

non-volatile, used primarily for backup (to recover from disk failure), and for archival data

sequential-access – much slower than disk

very high capacity (40 to 300 GB tapes available)

tape can be removed from drive storage costs much cheaper than disk, but drives are expensive

Tape jukeboxes available for storing massive amounts of data n hundreds of terabytes (1 terabyte =

109 bytes) to even a petabyte (1 petabyte = 1012 bytes) Storage Hierarchy

primary storage: Fastest media but volatile (cache, main memory).

secondary storage: next level in hierarchy, non-volatile, moderately fast access time

also called on-line storage E.g. flash memory, magnetic disks

tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage

E.g. magnetic tape, optical storage

2.EXPLAIN ABOUT MAGNETIC DISKS MECHANISM (8 marks)

Data is stored on spinning disk, and read/written magnetically

Primary medium for the long-term storage of data; typically stores entire database.

Data must be moved from disk to main memory for access, and written back for storage

Much slower access than main memory (more on this later)

direct-access – possible to read data on disk in any order, unlike magnetic tape

Capacities range up to roughly 400 GB currently

Much larger capacity and cost/byte than main memory/flash memory Growing constantly and rapidly with technology improvements (factor of 2 to 3 every

2 years)

Survives power failures and system crashes i. disk failure can destroy data, but is rare

Read-write head

Positioned very close to the platter surface (almost touching it)

Reads or writes magnetically encoded information.

Surface of platter divided into circular tracks

Over 50K-100K tracks per platter on typical hard disks

Each track is divided into sectors.

A sector is the smallest unit of data that can be read or written.

Sector size typically 512 bytes

Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks)

To read/write a sector disk arm swings to position head on right track platter spins continually; data is read/written as sector passes under head

Head-disk assemblies multiple disk platters on a single spindle (1 to 5 usually) one head per platter, mounted on a common arm.

Cylinder i consists of ith track of all the platters

Earlier generation disks were susceptible to head-crashes Surface of earlier generation disks had metal-oxide coatings which would disintegrate on head crash and damage all data on disk

Current generation disks are less susceptible to such disastrous failures, although individual sectors may get corrupted

Disk controller

interfaces between the computer system and the disk drive hardware.

accepts high- level commands to read or write a sector

initiates actions such as moving the disk arm to the right track and actually reading or writing the

data

Computes and attaches checksums to each sector to verify that data is read back correctly

If data is corrupted, with very high probability stored checksum won‘t match recomputed checksum

Ensures successful writing by reading back sector after writing it

Performs remapping of bad sectors

3.EXPLAIN ABOUT RAID LEVELS. (16 marks)

RAID: Redundant Arrays of Independent Disks

disk organization techniques that manage a large numbers of disks, providing a view of a single disk

of o high capacity and high speed by using multiple disks in parallel, and

o high reliability by storing data redundantly, so that data can be recovered even if a disk fails

The chance that some disk out of a set of N disks will fail is much higher than the chance that a

specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years),

will have a system MTTF of 1000 hours (approx. 41 days)

Techniques for using redundancy to avoid data loss are critical with large numbers of disks

Originally a cost-effective alternative to large, expensive disks I in RAID originally stood for ``inexpensive‘‘ Today RAIDs are used for their higher reliability and bandwidth.

The ―I‖ is interpreted as independent

Schemes to provide redundancy at lower cost by using disk striping combined with parity bits

Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics

RAID Level 0: Block striping; non-redundant.

Used in high-performance applications where data lose is not critical.

RAID Level 1: Mirrored disks with block striping

Offers best write performance.

Popular for applications such as storing log files in a database system.

RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.

RAID Level 3: Bit-Interleaved Parity

a single parity bit is enough for error correction, not just detection, since we know which disk has

failed o When writing data, corresponding parity bits must also be computed and written to a parity

bit disk

o To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)

Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to

participate in every I/O.

Subsumes Level 2 (provides all its benefits, at lower cost)

.

RAID Level 4: Block-Interleaved Parity;

uses block- level striping, and keeps a parity block on a separate disk for corresponding b locks from

N other disks.

When writing data block, corresponding block of parity bits must also be computed and written to

parity disk

To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block)

from other disks.

Provides higher I/O rates for independent block reads than Level 3

block read goes to a single disk, so blocks stored on different disks can be read in parallel

Provides high transfer rates for reads of multiple blocks than no-striping

Before writing a block, parity data must be computed

Can be done by using old parity block, old value of current block and new value of current block (2

block reads + 2 block writes)

Or by recomputing the parity value using the new values of blocks corresponding to the parity block

More efficient for writing large amounts of data sequentially

Parity block becomes a bottleneck for independent block writes since every block write also writes

to parity disk

RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.

E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.

Higher I/O rates than Level 4. o Block writes occur in parallel if the blocks and their parity blocks are on different disks.

o Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.

RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures.

Better reliability than Level 5 at a higher cost; not used as widely.

4.EXPLAIN ABOUT FILE ORGANIZATION (8 marks)

The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence

of fields. One approach:

assume record size is fixed

each file has records of one particular type only

different files are used for different relations

This case is easiest to implement; will consider variable length records later.

Fixed-Length Records

Simple approach:

o Store record i starting from byte n (i – 1), where n is the size of each record. o Record access is simple but records may cross blocks

o Modification: do not allow records to cross block boundaries Deletion of record i:

alternatives:

move records i + 1, . . ., n

to i, . . . , n – 1

move record n to i

do not move records, but link all free records on a free list

Free Lists

Store the address of the first deleted record in the file header.

Use this first record to store the address of the second deleted record, and so on

Can think of these stored addresses as pointers since they ―point‖ to the location of a record.

More space efficient representation: reuse space for normal attributes of free records to store pointers. (No pointers stored in in-use records.)

Variable-Length Records

Variable- length records arise in database systems in several ways:

Storage of multiple record types in a file.

Record types that allow variable lengths for one or more fields

Record types that allow repeating fields (used in some older data models).

Variable-Length Records: Slotted Page Structure

Slotted page header contains:

o number of record entries o end of free space in the block

o location and size of each record

Records can be moved around within a page to keep them contiguous with no empty space between

them; entry in the header must be updated.

Pointers should not point directly to record — instead they should point to the entry for the record in

header.

5.EXPLAIN ABOUT ORGANIZATION OF RECORDS IN FILES (8 marks)

Heap – a record can be placed anywhere in the file where there is space

Sequential – store records in sequential order, based on the value of the search key of each record

Hashing – a hash function computed on some attribute of each record; the result specifies in which

block of the file the record should be placed Records of each relation may be stored in a separate file. In a multitable clustering file organization

records of several different relations can be stored in the same file Motivation: store related records on the same block to minimize I/O

Sequential File Organization

Suitable for applications that require sequential processing of the entire file

The records in the file are ordered by a search-key

Deletion – use pointer chains

Insertion –locate the position where the record is to be inserted

if there is free space insert there

if no free space, insert the record in an overflow block

In either case, pointer chain must be updated

Need to reorganize the file from time to time to restore sequential order

Multitable Clustering File Organization

Store several relations in one file using a multitable clustering file organization

Multitable clustering organization of customer and depositor:

good for queries involving depositor customer, and for queries involving one single customer and his accounts

bad for queries involving only customer

results in variable size records

Can add pointer chains to link records of a particular relation

6.DISCUSS ABOUT DIFFERENT TYPES OF INDICES WITH INSERTION AND DELETION. (16 marks)

In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library.

Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.

Also called clustering index

The search key of a primary index is usually but not necessarily the primary key. Secondary index: an index whose search key specifies an order different from the sequential order

of the file. Also called non-clustering index. Index-sequential file: ordered sequential file with a

primary index

Dense index — Index record appears for every search-key value in the file.

Sparse Index: contains index records for only some search-key values.

o Applicable when records are sequentially ordered on search-key o To locate a record with search-key value K we: o Find index record with largest search-key value < K

o Search file sequentially starting at the record to which the index record points

Compared to dense indices:

Less space and less maintenance overhead for insertions and deletions.

Generally slower than dense index for locating records.

Good tradeoff: sparse index with an index entry for every block in file, corresponding to

least search-key value in the block.

Multilevel index

If primary index does not fit in memory, access becomes expensive.

Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.

outer index – a sparse index of primary index

inner index – the primary index file

If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.

Indices at all levels must be updated on insertion or deletion from the file.

Deletion

If deleted record was the only record in the file with its particular search-key value, the

search-key is deleted from the index also.

Single- level index deletion:

Dense indices – deletion of search-key:similar to file record deletion.

Sparse indices –

if an entry for the search key exists in the index, it is deleted by replacing the entry in the

index with the next search-key value in the file (in search-key order).

If the next search-key value already has an index entry, the entry is deleted instead of being replaced.

Insertion

Single- level index insertion:

Perform a lookup using the search-key value appearing in the record to be inserted.

Dense indices – if the search-key value does not appear in the index, insert it.

Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.

If a new block is created, the first search-key value appearing in the new block is inserted into the index.

Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms

7.EXPLAIN B+-TREE INDEX FILES(16 marks)

B+-tree indices are an alternative to indexed-sequential files.

Disadvantage of indexed-sequential files

performance degrades as file grows, since many overflow blocks get created.

Periodic reorganization of entire file is required.

Advantage of B+-tree index files:

automatically reorganizes itself with small, local, changes, in the face of insertions and deletions.

Reorganization of entire file is not required to maintain performance.

(Minor) disadvantage of B+-trees:

extra insertion and deletion overhead, space overhead.

Advantages of B+-trees outweigh disadvantages

B+-trees are used extensively

A B+-tree is a rooted tree satisfying the following properties:

All paths from root to leaf are of the same length

Each node that is not a root or a leaf has between n/2 and n children.

A leaf node has between (n–1)/2 and n–1 values Special cases:

If the root is not a leaf, it has at least 2 children.

If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–

1) values. B+-Tree Node Structure

Typical node

Ki are the search-key values

Pi are pointers to children (for non- leaf nodes) or pointers to records or buckets of records (for leaf

nodes).

The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1

Leaf Nodes in B+-Trees

Properties of a leaf node:

For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a

bucket of pointers to file records, each record having search-key value Ki. Only need bucket structure if search-key does not form a primary key.

If Li, Lj are leaf nodes and i < j, Li‘s search-key values are less than Lj‘s search-key values

Pn points to next leaf node in search-key order

Non-Leaf Nodes in B+-Trees

Non leaf nodes form a multilevel sparse index on the leaf nodes. For a non- leaf node with m pointers:

All the search-keys in the subtree to which P1 points are less than K1

For 2 i n – 1, all the search-keys in the subtree to which Pi points have values

greater than or equal to Ki–1 and less than Ki

All the search-keys in the subtree to which Pn points have values greater than or equal

to Kn–1

Example of a B+-tree

B+-tree for account file (n = 5)

Leaf nodes must have between 2 and 4 values

((n–1)/2 and n –1, with n = 5).

Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5).

Root must have at least 2 children.

Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be

―physically‖ close.

The non- leaf levels of the B+-tree form a hierarchy of sparse indices.

The B+-tree contains a relatively small number of levels

Level below root has at least 2* n/2 values

Next level has at least 2* n/2 * n/2 values

.. etc.

o If there are K search-key values in the file, the tree height is no more than logn/2(K)

o thus searches can be conducted efficiently.

Insertions and deletions to the main file can be handled efficiently, as the index can be restructured

in logarithmic time

Advantages of B-Tree indices:

o May use less tree nodes than a corresponding B+-Tree. o Sometimes possible to find search-key value before reaching leaf node.

Disadvantages of B-Tree indices:

o Only small fraction of all search-key values are found early o Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth

than corresponding B+-Tree o Insertion and deletion more complicated than in B+-Trees o Implementation is harder than B+-Trees.

Typically, advantages of B-Trees do not out weigh disadvantages.

8.EXPLAIN ABOUT INSERTION AND DELETION IN B+ TREE (16 marks)

Insertion

Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be ―physically‖ close.

The non- leaf levels of the B+-tree form a hierarchy of sparse indices.

The B+-tree contains a relatively small number of levels

Level below root has at least 2* n/2 values

Next level has at least 2* n/2 * n/2 values etc.

If there are K search-key values in the file, the tree height is no more than logn/2(K)

thus searches can be conducted efficiently.

Insertions and deletions to the main file can be handled efficiently, as the index can be restructured

in logarithmic time (as we shall see). Splitting a leaf node:

take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. Place

the first n/2 in the original node, and the rest in a new node.

let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split.

If the parent is full, split it and propagate the split further up.

Splitting of nodes proceeds upwards till a node that is not full is found.

In the worst case the root node may be split increasing the height of the tree by 1.

Splitting a non- leaf node: when inserting (k,p) into an already full internal node N

Copy N to an in-memory area M with space for n+1 pointers and n keys

Insert (k,p) into M

Copy P1,K1, …, K n/2-1,P n/2 from M back into node N

Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N‘

Insert (K n/2,N‘) into parent N

Deletion

o Find the record to be deleted, and remove it from the main file and from the bucket (if present)

o Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty

o If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then merge siblings: Insert all the search-key values in the two nodes into a single node (the one on the

left), and delete the other node.

Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure.

o Otherwise, if the node has too few entries due to the removal, but the entries in the node and a sibling do not fit into a single node, then redistribute pointers :

o Redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries.

o Update the corresponding search-key value in the parent of the node.

o The node deletions may cascade upwards till a node which has n/2 or more pointers is found.

o If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.

Deleting ―Downtown‖ causes merging of underfull leaves

leaf node can become empty only for n=3!

Leaf with ―Perryridge‖ becomes underfull (actually empty, in this special case) and merged with its

sibling. As a result ―Perryridge‖ node‘s parent became underfull, and was merged with its sibling

Value separating two nodes (at parent) moves into merged node Entry deleted from parent

Root node then has only one child, and is deleted

Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling . Search-key value in the parent‘s parent changes as a result

9. EXPLAIN ABOUT STATIC HASHING (8 marks)

A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).

In a hash file organization we obtain the bucket of a record directly from its search-key value using

a hash function.

Hash function h is a function from the set of all search-key values K to the set of all bucket addresses

B.

Hash function is used to locate records for access, insertion as well as deletion.

Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

Example of Hash File Organization

There are 10 buckets,

The binary representation of the ith character is assumed to be the integer i.

The hash function returns the sum of the binary representations of the characters modulo 10

E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

Hash Functions

o Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file.

o An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.

Ideal hash function is random, so each bucket will have the same number of records assigned to it

irrespective of the actual distribution of search-key values in the file.

Typical hash functions perform computation on the internal binary representation of the search-key.

o For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned.

Handling of Bucket Overflows

o Bucket overflow can occur because of Insufficient buckets

Skew in distribution of records. This can occur due to two reasons: multiple records have same search-key value

chosen hash function produces non-uniform distribution of key values o Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is

handled by using overflow buckets.

o Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.

Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.

Hash Indices

o Hashing can be used not only for file organization, but also for index-structure creation. o A hash index organizes the search keys, with their associated record pointers, into a hash file

structure. o Strictly speaking, hash indices are always secondary indices o if the file itself is organized using hashing, a separate primary hash index on it using the same

search-key is unnecessary. o However, we use the term hash index to refer to both secondary index structures and hash

organized files.

Example of Hash Index

Deficiencies of Static Hashing

o In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.

Databases grow or shrink with time. If initial number of buckets is too small, and file grows, performance will degrade due

to too much overflows.

If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull).

If database shrinks, again space will be wasted. o One solution: periodic re-organization of the file with a new hash function

Expensive, disrupts normal operations o Better solution: allow the number of buckets to be modified dynamically.

roever engineering college department of … 16mark.pdf · roever engineering college department of...

Documents