roever engineering college department of … 16mark.pdf · roever engineering college department of...
TRANSCRIPT
ROEVER ENGINEERING COLLEGE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
CS1254 DATABASE MANAGEMENT SYSTEMS
16 MARKS:
UNIT I
FUNDAMENTALS
PART B
1. EXPLAIN ABOUT DB SYSTEM STRUCTURE OR ARCHITECTURE. (16 marks)
The functional components can be broadly divided into the
Storage manager
Query Processor.
Storage Manager It is a program module that provides the interface b/w the low level data stored in the db and the application programs & queries submitted to the system.
It translates the various DML statements into file system commands.
It is responsible for storing, retrieving and updating data in the database.
STORAGE MANAGER
The storage manager components includes:
Authorization & Integrity Manager: It tests for the satisfication of integrity constraints and checks the authority of users to access data.
File Manager: Allocation of files and disk & data structures used to represent information stored on disk.
Buffer Manger: used to decide what data to cache in mainmemory in orde to speed up the accessing of data.
Transaction Manager: db remains in a consistent state despite system failures and that concurrent transaction execution proceed without conflicting.
The SM implements several DS such as,
Data files which store db
Data dictionary which stores meta data about the structure of the db.
Indices: It is a file that provides fast access of data items that hold particular values.
QUERY PROCESSOR
It includes
DDL Interpreter:
- which interprets DDL sts and records the definitions in the data dictionary.
DML Compiler:
- It translates DML sts into an evaluation plan consisting of low level instructions that the qury
evaluation engine understands.
Query Evaluation Engine:
-Which executes low level instructions generated by DML compiler.
DATABASE ARCHITECTURE
DATABASE USERS
We can classify users of db into many types.
Naïve users:
- who interact with the system by invoking one of the application programs that have been previously written.
Application Programmers:
- Computer professionals who write application programs.
Sophisticated users
- familiar with the structure of database
- Such users can use a query language such as SQL to perform the required operations on databases.
Specialized users:
- Specialized users who write specialized db applications
DATABASE ADMINISTRATOR
DBA is a person who has central control over the system.
DBA is responsible for authorizing access to the db.
DBA is responsible for acquiring s/w and h/w resources.
DBA is responsible for coordinating and monitoring the use of db.
DBA creates original db schema by executing a set of data definition stmts in DDL.
2. EXPLAIN ABOUT CONCEPTS OF ER MODEL. (16 marks)
Entity-Relationship data model It is a high level conceptual data model that describes the structure of db in terms of entities, relationship among entities & constraints on them..
Basic Concepts of E-R Model:
Entity
Entity Set
Attributes
Relationship
Relationship set
Identifying Relationship
Entity:
―An entity is a business object that represents a group, or category of data.‖
-Example:
- Person, Employee, Car, Home etc..
Object with conceptual Existence
- Account, loan, job etc…
Entity Set:
- A set of entities of the same type.
Attributes:
- A set of properties that describe an entity.
Types of Attributes:
Simple (or) atomic vs. Composite:
- An attribute which cant be sub divided. (Eg.Roll No)
- An attribute which can be divided into sub parts is called
Composite attribute.
e.g.. Address- Apartment no.
- Street - Place - City - District
Single Valued vs. Multivalued:
-An attribute having only one value (e.g.. Eid,Roll No)
- An attribute having multiple values (e.g.. Deptlocat- A dept can be located in several places)
Stored Vs Derived
- Stored attribute (SA) is one that has some value where as Derived Attribute (DA) is a one where its value is derived from sa.
-E.g.. SA-DOB
DA- Age derived from DOB.
Key Attribute:
- An attribute which is used to uniquely identify records.
E.g.. eid, sno, dno
Relationship:
It is an association among several entities. It specifies what type of relationship exists between entities.
Relationship set:
- It is a set of relationships of the same type.
Weak Entity Set/ Strong Entity Set:
An entity set that does not possess sufficient attributes to form a primary key is called a Weak entity set. One that does have a primary key is called a Strong entity set.
Identifying Relationship:
The relationship associated with the weak entity type
Constraints
Two of the most important constraints are
Mapping Constraints
Participation constraints- Total Participation and Partial Participation
Mapping Cardinalities:
Mapping cardinalities or cardinality ratios, expresss the number of entities to which another entity can be associated via a relationshipset. There are several types of Mapping Cardinalities available. They are,
(i) One-to-One
An entity in set A is associated with at most one entity in set B and vice versa.
(ii) One-to-many An entity in set A is associated with zero or more no. of entities in set B and an entity in B
is associated with at most one entity in A.
E
N
T
I
T
Y
S
E
T
(iii) Many-to-One One or more no. of entities in set A is associated with at most one entity in B. An entity in B can be associated with any no. of entities in A.
(iv) Many-to-Many
One or more no. of entities in set A is associated with one or more no. of entities in set B.
Participation Constraints:
Total Participation The participation of an entity set E in a relationship set R is said to be total if every entity in E
participates in atleast one relationship in R. Partial Participation:
The participation of an entity set E in a relationship set R is said to be partial if only a few of the
entities in E participated in relationship in R.
Keys
It is used to uniquely identify entities within a given entity set or a relationship set.
Keys in Entity set:
(i) Primary Key: • It is a key used to uniquely identify an entity in the entity set.
• E,g, eno,rno,dno etc… (ii) Super Key:
It is a set of one or more attributes that allow us to uniquely identify an entity in the entity set. Among them one must be a primary key attribute. E.G.. Eid (primary key) and ename together can be identify an entity in entity set.
(iii) Candidate key: They are minimal super keys for which no proper subset is a superkey.
E.g.. Ename and eaddr can be sufficient to identify an employee in employee set. {eid} and {ename,eaddr} – Candidate keys
Foreign keys:
An attribute which makes a reference to an attribute of another entity type is called foreign key Foreign keys link tables together to form an integrated database
Domain
A range of values can be defined for an attribute and is called as Domain of that attribute.
E.g.. Age – attribute A Domain (A)= {1,2,….100}
Keys in Relationship set:
Case 1: If the relationship set R has no attributes, then the set of attributes Primarykey(E1) U Primarykey(E2) U Primarykey (n)
Case2: If the relationship set R has attributes, then the set of attribues, Primarykey(E1) U Primarykey(E2) U Primarykey (n)U{a1,a2,…an}
describes an individual relationship in set R. In both cases,
Primarykey(E1) U Primarykey(E2) U Primarykey (n) forms a superkey for a relationship set.
E-R Diagram
3. COMPARE AND CONTRAST BETWEEN DATABASE SYSTEMS AND FILE SYSTEMS (6 marks)
In the early days, database applications were built directly on top of file systems
Drawbacks of using file systems to store data:
a. Data redundancy and inconsistency
i. Multiple file formats, duplication of information in different files
b. Difficulty in accessing data
i. Need to write a new program to carry out each new task
c. Data isolation — multiple files and formats
d. Integrity problems
i. Integrity constraints (e.g. account balance > 0) become ―buried‖ in program code rather than being stated explicitly
ii. Hard to add new constraints or change existing ones
e. Atomicity of updates
i. Failures may leave database in an inconsistent state with partial updates carried out
ii. Example: Transfer of funds from one account to another should either complete or not happen at all
f. Concurrent access by multiple users
i. Concurrent accessed needed for performance
ii. Uncontrolled concurrent accesses can lead to inconsistencies
1. Example: Two people reading a balance and updating it at the same time
g. Security problems
i. Hard to provide user access to some, but not all, data
ii. Database systems offer solutions to all the above problems
4.WRITE ABOUT DIFFERENT LEVELS OF ABSTRACTION ,INSTANCES AND SCHEMAS (6 marks)
ABSTRACTION
It hides the complex details from the user and provide only the necessary data to the user
Three levels are there,
Physical level: describes how a record (e.g., customer) is stored.
Logical level: describes data stored in database, and the relationships among the data.
type instructor = record
ID : string; name : string; dept_name : string;
salary : integer;
end;
View level: application programs hide details of data types. Views can also hide information (such as an
employee‘s salary) for security purposes.
Three levels of abstraction
SCHEMA
The logical structure of the database .
– Example: The database consists of information about a set of customers and accounts and the
relationship between them)
– Analogous to type information of a variable in a program
– Physical schema: database design at the physical level
– Logical schema: database design at the logical level
Conceptual Schema
1. Conceptual Schema: or logical schema describes all relations that are stored in the database.
University, a conceptual schema is:
- Students (sid:string,Age:integer)
– Faculty (fid: string, salary: real)
– Courses (cid: string, cname: string, credits:integer)
In the university example, these relations contain information about entities, such as students and faculty, and about relationships, such as students‘ enrollment in courses.
Physical Schema: specifies additional storage details.
• It summarizes how the relations described in the conceptual schema are stored on secondary storage devices such as disks and tapes.
• Creation of data structures called indexes, to speed up data retrieval operations
A sample physical schema for the university:
– Store all relations as unsorted files of records
– Create indexes on the first column of Students, Faculty, and Courses relations.
External Schema
Each external schema consists of a collection of one or more views and relations from the conceptual
schema.
• A view is conceptually a relation, but the records in a view are not stored in the DBMS.
• A user creates any view from data already stored.
• For example: we might want to allow students to find out the names of faculty members teaching
courses.
– This is the view associated:
Courseinfo (cid: string, fname:string)
– A user can treat a view just like a relation and ask questions about the records in the view, even though the records in the view are not stored explicitly.
INSTANCE – the actual content of the database at a particular point in time
– Analogous to the value of a variable
Physical Data Independence – the ability to modify the physical schema without changing the logical schema
– Applications depend on the logical schema
– In general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.
5. DRAW THE ER DIAGRAM FOR BANKING (10 marks)
6. DISCUSS ABOUT DATABASE LANGUAGES (6 marks)
Data Manipulation Language (DML)
Language for accessing and manipulating the data organized by the appropriate data model.
DML also known as query language
Two classes of languages
o Procedural – user specifies what data is required and how to get those data
o Declarative (nonprocedural) – user specifies what data is required without specifying how to get those data
SQL is the most widely used query language
SQL
SQL: widely used non-procedural language
Example: Find the name of the instructor with ID 22222 select name from instructor where instructor.ID = ‗22222‘
select instructor.ID, department.dept name from instructor, department where instructor.dept name= department.dept name and department.budget > 95000
Application programs generally access databases through one of
Language extensions to allow embedded SQL
Application program interface (e.g., ODBC/JDBC) which allow SQL queries to be
sent to a database
Data Definition Language (DDL)
Specification notation for defining the database schema.
Example: create table instructor (ID char(5),name varchar(20),dept_name varchar(20),
salary numeric(8,2))
DDL compiler generates a set of tables stored in a data dictionary.
Data dictionary contains metadata (i.e., data about data)
Database schema
Integrity constraints
Primary key (ID uniquely identifies instructors)
Referential integrity (references constraint in SQL)
e.g. dept_name value in any instructor tuple must appear in department
relation
Authorization
7.EXPLAIN DIFFERENT TYPES OF ATTRIBUTES IN ER MODEL( 4 MARKS)
Single valued attributes: attributes with a single value for a particular entity are called single valued attributes. Eg. Cus_id- it takes only one value
Multivalued attributes: Attributes with a set of value for a particular entity are called multivalued attributes. Eg. Phone number- it takes many values
Stored attributes: The attributes stored in a data base are called stored attributes.
Derived attributes: The attributes that are derived from the stored attributes are called derived attributes.
Eg. Age- it will be derived from date of birth(another attribute)
Simple attribute: it can not be divided into sub parts. Eg. Cus_id
Composite attributes: Composite attributes can be divided in to sub parts.
Eg. Name-has first name, middle name and last name
UNIT II
RELATIONAL MODEL
PART B
1. EXPLAIN THE BASIC OPERATIONS OF RELATION ALGEBRA(16 marks)
Relational Algebra
It consists of set of operations that take one or more relations as input and produces new relation as
output.
The operations can be divided into,
Basic operations: Select, Project, Union, rename, set difference and Cartesian product
Additional operations: Set intersections, natural join, division and assignment.
Extended operations: Aggregate operations and outer join
Basic operations
SELECT
It selects tuples that satisfy a given predicate, To denote selection.
(Sigma) is used.
Syntax: condition (Table name) ie. Sal>1000 (Employee)
PROJECT
It selects attributes from the relation.
π– Symbol for project,
π<Attribute list><Attribute list>.
for eg. πEid,sal (employee)
1. Mathematical Set Operations
UNION OPERATION:
R1 U R2 - implies that tupl es either from R1 or tuples from R2 or both R1 & R2.
U symbol is used
SET DIFFERENCER1 – R2 implies tuples present in R1 but not in R2. ‘− ‘ is used
CARTESIAN PRODUCT
R1 × R2 allows to combine tuples from any two relations.
E.G.. Emp1 × Emp2
RENAME OPERATION
To rename the name of a relation or the name of an attribute.
2. Additional operations
INTERSECTION
R1η R2 implies tuples present in both R1 & R2
NATURAL JOIN OR EQUI JOIN
Used to combine related tupules from two relations.
It requires that the two join attributes have the same name, otherwise renaming operation is applied first and then join operation is applied.
Symbol:
OUTER JOIN
It is an extension of the join operation to deal with missing information.
In natural join, only the matching tuples comes in the result and the unmatched tuples are lost. To avoid this loss of information we use outer join.
There are 3 forms of outer join operation They are Left outer join, Right outer join and full outer join
LEFT OUTER JOIN-
It takes all tuples in the left relation that did not match with any tuple in the right relation and pads
the tuples with null values for all other attributes from the right relation and adds them to the result of the natural join.
RIGHT OUTER JOIN-
It takes all tuples in the right relation that did not match with any tuple in the left relation and pads the tuples with null values for all other attributes and adds them to the result of the natural join.
FULL OUTER JOIN
Padding tuples from the left relation that didn‘t match any from the right relation, as well as tuples from the right relation that did not match any from the left relation & adding them to the result of the
join.
DIVISION OPERATION
It is denoted by ’ . It is suited to queries that include the phrase ‗for all‘.
AGGREGATE FUNCTIONS
It takes a collection of values and return a single value as a result.
Avg., min., max., sum., count are few aggregate functions.
2. EXPLAIN ABOUT TUPLE RELATIONAL AND DOMAIN RELATIONAL CALCULUS (16 marks)
Relational Calculus
It can be divided as Tuple Relational calculus and Domain Relational Calculus
Tuple relational Calculus:
It is a non procedural query language
Specifies what data are required without describing how to get those data
Each Query is in the form of {t | P(t)}.
It is the set of all tuples ‗t‘ such that predicate P is true for ‗t‘.
Notations Used:
t is a tuple variable, t[A] denotes the value of tuple t on attribute A
t r denotes that tuple ‗t‘ is in relation ‗r‘.
P is the formula similar to that of the predicate calculus.
Tuple relational calculus.
The tuple relational calculation is anon procedural query language. It describes the desired information with out giving a specific procedure for obtaining that information. A query or
expression can be expressed in tuple relational calculus as {t | P (t)} which means the set of all tuples‗t‘ such that predicate P is true for‗t‘. Notations used:
t[A] → the value of tuple ‗t‘ on attribute, A
t ∈ r → tuple ‗t‘ is in relation ‗r‘
∃ → there exists Definition for ‗there exists‘ (∃): ∃ t ∈ r(Q(t)) which means there exists a tuple ‗t‘ in relation ‗r‘ such that predicate Q(t) is true.
∀→ for all Definition for ‗for all‘ (∀): ∀t ∈ r(Q(t)) which means Q(t) is true for all tuples ‗t‘ in
relation ‗r‘.
⇒ → Implication Definition for Implication (⇒): P⇒Q means if P is true then Q must be true.
Predicate calculus formula
Set of attributes and constants
Set of comparison operators(e.g. <, >, <, >, =, =).
Set of connectives: and(^), or( ), not( )
Implication( ) : X Y, if X is true, then Y is True
Set of Quantifiers:
- there exists
Definition for there exists
t r(Q(t)) = ‗there exists‘ a tuple in relation r such that predicate Q(t) is true.
V - For all
Definition for ‗For all‘
V t r(Q(t))=Q(t) is true ― for all‖ tuples ‗t‘ in relation r.
Safety of Expressions
It is possible to write tuple calculus expressions that generate infinite relations.
For example, { t | t r } results in an infinite relation if the domain of any attribute of relation r is
infinite
To guard against the problem, we restrict the set of allowable expressions to safe expressions.
An expression {t | P (t )} in the tuple relational calculus is safe if every component of t appears in
one of the relations, tuples, or constants that appear in P
o NOTE: this is more than just a syntax condition.
o E.g. { t | t [A] = 5 true } is not safe --- it defines an infinite set with attribute
values that do not appear in any relation or tuples or constants in P.
Domain Relational calculus
The domain relational calculus uses domain variables that take on values from an attribute domain rather than values for entire tuple.
Each Query is an expression of the form,
{<x1,x2,…xn>/P(x1,x2,…xn)}
Where x1,x2,…xn represent domain variables.
P represents a formula similar to that of predicate calculus.
Safety of Expressions
The expression: { x1, x2, …, xn | P (x1, x2, …, xn )}
is safe if all of the following hold:
1. All values that appear in tuples of the expression are values from dom (P ) (that is, the values
appear either in P or in a tuple of a relation mentioned in P ).
2. For every ―there exists‖ subformula of the form x (P1(x )), the subformula is true if and only if
there is a value of x in dom (P1) such that P1(x ) is true.
3. For every ―for all‖ subformula of the form x (P1 (x )), the subformula is true if and only if P1(x ) is
true for all values x from dom (P1).
3.EXPLAIN ABOUT TRIGGERS (6 marks)
Triggers
A Trigger is a statement that is executed automatically by the system as a side effect of modification
to the db.
To design a trigger mechanism, we must
* Specify the conditions under which the trigger is to be executed.
* Specify the actions to be taken when the triggers executed.
Need fro triggers
Suppose that instead of allowing negative account balances, the bank deals with overdrafts by,
* Setting the account balance to zero.
* Creating the loan in the amount of overdraft.
* Giving this loan a loan no identical to the account number to the overdrawn amount.
3 parts of triggers:
o Events- A change ti the database that activates triggers o Condition- a query that is running when the triggers is activated.
o Action –A procedure that is executed when the trigger is activated.
The condition for executing the trigger is an update to the account relation that results in a negative balance value.
E.G… create trigger overdraft_trigger after update on account referencing new row as nrow
for each row when nrow.balance<0
begin atomic
insert into borrower(select customer_name,acc_no
from depositor where nrow.acc_no=depositor.acc_no);
insert into loan values
(nrow.acc_no,nrow.branch_name,nrow.balance);
update account set balance = 0 where
account.acc_no=nrow.acc_no
end.
4. EXPALIN ABOUT DATA DEFINITION LANGUAGE (DDL) IN SQL
(Or basic schema definition in SQL) (10 marks)
1. Create:
It is used to create a new table in oracle.
Syntax:
Create table table_name(colunmn_name1 data type1 [constraint], column_name2 data type2[constraint],… column_ name n datatype n [constraint]);
2. Alter:
It is used to add a new column to the table, modifying the existing column, including or dropping an integrity constraints (primary key, not null, etc..)
a. Adding new column:
Syntax: alter table tablename add column_name data_type;
E.G. alter table customer add pno number(10);
b. Modifying an existing column:
Syntax: alter table tablename modify column_name newdata_type;
c. Dropping a column: It is used to delete a table.
Syntax: alter table tablename drop column column_name;
E.g.. alter table customer drop column cust_add;
3. Dropping a table It is used to drop a table.
Syntax: drop table table_name;
E.G.. Drop table customer;
All data and table structure of a customer table is permanently removed.
4. Renaming a table
SYNTAX: rename oldtablename to newtablename;
E.G.. Rename customer to cust_det;
5. Truncate a table
It removes all records or rows from the table.
SYNTAX: truncate table tablename;
E.G.. Truncate table cust_det;
5. EXPLAIN THE BASIC STRUCTURE OF SQL (10 marks)
The basic structure of an SQL expression consists of 3 clauses,
Select- select clause corresponds to projection operation in relational algebra
It is used to list the attributes desired in the result of a query.
From- from clause corresponds to Cartesian product in relation algebra.
Where- where clause corresponds to the selection operation in relational algebra.
It consist of a predicate involving attributes of the relations that appear in the from clause
All comparisons can be used such as <,>,<=,>=,=.
Logical operator like or, and are used.
It is used to list the attributes desired in the result of a query.
Syntax: select A1,A2,…..An from R1,R2,…Rm where P
E.g.. Select cust_name from customer;
E.g.2 Select distinct cust_name from customer;
if we want duplicates removed.
E.g.3 Select all cust_name from customer;
To specify explicitly that duplicates are not allowed
From Clause: It is used to list the relations involved in the query.
E.G.. Select * from customer;
Where clause: It is used for specifying the condition.
E.g.. Find the names of all customer whose city is ―chennai‖.
select customer_name from customer where city=―chennai‖.
OTHER OPERATIONS
Rename operation
Sql provides rename operation for both relations and attributes.
It uses the as clause.
Syntax: old_name as new_name.
Example: select loan_number as loanid ,amount from loan.
Tuple variable
The as clause is particularly useful in defining the notation of tuple variables.
A tuple variable in SQL must be associated with particular relation.
It is useful in comparing two tuples in the same relation.
Ex: select customer_name , T.loan_number, S.amount from borrower as T, loan as S.
String operations
SQL specifies strings by enclosing them in single quote.
The most commonly used operation on string is pattern matching-Like.
We describe patterns by using two special characters,
o Percent (%) - % character matches any substring. o Underscore (_) - _ character matches any character.
Ex: perry%-- matches any string beginning with perry.
%idge%--matches any string containing idge as a substring o Eg., perryridge,rockridge,ridgeway.
‗_ _ _‘ -- matches any string of exactly 3 characters.
‗_ _ _ %‘ – matches any string of at least 3 characters.
Ex: select customer_name from customer where customer_street like ‘%main%’
6. EXPLAIN SET OPERATIONS AND AGGREGATE FUNCTIONS IN SQL.
(10 marks)
Set Operations
The set operations union, intersect and except operate on relations and correspond to the relational algebra operations U, , and -.
Each above operations automatically eliminates the duplicates and to retain all duplicates use union all,
intersect all, except all.
1. Find all customers who have a loan or an account,or both(select customer_name from depositor) union (select customer_name from borrower);
2. Find all customers who have both loan and an account,select customer_name from depositor) intersect(select customer_name from borrower);
3. Find all customers who have an account but no loan,(select customer_name from depositor) except (select customer_name from borrower);
Aggregate functions
It takes collection of values as input and return a single value as output.
SQL has 5 built in agg.fns,
avg – Find average value.
min – Find minimum value.
max – Find Maximum value.
Sum – Find sum of values.
Count – Counts number of values
Find the average account balance at the adayar branch.Select avg(balance) from account where branch_name=―adayar‖;
Others
Group by clause
It is used to group a set of tuples with the same value on all attr ibutes in the group by clause are placed in one group.
Eg., To find the avg account balance at each branch.
Select branch_name, avg(balance) from account group by branch_name. We are using the keyword distinct for eliminating the duplicates.
Eg., to find the number of depositors Select branch_name,count(distinct cus_name) from depositor,account where
depositer.acc_num=account.acc_num group by branch_name.
Having
It is useful to state a condition that applies to group rather than to tuples.
Eg., to find the branch_name where the average account balance is more than $1200. Select branch_name, avg(balance) from account group by branch_name having (balance) >1200.
NULL values
SQL allows the use of null values to indicate the absence of information about the va lue of an
attribute. Eg., To find the loan number that appears in th loan relation with null values for amount.
Select loan_number from loan where amount is null. The result of an arithmetic expression is null if any of the input values is null, SQL treats the result
of any comparison involving a null values.
Boolean operations
And : true and unknown =unknown
False and unknown =false
Unknown and unknown=unknown
Or : true or unknown =true
False or unknown = unknown
Unknown or unknown=unknown
Not : not known = unknown.
Nested sub queries
A sub query is a select- from-where expression that is nested within another query.
7. EXPLAIN ABOUT SET MEMBERSHIP AND SET COMPARISON (10 marks)
Set membership
SQL allows testing tuples for membership in a relation
Eg., find all customers who have both a loan and an account at the bank.
For finding all account holders we write the sub query as,
Select cus_name from depositor.
1. In clause
Then find customers who are borrowers from the bank and who appears in the list of account holders
obtained in the sub query.
Select distinct cus_name from borrower where cus_name in (select cus_namefrom depositor). 2. Not in clause
Eg., Find all customers who have loan but not having deposit account in the bank.
select distinct cus_name from borrower where cus_name not in (select cus_name from depositor)
Set comparison
Some- find the names of all branches that have assets greater than those of at least one branch
located in brooklyn
Select branch_name from branch where assets > some (select assets from branch where
branch_city=‘brooklyn‘.
The > some comparison in the where clause of the outer select is true if the assets value of the tuple is greater than atleast one member of the set of all assets values branches in booklyn
SQL allows < some,< = some,>= some,= some and < > some comparisons.
All- find the names of all branches that have an asssets values greater than of each branch in
Brooklyn[>all-greater than all]
Select branch_name from branch where assets > all (select assets from branch where
branch_name=‘brooklyn‘.
SQL allows < all,< = all,>= all,= all and < > all comparisons.
Test for empty relations
SQL includes a features for testing whether a sub query has any tuples in its result.
Exist – returns value true if subquery non empty.
Eg., Find all customer who have both an account and a loan at the bank.
Select cus_name from borrower where exist (select * from depositer where depositer .cus_name = borrower.cus_name)
Not exist-find all customer whi have an account at all the branches in Brooklyn.
Select distinct s,cus_name from depositor as s where not exist (( select branct_name from branch
where branch_city=‘brooklyn‘) except (select r.branch_name from depositor as t, account as r where t.account_num=r.account.num and s.cus_name=t.cus_name))
The sub query finds all the branches at which customer s.cus_name has an account. Thus the outer
select takes each customer and tests whether the set of all branches at which that customer has an account contains the set of all branches located in Brooklyn.
Test for the absence of duplicate tuples
SQL includes a features for testing whether a sub query has any duplicate tuples in its result.
The unique construct returns the values true, if the argument sub query contains no duplicates tuples.
Eg., find all customers who have at least one account at the perryridge branch.
Select t.cus_name from depositor as t where unique ( select r.cus_name from account, depositor as r
where t.cus_name=r.cus_name and r.acc_num=account.acc_num and account.branch_name=‘peeryridge‘).
Not unique- test for the existence of duplicate tuples.
8. EXPLAIN DIFFERENT TYPES OF JOIN QUERIES. (8 marks)
The purpose of join is to combine the data spread across tables.
Types of joins
Inner join
Left outer join
Right outer join
Full outer join
Loan borrower
Loan_no Branch_name amt
170 Down town 3000
180 Redwood 4000
190 Perryridge 7000
Inner loan
Inner join combines the two relations .ie left side attribute and right side attribute
Eg., loan innerjoin borrower on loan.loan_no=borrower.loan_no.
Output will be
Loan_no Branch_name amt Cus_name Loan_no
170 Down town 3000 Jones 170
180 Redwood 4000 Smith 180
Natural inner join
Displayed only once. Eg., loan natural join borrower.
Loan_no Branch_name amt Cus_name
170 Down town 3000 Jones
180 Redwood 4000 Smith
Left outer join
It displays the left hand side relation that does not match any tuple in the right hand side relation are
padded with null values.
Eg., loan left outer join borrower on loan.loan_no=borrower.loan_no.
Loan_no Branch_name amt Cus_name Loan_no
170 Down town 3000 Jones 170
180 Redwood 4000 Smith 180
190 Perryridge 7000 Null Null
Natural left outer join
Eg., loan natural left outer join borrower.
Loan_no Branch_name amt Cus_name
170 Down town 3000 Jones
180 Redwood 4000 Smith
Cus_name Loan_no
Jones 170
Smith 180
Hayes 155
190 Perryridge 7000 Null
Right outer join
It displays the right hand relation that does not match any tuple in left hand relation are padded with null
values.
Eg., loan left outer join borrower on loan.loan_no=borrower.loan_no.
Loan_no Branch_name amt Cus_name Loan_no
170 Down town 3000 Jones 170
180 Redwood 4000 Smith 180
Null Null Null Hayes 155
Natural right outer join
Eg., loan natural right outer join borrower.
Loan_no Branch_name amt Cus_name
170 Down town 3000 Jones
180 Redwood 4000 Smith
155 Null Null Hayes
Full outer join
Both right and left outer join.
Eg., loan full outer join borrower on loan.loan_no=borrower.loan_no.
Loan_no Branch_name amt Cus_name Loan_no
170 Down town 3000 Jones 170
180 Redwood 4000 Smith 180
190 Perryridge 7000 Null Null
Null Null Null Hayes 155
Natural full outer join
9. EXPLAIN ABOUT INTEGRITY CONSTRAINTS . (6 marks)
Integrity constraints guard against accidental damage to the database, by ensuring that authorized changes to the database do not result in a loss of data consistency.
A checking account must have a balance greater than $10,000.00
Loan_no Branch_name amt Cus_name
170 Down town 3000 Jones
180 Redwood 4000 Smith
190 Perryridge 7000 Null
155 Null Null Hayes
A salary of a bank employee must be at least $4.00 an hour
A customer must have a (non-null) phone number
Constraints on a Single Relation
not null
primary key
unique
check (P ), where P is a predicate
Not Null Constraint
Declare branch_name for branch is not null
branch_name char(15) not null
Declare the domain Dollars to be not null
create domain Dollars numeric(12,2) not null
The Unique Constraint
unique ( A1, A2, …, Am)
The unique specification states that the attributes
A1, A2, … Am form a candidate key.
Candidate keys are permitted to be null (in contrast to primary keys).
The check clause
check (P ), where P is a predicate
Example: Declare branch_name as the primary key for branch and ensure that the values of assets
are non-negative.
create table branch
(branch_name char(15),
branch_city char(30),
assets integer,
primary key (branch_name),
check (assets >= 0))
The check clause in SQL-92 permits domains to be restricted:
Use check clause to ensure that an hourly_wage domain allows only values greater
than a specified value. create domain hourly_wage numeric(5,2)
constraint value_test check(value > = 4.00)
The domain has a constraint that ensures that the hourly_wage is greater than 4.00
The clause constraint value_test is optional; useful to indicate which constraint an update violated.
10.EXPLAIN ABOUT REFERENTIAL INTEGRITY AND ASSERTIONS
(8 marks)
Referential integrity
Ensures that a value that appears in one relation for a given set of attributes also appears for a certain set of
attributes in another relation.
Example: If ―Perryridge‖ is a branch name appear ing in one of the tuples in the account relation, then there exists a tuple in the branch relation for branch ―Perryridge‖.
Primary and candidate keys and foreign keys can be specified as part of the SQL create table statement:
The primary key clause lists attributes that comprise the primary key.
The unique key clause lists attributes that comprise a candidate key.
The foreign key clause lists the attributes that comprise the foreign key and the name of the
relation referenced by the foreign key. By default, a foreign key references the primary key attributes of the referenced table.
Example
create table customer (customer_name char(20), customer_street char(30),
customer_city char(30), primary key (customer_name ))
create table branch (branch_name char(15),
branch_city char(30), assets numeric(12,2), primary key (branch_name ))
create table account
(account_number char(10), branch_name char(15), balance integer,
primary key (account_number), foreign key (branch_name) references branch )
create table depositor (customer_name char(20),
account_number char(10), primary key (customer_name, account_number),
foreign key (account_number ) references account, foreign key (customer_name ) references customer )
Assertions
An assertion is a predicate expressing a condition that we wish the database always to satisfy.
An assertion in SQL takes the form
create assertion <assertion-name> check <predicate>
When an assertion is made, the system tests it for validity, and tests it again onevery update that may violate
the assertion
This testing may introduce a significant amount of overhead; hence assertions should be used with great
care.
Asserting
for all X, P(X) is achieved in a round-about fashion using
not exists X such that not P(X)
Assertion Example
Every loan has at least one borrower who maintains an account with a minimum balance or
$1000.00
create assertion balance_constraint check
(not exists (
select *
from loan where not exists (
select * from borrower, depositor, account where loan.loan_number = borrower.loan_number and borrower.customer_name = depositor.customer_name
and depositor.account_number = account.account_number and account.balance >= 1000)))
The sum of all loan amounts for each branch must be less than the sum of all account balances at the branch.
create assertion sum_constraint check
(not exists (select *
from branch where (select sum(amount )
from loan where loan.branch_name = branch.branch_name )
>= (select sum (amount ) from account
where loan.branch_name = branch.branch_name )))
11. EXPLAIN ABOUT EMBEDDED SQL AND DYNAMIC SQL (6 marks)
Embedded SQL
The open statement causes the query to be evaluated
EXEC SQL open c END_EXEC
The fetch statement causes the values of one tuple in the query result to be placed on host language
variables.
EXEC SQL fetch c into :cn, :cc END_EXEC Repeated calls to fetch get successive tuples in the query result
A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to ‗02000‘ to indicate no more data is available
The close statement causes the database system to delete the temporary relation that holds the result of the query.
EXEC SQL close c END_EXEC
Note: above details vary with language. For example, the Java embedding defines Java iterators to step through result tuples.
Updates Through Cursors
Can update tuples fetched by cursor by declaring that the cursor is for update
declare c cursor for
select * from account
where branch_name = ‗Perryridge‘ for update
To update tuple at the current location of cursor c
update account
set balance = balance + 100 where current of c
Dynamic SQL
Allows programs to construct and submit SQL queries at run time.
Example of the use of dynamic SQL from within a C program.
char * sqlprog = “update account
set balance = balance * 1.05 where account_number = ?” EXEC SQL prepare dynprog from :sqlprog;
char account [10] = ―A-101‖; EXEC SQL execute dynprog using :account;
The dynamic SQL program contains a ?, which is a place holder for a value that is provided when the SQL program is executed.
UNIT III DATABASE DESIGN
PARTB
1. EXPLAIN FIRST NORMAL FORM(1NF) (8 marks)
A relation schema R is in 1NF if * all the attributes of the relation R are atomic in nature.
E.G… DEPT
Suppose we extend it by including DLOCATIONS attribute as shown above. We assume that
each dept may have a no. of Locations.
This is not 1NF bcoz DLOCATIONS is not an atomic attribute.
First Normal Form(1NF)
There are 2 main techniques to achieve 1NF,
1. Remove the attribute DLOCATIONS and place it in separate relation DEPT_LOCATIONS along
with a primary key DNO. The primary key of the original DEPT is the combination
{DNO,DLOCATIONS} DEPT-LOCATIONS
2. 2. Expand the key so that there will be a separate tuple in the orginal DEPT relation for each location
of a DEPT as shown below,
DNO LOCATIONS
DNAME DNO DHEAD DLOCATIONS
Research 3 John (Mianus,Rye,Stratford)
Administrator 2 prince Mianu
Headquarter 1 Peter Rye
1
2
3
3
3
Rye
Mianus
Rye
Mianus
stratford
So the first technique is superior.
DNAME DNO DHEAD DLOCATIONS
Research
Research
Research
Administration
HQ
3
3
3
2
1
John
John
John
Princy
Peter
Mianus
Rye
Statford
Mianus
Rye
2.EXPLAIN SECOND NORMAL FORM(2NF) (8 marks)
A relation R is in 2NF if and only if,
• It is in the 1NF and
• No partial dependency exists between non-key attributes and key attributes.
• The test for 2NF involves testing for functional dependencies whose left hand side attributes are
part of primary key. If the primary key contains a single attr ibute, the test need not be applied at all.
• A relation schema R is in 2NF if every non-prime attribute A in R is fully functionally
dependent on the primary key of R.
• E.G.. Consider the EMP_PROJ relation, it is in 1NF but not in 2NF,
The non-prime attribute ENAME violates 2NF because of FD2, as do the non-prime attribute PNAME
and PLOCATION because of FD2 and FD3 make ENAME, PNAME and PLOCATION partially dependent
on the primary key {SSN,PNO}, thus violating 2NF test.
The Functional dependencies FD1,FD2 and FD3 leads to the decomposition of EMP_PROJ into the
3 relation schemas EP1,EP2 and EP3, each of which is in 2NF.
3. EXPLAIN THIRD NORMAL FORM (3NF) AND BCNF.GIVE THEIR
COMPARISON (16 marks)
THIRD NORMAL FORM
A relation R is said to be in the 3NF if and only if
* It is in 2 NF and
* No transitive dependency exists between non-key attributes and key attributes.
E.G… Consider the relation schema EMP_DEPT
The dependency SSN DNGRSSN is transitive through DNO in EMP-DEPT, because both the
dependencies SSN DNO and DNO DNGRSSN hold and DNO a key itself nor a subset of the key
of EMP-DEPT is neither.
A relation schema R is n 3NF, if it satisfies 2NF and no non-prime attribute of R is transitively
dependent on the primary key.
EMP_DEPT Relation Schema in 2NF
ENAME SSN BDATE ADDRESS DNO DNAME DNGRSSN
We can normalize schemas ED1 and ED2,
3NF relation schemas ED1 and ED2
Here ED1 and ED2 represet independent entity facts about employees and departments.
ED1
ENAME SSN BDATE ADDRESS
DNO
ED2
DNO DNAME DNGRSSN
BOYCE-CODD NORMAL FORM (BCNF)
A relation R is said to be in BCNF, if and only if all the determinant are candidate keys.
BCNF relation is a strong 3NF, but not every 3NF relation is BCNF.
A relation schema R is in BCNF with respect to a set F of functional dependencies if for all
functional dependencies in F of the form
Α β
Where α C R and β C R,
at least one of the following holds
1. Α β is trivial (i.e β C A)
2. α is a super key of R
Comparison of BCNF and 3NF
1. It is always possible to decompose a relation into a set of relations that are in 3NF such that:
the decomposition is lossless
the dependencies are preserved
2.It is always possible to decompose a relation into a set of relations that are in BCNF such that
the decomposition is lossless
it may not be possible to preserve dependencies.
4.EXPLAIN ABOUT MULTIVALUED DEPENDENCIES AND FOURTH NORMAL FORM (16 marks)
Multivalued Dependencies (MVDs)
Let R be a relation schema and let R and R. The multivalued dependency
holds on R if in any legal relation r(R), for all pairs for tuples t1 and t2 in r such that t1[] = t2 [],
there exist tuples t3 and t4 in r such that:
t1[] = t2 [] = t3 [] = t4 []
t3[] = t1 []
t3[R – ] = t2[R – ]
t4 [] = t2[]
t4[R – ] = t1[R – ]
Let R be a relation schema with a set of attributes that are partitioned into 3 nonempty subsets.
Y, Z, W
We say that Y Z (Y multidetermines Z ) if and only if for all possible relations r (R )
< y1, z1, w1 > r and < y1, z2, w2 > r
then
< y1, z1, w2 > r and < y1, z2, w1 > r
Note that since the behavior of Z and W are identical it follows that
Y Z if Y W
In our example:
course teacher
course book
The above formal definition is supposed to formalize the notion that given a particular value of Y (course) it has associated with it a set of values of Z (teacher) and a set of values of W (book), and these two sets are in some sense independent of each other.
Note:
If Y Z then Y Z
Indeed we have (in above notation) Z1 = Z2
The claim follows.
We use multivalued dependencies in two ways:
1. To test relations to determine whether they are legal under a given set of functional and multivalued dependencies
2. To specify constraints on the set of legal relations. We shall thus concern ourselves only with relations that satisfy a given set of functional and multivalued dependencies.
If a relation r fails to satisfy a given multivalued dependency, we can construct a relations r that
does satisfy the multivalued dependency by adding tuples to r.
Fourth Normal Form
A relation schema R is in 4NF with respect to a set D of functional and multivalued dependencies if
for all multivalued dependencies in D+ of the form , where R and R, at least one of the following hold:
is trivial (i.e., or = R)
is a superkey for schema R
If a relation is in 4NF it is in BCNF
5.EXPLAIN ABOUT 5NF AND DK/NF(6 marks)
Fifth Normal Form (5NF)
There are certain conditions under which after decomposing a relation, it cannot be reassembled back into its original form.
We don't consider these issues here.
Domain Key Normal Form (DK/NF)
A relation is in DK/NF if every constraint on the relation is a logical consequence of the definition
of keys and domains. Constraint: An rule governing static values of an attribute such that we can determine if this
constraint is True or False. Examples: 1. Functional Dependencies 2. Multivalued Dependencies
3. Inter-relation rules 4. Intra-relation rules
However: Does Not include time dependent constraints.
Key: Unique identifier of a tuple.
Domain: The physical (data type, size, NULL values) and semantic (logical) description of what values an attribute can hold.
There is no known algorithm for converting a relation directly into DK/NF.
UNIT IV
TRANSACTIONS
PART B
1. EXPLAIN ABOUT DIFFERENT TRANSACTION STATES. (6 marks)
o Active – the initial state; the transaction stays in this state while it is executing
o Partially committed – after the final statement has been executed.
o Failed -- after the discovery that normal execution can no longer proceed.
o Aborted – after the transaction has been rolled back and the database restored to its state
prior to the start of the transaction. Two options after it has been aborted:
restart the transaction; can be done only if no internal logical error kill the transaction
o Committed – after successful completion.
2. EXPLAIN THE IMPLEMENTATION OF ATOMICITY AND DURABILITY (6
marks)
The recovery-management component of a database system implements the support for atomicity
and durability.
The shadow-database scheme:
assume that only one transaction is active at a time.
a pointer called db_pointer always points to the current consistent copy of the
database.
all updates are made on a shadow copy of the database, and db_pointer is made to
point to the updated shadow copy only after the transaction reaches partial commit
and all updated pages have been flushed to disk.
in case transaction fails, old consistent copy pointed to by db_pointer can be used,
and the shadow copy can be deleted.
o Assumes disks do not fail
o Useful for text editors, but
extremely inefficient for large databases (why?)
Does not handle concurrent transactions
o Will study better schemes in Chapter 17.
3. DISCUSS ABOUT SERIALIZABILITY (16 marks)
Basic Assumption – Each transaction preserves database consistency.
Thus serial execution of a set of transactions preserves database consistency.
A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule. Different
forms of schedule equivalence give rise to the notions of: 1. conflict serializability
2. view serializability We ignore operations other than read and write instructions, and we assume that transactions may
perform arbitrary computations on data in local buffers in between reads and writes. Our simplified
schedules consist of only read and write instructions.
Conflicting Instructions
Instructions li and lj of transactions Ti and Tj respectively, conflict if and only if there exists some
item Q accessed by both li and lj, and at least one of these instructions wrote Q. 1. li = read(Q), lj = read(Q). li and lj don‘t conflict.
2. li = read(Q), lj = write(Q). They conflict. 3. li = write(Q), lj = read(Q). They conflict
4. li = write(Q), lj = write(Q). They conflict Intuitively, a conflict between li and lj forces a (logical) temporal order between them.
If li and lj are consecutive in a schedule and they do not conflict, their results would
remain the same even if they had been interchanged in the schedule.
Conflict Serializability
If a schedule S can be transformed into a schedule S´ by a series of swaps of non-
conflicting instructions, we say that S and S´ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule
Schedule 3 can be transformed into Schedule 6, a serial schedule where T2 follows T1,
by series of swaps of non-conflicting instructions.
Therefore Schedule 3 is conflict serializable.
Schedule 3 Schedule 6
Example of a schedule that is not conflict serializable: We are unable to swap instructions in the above schedule to obtain either the serial schedule < T3, T4
>, or the serial schedule < T4, T3 >.
View Serializability
Let S and S´ be two schedules with the same set of transactions. S and S´ are view equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction Ti must, in schedule S´, also read the initial value of Q. 2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value was
produced by transaction Tj (if any), then transaction Ti must in schedule S´ also read the value of Q that was produced by transaction Tj .
3. For each data item Q, the transaction (if any) that performs the final write(Q) operation in schedule S must perform the final write(Q) operation in schedule S´.
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable it is view equivalent to a serial schedule.
Every conflict serializable schedule is also view serializable.
Below is a schedule which is view-serializable but not conflict serializable.
What serial schedule is above equivalent to?
Every view serializable schedule that is not conflict serializable has blind writes.
4. EXPLAIN LOCK-BASED PROTOCOLS (8 marks)
A lock is a mechanism to control concurrent access to a data item
Data items can be locked in two modes :
exclusive (X) mode. Data item can be both read as well as
written. X-lock is requested using lock-X instruction.
shared (S) mode. Data item can only be read. S- lock is
requested using lock-S instruction.
Lock requests are made to concurrency-control manager. Transaction can proceed only after request
is granted.
Lock-compatibility matrix
A transaction may be granted a lock on an item if the requested lock is compatible with locks already
held on the item by other transactions
Any number of transactions can hold shared locks on an item, but if any transaction holds an
exclusive on the item no other transaction may hold any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible locks held
by other transactions have been released. The lock is then granted.
Example of a transaction performing locking:
T2: lock-S(A); read (A);
unlock(A); lock-S(B); read (B);
unlock(B); display(A+B)
Locking as above is not sufficient to guarantee serializability — if A and B get updated in-between
the read of A and B, the displayed sum would be wrong.
A locking protocol is a set of rules followed by all transactions while requesting and releasing
locks. Locking protocols restrict the set of possible schedules.
Pitfalls of Lock-Based Protocols
Consider the partial schedule
Neither T3 nor T4 can make progress — executing lock-S(B) causes T4 to wait for T3 to release its
lock on B, while executing lock-X(A) causes T3 to wait for T4 to release its lock on A.
Such a situation is called a deadlock.
To handle a deadlock one of T3 or T4 must be rolled back and its locks released.
The potential for deadlock exists in most locking protocols. Deadlocks are a necessary evil.
Starvation is also possible if concurrency control manager is badly designed. For example:
A transaction may be waiting for an X-lock on an item, while a sequence of other transactions
request and are granted an S- lock on the same item.
The same transaction is repeatedly rolled back due to deadlocks.
Concurrency control manager can be designed to prevent starvation.
5. EXPLAIN THE CONCEPT OF TWO-PHASE LOCKING PROTOCOL
(8 marks) This is a protocol which ensures conflict-serializable schedules.
Phase 1: Growing Phase
transaction may obtain locks
transaction may not release locks
Phase 2: Shrinking Phase
transaction may release locks
transaction may not obtain locks
The protocol assures serializability. It can be proved that the transactions can be serialized in
the order of their lock points (i.e. the point where a transaction acquired its final lock).
Two-phase locking does not ensure freedom from deadlocks
Cascading roll-back is possible under two-phase locking. To avoid this, follow a modified
protocol called strict two-phase locking. Here a transaction must hold all its exclusive locks
till it commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In this
protocol transactions can be serialized in the order in which they commit.
There can be conflict serializable schedules that cannot be obtained if two-phase locking is
used.
However, in the absence of extra information (e.g., ordering of access to data), two-phase
locking is needed for conflict serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we can find a transaction Tj
that uses two-phase locking, and a schedule for Ti and Tj that is not conflict serializable.
Lock Conversions
Two-phase locking with lock conversions:
– First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
– Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the programmer to insert the various locking
instructions.
6.EXPLAIN THE IMPLEMENTATION OF LOCKING (6 marks)
A lock manager can be implemented as a separate process to which transactions send lock
and unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a message
asking the transaction to roll back, in case of a deadlock)
The requesting transaction waits until its request is answered
The lock manager maintains a data-structure called a lock table to record granted locks and
pending requests
The lock table is usually implemented as an in-memory hash table indexed on the name of
the data item being locked
Black rectangles indicate granted locks, white ones indicate waiting requests
Lock table also records the type of lock granted or requested
New request is added to the end of the queue of requests for the data item, and granted if it is
compatible with all earlier locks
Unlock requests result in the request being deleted, and later requests are checked to see if
they can now be granted
If transaction aborts, all waiting or granted requests of the transaction are deleted
lock manager may keep a list of locks held by each transaction, to implement this efficiently
7.EXPLAIN ABOUT LOG BASED RECOVERY (16 marks)
Log-Based Recovery
o A log is kept on stable storage.
The log is a sequence of log records, and maintains a record of update activities on
the database.
o When transaction Ti starts, it registers itself by writing a
<Ti start>log record
o Before Ti executes write(X), a log record <Ti, X, V1, V2> is written, where V1 is the value
of X before the write, and V2 is the value to be written to X.
o Log record notes that Ti has performed a write on data item Xj Xj had value V1 before the
write, and will have value V2 after the write.
o When Ti finishes it last statement, the log record <Ti commit> is written.
o We assume for now that log records are written directly to stable storage (that is, they are
not buffered)
o Two approaches using logs
Deferred database modification
Immediate database modification
Deferred Database Modification
o The deferred database modification scheme records all modifications to the log, but defers all
the writes to after partial commit.
o Assume that transactions execute serially
o Transaction starts by writing <Ti start> record to log.
o A write(X) operation results in a log record <Ti, X, V> being written, where V is the new
value for X
Note: old value is not needed for this scheme
o The write is not performed on X at this time, but is deferred.
o When Ti partially commits, <Ti commit> is written to the log
o Finally, the log records are read and used to actually execute the previous ly deferred writes.
o During recovery after a crash, a transaction needs to be redone if and only if both <Ti start>
and<Ti commit> are there in the log.
Redoing a transaction Ti ( redoTi) sets the value of all data items updated by the transaction to the new
values.
o Crashes can occur while
the transaction is executing the original updates, or
while recovery action is being taken
o example transactions T0 and T1 (T0 executes before T1):
o T0: read (A) T1 : read (C)
A: - A - 50 C:- C- 100
Write (A) write (C)
read (B)
B:- B + 50
write (B)
o Below we show the log as it appears at three instances of time.
If log on stable storage at time of crash is as in case:
(a) No redo actions need to be taken
(b) redo(T0) must be performed since <T0 commit> is present
(c) redo(T0) must be performed followed by redo(T1) since
<T0 commit> and <Ti commit> are present
Immediate Database Modification
o The immediate database modification scheme allows database updates of an uncommitted
transaction to be made as the writes are issued
since undoing may be needed, update logs must have both old value and new value
o Update log record must be written before database item is written
We assume that the log record is output directly to stable storage
Can be extended to postpone log record output, so long as prior to execution of an
output(B) operation for a data block B, all log records corresponding to items B must
be flushed to stable storage
o Output of updated blocks can take place at any time before or after transaction commit
o Order in which blocks are output can be different from the order in which they are written.
Immediate Database Modification Example
Log Write Output
<T0 start>
<T0, A, 1000, 950>
To, B, 2000, 2050
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600
BB, BC
<T1 commit>
BA
Note: BX denotes block containing X.
Immediate DB Modification Recovery Example
Below we show the log as it appears at three instances of time.
Recovery actions in each case above are:
undo (T0): B is restored to 2000 and A to 1000.
undo (T1) and redo (T0): C is restored to 700, and then A and B are
set to 950 and 2050 respectively.
redo (T0) and redo (T1): A and B are set to 950 and 2050
respectively. Then C is set to 600
Checkpoints
Problems in recovery procedure as discussed earlier :
searching the entire log is time-consuming
we might unnecessarily redo transactions which have already
output their updates to the database.
Streamline recovery procedure by periodically performing checkpointing
Output all log records currently residing in main memory onto stable storage.
Output all modified buffer blocks to the disk.
Write a log record < checkpoint> onto stable storage.
Example of Checkpoints
8. DESCRIBE DEADLOCK HANDLING (16 marks)
Neither of the transaction can ever proceed with its normal execution. This situation is called deadlock.
Consider the following two transactions:
T1: write (X) T2 : write(Y)
write(Y) write(X)
Schedule with deadlock
System is deadlocked if there is a set of transactions such that every transaction in the set is waiting
for another transaction in the set.
Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some
prevention strategies :
Require that each transaction locks all its data items before it begins execution
(predeclaration).
Impose partial ordering of all data items and require that a transaction can lock data
items only in the order specified by the partial order (graph-based protocol).
Deadlock Prevention Strategies
Following schemes use transaction timestamps for the sake of deadlock prevention alone.
wait-die scheme — non-preemptive
older transaction may wait for younger one to release data item. Younger transactions
never wait for older ones; they are rolled back instead.
a transaction may die several times before acquiring needed data item
wound-wait scheme — preemptive
older transaction wounds (forces rollback) of younger transaction instead of waiting
for it. Younger transactions may wait for older ones.
may be fewer rollbacks than wait-die scheme.
Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original
timestamp. Older transactions thus have precedence over newer ones, and starvation is hence
avoided.
Timeout-Based Schemes :
a transaction waits for a lock only for a specified amount of time. After that, the wait times out a nd
the transaction is rolled back.
thus deadlocks are not possible
simple to implement; but starvation is possible. Also difficult to determine good value of the timeout
interval.
Deadlock Detection
o Deadlocks can be described as a wait- for graph, which consists of a pair G = (V,E),
V is a set of vertices (all the transactions in the system)
E is a set of edges; each element is an ordered pair Ti Tj.
o If Ti Tj is in E, then there is a directed edge from Ti to Tj, implying that Ti is waiting for Tj
to release a data item.
o When Ti requests a data item currently being held by Tj, then the edge Ti Tj is inserted in the
wait-for graph. This edge is removed only when Tj is no longer holding a data item needed
by Ti.
o The system is in a deadlock state if and only if the wait- for graph has a cycle. Must invoke a
deadlock-detection algorithm periodically to look for cycles.
Wait- for graph without a cycle Wait- for graph with a cycle
Deadlock Recovery
o When deadlock is detected :
Some transaction will have to rolled back (made a victim) to break deadlock. Select
that transaction as victim that will incur minimum cost.
Rollback -- determine how far to roll back transaction
Total rollback: Abort the transaction and then restart it.
More effective to roll back transaction only as far as necessary to break deadlock.
Starvation happens if same transaction is always chosen as victim. Include the
number of rollbacks in the cost factor to avoid starvation
UNIT V
IMPLEMENTATION TECHNIQUES PART B
1. EXPLAIN DIFFERENT TYPES OF PHYSICAL STORAGE MEDIA (8 marks)
Cache – fastest and most costly form of storage; volatile; managed by the computer system hardware.
Main memory:
fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)
generally too small (or too expensive) to store the entire database
capacities of up to a few Gigabytes widely used currently
Capacities have gone up and per-byte costs have decreased steadily and rapidly (roughly factor of 2
every 2 to 3 years)
Volatile — contents of main memory are usually lost if a power failure or system crash occurs.
Flash memory
Data survives power failure
Data can be written at a location only once, but location can be erased and written to again o Can support only a limited number (10K – 1M) of write/erase cycles.
o Erasing of memory has to be done to an entire bank of memory
Reads are roughly as fast as main memory
But writes are slow (few microseconds), erase is slower
Cost per unit of storage roughly similar to main memory
Widely used in embedded devices such as digital cameras
Is a type of EEPROM (Electrically Erasable Programmable Read-Only Memory)
Magnetic-disk
Data is stored on spinning disk, and read/written magnetically
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage o Much slower access than main memory (more on this later)
direct-access – possible to read data on disk in any order, unlike magnetic tape
Capacities range up to roughly 400 GB currently
o Much larger capacity and cost/byte than main memory/flash memory o Growing constantly and rapidly with technology improvements (factor of 2 to 3 every 2
years)
Survives power failures and system crashes
o disk failure can destroy data, but is rare
Optical storage
non-volatile, data is read optically from a spinning disk using a laser
CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
Write-one, read-many (WORM) optical disks used for archival storage (CD-R, DVD-R, DVD+R)
Multiple write versions also available (CD-RW, DVD-RW, DVD+RW, and DVD-RAM)
Reads and writes are slower than with magnetic disk
Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data
Tape storage
non-volatile, used primarily for backup (to recover from disk failure), and for archival data
sequential-access – much slower than disk
very high capacity (40 to 300 GB tapes available)
tape can be removed from drive storage costs much cheaper than disk, but drives are expensive
Tape jukeboxes available for storing massive amounts of data n hundreds of terabytes (1 terabyte =
109 bytes) to even a petabyte (1 petabyte = 1012 bytes) Storage Hierarchy
primary storage: Fastest media but volatile (cache, main memory).
secondary storage: next level in hierarchy, non-volatile, moderately fast access time
also called on-line storage E.g. flash memory, magnetic disks
tertiary storage: lowest level in hierarchy, non-volatile, slow access time also called off-line storage
E.g. magnetic tape, optical storage
2.EXPLAIN ABOUT MAGNETIC DISKS MECHANISM (8 marks)
Data is stored on spinning disk, and read/written magnetically
Primary medium for the long-term storage of data; typically stores entire database.
Data must be moved from disk to main memory for access, and written back for storage
Much slower access than main memory (more on this later)
direct-access – possible to read data on disk in any order, unlike magnetic tape
Capacities range up to roughly 400 GB currently
Much larger capacity and cost/byte than main memory/flash memory Growing constantly and rapidly with technology improvements (factor of 2 to 3 every
2 years)
Survives power failures and system crashes i. disk failure can destroy data, but is rare
Read-write head
Positioned very close to the platter surface (almost touching it)
Reads or writes magnetically encoded information.
Surface of platter divided into circular tracks
Over 50K-100K tracks per platter on typical hard disks
Each track is divided into sectors.
A sector is the smallest unit of data that can be read or written.
Sector size typically 512 bytes
Typical sectors per track: 500 (on inner tracks) to 1000 (on outer tracks)
To read/write a sector disk arm swings to position head on right track platter spins continually; data is read/written as sector passes under head
Head-disk assemblies multiple disk platters on a single spindle (1 to 5 usually) one head per platter, mounted on a common arm.
Cylinder i consists of ith track of all the platters
Earlier generation disks were susceptible to head-crashes Surface of earlier generation disks had metal-oxide coatings which would disintegrate on head crash and damage all data on disk
Current generation disks are less susceptible to such disastrous failures, although individual sectors may get corrupted
Disk controller
interfaces between the computer system and the disk drive hardware.
accepts high- level commands to read or write a sector
initiates actions such as moving the disk arm to the right track and actually reading or writing the
data
Computes and attaches checksums to each sector to verify that data is read back correctly
If data is corrupted, with very high probability stored checksum won‘t match recomputed checksum
Ensures successful writing by reading back sector after writing it
Performs remapping of bad sectors
3.EXPLAIN ABOUT RAID LEVELS. (16 marks)
RAID: Redundant Arrays of Independent Disks
disk organization techniques that manage a large numbers of disks, providing a view of a single disk
of o high capacity and high speed by using multiple disks in parallel, and
o high reliability by storing data redundantly, so that data can be recovered even if a disk fails
The chance that some disk out of a set of N disks will fail is much higher than the chance that a
specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years),
will have a system MTTF of 1000 hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are critical with large numbers of disks
Originally a cost-effective alternative to large, expensive disks I in RAID originally stood for ``inexpensive‘‘ Today RAIDs are used for their higher reliability and bandwidth.
The ―I‖ is interpreted as independent
Schemes to provide redundancy at lower cost by using disk striping combined with parity bits
Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics
RAID Level 0: Block striping; non-redundant.
Used in high-performance applications where data lose is not critical.
RAID Level 1: Mirrored disks with block striping
Offers best write performance.
Popular for applications such as storing log files in a database system.
RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) with bit striping.
RAID Level 3: Bit-Interleaved Parity
a single parity bit is enough for error correction, not just detection, since we know which disk has
failed o When writing data, corresponding parity bits must also be computed and written to a parity
bit disk
o To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk)
Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to
participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost)
.
RAID Level 4: Block-Interleaved Parity;
uses block- level striping, and keeps a parity block on a separate disk for corresponding b locks from
N other disks.
When writing data block, corresponding block of parity bits must also be computed and written to
parity disk
To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block)
from other disks.
Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different disks can be read in parallel
Provides high transfer rates for reads of multiple blocks than no-striping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block and new value of current block (2
block reads + 2 block writes)
Or by recomputing the parity value using the new values of blocks corresponding to the parity block
More efficient for writing large amounts of data sequentially
Parity block becomes a bottleneck for independent block writes since every block write also writes
to parity disk
RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks.
Higher I/O rates than Level 4. o Block writes occur in parallel if the blocks and their parity blocks are on different disks.
o Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.
RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures.
Better reliability than Level 5 at a higher cost; not used as widely.
4.EXPLAIN ABOUT FILE ORGANIZATION (8 marks)
The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence
of fields. One approach:
assume record size is fixed
each file has records of one particular type only
different files are used for different relations
This case is easiest to implement; will consider variable length records later.
Fixed-Length Records
Simple approach:
o Store record i starting from byte n (i – 1), where n is the size of each record. o Record access is simple but records may cross blocks
o Modification: do not allow records to cross block boundaries Deletion of record i:
alternatives:
move records i + 1, . . ., n
to i, . . . , n – 1
move record n to i
do not move records, but link all free records on a free list
Free Lists
Store the address of the first deleted record in the file header.
Use this first record to store the address of the second deleted record, and so on
Can think of these stored addresses as pointers since they ―point‖ to the location of a record.
More space efficient representation: reuse space for normal attributes of free records to store pointers. (No pointers stored in in-use records.)
Variable-Length Records
Variable- length records arise in database systems in several ways:
Storage of multiple record types in a file.
Record types that allow variable lengths for one or more fields
Record types that allow repeating fields (used in some older data models).
Variable-Length Records: Slotted Page Structure
Slotted page header contains:
o number of record entries o end of free space in the block
o location and size of each record
Records can be moved around within a page to keep them contiguous with no empty space between
them; entry in the header must be updated.
Pointers should not point directly to record — instead they should point to the entry for the record in
header.
5.EXPLAIN ABOUT ORGANIZATION OF RECORDS IN FILES (8 marks)
Heap – a record can be placed anywhere in the file where there is space
Sequential – store records in sequential order, based on the value of the search key of each record
Hashing – a hash function computed on some attribute of each record; the result specifies in which
block of the file the record should be placed Records of each relation may be stored in a separate file. In a multitable clustering file organization
records of several different relations can be stored in the same file Motivation: store related records on the same block to minimize I/O
Sequential File Organization
Suitable for applications that require sequential processing of the entire file
The records in the file are ordered by a search-key
Deletion – use pointer chains
Insertion –locate the position where the record is to be inserted
if there is free space insert there
if no free space, insert the record in an overflow block
In either case, pointer chain must be updated
Need to reorganize the file from time to time to restore sequential order
Multitable Clustering File Organization
Store several relations in one file using a multitable clustering file organization
Multitable clustering organization of customer and depositor:
good for queries involving depositor customer, and for queries involving one single customer and his accounts
bad for queries involving only customer
results in variable size records
Can add pointer chains to link records of a particular relation
6.DISCUSS ABOUT DIFFERENT TYPES OF INDICES WITH INSERTION AND DELETION. (16 marks)
In an ordered index, index entries are stored sorted on the search key value. E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file.
Also called clustering index
The search key of a primary index is usually but not necessarily the primary key. Secondary index: an index whose search key specifies an order different from the sequential order
of the file. Also called non-clustering index. Index-sequential file: ordered sequential file with a
primary index
Dense index — Index record appears for every search-key value in the file.
Sparse Index: contains index records for only some search-key values.
o Applicable when records are sequentially ordered on search-key o To locate a record with search-key value K we: o Find index record with largest search-key value < K
o Search file sequentially starting at the record to which the index record points
Compared to dense indices:
Less space and less maintenance overhead for insertions and deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in file, corresponding to
least search-key value in the block.
Multilevel index
If primary index does not fit in memory, access becomes expensive.
Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it.
outer index – a sparse index of primary index
inner index – the primary index file
If even outer index is too large to fit in main memory, yet another level of index can be created, and so on.
Indices at all levels must be updated on insertion or deletion from the file.
Deletion
If deleted record was the only record in the file with its particular search-key value, the
search-key is deleted from the index also.
Single- level index deletion:
Dense indices – deletion of search-key:similar to file record deletion.
Sparse indices –
if an entry for the search key exists in the index, it is deleted by replacing the entry in the
index with the next search-key value in the file (in search-key order).
If the next search-key value already has an index entry, the entry is deleted instead of being replaced.
Insertion
Single- level index insertion:
Perform a lookup using the search-key value appearing in the record to be inserted.
Dense indices – if the search-key value does not appear in the index, insert it.
Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created.
If a new block is created, the first search-key value appearing in the new block is inserted into the index.
Multilevel insertion (as well as deletion) algorithms are simple extensions of the single-level algorithms
7.EXPLAIN B+-TREE INDEX FILES(16 marks)
B+-tree indices are an alternative to indexed-sequential files.
Disadvantage of indexed-sequential files
performance degrades as file grows, since many overflow blocks get created.
Periodic reorganization of entire file is required.
Advantage of B+-tree index files:
automatically reorganizes itself with small, local, changes, in the face of insertions and deletions.
Reorganization of entire file is not required to maintain performance.
(Minor) disadvantage of B+-trees:
extra insertion and deletion overhead, space overhead.
Advantages of B+-trees outweigh disadvantages
B+-trees are used extensively
A B+-tree is a rooted tree satisfying the following properties:
All paths from root to leaf are of the same length
Each node that is not a root or a leaf has between n/2 and n children.
A leaf node has between (n–1)/2 and n–1 values Special cases:
If the root is not a leaf, it has at least 2 children.
If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–
1) values. B+-Tree Node Structure
Typical node
Ki are the search-key values
Pi are pointers to children (for non- leaf nodes) or pointers to records or buckets of records (for leaf
nodes).
The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1
Leaf Nodes in B+-Trees
Properties of a leaf node:
For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with search-key value Ki, or to a
bucket of pointers to file records, each record having search-key value Ki. Only need bucket structure if search-key does not form a primary key.
If Li, Lj are leaf nodes and i < j, Li‘s search-key values are less than Lj‘s search-key values
Pn points to next leaf node in search-key order
Non-Leaf Nodes in B+-Trees
Non leaf nodes form a multi- level sparse index on the leaf nodes. For a non- leaf node with m pointers:
All the search-keys in the subtree to which P1 points are less than K1
For 2 i n – 1, all the search-keys in the subtree to which Pi points have values
greater than or equal to Ki–1 and less than Ki
All the search-keys in the subtree to which Pn points have values greater than or equal
to Kn–1
Example of a B+-tree
B+-tree for account file (n = 5)
Leaf nodes must have between 2 and 4 values
((n–1)/2 and n –1, with n = 5).
Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5).
Root must have at least 2 children.
Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be
―physically‖ close.
The non- leaf levels of the B+-tree form a hierarchy of sparse indices.
The B+-tree contains a relatively small number of levels
Level below root has at least 2* n/2 values
Next level has at least 2* n/2 * n/2 values
.. etc.
o If there are K search-key values in the file, the tree height is no more than logn/2(K)
o thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as the index can be restructured
in logarithmic time
Advantages of B-Tree indices:
o May use less tree nodes than a corresponding B+-Tree. o Sometimes possible to find search-key value before reaching leaf node.
Disadvantages of B-Tree indices:
o Only small fraction of all search-key values are found early o Non-leaf nodes are larger, so fan-out is reduced. Thus, B-Trees typically have greater depth
than corresponding B+-Tree o Insertion and deletion more complicated than in B+-Trees o Implementation is harder than B+-Trees.
Typically, advantages of B-Trees do not out weigh disadvantages.
8.EXPLAIN ABOUT INSERTION AND DELETION IN B+ TREE (16 marks)
Insertion
Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be ―physically‖ close.
The non- leaf levels of the B+-tree form a hierarchy of sparse indices.
The B+-tree contains a relatively small number of levels
Level below root has at least 2* n/2 values
Next level has at least 2* n/2 * n/2 values etc.
If there are K search-key values in the file, the tree height is no more than logn/2(K)
thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled efficiently, as the index can be restructured
in logarithmic time (as we shall see). Splitting a leaf node:
take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order. Place
the first n/2 in the original node, and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split.
If the parent is full, split it and propagate the split further up.
Splitting of nodes proceeds upwards till a node that is not full is found.
In the worst case the root node may be split increasing the height of the tree by 1.
Splitting a non- leaf node: when inserting (k,p) into an already full internal node N
Copy N to an in-memory area M with space for n+1 pointers and n keys
Insert (k,p) into M
Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N‘
Insert (K n/2,N‘) into parent N
Deletion
o Find the record to be deleted, and remove it from the main file and from the bucket (if present)
o Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty
o If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then merge siblings: Insert all the search-key values in the two nodes into a single node (the one on the
left), and delete the other node.
Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure.
o Otherwise, if the node has too few entries due to the removal, but the entries in the node and a sibling do not fit into a single node, then redistribute pointers :
o Redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries.
o Update the corresponding search-key value in the parent of the node.
o The node deletions may cascade upwards till a node which has n/2 or more pointers is found.
o If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root.
Deleting ―Downtown‖ causes merging of under- full leaves
leaf node can become empty only for n=3!
Leaf with ―Perryridge‖ becomes underfull (actually empty, in this special case) and merged with its
sibling. As a result ―Perryridge‖ node‘s parent became underfull, and was merged with its sibling
Value separating two nodes (at parent) moves into merged node Entry deleted from parent
Root node then has only one child, and is deleted
Parent of leaf containing Perryridge became underfull, and borrowed a pointer from its left sibling . Search-key value in the parent‘s parent changes as a result
9. EXPLAIN ABOUT STATIC HASHING (8 marks)
A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).
In a hash file organization we obtain the bucket of a record directly from its search-key value using
a hash function.
Hash function h is a function from the set of all search-key values K to the set of all bucket addresses
B.
Hash function is used to locate records for access, insertion as well as deletion.
Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
Example of Hash File Organization
There are 10 buckets,
The binary representation of the ith character is assumed to be the integer i.
The hash function returns the sum of the binary representations of the characters modulo 10
E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3
Hash Functions
o Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file.
o An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values.
Ideal hash function is random, so each bucket will have the same number of records assigned to it
irrespective of the actual distribution of search-key values in the file.
Typical hash functions perform computation on the internal binary representation of the search-key.
o For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned.
Handling of Bucket Overflows
o Bucket overflow can occur because of Insufficient buckets
Skew in distribution of records. This can occur due to two reasons: multiple records have same search-key value
chosen hash function produces non-uniform distribution of key values o Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is
handled by using overflow buckets.
o Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list.
Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.
Hash Indices
o Hashing can be used not only for file organization, but also for index-structure creation. o A hash index organizes the search keys, with their associated record pointers, into a hash file
structure. o Strictly speaking, hash indices are always secondary indices o if the file itself is organized using hashing, a separate primary hash index on it using the same
search-key is unnecessary. o However, we use the term hash index to refer to both secondary index structures and hash
organized files.
Example of Hash Index
Deficiencies of Static Hashing
o In static hashing, function h maps search-key values to a fixed set of B of bucket addresses.
Databases grow or shrink with time. If initial number of buckets is too small, and file grows, performance will degrade due
to too much overflows.
If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and buckets will be underfull).
If database shrinks, again space will be wasted. o One solution: periodic re-organization of the file with a new hash function
Expensive, disrupts normal operations o Better solution: allow the number of buckets to be modified dynamically.