showriblog.files.wordpress.com · web viewunit – ii. database design: functional dependencies,...

UNIT – II

Database Design: Functional Dependencies, Normal forms based on Primary Keys, Second Normal Forms, Third Normal Forms, Multivalued Dependencies and Fourth Normal Form, Join Dependencies and Fifth Normal Form.

Query Processing and Optimization: Algorithms for External Sorting, SELECT and JOIN Operations, PROJECT and SET Operations, Aggregate Operations and OUTER JOINS, Using Heuristics in Query Optimization, Using Selectivity and Cost Estimates in Query Optimization.

Functional Dependencies

1. Definition of Functional Dependency

The attributes of a table is said to be dependent on each other when an attribute of a table uniquely identifies another attribute of the same table.

For example: Suppose we have a student table with attributes: Stu_Id, Stu_Name, Stu_Age. Here Stu_Id attribute uniquely identifies the Stu_Name attribute of student table because if we know the student id we can tell the student name associated with it. This is known as functional dependency and can be written as Stu_Id->Stu_Name or in words we can say Stu_Name is functionally dependent on Stu_Id.

Formally:

If column A of a table uniquely identifies the column B of same table then it can represented as A->B (Attribute B is functionally dependent on attribute A)

Example:

Let us consider a functional dependency that there is one person working on a machine each day, which is given as:

FD: {MACHINE-NO,DATA-USED} -> {PERSON-ID}

It means that once the values of MACHINE-NO and DATA-USED are known, a unique value of PERSON-ID also can be known. Fig.(a) shows the functional dependency diagram (FDD) for this example.

2. Inference Rules for Functional Dependencies

A set of all functional dependencies that are implied by a given set of functional dependencies X is called closure of X, written X+. A set of inference rule is needed to compute X+ from X.

Armstrong’s axioms

1. Reflexivity: If B is a subset of A then, AB. This also implies AA always holds

2. Augmentation: If we have AB then ACBC

3. Transitivity: If AB and BC, then AC

4. Additivity or Union: If AB and AC, then ABC

5. Projectivity or Decomposition: If ABC then AB and AC

6. Pseudo transitivity: If AB and CBD, then ACD

3. Equivalence of sets of Functional Dependencies

Definition: Two sets of FDs F and G are equivalent if:

· Every FD in F can be inferred from G, and

· Every FD in G can be inferred from F

· Hence, F and G are equivalent if F+ =G+

There is an algorithm for checking equivalence of sets of FDs

4. Minimal sets of Functional Dependencies

Normal forms based on Primary Keys

http://www.studytonight.com/dbms/database-normalization.php

https://www.slideshare.net/raiuniversity/bsc-cs-iidbmsuivnormalization

Normalization of Database

Database Normalisation is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts data into tabular form by removing duplicated data from the relation tables.

Normalization is used for mainly two purposes

1. Eliminating reduntant (useless) data.

2. Ensuring data dependencies make sense i.e data is logically stored.

Problem without Normalization

Without Normalization, it becomes difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anamolies are very frequent if Database is not Normalized. To understand these anomalies let us take an example of Student table.

S_id

S_Name

S_Address

Subject_opted

401

Adam

Noida

Bio

402

Alex

Panipat

Maths

403

Stuart

Jammu

Maths

404

Adam

Noida

Physics

Updation Anamoly : To update address of a student who occurs twice or more than twice in a table, we will have to update S_Address column in all the rows, else data will become inconsistent.

Insertion Anamoly : Suppose for a new admission, we have a Student id(S_id), name and address of a student but if student has not opted for any subjects yet then we have to insert NULL there, leading to Insertion Anamoly.

Deletion Anamoly : If (S_id) 401 has only one subject and temporarily he drops it, when we delete that row, entire student record will be deleted along with it.

Normalization Rule

Normalization rule are divided into following normal form.

1. First Normal Form

2. Second Normal Form

3. Third Normal Form

4.

1. First Normal Form (1NF)

As per First Normal Form, no two Rows of data must contain repeating group of information i.e each set of column must have a unique value, such that multiple columns cannot be used to fetch the same row. Each table should be organized into rows, and each row should have a primary key that distinguishes it as unique.

The Primary key is usually a single column, but sometimes more than one column can be combined to create a single primary key. For example consider a table which is not in First normal form

Student Table:

Student

Age

Subject

Adam

15

Biology, Maths

Alex

14

Maths

Stuart

17

Maths

In First Normal Form, any row must not have a column in which more than one value is saved, like separated with commas. Rather than that, we must separate such data into multiple rows.

Student Table following 1NF will be :

Student

Age

Subject

Adam

15

Biology

Adam

15

Maths

Alex

14

Maths

Stuart

17

Maths

Using the First Normal Form, data redundancy increases, as there will be many columns with same data in multiple rows but each row as a whole will be unique.

2. Second Normal Form (2NF)

As per the Second Normal Form there must not be any partial dependency of any column on primary key. It means that for a table that has concatenated primary key, each column in the table that is not part of the primary key must depend upon the entire concatenated key for its existence. If any column depends only on one part of the concatenated key, then the table fails Second normal form.

In example of First Normal Form there are two rows for Adam, to include multiple subjects that he has opted for. While this is searchable, and follows First normal form, it is an inefficient use of space. Also in the above Table in First Normal Form, while the candidate key is {Student, Subject}, Age of Student only depends on Student column, which is incorrect as per Second Normal Form. To achieve second normal form, it would be helpful to split out the subjects into an independent table, and match them up using the student names as foreign keys.

New Student Table following 2NF will be:

Student

Age

Adam

15

Alex

14

Stuart

17

In Student Table the candidate key will be Student column, because all other column i.e Age is dependent on it.

New Subject Table introduced for 2NF will be :

Student

Subject

Adam

Biology

Adam

Maths

Alex

Maths

Stuart

Maths

In Subject Table the candidate key will be {Student, Subject} column. Now, both the above tables qualifies for Second Normal Form and will never suffer from Update Anomalies. Although there are a few complex cases in which table in Second Normal Form suffers Update Anomalies, and to handle those scenarios Third Normal Form is there.

3. Third Normal Form (3NF)

Third Normal form applies that every non-prime attribute of table must be dependent on primary key, or we can say that, there should not be the case that a non-prime attribute is determined by another non-prime attribute. So this transitive functional dependency should be removed from the table and also the table must be in Second Normal form. For example, consider a table with following fields.

Student_Detail Table :

Student_id

Student_name

DOB

Street

city

State

Zip

In this table Student_id is Primary key, but street, city and state depends upon Zip. The dependency between zip and other fields is called transitive dependency. Hence to apply 3NF, we need to move the street, city and state to new table, with Zip as primary key.

New Student_Detail Table :

Student_id

Student_name

DOB

Zip

Address Table:

Zip

Street

city

State

The advantage of removing transitive dependency is,

a) Amount of data duplication is reduced.

b) Data integrity achieved.

Fourth Normal Form (4NF)

In the fourth normal form,

· It should meet all the requirement of 3NF

· Attribute of one or more rows in the table should not result in more than one rows of the same table leading to multi-valued dependencies

To understand it clearly, consider a table with Subject, Lecturer who teaches each subject and recommended Books for each subject.

If we observe the data in the table above it satisfies 3NF. But LECTURER and BOOKS are two independent entities here. There is no relationship between Lecturer and Books. In the above example, either Alex or Bosco can teach Mathematics. For Mathematics subject , student can refer either 'Maths Book1' or 'Maths Book2'. i.e.;

SUBJECT --> LECTURER

SUBJECT-->BOOKS

This is a multivalued dependency on SUBJECT. If we need to select both lecturer and books recommended for any of the subject, it will show up (lecturer, books) combination, which implies lecturer who recommends which book. This is not correct.

SELECT c.LECTURER, c.BOOKS FROM COURSE c WHERE SUBJECT = 'Mathematics';

To eliminate this dependency, we divide the table into two as below:

Now if we want to know the lecturer names and books recommended for any of the subject, we will fire two independent queries. Hence it removes the multi-valued dependency and confusion around the data. Thus the table is in 4NF.

--Select the lecturer names

SELECT c.SUBJECT , c.LECTURER FROM COURSE c WHERE c.SUBJECT = 'Mathematics';

--Select the recommended book names

SELECT c.SUBJECT , c.BOOKS FROM COURSE c WHERE c.SUBJECT = 'Mathematics';

Fifth Normal Form (5NF)

A database is said to be in 5NF, if and only if,

· It's in 4NF

· If we can decompose table further to eliminate redundancy and anomaly, and when we re-join the decomposed tables by means of candidate keys, we should not be losing the original data or any new record set should not arise. In simple words, joining two or more decomposed table should not lose records nor create new records.

Consider an example of different Subjects taught by different lecturers and the lecturers taking classes for different semesters.

Note: Please consider that Semester 1 has Mathematics, Physics and Chemistry and Semester 2 has only Mathematics in its academic year!!

In above table, Rose takes both Mathematics and Physics class for Semester 1, but she does not take Physics class for Semester 2. In this case, combination of all these 3 fields is required to identify a valid data. Imagine we want to add a new class - Semester3 but do not know which Subject and who will be taking that subject. We would be simply inserting a new entry with Class as Semester3 and leaving Lecturer and subject as NULL. As we discussed above, it's not a good to have such entries. Moreover, all the three columns together act as a primary key, we cannot leave other two columns blank!

Hence we have to decompose the table in such a way that it satisfies all the rules till 4NF and when join them by using keys, it should yield correct record. Here, we can represent each lecturer's Subject area and their classes in a better way. We can divide above table into three - (SUBJECT, LECTURER), (LECTURER, CLASS), (SUBJECT, CLASS)

Now, each of combinations is in three different tables. If we need to identify who is teaching which subject to which semester, we need join the keys of each table and get the result.

For example, who teaches Physics to Semester 1, we would be selecting Physics and Semester1 from table 3 above, join with table1 using Subject to filter out the lecturer names. Then join with table2 using Lecturer to get correct lecturer name. That is we joined key columns of each table to get the correct data. Hence there is no lose or new data - satisfying 5NF condition.

SELECT t3.Class, t3.Subject, t1.Lecturer

FROM TABLE3 t3, TABLE3 t2, TABLE3 t1,

where t3.Class = 'SEMESTER1' and t3.SUBJECT= 'PHYSICS'

AND t3.Subject = t1.Subject

AND t3.Class = t2.Class

AND t1.Lecturer = t2.Lecturer;

First Normal Form (1NF)

Each column is unique in 1NF.

Example:

Sample Employee table, it displays employees are working with multiple departments.

Employee

Age

Department

Melvin

32

Marketing, Sales

Edward

45

Quality Assurance

Alex

36

Human Resource

Employee table following 1NF:

Employee

Age

Department

Melvin

32

Marketing

Melvin

32

Sales

Edward

45

Quality Assurance

Alex

36

Human Resource

Second Normal Form (2NF)

The entity should be considered already in 1NF and all attributes within the entity should depend solely on the unique identifier of the entity.

Example:

Sample Products table:

productID

product

Brand

1

Monitor

Apple

2

Monitor

Samsung

3

Scanner

HP

4

Head phone

JBL

Product table following 2NF:Products Category table:

productID

product

1

Monitor

2

Scanner

3

Head phone

Brand table:

brandID

brand

1

Apple

2

Samsung

3

HP

4

JBL

Products Brand table:

pbID

productID

brandID

1

1

1

2

1

2

3

2

3

4

3

4

Third Normal Form (3NF)

The entity should be considered already in 2NF and no column entry should be dependent on any other entry (value) other than the key for the table.

If such an entity exists, move it outside into a new table.

3NF is achieved are considered as the database is normalized.

Boyce-Codd Normal Form (BCNF)

3NF and all tables in the database should be only one primary key.

Fourth Normal Form (4NF)

Tables cannot have multi-valued dependencies on a Primary Key.

Fifth Normal Form (5NF)

Composite key shouldn’t have any cyclic dependencies.

Well this is a highly simplified explanation for Database Normalization. One can study this process extensively though. After working with databases for some time you’ll automatically create Normalized databases. As, it’s logical and practical.

Database Keys

Keys are very important part of Relational database. They are used to establish and identify relation between tables. They also ensure that each record within a table can be uniquely identified by combination of one or more fields within a table.

Super Key

Super Key is defined as a set of attributes within a table that uniquely identifies each record within a table. Super Key is a superset of Candidate key.

Candidate Key

Candidate keys are defined as the set of fields from which primary key can be selected. It is an attribute or set of attribute that can act as a primary key for a table to uniquely identify each record in that table.

Primary Key

Primary key is a candidate key that is most appropriate to become main key of the table. It is a key that uniquely identify each record in a table.

Composite Key

Key that consists of two or more attributes that uniquely identify an entity occurance is called Composite key. But any attribute that makes up the Composite key is not a simple key in its own.

Secondary or Alternative key

The candidate key which are not selected for primary key are known as secondary keys or alternative keys

Non-key Attribute

Non-key attributes are attributes other than candidate key attributes in a table.

Non-prime Attribute

Non-prime Attributes are attributes other than Primary attribute.

Algorithms for Query Processing and Optimization

A query expressed in a high-level query language such as SQL must be scanned, parsed and validate.

Scanner: Identify the language tokens.

Parser: Check query syntax

Validate: Check all attribute and relation names are valid and semantically meaningful names.

· An internal representation (query tree or query graph) of the query is created after scanning, parsing, and validating.

· Then DBMS must devise an execution strategy for retrieving the result from the database files.

· How to choose a suitable (efficient) strategy for processing a query is known as query optimization

The fig: shows the different steps of processing a high-level query.

The query optimizer module has the task of producing an execution plan

The code generator generates the code to execute that plan.

The runtime database processor has the task of running the query code. Whether in compiled or interpreted mode, to produce the query result. If a runtime error results,an error message is generated by the runtime database processor.

Algorithms for External Sorting

· External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory, usually a hard disk drive.

· Use a sort-merge strategy, which starts by sorting small subfiles-called runs-of the main file and then merge the sorted runs, creating larger sorted sub-files that are merged in turn.

· The algorithm consists of two phases: sorting phase and merging phase.

1. Sorting phase: In the sorting phase, runs (portions or pieces) of the file that can fit in the availablebuffer space are read into main memory, sorted using an internal sorting algorithm, andwritten back to disk as temporary sorted subfiles (or runs).

Number of initial runs (nR), number of file blocks (b), and available buffer space (nb)

nR = [b/nB]

Example: If the available buffer size is 5 blocks and the file contains 1024 blocks, then there are 205 initial runs each of size 5 blocks. After the sort phase, 205 sorted runs are stored as temporary subfiles on disk.

2. Merging Phase

-The sorted runs are merged during one or more passes.

– The degree of merging (dM) is the number of runs that can be merged together in each pass.

– In each pass, one buffer block is needed to hold one block from each of the runs being merged, and one block is needed for containing one block of the merge result.

– dM = MIN{nB − 1, nR}, and the number of passes is [logdM (nR)].

– In previous example, dM = 4, 205 runs → 52 runs → 13 runs → 4 runs → 1 run.

This means 4 passes.

• The complexity of external sorting (number of block accesses): (2 × b) + (2 × (b ×(logdM b)))

For example:

– 5 initial runs [2, 8, 11], [4, 6, 7], [1, 9, 13], [3, 12, 15], [5, 10, 14].

– The available buffer nB = 3 blocks → dM = 2 (two way merge)

Algorithms for SELECT and JOIN Operations

There are many options for executing a SELECT operation; Five operations for demonstration (Fig:5.5)

Search Methods for Simple Selection

A number of search algorithms are possible for selecting records from a file. These are also known as file scans, because they scan the records of a file to search for and retrieve records that satisfy a selection condition. If the search algorithm involves the use of an index, the index search is called an index scan.

The following search methods (S1 through S6) are examples of some of the search algorithms that can be used to implement a select operation:

Search Methods for Complex Selection. If a condition of a SELECT operation is a conjunctive condition-that is, if it is made up of several simple conditions connected with the AND logical connective such as op4 above-the DBMS can use the following additional methods to implement the operation:

2. Implementing the JOIN Operation

The JOIN operation is one of the most time-consuming operations in query processing. Many of the join operations encountered in queries are of the EQUI-JOIN and NATURAL JOIN varieties, so we consider only these two here..

Two-way join: join on two files.

Multiway join: joins involving more than two files

Example:

Methods for Implementing Joins

J1: Nested-loop join

for r in R:

for s in S:

if r[A] == s[B]:

# add r + s to result

J2: Single-loop join (with index)

Assume there is an index on S.B:

for r in R:

ss = S.index(B).getall(r[A])

# merge r × ss with result

Different Types of SQL JOINs

Here are the different types of the JOINs in SQL:

(INNER) JOIN: Returns records that have matching values in both tables

LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table

RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table

FULL (OUTER) JOIN: Return all records when there is a match in either left or right table

Algorithms for PROJECT and SET Operations

PROJECT Operations

Algorithms for Aggregate Operations and OUTER JOINS

1. Implementing Aggregate Operations

The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an entire table, can be computed by a table scan or by using an appropriate index.For example, consider the following SQL query:

SELECT MAX (SALARY) FROM EMPLOYEE;

If an (ascending) index on SALARY exists for the EMPLOYEE relation, the optimizer can decide the largest value: the rightmost leaf (B-Tree and B +-Tree) or the last entry in the first level index (clustering, secondary).

The index could also be used for COUNT, AVERAGE, and SUM aggregates, if it is a dense index. For a nondense index, the actual number of records associated with each index entry must be used for a correct computation.

For a GROUP BY clause in a query, the technique to use is to partition (either sorting or hashing) the relation on the grouping attributes and then to apply the aggregate operators on each group. Consider the following query:

SELECT DNO, AVG (SALARY) FROM EMPLOYEE GROUP BY DNO;

2. Implementing Outer Join operations

The outer join operation was introduced, with its three variations:

a).Left outer join

b).Right outer join

c).Full outer join

Different Types of SQL JOINs

Here are the different types of the JOINs in SQL:

(INNER) JOIN: Returns records that have matching values in both tables

LEFT (OUTER) JOIN: Return all records from the left table, and the matched records from the right table

RIGHT (OUTER) JOIN: Return all records from the right table, and the matched records from the left table

FULL (OUTER) JOIN: Return all records when there is a match in either left or right table

The following is an example of a left outer join operation in SQL

SELECT LNAME, FNAME, DNAME FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON DNO=DNUMBER);

The result of this query is a table of employee names and their associated departments.

Outer join can be computed by modifying one of the join algorithms, such as nested loop join or single-loop join.

– For a left (right) outer join, we use the left (right) relation as the outer loop or single-loop in the join algorithms.

– If there are matching tuples in the other relation, the joined tuples are produced and saved in the result. However, if no matching tuple is found, the tuple is still included in the result but is padded with null values.

The other join algorithms, sort-merge and hash-join, can also be extended to compute outer joins.

Using Heuristics in Query Optimization

Using heuristic rules to modify the internal representation (query tree) of a query.

• One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other binary operations.

• The SELECT and PROJECT operations reduce the size of a file and hence should be applied first.

1.Notation for Query Trees and Query Graphs

• Query tree: see Figure 15.4(a) (Fig 18.4(a) on e3).

– A query tree is a tree data structure that corresponds to a relational algebra expression. The input relations of the query – leaf nodes; the relational algebra operations – internal nodes.

– An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation.

– A query tree specifies a specific order of operations for executing a query.

Query Graph: see Figure 15.4(c) (Fig 18.4(c) on e3).

A graph data structure that corresponds to a relational calculus expression. It does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query.

– Relation nodes: displayed as single circles.

– Constant nodes: displayed as double circles.

– Graph edges: specify the selection and join conditions.

– The attributes to be retrieved from each relation are displayed in square brackets above each relation.

The query graph does not indicate an order on which operations to perform first. There is only a single graph corresponding to each query. Hence, a query graph corresponds to a relational calculus expression.

2. Heuristic Optimization of Query Trees

• Many different relational algebra expressions – and hence many different query trees – can be equivalent.

• The query parser will typically generate a standard initial query tree to correspond to an SQL query, without doing any optimization.

• The initial query tree (canonical query tree) is generated by the following sequence.

– The CARTESIAN PRODUCT of the relations specified in the FROM clause is first applied.

– The selection and join conditions of the WHERE clause are applied.

– The projection on the SELECT clause attributes are applied.

• The heuristic query optimizer transform this initial query tree (inefficient) to a final query tree that is efficient to execute.

• Example of transforming a query. See Figure 15.5 (Fig 18.5 on e3).

Consider the SQL query below.

SELECT LNAME FROM EMPLOYEE, WORKS ON, PROJECT WHERE PNAME=’Aquarius’ and PNUMBER=PNO and ESSN=SSN and BDATE > ’1957-12-31’;

Relational algebra is procedural query language. It consists of a set of operations that take one or two relations as input and produce a new relation as their result.

Using Selectivity and Cost Estimates in Query Optimization

The DBMS attempts to form a good cost model of various query operations as applied to the current database state, including the attribute value statistics (histogram), nature of indices, number of block buffers that can be allocated to various pipelines, selectivity of selection clauses, storage speed, network speed (for distributed databases in particular), and so on.

Access cost to secondary storage

Reading and writing blocks between storage and ram. (time)

Disk storage cost

Cost of temporary intermediate files in storage. (time/space)

Computation cost

Usually slower than storage, but sometimes not, the cpu cost of evaluating the query. (time)

Memory usage cost

The amount of ram needed by the query. (space)

Communication cost

The cost to transport query data over a network between database nodes. (time)

“Typical” databases emphasize access cost, the usual limiting factor. In-memory databases minimize computation cost, while distributed databases put increasing emphasis on communication cost.

Catalog Information Used in Cost Functions

Examples of Cost Functions for Select

S1, linear search

On average half the blocks must be accessed for an equality condition on the key, all the blocks otherwise.

S2, binary search

lg b for the search, plus more if it's nonkey.

S3a, primary index for a single record

One more than the number of index levels.

S3b, hash index for a single record

Average of 1 or 2 depending on the type of hash.

S4, ordered index for multiple records

CS4=x+b/2 as a rough estimate.

S5, clustering index for multiple records

http://www.cs.montana.edu/~halla/csci440/n19/n19.html#trees

NBKRIST ADBMS I M.TECH CSE I SEMPage 25

showriblog.files.wordpress.com · web viewunit – ii. database design: functional dependencies,...

Documents