normalization

Normalization

• Good database design ensures:– that the data that we want to store is consistent.– and can be retrieved or updated in the most efficient manner.

• Database design deals with fitting of a given piece of data into an appropriate logical organization.

• Database design considers the following factors:– How many tables should we have?– How should they be interrelated?– How many columns should each table have?– Which columns should be key columns?– What sort of constraints should we have?

Functional Dependency:• Functional dependency is very important in

formulating database design.

• For two columns A and B of a table, we can define the functional dependency between them as:

Column B is said to be functionally dependent on column A if, given A we can

precisely determine B.

E.g. Consider the following Employee table:

• Is there any functional dependence between any two columns of the table?

• Consider a combination of Emp_id & Emp_name.• Are any of the following statements true?• Given an Emp_id we can certainly determine a

precise Emp_name.• Given an Emp_name, we can certainly determine a

precise Emp_id.

Emp_id Emp_Name Age

A101 Anand 29

A102 Shilpa 23

A103 Reshama 34

• The first statement is true, but second one is not.• Emp_id is the P.K. of the table and therefore is

always unique.• Emp_name can repeat and therefore can be

duplicated in the table.• Therefore we can determine Employee name given

employee id.

Thus Emp_name is functionally dependent on Emp_id.

• Symbolically:

Emp_id Emp_name

• This should be read as Emp_id functionally determines Emp_name.

• In general if column A determines the value of column B, then we write it as:

A B

• This should be read as:

A functionally determines B

• There can be several functional dependencies between various columns of a table.

• E.g.

Emp_id Age

Expanded Employee Table:

List of functional dependencies:• Emp_id Emp_name• Emp_id Age• Emp_id Salary• Emp_id Project_id• Emp_id Completion_date

Emp_id Emp_name Age Salary Proj_id Completion_date

A101 Anand 29 10000 BTC 12/12/2005

A102 Shilpa 23 7000 WAP 11/8/2005

A103 Reshma 34 25000 WAP 11/8/2005

List that cannot be termed as functional dependencies:

Age NOT Emp_id

(because more than one employee can have same age).

Salary NOT Emp_id

(because more than one employee can have the same salary.

Decomposition:• Decomposition refers to the breaking down of one table

into multiple tables.• It is not much different from RDBMS projection operation.• During projection, we choose only the needed columns of

the table, discarding the rest.• In decomposition we do the same.• Discarding means not deleting forever but it means they

are placed in another table where they should logically belong.

• Decomposition is done to reduce redundancy.• Redundancy means unnecessary and uncontrolled

duplication of data.• Redundancy leads to loss of data integrity and data

consistency.

Examination Table

Rno name Sub_id Sub_name Marks

101 Narendra s1 English 80

101 Narendra s2 Physics 65

101 Narendra s3 Chemistry 68

102 Nita s1 English 70

102 Nita s2 Physics 49

102 Nita s3 Chemistry 50

• Redundancy exists in the table.

• Student names appear three times in the table.

• Similarly with the subject.

• To minimize the redundancy following rule is to be followed:

Keep one fact in one place.

• Decomposing the Examination table breaks it into two tables:

RnoNameSub_idSub_nameMks

RnoName

Sub_idSub_name

Mks

Student

Subject

After decomposition:• Student’s information in Student Table.• Subject’s information in Subject table.• Rno primary key in Student table.• Sub_id primary key in Subject table.• For a given student there will be only one entry in

student table and therefore student information cannot repeat.

• For a given subject there will also be exactly one entry in subject table and therefore subject information cannot repeat.

• Hence the problem of redundancy is solved.

• Although the problem of redundancy is solved another undesirable problem has been introduced.

• Where do we now have the information as to which student has obtained how many marks in which subject?

• We have lost it.

Loss of information due to decomposition is called Lossy Decomposition.

• Lossy decomposition is not acceptable in any case and cannot be used to reduce redundancy.

How to figure out lossy Decomposition?• Perform reversibility of decomposition i.e. recomposition.• Check whether we can reassemble the data in the

original form.• If we are not able to take back the data in original form,

it means we have lost some data from the original database design.

• On the other hand, if we are able to successfully reassemble the split tables so that we can recreate the original data, then the decomposition is desirable and successful.

When all information found in the original database is preserved after decomposition, it is called lossless

decomposition or non - lossy decomposition.

Modification in the previous table designs:

RnoNameSub_idSub_nameMks

RnoName

Sub_idSub_name

Student

Subject

RnoSub_id

Mks

Result

• Three tables are created as a result of the decomposition.

• The student and the subject tables continue to exist as before, but the structure of the student table is changed.

• The marks column no longer exists.• The newly created table is the Result table.• This table shows the marks obtained by a student in a

particular subject.• If we observe carefully, there are referential integrity

relationships as follows:–Between Student and result tables based on the Rno.–Between the Subject and the result tables, based on the Sub_id.

Student Table

Rno

Subject Table

Sub_id

Result TableRno

Sub_id

• If we join these three tables to perform a recomposition, we would get the following columns (eliminate duplicates):

Rno

Name

Sub_id

Sub_name

Mks• This is precisely what was contained in the original

Examination table.• This is an example of lossless decomposition.

In lossy decomposition, we lose some of the functional dependency relationships.

In lossless decomposition, all functional dependency relationships are preserved.

Normalization:• Normalization is the process of successive reduction

of a given set of relations to a better form.• It is the process of efficiently organizing data in a

database. • In the field of relational database design,

normalization is a systematic way of ensuring that:– a database structure is suitable for general-purpose

querying, and– free of certain undesirable characteristics as insertion,

update, and deletion anomalies—that could lead to a loss of data integrity.

• The end result of normalization is that redundant data is eliminated and only data related to the attribute is stored within a table.

• E.F. Codd, the inventor of the relational model, introduced the concept of normalization.

Goals of the normalization process:

1. Eliminating redundant data (for example, storing the same data in more than one table)

2.Ensuring data dependencies(only storing related data in a table).

Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.

Advantages of Normalization:

1. Data integrity:– The primary thing that normalization gives you is

data integrity. – With a fully normalized database design, you're

assured that your transactions will completed or roll backed.

– It also assures that inserts, updates, and deletes i.e., any type of modification to the data can't compromise your data.

2. Data Retrieval:– A fully normalized database will have many

tables, each with a handful of columns. – to retrieve any single piece of data, then, requires

joining together multiple tables. – With a fully normalized database, you can retrieve

any piece of data with a single query. – And generally, the queries aren't that complex.

3. Performance:– A fully normalized database exhibits greater

performance because:– Queries are shorter and simpler – Tables have fewer records since duplicate

information has been eliminated through the normalization process, and

– Indexes are more efficient because each table only contains a limited amount of non-redundant information.

Disadvantages:

1. Normalization Procedures:

• The proper normalization of a database is a challenging process.

• An improperly normalized database can cause significant problems.

• Furthermore, planning is even more important because databases don't lend themselves to rapid development.

2. Database restructuring:

• Proper database design is also important because it can be very difficult to modify the structure of a database after.

• Modifying a properly normalized database is easier than modifying an improperly normalized database .

• Database restructuring is also problematic since the database generally serves as the foundation of an larger application.

• Changes to the database risk breaking the application.

3. Data insertion and Maintenance:– A fully normalized database makes it easy to

retrieve data, it can be a serious pain to insert new records and update existing records.

– This is because inter-table dependencies (i.e., they have referential integrity on themselves).

– These tables are locked down so tightly that you can only insert, update, or delete data by acting on all related records.

– It is impossible to update individual records.

Normal Forms:• The database community has developed a series of

guidelines for ensuring that databases are normalized.

• These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF).

• Normal forms are guidelines and guidelines only. • Sometimes it becomes necessary to stray from them

to meet practical business requirements.

E.g.: Order Table:

Order_number

Order_date

Customer_number

Item_number

Item_name

Quantity

Unit_price

Bill_amount

These would repeat for as many items as we have for a given order

First Normal Form:

• A table is in the first normal form(1NF) if it does not contain any repeating columns or repeating groups of columns.

• Order table does not conform to the first normal form as it can contain duplicate items.

• To bring it into 1NF it is required to perform lossless decomposition.

• This is done by following a rule:

Move all the repeating columns to another table.

• The repeating columns in the Order table are:

Item_number

Item_name

Quantity

Unit_price

Bill_amount

• These columns can repeat many times in the same order.

• These columns are to be moved to another table.

• Thus we have two table:

• Table 1: Order

Order_number

Order_date

Customer_number

• Table 2 : Order_item

Item_number

Item_name

Quantity

Unit_price

Bill_amount

• For joining them there is a need for some commonality between these two tables.

• There is no such commonality.• This is an example of lossy decomposition.• One way to achieve commonality is to add

column Order_number to Order_item table.• It will link item sold to a particular order.• This will create referential integrity

relationship based on this column, to be established between the two tables.

• Order_number will be Primary Key in the Order table and Foreign Key in the Order_item table.

• Table 1: Order

Order_number

Order_date

Customer_number

• Table 2 : Order_item

Order_number

Item_number

Item_name

Quantity

Unit_price

Bill_amount

• This results into lossless decomposition.• What should be the primary key for Order_item

table.• It cannot be Order_number as there can be multiple

rows for the same Order.• Therefore here the primary key should be

combination of the Order_number and the Item_number.

• Thus the primary key is a composite or concatenated primary key.

Second Normal Form:

• A table is in the second normal form (2NF) if it is in the first normal form and if all non key columns in the table depend on the entire primary key.

• Let us examine the above tables for 2NF:

i) it is in the first normal form:

First condition is already satisfied.

ii) if all non key columns in the table depend on the entire primary key:

Order Table: Order_number (PK)

Order_date

Customer_number

• From Order_number we can derive the other non-key columns i.e. Order_date and Customer_number.

• These non key columns do not depend on each other at all.

• Thus this table is in 2NF.

• Order_item Table:Order_number

Item_numberItem_nameQuantityUnit_priceBill_amount

• Can we determine the values of non key columns using Composite primary key?

• Not Quite.• Item_name is base on Item_number alone, Order_number is

not required.• Unit_price column is determined with Item_number only.• Thus all non key columns in the table do not depend on the

entire primary key, but some of them depend on the part of the primary key.

• 2NF principle is violated.

Primary Key (Composite Key

• TO bring it into 2NF:

Move columns that do not depend on the entire primary key to another table.

• The columns that do not depend on the entire primary key are Item_name and Unit_price.

• They depend only on the Item_number.

• Move these two columns to a new table called Item.

• We also need Item_number (PK) in this table.

• The table would now look as:

• Table 1: Order Order_number (PK)

Order_dateCustomer_number

Table 2 : Order_itemOrder_number (FK)

Item_number (FK)QuantityBill_amount

Table 3 : ItemItem_number (PK)Item_nameUnit_price

Third Normal Form (3NF):

• A table is in the third normal form if it is in the second normal form and if all non key columns in the table depend non – transitively on the entire primary key.

• In other words, 3NF requires a table:i) to be in the second normal form and

ii) if all non-key column in the table must be independent of all other non-key columns.

Transitive Dependency:

A B C

B Functionally depends on A

C Functionally depends on B

C Transitively depends on A

3NF states to identify all transitive dependencies and get rid of them.

• Possible transitive dependencies in current table structure:

Order Table: Order_number (PK)

Order_date

Customer_number

• No such dependency as Order_date and Customer_number are functionally dependent on Order_number.

Order_item Table:Order_number (FK)

Item_number (FK)

Quantity

Bill_amount• Item_number and Quantity are functionally

dependent on Order_number.• Bill_amount does not depend directly on the PK of

the Order_item table.• It is calculated from:

Quantity (Order_item) * Unit_price ( Item)• This is the case of transitive dependency and have

to be get rid of it.

Item Table:

Item_number (PK)

Item_name

Unit_price• Here the non key columns Item_name and

Unit_price functionally depend on the primary key of the table, that is Item_number.

• Hence no transitive dependency in this table.

• In summary we need to get rid of the Bill_amount column from the Order_item table in order to bring the table into the 3NF.

• DO we loose any information by doing so?

• No really, as value for Bill_amount can be calculated from quantity and Unit_price columns.

• Table 1: Order

Order_number (PK)

Order_date

Customer_number

Table 2 : Order_item

Order_number (FK)

Item_number (FK)

Quantity

Table 3 : Item

Item_number (PK)

Item_name

Unit_price

Boyce – Codd Normal Form (BCNF):

• The first three normal forms provide means to ensure that a table is reasonably normalized.

• The next normal form is Boyce-Codd normal form (BCNF), named after the creators of this normal form.

Determinant:

• Functional dependency states that if B is functionally dependent on A, that is, A functionally determines B, then A B.

• Based on this definition, let us define BCNF:

• A table is in BCNF if the only determinants in the table are the candidate keys.

• In other words:

• A table is in BCNF if every column, on which some other column is fully functionally dependent, is also a candidate for the primary key of the table.

Example:

•The relation student (sno, sname, cno, cname) has all attributes participating in candidate keys since all the attributes are assumed to be unique. We therefore had the following candidate keys:

(sno, cno) (sno, cname) (sname, cno) (sname, cname)

•Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF,

• The difficulty in this relation is being caused by dependence within the candidate keys.

• The second and third normal forms assume that all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within the keys.

• BCNF deals with such dependencies.

• A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not in X, then X is a candidate key for R.

• It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if :

1. The candidate keys in the relation are composite keys (that is, they are not single attributes),

2. There is more than one candidate key in the relation, and

3. The keys are not disjoint, that is, some attributes in the keys are common.

• The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping.

• Consider for example, the relationship enrol (sno, sname, cno, cname, date-enrolled) Let us assume that the relation has the following

candidate keys: (sno, cno)

(sno, cname) (sname, cno) (sname, cname)

• (we have assumed sname and cname are unique identifiers).

• The relation is in 3NF but not in BCNF because there are dependencies

sno -> snamecno -> cname

• Where attributes that are part of a candidate key are dependent on part of another candidate key.

• Such dependencies indicate that although the relation is about some entity or association that is identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole thing that the keys identify.

• For example, the above relation is about an association (enrolment) between students and subjects and therefore the relation needs to include only one identifier to identify students and one identifier to identify subjects.

• Providing two identifiers about students (sno, sname) and two keys about subjects (cno, cname) means that some information about students and subjects that is not needed is being provided. This provision of information will result in repetition of information and the anomalies.

• If we wish to include further information about students and courses in the database, it should not be done by putting the information in the present relation but by creating new relations that represent information about entities student and subject.

• These difficulties may be overcome by decomposing the above relation in the following three relations:

(sno, sname) (cno, cname) (sno, cno, date-of-enrolment)

• We now have a relation that only has information about students, another only about subjects and the third only about enrolments.

• All the anomalies and repetition of information have been removed.

normalization

Documents