normalization
DESCRIPTION
Normalization. Good database design ensures: that the data that we want to store is consistent . and can be retrieved or updated in the most efficient manner. Database design deals with fitting of a given piece of data into an appropriate logical organization . - PowerPoint PPT PresentationTRANSCRIPT
Normalization
• Good database design ensures:– that the data that we want to store is consistent.– and can be retrieved or updated in the most efficient manner.
• Database design deals with fitting of a given piece of data into an appropriate logical organization.
• Database design considers the following factors:– How many tables should we have?– How should they be interrelated?– How many columns should each table have?– Which columns should be key columns?– What sort of constraints should we have?
Functional Dependency:• Functional dependency is very important in
formulating database design.
• For two columns A and B of a table, we can define the functional dependency between them as:
Column B is said to be functionally dependent on column A if, given A we can
precisely determine B.
E.g. Consider the following Employee table:
• Is there any functional dependence between any two columns of the table?
• Consider a combination of Emp_id & Emp_name.• Are any of the following statements true?• Given an Emp_id we can certainly determine a
precise Emp_name.• Given an Emp_name, we can certainly determine a
precise Emp_id.
Emp_id Emp_Name Age
A101 Anand 29
A102 Shilpa 23
A103 Reshama 34
• The first statement is true, but second one is not.• Emp_id is the P.K. of the table and therefore is
always unique.• Emp_name can repeat and therefore can be
duplicated in the table.• Therefore we can determine Employee name given
employee id.
Thus Emp_name is functionally dependent on Emp_id.
• Symbolically:
Emp_id Emp_name
• This should be read as Emp_id functionally determines Emp_name.
• In general if column A determines the value of column B, then we write it as:
A B
• This should be read as:
A functionally determines B
• There can be several functional dependencies between various columns of a table.
• E.g.
Emp_id Age
Expanded Employee Table:
List of functional dependencies:• Emp_id Emp_name• Emp_id Age• Emp_id Salary• Emp_id Project_id• Emp_id Completion_date
Emp_id Emp_name Age Salary Proj_id Completion_date
A101 Anand 29 10000 BTC 12/12/2005
A102 Shilpa 23 7000 WAP 11/8/2005
A103 Reshma 34 25000 WAP 11/8/2005
List that cannot be termed as functional dependencies:
Age NOT Emp_id
(because more than one employee can have same age).
Salary NOT Emp_id
(because more than one employee can have the same salary.
Decomposition:• Decomposition refers to the breaking down of one table
into multiple tables.• It is not much different from RDBMS projection operation.• During projection, we choose only the needed columns of
the table, discarding the rest.• In decomposition we do the same.• Discarding means not deleting forever but it means they
are placed in another table where they should logically belong.
• Decomposition is done to reduce redundancy.• Redundancy means unnecessary and uncontrolled
duplication of data.• Redundancy leads to loss of data integrity and data
consistency.
Examination Table
Rno name Sub_id Sub_name Marks
101 Narendra s1 English 80
101 Narendra s2 Physics 65
101 Narendra s3 Chemistry 68
102 Nita s1 English 70
102 Nita s2 Physics 49
102 Nita s3 Chemistry 50
• Redundancy exists in the table.
• Student names appear three times in the table.
• Similarly with the subject.
• To minimize the redundancy following rule is to be followed:
Keep one fact in one place.
• Decomposing the Examination table breaks it into two tables:
RnoNameSub_idSub_nameMks
RnoName
Sub_idSub_name
Mks
Student
Subject
After decomposition:• Student’s information in Student Table.• Subject’s information in Subject table.• Rno primary key in Student table.• Sub_id primary key in Subject table.• For a given student there will be only one entry in
student table and therefore student information cannot repeat.
• For a given subject there will also be exactly one entry in subject table and therefore subject information cannot repeat.
• Hence the problem of redundancy is solved.
• Although the problem of redundancy is solved another undesirable problem has been introduced.
• Where do we now have the information as to which student has obtained how many marks in which subject?
• We have lost it.
Loss of information due to decomposition is called Lossy Decomposition.
• Lossy decomposition is not acceptable in any case and cannot be used to reduce redundancy.
How to figure out lossy Decomposition?• Perform reversibility of decomposition i.e. recomposition.• Check whether we can reassemble the data in the
original form.• If we are not able to take back the data in original form,
it means we have lost some data from the original database design.
• On the other hand, if we are able to successfully reassemble the split tables so that we can recreate the original data, then the decomposition is desirable and successful.
When all information found in the original database is preserved after decomposition, it is called lossless
decomposition or non - lossy decomposition.
Modification in the previous table designs:
RnoNameSub_idSub_nameMks
RnoName
Sub_idSub_name
Student
Subject
RnoSub_id
Mks
Result
• Three tables are created as a result of the decomposition.
• The student and the subject tables continue to exist as before, but the structure of the student table is changed.
• The marks column no longer exists.• The newly created table is the Result table.• This table shows the marks obtained by a student in a
particular subject.• If we observe carefully, there are referential integrity
relationships as follows:–Between Student and result tables based on the Rno.–Between the Subject and the result tables, based on the Sub_id.
Student Table
Rno
Subject Table
Sub_id
Result TableRno
Sub_id
• If we join these three tables to perform a recomposition, we would get the following columns (eliminate duplicates):
Rno
Name
Sub_id
Sub_name
Mks• This is precisely what was contained in the original
Examination table.• This is an example of lossless decomposition.
In lossy decomposition, we lose some of the functional dependency relationships.
In lossless decomposition, all functional dependency relationships are preserved.
Normalization:• Normalization is the process of successive reduction
of a given set of relations to a better form.• It is the process of efficiently organizing data in a
database. • In the field of relational database design,
normalization is a systematic way of ensuring that:– a database structure is suitable for general-purpose
querying, and– free of certain undesirable characteristics as insertion,
update, and deletion anomalies—that could lead to a loss of data integrity.
• The end result of normalization is that redundant data is eliminated and only data related to the attribute is stored within a table.
• E.F. Codd, the inventor of the relational model, introduced the concept of normalization.
Goals of the normalization process:
1. Eliminating redundant data (for example, storing the same data in more than one table)
2.Ensuring data dependencies(only storing related data in a table).
Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
Advantages of Normalization:
1. Data integrity:– The primary thing that normalization gives you is
data integrity. – With a fully normalized database design, you're
assured that your transactions will completed or roll backed.
– It also assures that inserts, updates, and deletes i.e., any type of modification to the data can't compromise your data.
2. Data Retrieval:– A fully normalized database will have many
tables, each with a handful of columns. – to retrieve any single piece of data, then, requires
joining together multiple tables. – With a fully normalized database, you can retrieve
any piece of data with a single query. – And generally, the queries aren't that complex.
3. Performance:– A fully normalized database exhibits greater
performance because:– Queries are shorter and simpler – Tables have fewer records since duplicate
information has been eliminated through the normalization process, and
– Indexes are more efficient because each table only contains a limited amount of non-redundant information.
Disadvantages:
1. Normalization Procedures:
• The proper normalization of a database is a challenging process.
• An improperly normalized database can cause significant problems.
• Furthermore, planning is even more important because databases don't lend themselves to rapid development.
2. Database restructuring:
• Proper database design is also important because it can be very difficult to modify the structure of a database after.
• Modifying a properly normalized database is easier than modifying an improperly normalized database .
• Database restructuring is also problematic since the database generally serves as the foundation of an larger application.
• Changes to the database risk breaking the application.
3. Data insertion and Maintenance:– A fully normalized database makes it easy to
retrieve data, it can be a serious pain to insert new records and update existing records.
– This is because inter-table dependencies (i.e., they have referential integrity on themselves).
– These tables are locked down so tightly that you can only insert, update, or delete data by acting on all related records.
– It is impossible to update individual records.
Normal Forms:• The database community has developed a series of
guidelines for ensuring that databases are normalized.
• These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF).
• Normal forms are guidelines and guidelines only. • Sometimes it becomes necessary to stray from them
to meet practical business requirements.
E.g.: Order Table:
Order_number
Order_date
Customer_number
Item_number
Item_name
Quantity
Unit_price
Bill_amount
These would repeat for as many items as we have for a given order
First Normal Form:
• A table is in the first normal form(1NF) if it does not contain any repeating columns or repeating groups of columns.
• Order table does not conform to the first normal form as it can contain duplicate items.
• To bring it into 1NF it is required to perform lossless decomposition.
• This is done by following a rule:
Move all the repeating columns to another table.
• The repeating columns in the Order table are:
Item_number
Item_name
Quantity
Unit_price
Bill_amount
• These columns can repeat many times in the same order.
• These columns are to be moved to another table.
• Thus we have two table:
• Table 1: Order
Order_number
Order_date
Customer_number
• Table 2 : Order_item
Item_number
Item_name
Quantity
Unit_price
Bill_amount
• For joining them there is a need for some commonality between these two tables.
• There is no such commonality.• This is an example of lossy decomposition.• One way to achieve commonality is to add
column Order_number to Order_item table.• It will link item sold to a particular order.• This will create referential integrity
relationship based on this column, to be established between the two tables.
• Order_number will be Primary Key in the Order table and Foreign Key in the Order_item table.
• Table 1: Order
Order_number
Order_date
Customer_number
• Table 2 : Order_item
Order_number
Item_number
Item_name
Quantity
Unit_price
Bill_amount
• This results into lossless decomposition.• What should be the primary key for Order_item
table.• It cannot be Order_number as there can be multiple
rows for the same Order.• Therefore here the primary key should be
combination of the Order_number and the Item_number.
• Thus the primary key is a composite or concatenated primary key.
Second Normal Form:
• A table is in the second normal form (2NF) if it is in the first normal form and if all non key columns in the table depend on the entire primary key.
• Let us examine the above tables for 2NF:
i) it is in the first normal form:
First condition is already satisfied.
ii) if all non key columns in the table depend on the entire primary key:
Order Table: Order_number (PK)
Order_date
Customer_number
• From Order_number we can derive the other non-key columns i.e. Order_date and Customer_number.
• These non key columns do not depend on each other at all.
• Thus this table is in 2NF.
• Order_item Table:Order_number
Item_numberItem_nameQuantityUnit_priceBill_amount
• Can we determine the values of non key columns using Composite primary key?
• Not Quite.• Item_name is base on Item_number alone, Order_number is
not required.• Unit_price column is determined with Item_number only.• Thus all non key columns in the table do not depend on the
entire primary key, but some of them depend on the part of the primary key.
• 2NF principle is violated.
Primary Key (Composite Key
• TO bring it into 2NF:
Move columns that do not depend on the entire primary key to another table.
• The columns that do not depend on the entire primary key are Item_name and Unit_price.
• They depend only on the Item_number.
• Move these two columns to a new table called Item.
• We also need Item_number (PK) in this table.
• The table would now look as:
• Table 1: Order Order_number (PK)
Order_dateCustomer_number
Table 2 : Order_itemOrder_number (FK)
Item_number (FK)QuantityBill_amount
Table 3 : ItemItem_number (PK)Item_nameUnit_price
Third Normal Form (3NF):
• A table is in the third normal form if it is in the second normal form and if all non key columns in the table depend non – transitively on the entire primary key.
• In other words, 3NF requires a table:i) to be in the second normal form and
ii) if all non-key column in the table must be independent of all other non-key columns.
Transitive Dependency:
A B C
B Functionally depends on A
C Functionally depends on B
C Transitively depends on A
3NF states to identify all transitive dependencies and get rid of them.
• Possible transitive dependencies in current table structure:
Order Table: Order_number (PK)
Order_date
Customer_number
• No such dependency as Order_date and Customer_number are functionally dependent on Order_number.
Order_item Table:Order_number (FK)
Item_number (FK)
Quantity
Bill_amount• Item_number and Quantity are functionally
dependent on Order_number.• Bill_amount does not depend directly on the PK of
the Order_item table.• It is calculated from:
Quantity (Order_item) * Unit_price ( Item)• This is the case of transitive dependency and have
to be get rid of it.
Item Table:
Item_number (PK)
Item_name
Unit_price• Here the non key columns Item_name and
Unit_price functionally depend on the primary key of the table, that is Item_number.
• Hence no transitive dependency in this table.
• In summary we need to get rid of the Bill_amount column from the Order_item table in order to bring the table into the 3NF.
• DO we loose any information by doing so?
• No really, as value for Bill_amount can be calculated from quantity and Unit_price columns.
• Table 1: Order
Order_number (PK)
Order_date
Customer_number
Table 2 : Order_item
Order_number (FK)
Item_number (FK)
Quantity
Table 3 : Item
Item_number (PK)
Item_name
Unit_price
Boyce – Codd Normal Form (BCNF):
• The first three normal forms provide means to ensure that a table is reasonably normalized.
• The next normal form is Boyce-Codd normal form (BCNF), named after the creators of this normal form.
Determinant:
• Functional dependency states that if B is functionally dependent on A, that is, A functionally determines B, then A B.
• Based on this definition, let us define BCNF:
• A table is in BCNF if the only determinants in the table are the candidate keys.
• In other words:
• A table is in BCNF if every column, on which some other column is fully functionally dependent, is also a candidate for the primary key of the table.
Example:
•The relation student (sno, sname, cno, cname) has all attributes participating in candidate keys since all the attributes are assumed to be unique. We therefore had the following candidate keys:
(sno, cno) (sno, cname) (sname, cno) (sname, cname)
•Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF,
• The difficulty in this relation is being caused by dependence within the candidate keys.
• The second and third normal forms assume that all attributes not part of the candidate keys depend on the candidate keys but does not deal with dependencies within the keys.
• BCNF deals with such dependencies.
• A relation R is said to be in BCNF if whenever X -> A holds in R, and A is not in X, then X is a candidate key for R.
• It should be noted that most relations that are in 3NF are also in BCNF. Infrequently, a 3NF relation is not in BCNF and this happens only if :
1. The candidate keys in the relation are composite keys (that is, they are not single attributes),
2. There is more than one candidate key in the relation, and
3. The keys are not disjoint, that is, some attributes in the keys are common.
• The BCNF differs from the 3NF only when there are more than one candidate keys and the keys are composite and overlapping.
• Consider for example, the relationship enrol (sno, sname, cno, cname, date-enrolled) Let us assume that the relation has the following
candidate keys: (sno, cno)
(sno, cname) (sname, cno) (sname, cname)
• (we have assumed sname and cname are unique identifiers).
• The relation is in 3NF but not in BCNF because there are dependencies
sno -> snamecno -> cname
• Where attributes that are part of a candidate key are dependent on part of another candidate key.
• Such dependencies indicate that although the relation is about some entity or association that is identified by the candidate keys e.g. (sno, cno), there are attributes that are not about the whole thing that the keys identify.
• For example, the above relation is about an association (enrolment) between students and subjects and therefore the relation needs to include only one identifier to identify students and one identifier to identify subjects.
• Providing two identifiers about students (sno, sname) and two keys about subjects (cno, cname) means that some information about students and subjects that is not needed is being provided. This provision of information will result in repetition of information and the anomalies.
• If we wish to include further information about students and courses in the database, it should not be done by putting the information in the present relation but by creating new relations that represent information about entities student and subject.
• These difficulties may be overcome by decomposing the above relation in the following three relations:
(sno, sname) (cno, cname) (sno, cno, date-of-enrolment)
• We now have a relation that only has information about students, another only about subjects and the third only about enrolments.
• All the anomalies and repetition of information have been removed.