Chapter 11Further Normalization (I)
Throughout this book, we have made use of
the suppliers and parts database as an exam-
ple, as follows:
S {S#, SNAME, STATUS, CITY}
PRIMARY KEY {S#}
P {P#, PNAME, COLOR, WEIGHT, CITY}
PRIMARY KEY {P#};
SP {S#, P#, QTY}
PRIMARY KEY {S#, P#}
FOREIGN KEY {S#} REFERENCES S
FOREIGN KEY {P#} REFERENCES P
1
The above design does seem to make sense:
it is obvious that three tables S, P, and SP are
necessary; it is also obvious that the supplier
CITY attribute belongs in table S, the part COLOR
attribute belongs in P, and the shipment QTY
belongs in SP, etc.. But, why are they the right
decisions?
Assume that we move the supplier attribute
CITY out of the S and into SP to obtain the
following SCOP:
S# CITY P# QTYS1 London P1 100S1 London P2 100S2 Paris P1 200S2 Paris P2 200S3 Paris P2 300S4 London P2 400S4 London P4 400S4 London P5 400
2
Looking at this modified table, we immediately
spot what is wrong with this design: redun-
dancy: every tuple for supplier S1 tells user S1
is located in London, every tuple for supplier
S2 tells us that S2 is located in Paris, and so
on. This leads to some further problems: Af-
ter an update, S1 might be shown as being
located in London in one tuple, and in Ams-
terdam in another (inconsistency).
Thus, a good design principle is “one fact in
one place.” This chapter provides a formal
treatment for this simple idea. In fact, rela-
tional model is already normalized in certain
sense, every tuple contains exactly one value
for each attribute. We will say that it is in the
first normal form (INF).
3
What are they for?
As we noticed that a given table is in INF,
but still possess certain undesirable properties,
e.g., the table SCOP. The principle of further
normalization allows us to recognize such cases
and to replace them with those that are more
desirable in certain way.
In case of the table SCOP, those principles would
tell us exactly what is wrong, and would tell us
how to replace it with two more desirable ta-
bles, one with heading {S#, CITY}, and another
with heading {S#, P#, QTY}.
4
Normal forms
The process of further normalization is built
around the concept of normal forms. A ta-
ble is said to be in a certain normal form if
it satisfies a certain prescribed set of condi-
tions. For example, a table is in second normal
form (2NFS) iff it is in 1NFS and also satisfies
some additional conditions.
Many normal forms have been defined. The
following picture shows their relationship:
5
Before we can start...
To replace a table with something better, we
have to apply certain normalization procedure,
which will successively reduce a give collection
of tables to some more desirable form. Such
a procedure should be reversible, in the sense
that it is also possible to take its output and
map it back to the input, i.e., such a proce-
dure should be information preserving. In other
words, the only decomposition we are inter-
ested in are ones that will lose nothing during
its application.
For example, consider S, with heading {S#,STATUS, CITY}. We will consider several op-
tions to decompose it into several simpler ta-
bles.
6
You have a choice
Given the following two options:
we observe that in case (a), nothing is lost: the
SST and SC values still tell us that S3 has sta-
tus 30 and city Paris, etc., while in case (b),
although we can still tell that both suppliers
have status 30, we cannot tell which suppler
has which city. Thus, we call second decom-
position lossy.
7
What is the problem?
First of all, the decomposition is really a pro-
cess of projection: SST, SC and ST are each
projections of S. Thus, projection is the de-
composition operator.
Secondly, when we say case (a) is non-lossy, we
really mean that if we join SST and SC, we will
get back to the original S. On the other hand,
in case (b), if we join them together, we won’t
get back the original table S. Thus, reversibility
means precisely that the original table is equal
to the join of its projections. Hence, join is the
recomposition operator.
8
The million buck question
Question: If R1 and R2 are projections of
some table R, and R1 and R2 between them
include all of the attributes of R, what condi-
tions have to be satisfied in order to guarantee
that their join takes back to the original R?
We notice that S# → STATUS and S# → CITY
are two FDs associated with S. As a matter of
fact, we have the following general result:
Heath’s theorem: Let R{A, B, C} be a table,
where A, B, and C are attributes of R. If R
satisfies the FD A → B, then R is equal to the
join of its projections on {A, B} and {A, C}.
9
More on FDs
1. An FD rule is left-irreducible if its left-hand
side cannot be further cut down. E.g., con-
sider SCOP, the table satisfies {S#, P#} → CITY.
However, P# is redundant, since we also have S#
→ CITY, which implies the earlier one. Thus,
we can say that CITY is irreducibly dependent
on S#, but not on {S#, P#}.
2. FDs, as special integrity constraints, are
semantic notions. Recognizing the FDs is thus
part of the process of understanding what the
data means. The FD S# → CITY means that
each supplier is located in precisely one city,
and the way to specify this understanding is to
declare the FD in the database.
10
3. Let R be a table and let I be some irre-
ducible set of FDs that apply to R. It is con-
venient to represent the set I by means of a
functional dependency diagram. E.g., below is
the FD diagram for S, SP, and P.
Each arrow in the above figure comes out a
candidate key. The normalization process is to
eliminate arrows that don’t come out of can-
didate keys.
11
The fun starts...
Assume that each table has exactly one can-
didate key, thus also the primary key. Then,
an table is in 3NF iff the non-key attributes are
1) mutually independent, and 2) irreducibly de-
pendent on the primary key.
In contrast, a table is in 1NF if, in every legal
value of that table, every tuple contains exactly
one value for each attribute.
A table that is only in 1NF, but not in higher
normal forms, has a structure that is unde-
sirable for several reasons. For example, let’s
suppose that information about suppliers and
shipments, is lumped together into a single ta-
ble, FIRST, as follows:
FIRST {S#,STATUS,CITY,P#,QTY}
PRIMARY KEY {S#,P#}
12
Below is the FD diagram of this table, with an
additional FD: CITY → STATUS:
This diagram is more complex than that for
a 3NF table, since in a 3NF diagram, arrows
come out of candidate keys only; while in this
diagram for FIRST, there are certain additional
arrows, which will cause some troubles. In fact,
FIRST violates both conditions of being a 3NF
table: non-key attributes STATUS depends on
another non-key CITY; and they are not irre-
ducibly dependent on the primary key, since
both STATUS and CITY depend on S# alone.
13
An example
S# STATUS CITY P# QTYS1 20 London P1 300S1 20 London P2 200S1 20 London P3 400S1 20 London P4 200S1 20 London P5 100S1 20 London P6 100S2 10 Paris P1 300S2 10 Paris P2 400S3 10 Paris P2 200S4 20 London P2 200S4 20 London P4 400S4 20 London P5 400
The redundancy is obvious. E.g., every tuple
for supplier S1 shows that the city as London;
likewise, every tuple of city London shows the
status as 20.
14
Update related problems
The redundancies in FIRST lead to a bunch of
problems, when we try to apply various update
operations, such as INSERT, DELETE, and
UPDATE, on the table. For the moment, we
focus on the supplier-city redundancy.
1. We cannot insert the fact that a particular
supplier is located in a particular city until and
unless that supplier supplies at least one part,
since, unless that happens, there is no primary
key value for such a tuple.
15
2. By the same token, if we delete the only
tuple for a particular supplier, we delete not
only the shipment connecting that supplier to
a particular part, but also the information that
the supplier is located in a particular city.
3. The city value for a given supplier appears in
FIRST many times, in general. Then, if supplier
S1 moves over to Amsterdam, we are faced
with either the inconsistency problem, or we
have to search throughout the table to find
every tuple connecting S1 and London.
16
What to do?
The solution to these problems, is to replace
FIRST by the two tables:
SECOND {S#,STATUS,CITY}
and
SP {S#,P#,QTY}.
Below is the corresponding FD diagrams.
17
Sample values
S# STATUS CITYS1 20 LondonS2 10 ParisS3 10 ParisS4 20 LondonS5 30 Athens
S# P# QTYS1 P1 300S1 P2 200S1 P3 400S1 P4 200S1 P5 100S1 P6 100S2 P1 300S2 P2 400S2 P2 200S2 P2 200S2 P4 300S2 P5 400
18
They are better!
All the aforementioned problems have been elim-
inated:
We can insert the information that S5 is lo-
cated in Athens, even though S5 does not cur-
rently supply any parts.
We can also delete the shipment connecting
S3 and P2 by deleting the appropriate tuple
from SP, w/o losing the information that S3
is located in Paris.
Finally, in the revised structure, the city for
a given supplier appears only once. Thus,the
S#-CITY redundancy has been eliminated.
19
The effect of decomposing FIRST into SECOND
and SP is to eliminate the dependencies that
were not irreducible.
More specifically, in FIRST, the attribute CITY
does not describe the entity identified by the
primary key, namely a shipment, it merely de-
scribes the supplier involved in that shipment.
Hence, we should not mix those two kinds of
information in the same table.
20
Second normal form
A table is in 2NF iff it is in 1NF and every
non-key attribute is irreducibly dependent on
the primary key.
Both SECOND and SP are in 2NF, but FIRST is
not, since the two non-key attributes STATUS
and CITY depend on S# alone, while in that
case, the primary key is {S#,P#}.
A table that is in 1NF, but not in 2NF can
always be reduced to an equivalent collection
of 2NF tables, by replacing the 1NF table with
suitable projections.
21
The process
Let R(A, B, C, D) be a table, and {A, B} be its
primary key, and the FD A → D holds. Then,
it can be decomposed into the following two
tables: R1(A, D) with A as its primary key; and
R2(A, B, C) with {A, B} as its primary key, and
A as its foreign key referencing R1.
For example, given the {S#, Status, City, P#,
QTY} table, since {S#, P#} is the primary key,
thus, A ≡ S#, and B ≡ P#. We also have
the FD: S# → {City, Status}, thus, D ≡ {City,
Status}, and finally, C ≡ QTY.
Thus, by the process, we have two tables, {S#,City, Status} and {S#, P#, QTY}.
22
It is not good enough!
Below are the corresponding FD diagrams for
those two tables.
Although table SP is actually in 3NF, thus sat-
isfactory for now; the FD diagram for SECOND
is still more complex than a 3NF diagram.
More specifically, the dependency of STATUS on
S#, is transitive (via CITY.) Such dependencies
lead to update problems.
23
What do you mean?
1. We cannot insert the fact that a particular
city has a status until we have some supplier
actually located in that city.
2. Similarly, if we delete the only tuple for a
particular city, we delete both the information
for that supplier concerned and the information
that a specific city has a status.
3. Again, the status for a given city appears
many ties, which leads to....
S# STATUS CITYS1 20 LondonS2 10 ParisS3 10 ParisS4 20 LondonS5 30 Athens
24
One step further
Again the solution is to replace the originalSECOND with two projections, SC {S#, CITY}, andCS {CITY, STATUS}, with the following diagram
and the corresponding values:
S# CITYS1 LondonS2 ParisS3 ParisS4 LondonS5 Athens
CITY STATUSAthens 30London 20Paris 10Rome 50
25
Third normal form
It is clear to see that the revised structure over-
comes all the problems, and the effect of the
projection is to eliminate the transitive depen-
dence of, in this case, STATUS on S#. Again,
STATUS does not describe the whole tuple in
SECOND, it only describes the supplier.
In general, a table is in 3NF iff it is in 2NF
and every non-key attribute is non transitively
dependent on the primary key. both SC and CS
are in 3NF, and SECOND is not.
Again, any table that is in 2NF but not in 3NF
can be converted to an equivalent collection of
3NF tables, via projections.
26
The process
Let R(A, B, C, D) be a table, and let A be its
primary key, and the FD B → C holds. Then,
it can be decomposed into the following two
tables: R1(B, C) with B as its primary key; and
R2(A, B, D) with A being its primary key, and
B as its foreign key referencing R1.
Given Second{S#, City, Status}, we immedi-
ately have that A ≡ S#. Moreover, since City
→ Status, we have that B ≡ City, C ≡ Status
and D ≡ ∅.
Now the process says that we should split the
table into R1(City, Status) and R2(S#, City).
27
An alternative
Start with the table First, we apply a process
to split it into Second and SP, both in 2NF,
then apply another process to convert Second
into two tables, both in 3NF. We can also do
it another way.
Given FIRST {S#,STATUS,CITY,P#,QTY}, we have
that A ≡ {S# P#}. Since CITY → STATUS, we
have that B ≡ City and C ≡ Status. Finally, D
≡ QTY.
Then, we have that R1 : {CITY, STATUS}, and
R2 : {S#,P#,CITY, QTY}.
In R2, since S# → CITY. CITY is not irreducibly
dependent on {S#,P#}. Hence, R2 is not in
2NF. Using the previous process, R2 can be
further decomposed into R21 : {S#, CITY}, and
R22 : {S#,P#,QTY}.28
Dependency preservation
During the reduction process, it is often the
case that a given table can be decomposed,
without losing any information, in a variety
of different ways. Consider the table SECOND
{S#,STATUS,CITY} , with FDs S# → CITY and
CITY → STATUS, thus, S# → STATUS, by tran-
sitivity. Besides the given decomposition A:
SC {S#, CITY}, and CS {CITY, STATUS}; we can
also put into decomposition B SC {S#, CITY},and SS {S#, STATUS}. (?)
Both of them are non lossy, and in 3NF. But,
decomposition B is less satisfactory, since it is
still not possible to insert the information that
a particular city has a particular status unless
some supplier is located in that city.
29
Dig a bit deeper
We may observe that the two projections in the
decomposition A are independent of one an-
other, in the sense that updates can be made
in either one w/o regarding for another, pro-
vided that update is legal within the context
of the projection concerned.
In decomposition B, updates to either of the
two projections must be monitored to ensure
that the FD: CITY → STATUS is not violated.
Thus, the two constraints depend on each other.
30
Big deal
The essence is that in decomposition B, the
aforementioned FD has become a database
constraint that relates two tables. On the
other hand, for decomposition A, it is the tran-
sitive FD:S# → STATUS that becomes a database
constraint, which will be automatically enforced
as long as the two separate table constraints
S# → CITY and CITY → STATUS are enforced.
The latter is much easier to achieve, since we
only need to enforce two table constraints.
31
You have a choice
Thus, the concept of independent projectionsprovides a guideline for choosing a particulardecomposition when there is more than onepossibility.
It has been shown that projections R1 and R2of a table R are independent iff 1) every FDin R is implied by those in R1 and R2; and 2)the common attributes of R1 and R2 form acandidate key for at least one of the pair.
For example, in A, the common attribute CITY
constitutes the primary key for CS; and everyFD in SECOND either appears in one of the twoprojections, or logically follows from those thatdo. For B, the FD CITY → STATUS cannot bededucted from the FDs of those projections.
There is an algorithm that can decompose anytable, without losing information, into a set ofindependent 3NF projections.
32
Boyce/Codd normal form
Now, we drop our assumption that every table
has just one candidate key and consider the
general case, for which we define another nor-
mal form. Recall that the term determinant
refers to the left-hand side of an FD. Also, by
a trivial FD, we mean an FD whose right-hand
side is a subset of its left-hand side.
A table is in BCNF iff every determinant of
a nontrivial, left-irreducible FD is a candidate
key. Thus, for such tables, the only arrows in
their diagrams come out of candidate keys.
33
It is clear that FIRST is not in BCNF, since two
determinants S# and CITY are not candidate
keys. Similarly, SECOND is also not in BCNF.
On the other hand tables SP, SC, and CS are
all in BCNF, since in each case, the sole can-
didate key is the only determinant in the table.
Let’s consider another example, involving two
disjoint candidate keys. Suppose that in the
usual suppliers table S, S# and SNAME are both
candidate keys. Assume, however, STATUS and
CITY are now mutually independent. Thus, we
now have the following diagram for their FDs.
Table S is in BCNF, since the only determi-
nants are candidates keys.
34
Example continued...
Now, we consider some examples in which the
candidate keys overlap, i.e., they share at least
one attribute. Again, we assume that supplier
names are unique, and we consider SSP{S#,SNAME, P#, QTY}, with {S#, P#}, and {SNAME,P#}, as candidate keys. Since both S# and
SNAME are determinants (?) and they happen
not to be candidate key by themselves, SSP is
not in BCNF.
S# SNAME P# QTYS1 Smith P1 300S1 Smith P2 200S1 Smith P3 400S1 Smith P4 200
It is clear that SSP has redundancies and will
suffer from update anomalies, as well.
35
Yet another solution
So, we break SSP into two projections: SS {S#,SNAME}, and SP {S#, P#, QTY}; or alternatively,
SS {S#, SNAME}, and SP {SNAME, P#, QTY}. Both
of them are in BCNF.
Common sense tells us that the latter structure
is much preferable to SSP, which is supported
by the functional dependency theory and the
BCNF theory.
Homework: Do either Exercise 12.3, or 12.4.
36