DatabaseUnderstanding Database Normalization

Understanding Database Normalization

Normalization is a very basic and important concept to consider when designing a workable relational schema in the database. The idea was proposed by E.F. Codd in 1972; since then, it has been the cornerstone of every relational database design. In fact, any schema, if it is working, must follow certain rules of it. It basically takes a relational schema through a series of test to minimize redundancy and insertion, deletion, and update anomalies by decomposing into smaller relations.

Before We Begin…

Before we begin with the rules of normalization and applying them, we must understand the following concepts.

Redundancy

One of the primary considerations for table design is to minimize storage space requirements. The tables should be designed in such a manner that very minimal data is repeatedly kept in one or more table. Storing redundant data not only takes more space but also leads to more serious problems.

EMPLOYEE_DEPARTMENT
SSN (PK) NAME ADDRESS DNO DNAME MGR_SSN
123456789 Wiles, A. 123, Fondren, Houston, TX 5 R&D 234567890
234567890 Newton, I. 234, Voss, Houston, TX 5 R&D 234567890
345678901 Turing, A 2121, Castle, Spring, TX 4 Administration 234567890
456789012 Gauss, C.F. 786, Berry, Bellaire, TX 4 Administration 234567890
234567890 Euler, L. 123, Fondren, Houston, TX 5 R&D 234567890
678901234 Rieman, G.F.B 980, Dallas, Houston, TX 1 Headquarters 678901234

Table 1: Employees working in different departments showing redundant data

Observe how DNO and DNAME are repeatedly stored in the table. This type of redundant data leds to updation, insertion, and deletion anomalies.

Insertion Anomaly

If we want to insert new employee who has not been assigned any department, the attribute values of the department for that particular employees has to be kept null. This is clearly a waste of space. Further, if we insert a new employee for a department, say, 4, the other attributes of the department have to be consistent. For example, department 4, its DNAME must be ‘Administration‘ and MGR_SSN must be ‘234567890‘.

Updation Anomaly

If we change the value of one of the departments by, say, changing its DNAME or MGR_SSN, we also must update the value of all employees who work in that department. Otherwise, the database will be in an inconsistent state.

Deletion Anomaly

Suppose we delete an employee from the database, for example, the last employee in the above table, who is the sole representative of DNO=1, then all the information about the department is also lost. This is ridiculous because we want to delete an employee information, not the whole department.

Functional Dependency

Functional dependency is the base of normalization. It states the interdependency of attributes in a table. If we know the SSN of particular employee, we can find out the address of that employee. This means that the attribute address is functionally determined by SSN. Symbolically, we can write:

{ SSN } → { ADDRESS }

Similarly,

{ SSN } → { ENAME, ADDRESS }
{ SSN, DNO } → { MGR_SSN}

When one or more attributes uniquely identifies a row, that attribute is called the primary key.

Normalization Forms

Rules of normalization, when applied to a table, minimize the problem areas making the table level-up into a consistent state especially during the insertion, updation, and deletion processes. The first normal form or, 1NF, is the first rule, and so on. Let’s have a closer look at the rules.

First Normal Form (1NF)

Based on the attribute atomicity, the first normal form essentially states that we should not put more than one attribute values in a single domain. For example, in the following table, more than one PHONES are attributed to a person, that too within a single domain. This is a pretty bad design.

EMPLOYEE_PHONE
SSN (PK) NAME PHONES
123456789 Wiles, A. 1122334455, 3344556677, 6677889944
234567890 Newton, I. 3399118822, 3399554773
345678901 Turing, A 2266001993
456789012 Gauss, C.F. 4466577853

Instead, what we should do is as follows

EMPLOYEE_PHONE
SSN (PK) NAME PHONE1 PHONE2 PHONE3
123456789 Wiles, A. 1122334455 3344556677 6677889944
234567890 Newton, I. 3399118822 3399554773 NULL
345678901 Turing, A 2266001993 NULL NULL
456789012 Gauss, C.F. 4466577853 NULL NULL

There are too many NULL values; moreover, we have no way to add another phone number. We can do better by decomposing the table into two, as follows.

EMPLOYEE_NAME
SSN (PK) NAME
123456789 Wiles, A.
234567890 Newton, I.
345678901 Turing, A
456789012 Gauss, C.F.
EMPLOYEE_PHONE
SSN (PK) PHONES (PK)
123456789 1122334455
123456789 3344556677
123456789 6677889944
234567890 3399554773
234567890 3399118822
345678901 2266001993
456789012 4466577853

Second Normal Form (2NF)

The second normal form is based on the concept of full functional dependency apart from the fact that the table must be in 1NF. Here, we must remove all non-key attributes that are not completely dependent on the primary key. For example,

EMPLOYEE_PROJECT
SSN (PK) PROJECT_NO (PK) EMPLOYEE_NAME PROJECT_NAME PROJECT_HOURS

In the preceding table:

{ SSN } → { EMPLOYEE_NAME }
{ SSN } → { PROJ_HOURS }

Also,

{ PROJECT_NO } → { PROJECT_NAME }
{ PROJECT_NO } → { PROJECT_HOURS }

This is clearly a violation of 2NF, because the attributes PROJECT_HOURS and PROJECT_NAME are functionally dependent on the PROJECT_NO, individually. Also, EMPLOYEE_NAME and PROJ_HOURS are uniquely determined by SSN. What we can do is decompose the table into the following tables.

SSN (PK) PROJECT_NO (PK) PROJECT_HOURS
SSN (PK) EMPLOYEE_NAME
PROJECT_NO (PK) PROJECT_NAME

Third Normal Form (3NF)

As should be obvious, for a table to be in 3NF, first it must be in 2NF and the core concept behind it is that a table must not hold transitive dependency of attributes. Transitive dependency is: X → Y, Y → Z, X → Z. That means any non-key field must not depend on a field that is not a primary key. For example,

SSN (PK) EMPLOYEE_NAME BIRTH_DATE DEPT_NAME DEPT_ADDRESS

Here,

{ SSN } → { EMPLOYEE_NAME }
{ SSN } → { BIRTH_DATE }
{ SSN } → { DEPT_NAME }
{ SSN } → { DEPT_ADDRESS }

However, the anomaly is

{ DEPT_NAME } → { DEPT_ADDRESS }

whereas DEPT_NAME is a non-key. We can remove this problem by decomposing the table as follows.

SSN (PK) EMPLOYEE_NAME BIRTH_DATE DEPT_NAME
DEPT_NAME (PK) DEPT_ADDRESS

Boyce Codd Normal Form (BCNF)

BCNF is almost equivalent to 3NF for most tables, but is a little stricter than 3NF. Every table in BCNF is also in 3NF but not the other way around.

BCNF states that whenever a nontrivial functional dependency X Y holds that means X is the primary key.

Let us understand by comparing tables where some of them are in 3NF as well as in BCNF and another table, which is in 3NF but not in BCNF.

EMPLOYEE
SSN (PK) EMPLOYEE_NAME BIRTH_DATE
{ SSN } → { EMPLOYEE_NAME }
{ SSN } → { BIRTH_DATE }

In the previous table, the primary key for the table is SSN. This is the only nontrivial functional dependency that holds on the EMPLOYEE table have SSN on the left side of the arrow. Because SSN is the primary key, functional dependency does not violate BCNF.

<bPROJECT
PROJECT_NO (PK) PROJECT_NAME PROJECT_DURATION
{ PROJECT_NO } → { PROJECT_NAME }
{ PROJECT_NO } → { PROJECT_DURATION }

Similarly, the previous table also is in BCNF.

PROJECT_INFO
DEPT_NO SSN PROJECT_NO DURATION
{ DEPT_NO, SSN } → { PROJECT_NO, DURATION }
{ PROJECT_NO } → { DURATION, DEPT_NO }

However, the previous table is not in BCNF, because PROJECT_NO is not a primary key. We cannot pair rows representing two different SSNs working in the same PROJECT_NO and DEPT_NO; for example:

[ D1, SSN1, P1, 22Hrs ]
[ D1, SSN2, P1, 22Hrs ]

The functional dependency PROJECT_NO → DURATION is nontrivial. Therefore, the table does not satisfy the definition of BCNF. We can eliminate this problem by redesigning the table such that all decomposed tables hereafter are in BCNF, as follows:

DEPT_NO PROJECT_NO DURATION
SSN PROJECT_NO

This decomposition is lossless-join decomposition.

Conclusion

Normalization goes further to 4NF, 5NF, and DKNF (Domain Key Normal Form). The four types of normalization discussed in this article are primary for most database design. Tables can be constricted further with the next levels of normalization, yet in practice they may not be feasible. The disadvantage of imposing stricter rules is that tables may be decomposed further into smaller relations. As a result, even a trivial database searches may require extensive join operations. Too many join operations are expensive and degrade performance. However, these basic four levels of normalization are not only achievable but also desirable in most cases.

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Latest Posts

Related Stories