Comparing database normalization and denormalization

Lets now deep dive and understand the fundamental difference between normalization and de-normalization.

Normalization

Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.

Data normalization is necessary to eliminate the following anomalies.

Insertion anomalies

An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL but does not have the necessary data.

For instance, suppose a customer has a customer ID as a foreign key in the Orders table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.

Update anomalies

An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.

This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.

For instance, suppose we have a Customers table with five columns, including Customer Phone Number and Customer Address. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.

Delete anomalies

Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.

For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).

Hence, to avoid these anomalies, we need to normalize the data.

Types of normalization

There are four types of normalization:

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce–Codd Normal Form (BCNF)

Let’s explore each of them.

1NF

In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.

For example, in the following table, we have details of the products purchased by customers.

Customer	Customer_Name	Products
1	Adam	Phone, Pen
2	James	Car, Ipod
3	Paul	Laptop, Cup

Figure 1.22 – Non-normalized product table

We can see that the Products attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.

Hence, this can be normalized in the following way in 1NF.

Customer ID	Customer_Name	Products
1	Adam	Phone
1	Adam	Pen
2	James	Car
2	James	Ipod
3	Paul	Laptop
3	Paul	Cup

Figure 1.23 – 1NF normalized product table

As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.

2NF

There are two rules that must be followed to normalize a table into 2NF:

Rule 1 – the table first has to be in 1NF.
Rule 2 – the table should not have any partial dependency. This means that every non-prime attribute (all attributes other than the primary key) must be dependent on the primary key of the table.

Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.

For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.

Customer_ID	Customer_Name	Customer_Phone_Number	Order_ID	Order_Status
1	Adam	485-000-9890	1	In Progress
1	Adam	585-000-9890	2	Delivered
2	James	685-000-9890	3	In Progress
2	James	785-000-9890	4	In Progress
3	Paul	885-000-9890	5	Delivered
3	Paul	985-000-9890	6	Delivered

Figure 1.24 – Customer details table

The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.

Customer_ID	Customer_Name	Customer_Phone_Number
1	Adam	485-000-9890
1	Adam	585-000-9890
2	James	685-000-9890
2	James	785-000-9890
3	Paul	885-000-9890
3	Paul	985-000-9890

Figure 1.25 – Customers table, which holds customer details

Order_ID	Order_Status
1	In Progress
2	Delivered
3	In Progress
4	In Progress
5	Delivered
6	Delivered

Figure 1.26 – Orders table

3NF

3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:

Rule 1 – the table has to be in 1NF and 2NF.
Rule 2 – the table should not have transitive functional dependencies. This means that non-prime attributes, which are attributes that are not part of the candidate key, should not be dependent on other non-prime attributes within the table.

In the following table, Customer_ID determines Product_ID, and Product_ID determines Product_Name. Hence, Customer_ID determines Product_Name through Product_ID. This means the table has a transitive dependency and is not in 3NF.

Therefore, the table has been divided into two separate tables to achieve 3NF.

Customer_ID	Customer_Name	Product_ID	Product_Name	Customer_Phone_Number
1	Adam	1	Phone	485-000-9890
1	Adam	2	Pen	585-000-9890
2	James	3	Car	685-000-9890
2	James	4	iPod	785-000-9890
3	Paul	5	Laptop	885-000-9890
3	Paul	6	Cup	985-000-9890

Figure 1.27 – Customer order table

Customer_ID	Customer_Name	Customer_Phone_Number
1	Adam	485-000-9890
1	Adam	585-000-9890
2	James	685-000-9890
2	James	785-000-9890
3	Paul	885-000-9890
3	Paul	985-000-9890

Figure 1.28 – Customers table

Product_ID	Product_Name
1	Phone
2	Pen
3	Car
4	iPod
5	Laptop
6	Cup

Figure 1.29 – Products table

Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name and Customer_Phone_Number are solely dependent on Customer_ID, which serves as the primary key for the Customers table. Similarly, non-key attributes such as Product_Name are exclusively dependent on Product_ID, which acts as the primary key for the Products table.

BCNF

BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.

Two rules are to be followed for the table to be in BCNF:

Rule 1 – the table should be in 1NF, 2NF, and 3NF
Rule 2 – for every functional dependency, such as A -> B, A should be the super key

Super key

Consider the following table:

Student_ID	Subject	Faculty
Student_1	Cloud Computing	Professor A
Student_2	Big Data	Professor B
Student_3	Statistics	Professor C
Student_4	Project Management	Professor D
Student_5	Analytics	Professor E

Figure 1.30 – Student-subject table

In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID and Subject, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject and Faculty columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.

There are a few key points to understand here:

The preceding table is in 1NF as all rows are atomic
The preceding table is in 2NF as there is no partial dependency
The preceding table is also in 3NF as there is no transitive dependency

This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.

How to fix this

A new super key called Faculty_ID will be introduced, and the preceding table will be split into two different tables, as follows.

Student_ID	Faculty_ID
Student_1	Faculty_ID1
Student_2	Faculty_ID2
Student_3	Faculty_ID3
Student_4	Faculty_ID4
Student_5	Faculty_ID5

Figure 1.31 – Student-faculty mapping table

Faculty_ID	Faculty	Subject
Faculty_ID1	Professor A	Cloud Computing
Faculty_ID2	Professor B	Big Data
Faculty_ID3	Professor C	Statistics
Faculty_ID4	Professor D	Project Management
Faculty_ID5	Professor E	Analytics

Figure 1.32 – Faculty table

Now, the Faculty_ID super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.

Denormalization

Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.

However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.

Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.

Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.

For example, let us consider the “Customers” and “Orders” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.

Figure 1.33 – Normalized tables

In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:

customer_census: The table contains statistics related to customers, including the number of orders placed, the total amount spent on orders, the number of incidents raised due to complaints, and the total number of complaint calls made by the customer. This table provides information on customer orders and complaint statistics.
IT incidents: The table holds information related to the tickets raised by all customers for different orders, customers, and the customer service agent working on the ticket.

Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.

Interesting read

These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.

A few of the main key performance indicators (KPIs) to track in this area are as follows:

Customer call frequency over a period of time
Number of incidents raised over a period of time
Major dispute reasons

These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.

Whenever order details are modified, the changes must be reflected in both of the new tables.

Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.

Figure 1.34 – Denormalized tables

When to apply denormalization

Some situations in which it may be appropriate to apply denormalization are as follows:

Performance improvement: One of the main reasons to use denormalization is to improve query performance. In denormalization, we read data from fewer tables as data is available in multiple tables. Hence, denormalized tables have better performance.
Maintaining history: Maintaining history in different tables helps with data validation. For example, if we have built a dashboard using these tables, and we want to look back at the history and validate it in the future, denormalized tables can help with validation.

Disadvantages of denormalization

Here are some of the disadvantages of denormalization:

Expensive updates and inserts: Whenever data is updated in one table, it needs to be updated in other related tables as well. For instance, if customer or order data is updated in one table, it must also be updated in all other tables where it is stored. This process can be costly and time-consuming.
Expensive storage: Storing data in multiple tables requires more storage space. For instance, customer and order data are stored in separate tables.
Data inconsistency: This can occur when data is not updated correctly, leading to discrepancies and errors in the data.

With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Data Wrangling with SQL

By : Raghav Kandarpa, Shivangi Saxena

Data Wrangling with SQL

By: Raghav Kandarpa, Shivangi Saxena

Overview of this book

Comparing database normalization and denormalization