-
Book Overview & Buying
-
Table Of Contents
Data Wrangling with SQL
By :
Lets now deep dive and understand the fundamental difference between normalization and de-normalization.
Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.
Data normalization is necessary to eliminate the following anomalies.
An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL but does not have the necessary data.
For instance, suppose a customer has a customer ID as a foreign key in the Orders table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.
An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.
This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.
For instance, suppose we have a Customers table with five columns, including Customer Phone Number and Customer Address. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.
Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.
For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).
Hence, to avoid these anomalies, we need to normalize the data.
There are four types of normalization:
Let’s explore each of them.
In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.
For example, in the following table, we have details of the products purchased by customers.
|
Customer |
Customer_Name |
Products |
|
1 |
Adam |
Phone, Pen |
|
2 |
James |
Car, Ipod |
|
3 |
Paul |
Laptop, Cup |
Figure 1.22 – Non-normalized product table
We can see that the Products attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.
Hence, this can be normalized in the following way in 1NF.
|
Customer ID |
Customer_Name |
Products |
|
1 |
Adam |
Phone |
|
1 |
Adam |
Pen |
|
2 |
James |
Car |
|
2 |
James |
Ipod |
|
3 |
Paul |
Laptop |
|
3 |
Paul |
Cup |
Figure 1.23 – 1NF normalized product table
As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.
There are two rules that must be followed to normalize a table into 2NF:
Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.
For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.
|
Customer_ID |
Customer_Name |
Customer_Phone_Number |
Order_ID |
Order_Status |
|
1 |
Adam |
485-000-9890 |
1 |
In Progress |
|
1 |
Adam |
585-000-9890 |
2 |
Delivered |
|
2 |
James |
685-000-9890 |
3 |
In Progress |
|
2 |
James |
785-000-9890 |
4 |
In Progress |
|
3 |
Paul |
885-000-9890 |
5 |
Delivered |
|
3 |
Paul |
985-000-9890 |
6 |
Delivered |
Figure 1.24 – Customer details table
The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.
|
Customer_ID |
Customer_Name |
Customer_Phone_Number |
|
1 |
Adam |
485-000-9890 |
|
1 |
Adam |
585-000-9890 |
|
2 |
James |
685-000-9890 |
|
2 |
James |
785-000-9890 |
|
3 |
Paul |
885-000-9890 |
|
3 |
Paul |
985-000-9890 |
Figure 1.25 – Customers table, which holds customer details
|
Order_ID |
Order_Status |
|
1 |
In Progress |
|
2 |
Delivered |
|
3 |
In Progress |
|
4 |
In Progress |
|
5 |
Delivered |
|
6 |
Delivered |
Figure 1.26 – Orders table
3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:
In the following table, Customer_ID determines Product_ID, and Product_ID determines Product_Name. Hence, Customer_ID determines Product_Name through Product_ID. This means the table has a transitive dependency and is not in 3NF.
Therefore, the table has been divided into two separate tables to achieve 3NF.
|
Customer_ID |
Customer_Name |
Product_ID |
Product_Name |
Customer_Phone_Number |
|
1 |
Adam |
1 |
Phone |
485-000-9890 |
|
1 |
Adam |
2 |
Pen |
585-000-9890 |
|
2 |
James |
3 |
Car |
685-000-9890 |
|
2 |
James |
4 |
iPod |
785-000-9890 |
|
3 |
Paul |
5 |
Laptop |
885-000-9890 |
|
3 |
Paul |
6 |
Cup |
985-000-9890 |
Figure 1.27 – Customer order table
|
Customer_ID |
Customer_Name |
Customer_Phone_Number |
|
1 |
Adam |
485-000-9890 |
|
1 |
Adam |
585-000-9890 |
|
2 |
James |
685-000-9890 |
|
2 |
James |
785-000-9890 |
|
3 |
Paul |
885-000-9890 |
|
3 |
Paul |
985-000-9890 |
Figure 1.28 – Customers table
|
Product_ID |
Product_Name |
|
1 |
Phone |
|
2 |
Pen |
|
3 |
Car |
|
4 |
iPod |
|
5 |
Laptop |
|
6 |
Cup |
Figure 1.29 – Products table
Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name and Customer_Phone_Number are solely dependent on Customer_ID, which serves as the primary key for the Customers table. Similarly, non-key attributes such as Product_Name are exclusively dependent on Product_ID, which acts as the primary key for the Products table.
BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.
Two rules are to be followed for the table to be in BCNF:
Consider the following table:
|
Student_ID |
Subject |
Faculty |
|
Student_1 |
Cloud Computing |
Professor A |
|
Student_2 |
Big Data |
Professor B |
|
Student_3 |
Statistics |
Professor C |
|
Student_4 |
Project Management |
Professor D |
|
Student_5 |
Analytics |
Professor E |
Figure 1.30 – Student-subject table
In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID and Subject, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject and Faculty columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.
There are a few key points to understand here:
This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.
A new super key called Faculty_ID will be introduced, and the preceding table will be split into two different tables, as follows.
|
Student_ID |
Faculty_ID |
|
Student_1 |
Faculty_ID1 |
|
Student_2 |
Faculty_ID2 |
|
Student_3 |
Faculty_ID3 |
|
Student_4 |
Faculty_ID4 |
|
Student_5 |
Faculty_ID5 |
Figure 1.31 – Student-faculty mapping table
|
Faculty_ID |
Faculty |
Subject |
|
Faculty_ID1 |
Professor A |
Cloud Computing |
|
Faculty_ID2 |
Professor B |
Big Data |
|
Faculty_ID3 |
Professor C |
Statistics |
|
Faculty_ID4 |
Professor D |
Project Management |
|
Faculty_ID5 |
Professor E |
Analytics |
Figure 1.32 – Faculty table
Now, the Faculty_ID super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.
Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.
However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.
Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.
Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.
For example, let us consider the “Customers” and “Orders” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.
Figure 1.33 – Normalized tables
In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:
Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.
Interesting read
These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.
A few of the main key performance indicators (KPIs) to track in this area are as follows:
These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.
Whenever order details are modified, the changes must be reflected in both of the new tables.
Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.
Figure 1.34 – Denormalized tables
Some situations in which it may be appropriate to apply denormalization are as follows:
Here are some of the disadvantages of denormalization:
With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.
Change the font size
Change margin width
Change background colour