A new application almost always starts with data modeling. In traditional data modeling for relational databases, this process is known as data normalization, which focuses on removing data redundancy and minimizing data dependency. The modeling formulates a series of tables, which are defined by its schema (rows and columns of the table).
Why normalization? Data normalization helps to reduce dependency and redundancy of data, which makes it easier to update the data, while maintaining its consistency.
Normalization ensures that the update is done only once for an entity. Let's take a typical example to showcase how data normalization works:
ID |
Name |
Subject |
---|---|---|
1 |
John |
Biology |
2 |
Smith |
Math |
3 |
John |
Math |
4 |
Scott |
Biology |
5 |
Smith |
Physics |
In this example, the data in the Name field is repeating as one student can take multiple subjects. In normalization, it is required to have data that is tabular with each row-column intersection having one value. The redundancy of data can create inconsistencies in the data during updates, where there is a real possibility of the same data having different values. To make the preceding table normalized, we can consider adding another column for extra subjects, which looks something like this:
ID |
Name |
Subject_1 |
Subject_2 |
---|---|---|---|
1 |
John |
Math |
Biology |
2 |
Smith |
Physics |
Math |
3 |
Scott |
Biology |
Null |
The preceding structure creates its own problems. For instance, if you want to know all the students who are taking a Math class, you have to check two columns. The problem gets more complex if there is a student who is taking more than two subjects, which requires you to add a column to the table. Updating and deleting a record becomes more problematic.
In this scenario, to maintain consistency and to avoid redundancy, we have to create two tables. The first table will contain all the student information as follows:
Student_ID |
Name |
---|---|
1 |
John |
2 |
Smith |
3 |
Scott |
The second table will contain the information related to the subjects offered as follows:
Subject_ID |
Subject_name |
---|---|
1 |
Biology |
2 |
Math |
3 |
Physics |
Now, we have to create a map of uniquely identifiable IDs in the form of a table that shows which student is taking what subjects:
Student_ID |
Subject_ID |
---|---|
1 |
1 |
1 |
2 |
2 |
2 |
2 |
3 |
3 |
1 |
In the preceding data model, we do not have to worry about data inconsistencies. This allows us to manipulate the data easily, and allows insertions and deletions of data without having to worry about artifacts. However, a problem arises when reading the data. If you want to know the name of the student who is taking Math, the SQL query will be as follows:
SELECT a.Name from STUDENTS a JOIN STU_SUB_MAP b ON a.student_id=b.student_id JOIN SUBJECTS c ON c.Subject_id = b.Subject_id where c.subject_name='Math'
The result of the query will look like the following table:
Name |
---|
John |
Smith |
Even though we have got the correct result from the database, the problem lies in the amount of work the database needs to do to get this result. It has to join three different tables, read the data, and display the results. The JOIN
operation is a relatively expensive operation, which can be mitigated to some extent by using indexes. However, as the table size increases, the amount of time the database takes to compute also increases.
Though normalization gives us highly structured tables with the utmost consistency, sometimes we need to keep the data in a denormalized format. This is usually necessary in scenarios where performance takes precedence over structure. A denormalized structure is more beneficial if your application has much more reads than writes.