Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Data Wrangling with SQL
  • Table Of Contents Toc
Data Wrangling with SQL

Data Wrangling with SQL

By : Raghav Kandarpa, Shivangi Saxena
4.6 (25)
close
close
Data Wrangling with SQL

Data Wrangling with SQL

4.6 (25)
By: Raghav Kandarpa, Shivangi Saxena

Overview of this book

The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.
Table of Contents (21 chapters)
close
close
1
Part 1:Data Wrangling Introduction
4
Part 2:Data Wrangling Techniques Using SQL
9
Part 3:SQL Subqueries, Aggregate And Window Functions
13
Part 4:Optimizing Query Performance
15
Part 5:Data Science And Wrangling

Comparing database normalization and denormalization

Lets now deep dive and understand the fundamental difference between normalization andde-normalization.

Normalization

Database design is a process that aims to reduce data redundancy and maintain data integrity. This systematic approach involves removing data redundancy and undesirable characteristics, such as data insertion, update, and delete anomalies. It achieves this by breaking down larger tables into smaller ones and linking them based on relationships to store data logically.

Data normalization is necessary to eliminate the following anomalies.

Insertion anomalies

An insertion anomaly occurs in relational databases when we are unable to insert data into the database due to missing attributes or data. This is a common scenario where a foreign key cannot be NULL but does not have the necessary data.

For instance, suppose a customer has a customer ID as a foreign key in the Orders table, and no customers have been inserted yet because they haven’t ordered any products yet. Therefore, it is impossible to insert a customer who hasn’t ordered anything yet.

Update anomalies

An update anomaly occurs when we partially update data in a database. This is a common scenario where data is not normalized, and hence, data elements can reference the same data element in more than one place. As all these different locations are not updated automatically, it is important to manually update this data element at each location.

This process can be time-consuming and inefficient as it requires searching for data elements in various locations and manually updating each one.

For instance, suppose we have a Customers table with five columns, including Customer Phone Number and Customer Address. If a customer’s address or phone number changes, we must update the table. However, if the table is not normalized, a single customer may have multiple entries, and updating all of them could result in an update anomaly if one of them is overlooked.

Delete anomalies

Data loss can occur when important information is accidentally deleted along with other data. This can result in the loss of crucial data.

For example, let’s consider a customer named Adam who ordered one product, a phone. If the order is canceled and you delete the customer from the order table, it will also delete the product (the phone).

Hence, to avoid these anomalies, we need to normalize the data.

Types of normalization

There are four types of normalization:

  • First Normal Form (1NF)
  • Second Normal Form (2NF)
  • Third Normal Form (3NF)
  • Boyce–Codd Normal Form (BCNF)

Let’s explore each of them.

1NF

In 1NF, each attribute in the table should contain only atomic values. This means that each cell in the table should hold only one value, and the intersection of rows and columns should not contain multiple values or repeating groups.

For example, in the following table, we have details of the products purchased by customers.

Customer

Customer_Name

Products

1

Adam

Phone, Pen

2

James

Car, Ipod

3

Paul

Laptop, Cup

Figure 1.22 – Non-normalized product table

We can see that the Products attribute holds information related to multiple products purchased by the customer and is therefore not normalized. This is because it contains more than one value in a single cell.

Hence, this can be normalized in the following way in 1NF.

Customer ID

Customer_Name

Products

1

Adam

Phone

1

Adam

Pen

2

James

Car

2

James

Ipod

3

Paul

Laptop

3

Paul

Cup

Figure 1.23 – 1NF normalized product table

As we can see, each cell now holds only one value at the intersection of a column and row. Therefore, the data has been normalized to 1NF.

2NF

There are two rules that must be followed to normalize a table into 2NF:

  • Rule 1 – the table first has to be in 1NF.
  • Rule 2 – the table should not have any partial dependency. This means that every non-prime attribute (all attributes other than the primary key) must be dependent on the primary key of the table.

Whenever the table represents data for two different entities instead of just one entity, it needs to be broken down into its own entity in a different table.

For example, the following table has composite primary keys: Customer_ID and Order_ID. However, the non-prime attributes are not solely dependent on the Customer_ID primary key, but also on Order_ID. For instance, the Order_Status attribute, which is a non-prime attribute, is only dependent on Order_ID. Therefore, it is necessary to split the table into two separate tables. One table will contain the customer details, while the other will contain the order details.

Customer_ID

Customer_Name

Customer_Phone_Number

Order_ID

Order_Status

1

Adam

485-000-9890

1

In Progress

1

Adam

585-000-9890

2

Delivered

2

James

685-000-9890

3

In Progress

2

James

785-000-9890

4

In Progress

3

Paul

885-000-9890

5

Delivered

3

Paul

985-000-9890

6

Delivered

Figure 1.24 – Customer details table

The table depicted in Figure 1.24 has been decomposed into two separate tables that satisfy 2NF.

Customer_ID

Customer_Name

Customer_Phone_Number

1

Adam

485-000-9890

1

Adam

585-000-9890

2

James

685-000-9890

2

James

785-000-9890

3

Paul

885-000-9890

3

Paul

985-000-9890

Figure 1.25 – Customers table, which holds customer details

Order_ID

Order_Status

1

In Progress

2

Delivered

3

In Progress

4

In Progress

5

Delivered

6

Delivered

Figure 1.26 – Orders table

3NF

3NF ensures a reduction in data duplication, thereby maintaining data integrity by following these rules:

  • Rule 1 – the table has to be in 1NF and 2NF.
  • Rule 2 – the table should not have transitive functional dependencies. This means that non-prime attributes, which are attributes that are not part of the candidate key, should not be dependent on other non-prime attributes within the table.

In the following table, Customer_ID determines Product_ID, and Product_ID determines Product_Name. Hence, Customer_ID determines Product_Name through Product_ID. This means the table has a transitive dependency and is not in 3NF.

Therefore, the table has been divided into two separate tables to achieve 3NF.

Customer_ID

Customer_Name

Product_ID

Product_Name

Customer_Phone_Number

1

Adam

1

Phone

485-000-9890

1

Adam

2

Pen

585-000-9890

2

James

3

Car

685-000-9890

2

James

4

iPod

785-000-9890

3

Paul

5

Laptop

885-000-9890

3

Paul

6

Cup

985-000-9890

Figure 1.27 – Customer order table

Customer_ID

Customer_Name

Customer_Phone_Number

1

Adam

485-000-9890

1

Adam

585-000-9890

2

James

685-000-9890

2

James

785-000-9890

3

Paul

885-000-9890

3

Paul

985-000-9890

Figure 1.28 – Customers table

Product_ID

Product_Name

1

Phone

2

Pen

3

Car

4

iPod

5

Laptop

6

Cup

Figure 1.29 – Products table

Therefore, it is evident that non-prime attributes, which refer to attributes other than the primary key, are solely reliant on the primary key of the table and not on any other column. For instance, non-key attributes such as Customer_Name and Customer_Phone_Number are solely dependent on Customer_ID, which serves as the primary key for the Customers table. Similarly, non-key attributes such as Product_Name are exclusively dependent on Product_ID, which acts as the primary key for the Products table.

BCNF

BCNF is sometimes referred to as 3.5 NF. Even if a table is in 3NF, there may still be anomalies present if there is more than one candidate key.

Two rules are to be followed for the table to be in BCNF:

  • Rule 1 – the table should be in 1NF, 2NF, and 3NF
  • Rule 2 – for every functional dependency, such as A -> B, A should be the super key

Super key

Consider the following table:

Student_ID

Subject

Faculty

Student_1

Cloud Computing

Professor A

Student_2

Big Data

Professor B

Student_3

Statistics

Professor C

Student_4

Project Management

Professor D

Student_5

Analytics

Professor E

Figure 1.30 – Student-subject table

In the table depicted in Figure 1.30, a student can study multiple subjects in one semester, and multiple professors can teach one subject. For each subject, one professor is assigned. All normal forms are satisfied except for BCNF. The primary key is formed by combining Student_ID and Subject, as this allows for the unique extraction of information related to both faculty and students. Additionally, there is a dependency between the Subject and Faculty columns. This is because one subject can be taught by multiple professors, creating a dependency between the two columns.

There are a few key points to understand here:

  • The preceding table is in 1NF as all rows are atomic
  • The preceding table is in 2NF as there is no partial dependency
  • The preceding table is also in 3NF as there is no transitive dependency

This table does not satisfy the BCNF condition because there is a dependency of Faculty on the Subject, which makes the Faculty column a non-prime attribute while Subject is a prime attribute.

How to fix this

A new super key called Faculty_ID will be introduced, and the preceding table will be split into two different tables, as follows.

Student_ID

Faculty_ID

Student_1

Faculty_ID1

Student_2

Faculty_ID2

Student_3

Faculty_ID3

Student_4

Faculty_ID4

Student_5

Faculty_ID5

Figure 1.31 – Student-faculty mapping table

Faculty_ID

Faculty

Subject

Faculty_ID1

Professor A

Cloud Computing

Faculty_ID2

Professor B

Big Data

Faculty_ID3

Professor C

Statistics

Faculty_ID4

Professor D

Project Management

Faculty_ID5

Professor E

Analytics

Figure 1.32 – Faculty table

Now, the Faculty_ID super key has eliminated the non-prime attribute functional dependency and the BCNF normal form is satisfied.

Denormalization

Before delving into denormalization, let’s first quickly review normalization, which we just discussed. Normalization, in simple terms, involves segmenting each piece of data into the appropriate bucket, which helps maintain data integrity, eliminates redundancy, and ensures easy updates to the data in the future.

However, the drawback of using normalized data is that it may take a longer time to retrieve the data as the data has to be queried from different tables where it is stored.

Denormalization comes into play when there is a trade-off between organized data and faster data retrieval. It is the opposite of normalization. In simple terms, denormalization involves storing the same data in multiple locations, which increases redundancy but decreases data retrieval time, making it faster.

Denormalization is used when there are performance issues with normalized data or when faster data retrieval is needed.

For example, let us consider the “Customers” and “Orders” tables. Both of these are currently normalized as there is no redundancy, and the data is housed in only one table, maintaining data integrity between the primary key and foreign key.

Figure 1.33 – Normalized tables

Figure 1.33 – Normalized tables

In the following table, we have two new tables: the customer census table and the IT incidents table. Let’s take a closer look at the type of data that each of these tables contains:

  • customer_census: The table contains statistics related to customers, including the number of orders placed, the total amount spent on orders, the number of incidents raised due to complaints, and the total number of complaint calls made by the customer. This table provides information on customer orders and complaint statistics.
  • IT incidents: The table holds information related to the tickets raised by all customers for different orders, customers, and the customer service agent working on the ticket.

Both of these tables are denormalized, as we have included both customer and order information. There is no need to join these tables with the “Customer” or “Order” tables, as they contain all the necessary information. Retrieving data from these tables is faster than from normalized tables.

Interesting read

These types of tables are utilized by e-commerce companies to monitor customer service metrics and enhance customer service on a month-over-month (M-O-M), quarter-over-quarter (Q-O-Q), and year-over-year (Y-O-Y) basis.

A few of the main key performance indicators (KPIs) to track in this area are as follows:

  • Customer call frequency over a period of time
  • Number of incidents raised over a period of time
  • Major dispute reasons

These two tables contain redundant data, such as customer and order information repeated multiple times. However, this data can be retrieved from a single table, making the process faster. This is a clear example of denormalized tables.

Whenever order details are modified, the changes must be reflected in both of the new tables.

Order details need to be updated in new tables as well to maintain data integrity. This is one of the disadvantages of denormalized tables.

Figure 1.34 – Denormalized tables

Figure 1.34 – Denormalized tables

When to apply denormalization

Some situations in which it may be appropriate to apply denormalization are as follows:

  • Performance improvement: One of the main reasons to use denormalization is to improve query performance. In denormalization, we read data from fewer tables as data is available in multiple tables. Hence, denormalized tables have better performance.
  • Maintaining history: Maintaining history in different tables helps with data validation. For example, if we have built a dashboard using these tables, and we want to look back at the history and validate it in the future, denormalized tables can help with validation.

Disadvantages of denormalization

Here are some of the disadvantages of denormalization:

  • Expensive updates and inserts: Whenever data is updated in one table, it needs to be updated in other related tables as well. For instance, if customer or order data is updated in one table, it must also be updated in all other tables where it is stored. This process can be costly and time-consuming.
  • Expensive storage: Storing data in multiple tables requires more storage space. For instance, customer and order data are stored in separate tables.
  • Data inconsistency: This can occur when data is not updated correctly, leading to discrepancies and errors in the data.

With that, we have reached the end of the chapter. Let’s conclude everything we have learned so far.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Wrangling with SQL
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon