Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Data Wrangling with SQL
  • Table Of Contents Toc
  • Feedback & Rating feedback
Data Wrangling with SQL

Data Wrangling with SQL

By : Raghav Kandarpa, Shivangi Saxena
4.6 (25)
close
close
Data Wrangling with SQL

Data Wrangling with SQL

4.6 (25)
By: Raghav Kandarpa, Shivangi Saxena

Overview of this book

The amount of data generated continues to grow rapidly, making it increasingly important for businesses to be able to wrangle this data and understand it quickly and efficiently. Although data wrangling can be challenging, with the right tools and techniques you can efficiently handle enormous amounts of unstructured data. The book starts by introducing you to the basics of SQL, focusing on the core principles and techniques of data wrangling. You’ll then explore advanced SQL concepts like aggregate functions, window functions, CTEs, and subqueries that are very popular in the business world. The next set of chapters will walk you through different functions within SQL query that cause delays in data transformation and help you figure out the difference between a good query and bad one. You’ll also learn how data wrangling and data science go hand in hand. The book is filled with datasets and practical examples to help you understand the concepts thoroughly, along with best practices to guide you at every stage of data wrangling. By the end of this book, you’ll be equipped with essential techniques and best practices for data wrangling, and will predominantly learn how to use clean and standardized data models to make informed decisions, helping businesses avoid costly mistakes.
Table of Contents (21 chapters)
close
close
1
Part 1:Data Wrangling Introduction
4
Part 2:Data Wrangling Techniques Using SQL
9
Part 3:SQL Subqueries, Aggregate And Window Functions
13
Part 4:Optimizing Query Performance
15
Part 5:Data Science And Wrangling

Decoding database structures – relational and non-relational

Before we delve into the details of relational and non-relational databases, let us first understand the meaning of the term database and why it is important to know about databases.

What is a database?

Most of us have heard of a database, right? To put it simply, it is a collection of information that is stored in an organized and logical manner. This helps people keep track of things and find information quickly. For example, imagine you are walking into a superstore and looking for a specific item, such as a phone charger. To find it, you would use logical categorization. First, you would go to the electronics section, but this section would have all sorts of electronics the superstore had to offer. So, you would then look for a section called phones and accessories and search for the specific phone charger that was compatible with your phone.

By using logical reasoning, you can determine the location of the object and purchase the charger successfully. If we consider the process from the perspective of the superstore, we can see that they have divided the entire area into sections such as electronics, merchandise, and groceries, and further subdivided it into rows and columns known as aisles. They store each object according to its category in an organized manner, which can be accessed through the store’s database.

The business definition of a database is that it is a collection of information stored on a server that is accessed regularly for analysis and decision-making. The information is organized into tables, which are similar to spreadsheets, with rows and columns. A database can contain multiple tables, and a server can have multiple databases for different categories or clients. For example, a university database may contain information on students, teachers, and subjects, while a superstore database may contain data on products, orders, store locations, and customers. Each row in the database represents a specific occurrence or transaction. The database stores information and its relationships.

Types of databases

Database Management Systems (DBMSs) are used to store and manage data in a database. The most commonly used language to extract information from a database is SQL. The history of databases dates back several decades, specifically to the 1970s. Since then, databases have evolved into two broad categories, known as relational and non-relational.

Relational databases

A relational database, or relational DBMS, stores data in the form of tables or entities that we want to track, such as customers and orders. The data about these entities is stored in relations, which are 2D tables of rows and columns, similar to a spreadsheet. Each row contains data, and each column contains different attributes about that entity.

Figure 1.1 – Relational database

Figure 1.1 – Relational database

For instance, in a table/entity that contains information about customers, the attributes or columns could include Name, Phone Number, Address, and Gender. The rows would then represent specific information for each customer in a separate row.

For example, we could have a customers table as follows:

Customer_ID

Name

Address

Phone

Gender

Email

1

Joey

Texas

834-2345

M

[email protected]

2

Ron

Tennessee

987-6543

M

[email protected]

3

Fred

New York

876-5678

M

[email protected]

4

Tom

LA

765-7654

M

[email protected]

5

Mary

Georgia

124-0987

F

[email protected]

Figure 1.2 – Customers table

Every row in a relational database should have a unique key, which we call the primary key (discussed later in the chapter). This key can be used as a foreign key in a different table to build logical referential relations between the two tables. The relations between the fields and tables are known as schemas. To extract data from databases, we use SQL queries.

These are some of the advantages of relational databases:

  • Highly efficient
  • High readability as data is sorted and unique
  • High data integrity
  • Normalized data

Non-relational databases

A non-relational database stores data in a non-tabular format, meaning it does not have a structure of tables and relations. Instead, this type of database stores information in various ways, such as key-value and document-based databases. In a key-value database, data is stored in two parts: a key and its corresponding value. Each key is unique and can only connect to one value in the collection. In contrast, a document-oriented database pairs a key with a document that contains a complex combination of several key-value pairs. Non-relational databases, also known as NoSQL databases, are more flexible than traditional relational databases. Some commonly used non-relational databases include MongoDB, Cassandra, Amazon DynamoDB, and Apache HBase.

Figure 1.3 – NoSQL databases

Figure 1.3 – NoSQL databases

Key

Document

2022

{  Customer ID: 1234,  Customer Name: Joey,  Customer address: XYZTX,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

2023

{  Customer ID:5667,  Customer Name: Ron,  Customer address: LKJHNTN,  Order details:    {     Order 1: {product 1, product description}     Order 2: {Product 1, product description}    }}

Figure 1.4 – Non-relational database example

These are some of the advantages of non-relational databases:

  • Simple management – no sorting needed, so data can be directly dumped into the database without any preprocessing
  • Higher readability of a particular document, especially when the dataset contains big data, avoiding the need to parse through multiple tables and write complex queries
  • Can be scaled to a huge level by splitting the servers into multiple clusters and managing the CPU utilization on each of these clusters

Let’s understand that last point in detail. Multiple clusters refers to a distributed computing architecture that consists of multiple servers or nodes, each serving a different purpose, but working together to achieve a common goal. In this context, a cluster typically consists of a group of interconnected computers that works together to provide a more powerful and scalable computing environment. Each cluster may have its own dedicated resources, such as CPU, memory, and storage, and can be managed independently. By splitting servers into multiple clusters, the workload can be distributed more efficiently, allowing for greater scalability and flexibility. For example, suppose you have a large-scale application that requires a lot of processing power and storage. In that case, you might split your servers into multiple clusters and distribute the workload across those clusters. This way, you can achieve better performance, as well as reduce the risk of a single point of failure. Overall, multiple clusters offer several advantages, including increased scalability, improved performance, better fault tolerance, and the ability to handle large workloads more efficiently.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Wrangling with SQL
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon