Book Image

SQL for Data Analytics - Third Edition

By : Jun Shan, Matt Goldwasser, Upom Malik, Benjamin Johnston
Book Image

SQL for Data Analytics - Third Edition

By: Jun Shan, Matt Goldwasser, Upom Malik, Benjamin Johnston

Overview of this book

Every day, businesses operate around the clock, and a huge amount of data is generated at a rapid pace. This book helps you analyze this data and identify key patterns and behaviors that can help you and your business understand your customers at a deep, fundamental level. SQL for Data Analytics, Third Edition is a great way to get started with data analysis, showing how to effectively sort and process information from raw data, even without any prior experience. You will begin by learning how to form hypotheses and generate descriptive statistics that can provide key insights into your existing data. As you progress, you will learn how to write SQL queries to aggregate, calculate, and combine SQL data from sources outside of your current dataset. You will also discover how to work with advanced data types, like JSON. By exploring advanced techniques, such as geospatial analysis and text analysis, you will be able to understand your business at a deeper level. Finally, the book lets you in on the secret to getting information faster and more effectively by using advanced techniques like profiling and automation. By the end of this book, you will be proficient in the efficient application of SQL techniques in everyday business scenarios and looking at data with the critical eye of analytics professional.
Table of Contents (11 chapters)
9
9. Using SQL to Uncover the Truth: A Case Study

Relational Databases and SQL

A relational database is a database that utilizes the relational model of data. The relational model, invented by Dr. Edgar F. Codd in 1970, organizes data as relations, or sets of tuple. Tuple is the mathematical term for a series of attributes grouped together in a particular order. A more common (and more practical) name for a tuple is a record. Each record consists of a series of attributes that generally describe the record.

For instance, a fast-moving consumer goods company wants to track its customers. They can save the customer information in a relation called customer_info. Each record in this relation contains details about one customer. The attributes in each record include information such as the customer's last name, first name, age, date of signup, and delivery address. This relationship and its first two records will look like this:

Figure 2.2: An example customer_info relation

Figure 2.2: An example customer_info relation

As you can see, each relation is indeed a two-dimensional table that looks like an Excel spreadsheet. Thus, when implemented in a relational database, these relations are called tables. Each table is made up of rows and columns. Each row of the table is a record, and the attributes are represented as columns of the table. There cannot be duplicate columns and the columns must follow the same order in all the rows. Every column also has a data type that describes the type of data in the column.

While not technically required, most tables in a relational database have a column (sometimes a group of columns) referred to as the primary key, which uniquely identifies a row of the database. In the example shown in Figure 2.2, each row contains a column called ID. This record, as the name suggests, is an attribute that can be used to uniquely identify this record. It is known as a relational key. In all other columns, you can have data duplicated across different rows. But in the primary key column(s), the data must be unique.

Most of the operations in a relational database, and in all data management systems, are organized around tables and the data inside them. They generally can be categorized into four groups—create, read, update, and delete. To utilize any data, you must create the definition of the dataset first, then create the individual data records one by one and put them into the dataset. Once a dataset is created, you can read all aspects of information from it. If there is any change to the data, you need to update the affected records.

Finally, when you do not need the data anymore, you will want to delete the records to save storage costs and increase performance. If you do not need this dataset, you can even delete the whole dataset by removing its definition from the database. These operations, by the order of each operation's position in a dataset's lifecycle, are generally called CRUD. CRUD stands for create, read, update, and delete.

In relational databases, all these operations are carried out using SQL. You will learn all the related SQL statements in this and the upcoming chapters.

Note

Virtually all relational databases that use SQL deviate from the relational model in some basic ways. For example, not every table has a specified relational key. Additionally, a relational model does not technically allow duplicate rows, but you can have duplicate rows in a relational database. These differences are minor and will not matter to most readers of this book.

Advantages and Disadvantages of SQL Databases

As discussed in the previous sections, since relations are collections of records that have clearly defined attributes in a defined order, they are considered structured data. Relational databases are the main tool used for storing and processing structured data.

Since the release of Oracle Database in 1979, SQL has become an industry standard for structured data in nearly all computer applications—and for good reasons. SQL databases provide a range of advantages that make them the first choice for many applications:

  • Intuitive: Relations represented as tables serve as a common data structure that almost everyone understands. As such, working with and reasoning about relational databases is much easier than doing so with other models.
  • Efficient: Using a technique known as normalization, relational databases allow the representation of data without unnecessarily repeating it. As such, relational databases can represent large amounts of information while utilizing less space. This reduced storage footprint also allows the database to reduce operation costs, making well-designed relational databases quick to process.
  • Declarative: SQL is a declarative language, meaning that when you write code, you only need to tell the computer what data you want, and the database takes care of determining how to execute the SQL code. You never have to worry about telling the computer how to access and pull data from the table.
  • Robust: Most popular SQL databases have a property known as atomicity, consistency, isolation, and durability (ACID) compliance, which guarantees the validity of the data, even if the hardware fails.

That said, there are still some downsides to SQL databases, which are as follows:

  • Relatively lower specificity: While SQL is declarative, its functionality can often be limited to what has already been programmed into it. Although most popular relational database software is updated constantly with new functionality being built all the time, it can be difficult to process and work with data structures and algorithms that are not programmed into a relational database.
  • Limited scalability: SQL databases are incredibly robust, but this robustness comes at a cost. As the amount of information you have doubles, the cost of resources increases even more than double. When very large volumes of information are involved, other data stores such as NoSQL databases may be efficient.
  • Sacrificing performance for consistency: Relational databases are generally designed for consistency, which means they will take extra steps to make sure multiple users will see the same data when they try to access/modify the data at the same time. To achieve this, relational databases implement some complex checking and data locking mechanisms into their operational logic. For usage scenarios that do not require consistency, especially for high-performance operations like search engines or social network sites, this is an unnecessary burden and will hurt the performance of the application.
  • Lack of semi-structured and unstructured data processing ability: The fundamental theory that SQL is built on is the relational theory, which, by definition, handles only structured data. Relational databases can store and fetch semi-structured and unstructured data. But processing this data requires processing power and functionalities that are beyond standard SQL. Later chapters of this book will cover some examples of this type of processing.