Book Image

The Applied SQL Data Analytics Workshop - Second Edition

By : Matt Goldwasser, Upom Malik, Benjamin Johnston
3.5 (2)
Book Image

The Applied SQL Data Analytics Workshop - Second Edition

3.5 (2)
By: Matt Goldwasser, Upom Malik, Benjamin Johnston

Overview of this book

Every day, businesses operate around the clock and a huge amount of data is generated at a rapid pace. Hidden in this data are key patterns and behaviors that can help you and your business understand your customers at a deep, fundamental level. Are you ready to enter the exciting world of data analytics and unlock these useful insights? Written by a team of expert data scientists who have used their data analytics skills to transform businesses of all shapes and sizes, The Applied SQL Data Analytics Workshop is a great way to get started with data analysis, showing you how to effectively sieve and process information from raw data, even without any prior experience. The book begins by showing you how to form hypotheses and generate descriptive statistics that can provide key insights into your existing data. As you progress, you'll learn how to write SQL queries to aggregate, calculate and combine SQL data from sources outside of your current dataset. You'll also discover how to work with different data types, like JSON. By exploring advanced techniques, such as geospatial analysis and text analysis, you'll finally be able to understand your business at a deeper level. Finally, the book lets you in on the secret to getting information faster and more effectively by using advanced techniques like profiling and automation. By the end of The Applied SQL Data Analytics Workshop, you'll have the skills you need to start identifying patterns and unlocking insights in your own data. You will be capable of looking and assessing data with the critical eye of a skilled data analyst.
Table of Contents (9 chapters)
Preface
7
7. The Scientific Method and Applied Problem Solving

Reading Tables: The SELECT Query

The most common operation in a database is reading data from a database. This is almost exclusively done through the use of the SELECT keyword.

Basic Anatomy and Working of a SELECT Query

Generally speaking, a query can be broken down into five parts:

  • Operation: The first part of a query describes what is going to be done. In this case, this is the word SELECT followed by the names of columns combined with functions.
  • Data: The next part of the query is the data, which is the FROM keyword followed by one or more tables connected together with reserved keywords indicating what data should be scanned for filtering, selection, and calculation.
  • Conditional: This is a part of the query that filters the data to only rows that meet a condition usually indicated with WHERE.
  • Grouping: This is a special clause that takes the rows of a data source and assembles them together using a key created by a GROUP BY clause, and then calculates a value using the values from all rows with the same value. We will discuss this step more in Chapter 3, Aggregate and Window Functions.
  • Postprocessing: This is a part of the query that takes the results of the data and formats them by sorting and limiting the data, often using keywords such as ORDER BY and LIMIT.

The steps of a SELECT query are as follows:

  1. Create a data source by taking one or more tables and combining them together into one large table.
  2. Filter the table based on the large data source created in Step 1 by seeing which rows meet the WHERE clause.
  3. Calculate values based on columns in the data source in Step 1. If there is a GROUP BY clause, divide the rows into groups and then calculate an aggregate statistic for each group. Otherwise, return a column or value that has been calculated by performing functions on one or more columns together.
  4. Take the rows returned and reorganize them based on the query.

To break down those steps, let's look at a typical query and follow the logic we've described:

SELECT 
  first_name
FROM 
  customers
WHERE 
  state='AZ'
ORDER BY 
  first_name;

The operation of this query follows a sequence:

  1. We start with the customers table.
  2. The customers table is filtered to where the state column equals 'AZ'.
  3. We capture the first_name column from the filtered table.
  4. The first_name column is ordered alphabetically.

Here, we've demonstrated how a query can be broken down into a series of steps for the database to process.

We will now look at the query keywords and patterns found in a SELECT query.

Basic Keywords in a SELECT Query

There are many keywords that we use while writing a SELECT query. Let's start by understanding the SELECT and FROM keywords.

The SELECT and FROM Statements

The most basic SELECT query follows this pattern: SELECT…FROM <table_name>;. This query is a way to pull data from a single table. For example, if you want to pull all the data from the products table in our sample database, simply use this query:

SELECT 
  *
FROM 
  products;

This query will pull all the data from a database. The * symbol seen here is shorthand to return all the columns from a database. The semicolon operator (;) is used to tell the computer it has reached the end of the query, much as a period is used for a normal sentence. It's important to note that the rows will be returned in no specific order. If we want to return only specific columns from a query, we can simply replace the asterisk (*) with the names of the columns we want to be separated in the order we want them to be returned in. For example, if we wanted to return the product_id column followed by the model column of the products table, we would write the following query:

SELECT product_id, model
FROM products;

If we wanted to return the model column first and the product_id column second, we would write this:

SELECT model, product_id
FROM products;

In the next section, we will learn about the WHERE clause.

The WHERE Clause

The WHERE clause is a piece of conditional logic that limits the amount of data returned. All of the rows that are returned in a SELECT statement with a WHERE clause in it meet the conditions of the WHERE clause. The WHERE clause can usually be found after the FROM clause of a single SELECT statement.

The condition in the WHERE clause is generally a Boolean statement that can either be True or False for every row. In the case of numeric columns, these Boolean statements can use equals, greater than, or less than operators to compare the columns against a value.

We will use an example to illustrate this. Let's say we wanted to see the model names of our products with the model year of 2014 from our sample dataset. We would write the following query:

SELECT 
  model
FROM 
  products
WHERE 
  year=2014;

In the next section, we will learn how we can use the AND/OR clause in our queries.

The AND/OR Clause

The previous query had only one condition. We are often interested in multiple conditions being met at once. For this, we put multiple statements together using the AND or OR clause.

Now we will illustrate this with an example. Let's say we wanted to return models that were not only built-in 2014 but also have a manufacturer's suggested retail price (MSRP) of less than $1,000. We can write the following:

SELECT 
  model
FROM 
  products
WHERE 
  year=2014
  AND msrp<=1000;

Now, let's say we wanted to return any models that were released in the year 2014 or had a product type of automobile. We would write the following query:

SELECT 
  model
FROM 
  products
WHERE 
  year=2014
  OR product_type='automobile';

When using more than one AND/OR condition, use parentheses to separate and position pieces of logic together. This will ensure that your query works as expected and that it is as readable as possible. For example, if we wanted to get all products with models between the years of 2014 and 2016, as well as any products that are scooters, we could write the following:

SELECT 
  *
FROM 
  products
WHERE 
  year>2014
  AND year<2016
  OR product_type='scooter';

However, to clarify the WHERE clause, it would be preferable to write the following:

SELECT 
  *
FROM 
  products
WHERE 
  (year>2014 AND year<2016)
  OR product_type='scooter';

In the next section, we will learn about the IN and NOT ON clauses.

The IN/NOT IN Clause

As mentioned earlier, Boolean statements can use equals signs to indicate that a column must equal a certain value. However, what if you are interested in returning rows where a row has a column that can be equal to any group of values? For instance, let's say you were interested in returning all models from the years 2014, 2016, or 2019. You could write a query such as this:

SELECT 
  model
FROM 
  products
WHERE 
  year = 2014
  OR year = 2016
  OR year = 2019;

However, this is long and tedious to write. Using IN, you can instead write the following:

SELECT 
  model
FROM 
  products
WHERE 
  year IN (2014, 2016, 2019);

This is much cleaner to write and makes it easier to understand what is going on.

Conversely, you can also use the NOT IN clause to return all the values that are not in a list of values. For instance, if you wanted all of the products that were not produced in the years 2014, 2016, and 2019, you could write the following:

SELECT 
  model
FROM 
  products
WHERE 
  year NOT IN (2014, 2016, 2019);

In the next section, we will learn how to use the ORDER BY clause in our queries.

The ORDER BY Clause

As previously mentioned, SQL queries will order rows as the database finds them if they are not given more specific instructions to do otherwise. For many use cases, this is acceptable. However, you will often want to see rows in a specific order. Let's say you want to see all of the products listed by the date when they were first produced, from earliest to latest. The method for doing this in SQL would be using the ORDER BY clause as follows:

SELECT 
  model
FROM 
  products
ORDER BY 
  production_start_date;

If an order sequence is not explicitly mentioned, the rows will be returned in ascending order. Ascending order simply means the rows will be ordered from the smallest value to the highest value of the chosen column or columns. In the case of things such as text, this means alphabetical order. You can make the ascending order explicit by using the ASC keyword. For our last query, this could be achieved by writing the following:

SELECT 
  model
FROM 
  products
ORDER BY 
  production_start_date ASC;

If you want to extract data in descending order, you can use the DESC keyword. If we wanted to fetch manufactured models ordered from newest to oldest, we would write the following:

SELECT 
  model
FROM 
  products
ORDER BY 
  production_start_date DESC;

Also, instead of writing the name of the column you want to order by, you can refer to what number column it is in the natural order of the table. For instance, say you wanted to return all the models in the products table ordered by product ID. You could write the following:

SELECT 
  model
FROM 
  products
ORDER BY 
  product_id;

However, because product_id is the first column in the table, you could instead write the following:

SELECT 
  model
FROM 
  products
ORDER BY 
  1;

Finally, you can order by multiple columns by adding additional columns after ORDER BY separated with commas. For instance, let's say we wanted to order all of the rows in the table first by the year of the model from newest to oldest, and then by the MSRP from least to greatest. We would then write the following:

SELECT 
  *
FROM 
  products
ORDER BY 
  year DESC, 
  base_msrp ASC;

The following is the output of the preceding code:

Figure 1.37: Ordering multiple columns using ORDER BY

Figure 1.37: Ordering multiple columns using ORDER BY

In the next section, we will learn about the LIMIT keyword in SQL.

The LIMIT Clause

Most tables in SQL databases tend to be quite large and, therefore, returning every single row is unnecessary. Sometimes, you may want only the first few rows. For this scenario, the LIMIT keyword comes in handy. Let's imagine that you wanted to only get the first five products that were produced by the company. You could get this by using the following query:

SELECT 
  model
FROM 
  products
ORDER BY 
  production_start_date
LIMIT 
  5;

The following is the output of the code:

Figure 1.38: Query with LIMIT

Figure 1.38: Query with LIMIT

As a general rule, you probably want to use the LIMIT keyword for a table or query you have not worked with.

The IS NULL/IS NOT NULL Clause

Often, some entries in a given column may be missing. This could be for a variety of reasons. Perhaps the data was not collected or not available at the time that the data was collected. Perhaps the ETL job failed to collect and load data into a column. It may also be possible that the absence of a value is representative of a certain state in the row and actually provides valuable information. Whatever the reason, we are often interested in finding rows where the data is not filled in for a certain value. In SQL, blank values are often represented by the NULL value. For instance, in the products table, the production_end_date column having a NULL value indicates that the product is still being made. In this case, if we want to list all products that are still being made, we can use the following query:

SELECT 
  *
FROM 
  products
WHERE 
  production_end_date IS NULL;

The following is the output of the code:

Figure 1.39: Products with NULL production_end_date

Figure 1.39: Products with NULL production_end_date

If we are only interested in products that are not being produced, we can use the IS NOT NULL clause, as shown in the following query:

SELECT *
FROM products
WHERE production_end_date IS NOT NULL;

The following is the output of the code:

Figure 1.40: Products with non-NULL production_end_date

Figure 1.40: Products with non-NULL production_end_date

We now will look at how to use these new keywords in the following exercise.

Exercise 1.06: Querying the salespeople Table Using Basic Keywords in a SELECT Query

In this exercise, we will create various queries using basic keywords in a SELECT query. Let's say that after a few days at your new job, you finally get access to the company database. Today, your boss has asked you to help a sales manager who does not know SQL particularly well. The sales manager would like a couple of different lists of salespeople. First, create a list of the online usernames of the first 10 female salespeople hired, ordered from the first hired to the latest hired.

Note

For all future exercises in this book, we will be using pgAdmin 4.

Perform the following steps to complete the exercise:

  1. Open your favorite SQL client and connect to the sqlda database.
  2. Examine the schema for the salespeople table from the schema drop-down list. Notice the names of the columns in the following figure:
    Figure 1.41: Schema of the salespeople table

    Figure 1.41: Schema of the salespeople table

  3. Execute the following query to get the usernames of female salespeople sorted by their hire_date values, and then set LIMIT to 10:
    SELECT 
      username
    FROM 
      salespeople
    WHERE 
      gender= 'Female'
    ORDER BY 
      hire_date
      LIMIT 10;

    The following is the output of the preceding code:

    Figure 1.42: Usernames of female salespeople sorted by hire date

Figure 1.42: Usernames of female salespeople sorted by hire date

We now have a list of usernames for female salespeople ordered from the earliest hire to the most recent hire.

Note

To access the source code for this specific section, please refer to https://packt.live/2B4qMUk.

In this exercise, we used different basic keywords in a SELECT query to help the sales manager get the list of salespeople as per their requirements.

Activity 1.03: Querying the customers Table Using Basic Keywords in a SELECT Query

The marketing department has decided that they want to do a series of marketing campaigns to help promote a sale. To do this, they need details of all customers in New York city. The following are the steps to complete the activity:

  1. Open your favorite SQL client and connect to the sqlda database. Examine the schema for the customers table from the schema drop-down list.
  2. Write a query that pulls all emails for ZoomZoom customers in the state of Florida in alphabetical order.
  3. Write a query that pulls all the first names, last names and email details for ZoomZoom customers in New York City in the state of New York. They should be ordered alphabetically, with the last name followed by the first name.
  4. Write a query that returns all the customers with a phone number ordered by the date the customer was added to the database.

    Note

    The solution for this activity can be found via this link.

In this activity, we used various basic keywords in a SELECT query and helped the marketing manager to get the data they needed for the marketing campaign.