Book Image

Data Science with SQL Server Quick Start Guide

By : Dejan Sarka
Book Image

Data Science with SQL Server Quick Start Guide

By: Dejan Sarka

Overview of this book

SQL Server only started to fully support data science with its two most recent editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning (ML) Services for your projects, then this is the ideal book for you. This book is the ideal introduction to data science with Microsoft SQL Server and In-Database ML Services. It covers all stages of a data science project, from businessand data understanding,through data overview, data preparation, modeling and using algorithms, model evaluation, and deployment. You will learn to use the engines and languages that come with SQL Server, including ML Services with R and Python languages and Transact-SQL. You will also learn how to choose which algorithm to use for which task, and learn the working of each algorithm.
Table of Contents (15 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
Index

Core T-SQL SELECT statement elements


You probably already know that the most important SQL statement is the mighty SELECT statement you use to retrieve data from your databases. Every database developer knows the basic clauses and their usage:

  • SELECT: Defines the columns returned, or a projection of all table columns
  • FROM: Lists the tables used in the query and how they are associated, or joined
  • WHERE: Filters the data to return only the rows that satisfy the condition in the predicate
  • GROUP BY: Defines the groups over which the data is aggregated
  • HAVING: Filters the data after the grouping with conditions that refer to aggregations
  • ORDER BY: Sorts the rows returned to the client application

 

 

The simplest form of the SELECT statement

Let's start with the simplest concept of SQL that every Tom, Dick, and Harry is aware of! The simplest query to retrieve the data you can write includes the SELECT and the FROM clauses. In the select clause, you can use the star character, literally SELECT *, to denote that you need all columns from a table in the result set. The following code switches to the AdventureWorksDW2017 database context and selects all data from the dbo.DimEmployee table:

USE AdventureWorksDW2017;
GO
SELECT *
FROM dbo.DimEmployee;

This query returns 296 rows, all employees with all columns.

Note

Using SELECT * is not recommended in production. Queries with SELECT * can return an unexpected result when the table structure changes and are also not suitable for good optimization.

Better than using SELECT * is to explicitly list only the columns you need. This means you are returning only a projection on the table. The following example selects only three columns from the table:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee;

Here is the shortened result, limited to the first three rows only:

EmployeeKey FirstName LastName
----------- --------- ----------
1           Guy       Gilbert
2           Kevin     Brown
3           Roberto   Tamburello

Object names in SQL Server, such as table and column, can include spaces. Names that include spaces are called delimited identifiers. To make SQL Server properly understand them as column names, you must enclose delimited identifiers in square brackets. However, if you prefer to have names without spaces, or if you use computed expressions in the column list, you can add column aliases. The following code uses an expression in the SELECT clause to create a calculated column called [Full Name], and then uses the INTO clause to store the data in a table.

The next query retrieves the data from the newly created and populated dbo.EmpFull table:

SELECT EmployeeKey,
 FirstName + ' ' + LastName AS [Full Name]
INTO dbo.EmpFUll
FROM dbo.DimEmployee;
GO
SELECT EmployeeKey, [Full Name]
FROM dbo.EmpFUll;

Here is the partial result:

EmployeeKey Full Name
----------- ------------------
1           Guy Gilbert
2           Kevin Brown
3           Roberto Tamburello

As you have seen before, there are 296 employees. If you check the full result of the first query, you might notice that there is a column named SalesPersonFlag in the dbo.DimEmployee table. You might want to check which of the employees are also salespeople. You can filter the results of a query with the WHERE clause, as the following query shows:

SELECT EmployeeKey, FirstName, LastName
FROM dbo.DimEmployee
WHERE SalesPersonFlag = 1;

This query returns 17 rows only.

Joining multiple tables

In a relational database, you typically have data spread in multiple tables. Each table represents a set of entities of the same kind, such as employees in the examples you have seen so far. In order to make result sets meaningful for the business your database supports, most of the time you need to retrieve data from multiple tables in the same query. You need to join two or more tables based on some conditions. The most frequent kind of a join is the inner join. An inner join returns only rows for which the condition in the join predicate for the two joined tables evaluates to true. Note that in a relational database, you have three-valued logic, because there is always a possibility that a piece of data is unknown. You mark the unknown with the NULL keyword. A predicate can thus evaluate to true, false, or NULL. For an inner join, the order of the tables involved in the join is not important.

 

 

In the following example, you can see the dbo.DimEmployee table joined with an inner join to the dbo.FactResellerSales table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
 fr.SalesAmount
FROM dbo.DimEmployee AS e
 INNER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey;

Here are the partial results:

EmployeeKey FirstName LastName SalesAmount
----------- --------- -------- -----------
285         Tsvi      Reiter   2024.994
285         Tsvi      Reiter   6074.982
285         Tsvi      Reiter   2024.994

In the previous query, you can see that table aliases are used. If a column's name is unique across all tables in the query, you can use it without a table name. If not, you need to use table name in front of the column, to avoid ambiguous column names, in the table.column format. In the previous query, the EmployeeKey column appears in both tables. Therefore, you need to precede this column name with the table name of its origin to avoid ambiguity. You can shorten the two-part column names by using table aliases. You specify table aliases in the FROM clause. Once you specify table aliases, you must always use the aliases; you can't refer to the original table names in that query anymore. Please note that a column name might be unique in the query at the moment when you write the query. However, later somebody could add a column with the same name in another table involved in the query. If the column name is not preceded by an alias or by the table name, you would get an error when executing the query because of the ambiguous column name. In order to make the code more stable and more readable, you should always use table aliases for each column in the query.

The previous query returned 60,855 rows. It is always recommended to know at least approximately the number of rows your query should return. This number is the first control of the correctness of the result set, or said differently, whether the query is written in a logically correct way. If every sale has an employee, as it should have, then the previous query should have returned exactly the number of rows dbo.FactResellerSales has. You can quickly check the number of rows in the dbo.FactResellerSales table with the help of the COUNT(*) aggregate function, as the following query shows:

SELECT COUNT(*) AS ResellerSalesCount
FROM dbo.FactResellerSales;

 

The result is, as you probably expected, 60,855 rows.

You can join multiple tables in a single query. The following code joins seven tables in a single query. Note that all of the joins are still inner joins. The query returns 60,855 rows again, with at least 1 column from each table involved in the query:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
 r.ResellerKey, r.ResellerName,
 d.DateKey, d.CalendarYear, d.CalendarQuarter,
 p.ProductKey, p.EnglishProductName,
 ps.EnglishProductSubcategoryName,
 pc.EnglishProductCategoryName,
 fr.OrderQuantity, fr.SalesAmount
FROM dbo.DimEmployee AS e
 INNER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey
 INNER JOIN dbo.DimReseller AS r
  ON r.ResellerKey = fr.ResellerKey
 INNER JOIN dbo.DimDate AS d
  ON fr.OrderDateKey = d.DateKey
 INNER JOIN dbo.DimProduct AS p
  ON fr.ProductKey = p.ProductKey
 INNER JOIN dbo.DimProductSubcategory AS ps
  ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
 INNER JOIN dbo.DimProductCategory AS pc
  ON ps.ProductCategoryKey = pc.ProductCategoryKey;

In the dbo.Employees table, there are 17 salespeople. Do all of them have at least one sale, at least one row from the dbo.FactResellerSales table associated with the employee key of that salesperson? You can check how many distinct employees have sales associated with them with the help of the DISTINCT keyword:

SELECT DISTINCT fr.EmployeeKey
FROM dbo.FactResellerSales AS fr;

The query returns 17 rows. Now imagine that you would like to list all sales rows together with the employees' data, but you also need to include in the result the employees that are not salespeople, that do now have any row associated with their EmployeeKey column in the fact table. You can use an outer join to fulfill this task.

 

 

With an outer join, you preserve the rows from one or both tables, even if they don't have a match in the other table. The result set returned includes all of the matched rows, like what you get from an inner join plus the preserved rows. Within an outer join, the order of the tables involved in the join might be important. If you use LEFT OUTER JOIN, the rows from the left table are preserved. If you use RIGHT OUTER JOIN, the rows from the right table are preserved. Of course, in both cases, the order of the tables involved in the join is important. With FULL OUTER JOIN, you preserve the rows from both tables, and the order of the tables is not important. The following query uses a left outer join to preserve the rows from the dbo.DimEmployee table:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
 fr.SalesAmount
FROM dbo.DimEmployee AS e
 LEFT OUTER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows. Did we get all of the employees in the result? You can check this by checking the distinct EmployeeKey after the outer join:

SELECT DISTINCT e.EmployeeKey
FROM dbo.DimEmployee AS e
 LEFT OUTER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey;

The query returns 296 rows, which is the number of employees.

Joining more than two tables is not tricky if all of the joins are inner joins. The order of joins is not important. However, you might want to execute an outer join after all of the inner joins. If you don't control the join order with the outer joins, it might happen that a subsequent inner join filters out the preserved rows of an outer join. You can control the join order with parentheses. The following query uses the right outer join to preserve all employees and makes sure that this join is executed after all inner joins:

SELECT e.EmployeeKey, e.FirstName, e.LastName,
 r.ResellerKey, r.ResellerName,
 d.DateKey, d.CalendarYear, d.CalendarQuarter,
 p.ProductKey, p.EnglishProductName,
 ps.EnglishProductSubcategoryName,
 pc.EnglishProductCategoryName,
 fr.OrderQuantity, fr.SalesAmount
FROM (dbo.FactResellerSales AS fr
 INNER JOIN dbo.DimReseller AS r
  ON r.ResellerKey = fr.ResellerKey
 INNER JOIN dbo.DimDate AS d
  ON fr.OrderDateKey = d.DateKey
 INNER JOIN dbo.DimProduct AS p
  ON fr.ProductKey = p.ProductKey
 INNER JOIN dbo.DimProductSubcategory AS ps
  ON p.ProductSubcategoryKey = ps.ProductSubcategoryKey
 INNER JOIN dbo.DimProductCategory AS pc
  ON ps.ProductCategoryKey = pc.ProductCategoryKey)
 RIGHT OUTER JOIN dbo.DimEmployee AS e
  ON e.EmployeeKey = fr.EmployeeKey;

The query returns 61,134 rows, as it should. Note that with the usage of the parenthesis, the order of joins is defined in the following way:

  • Perform all inner joins, with an arbitrary order among them

  • Execute the left outer join after all of the inner joins

Grouping and aggregating data

Many times, you need to aggregate data in groups. This is where the GROUP BY clause comes in handy. The following query aggregates the sales data for each employee:

SELECT e.EmployeeKey,
 MIN(e.LastName) AS LastName,
 SUM(fr.OrderQuantity)AS EmpTotalQuantity,
 SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
 INNER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey;

The query returns 17 aggregated rows. Here are the results, abbreviated to the first three rows only:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
284         Vargas     11544            3609447.2163
295         Valdez     6898             1790640.2311
281         Blythe     23058            9293903.0055

In the SELECT clause, you can have only the columns used for grouping, or aggregated columns. That is why the LastName column in the SELECT list is used in the MIN() aggregate function. You need to get a scalar, a single aggregated value for each row for each column not included in the GROUP BY list.

 

 

Sometimes, you need to filter aggregated data. For example, you might need to find only the employees for which the sum of the order quantity did not reach 10,000. You can filter the result set on the aggregated data by using the HAVING clause:

SELECT e.EmployeeKey,
 MIN(e.LastName) AS LastName,
 SUM(fr.OrderQuantity)AS EmpTotalQuantity,
 SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
 INNER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) < 10000;

The query returns eight rows only. Note that you can't use column aliases from the SELECT clause in any other clause introduced in the previous query. The SELECT clause logically executes after all other clauses from the query, and the aliases are not known yet. However, the ORDER BY clause, which sorts the result, executes after the SELECT clause, and therefore the columns aliases are already known and you can refer to them. The following query shows the nine employees with sum of the OrderQuantity variable greater than 10,000, sorted in descending order by this sum:

SELECT e.EmployeeKey,
 MIN(e.LastName) AS LastName,
 SUM(fr.OrderQuantity)AS EmpTotalQuantity,
 SUM(fr.SalesAmount) AS EmpTotalAmount
FROM dbo.DimEmployee AS e
 INNER JOIN dbo.FactResellerSales AS fr
  ON e.EmployeeKey = fr.EmployeeKey
GROUP BY e.EmployeeKey
HAVING SUM(fr.OrderQuantity) > 10000
ORDER BY EmpTotalQuantity DESC;

You can see the shortened results as follows:

EmployeeKey LastName   EmpTotalQuantity EmpTotalAmount
----------- ---------- ---------------- --------------
282         Mitchell   27229            10367007.4286
283         Carson     27051            10065803.5429
291         Pak        26231            8503338.6472