Book Image

Mastering SQL Server 2014 Data Mining

By : Amarpreet Singh Bassan, Debarchan Sarkar
Book Image

Mastering SQL Server 2014 Data Mining

By: Amarpreet Singh Bassan, Debarchan Sarkar

Overview of this book

<p>Whether you are new to data mining or are a seasoned expert, this book will provide you with the skills you need to successfully create, customize, and work with Microsoft Data Mining Suite. Starting with the basics, this book will cover how to clean the data, design the problem, and choose a data mining model that will give you the most accurate prediction.</p> <p>Next, you will be taken through the various classification models such as the decision tree data model, neural network model, as well as Naïve Bayes model. Following this, you'll learn about the clustering and association algorithms, along with the sequencing and regression algorithms, and understand the data mining expressions associated with each algorithm. With ample screenshots that offer a step-by-step account of how to build a data mining solution, this book will ensure your success with this cutting-edge data mining system.</p>
Table of Contents (17 chapters)
Mastering SQL Server 2014 Data Mining
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Staging data


In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems into a centralized repository.

Extract, transform, and load

This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts:

  • Extract: The data is extracted from a different source database and other databases that might contain the information that we seek

  • Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on

  • Load: The transformed data is loaded into the destination data store database

We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure:

Data warehouse

As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices, and then relate to these concepts with the help of the AdventureWorksDW database.

Measures and dimensions

There are a few common terminologies you will encounter as you enter the world of data warehousing. This section discusses them to help you get familiar:

  • Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on

  • Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension

Schema

Basically, a schema determines the relationship of the various entities with each other. There are essentially two types of schema, namely:

  • Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot:

  • Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema would usually involve hierarchies, as shown in the following screenshot:

Data mart

While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts.

Let's consider the sales for AdventureWorks Cycles. To make any predictions on the sales of AdventureWorks Cycles, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart.

The Internet sales facts table has the following data:

[ProductKey]
 [OrderDateKey]
 [DueDateKey]
 [ShipDateKey]
 [CustomerKey]
 [PromotionKey]
 [CurrencyKey]
 [SalesTerritoryKey]
 [SalesOrderNumber]
 [SalesOrderLineNumber]
 [RevisionNumber]
 [OrderQuantity]
 [UnitPrice]
 [ExtendedAmount]
 [UnitPriceDiscountPct]
 [DiscountAmount]
 [ProductStandardCost]
 [TotalProductCost]
 [SalesAmount]
 [TaxAmt]
 [Freight]
 [CarrierTrackingNumber]
 [CustomerPONumber]
 [OrderDate]
 [DueDate]
 [ShipDate]

From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following:

  • Product: This provides the following data:

    [ProductKey]
    [ListPrice]
  • Date: This provides the following data:

    [DateKey]
  • Customer: This provides the following data:

    [CustomerKey]
  • Currency: This provides the following data:

    [CurrencyKey]
  • Sales territory: This provides the following data:

    [SalesTerritoryKey]

The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and, hence, we can easily perform all the analysis pertaining to the sales of the organization.

Refreshing data

Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data.

Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table:

[BusinessEntityID]
,[NationalIDNumber]
,[LoginID]
,[OrganizationNode]
,[OrganizationLevel]
,[JobTitle]
,[BirthDate]
,[MaritalStatus]
,[Gender]
,[HireDate]
,[SalariedFlag]
,[VacationHours]
,[SickLeaveHours]
,[CurrentFlag]
,[rowguid]
,[ModifiedDate]

Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions. Prior to SQL 2008. We have to follow the method described in the next section to keep track of the changes. SQL 2008 provides us with Change Data Capture (CDC) and Change Tracking (CT), which will help us in incremental loading of our data warehouse; however, the following solution presented is a generalized solution that will work for any source database. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely:

  • Type 1: The older values are overwritten by new values

  • Type 2: A new row specifying the present value for the dimension is inserted

  • Type 3: The column specifying TimeStamp from which the new value is effective is updated

Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows:

Create table employee_load_status(
HireDate   DateTime,
LoadStatus       varchar
);

The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database:

With employee_loaded_date(HireDate)  as 
(select  ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='success'
Union All
Select ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='failed'
)
Insert into DimEmployee  select * from employee where HireDate >=(select Min(HireDate) from employee_loaded_date);

This will reload all the data from the date of the first failure till the present day.

A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks database, we see the following columns:

[BusinessEntityID]
,[TerritoryID]
,[SalesQuota]
,[Bonus]
,[CommissionPct]
,[SalesYTD]
,[SalesLastYear]
,[rowguid]
,[ModifiedDate]

The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column.

Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on.

Let's look at this with a simpler example. Consider we have the following sales table:

Name

SalesAmount

Date

Rama

1000

11-02-2014

Shyama

2000

11-02-2014

Consider we have the following fact table:

id

SalesAmount

Datekey

We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this:

id

SalesAmount

Datekey

ra

1000

11-02-2014

Sh

2000

11-02-2014

Ra

4000

12-02-2014

Sh

5000

13-02-2014

Also, a complete reload will appear as shown here:

id

TotalSalesAmount

Datekey

Ra

5000

12-02-2014

Sh

7000

13-02-2014

Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria.