Mastering SQL Server 2014 Data Mining

Mastering SQL Server 2014 Data Mining

By : Amarpreet Singh Bassan, Debarchan Sarkar

Buy this Book

Mastering SQL Server 2014 Data Mining

By: Amarpreet Singh Bassan, Debarchan Sarkar

Buy this Book

Overview of this book

<p>Whether you are new to data mining or are a seasoned expert, this book will provide you with the skills you need to successfully create, customize, and work with Microsoft Data Mining Suite. Starting with the basics, this book will cover how to clean the data, design the problem, and choose a data mining model that will give you the most accurate prediction.</p> <p>Next, you will be taken through the various classification models such as the decision tree data model, neural network model, as well as Naïve Bayes model. Following this, you'll learn about the clustering and association algorithms, along with the sequencing and regression algorithms, and understand the data mining expressions associated with each algorithm. With ample screenshots that offer a step-by-step account of how to build a data mining solution, this book will ensure your success with this cutting-edge data mining system.</p>

Mastering SQL Server 2014 Data Mining

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Identifying, Staging, and Understanding Data

Data mining life cycle

Staging data

Understanding and cleansing data

Summary

Data Model Preparation and Deployment

Preparing data models

Validating data models

Deploying data models

Summary

Tools of the Trade

SQL Server BI Suite

References

Summary

Preparing the Data

Listing of popular databases

Summary

Classification Models

Input, output, and predicted columns

The feature selection

The Microsoft Decision Tree algorithm

The Microsoft Neural Network algorithm

The Microsoft Naïve Bayes algorithm

Summary

Segmentation and Association Models

The Microsoft Clustering algorithm

The Microsoft Association algorithm

Summary

Sequence and Regression Models

The Microsoft Sequence Clustering algorithm

The Microsoft Time Series algorithm

Summary

Data Mining Using Excel and Big Data

Data mining using Microsoft Excel

Data mining using HDInsight and Microsoft Azure Machine Learning

Summary

Tuning the Models

Getting the real-world data

Adding a clustering model to the data mining structure

Adding the Neural Network model to the data mining structure

Summary

Troubleshooting

A fraction of rows get transferred into a SQL table

Error during changing of the data type of the table

Troubleshooting the data mining structure performance

Error during the deployment of a model

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Staging data

In this phase, we collect data from all the sources and dump them into a common repository, which can be any database system such as SQL Server, Oracle, and so on. Usually, an organization might have various applications to keep track of the data from various departments, and it is quite possible that all these applications might use a different database system to store the data. Thus, the staging phase is characterized by dumping the data from all the other data storage systems into a centralized repository.

Extract, transform, and load

This term is most common when we talk about data warehouse. As it is clear, ETL has the following three parts:

Extract: The data is extracted from a different source database and other databases that might contain the information that we seek
Transform: Some transformation is applied to the data to fit the operational needs, such as cleaning, calculation, removing duplicates, reformatting, and so on
Load: The transformed data is loaded into the destination data store database

We usually believe that the ETL is only required till we load the data onto the data warehouse but this is not true. ETL can be used anywhere that we feel the need to do some transformation of data as shown in the following figure:

Data warehouse

As evident from the preceding figure, the next stage is the data warehouse. The AdventureWorksDW database is the outcome of the ETL applied to the staging database, which is AdventureWorks. We will now discuss the concepts of data warehousing and some best practices, and then relate to these concepts with the help of the AdventureWorksDW database.

Measures and dimensions

There are a few common terminologies you will encounter as you enter the world of data warehousing. This section discusses them to help you get familiar:

Measure: Any business entity that can be aggregated or whose values can be ascertained in a numerical value is termed as measure, for example, sales, number of products, and so on
Dimension: This is any business entity that lends some meaning to the measures, for example, in an organization, the quantity of goods sold is a measure but the month is a dimension

Schema

Basically, a schema determines the relationship of the various entities with each other. There are essentially two types of schema, namely:

Star schema: This is a relationship where the measures have a direct relationship with the dimensions. Let's look at an instance wherein a seller has several stores that sell several products. The relationship of the tables based on the star schema will be as shown in the following screenshot:
Snowflake schema: This is a relationship wherein the measures may have a direct and indirect relationship with the dimensions. We will be designing a snowflake schema if we want a more detailed drill down of the data. Snowflake schema would usually involve hierarchies, as shown in the following screenshot:

Data mart

While a data warehouse is a more organization-wide repository of data, extracting data from such a huge repository might well be an uphill task. We segregate the data according to the department or the specialty that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. We call these smaller data warehouses data marts.

Let's consider the sales for AdventureWorks Cycles. To make any predictions on the sales of AdventureWorks Cycles, we will have to group all the tables associated with the sales together in a data mart. Based on the AdventureWorks database, we have the following table in the AdventureWorks sales data mart.

The Internet sales facts table has the following data:

[ProductKey]
 [OrderDateKey]
 [DueDateKey]
 [ShipDateKey]
 [CustomerKey]
 [PromotionKey]
 [CurrencyKey]
 [SalesTerritoryKey]
 [SalesOrderNumber]
 [SalesOrderLineNumber]
 [RevisionNumber]
 [OrderQuantity]
 [UnitPrice]
 [ExtendedAmount]
 [UnitPriceDiscountPct]
 [DiscountAmount]
 [ProductStandardCost]
 [TotalProductCost]
 [SalesAmount]
 [TaxAmt]
 [Freight]
 [CarrierTrackingNumber]
 [CustomerPONumber]
 [OrderDate]
 [DueDate]
 [ShipDate]

From the preceding column, we can easily identify that if we need to separate the tables to perform the sales analysis alone, we can safely include the following:

Product: This provides the following data:
```
[ProductKey]
[ListPrice]
```
Date: This provides the following data:
```
[DateKey]
```
Customer: This provides the following data:
```
[CustomerKey]
```
Currency: This provides the following data:
```
[CurrencyKey]
```
Sales territory: This provides the following data:
```
[SalesTerritoryKey]
```

The preceding data will provide the relevant dimensions and the facts that are already contained in the FactInternetSales table and, hence, we can easily perform all the analysis pertaining to the sales of the organization.

Refreshing data

Based on the nature of the business and the requirements of the analysis, refreshing of data can be done either in parts wherein new or incremental data is added to the tables, or we can refresh the entire data wherein the tables are cleaned and filled with new data, which consists of the old and new data.

Let's discuss the preceding points in the context of the AdventureWorks database. We will take the employee table to begin with. The following is the list of columns in the employee table:

[BusinessEntityID]
,[NationalIDNumber]
,[LoginID]
,[OrganizationNode]
,[OrganizationLevel]
,[JobTitle]
,[BirthDate]
,[MaritalStatus]
,[Gender]
,[HireDate]
,[SalariedFlag]
,[VacationHours]
,[SickLeaveHours]
,[CurrentFlag]
,[rowguid]
,[ModifiedDate]

Considering an organization in the real world, we do not have a large number of employees leaving and joining the organization. So, it will not really make sense to have a procedure in place to reload the dimensions. Prior to SQL 2008. We have to follow the method described in the next section to keep track of the changes. SQL 2008 provides us with Change Data Capture (CDC) and Change Tracking (CT), which will help us in incremental loading of our data warehouse; however, the following solution presented is a generalized solution that will work for any source database. When it comes to managing the changes in the dimensions table, Slowly Changing Dimensions (SCD) is worth a mention. We will briefly look at the SCD here. There are three types of SCD, namely:

Type 1: The older values are overwritten by new values
Type 2: A new row specifying the present value for the dimension is inserted
Type 3: The column specifying TimeStamp from which the new value is effective is updated

Let's take the example of HireDate as a method of keeping track of the incremental loading. We will also have to maintain a small table that will keep a track of the data that is loaded from the employee table. So, we create a table as follows:

Create table employee_load_status(
HireDate   DateTime,
LoadStatus       varchar
);

The following script will load the employee table from the AdventureWorks database to the DimEmployee table in the AdventureWorksDW database:

With employee_loaded_date(HireDate)  as 
(select  ISNULL(Max(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='success'
Union All
Select ISNULL(min(HireDate),to_date('01-01-1900','MM-DD-YYYY')) from employee_load_status where LoadStatus='failed'
)
Insert into DimEmployee  select * from employee where HireDate >=(select Min(HireDate) from employee_loaded_date);

This will reload all the data from the date of the first failure till the present day.

A similar procedure can be followed to load the fact table but there is a catch. If we look at the sales table in the AdventureWorks database, we see the following columns:

[BusinessEntityID]
,[TerritoryID]
,[SalesQuota]
,[Bonus]
,[CommissionPct]
,[SalesYTD]
,[SalesLastYear]
,[rowguid]
,[ModifiedDate]

The SalesYTD column might change with every passing day, so do we perform a full load every day or do we perform an incremental load based on date? This will depend upon the procedure used to load the data in the sales table and the ModifiedDate column.

Assuming the ModifiedDate column reflects the date on which the load was performed, we also see that there is no table in the AdventureWorksDW that will use the SalesYTD field directly. We will have to apply some transformation to get the values of OrderQuantity, DateOfShipment, and so on.

Let's look at this with a simpler example. Consider we have the following sales table:

Name	SalesAmount	Date
Rama	1000	11-02-2014
Shyama	2000	11-02-2014

Consider we have the following fact table:

SalesAmount

Datekey

We will have to think of whether to apply incremental load or a complete reload of the table based on our end needs. So the entries for the incremental load will look like this:

id	SalesAmount	Datekey
ra	1000	11-02-2014
Sh	2000	11-02-2014
Ra	4000	12-02-2014
Sh	5000	13-02-2014

Also, a complete reload will appear as shown here:

id	TotalSalesAmount	Datekey
Ra	5000	12-02-2014
Sh	7000	13-02-2014

Notice how the SalesAmount column changes to TotalSalesAmount depending on the load criteria.

Mastering SQL Server 2014 Data Mining

By : Amarpreet Singh Bassan, Debarchan Sarkar

Mastering SQL Server 2014 Data Mining

By: Amarpreet Singh Bassan, Debarchan Sarkar

Overview of this book

Related Content you might be interested in

Current Title:

Mastering SQL Server 2014 Data Mining

Staging data

Extract, transform, and load

Data warehouse

Measures and dimensions

Schema

Data mart

Refreshing data