Book Image

Mastering SQL Server 2014 Data Mining

By : Amarpreet Singh Bassan, Debarchan Sarkar
Book Image

Mastering SQL Server 2014 Data Mining

By: Amarpreet Singh Bassan, Debarchan Sarkar

Overview of this book

<p>Whether you are new to data mining or are a seasoned expert, this book will provide you with the skills you need to successfully create, customize, and work with Microsoft Data Mining Suite. Starting with the basics, this book will cover how to clean the data, design the problem, and choose a data mining model that will give you the most accurate prediction.</p> <p>Next, you will be taken through the various classification models such as the decision tree data model, neural network model, as well as Naïve Bayes model. Following this, you'll learn about the clustering and association algorithms, along with the sequencing and regression algorithms, and understand the data mining expressions associated with each algorithm. With ample screenshots that offer a step-by-step account of how to build a data mining solution, this book will ensure your success with this cutting-edge data mining system.</p>
Table of Contents (17 chapters)
Mastering SQL Server 2014 Data Mining
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Data mining life cycle


Before going into further detail, it is important to understand the various stages of the data mining life cycle. The data mining life cycle can be broadly classified into the following steps:

  1. Understanding the business requirement.

  2. Understanding the data.

  3. Preparing the data for analysis.

  4. Preparing the data mining models.

  5. Evaluating the results of the analysis prepared with the models.

  6. Deploying the models to the SQL Server Analysis Services.

  7. Repeating steps 1 to 6 in case the business requirement changes.

Let's look at each of these stages in detail.

First and foremost, the task that needs to be well defined even before beginning the mining process is to identify the goals. This is a crucial part of the data mining exercise and you need to understand the following questions:

  • What and whom are we targeting?

  • What is the outcome we are targeting?

  • What is the time frame for which we have the data and what is the target time period that our data is going to forecast?

  • What would the success measures look like?

Let's define a classic problem and understand more about the preceding questions. Note that for the most part of this book, we will be using the AdventureWorks and AdventureWorksDW databases for our data mining activities as they already have the schema and dimensions predefined. We can use them to discuss how to extract the information rather than spending our time on defining the schema.

The details on how to acquire the AdventureWorks database is already discussed in the Preface of this book.

Consider an instance where you are a salesman for the AdventureWorks Cycles, company, and you need to make predictions that could be used in marketing the products. The problem sounds simple and straightforward, but any serious data miner would immediately come up with many questions. Why? The answer lies in the exactness of the information being searched for. Let's discuss this in detail.

The problem statement comprises the words predictions and marketing. When we talk about predictions, there are several insights that we seek, namely:

  • What is it that we are predicting? (for example: customers, product sales, and so on)

  • What is the time period of the data that we are selecting for prediction?

  • What time period are we going to have the prediction for?

  • What is the expected outcome of the prediction exercise?

From the marketing point of view, several follow-up questions that must be answered are as follows:

  • What is our target for marketing; a new product or an older product?

  • Is our marketing strategy product centric or customer centric? Are we going to market our product irrespective of the customer classification, or are we marketing our product according to customer classification?

  • On what timeline in the past is our marketing going to be based on?

We might observe that there are many questions that overlap the two categories and, therefore, there is an opportunity to consolidate the questions and classify them as follows:

  • What is the population that we are targeting?

  • What are the factors that we will actually be looking at?

  • What is the time period of the past data that we will be looking at?

  • What is the time period in the future that we will be considering the data mining results for?

Let's throw some light on these aspects based on the AdventureWorks example. We will get answers to the preceding questions and arrive at a more refined problem statement.

What is the population that we are targeting? The target population might be classified according to the following aspects:

  • Age

  • Salary

  • Number of kids

What are the factors that we are actually looking at? They might be classified as follows:

  • Geographical location: The people living in hilly areas would prefer All Terrain Bikes (ATB) and the population on plains would prefer daily commute bikes.

  • Household: The people living in posh areas would look for bikes with the latest gears and also look for accessories that are state of the art, whereas people in the suburban areas would mostly look for budgetary bikes.

  • Affinity of components: The people who tend to buy bikes would also buy some accessories.

What is the time period of the past data that we would be looking at? Usually, the data that we get is quite huge and often consists of the information that we might very adequately label as noise. In order to sieve effective information, we will have to determine exactly how much into the past we should look at; for example, we can look at the data for the past year, past two years, or past five years.

We also need to decide the future data that we will consider the data mining results for. We might be looking at predicting our market strategy for an upcoming festive season or throughout the year. We need to be aware that market trends change and so do people's needs and requirements. So we need to keep a time frame to refresh our findings to an optimal; for example, the predictions from the past five years' data can be valid for the upcoming two or three years depending upon the results that we get.

Now that we have taken a closer look into the problem, let's redefine the problem more accurately. AdventureWorks Cycles has several stores in various locations and, based on the location, we would like to get an insight into the following:

  • Which products should be stocked where?

  • Which products should be stocked together?

  • How many products should be stocked?

  • What is the trend of sales for a new product in an area?

It is not necessary that we receive answers to all the detailed questions but even if we keep looking for the answers to these questions, there would be several insights that we will get, which will help us make better business decisions.