Book Image

Hands-On Data Science with SQL Server 2017

By : Marek Chmel, Vladimír Mužný
Book Image

Hands-On Data Science with SQL Server 2017

By: Marek Chmel, Vladimír Mužný

Overview of this book

SQL Server is a relational database management system that enables you to cover end-to-end data science processes using various inbuilt services and features. Hands-On Data Science with SQL Server 2017 starts with an overview of data science with SQL to understand the core tasks in data science. You will learn intermediate-to-advanced level concepts to perform analytical tasks on data using SQL Server. The book has a unique approach, covering best practices, tasks, and challenges to test your abilities at the end of each chapter. You will explore the ins and outs of performing various key tasks such as data collection, cleaning, manipulation, aggregations, and filtering techniques. As you make your way through the chapters, you will turn raw data into actionable insights by wrangling and extracting data from databases using T-SQL. You will get to grips with preparing and presenting data in a meaningful way, using Power BI to reveal hidden patterns. In the concluding chapters, you will work with SQL Server integration services to transform data into a useful format and delve into advanced examples covering machine learning concepts such as predictive analytics using real-world examples. By the end of this book, you will be in a position to handle the growing amounts of data and perform everyday activities that a data science professional performs.
Table of Contents (14 chapters)

Introducing data science

Data science is a modern term that covers a large amount of different disciplines. We can think of data science as a field that uses various tools, processes, methods, and algorithms to extract knowledge and insights from data, which can be stored in a structured and unstructured manner. In one view, we can see data science as being quite similar to data mining.

Data science as a field includes everything that is associated with data manipulation—cleansing, preparation, analysis, visualization, and so on. Data science combines numerous skills that can be used for working with data such as programming, reasoning, mathematical skills, and statistics.

Data science is frequently mentioned together with other buzzwords such as big data, machine learning, and so on. As a matter of the fact, projects working with machine learning and big data are usually using data science principles, tools, and processes to build the the application.

Why is data science so important to us? Well, up until 2005, mankind had created approximately 130 exabytes of data (1 exabyte = 1,000 petabytes). But this number is growing quickly, and actually the amount of data created around the world is not growing in a linear fashion, but rather exponentially, with expectations that it will grow to 40 zettabytes in 2020. Such a large amount of data can hardly be processed by machines, or even data scientists, but a proper approach can increase the fraction of data that we'll be able to analyze.

Data science project life cycle

There are different data science life cycles available, which can fit different projects. We'll focus most on the Team Data Science Process (TDSP) life cycle, which is defined by Microsoft and can be applied to data science with Microsoft SQL Server. The TDSP provides a recommended life cycle that you can use to structure your data science projects. The life cycle outlines the steps, from start to finish, that projects usually follow when they are executed.

Business understanding

When we work with a data science project, this project usually has several phases. Each data science project begins with the business problem, or identifying the question. There are key tasks addressed in this first stage of the project:

  • Define the goal or objective: identify the required business variable that needs to be predicted by the analysis
  • Identify the data sources: find the required data that you will use for the data science project

When you work with stakeholders to identify the key questions and project goals, you should use sharp questions that will help you identify the required data for your analysis.

If your question is, What will my stock's sale price be next month? then you'll need to analyze the data, which includes your stock's price history over months. Also, you will need to analyze the sales of your stock from those months. And, in a similar manner, you need to think about the business problem definition and ask specific questions that include the numbers, names, categories, and so on.

Based on the questions that you'll be trying to answer, you can also foresee the data science methods that you'll use to address such a question. Typical questions for data science projects would include the following:

  • How much, or how many?
  • Which category?
  • Which group?
  • Is this weird?
  • Which option should be taken?

During the first phase of the project, you're also usually building the TDSP, which will consist of various personnel members, each specializing in a particular subject, which will be essential to the success of the project:

  • A data scientist: A highly educated and skilled person who can solve complex data problems by employing deep expertise in scientific disciplines (mathematics, statistics, or computer science).
  • A data professional: A skilled person who creates or maintains data systems, data solutions, or implements predictive modelling. These people will work in professions such as database administration, database development, or BI Development.
  • A software developer: A person who designs and develops programming logic, and who can apply machine learning to integrate predictive functionality into applications.
  • A project leader: A project leader manages the daily activities of individual data scientists and other project contributors on a specific data science project.

Getting data

The second phase of the project is related to data identification, acquisition, and understanding. Data comes from various data sources that provide data in a structured, a semi-structured, and an unstructured format. Data that we have on input may come with different quality and integrity, based on the data source that is used for storing the information. For the data analysis, we need to ingest the data into the target analytic environment, either an on-premise one, or in the cloud. These can include numerous services from Microsoft such as SQL Server (ideally with PolyBase to access external data) or cloud services such as Azure Storage Account, HDInsight, and Azure Data Lake.

Considering we'll load the data into Microsoft SQL Server, we need a good way to break down the dataset of the information into individual rows and columns. Each row in the table will present one event, instance, or item for our analysis. Each column on this table will represent an attribute of the row. Different projects will have data with a different level of detail collected, based on the available data sources and our ability to process such data.

When we talk about the initial loading of the data into SQL Server, this is usually referred to as a staging database. Since the data can be loaded from numerous different databases and repositories, dumping all the data from the source into a centralized repository is usually the first step before building the analytical storage. The next stage would be the data warehouse. Data warehouse is a common term for an enterprise system used for reporting and data analysis, which is usually a core of the enterprise business intelligence solution. While a data warehouse is an enterprise-wide repository of data, extracting insights from such a huge repository might be a challenging task. We can segregate the data according to the department or the category that the data belongs to, so that we have much smaller sections of the data to work with and extract information from. These smaller portions of the data warehouse are then referenced as data marts.

Data in the source systems may need a lot of work before and during loading it into a database or other analytical storage, where we can properly analyze the data. In general, one of the many steps in data science projects is data wrangling, a process of acquiring raw data and mapping and transforming the data into another format that is suitable for its end use, for us, the data analysts. Data wrangling basically has three steps:

  • Getting and reading the data
  • Cleaning the data
  • Shaping and structuring the data

Reading the data sounds simple, but in the end, it's a complex task in the data science project, where one part of the project is a data flow and a pipeline definition on how to connect to the data, read the data with the proper tools, and move the data to the analytics store. This can end up with complex integration work as part of the data science project just to interconnect various data sources together and shape the data from various sources, so you can run powerful analytics on the data to get the insights. The Microsoft SQL Server includes very important services, such as SQL Server Integration Services, which, together with SQL Server Data Tools, can be used as one of the tools available for data wrangling with all three steps.

Once the data is loaded into the analytical store, we need to explore and visualize the available data with the toolset available to get the idea of the structure and develop initial understanding of the data. An initial understanding of the data can be achieved via numerous tools, but if we focus on Microsoft SQL Server, then the choices would include SQL Service Integration Services—Data Profiling Task and SQL Server Management Studio.

When you explore the data, you're looking for basic information such as this:

  • Is the data organized?
  • Are there any missing values?
  • What does each row represent?
  • What do columns represent?
  • Is the data stored as a categorical or a numerical feature?
  • Are there any transformations required?

Modelling and analysis

This part of the project might be the most creative one, since it includes numerous tasks, which have to be taken to deliver the final product. The list of tasks can be very long, and may include these:

  • Data mining
  • Text analytics
  • Model building
  • Feature engineering and extraction
  • Model testing

Microsoft SQL Server has tools built in, which can provide a delivery platform for most of the tasks. When we talk about data mining, there are several different methodologies or frameworks to follow, where so far the Cross Industry Standard Process for Data Mining (CRISP-DM) is the most frequently used one, based on several different methods of research regarding the methodology usage. In 2015, IBM released a new methodology called Analytics Solutions Unified Method for Data Mining/Predictive Analytics, which refined and extended CRISP-DM. CRISP-DM is an open-standard process model that describes common approaches used by data-mining experts, and it's still the most widely used analytics model. CRISP-DM breaks the process of data mining into six major phases. The sequence of the phases is not strict and moves back and forth between different phases, as it is always required. The arrows in the process diagram indicate the most important and frequent dependencies between phases. The outer circle in the diagram symbolizes the cyclic nature of data mining itself. A data-mining process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions, and subsequent data-mining processes will benefit from the experiences of the previous ones:

The purpose of data mining is to put structured and unstructured data in relation to each other so as to easily interface them and provide the workers in the sector with a system that is easy to use. The experts of each specified area of business will therefore have access to a complex data system that is able to process information at different levels. This has the advantage of bringing to light the relationships among data, predictive analysis, assessments for specific business decisions, and much more.

Data mining can be used for solving many business problems and to prepare the data for a more advanced approach, such as machine learning, which can be used for:

  • Searching for anomalies
  • Churn analysis
  • Customer segmentation
  • Forecasting
  • Market basket analysis
  • Network intrusion detection
  • Targeted advertisement

Deployment and visualization

Once you have developed a functioning data science solution or some learning model, you're ready for deployment into production. Many of the systems have two primary modes of operations for a data science solution, either real-time operations or batch operation. In this part, we usually consider the data visualization and the proper toolset to deliver the results to our users. Tools such as Power BI or Tableau will help you bring interesting insights to your data in a visual way, which is usually best for the end users.

Final acceptance

The last step of the project is the final presentation and acceptance from the client or the customer. You'll present the insights and translate the findings into the language appropriate for your audience. In this part of the project, you'll work with the customer or the internal team, who will run and support the project once it gets into production, and you'll verify that the outcome meets the required needs.