Book Image

Data Science with SQL Server Quick Start Guide

By : Dejan Sarka
Book Image

Data Science with SQL Server Quick Start Guide

By: Dejan Sarka

Overview of this book

SQL Server only started to fully support data science with its two most recent editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning (ML) Services for your projects, then this is the ideal book for you. This book is the ideal introduction to data science with Microsoft SQL Server and In-Database ML Services. It covers all stages of a data science project, from businessand data understanding,through data overview, data preparation, modeling and using algorithms, model evaluation, and deployment. You will learn to use the engines and languages that come with SQL Server, including ML Services with R and Python languages and Transact-SQL. You will also learn how to choose which algorithm to use for which task, and learn the working of each algorithm.
Table of Contents (15 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
Index

Chapter 5. Data Preparation

Unfortunately, much of the data you get to work with is not immediately useful for a data science project. A major part of the work on such a project is the data preparation part. There are many different issues you could have with your data. You might have some missing values in it. Maybe you need to group some continuous variables in a limited number of bins—this means having to bin or to discretize them. Immediately, you realize that the discretionary is not a particularly straightforward process. Maybe you need to create numerical variables from categorical ones. You create so-called dummy variables, or dummies, from values of a categorical variable. Sometimes, you need to aggregate data over some groups defined with one or more variables, and further operate on aggregated data.

This chapter will introduce you to some of the basic data preparation tasks and tools, including the following:

  • Handling missing values
  • Creating dummies from categorical variables
  • Different...