Book Image

Data Science with SQL Server Quick Start Guide

By : Dejan Sarka
Book Image

Data Science with SQL Server Quick Start Guide

By: Dejan Sarka

Overview of this book

SQL Server only started to fully support data science with its two most recent editions. If you are a professional from both worlds, SQL Server and data science, and interested in using SQL Server and Machine Learning (ML) Services for your projects, then this is the ideal book for you. This book is the ideal introduction to data science with Microsoft SQL Server and In-Database ML Services. It covers all stages of a data science project, from businessand data understanding,through data overview, data preparation, modeling and using algorithms, model evaluation, and deployment. You will learn to use the engines and languages that come with SQL Server, including ML Services with R and Python languages and Transact-SQL. You will also learn how to choose which algorithm to use for which task, and learn the working of each algorithm.
Table of Contents (15 chapters)
Title Page
Copyright and Credits
Packt Upsell
Contributors
Preface
Index

The entropy of a discrete variable


In the Information Theory, as defined by Claude E Shannon, information is a surprise. Surprise comes from diversity, not from equality. Considering the data, a variable that can occupy a single value only, actually a constant, has no surprise and no information.

 

 

Whatever case you take randomly from the dataset, you know the value of this variable in advance, and you are never surprised. To have at least some information in it, a variable must be at least dichotomous, meaning it must have a pool of at least two distinct values. Now, imagine that you take a case randomly out of the dataset, but you know the overall distribution of that variable. If one state is more frequent, appearing in 80% of cases, you would expect that state, of course. You would be surprised 20% of the time. With 50%—50% distribution, no matter which state you would expect, you would be surprised half of the time. With such a distribution, this variable would have the maximum possible...