Principles of Data Science - Second Edition

By : Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi

Principles of Data Science - Second Edition

By: Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi

Overview of this book

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas. Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

1. How to Sound Like a Data Scientist

What is data science?

The data science Venn diagram

Why Python?

Some more terminology

Data science case studies

Summary

2. Types of Data

Flavors of data

Why look at these distinctions?

Structured versus unstructured data

Quantitative versus qualitative data

The road thus far

The four levels of data

Data is in the eye of the beholder

Summary

3. The Five Steps of Data Science

Introduction to data science

Overview of the five steps

Exploring the data

Summary

4. Basic Mathematics

Mathematics as a discipline

Basic symbols and terminology

Linear algebra

Summary

5. Impossible or Improbable - A Gentle Introduction to Probability

Basic definitions

Probability

Bayesian versus Frequentist

Compound events

Conditional probability

The rules of probability

A bit deeper

Summary

6. Advanced Probability

Collectively exhaustive events

Bayesian ideas revisited

Random variables

Summary

7. Basic Statistics

What are statistics?

How do we obtain and sample data?

8. Advanced Statistics

Point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Summary

9. Communicating Data

Why does communication matter?

Identifying effective and ineffective visualizations

When graphs and statistics lie

Verbal communication

The why/how/what strategy of presenting

Summary

10. How to Tell If Your Toaster Is Learning – Machine Learning Essentials

What is machine learning?

Machine learning isn't perfect

How does machine learning work?

Types of machine learning

How does statistical modeling fit into all of this?

Linear regression

Logistic regression

Probability, odds, and log odds

Dummy variables

Summary

11. Predictions Don't Grow on Trees - or Do They?

Naive Bayes classification

Decision trees

Unsupervised learning

k-means clustering

Choosing an optimal number for K and cluster validation

Summary

12. Beyond the Essentials

The bias/variance trade-off

K folds cross-validation

Grid searching

Ensembling techniques

Neural networks

Summary

13. Case Studies

Case study 1 – Predicting stock prices based on social media

Case study 2 – Why do some people cheat on their spouses?

Case study 3 – Using TensorFlow

Summary

14. Building Machine Learning Models with Azure Databricks and Azure Machine Learning service

Technical requirements

Technologies for machine learning projects

Configuring Azure Databricks

Training a text classifier with Azure Databricks

Azure Machine Learning

Summary

Other Books You May Enjoy

Leave a review – let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The data science Venn diagram

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. Understanding data science begins with three basic areas:

Math/statistics: This is the use of equations and formulas to perforanalysis.
Computer programming: This is the ability to use code to create outcomes on computer.
Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, d so on).

The following Venn diagram provides a visual representation of how these three areas of data science intersect:

The Venn diagram of data science

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics background allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.

While having only two of these three qualities can make you intelligent, it will also leave a gap. Let's say that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place, but lack the math skills to evaluate your algorithms. This will mean that you end up losing money in the long run. It is only when you boost your skills in coding, math, and domain knowledge that you can truly perform data science.

The quality that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.

Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and, above all, understand our analyses' place in the domain we are in. This includes the presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information, or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.

Note

The intersection of math and coding is machine learning. This book will look at machine learning in great detail later on, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just that—algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data, but if you don't understand how to apply this model in a practical sense so that doctors and nurses can easily use it, your model might be useless.

Both computer programming and math are covered extensively in this book. Domain knowledge comes with both the practice of data science and reading examples of other people's analyses.

The math

Most people stop listening once someone says the word "math." They'll nod along in an attempt to hide their utter disdain for the topic. This book will guide you through the math needed for data science, specifically statistics and probability. We will use these subdomains of mathematics to create what are called models.

A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding the theory allows us to apply a model that we built for the fashion industry to a financial domain.

The math covered in this book ranges from basic algebra to advanced probabilistic and statistical modeling. Do not skip over these chapters, even if you already know these topics or you're afraid of them. Every mathematical concept that I will introduce will be introduced with care and purpose, using examples. The math in this book is essential for data scientists.

Example – spawner-recruit models

In biology, we use, among many other models, a model known as the spawner-recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the graph further down (titled spawner-recruit model) was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that the group would obtain, and vice versa?

Essentially, models allow us to plug in one variable to get the other. Consider the follo Example – spawner-recruit models In this example, let's say we knew that a group of salmon had 1.15 (in thousands) spawners. Then, we would have t This result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.

There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the best model possible. We no longer rely on human instincts—rather, we rely on data, such as that displayed in the following graph:

The spawner-recruit model visualized

The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! Throughout this book, we will look at relationships involving marketing dollars, sentiment data, restaurant reviews, and much more. The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.

Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.

Computer programming

Let's be honest: you probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on technology. You don't turn on the TV to see a new theory on primes—rather, you will see investigative reports on how the latest smartphone can take better photos of cats, or something. Computer languages are how we communicate with machines and tell them to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages that are available to us. This book will focus exclusively on using Python.

Principles of Data Science - Second Edition

By : Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi

Principles of Data Science - Second Edition

By: Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi

Overview of this book

Related Content you might be interested in

Current Title:

Principles of Data Science - Second Edition

The data science Venn diagram

Note

The math

Example – spawner-recruit models

Computer programming