Book Image

Principles of Data Science - Second Edition

By : Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi
Book Image

Principles of Data Science - Second Edition

By: Sinan Ozdemir, Sunil Kakade, Marco Tibaldeschi

Overview of this book

Need to turn programming skills into effective data science skills? This book helps you connect mathematics, programming, and business analysis. You’ll feel confident asking—and answering—complex, sophisticated questions of your data, making abstract and raw statistics into actionable ideas. Going through the data science pipeline, you'll clean and prepare data and learn effective data mining strategies and techniques to gain a comprehensive view of how the data science puzzle fits together. You’ll learn fundamentals of computational mathematics and statistics and pseudo-code used by data scientists and analysts. You’ll learn machine learning, discovering statistical models that help control and navigate even the densest datasets, and learn powerful visualizations that communicate what your data means.
Table of Contents (17 chapters)
16
Index

The data science Venn diagram

It is a common misconception that only those with a PhD or geniuses can understand the math/programming behind data science. This is absolutely false. Understanding data science begins with three basic areas:

  • Math/statistics: This is the use of equations and formulas to perforanalysis.
  • Computer programming: This is the ability to use code to create outcomes on computer.
  • Domain knowledge: This refers to understanding the problem domain (medicine, finance, social science, d so on).

The following Venn diagram provides a visual representation of how these three areas of data science intersect:

The data science Venn diagram

The Venn diagram of data science

Those with hacking skills can conceptualize and program complicated algorithms using computer languages. Having a math and statistics background allows you to theorize and evaluate algorithms and tweak the existing procedures to fit specific situations. Having substantive expertise (domain expertise) allows you to apply concepts and results in a meaningful and effective way.

While having only two of these three qualities can make you intelligent, it will also leave a gap. Let's say that you are very skilled in coding and have formal training in day trading. You might create an automated system to trade in your place, but lack the math skills to evaluate your algorithms. This will mean that you end up losing money in the long run. It is only when you boost your skills in coding, math, and domain knowledge that you can truly perform data science.

The quality that was probably a surprise for you was domain knowledge. It is really just knowledge of the area you are working in. If a financial analyst started analyzing data about heart attacks, they might need the help of a cardiologist to make sense of a lot of the numbers.

Data science is the intersection of the three key areas mentioned earlier. In order to gain knowledge from data, we must be able to utilize computer programming to access the data, understand the mathematics behind the models we derive, and, above all, understand our analyses' place in the domain we are in. This includes the presentation of data. If we are creating a model to predict heart attacks in patients, is it better to create a PDF of information, or an app where you can type in numbers and get a quick prediction? All these decisions must be made by the data scientist.

Note

The intersection of math and coding is machine learning. This book will look at machine learning in great detail later on, but it is important to note that without the explicit ability to generalize any models or results to a domain, machine learning algorithms remain just that—algorithms sitting on your computer. You might have the best algorithm to predict cancer. You could be able to predict cancer with over 99% accuracy based on past cancer patient data, but if you don't understand how to apply this model in a practical sense so that doctors and nurses can easily use it, your model might be useless.

Both computer programming and math are covered extensively in this book. Domain knowledge comes with both the practice of data science and reading examples of other people's analyses.

The math

Most people stop listening once someone says the word "math." They'll nod along in an attempt to hide their utter disdain for the topic. This book will guide you through the math needed for data science, specifically statistics and probability. We will use these subdomains of mathematics to create what are called models.

A data model refers to an organized and formal relationship between elements of data, usually meant to simulate a real-world phenomenon.

Essentially, we will use math in order to formalize relationships between variables. As a former pure mathematician and current math teacher, I know how difficult this can be. I will do my best to explain everything as clearly as I can. Between the three areas of data science, math is what allows us to move from domain to domain. Understanding the theory allows us to apply a model that we built for the fashion industry to a financial domain.

The math covered in this book ranges from basic algebra to advanced probabilistic and statistical modeling. Do not skip over these chapters, even if you already know these topics or you're afraid of them. Every mathematical concept that I will introduce will be introduced with care and purpose, using examples. The math in this book is essential for data scientists.

Example – spawner-recruit models

In biology, we use, among many other models, a model known as the spawner-recruit model to judge the biological health of a species. It is a basic relationship between the number of healthy parental units of a species and the number of new units in the group of animals. In a public dataset of the number of salmon spawners and recruits, the graph further down (titled spawner-recruit model) was formed to visualize the relationship between the two. We can see that there definitely is some sort of positive relationship (as one goes up, so does the other). But how can we formalize this relationship? For example, if we knew the number of spawners in a population, could we predict the number of recruits that the group would obtain, and vice versa?

Essentially, models allow us to plug in one variable to get the other. Consider the folloExample – spawner-recruit modelsIn this example, let's say we knew that a group of salmon had 1.15 (in thousands) spawners. Then, we would have tExample – spawner-recruit models Example – spawner-recruit modelsThis result can be very beneficial to estimate how the health of a population is changing. If we can create these models, we can visually observe how the relationship between the two variables can change.

There are many types of data models, including probabilistic and statistical models. Both of these are subsets of a larger paradigm, called machine learning. The essential idea behind these three topics is that we use data in order to come up with the best model possible. We no longer rely on human instincts—rather, we rely on data, such as that displayed in the following graph:

Example – spawner-recruit models

The spawner-recruit model visualized

The purpose of this example is to show how we can define relationships between data elements using mathematical equations. The fact that I used salmon health data was irrelevant! Throughout this book, we will look at relationships involving marketing dollars, sentiment data, restaurant reviews, and much more. The main reason for this is that I would like you (the reader) to be exposed to as many domains as possible.

Math and coding are vehicles that allow data scientists to step back and apply their skills virtually anywhere.

Computer programming

Let's be honest: you probably think computer science is way cooler than math. That's ok, I don't blame you. The news isn't filled with math news like it is with news on technology. You don't turn on the TV to see a new theory on primes—rather, you will see investigative reports on how the latest smartphone can take better photos of cats, or something. Computer languages are how we communicate with machines and tell them to do our bidding. A computer speaks many languages and, like a book, can be written in many languages; similarly, data science can also be done in many languages. Python, Julia, and R are some of the many languages that are available to us. This book will focus exclusively on using Python.