Principles of Data Science

Principles of Data Science

Overview of this book

Need to turn your skills at programming into effective data science skills? Principles of Data Science is created to help you join the dots between mathematics, programming, and business analysis. With this book, you’ll feel confident about asking—and answering—complex and sophisticated questions of your data to move from abstract and raw statistics to actionable ideas. With a unique approach that bridges the gap between mathematics and computer science, this books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques, you’ll move on to build a comprehensive picture of how every piece of the data science puzzle fits together. Learn the fundamentals of computational mathematics and statistics, as well as some pseudocode being used today by data scientists and analysts. You’ll get to grips with machine learning, discover the statistical models that help you take control and navigate even the densest datasets, and find out how to create powerful visualizations that communicate what your data means.

Principles of Data Science

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

How to Sound Like a Data Scientist

What is data science?

The data science Venn diagram

Some more terminology

Data science case studies

Summary

Types of Data

Flavors of data

Why look at these distinctions?

Structured versus unstructured data

Quantitative versus qualitative data

The road thus far…

The four levels of data

Data is in the eye of the beholder

Summary

The Five Steps of Data Science

Introduction to data science

Overview of the five steps

Explore the data

Summary

Basic Mathematics

Mathematics as a discipline

Basic symbols and terminology

Linear algebra

Summary

Impossible or Improbable – A Gentle Introduction to Probability

Basic definitions

Probability

Bayesian versus Frequentist

Compound events

Conditional probability

The rules of probability

A bit deeper

Summary

Advanced Probability

Collectively exhaustive events

Bayesian ideas revisited

Random variables

Summary

Basic Statistics

What are statistics?

How do we obtain and sample data?

Sampling data

How do we measure statistics?

The Empirical rule

Summary

Advanced Statistics

Point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Summary

Communicating Data

Why does communication matter?

Identifying effective and ineffective visualizations

When graphs and statistics lie

Verbal communication

The why/how/what strategy of presenting

Summary

How to Tell If Your Toaster Is Learning – Machine Learning Essentials

What is machine learning?

Machine learning isn't perfect

How does machine learning work?

Types of machine learning

How does statistical modeling fit into all of this?

Linear regression

Logistic regression

Probability, odds, and log odds

Dummy variables

Summary

Predictions Don't Grow on Trees – or Do They?

Naïve Bayes classification

Decision trees

Unsupervised learning

K-means clustering

Choosing an optimal number for K and cluster validation

Feature extraction and principal component analysis

Summary

Beyond the Essentials

The bias variance tradeoff

K folds cross-validation

Grid searching

Ensembling techniques

Neural networks

Summary

Case Studies

Case study 1 – predicting stock prices based on social media

Case study 2 – why do some people cheat on their spouses?

Case study 3 – using tensorflow

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What is data science?

Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.

Basic terminology

The definitions that follow are general enough to be used in daily conversations and work to serve the purpose of the book, an introduction to the principles of data science.

Let's start by defining what data is. This might seem like a silly first definition to have, but it is very important. Whenever we use the word "data", we refer to a collection of information in either an organized or unorganized format:

Organized data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation.
Unorganized data: This is the type of data that is in the free form, usually text or raw audio/signals that must be parsed further to become organized.
Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insight from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.

Data science is the art and science of acquiring knowledge through data.

What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know, I tried and got edited down).

Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

Make decisions
Predict the future
Understand the past/present
Create new industries/products

This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.

Data science is about using data in order to gain new insights that you would otherwise have missed.

As an example, imagine you are sitting around a table with three other people. The four of you have to make a decision based on some data. There are four opinions to consider. You would use data science to bring a fifth, sixth, and even seventh opinion to the table.

That's why data science won't replace the human brain, but complement it, work alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion, a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.

Why data science?

In this data age, it's clear that we have a surplus of data. But why should that necessitate an entire new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time. Data is collected in various forms and from different sources, and often comes in very unorganized.

Data can be missing, incomplete, or just flat out wrong. Often, we have data on very different scales and that makes it tough to compare it. Consider that we are looking at data in relation to pricing used cars. One characteristic of a car being the year it was made and another might be the number of miles on that car. Once we clean our data (which we spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.

Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today, through a very simple example.

Example – Sigma Technologies

Ben Runkle, CEO, Sigma Technologies, is trying to resolve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Jessie Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Runkle the most recent transcripts and finds something surprising:

"…. Not sure how to export this; are you?"
"Where is the button that makes a new list?"
"Wait, do you even know where the slider is?"
"If I can't figure this out today, it's a real problem..."

It is clear that customers were having problems with the existing UI/UX, and weren't upset due to a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.

Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle, a driver. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Haghun is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.

Both of these mentalities have their place in today's enterprises; however, it is Hagun's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information rather than just picking up a solution and going with it.

Principles of Data Science

Principles of Data Science

Overview of this book

Related Content you might be interested in

Current Title:

Principles of Data Science

What is data science?

Basic terminology

Why data science?

Example – Sigma Technologies