Book Image

Principles of Data Science

Book Image

Principles of Data Science

Overview of this book

Need to turn your skills at programming into effective data science skills? Principles of Data Science is created to help you join the dots between mathematics, programming, and business analysis. With this book, you’ll feel confident about asking—and answering—complex and sophisticated questions of your data to move from abstract and raw statistics to actionable ideas. With a unique approach that bridges the gap between mathematics and computer science, this books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques, you’ll move on to build a comprehensive picture of how every piece of the data science puzzle fits together. Learn the fundamentals of computational mathematics and statistics, as well as some pseudocode being used today by data scientists and analysts. You’ll get to grips with machine learning, discover the statistical models that help you take control and navigate even the densest datasets, and find out how to create powerful visualizations that communicate what your data means.
Table of Contents (20 chapters)
Principles of Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

What is data science?


Before we go any further, let's look at some basic definitions that we will use throughout this book. The great/awful thing about this field is that it is so young that these definitions can differ from textbook to newspaper to whitepaper.

Basic terminology

The definitions that follow are general enough to be used in daily conversations and work to serve the purpose of the book, an introduction to the principles of data science.

Let's start by defining what data is. This might seem like a silly first definition to have, but it is very important. Whenever we use the word "data", we refer to a collection of information in either an organized or unorganized format:

  • Organized data: This refers to data that is sorted into a row/column structure, where every row represents a single observation and the columns represent the characteristics of that observation.

  • Unorganized data: This is the type of data that is in the free form, usually text or raw audio/signals that must be parsed further to become organized.

    Whenever you open Excel (or any other spreadsheet program), you are looking at a blank row/column structure waiting for organized data. These programs don't do well with unorganized data. For the most part, we will deal with organized data as it is the easiest to glean insight from, but we will not shy away from looking at raw text and methods of processing unorganized forms of data.

Data science is the art and science of acquiring knowledge through data.

What a small definition for such a big topic, and rightfully so! Data science covers so many things that it would take pages to list it all out (I should know, I tried and got edited down).

Data science is all about how we take data, use it to acquire knowledge, and then use that knowledge to do the following:

  • Make decisions

  • Predict the future

  • Understand the past/present

  • Create new industries/products

This book is all about the methods of data science, including how to process data, gather insights, and use those insights to make informed decisions and predictions.

Data science is about using data in order to gain new insights that you would otherwise have missed.

As an example, imagine you are sitting around a table with three other people. The four of you have to make a decision based on some data. There are four opinions to consider. You would use data science to bring a fifth, sixth, and even seventh opinion to the table.

That's why data science won't replace the human brain, but complement it, work alongside it. Data science should not be thought of as an end-all solution to our data woes; it is merely an opinion, a very informed opinion, but an opinion nonetheless. It deserves a seat at the table.

Why data science?

In this data age, it's clear that we have a surplus of data. But why should that necessitate an entire new set of vocabulary? What was wrong with our previous forms of analysis? For one, the sheer volume of data makes it literally impossible for a human to parse it in a reasonable time. Data is collected in various forms and from different sources, and often comes in very unorganized.

Data can be missing, incomplete, or just flat out wrong. Often, we have data on very different scales and that makes it tough to compare it. Consider that we are looking at data in relation to pricing used cars. One characteristic of a car being the year it was made and another might be the number of miles on that car. Once we clean our data (which we spend a great deal of time looking at in this book), the relationships between the data become more obvious, and the knowledge that was once buried deep in millions of rows of data simply pops out. One of the main goals of data science is to make explicit practices and procedures to discover and apply these relationships in the data.

Earlier, we looked at data science in a more historical perspective, but let's take a minute to discuss its role in business today, through a very simple example.

Example – Sigma Technologies

Ben Runkle, CEO, Sigma Technologies, is trying to resolve a huge problem. The company is consistently losing long-time customers. He does not know why they are leaving, but he must do something fast. He is convinced that in order to reduce his churn, he must create new products and features, and consolidate existing technologies. To be safe, he calls in his chief data scientist, Dr. Jessie Hughan. However, she is not convinced that new products and features alone will save the company. Instead, she turns to the transcripts of recent customer service tickets. She shows Runkle the most recent transcripts and finds something surprising:

  • "…. Not sure how to export this; are you?"

  • "Where is the button that makes a new list?"

  • "Wait, do you even know where the slider is?"

  • "If I can't figure this out today, it's a real problem..."

It is clear that customers were having problems with the existing UI/UX, and weren't upset due to a lack of features. Runkle and Hughan organized a mass UI/UX overhaul and their sales have never been better.

Of course, the science used in the last example was minimal, but it makes a point. We tend to call people like Runkle, a driver. Today's common stick-to-your-gut CEO wants to make all decisions quickly and iterate over solutions until something works. Dr. Haghun is much more analytical. She wants to solve the problem just as much as Runkle, but she turns to user-generated data instead of her gut feeling for answers. Data science is about applying the skills of the analytical mind and using them as a driver would.

Both of these mentalities have their place in today's enterprises; however, it is Hagun's way of thinking that dominates the ideas of data science—using data generated by the company as her source of information rather than just picking up a solution and going with it.