Book Image

Principles of Data Science

Book Image

Principles of Data Science

Overview of this book

Need to turn your skills at programming into effective data science skills? Principles of Data Science is created to help you join the dots between mathematics, programming, and business analysis. With this book, you’ll feel confident about asking—and answering—complex and sophisticated questions of your data to move from abstract and raw statistics to actionable ideas. With a unique approach that bridges the gap between mathematics and computer science, this books takes you through the entire data science pipeline. Beginning with cleaning and preparing data, and effective data mining strategies and techniques, you’ll move on to build a comprehensive picture of how every piece of the data science puzzle fits together. Learn the fundamentals of computational mathematics and statistics, as well as some pseudocode being used today by data scientists and analysts. You’ll get to grips with machine learning, discover the statistical models that help you take control and navigate even the densest datasets, and find out how to create powerful visualizations that communicate what your data means.
Table of Contents (20 chapters)
Principles of Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Data science case studies


The combination of math, computer programming, and domain knowledge is what makes data science so powerful. Often, it is difficult for a single person to master all three of these areas. That's why it's very common for companies to hire teams of data scientists instead of a single person. Let's look at a few powerful examples of data science in action and their outcome.

Case study – automating government paper pushing

Social security claims are known to be a major hassle for both the agent reading it and for the person who wrote the claim. Some claims take over 2 years to get resolved in their entirety, and that's absurd! Let's look at what goes into a claim:

Sample social security form

Not bad. It's mostly just text, though. Fill this in, then that, then this, and so on. You can see how it would be difficult for an agent to read these all day, form after form. There must be a better way!

Well, there is. Elder Research Inc. parsed this unorganized data and was able to automate 20% of all disability social security forms. This means that a computer could look at 20% of these written forms and give its opinion on the approval.

Not only that, the third-party company that is hired to rate the approvals of the forms actually gave the machine-graded forms a higher grade than the human forms. So, not only did the computer handle 20% of the load, it, on average, did better than a human.

Fire all humans, right?

Before I get a load of angry e-mails claiming that data science is bringing about the end of human workers, keep in mind that the computer was only able to handle 20% of the load. That means it probably performed terribly for 80% of the forms! This is because the computer was probably great at simple forms. The claims that would have taken a human minutes took the computer seconds to compute. But these minutes add up, and before you know it, each human is being saved over an hour a day!

Forms that might be easy for a human to read are also likely easy for the computer. It's when the form becomes very terse or when the writer starts deviating from usual grammar that the computer starts to fail. This model is great because it lets the humans spend more time on those difficult claims and gives them more attention without getting distracted by the sheer volume of papers.

Note

Note that I used the word model. Remember that a model is a relationship between elements. In this case, the relationship is between written words and the approval status of a claim.

Case study – marketing dollars

A dataset shows the relationship between the money spent in the categories of TV, radio, and newspaper. The goal is to analyze the relationship between the three different marketing mediums and how it affects the sale of a product. Our data is in the form of a row and column structure. Each row represents a sales region and the columns tell us how much money was spent on each medium and the profit achieved in that region.

Note

Usually, the data scientist must ask for units and scale. In this case, I will tell you that TV, radio, and newspaper are measured in "thousands of dollars" and sales in "thousands of widgets sold". This means that in the first region, $230,100 was spent on TV advertising, $37,800 on radio advertising, and $69,200 on newspaper advertising. In the same region, 22,100 items were sold.

Advertising budgets

For example, in the third region, we spent $17,200 on TV advertising and sold 9,300 widgets.

If we plot each variable against sales, we get the following graphs:

import seaborn as sns   
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales')

Graphs of advertising budgets

Note how none of these variables form a very strong line and, therefore, might not work well to predict sales (on their own). TV comes closest in forming an obvious relationship, but still even that isn't great. In this case, we will have to form a more complex model than the one we used in the spawner-recruiter model and combine all three variables in order to model sales.

This type of problem is very common in data science. In this example, we are attempting to identify key features that are associated with the sales of a product. If we can isolate these key features, then we can exploit these relationships and change how much we spend on advertising in different places with the hopes of increasing our sales.

Case study – what's in a job description?

Looking for a job in data science? Great, let me help. In this case study, I have "scraped" (taken from the Web) 1,000 job descriptions for companies actively hiring data scientists (as of January 2016). The goal here is to look at some of the most common keywords people use in their job descriptions.

An example of data scientist job listings.

(Note the second one asking for core Python libraries; we talk about these later on in this book)

import requests               
# used to grab data from the web

from BeautifulSoup import BeautifulSoup   
# used to parse HTML

from sklearn.feature_extraction.text import CountVectorizer
# used to count number of words and phrases (we will be using this module a lot)

The first two imports are used to grab web data from the website, Indeed.com, and the third import is meant to simply count the number of times a word or phrase appears.

texts = []
# hold our job descriptions in this list

for index in range(0,1000,10): # go through 100 pages of indeed
  page = 'indeed.com/jobs?q=data+scientist&start='+str(index)
  # identify the url of the job listings

  web_result = requests.get(page).text
  # use requests to actually visit the url

  soup  BeautifulSoup(web_result)
  # parse the html of the resulting page

  for listing in soup.findAll('span', {'class':'summary'}:
    # for each listing on the page

    texts.append(listing.text)
 # append the text of the listing to our list

Okay, before I lose you, all that this loop is doing is going through 100 pages of job descriptions, and for each page, grabbing each job description. The important variable here is texts, which is a list of over 1,000 job descriptions:

type(texts) # == list

vect = CountVectorizer(ngram_range=(1,2), stop_words='english')
# Get basic counts of one and two word phrases

matrix = vect.fit_transform(texts)
# fit and learn to the vocabulary in the corpus

print len(vect.get_feature_names())  # how many features are there 
# There are 11,293 total one and two words phrases in my case!!

I have omitted some code here, but it exists in the GitHub repository for this book. The results are as follows (represented as the phrase, and then the number of of times it occurred):

experience 320
machine 306
learning 305
machine learning 294
techniques 266
statistical 215
team 197
analytics 173
business 167
statistics 159
algorithms 152
datamining 149
software 144
applied 141
programming 132
understanding 127
world 127
research 125
datascience 123
methods 122
join 122
quantitative 122
group 121
real 120
large 120

Notable things:

  • Machine learning and experience are at the top of the list. Experience comes with practice. A basic idea of machine learning comes with this book.

  • These words are followed closely by statistical words implying knowledge of math and theory.

  • The word team is very high up, implying that you will need to work with a team of data scientists; you won't be a lone wolf.

  • Computer science words such as algorithms and programming are prevalent.

  • The words techniques, understanding, and methods imply a more theoretical approach, ambivalent to any single domain.

  • The word business implies a particular problem domain.

There are many interesting things to note about this case study but the biggest take away is that there are many key words and phrases that make up a data science role. It isn't just math, coding, or domain knowledge; it truly is the combination of these three ideas (whether exemplified in a single person or across a multiperson team) that makes data science possible and powerful.