Principles of Data Science - Third Edition

By : Sinan Ozdemir

Principles of Data Science - Third Edition

By: Sinan Ozdemir

Overview of this book

Principles of Data Science bridges mathematics, programming, and business analysis, empowering you to confidently pose and address complex data questions and construct effective machine learning pipelines. This book will equip you with the tools to transform abstract concepts and raw statistics into actionable insights. Starting with cleaning and preparation, you’ll explore effective data mining strategies and techniques before moving on to building a holistic picture of how every piece of the data science puzzle fits together. Throughout the book, you’ll discover statistical models with which you can control and navigate even the densest or the sparsest of datasets and learn how to create powerful visualizations that communicate the stories hidden in your data. With a focus on application, this edition covers advanced transfer learning and pre-trained models for NLP and vision tasks. You’ll get to grips with advanced techniques for mitigating algorithmic bias in data as well as models and addressing model and data drift. Finally, you’ll explore medium-level data governance, including data provenance, privacy, and deletion request handling. By the end of this data science book, you'll have learned the fundamentals of computational mathematics and statistics, all while navigating the intricacies of modern ML and large pre-trained models like GPT and BERT.

Preface

Who is this book for?

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Chapter 1: Data Science Terminology

What is data science?

The data science Venn diagram

Some more terminology

Data science case studies

Summary

Chapter 2: Types of Data

Structured versus unstructured data

The four levels of data

Summary

Questions and answers

Chapter 3: The Five Steps of Data Science

Introduction to data science

Exploring the data

Summary

Chapter 4: Basic Mathematics

Basic symbols and terminology

Linear algebra

Summary

Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability

Basic definitions

Bayesian versus frequentist

How to utilize the rules of probability

Introduction to binary classifiers

Summary

Chapter 6: Advanced Probability

Bayesian ideas revisited

Random variables

Summary

Chapter 7: What Are the Chances? An Introduction to Statistics

What are statistics?

How do we obtain and sample data?

How do we measure statistics?

The empirical rule

Summary

Chapter 8: Advanced Statistics

Understanding point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Summary

Chapter 9: Communicating Data

Why does communication matter?

Identifying effective visualizations

When graphs and statistics lie

Verbal communication

Summary

Chapter 10: How to Tell if Your Toaster is Learning – Machine Learning Essentials

Introducing ML

Types of ML

Predicting continuous variables with linear regression

Summary

Chapter 11: Predictions Don’t Grow on Trees, or Do They?

Performing naïve Bayes classification

Understanding decision trees

Diving deep into UL

Feature extraction and PCA

Summary

Chapter 12: Introduction to Transfer Learning and Pre-Trained Models

Understanding pre-trained models

Different types of TL

TL with BERT and GPT

Summary

Chapter 13: Mitigating Algorithmic Bias and Tackling Model and Data Drift

Understanding algorithmic bias

Sources of algorithmic bias

Measuring bias

Consequences of unaddressed bias and the importance of fairness

Mitigating algorithmic bias

Bias in LLMs

Emerging techniques in bias and fairness in ML

Understanding model drift and decay

Mitigating drift

Summary

Chapter 14: AI Governance

Mastering data governance

Navigating the intricacy and the anatomy of ML governance

A guide to architectural governance

Summary

Chapter 15: Navigating Real-World Data Science Case Studies in Action

Introduction to the COMPAS dataset case study

Text embeddings using pretrainedmodels and OpenAI

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data science case studies

We will spend much of this book looking at real-life examples of using data science and machine learning. The combination of math, computer programming, and domain knowledge is what makes data science so powerful but it can often be too abstract without concrete coding examples.

Oftentimes, it is difficult for a single person to master all three of these areas. That’s why it’s very common for companies to hire teams of data scientists instead of a single person. Let’s look at a few powerful examples of data science in action and its outcomes.

Case study – automating government paper pushing

Social security claims are known to be a major hassle for both the agent reading them and the person who wrote the claims. Some claims take over two years to get resolved in their entirety, and that’s absurd! Let’s look at the following figure, which shows what goes into a claim:

Figure 1.4 – Sample social security form

Not bad. It’s mostly just text, though. Fill this in, then that, then this, and so on. You can see how it would be difficult for an agent to read these all day, form after form. There must be a better way!

Well, there is. Elder Research Inc. parsed this unstructured data and was able to automate 20% of all disability social security forms. This means that a computer could look at 20% of these written forms and give its opinion on the approval.

Apart from this, the third-party company that is hired to rate the approvals of the forms gave the machine-graded forms a higher grade than the human forms. So, not only did the computer handle 20% of the load on average, but it also did better than a human.

Modern language models such as GPT-3 and BERT have taken the world of NLP by storm by pushing the boundaries of what we thought was possible way beyond its previously considered limits. We will spend a lengthy amount of time talking about these models later in this book.

Fire all humans, am I right?

Before I get a load of angry emails and tweets claiming that data science is bringing about the end of human workers, keep in mind that the computer was only able to handle 20% of the load in our previous example. This means that it probably performed terribly on 80% of the forms! This is because the computer was probably great at simple forms. The claims that would have taken a human minutes to compute took the computer seconds. But these minutes add up, and before you know it, each human is being saved over an hour a day!

Forms that might be easy for a human to read are also likely easy for the computer. It’s when the forms are very terse, or when the writer starts deviating from the usual grammar, that the computer starts to fail. This model is great because it lets humans spend more time on those difficult claims and give them more attention without getting distracted by the sheer volume of papers.

Note that I used the word “model.” Remember that a model is a relationship between elements. In this case, the relationship is between written words and the approval status of a claim.

Case study – what’s in a job description?

Looking for a job in data science? Great! Let me help. In this case study, I have scraped (used code to read from the web) 1,000 job descriptions for companies that are actively hiring data scientists. The goal here is to look at some of the most common keywords that people use in their job descriptions, as shown in the following screenshot:

Figure 1.5 – An example of data scientist job listings

In the following Python code, the first two imports are used to grab web data from Indeed.com, and the third import is meant to simply count the number of times a word or phrase appears, as shown in the following code:

import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
# grab postings from the web
texts = []
# cycle through 100 pages of indeed job resources
for i in range(0,1000,10):
 response = requests.get('http://www.indeed.com/jobs?q=data+scientist&sta rt='+str(i)).text
 soup  = BeautifulSoup(response)
 texts += [a.text for a in soup.findAll('span', {'class':'summary'})]
print(type(texts))
print(texts[0]) # first job description

All this loop is doing is going through 100 pages of job descriptions, and for each page, it is grabbing each job description. The important variable here is texts, which is a list of over 1,000 job descriptions, as shown in the following code:

type(texts) # == list
vectorizer = CountVectorizer(ngram_range=(1,2), stop_words='english')
# Get basic counts of one and two word phrases
matrix = vectorizer.fit_transform(texts)
# fit and learn to the vocabulary in the corpus
print len(vect.get_feature_names()) # how many features there are

There are 10,857 total one- and two-word phrases in my case! Since web pages are scraped in real time and these pages may change when you run this code, you may get a different number than 10,857.

I have omitted some code here because we will cover these packages in more depth in our NLP chapters later, but it exists in the GitHub repository for this book. The results are as follows (represented by the phrase and then the number of times it occurred):

Figure 1.6 – The top one- and two-word phrases when looking at job descriptions on Indeed for the title of “Data Scientist”

There are many interesting things to note about this case study, but the biggest takeaway is that there are many keywords and phrases that make up a data science role. It isn’t just math, coding, or domain knowledge; it truly is a combination of these three ideas (whether exemplified in a single-person team or across a multi-person team) that makes data science possible and powerful.

Principles of Data Science - Third Edition

By : Sinan Ozdemir

Principles of Data Science - Third Edition

By: Sinan Ozdemir

Overview of this book

Related Content you might be interested in

Current Title:

Principles of Data Science - Third Edition

Feature Engineering Made Easy

Cracking the Data Science Interview

Responsible AI in the Enterprise

Data science case studies

Case study – automating government paper pushing

Fire all humans, am I right?

Case study – what’s in a job description?