Principles of Data Science - Third Edition

By : Sinan Ozdemir

Principles of Data Science - Third Edition

By: Sinan Ozdemir

Overview of this book

Principles of Data Science bridges mathematics, programming, and business analysis, empowering you to confidently pose and address complex data questions and construct effective machine learning pipelines. This book will equip you with the tools to transform abstract concepts and raw statistics into actionable insights. Starting with cleaning and preparation, you’ll explore effective data mining strategies and techniques before moving on to building a holistic picture of how every piece of the data science puzzle fits together. Throughout the book, you’ll discover statistical models with which you can control and navigate even the densest or the sparsest of datasets and learn how to create powerful visualizations that communicate the stories hidden in your data. With a focus on application, this edition covers advanced transfer learning and pre-trained models for NLP and vision tasks. You’ll get to grips with advanced techniques for mitigating algorithmic bias in data as well as models and addressing model and data drift. Finally, you’ll explore medium-level data governance, including data provenance, privacy, and deletion request handling. By the end of this data science book, you'll have learned the fundamentals of computational mathematics and statistics, all while navigating the intricacies of modern ML and large pre-trained models like GPT and BERT.

Preface

Who is this book for?

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Chapter 1: Data Science Terminology

What is data science?

The data science Venn diagram

Some more terminology

Data science case studies

Summary

Chapter 2: Types of Data

Structured versus unstructured data

The four levels of data

Summary

Questions and answers

Chapter 3: The Five Steps of Data Science

Introduction to data science

Exploring the data

Summary

Chapter 4: Basic Mathematics

Basic symbols and terminology

Linear algebra

Summary

Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability

Basic definitions

Bayesian versus frequentist

How to utilize the rules of probability

Introduction to binary classifiers

Summary

Chapter 6: Advanced Probability

Bayesian ideas revisited

Random variables

Summary

Chapter 7: What Are the Chances? An Introduction to Statistics

What are statistics?

How do we obtain and sample data?

How do we measure statistics?

The empirical rule

Summary

Chapter 8: Advanced Statistics

Understanding point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Summary

Chapter 9: Communicating Data

Why does communication matter?

Identifying effective visualizations

When graphs and statistics lie

Verbal communication

Summary

Chapter 10: How to Tell if Your Toaster is Learning – Machine Learning Essentials

Introducing ML

Types of ML

Predicting continuous variables with linear regression

Summary

Chapter 11: Predictions Don’t Grow on Trees, or Do They?

Performing naïve Bayes classification

Understanding decision trees

Diving deep into UL

Feature extraction and PCA

Summary

Chapter 12: Introduction to Transfer Learning and Pre-Trained Models

Understanding pre-trained models

Different types of TL

TL with BERT and GPT

Summary

Chapter 13: Mitigating Algorithmic Bias and Tackling Model and Data Drift

Understanding algorithmic bias

Sources of algorithmic bias

Measuring bias

Consequences of unaddressed bias and the importance of fairness

Mitigating algorithmic bias

Bias in LLMs

Emerging techniques in bias and fairness in ML

Understanding model drift and decay

Mitigating drift

Summary

Chapter 14: AI Governance

Mastering data governance

Navigating the intricacy and the anatomy of ML governance

A guide to architectural governance

Summary

Chapter 15: Navigating Real-World Data Science Case Studies in Action

Introduction to the COMPAS dataset case study

Text embeddings using pretrainedmodels and OpenAI

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

Chapter 1, Data Science Terminology, describes the basic terminology used by data scientists. We will cover the differences between often-confused terms as well as looking at examples of each term used in order to truly understand how to communicate in the language of data science. We will begin by looking at the broad term data science and then, little by little, get more specific until we arrive at the individual subdomains of data science, such as machine learning and statistical inference. This chapter will also look at the three main areas of data science, which are math, programming, and domain expertise. We will look at each one individually and understand the uses of each. We will also look at the basic Python packages and the syntax that will be used throughout the book.

Chapter 2, Types of Data, deals with data types and the way data is observed. We will explore the different levels of data as well as the different forms of data. Specifically, we will understand the differences between structured/unstructured data, quantitative/qualitative data, and more.

Chapter 3, The Five Steps of Data Science, deals with the data science process as well as data wrangling and preparation. We will go into the five steps of data science and give examples of the process at every step of the way. After we cover the five steps of data science, we will turn to data wrangling, which is the data exploration/preparation stage of the process. In order to best understand these principles, we will use extensive examples to explain each step. I will also provide tips to look for when exploring data, including looking for data on different scales, categorical variables, and missing data. We will use pandas to check for and fix all of these things.

Chapter 4, Basic Mathematics, goes over the elementary mathematical skills needed by any data scientist. We will dive into functional analysis and use matric algebra as well as calculus to show and prove various outcomes based on real-world data problems.

Chapter 5, Impossible or Improbable – A Gentle Introduction to Probability, focuses heavily on the basic probability that is required for data science. We will derive results from data using probability rules and begin to see how we view real-world problems using probability. This chapter will be highly practical and Python will be used to code the examples.

Chapter 6, Advanced Probability, is where we explore how to use Python to solve more complex probability problems and also look at a new type of probability called Bayesian inference. We will use these theorems to solve real-world data scenarios such as weather predictions.

Chapter 7, What Are the Chances? An Introduction to Statistics, is on basic statistics, which is required for data science. We will also explore the types of statistical errors, including type I and type II errors, using examples. These errors are as essential to our analysis as the actual results. Errors and their different types allow us to dig deeper into our conclusions and avoid potentially disastrous results. Python will be used to code up statistical problems and results.

Chapter 8, Advanced Statistics, is where normalization is key. Understanding why and how we normalize data will be crucial. We will cover basic plotting, such as scatter plots, bar plots, and histograms. This chapter will also get into statistical modeling using data. We will not only define the concept as using math to model a real-world situation, but we will also use real data in order to extrapolate our own statistical models. We will also discuss overfitting. Python will be used to code up statistical problems and results.

Chapter 9, Communicating Data, deals with the different ways of communicating results from our analysis. We will look at different presentation styles as well as different visualization techniques. The point of this chapter is to take our results and be able to explain them in a coherent, intelligible way so that anyone, whether they are data-savvy or not, may understand and use our results. Much of what we will discuss will be how to create effective graphs through labels, keys, colors, and more. We will also look at more advanced visualization techniques such as parallel coordinates plots.

Chapter 10, How to Tell if Your Toaster is Learning – Machine Learning Essentials, focuses on machine learning as a part of data science. We will define the different types of machine learning and see examples of each kind. We will specifically cover areas in regression, classification, and unsupervised learning. This chapter will cover what machine learning is and how it is used in data science. We will revisit the differences between machine learning and statistical modeling and how machine learning is a broader category of the latter. Our aim will be to utilize statistics and probability in order to understand and apply essential machine learning skills to practical industries such as marketing. Examples will include predicting star ratings of restaurant reviews, predicting the presence of disease, spam email detection, and much more. This chapter focuses more on statistical and probabilistic models. The next chapter will deal with models that do not fall into this category. We will also focus on metrics that tell us how accurate our models are. We will use metrics in order to conclude results and make predictions using machine learning.

Chapter 11, Predictions Don’t Grow on Trees, or Do They?, focuses heavily on machine learning that is not considered a statistical or probabilistic model. These constitute models that cannot be contained in a single equation, such as linear regression or naïve Bayes. The models in this chapter are, while still based on mathematical principles, more complex than a single equation. The models include KNN, decision trees, and an introduction to unsupervised clustering. Metrics will become very important here as they will form the basis for measuring our understanding and our models. We will also peer into some of the ethics of data science in this chapter. We will see where machine learning can perhaps boundaries in areas such as privacy and advertising and try to draw a conclusion about the ethics of predictions.

Chapter 12, Introduction to Transfer Learning and Pre-Trained Models, introduces transfer learning and gives examples of how to transfer a machine’s learning from a pre-trained model to fine-tuned models. We will navigate the world of open source models and achieve state-of-the-art performance in NLP and vision tasks.

Chapter 13, Mitigating Algorithmic Bias and Tackling Model and Data Drift, introduces algorithmic bias and how to quantify, identify, and mitigate biases in data and models. We will see how biased data can lead to biased models. We will also see how we can identify bias as early as possible and catch new biases that arise in existing models.

Chapter 14, AI Governance, introduces drift in models and data and the proper ways to quantify and combat drift. We will see how data can drift over time and how we can update models properly to combat draft to keep our pipelines as performant as possible.

Chapter 15, Navigating Real-World Data Science Case Studies in Action, introduces basic governance structures and how to navigate deletion requests, privacy/permission structures, and data provenance.

Principles of Data Science - Third Edition

By : Sinan Ozdemir

Principles of Data Science - Third Edition

By: Sinan Ozdemir

Overview of this book

Related Content you might be interested in

Current Title:

Principles of Data Science - Third Edition

Feature Engineering Made Easy

Cracking the Data Science Interview

Responsible AI in the Enterprise

What this book covers