Book Image

Principles of Data Science - Third Edition

By : Sinan Ozdemir

Book Image

Principles of Data Science - Third Edition

By: Sinan Ozdemir

Overview of this book

Principles of Data Science bridges mathematics, programming, and business analysis, empowering you to confidently pose and address complex data questions and construct effective machine learning pipelines. This book will equip you with the tools to transform abstract concepts and raw statistics into actionable insights. Starting with cleaning and preparation, you’ll explore effective data mining strategies and techniques before moving on to building a holistic picture of how every piece of the data science puzzle fits together. Throughout the book, you’ll discover statistical models with which you can control and navigate even the densest or the sparsest of datasets and learn how to create powerful visualizations that communicate the stories hidden in your data. With a focus on application, this edition covers advanced transfer learning and pre-trained models for NLP and vision tasks. You’ll get to grips with advanced techniques for mitigating algorithmic bias in data as well as models and addressing model and data drift. Finally, you’ll explore medium-level data governance, including data provenance, privacy, and deletion request handling. By the end of this data science book, you'll have learned the fundamentals of computational mathematics and statistics, all while navigating the intricacies of modern ML and large pre-trained models like GPT and BERT.

Preface

Who is this book for?

What this book covers

To get the most out of this book

Download the example code files

Conventions used

Share Your Thoughts

Download a free PDF copy of this book

Free Chapter

Chapter 1: Data Science Terminology

Chapter 1: Data Science Terminology

What is data science?

The data science Venn diagram

Some more terminology

Data science case studies

Chapter 2: Types of Data

Chapter 2: Types of Data

Structured versus unstructured data

The four levels of data

Questions and answers

Chapter 3: The Five Steps of Data Science

Chapter 3: The Five Steps of Data Science

Introduction to data science

Exploring the data

Chapter 4: Basic Mathematics

Chapter 4: Basic Mathematics

Basic symbols and terminology

Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability

Chapter 5: Impossible or Improbable – A Gentle Introduction to Probability

Basic definitions

Bayesian versus frequentist

How to utilize the rules of probability

Introduction to binary classifiers

Chapter 6: Advanced Probability

Chapter 6: Advanced Probability

Bayesian ideas revisited

Random variables

Chapter 7: What Are the Chances? An Introduction to Statistics

Chapter 7: What Are the Chances? An Introduction to Statistics

What are statistics?

How do we obtain and sample data?

How do we measure statistics?

The empirical rule

Chapter 8: Advanced Statistics

Chapter 8: Advanced Statistics

Understanding point estimates

Sampling distributions

Confidence intervals

Hypothesis tests

Chapter 9: Communicating Data

Chapter 9: Communicating Data

Why does communication matter?

Identifying effective visualizations

When graphs and statistics lie

Verbal communication

Chapter 10: How to Tell if Your Toaster is Learning – Machine Learning Essentials

Chapter 10: How to Tell if Your Toaster is Learning – Machine Learning Essentials

Predicting continuous variables with linear regression

Chapter 11: Predictions Don’t Grow on Trees, or Do They?

Chapter 11: Predictions Don’t Grow on Trees, or Do They?

Performing naïve Bayes classification

Understanding decision trees

Diving deep into UL

Feature extraction and PCA

Chapter 12: Introduction to Transfer Learning and Pre-Trained Models

Chapter 12: Introduction to Transfer Learning and Pre-Trained Models

Understanding pre-trained models

Different types of TL

TL with BERT and GPT

Chapter 13: Mitigating Algorithmic Bias and Tackling Model and Data Drift

Chapter 13: Mitigating Algorithmic Bias and Tackling Model and Data Drift

Understanding algorithmic bias

Sources of algorithmic bias

Consequences of unaddressed bias and the importance of fairness

Mitigating algorithmic bias

Emerging techniques in bias and fairness in ML

Understanding model drift and decay

Mitigating drift

Chapter 14: AI Governance

Chapter 14: AI Governance

Mastering data governance

Navigating the intricacy and the anatomy of ML governance

A guide to architectural governance

Chapter 15: Navigating Real-World Data Science Case Studies in Action

Chapter 15: Navigating Real-World Data Science Case Studies in Action

Introduction to the COMPAS dataset case study

Text embeddings using pretrainedmodels and OpenAI

Index

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Some more terminology

At this point, you’re probably excitedly looking up a lot of data science material and seeing words and phrases I haven’t used yet. Here are some common terms that you are likely to encounter:

Machine learning: This refers to giving computers the ability to learn from data without explicit “rules” being given by a programmer. Earlier in this chapter, we saw the concept of machine learning as the union of someone who has both coding and math skills. Here, we are attempting to formalize this definition. Machine learning combines the power of computers with intelligent learning algorithms to automate the discovery of relationships in data and create powerful data models.
Statistical model: This refers to taking advantage of statistical theorems to formalize relationships between data elements in a (usually) simple mathematical formula.
Exploratory data analysis (EDA): This refers to preparing data to standardize results and gain quick insights. EDA is concerned with data visualization and preparation. This is where we turn unstructured data into structured data and clean up missing/incorrect data points. During EDA, we will create many types of plots and use these plots to identify key features and relationships to exploit in our data models.
Data mining: This is the process of finding relationships between elements of data. Data mining is the part of data science where we try to find relationships between variables (think the spawn-recruit model).

I have tried pretty hard not to use the term big data up until now. This is because I think this term is misused – a lot. Big data is data that is too large to be processed by a single machine (if your laptop crashed, it might be suffering from a case of big data).

The following diagram shows the relationship between these data science concepts.

Figure 1.3 – The state of data science (so far)

Figure 1.3 – The state of data science (so far)

With these terms securely stored in our brains, we can move on to the main educational resource in this book: data science case studies.