The Applied Data Science Workshop - Second Edition

By : Alex Galea

The Applied Data Science Workshop - Second Edition

By: Alex Galea

Overview of this book

From banking and manufacturing through to education and entertainment, using data science for business has revolutionized almost every sector in the modern world. It has an important role to play in everything from app development to network security. Taking an interactive approach to learning the fundamentals, this book is ideal for beginners. You’ll learn all the best practices and techniques for applying data science in the context of real-world scenarios and examples. Starting with an introduction to data science and machine learning, you’ll start by getting to grips with Jupyter functionality and features. You’ll use Python libraries like sci-kit learn, pandas, Matplotlib, and Seaborn to perform data analysis and data preprocessing on real-world datasets from within your own Jupyter environment. Progressing through the chapters, you’ll train classification models using sci-kit learn, and assess model performance using advanced validation techniques. Towards the end, you’ll use Jupyter Notebooks to document your research, build stakeholder reports, and even analyze web performance data. By the end of The Applied Data Science Workshop, you’ll be prepared to progress from being a beginner to taking your skills to the next level by confidently applying data science techniques and tools to real-world projects.

Preface

About the Book

Installing Libraries

1. Introduction to Jupyter Notebooks

Introduction

Basic Functionality and Features of Jupyter Notebooks

Jupyter Features

Summary

Free Chapter

2. Data Exploration with Jupyter

Introduction

Our First Analysis – the Boston Housing Dataset

Summary

3. Preparing Data for Predictive Modeling

Introduction

Machine Learning Process

Approaching Data Science Problems

Understanding Data from a Modeling Perspective

Introducing the Human Resource Analytics Dataset

Summary

4. Training Classification Models

Introduction

Understanding Classification Algorithms

Summary

5. Model Validation and Optimization

Introduction

Assessing Models with k-Fold Cross Validation

Dimensionality Reduction with PCA

Summary

6. Web Scraping with Jupyter Notebooks

Introduction

Internet Data Sources

Introduction to HTTP Requests

Data Workflow with pandas

Summary

Appendix

1. Introduction to Jupyter Notebooks

2. Data Exploration with Jupyter

3. Preparing Data for Predictive Modeling

4. Training Classification Models

5. Model Validation and Optimization

6. Web Scraping with Jupyter Notebooks

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Assessing Models with k-Fold Cross Validation

Thus far, we have trained models on a subset of the data and then assessed performance on the unseen portion, called the test set. This is good practice because the model's performance on data that's used for training is not a good indicator of its effectiveness as a predictor. It's very easy to increase accuracy on a training dataset by overfitting a model, which results in a poorer performance on unseen data.

That being said, simply training models on data that's been split in this way is not good enough. There is a natural variance in data that causes accuracies to be different (if even slightly), depending on the training and test splits. Furthermore, using only one training/test split to compare models can introduce bias toward certain models and lead to overfitting.

k-Fold cross validation offers a solution to this problem and allows the variance to be accounted for by way of an error estimate on each accuracy...

The Applied Data Science Workshop - Second Edition

By : Alex Galea

The Applied Data Science Workshop - Second Edition

By: Alex Galea

Overview of this book

Related Content you might be interested in

Current Title:

The Applied Data Science Workshop - Second Edition

scikit-learn Cookbook

The Machine Learning Workshop

The Supervised Learning Workshop

Assessing Models with k-Fold Cross Validation