Book Image

The Data Science Workshop - Second Edition

By : Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

5 (1)

Book Image

The Data Science Workshop - Second Edition

5 (1)

By: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

Overview of this book

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities. The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search. Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch. By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.

Preface

1. Introduction to Data Science in Python

1. Introduction to Data Science in Python

Application of Data Science

Overview of Python

Python for Data Science

Free Chapter

2. Regression

Simple Linear Regression

Multiple Linear Regression

Conducting Regression Analysis Using Python

Multiple Regression Analysis

Assumptions of Regression Analysis

Explaining the Results of Regression Analysis

3. Binary Classification

3. Binary Classification

Understanding the Business Context

Feature Engineering

Data-Driven Feature Engineering

Correlation Matrix and Visualization

4. Multiclass Classification with RandomForest

4. Multiclass Classification with RandomForest

Training a Random Forest Classifier

Evaluating the Model's Performance

Minimum Sample in Leaf

Maximum Features

5. Performing Your First Cluster Analysis

5. Performing Your First Cluster Analysis

Clustering with k-means

Interpreting k-means Results

Choosing the Number of Clusters

Initializing Clusters

Calculating the Distance to the Centroid

Standardizing Data

6. How to Assess Performance

6. How to Assess Performance

Assessing Model Performance for Regression Models

Assessing Model Performance for Classification Models

The Confusion Matrix

Receiver Operating Characteristic Curve

Area Under the ROC Curve

Saving and Loading Models

7. The Generalization of Machine Learning Models

7. The Generalization of Machine Learning Models

Cross-Validation

cross_val_score

LogisticRegressionCV

Hyperparameter Tuning with GridSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Model Regularization with Lasso Regression

Ridge Regression

8. Hyperparameter Tuning

8. Hyperparameter Tuning

What Are Hyperparameters?

Finding the Best Hyperparameterization

Tuning Using Grid Search

9. Interpreting a Machine Learning Model

9. Interpreting a Machine Learning Model

Linear Model Coefficients

RandomForest Variable Importance

Variable Importance via Permutation

Partial Dependence Plots

Local Interpretation with LIME

10. Analyzing a Dataset

10. Analyzing a Dataset

Exploring Your Data

Analyzing Your Dataset

Analyzing the Content of a Categorical Variable

Summarizing Numerical Variables

Visualizing Your Data

11. Data Preparation

11. Data Preparation

Handling Row Duplication

Converting Data Types

Handling Incorrect Values

Handling Missing Values

12. Feature Engineering

12. Feature Engineering

13. Imbalanced Datasets

13. Imbalanced Datasets

Understanding the Business Context

Challenges of Imbalanced Datasets

Strategies for Dealing with Imbalanced Datasets

Generating Synthetic Samples

14. Dimensionality Reduction

14. Dimensionality Reduction

Creating a High-Dimensional Dataset

Strategies for Addressing High-Dimensional Datasets

Comparing Different Dimensionality Reduction Techniques

15. Ensemble Learning

15. Ensemble Learning

Ensemble Learning

Simple Methods for Ensemble Learning

Advanced Techniques for Ensemble Learning

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Data

In the world of machine learning, the data that you have is not used in its entirety to train your model. Instead, you need to separate your data into three sets, as mentioned here:

A training dataset, which is used to train your model and measure the training loss.
An evaluation or validation dataset, which you use to measure the validation loss of the model to see whether the validation loss continues to reduce as well as the training loss.
A test dataset for final testing to see how well the model performs before you put it into production.

The Ratio for Dataset Splits

The evaluation dataset is set aside from your entire training data and is never used for training. There are various schools of thought around the particular ratio that is set aside for evaluation, but it generally ranges from a high of 30% to a low of 10%. This evaluation dataset is normally further split into a validation dataset that is used during training and a test dataset...