Book Image

The Data Science Workshop - Second Edition

By : Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

5 (1)

Book Image

The Data Science Workshop - Second Edition

5 (1)

By: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

Overview of this book

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities. The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search. Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch. By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.

Preface

1. Introduction to Data Science in Python

1. Introduction to Data Science in Python

Application of Data Science

Overview of Python

Python for Data Science

Free Chapter

2. Regression

Simple Linear Regression

Multiple Linear Regression

Conducting Regression Analysis Using Python

Multiple Regression Analysis

Assumptions of Regression Analysis

Explaining the Results of Regression Analysis

3. Binary Classification

3. Binary Classification

Understanding the Business Context

Feature Engineering

Data-Driven Feature Engineering

Correlation Matrix and Visualization

4. Multiclass Classification with RandomForest

4. Multiclass Classification with RandomForest

Training a Random Forest Classifier

Evaluating the Model's Performance

Minimum Sample in Leaf

Maximum Features

5. Performing Your First Cluster Analysis

5. Performing Your First Cluster Analysis

Clustering with k-means

Interpreting k-means Results

Choosing the Number of Clusters

Initializing Clusters

Calculating the Distance to the Centroid

Standardizing Data

6. How to Assess Performance

6. How to Assess Performance

Assessing Model Performance for Regression Models

Assessing Model Performance for Classification Models

The Confusion Matrix

Receiver Operating Characteristic Curve

Area Under the ROC Curve

Saving and Loading Models

7. The Generalization of Machine Learning Models

7. The Generalization of Machine Learning Models

Cross-Validation

cross_val_score

LogisticRegressionCV

Hyperparameter Tuning with GridSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Model Regularization with Lasso Regression

Ridge Regression

8. Hyperparameter Tuning

8. Hyperparameter Tuning

What Are Hyperparameters?

Finding the Best Hyperparameterization

Tuning Using Grid Search

9. Interpreting a Machine Learning Model

9. Interpreting a Machine Learning Model

Linear Model Coefficients

RandomForest Variable Importance

Variable Importance via Permutation

Partial Dependence Plots

Local Interpretation with LIME

10. Analyzing a Dataset

10. Analyzing a Dataset

Exploring Your Data

Analyzing Your Dataset

Analyzing the Content of a Categorical Variable

Summarizing Numerical Variables

Visualizing Your Data

11. Data Preparation

11. Data Preparation

Handling Row Duplication

Converting Data Types

Handling Incorrect Values

Handling Missing Values

12. Feature Engineering

12. Feature Engineering

13. Imbalanced Datasets

13. Imbalanced Datasets

Understanding the Business Context

Challenges of Imbalanced Datasets

Strategies for Dealing with Imbalanced Datasets

Generating Synthetic Samples

14. Dimensionality Reduction

14. Dimensionality Reduction

Creating a High-Dimensional Dataset

Strategies for Addressing High-Dimensional Datasets

Comparing Different Dimensionality Reduction Techniques

15. Ensemble Learning

15. Ensemble Learning

Ensemble Learning

Simple Methods for Ensemble Learning

Advanced Techniques for Ensemble Learning

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Interpreting k-means Results

After training our k-means algorithm, we will likely be interested in analyzing its results in more detail. Remember, the objective of cluster analysis is to group observations with similar patterns together. But how can we see whether the groupings found by the algorithm are meaningful? We will be looking at this in this section by using the dataset results we just generated.

One way of investigating this is to analyze the dataset row by row with the assigned cluster for each observation. This can be quite tedious, especially if the size of your dataset is quite big, so it would be better to have a kind of summary of the cluster results.

If you are familiar with Excel spreadsheets, you are probably thinking about using a pivot table to get the average of the variables for each cluster. In SQL, you would have probably used a GROUP BY statement. If you are not familiar with either of these, you may think of grouping each cluster together and then...