Book Image

The Data Science Workshop - Second Edition

By : Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

5 (1)

Book Image

The Data Science Workshop - Second Edition

5 (1)

By: Anthony So, Thomas V. Joseph, Robert Thas John, Andrew Worsley, Dr. Samuel Asare

Overview of this book

Where there’s data, there’s insight. With so much data being generated, there is immense scope to extract meaningful information that’ll boost business productivity and profitability. By learning to convert raw data into game-changing insights, you’ll open new career paths and opportunities. The Data Science Workshop begins by introducing different types of projects and showing you how to incorporate machine learning algorithms in them. You’ll learn to select a relevant metric and even assess the performance of your model. To tune the hyperparameters of an algorithm and improve its accuracy, you’ll get hands-on with approaches such as grid search and random search. Next, you’ll learn dimensionality reduction techniques to easily handle many variables at once, before exploring how to use model ensembling techniques and create new features to enhance model performance. In a bid to help you automatically create new features that improve your model, the book demonstrates how to use the automated feature engineering tool. You’ll also understand how to use the orchestration and scheduling workflow to deploy machine learning models in batch. By the end of this book, you’ll have the skills to start working on data science projects confidently. By the end of this book, you’ll have the skills to start working on data science projects confidently.

Preface

1. Introduction to Data Science in Python

1. Introduction to Data Science in Python

Application of Data Science

Overview of Python

Python for Data Science

Free Chapter

2. Regression

Simple Linear Regression

Multiple Linear Regression

Conducting Regression Analysis Using Python

Multiple Regression Analysis

Assumptions of Regression Analysis

Explaining the Results of Regression Analysis

3. Binary Classification

3. Binary Classification

Understanding the Business Context

Feature Engineering

Data-Driven Feature Engineering

Correlation Matrix and Visualization

4. Multiclass Classification with RandomForest

4. Multiclass Classification with RandomForest

Training a Random Forest Classifier

Evaluating the Model's Performance

Minimum Sample in Leaf

Maximum Features

5. Performing Your First Cluster Analysis

5. Performing Your First Cluster Analysis

Clustering with k-means

Interpreting k-means Results

Choosing the Number of Clusters

Initializing Clusters

Calculating the Distance to the Centroid

Standardizing Data

6. How to Assess Performance

6. How to Assess Performance

Assessing Model Performance for Regression Models

Assessing Model Performance for Classification Models

The Confusion Matrix

Receiver Operating Characteristic Curve

Area Under the ROC Curve

Saving and Loading Models

7. The Generalization of Machine Learning Models

7. The Generalization of Machine Learning Models

Cross-Validation

cross_val_score

LogisticRegressionCV

Hyperparameter Tuning with GridSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Model Regularization with Lasso Regression

Ridge Regression

8. Hyperparameter Tuning

8. Hyperparameter Tuning

What Are Hyperparameters?

Finding the Best Hyperparameterization

Tuning Using Grid Search

9. Interpreting a Machine Learning Model

9. Interpreting a Machine Learning Model

Linear Model Coefficients

RandomForest Variable Importance

Variable Importance via Permutation

Partial Dependence Plots

Local Interpretation with LIME

10. Analyzing a Dataset

10. Analyzing a Dataset

Exploring Your Data

Analyzing Your Dataset

Analyzing the Content of a Categorical Variable

Summarizing Numerical Variables

Visualizing Your Data

11. Data Preparation

11. Data Preparation

Handling Row Duplication

Converting Data Types

Handling Incorrect Values

Handling Missing Values

12. Feature Engineering

12. Feature Engineering

13. Imbalanced Datasets

13. Imbalanced Datasets

Understanding the Business Context

Challenges of Imbalanced Datasets

Strategies for Dealing with Imbalanced Datasets

Generating Synthetic Samples

14. Dimensionality Reduction

14. Dimensionality Reduction

Creating a High-Dimensional Dataset

Strategies for Addressing High-Dimensional Datasets

Comparing Different Dimensionality Reduction Techniques

15. Ensemble Learning

15. Ensemble Learning

Ensemble Learning

Simple Methods for Ensemble Learning

Advanced Techniques for Ensemble Learning

Customer Reviews

5 (1)

5 star

100%

4 star

0

3 star

0

2 star

0

1 star

0

Clustering with k-means

k-means is one of the most popular clustering algorithms (if not the most popular) among data scientists due to its simplicity and high performance. Its origins date back as early as 1956, when a famous mathematician named Hugo Steinhaus laid its foundations, but it was a decade later that another researcher called James MacQueen named this approach k-means.

The objective of k-means is to group similar data points (or observations) together that will form a cluster. Think of it as grouping elements close to each other (we will define how to measure closeness later in this chapter). For example, if you were manually analyzing user behavior on a mobile app, you might end up grouping customers who log in quite frequently, or users who make bigger in-app purchases, together. This is the kind of grouping that clustering algorithms such as k-means will automatically find for you from the data.

In this chapter, we will be working with an open source dataset...