Data Science with Python

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Buy this Book

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Buy this Book

Overview of this book

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression. As you make your way through the book, you will understand the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, discover how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome. By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.

About the Book

Minimum Hardware Requirements

Software Requirements

Installation and Setup

Using Kaggle for Faster Experimentation

Conventions

Installing the Code Bundle

Free Chapter

Introduction to Data Science and Data Pre-Processing

Introduction

Python Libraries

Roadmap for Building Machine Learning Models

Data in Different Scales

Data Discretization

Train and Test Data

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Performance Metrics

Summary

Data Visualization

Introduction

Functional Approach

Object-Oriented Approach Using Subplots

Summary

Introduction to Machine Learning via Scikit-Learn

Introduction

Introduction to Linear and Logistic Regression

Multiple Linear Regression

Logistic Regression

Max Margin Classification Using SVMs

Decision Trees

Random Forests

Summary

Dimensionality Reduction and Unsupervised Learning

Introduction

Hierarchical Cluster Analysis (HCA)

K-means Clustering

Principal Component Analysis (PCA)

Supervised Data Compression using Linear Discriminant Analysis (LDA)

Summary

Mastering Structured Data

Introduction

Boosting Algorithms

XGBoost Library

External Memory Usage

Cross-validation

Saving and Loading a Model

Neural Networks

Keras

Categorical Variables

Summary

Decoding Images

Introduction

Images

Convolutional Neural Networks

Image Data Preprocessing

Data Augmentation

Generators

Summary

Processing Human Language

Introduction

Text Data Processing

Recurrent Neural Networks (RNNs)

Summary

Tips and Tricks of the Trade

Introduction

Transfer Learning

Useful Tools and Tips

AutoML

Summary

Appendix

Chapter 1: Introduction to Data Science and Data Preprocessing

Chapter 2: Data Visualization

Chapter 3: Introduction to Machine Learning via Scikit-Learn

Chapter 4: Dimensionality Reduction and Unsupervised Learning

Chapter 5: Mastering Structured Data

Chapter 6: Decoding Images

Chapter 7: Processing Human Language

Chapter 8: Tips and Tricks of the Trade

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Chapter 4: Dimensionality Reduction and Unsupervised Learning

Activity 12: Ensemble k-means Clustering and Calculating Predictions

Solution:

After the glass dataset has been imported, shuffled, and standardized (see Exercise 58):

Instantiate an empty data frame to append each model and save it as the new data frame object labels_df with the following code:
import pandas as pd
labels_df = pd.DataFrame()
Import the KMeans function outside of the loop using the following:
from sklearn.cluster import KMeans
Complete 100 iterations as follows:
for i in range(0, 100):
Save a KMeans model object with two clusters (arbitrarily decided upon, a priori) using:
model = KMeans(n_clusters=2)
Fit the model to scaled_features using the following:
model.fit(scaled_features)
Generate the labels array and save it as the labels object, as follows:
labels = model.labels_
Store labels as a column in labels_df named after the iteration using the code:
labels_df['Model_{}_Labels'.format(i+1)] = labels
After labels have been generated for each of the 100 models (see Activity 21), calculate the mode for each row using the following code:
row_mode = labels_df.mode(axis=1)
Assign row_mode to a new column in labels_df, as shown in the following code:
labels_df['row_mode'] = row_mode
View the first five rows of labels_df
print(labels_df.head(5))

Figure 4.24: First five rows of labels_df

We have drastically increased the confidence in our predictions by iterating through numerous models, saving the predictions at each iteration, and assigning the final predictions as the mode of these predictions. However, these predictions were generated by models using a predetermined number of clusters. Unless we know the number of clusters a priori, we will want to discover the optimal number of clusters to segment our observations.

Activity 13: Evaluating Mean Inertia by Cluster after PCA Transformation

Solution:

Instantiate a PCA model with the value for the n_components argument equal to best_n_components (that is, remember, best_n_components = 6) as follows:
from sklearn.decomposition import PCA
model = PCA(n_components=best_n_components)
Fit the model to scaled_features and transform them into the six components, as shown here:
df_pca = model.fit_transform(scaled_features)
Import numpy and the KMeans function outside the loop using the following code:
from sklearn.cluster import KMeans
import numpy as np
Instantiate an empty list, inertia_list, for which we will append inertia values after each iteration using the following code:
inertia_list = []
In the inside for loop, we will iterate through 100 models as follows:
for i in range(100):
Build our KMeans model with n_clusters=x using:
model = KMeans(n_clusters=x)
Note
The value for x will be dictated by the outer loop which is covered in detail here.
Fit the model to df_pca as follows:
model.fit(df_pca)
Get the inertia value and save it to the object inertia using the following code:
inertia = model.inertia_
Append inertia to inertia_list using the following code:
inertia_list.append(inertia)
Moving to the outside loop, instantiate another empty list to store the average inertia values using the following code:
mean_inertia_list_PCA = []
Since we want to check the average inertia over 100 models for n_clusters 1 through 10, we will instantiate the outer loop as follows:
for x in range(1, 11):
After the inside loop has run through its 100 iterations, and the inertia value for each of the 100 models have been appended to inertia_list, compute the mean of this list, and save the object as mean_inertia using the following code:
mean_inertia = np.mean(inertia_list)
Append mean_inertia to mean_inertia_list_PCA using the following code:
mean_inertia_list_PCA.append(mean_inertia)
Print mean_inertia_list_PCA to the console using the following code:
print(mean_inertia_list_PCA)
Notice the output in the following screenshot:

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Overview of this book

Related Content you might be interested in

Current Title:

Data Science with Python

Applied Deep Learning with Keras

Machine Learning Fundamentals

Ensemble Machine Learning Cookbook

Chapter 4: Dimensionality Reduction and Unsupervised Learning

Activity 12: Ensemble k-means Clustering and Calculating Predictions

Figure 4.24: First five rows of labels_df

Activity 13: Evaluating Mean Inertia by Cluster after PCA Transformation

Note

Figure 4.25: mean_inertia_list_PCA