Book Image

Python Machine Learning Blueprints - Second Edition

By : Alexander Combs, Michael Roman
Book Image

Python Machine Learning Blueprints - Second Edition

By: Alexander Combs, Michael Roman

Overview of this book

Machine learning is transforming the way we understand and interact with the world around us. This book is the perfect guide for you to put your knowledge and skills into practice and use the Python ecosystem to cover key domains in machine learning. This second edition covers a range of libraries from the Python ecosystem, including TensorFlow and Keras, to help you implement real-world machine learning projects. The book begins by giving you an overview of machine learning with Python. With the help of complex datasets and optimized techniques, you’ll go on to understand how to apply advanced concepts and popular machine learning algorithms to real-world projects. Next, you’ll cover projects from domains such as predictive analytics to analyze the stock market and recommendation systems for GitHub repositories. In addition to this, you’ll also work on projects from the NLP domain to create a custom news feed using frameworks such as scikit-learn, TensorFlow, and Keras. Following this, you’ll learn how to build an advanced chatbot, and scale things up using PySpark. In the concluding chapters, you can look forward to exciting insights into deep learning and you'll even create an application using computer vision and neural networks. By the end of this book, you’ll be able to analyze data seamlessly and make a powerful impact through your projects.
Table of Contents (13 chapters)

Data science/machine learning workflow

Building machine learning applications, while similar in many respects to the standard engineering paradigm, differs in one crucial aspect: the need to work with data as a raw material. The success of your project will, in large part, depend on the quality of the data you acquire, as well as your handling of that data. And because working with data falls into the domain of data science, it is helpful to understand the data science workflow:

Data science workflow

The process involves these six steps in the following order:

  1. Acquisition
  2. Inspection
  3. Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Frequently, there is a need to circle back to prior steps, such as when inspecting and preparing the data, or when evaluating and modeling, but the process at a high level can be as described in the preceding list.

Let's now discuss each step in detail.


Data for machine learning applications can come from any number of sources; it may be emailed to you as a CSV file, it may come from pulling down server logs, or it may require building a custom web scraper. Data is also likely to exist in any number of formats. In most cases, you will be dealing with text-based data, but, as we'll see, machine learning applications may just as easily be built that utilize images or even video files. Regardless of the format, once you have secured the data, it is crucial that you understand what's in the data, as well as what isn't.


Once you have acquired your data, the next step is to inspect it. The primary goal at this stage is to sanity check the data, and the best way to accomplish this is to look for things that are either impossible or highly unlikely. As an example, if the data has a unique identifier, check to see that there is indeed only one; if the data is price-based, check that it is always positive; and whatever the data type, check the most extreme cases. Do they make sense? A good practice is to run some simple statistical tests on the data, and visualize it. The outcome of your models is only as good as the data you put in, so it is crucial to get this step right.


When you are confident you have your data in order, next you will need to prepare it by placing it in a format that is amenable to modeling. This stage encompasses a number of processes, such as filtering, aggregating, imputing, and transforming. The type of actions you need to take will be highly dependent on the type of data you're working with, as well as the libraries and algorithms you will be utilizing. For example, if you are working with natural language-based texts, the transformations required will be very different from those required for time-series data. We'll see a number of examples of these types of transformations throughout the book.


Once the data preparation is complete, the next phase is modeling. Here, you will be selecting an appropriate algorithm and using the data to train your model. There are a number of best practices to adhere to during this stage, and we will discuss them in detail, but the basic steps involve splitting your data into training, testing, and validation sets. This splitting up of the data may seem illogical—especially when more data typically yields better models—but as we'll see, doing this allows us to get better feedback on how the model will perform in the real world, and prevents us from the cardinal sin of modeling: overfitting. We will talk more about this in later chapters.


So, now you've got a shiny new model, but exactly how good is that model? This is the question that the evaluation phase seeks to answer. There are a number of ways to measure the performance of a model, and again it is largely dependent on the type of data you are working with and the type of model used, but on the whole, we are seeking to answer the question of how close the model's predictions are to the actual value. There is an array of confusing sounding terms, such as root mean-square error, or Euclidean distance, or F1 score. But in the end, they are all just a measure of distance between the actual prediction and the estimated prediction.


Once you are comfortable with the performance of your model, you'll want to deploy it. This can take a number of forms depending on the use case, but common scenarios include utilization as a feature within another larger application, a bespoke web application, or even just a simple cron job.