Data science/machine learning workflow

Building machine learning applications, while similar in many respects to the standard engineering paradigm, differs in one crucial aspect: the need to work with data as a raw material. The success of your project will, in large part, depend on the quality of the data you acquire, as well as your handling of that data. And because working with data falls into the domain of data science, it is helpful to understand the data science workflow:

Data science workflow

The process involves these six steps in the following order:

Acquisition
Inspection
Preparation
Modeling
Evaluation
Deployment

Frequently, there is a need to circle back to prior steps, such as when inspecting and preparing the data, or when evaluating and modeling, but the process at a high level can be as described in the preceding list.

Let's now discuss each step in detail.

Acquisition

Data for machine learning applications can come from any number of sources; it may be emailed to you as a CSV file, it may come from pulling down server logs, or it may require building a custom web scraper. Data is also likely to exist in any number of formats. In most cases, you will be dealing with text-based data, but, as we'll see, machine learning applications may just as easily be built that utilize images or even video files. Regardless of the format, once you have secured the data, it is crucial that you understand what's in the data, as well as what isn't.

Inspection

Once you have acquired your data, the next step is to inspect it. The primary goal at this stage is to sanity check the data, and the best way to accomplish this is to look for things that are either impossible or highly unlikely. As an example, if the data has a unique identifier, check to see that there is indeed only one; if the data is price-based, check that it is always positive; and whatever the data type, check the most extreme cases. Do they make sense? A good practice is to run some simple statistical tests on the data, and visualize it. The outcome of your models is only as good as the data you put in, so it is crucial to get this step right.

Preparation

When you are confident you have your data in order, next you will need to prepare it by placing it in a format that is amenable to modeling. This stage encompasses a number of processes, such as filtering, aggregating, imputing, and transforming. The type of actions you need to take will be highly dependent on the type of data you're working with, as well as the libraries and algorithms you will be utilizing. For example, if you are working with natural language-based texts, the transformations required will be very different from those required for time-series data. We'll see a number of examples of these types of transformations throughout the book.

Modeling

Once the data preparation is complete, the next phase is modeling. Here, you will be selecting an appropriate algorithm and using the data to train your model. There are a number of best practices to adhere to during this stage, and we will discuss them in detail, but the basic steps involve splitting your data into training, testing, and validation sets. This splitting up of the data may seem illogical—especially when more data typically yields better models—but as we'll see, doing this allows us to get better feedback on how the model will perform in the real world, and prevents us from the cardinal sin of modeling: overfitting. We will talk more about this in later chapters.

Evaluation

So, now you've got a shiny new model, but exactly how good is that model? This is the question that the evaluation phase seeks to answer. There are a number of ways to measure the performance of a model, and again it is largely dependent on the type of data you are working with and the type of model used, but on the whole, we are seeking to answer the question of how close the model's predictions are to the actual value. There is an array of confusing sounding terms, such as root mean-square error, or Euclidean distance, or F1 score. But in the end, they are all just a measure of distance between the actual prediction and the estimated prediction.

Deployment

Once you are comfortable with the performance of your model, you'll want to deploy it. This can take a number of forms depending on the use case, but common scenarios include utilization as a feature within another larger application, a bespoke web application, or even just a simple cron job.

Python Machine Learning Blueprints - Second Edition

By : Alexander Combs, Saurabh Chhajed, Michael Roman

Python Machine Learning Blueprints

By: Alexander Combs, Saurabh Chhajed, Michael Roman

Overview of this book