Machine Learning Automation with TPOT

By : Dario Radečić

Machine Learning Automation with TPOT

By: Dario Radečić

Overview of this book

The automation of machine learning tasks allows developers more time to focus on the usability and reactivity of the software powered by machine learning models. TPOT is a Python automated machine learning tool used for optimizing machine learning pipelines using genetic programming. Automating machine learning with TPOT enables individuals and companies to develop production-ready machine learning models cheaper and faster than with traditional methods. With this practical guide to AutoML, developers working with Python on machine learning tasks will be able to put their knowledge to work and become productive quickly. You'll adopt a hands-on approach to learning the implementation of AutoML and associated methodologies. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, this book will show you how to build automated classification and regression models and compare their performance to custom-built models. As you advance, you'll also develop state-of-the-art models using only a couple of lines of code and see how those models outperform all of your previous models on the same datasets. By the end of this book, you'll have gained the confidence to implement AutoML techniques in your organization on a production level.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Section 1: Introducing Machine Learning and the Idea of Automation

Free Chapter

Chapter 1: Machine Learning and the Idea of Automation

Technical requirements

Reviewing the history of machine learning

Reviewing automation

Applying automation to machine learning

Automation options

Summary

Q&A

Further reading

Chapter 3: Exploring Regression with TPOT

Technical requirements

Applying automated regression modeling to the fish market dataset

Applying automated regression modeling to the insurance dataset

Applying automated regression modeling to the vehicle dataset

Summary

Q&A

Chapter 4: Exploring Classification with TPOT

Technical requirements

Applying automated classification models to the iris dataset

Applying automated classification modeling to the titanic dataset

Summary

Q&A

Chapter 5: Parallel Training with TPOT and Dask

Technical requirements

Introduction to parallelism in Python

Introduction to the Dask library

Training machine learning models with TPOT and Dask

Summary

Q&A

Section 3: Advanced Examples and Neural Networks in TPOT

Chapter 6: Getting Started with Deep Learning: Crash Course in Neural Networks

Technical requirements

Overview of deep learning

Introducing artificial neural networks

Using neural networks to classify handwritten digits

Neural networks in regression versus classification

Summary

Q&A

Chapter 7: Neural Network Classifier with TPOT

Technical requirements

Exploring the dataset

Exploring options for training neural network classifiers

Training a neural network classifier

Summary

Questions

Chapter 8: TPOT Model Deployment

Technical requirements

Why do we need model deployment?

Introducing Flask and Flask-RESTful

Best practices for deploying automated models

Deploying machine learning models to localhost

Deploying machine learning models to the cloud

Summary

Question

Chapter 9: Using the Deployed TPOT Model in Production

Technical requirements

Making predictions in a notebook environment

Developing a simple GUI web application

Making predictions in a GUI environment

Summary

Q&A

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Applying automation to machine learning

We've covered the idea of automation and various types of automation thus far, but what's the connection between automation and machine learning? What exactly is it that we are trying to automate in machine learning?

That's what this section aims to demystify. By the end of this section, you will know the difference between the terms automation with machine learning and automating machine learning. These two might sound similar at first, but are very different in reality.

What are we trying to automate?

Let's get one thing straight – automation of machine learning processes has nothing to do with business process automation with machine learning. In the former, we're trying to automate the machine learning itself, ergo automating the process of selecting the best model and the best hyperparameters. The latter refers to automating a business process with the help of machine learning; for example, making a decision system that decides when to buy or sell a stock based on historical data.

It's crucial to remember this distinction. The primary focus of this book is to demonstrate how automation libraries can be used to automate the process of machine learning. By doing so, you will follow the exact same approach, regardless of the dataset, and always end up with the best possible model.

Choosing an appropriate machine learning algorithm isn't an easy task. Just take a look at the following diagram:

Figure 1.9 – Algorithms in scikit-learn (source: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011)

As you can see, multiple decisions are required to select an appropriate algorithm. In addition, every algorithm has its own set of hyperparameters (parameters specified by the engineer). To make things even worse, some of these hyperparameters are continuous in nature, so when you add it all up, there are hundreds of thousands or even millions of hyperparameter combinations that you as an engineer should test for.

Every hyperparameter combination requires training and evaluation of a completely new model. Concepts such as grid search can help you avoid writing tens of nested loops, but it is far from an optimal solution.

Modern machine learning engineers don't spend their time and energy on model training and optimization – but instead on raising the data quality and availability. Hyperparameter tweaking can squeeze that additional 2% increase in accuracy, but it is the data quality that can make or break your project.

We'll dive a bit deeper into hyperparameters next and demonstrate why searching for the optimal ones manually isn't that good an idea.

The problem of too many parameters

Let's take a look at some of hyperparameters available for one of the most popular machine learning algorithms – XGBoost. The following list shows the general ones:

booster
verbosity
validate_parameters
nthread
disable_default_eval_metric
num_pbuffer
num_feature

That's not much, and some of these hyperparameters are set automatically by the algorithm. The problem lies within the further selection. For example, if you choose gbtree as a value for the booster parameter, you can immediately tweak the values for the following:

eta
gamma
max_depth
min_child_weight
max_delta_step
subsample
sampling_method
colsample_bytree
colsample_bylevel
colsample_bynode
lambda
alpha
tree_method
sketch_eps
scale_pos_weight
updater
refresher_leaf
process_type
grow_policy
max_leaves
max_bin
predictor
num_parallel_tree
monotone_constraints
interaction_constraints

And that's a lot! As mentioned before, some hyperparameters take in continuous values, which tremendously increases the total number of combinations. Here's the final icing on the cake – these are only hyperparameters for a single model. Different models have different hyperparameters, which makes the tuning process that much more time consuming.

Put simply, model selection and hyperparameter tuning isn't something you should do manually. There are more important tasks to spend your energy on. Even if there's nothing else you have to do, I'd prefer going for lunch instead of manual tuning any day of the week.

AutoML enables us to do just that, so we'll explore it briefly in the next section.

What is AutoML?

AutoML stands for Automated Machine Learning, and its primary goal is to reduce or completely eliminate the role of data scientists in building machine learning models. Hearing that sentence might be harsh at first. I know what you are thinking. But no – AutoML can't replace data scientists and other data professionals.

In the best-case scenario, AutoML technologies enable other software engineers to utilize the power of machine learning in their application, without the need to have a solid background in ML. This best-case scenario is only possible if the data is adequately gathered and prepared – a task that's not the specialty of a backend developer.

To make things even harder for the non-data scientist, the machine learning process often requires extensive feature engineering. This step can be skipped, but more often than not, this will result in poor models.

In conclusion, AutoML won't replace data scientists, rather just the contrary – it's here to make the life of data scientists easier. AutoML only automates model selection and tuning to the full extent.

There are some AutoML services that advertise themselves as fully automating even the data preparation and feature engineering jobs, but that's just by combining various features together and making something that is not interpretable most of the time. A machine doesn't know the true relationships between variables. That's the job of a data scientist to discover.

Machine Learning Automation with TPOT

By : Dario Radečić

Machine Learning Automation with TPOT

By: Dario Radečić

Overview of this book

Related Content you might be interested in

Current Title:

Machine Learning Automation with TPOT

Hands-On Automated Machine Learning

Machine Learning with LightGBM and Python

Automated Machine Learning

Applying automation to machine learning

What are we trying to automate?

The problem of too many parameters

What is AutoML?