Book Image

Machine Learning Automation with TPOT

By : Dario Radečić
Book Image

Machine Learning Automation with TPOT

By: Dario Radečić

Overview of this book

The automation of machine learning tasks allows developers more time to focus on the usability and reactivity of the software powered by machine learning models. TPOT is a Python automated machine learning tool used for optimizing machine learning pipelines using genetic programming. Automating machine learning with TPOT enables individuals and companies to develop production-ready machine learning models cheaper and faster than with traditional methods. With this practical guide to AutoML, developers working with Python on machine learning tasks will be able to put their knowledge to work and become productive quickly. You'll adopt a hands-on approach to learning the implementation of AutoML and associated methodologies. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, this book will show you how to build automated classification and regression models and compare their performance to custom-built models. As you advance, you'll also develop state-of-the-art models using only a couple of lines of code and see how those models outperform all of your previous models on the same datasets. By the end of this book, you'll have gained the confidence to implement AutoML techniques in your organization on a production level.
Table of Contents (14 chapters)
1
Section 1: Introducing Machine Learning and the Idea of Automation
3
Section 2: TPOT – Practical Classification and Regression
8
Section 3: Advanced Examples and Neural Networks in TPOT

Automation options

AutoML isn't that new a concept. The idea and some implementations have been around for years, and are receiving positive feedback overall. Still, some fail to implement and fully utilize AutoML solutions in their organization due to a lack of understanding.

AutoML can't do everything – someone still has to gather the data, store it, and prepare it. This isn't a small task, and more often than not requires a significant amount of domain knowledge. Then and only then can automated solutions be utilized to their full potential.

This section explores a couple of options for implementing AutoML solutions. We'll compare one code-based tool written in Python, and one that is delivered as a browser application, meaning that no coding is required. We'll start with the code-based one first.

PyCaret

PyCaret has been widely used to make production-ready machine learning models with as little code as possible. It is a completely free solution capable of training, visualizing, and interpreting machine learning models with ease.

It has built-in support for regression and classification models and shows in an interactive way which models were used for the task, and which generated the best result. It's up to the data scientist to decide which model will be used for the task. Both training and optimization are as simple as a function call.

The library also provides an option to explain machine learning models with game-theoretic algorithms such as SHAP (Shapely Additive Explanations), also with a single function call.

PyCaret still requires a bit of human interaction. Oftentimes, though, the initialization and training process of a model must be selected explicitly by the user, and that breaks the idea of a fully-automated solution.

Further, PyCaret can be slow to run and optimize for a larger dataset. Let's take a look at a code-free AutoML solution next.

ObviouslyAI

Not all of us know how to develop machine learning models, or even how to write code. That's where drag and drop solutions come into play. ObviouslyAI is certainly one of the best out there, and is also easy to use.

This service allows for in-browser model training and evaluation, and can even explain the reasoning behind decisions made by a model. It's a no-brainer for companies in which machine learning isn't the core business, as it's pretty easy to start with and doesn't cost nearly as much as an entire data science team.

A big gotcha with services like this one is the pricing. There's always a free plan included, but in this particular case it's limited to datasets with fewer than 50,000 rows. That's completely fine for occasional tests here and there, but is a deal-breaker for most production use cases.

The second deal-breaker is the actual automation. You can't easily automate mouse clicks and dataset loads. This service automates the machine learning process itself completely, but you still have to do some manual work.

TPOT

The acronym TPOT stands for Tree-based Pipeline Optimization Tool. It is a Python library designed to handle machine learning tasks in an automated fashion.

Here's a statement from the official documentation:

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming (TPOT Documentation page, TPOT Team; November 5, 2019).

Genetic programming is a term that is further discussed in the later chapters. For now, just know that it is based on evolutionary algorithms – a special type of algorithm used to discover solutions to problems that humans don't know how to solve.

In a way, TPOT is your data science assistant. You can use it to automate everything boring in a data science project. The term "boring" is subjective, but throughout the book, we use it to refer to the tasks of manually selecting and tweaking machine learning models (read: spending days waiting for the model to tune).

TPOT can't automate the process of data gathering and cleaning, and the reason is obvious – a machine can't read your mind. It can, however, perform machine learning tasks on well prepared datasets better than most data scientists.

The following chapters discuss the library in great detail.