Book Image

Practical Automated Machine Learning Using H2O.ai

By : Salil Ajgaonkar
Book Image

Practical Automated Machine Learning Using H2O.ai

By: Salil Ajgaonkar

Overview of this book

With the huge amount of data being generated over the internet and the benefits that Machine Learning (ML) predictions bring to businesses, ML implementation has become a low-hanging fruit that everyone is striving for. The complex mathematics behind it, however, can be discouraging for a lot of users. This is where H2O comes in – it automates various repetitive steps, and this encapsulation helps developers focus on results rather than handling complexities. You’ll begin by understanding how H2O’s AutoML simplifies the implementation of ML by providing a simple, easy-to-use interface to train and use ML models. Next, you’ll see how AutoML automates the entire process of training multiple models, optimizing their hyperparameters, as well as explaining their performance. As you advance, you’ll find out how to leverage a Plain Old Java Object (POJO) and Model Object, Optimized (MOJO) to deploy your models to production. Throughout this book, you’ll take a hands-on approach to implementation using H2O that’ll enable you to set up your ML systems in no time. By the end of this H2O book, you’ll be able to train and use your ML models using H2O AutoML, right from experimentation all the way to production without a single need to understand complex statistics or data science.
Table of Contents (19 chapters)
1
Part 1 H2O AutoML Basics
4
Part 2 H2O AutoML Deep Dive
10
Part 3 H2O AutoML Advanced Implementation and Productization

Understanding AutoML and H2O AutoML

Before we begin our journey with H2O AutoML, it is important to understand what exactly AutoML is and what part it plays in the entire ML pipeline. In this section, we will try to understand the various steps involved in the ML pipeline and where AutoML fits into it. Then, we will explore what makes H2O’s AutoML so unique among the various AutoML technologies.

Let’s start by learning a bit about AutoML in general.

AutoML

AutoML is the process of automating the various steps that are performed while developing a viable ML system for predictions. A typical ML pipeline consists of the following steps:

  1. Data Collection: This is the very first step in an ML pipeline. Data is collected from various sources. The sources can generate different types of data, such as categorical, numeric, textual, time series, or even visual and auditory data. All these types of data are aggregated together based on the requirements and are merged into a common structure. This could be a comma-separated value file, a parquet file, or even a table from a database.
  2. Data Exploration: Once data has been collected, it is explored using basic analytical techniques to identify what it contains, the completeness and correctness of the data, and if the data shows potential patterns that can build a model.
  3. Data Preparation: Missing values, duplicates, and noisy data can all affect the quality of the model as they introduce incorrect learning. Hence, the raw data that is collected and explored needs to be pre-processed to remove all anomalies using specific data processing methods.
  4. Data Transformation: A lot of ML models work with different types of data. Some can work with categorical data, while some can only work with numeric data. That is why you may need to convert certain types of data from one form into the other. This allows the dataset to be fed properly during model training.
  5. Model Selection: Once the dataset is ready, an ML model is selected to be trained. The model is chosen based on what type of data the dataset contains, what information is to be extracted from the dataset, as well as which model suits the data.
  6. Model Training: This is where the model is trained. The ML system will learn from the processed dataset and create a model. This training can be influenced by several factors, such as data attribute weighting, learning rate, and other hyperparameters.
  7. Hyperparameter Tuning: Apart from model training, another factor that needs to be considered is the model’s architecture. The model’s architecture depends on the type of algorithm used, such as the number of trees in a random forest or neurons in a neural network. We don’t immediately know which architecture is optimal for a given model, so experimentation is needed. The parameters that define the architecture of a model are called hyperparameters; finding the best combination of hyperparameter values is known as hyperparameter tuning.
  8. Prediction: The final step of the ML pipeline is prediction. Based on the patterns in the dataset that were learned by the model during training, the model can now make a generalized prediction on unseen data.

For non-experts, all these steps and their complexities can be overwhelming. Every step in the ML pipeline process has been developed over years of research and there are vast topics within themselves. AutoML is the process that automates the majority of these steps, from data exploration to hyperparameter tuning, and provides the best possible models to make predictions on. This helps companies focus on solving real-world problems with results rather than ML processes and workflows.

Now that you understand the different steps in an ML pipeline and how the steps are automated by AutoML, let’s see why H2O’s AutoML technology is one of the leading technologies in the industry.

H2O AutoML

H2O AutoML is an AutoML software technology developed by H2O.ai that simplifies how ML systems are developed by providing user-friendly interfaces that help non-experts experiment with ML. It is an in-memory, distributed, fast, and scalable ML and analytics platform that works on big data and can be used for enterprise needs.

It is written in Java and uses key-value storage to access data, models, and other ML objects that are involved. It runs on a cluster system and uses the multi-threaded MapReduce framework to parallelize data operations. It is also easy to communicate with it as it uses simple REST APIs. Finally, it has a web interface that provides a detailed graphical view of data and model details.

Not only does H2O AutoML automate the majority of the sophisticated steps involved in the ML life cycle, but it also provides a lot of flexibility for even expert data scientists to implement specialized model training processes. H2O AutoML provides a simple wrapper function that encapsulates several of the model training tasks that would otherwise be complicated to orchestrate. It also has extensive explainability functions that can describe the various details of the model training life cycle. This provides easy-to-export details of the models that users can use to explain the performance and justifications of the models that have been trained.

The best part about H2O AutoML is that it is entirely open source. You can find H2O’s source code at https://github.com/h2oai. It is actively maintained by a community of developers serving in both open as well as closed sources companies. At the time of writing, it is on its third major version, which indicates that it is quite a mature technology and is feature-intensive – that is, it supports several major companies in the world. It also supports several programming languages, including R, Scala, Python, and Java, that can run on several operating systems and provides support for a wide variety of data sources that are involved in the ML life cycle, such as Hadoop Distributed File System, Hive, Amazon S3, and even Java Database Connectivity (JDBC).

Now that you understand the basics of AutoML and how powerful H2O AutoML is, let’s see what the minimum requirements are for a system to run H2O AutoML without any performance issues.