Book Image

Practical Deep Learning at Scale with MLflow

By : Yong Liu
5 (1)
Book Image

Practical Deep Learning at Scale with MLflow

5 (1)
By: Yong Liu

Overview of this book

The book starts with an overview of the deep learning (DL) life cycle and the emerging Machine Learning Ops (MLOps) field, providing a clear picture of the four pillars of deep learning: data, model, code, and explainability and the role of MLflow in these areas. From there onward, it guides you step by step in understanding the concept of MLflow experiments and usage patterns, using MLflow as a unified framework to track DL data, code and pipelines, models, parameters, and metrics at scale. You’ll also tackle running DL pipelines in a distributed execution environment with reproducibility and provenance tracking, and tuning DL models through hyperparameter optimization (HPO) with Ray Tune, Optuna, and HyperBand. As you progress, you’ll learn how to build a multi-step DL inference pipeline with preprocessing and postprocessing steps, deploy a DL inference pipeline for production using Ray Serve and AWS SageMaker, and finally create a DL explanation as a service (EaaS) using the popular Shapley Additive Explanations (SHAP) toolbox. By the end of this book, you’ll have built the foundation and gained the hands-on experience you need to develop a DL pipeline solution from initial offline experimentation to final deployment and production, all within a reproducible and open source framework.
Table of Contents (17 chapters)
1
Section 1 - Deep Learning Challenges and MLflow Prime
4
Section 2 –
Tracking a Deep Learning Pipeline at Scale
7
Section 3 –
Running Deep Learning Pipelines at Scale
10
Section 4 –
Deploying a Deep Learning Pipeline at Scale
13
Section 5 – Deep Learning Model Explainability at Scale

Understanding DL code challenges

In this section, we will discuss DL code challenges. Let's look at how these code challenges are manifested in each of the stages described in Figure 1.3. In this section, and within the context of DL development, code refers to the source code that's written in certain programming languages such as Python for data processing and implementation, while a model refers to the model logic and architecture in its serialized format (for example, pickle, TorchScript, or ONNX):

  • Data collection/cleaning/annotation: While data is the central piece in this stage, the code that does the query, extraction/transformation/loading (ETL), and data cleaning and augmentation is of critical importance. We cannot decouple the development of the model from the data pipelines that provide the data feeds to the model. Therefore, data pipelines that implement ETL need to be treated as one of the integrated steps in both offline experimentation and online production. A common mistake is that we use different data ETL and cleaning pipelines in offline experimentation, and then implement different data ETL/cleaning pipelines in online production, which could cause different model behaviors. We need to version and serialize the data pipeline as part of the entire model pipeline. MLflow provides several ways to allow us to implement such multistep pipelines.
  • Model development: During offline experiments, in addition to different versions of data pipeline code, we might also have different versions of notebooks or use different versions of DL library code. The usage of notebooks is particularly unique in ML/DL life cycles. Tracking which model results are produced by which notebook/model pipeline/data pipeline needs to be done for each run. MLflow does that with automatic code version tracking and dependencies. In addition, code reproducibility in different running environments is unique to DL models, as DL models usually require hardware accelerators such as GPUs or TPUs. The flexibility of running either locally, or remotely, on a CPU or GPU environment is of great importance. MLflow provides a lightweight approach in which to organize the ML projects so that code can be written once and run everywhere.
  • Model deployment and serving in production: While the model is serving production traffic, any bugs will need to be traced back to both the model and code. Thus, tracking code provenance is critical. It is also critical to track all the dependency code library versions for a particular version of the model.
  • Model validation and A/B testing: Online experiments could use multiple versions of models using different data feeds. Debugging any experimentation will require not only knowing which model is used but also which code is used to produce that model.
  • Monitoring and feedback loops: This stage is similar to the previous stage in terms of code challenges, where we need to know whether model degradation is due to code bugs or model and data drifting. The monitoring pipeline needs to collect all the metrics for both data and model performance.

In summary, DL code challenges are especially unique because DL frameworks are still evolving (for example, TensorFlow, PyTorch, Keras, Hugging Face, and SparkNLP). MLflow provides a lightweight framework to overcome many common challenges and can interface with many DL frameworks seamlessly.