Book Image

Production-Ready Applied Deep Learning

By : Tomasz Palczewski, Jaejun (Brandon) Lee, Lenin Mookiah
Book Image

Production-Ready Applied Deep Learning

By: Tomasz Palczewski, Jaejun (Brandon) Lee, Lenin Mookiah

Overview of this book

Machine learning engineers, deep learning specialists, and data engineers encounter various problems when moving deep learning models to a production environment. The main objective of this book is to close the gap between theory and applications by providing a thorough explanation of how to transform various models for deployment and efficiently distribute them with a full understanding of the alternatives. First, you will learn how to construct complex deep learning models in PyTorch and TensorFlow. Next, you will acquire the knowledge you need to transform your models from one framework to the other and learn how to tailor them for specific requirements that deployment environments introduce. The book also provides concrete implementations and associated methodologies that will help you apply the knowledge you gain right away. You will get hands-on experience with commonly used deep learning frameworks and popular cloud services designed for data analytics at scale. Additionally, you will get to grips with the authors’ collective knowledge of deploying hundreds of AI-based services at a large scale. By the end of this book, you will have understood how to convert a model developed for proof of concept into a production-ready application optimized for a particular production setting.
Table of Contents (19 chapters)
1
Part 1 – Building a Minimum Viable Product
6
Part 2 – Building a Fully Featured Product
10
Part 3 – Deployment and Maintenance

Planning a DL project

Every project starts with planning. Throughout the planning, the purpose of the project needs to be clearly defined, and key members should have a thorough understanding of the available resources that can be allocated to the project. Once team members and stakeholders are identified, the next step is to discuss each individual’s responsibility and create a timeline for the project.

This phase should result in a well-documented project playbook that precisely defines business objectives and how the project will be evaluated. A typical playbook contains an overview of key deliverables, a list of stakeholders, a Gantt chart defining steps and bottlenecks, definitions of responsibilities, timelines, and evaluation criteria. In the case of highly complex projects, following the Project Management Body of Knowledge (PMBOK®) Guide (https://www.pmi.org/pmbok-guide-standards/foundational/pmbok) and considering every knowledge domain (for example, integration management, project scope management, and time management) are strongly recommended. Of course, other project management frameworks exist, such as PRINCE2 (https://www.prince2.com/usa/what-is-prince2), which can provide a good starting point. Once the playbook is constructed, every stakeholder must review and revise it until everyone agrees with the contents.

In real life, many people underestimate the importance of planning. Especially in start-ups, engineers are eager to dive into MVP development and spend minimal time planning. However, it is especially dangerous to do so in the case of DL projects because the training process can quickly drain all the available resources.

Defining goal and evaluation metrics

The very first step of planning is to understand what purpose the project serves. The goal might be developing a new product, improving the performance of an existing service, or saving on operational costs. The motivation of the project naturally helps define the evaluation metrics.

In the case of DL projects, there are two types of evaluation metrics: business-related metrics and model-based metrics. Some examples of business-related metrics are as follows: conversion rate, click-through rate (CTR), lifetime value, user engagement measure, savings in operational cost, return on investment (ROI), and revenue. These are commonly used in advertising, marketing, and product recommendation verticals.

On the other hand, model-based metrics include accuracy, precision, recall, F1-score, rank accuracy metrics, mean absolute error (MAE), mean squared error (MSE), root-mean-square error (RMSE), and normalized mean absolute error (NMAE). In general, tradeoffs can be made between the various metrics. For example, a slight decrease in accuracy may be acceptable if meeting latency requirements is more critical to the project.

Along with the target evaluation metric, which differs from project to project, there are other metrics that are commonly found in most projects. These are due dates and resource usage. The target state must be reached by a certain date using available resources.

The goal and corresponding evaluation metrics need to be fair. If the goal is too hard to achieve, project members can possibly lose motivation. If the metric for the evaluation is not correct, understanding the impact of the project becomes difficult. As a result, it is recommended that the selected evaluation metrics are shared with others and considered fair for everyone.

Figure 1.4 – A sample playbook with the project description section filled out

Figure 1.4 – A sample playbook with the project description section filled out

As shown in Figure 1.4, the first section of a playbook begins with a general description, an estimated complexity of the technical aspects, and a list of required tools and frameworks. Next, it clearly describes the objective of the project and how the project will be evaluated.

Stakeholder identification

In the same way that the term stakeholder is used for a business, a stakeholder for a project refers to a person or group who can affect or be affected by the project. Stakeholders can be grouped into two types, internal and external. Internal stakeholders are those that are directly involved in project executions, while external stakeholders may be outside of the circle, supporting the project execution in an indirect way.

Each stakeholder has a different role within the project. First, we’ll look at internal stakeholders. Internal stakeholders are the main drivers of the project. Therefore, they work closely together to process and analyze data, develop a model, and build deliverables. Table 1.1 lists internal stakeholders that are commonly found in DL projects:

Table 1.1 – Common internal stakeholders for DL projects

Table 1.1 – Common internal stakeholders for DL projects

On the other hand, external stakeholders often play supportive roles, such as collecting necessary data for the project or providing feedback about the deliverable. In Table 1.2, we describe some common external stakeholders for DL projects:

Table 1.2 – Common external stakeholders for DL projects

Table 1.2 – Common external stakeholders for DL projects

Stakeholders are described in the second section of a playbook. As shown in Figure 1.4, the playbook must list stakeholders and their responsibilities in the project.

Task organization

A milestone refers to a point in a project where a significant event occurs. Therefore, there is a set of requirements leading up to a milestone. Once the requirements are met, a milestone can be claimed to have been reached. One of the most important steps in project planning is defining milestones and their associated tasks. The ordering of tasks that lead to the goal is called the critical path. It is worth mentioning that tasks don’t need to be tackled sequentially all the time. The understanding of a critical path is important because it allows the team to prioritize tasks appropriately to ensure the success of the project.

In this step, it is also critical to identify level-of-effort (LOE) activities and supportive activities, which are required for project execution. In the case of software development projects, LOE activities include supplementary tasks, such as setting up Git repositories or reviewing others’ merge requests. The following figure (Figure 1.5) describes a typical critical path for a DL project. It will be structured differently if the underlying project consists of different tasks, requirements, technologies, and desired levels of detail:

Figure 1.5 – A typical critical path for a DL project

Figure 1.5 – A typical critical path for a DL project

Resource allocation

For a DL project, there are two main resources that require explicit resource allocations: human and computational resources. Human resources refer to employees that will actively work on individual tasks. In general, they hold positions in data engineering, data science, DevOps, or software engineering. When people talk about human resources, they often consider headcount only. However, the knowledge and skills that individuals hold are other critical factors. Human resources are closely related to how fast the project can be carried out.

Computational resources refer to hardware and software resources that are allocated to the project. Unlike typical software engineering projects, such as mobile app development or web page development, DL projects require heavy computation and large amounts of data. Common techniques for speeding up the development process involve parallelism or using computationally stronger machines. In some cases, tradeoffs need to be made between them, as a single machine of high computational power can cost more than multiple machines of low computational power.

Overall, novel DL pipelines take advantage of flexible and stateless resources, such as AWS Spot instances with fault-tolerant code. Besides hardware resources, there are frameworks and services that may provide necessary features out of the box. If the necessary service requires a payment, it is important to understand how it can change the project execution and what the demand on human resources would be if the team decided to handle it in-house.

In this step, available resources need to be allocated to each task. Figure 1.6 lists the tasks described in the previous section and describes the allocated resources, along with estimates of operational costs. Each task has its own risk level indicator. It is designed for a small team of three people working on a simple DL project with limited computational resources on a couple of AWS Elastic Compute Cloud (EC2) instances for around 4 to 6 months. Please note that the cost estimation of human resources is not included in the example, as it differs a lot depending on geographic location and team seniority:

Figure 1.6 – A sample resource allocation section of a DL project

Figure 1.6 – A sample resource allocation section of a DL project

Before we move on to the next step, we would like to mention that it is important to set aside a portion of the resources as a backup, in case the milestone requires more resources than that have been allocated.

Defining a timeline

Now that we know the available resources, we should be able to construct a timeline for the project. In this step, the team needs to discuss how long each step would take to complete. It is worth mentioning that things don’t work out as planned all the time. There will be many difficulties throughout the project that can delay the delivery of the final product.

Therefore, including buffers within the timeline is a common practice in many organizations. It is important that every stakeholder agrees with the timeline. If anyone believes that it’s not reasonable, the adjustment needs to be made right away. Figure 1.7 is a sample Gantt chart with the most likely estimated timeline for the information presented in Figure 1.6:

Figure 1.7 – A sample Gantt chart describing the timeline

Figure 1.7 – A sample Gantt chart describing the timeline

It is worth mentioning that the chart can also be used to monitor the progress of each task and the overall project. In such cases, additional comments or visualizations summarizing the progress can be attached to each indicator bar.

Managing a project

Another important aspect of a DL project that needs to be discussed during the project planning phase is the process that the team will follow to update other team members and ensure on-time delivery of the project. Out of various project management methodologies, Agile fits perfectly, as it helps to split work into smaller parts and quickly iterate over development cycles until the FFP emerges. As DL projects are generally considered highly complex, it is more convenient to work within short cycles of research, development, and optimization phases. At the end of each cycle, stakeholders review results and adjust their long-term goals. Agile methodology is particularly suitable for small teams of experienced individuals. In a typical setting, 2-week sprints are found to be the most effective, especially when the short-term goals are clearly defined.

During a sprint meeting, the team reviews goals from the last sprint and defines goals for the upcoming sprint. It is also recommended to have short daily meetings to go over work performed on the previous day and plan for the upcoming day, as this process can help the team to quickly recognize bottlenecks and adjust their priorities as necessary. Commonly used tools for this process are Jira, Asana, and Quickbase. The majority of the aforementioned tools also support budget management, timeline reviewing, idea management, and resource tracking.

Things to remember

a. Project planning should result in a playbook that clearly describes what purpose the project serves and how the team will move together to reach a particular goal state.

b. The first step of project planning is to define a goal and its corresponding evaluation metrics. In the case of DL projects, there are two types of evaluation metrics: business-related metrics and model-based metrics.

c. A stakeholder refers to a person or a group who can affect or be affected by the project. Stakeholders can be grouped into two types: internal and external.

d. The next stage of project planning is task organization. The team needs to identify milestones, identify tasks, along with LOE activities, and understand the critical path.

e. For DL projects, there are two main resources that require explicit resource allocation: human and computational resources. During resource allocation, it is important to put aside a portion of the resources as a backup.

f. When estimating the timeline for the project, it must be shared within the team, and every stakeholder must agree with the schedule.

g. Agile methodology is a perfect fit for managing DL projects, as it helps to split work into smaller parts and quickly iterate over development cycles.