Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu
Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
Table of Contents (16 chapters)
1
Section 1: Introduction to Pachyderm and Reproducible Data Science
5
Section 2:Getting Started with Pachyderm
12
Section 3:Pachyderm Clients and Tools

Demystifying MLOps

This section defines Machine Learning Operations (MLOps) and describes why it is crucial to establish a reliable MLOps process within your data science department.

In many organizations, data science departments have been created fairly recently, in the last few years. The profession of data scientist is fairly new as well. Therefore, many of these departments have to find a way to integrate into the existing corporate process and devise ways to ensure the reliability and scalability of data science deliverables.

In many cases, the burden of building a suitable infrastructure falls on the shoulders of the data scientists themselves, who often are not as familiar with the latest infrastructure trends. Another problem is how to make it all work for different languages, platforms, and environments. In the end, data scientists spend more time on building the infrastructure than on working on the model itself. This is where the new discipline has emerged to help bridge the gap between data science and infra.

MLOps is a lifecycle process that identifies the stages of machine learning operations, ensuring the reliability of the data science process. MLOps is a set of practices that define the machine learning development process. Although the term was coined fairly recently, most data scientists agree that a successful MLOps process should adhere to the following principles:

  • Collaboration: This principle implies that everything that goes into developing an ML model must be shared among data scientists to preserve knowledge.
  • Reproducibility: This principle implies that not only the code but datasets, metadata, and parameters should be versioned and reproducible for all production models.
  • Continuity: This principle implies that a lifecycle of a model is a continuous process that means repetition of the lifecycle stages and improvement of the model with each iteration.
  • Testability: This principle implies that the organization implements ML testing and monitoring practices to ensure the model's quality.

Before we dive into the MLOps process stages, let's take a look at more established software development practices. DevOps is a software development practice that is used in many enterprise-level software projects. A typical DevOps lifecycle includes the following stages that continuously repeat, ensuring product improvement:

  • Planning: In this stage, the overall vision for the software is developed, and a more detailed design is devised.
  • Development: In this stage, the code is written, and the planned functionality is implemented. The code is shared through version control systems, such as Git, which ensures collaboration between software developers.
  • Testing: In this stage, the developed code is tested for defects through an automated or manual process.
  • Deployment: In this stage, the code is released to production servers, and the users have a chance to test it and provide feedback.
  • Monitoring: In this stage, the DevOps engineers focus on software performance and causes of outages, identifying possible areas of improvement.
  • Operations: This stage ensures the automated release of software updates.

The following diagram illustrates the DevOps lifecycle:

Figure 1.5 – DevOps Lifecycle

Figure 1.5 – DevOps Lifecycle

All these phases are continuously repeated, enabling communication between departments and a customer feedback loop. This practice has brought enterprises such benefits as a faster development cycle, better products, and continuous innovation. Better teamwork enabled by the close relationships between departments is one of the key factors that make this process efficient.

Data scientists deserve a process that brings the same level of reliability. One of the biggest problems of enterprise data science is that very few machine learning models make it to production. Many companies are just starting to adopt data science, and the new departments face unprecedented challenges. Often, the teams lack an understanding of the workflows that need to be implemented in order to make enterprise-level data science work.

Another important challenge is that unlike in traditional software development, data scientists operate not only with code but also with data and parameters. Data is taken from the real world, and the code is accurately developed in the office. The only time they cross is when they are combined in a data model.

The challenges that all data science departments face include the following:

  • Inconsistent or totally absent data science processes
  • No way to track data changes and reproduce past results
  • Slow performance

In many enterprises, data science departments are still small and struggle to create a reliable workflow. Building such a process requires certain expertise, such as an understanding of traditional software practices, such as DevOps, mixed with an understanding of data science challenges. That is where MLOps started to emerge, combining data science with best practices of software development.

If we try to apply similar DevOps practices to data science, here is what we might see:

  • Design: In this phase, data scientists work on acquiring the data and designing a data pipeline, also known as an Extract, Transform, Load (ETL) pipeline. A data pipeline is a sequence of transformation steps data goes through, which ends with an output result.
  • Development: In this stage, data scientists work on writing the algorithmic code for the previously developed data pipeline.
  • Training: In this stage, the model is trained with the selected or autogenerated data. During this stage, such techniques as hyperparameter tuning can be used.
  • Validation: In this stage, the trained data is validated to work with the rest of the data pipeline.
  • Deployment: In this stage, the trained and validated model is deployed into production.
  • Monitoring: In this stage, the model is constantly monitored for performance and possible flaws, and feedback is delivered directly to the data scientist for further improvement.

Similar to DevOps, the stages of MLOps are constantly repeated. The following diagram shows the stages of MLOps:

Figure 1.6 – MLOps Lifecycle

Figure 1.6 – MLOps Lifecycle

As you can see, the two practices are very similar, and the latter borrows the main concepts from the former. Using MLOps in practice has brought the following advantages to enterprise-level data science:

  • Faster go-to-market delivery: A data science model only has value when it is successfully deployed in production. With so many companies struggling to implement a proper process in their data science departments, an MLOps solution can genuinely make a difference.
  • Cross-team collaboration and communication: Software-development practices applied to data science create a common ground for developers, data scientists, and IT operations to work together and speak the same language.
  • Reproducibility and knowledge transfer: Keeping the code, the datasets, and the history of changes plays a big role in the improvement of overall model quality and enables data scientists to learn from each other's examples, contributing to innovation and feature development.
  • Automation: Automating a data pipeline helps to keep the process consistent across multiple releases and speeds up the promotion of a Proof of Concept (POC) model to a production-grade pipeline.

In this section, we've learned about the important stages of the MLOps process. In the next section, we will learn more about the types of data science platforms that can help you implement MLOps in your organization.