Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu
Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
Table of Contents (16 chapters)
1
Section 1: Introduction to Pachyderm and Reproducible Data Science
5
Section 2:Getting Started with Pachyderm
12
Section 3:Pachyderm Clients and Tools

Types of data science platforms

This section walks you through the data science platforms that are available in the open source world and on the market today and will help you understand the difference between them.

As new fields of AI and machine learning emerge, more and more engineers are working on new ways of solving data science problems, creating an infrastructure for better, faster AI adoption. Some platforms provide end-to-end capabilities for data from a data warehouse all the way to production, while others offer partial functionality and work in combination with other tools. Generally, there is no solution that fits all use cases, and certainly not every budget.

However, all of these solutions completely or partially facilitate the following stages of a data science lifecycle:

  • Data Engineering
  • Data Acquisition and Transformation
  • Data Training
  • Model Deployment
  • Monitoring and Improvement

The following diagram shows the types of data science tools:

Figure 1.7 – Types of data science tools

Figure 1.7 – Types of data science tools

Let's take a look at the existing data science platforms that can help you to build your data science workflow at scale.

End-to-end platforms

An end-to-end data science solution should be able to provide the tooling for all the stages of the ML lifecycle listed in the previous section. However, in some use cases, the definition of the end-to-end workflow could be different and might mostly work with the ML pipelines and projects, excluding the data engineering part. Since the definition may still fluctuate, it is likely that the end-to-end tools will continue to provide different functionalities as the field evolves.

If such a platform does exist, it should bring the following benefits:

  • A unified user interface that eliminates the need to stitch multiple interfaces together
  • Collaboration for all involved individuals, including data scientists, data engineers, and IT operations
  • The convenience of infrastructure support being offloaded to the solution provider, which offers the team additional time to focus on data models rather than on infrastructure problems

However, you might find the following disadvantages of an end-to-end platform to be inconsistent with your organization's goals:

  • Portability: Such a platform would likely be proprietary, and migration to a different platform would be difficult.
  • Price: An end-to-end platform will likely be subscription-based, which many data science departments might not be able to afford. If GPU-based workflows are involved, the price increases even more.
  • Bias: When you are using a proprietary solution that offers built-in pipelines, your models are bound to inherit bias from these automated tools. The problem is that bias might be difficult to recognize and address in automated ML solutions, which could potentially have negative consequences for your business.

Now that we are aware of the advantages and disadvantages of end-to-end data science platforms, let's consider the ones that are available on the market today. Because the AI field is developing rapidly, new platforms emerge every year. We'll look into the top five such platforms.

Big tech giants, such as Microsoft, Google, and Amazon, all offer automated ML features that a lot of users might find useful. Google's AI Platform offers Kubeflow pipelines to help manage ML workflows. Amazon offers tools that assist with hyperparameter tuning and labeling. Microsoft offers Azure Machine Learning services that support GPU-based workflows and are similar to Amazon's services functionality.

However, as stated previously, all these Explainable AI (XAI) features are prone to bias and require the data science team to build additional tools that can verify model performance and reliability. For many organizations, automated ML is not the right answer. Another issue is vendor lock-in, as you will have to keep all your data in the underlying cloud storage.

The Databricks solution provides a more flexible approach as it can be deployed on any cloud. Databricks is based on Apache Spark, one of the most popular tools for AI and ML workflows and offers end-to-end ML pipeline management through a platform called MLflow. MLflow enables data scientists to track their pipeline progress from model development to deployment to production. Many users enjoy the built-in notebook interface. One disadvantage is the lack of data visualization tools, which might be added in the future.

Algorithmia is another proprietary solution that can be deployed on any cloud platform and that provides an end-to-end ML workflow with model training, deployment, versioning, and other built-in functionality. It supports batch processing and can be integrated with GitHub actions. While Algorithmia is a great tool, it has some of the traditional software developer tools built in, which some engineering teams might find redundant.

Pluggable solutions

While end-to-end platforms might sound like the right solution for your data science department, in reality, it is not always the case. Big companies often have requirements that end-to-end platforms cannot meet. These requirements might include the following:

  • Data security: Some companies might have privacy limitations on storing their data in the cloud. These limitations also apply to the use of automated ML features.
  • Pipeline outputs: Often, the final product of a pipeline is a library that is packaged and used in other projects within the organization.
  • Existing infrastructure constraints: Some existing infrastructure components might prevent the integration of an end-to-end platform. Some parts of the infrastructure might already exist and satisfy the user's needs.

Pluggable solutions give data infrastructure teams the flexibility to build their own solution, which also comes with the need to support it. However, most of the big companies end up doing just that.

Pluggable solutions can be divided into the following categories:

  • Data ingestion tools
  • Data transformation tools
  • Data serving tools
  • Data visualization and monitoring tools

Let's consider some of these tools, which can be combined together to build a data science solution.

Data ingestion tools

Data ingestion is the process of collecting data from all sources in your company, such as databases, social media, and other platforms, into a centralized location for further consumption by machine learning pipelines and other AI processes.

One of the most popular open source tools to ingest data is Apache NiFi, which can ingest data into Apache Kafka, an open source streaming platform. From there, data pipelines can consume the data for processing.

Among commercial cloud-hosted platforms, we can name Wavefront, which enables not only ingestion but data processing as well. Wavefront is notable for its ability to scale and support high query loads.

Data transformation tools

Data transformation is the process of running your code against the data you have. This includes training and testing your data as part of a data pipeline. The tool should be able to consume the data from a centralized location. Tools such as TensorFlow and Keras provide extended functionality for this type of operation.

Pachyderm is a data transformation and pipeline tool as well, although its main value is in version control for large datasets. Unlike other transformation tools, Pachyderm gives data scientists the freedom to define their own pipelines and supports any language and library.

If you have taken any data science classes, chances are you have used MATLAB or Octave for model training. These tools provide a great playground to start exploring machine learning. However, when it comes to production-grade data science that requires continuous training, collaboration, version control, and model productization, these tools might not be the best choice. MATLAB and Octave are mainly for numerical computing for academic purposes. Another issue with platforms such as MATLAB is that they often use proprietary languages, while tools like Pachyderm support any language, including the most popular ones in the data science community.

Model serving tools

After you train your model and it gives satisfactory results, you need to think about moving that model into production, which often is convenient to do in the form of a REST API or through a table that is ingested into a database. Depending on the language that is used in your model, serving a REST API can be as easy as using a web framework such as Flask.

However, there are more advanced tools that can that give data scientists end-to-end control over the machine learning process. One such open source tool is Seldon. Seldon converts REST API endpoints into a production microservice, where you can easily promote each version of your model from staging to production.

Another tool that provides similar functionality is KFServing. Both solutions use Kubernetes' Custom Resource Definition (CRD) to define a Deployment class for model serving.

Often, in big companies, different teams are responsible for training models and serving models, and therefore, decisions can be made based on the team's familiarity and preference for one or the other solution.

Data monitoring tools

After the model is deployed in production, data scientists need to continue to receive feedback about model performance, possible bias, and other metrics. For example, if you have an e-commerce website with a recommendation system that suggests to users what to buy with the current order based on their past choices, you need to make sure that the system is still on track with the latest fashion trends. You might not know the trends, but the feedback loop should signal a decrease in model performance when it occurs.

Often, enterprises fail to employ a good monitoring solution for ML workflows, which can have a potentially devastating outcome for your business. Seldon Alibi is one of the tools that provide model inspection functionality, which enables data scientists to monitor models running in production and identify areas of improvement. Seldon Alibi provides outlier detection, which helps to discover anomalies; drift detection, which helps monitor changes in correlation between input and output data; and adversarial detection, which exposes malicious changes in the original data inputs.

Fiddler is another popular tool that monitors a production model for integrity, bias, performance, and outlier anomalies.

Putting it all together

As you can see, there are multiple ways to create a production-grade data science solution, and one size likely will not fit all. Although end-to-end solutions provide the convenience of using one vendor, they also have multiple disadvantages and are likely equipped with inferior domain functionality compared to pluggable tools. Pluggable tools, on the other hand, require certain expertise and culture to be present in your organization, which will allow different departments, such as DevOps engineers, to collaborate with data scientists to build an optimal solution and workflow.

The next section will walk us through the ethical problems that plague modern AI applications, how they might affect your business, and what you can do about them.