Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu

Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share Your Thoughts

Section 1: Introduction to Pachyderm and Reproducible Data Science

Section 1: Introduction to Pachyderm and Reproducible Data Science

Free Chapter

Chapter 1: The Problem of Data Reproducibility

Chapter 1: The Problem of Data Reproducibility

Why is reproducibility important?

The reproducibility crisis in science

Demystifying MLOps

Types of data science platforms

Explaining ethical AI

Further reading

Chapter 2: Pachyderm Basics

Chapter 2: Pachyderm Basics

Reviewing Pachyderm architecture

Learning about version control primitives

Discovering pipeline elements

Further reading

Chapter 3: Pachyderm Pipeline Specification

Chapter 3: Pachyderm Pipeline Specification

Pipeline specification overview

Understanding inputs

Exploring informational parameters

Exploring transformation

Optimizing your pipeline

Exploring service parameters

Exploring output parameters

Further reading

Section 2:Getting Started with Pachyderm

Section 2:Getting Started with Pachyderm

Chapter 4: Installing Pachyderm Locally

Chapter 4: Installing Pachyderm Locally

Technical requirements

Installing the required tools

Installing minikube

Installing Docker Desktop

Installing the Pachyderm command-line interface

Enabling autocompletion for Pachyderm

Preparing the Kubernetes environment

Deploying Pachyderm

Accessing the Pachyderm Console

Deleting an existing Pachyderm deployment

Further reading

Chapter 5: Installing Pachyderm on a Cloud Platform

Chapter 5: Installing Pachyderm on a Cloud Platform

Technical requirements

Installing the required tools

Deploying Pachyderm on Amazon EKS

Deploying the cluster

Deploying Pachyderm on GKE

Deploying the cluster

Deploying Pachyderm on Microsoft AKS

Deploying the cluster

Accessing the Pachyderm console

Further reading

Chapter 6: Creating Your First Pipeline

Chapter 6: Creating Your First Pipeline

Technical requirements

Pipeline overview

Creating a repository

Creating a pipeline specification

Viewing the pipeline result

Adding another pipeline step

Further reading

Chapter 7: Pachyderm Operations

Chapter 7: Pachyderm Operations

Technical requirements

Reviewing the standard Pachyderm workflow

Executing data operations

Executing pipeline operations

Running maintenance operations

Further reading

Chapter 8: Creating an End-to-End Machine Learning Workflow

Chapter 8: Creating an End-to-End Machine Learning Workflow

Technical requirements

NLP example overview

Creating repositories and pipelines

Creating an NER pipeline

Retraining an NER model

Further reading

Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

Technical requirements

Reviewing hyperparameter tuning techniques and strategies

Creating a hyperparameter tuning pipeline in Pachyderm

Further reading

Section 3:Pachyderm Clients and Tools

Section 3:Pachyderm Clients and Tools

Chapter 10: Pachyderm Language Clients

Chapter 10: Pachyderm Language Clients

Technical requirements

Using the Pachyderm Go client

Using the Pachyderm Python client

Further reading

Chapter 11: Using Pachyderm Notebooks

Chapter 11: Using Pachyderm Notebooks

Technical requirements

Enabling Pachyderm Notebooks in Pachyderm Hub

Running basic Pachyderm operations in Pachyderm Notebooks

Creating and running an example pipeline in Pachyderm Notebooks

Further reading

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

The reproducibility crisis in science

The reproducibility crisis is a problem that has been around for more than a decade. Because data science is a close discipline to science, it is important to review the issues many scientists have outlined in the past and correlate them with similar problems the data science space is facing today.

One of the most important issues is replication—the ability to reproduce the results of a scientific experiment has been one of the founding principles of good research. In other words, if an experiment can be reproduced, it is valid, and if not, it could be a one-time occurrence that does not represent real phenomena. Unfortunately, in recent years, more and more research papers in sociology, medicine, biology, and other areas of science cannot withhold retesting against an increased number of samples, even if these papers were published in well-known and trustworthy science magazines, such as Nature. This tendency could lead to public mistrust in science and AI as part of it.

As was mentioned previously, because of the popularity and growth of the AI industry, the number of AI papers has increased multiple times. Unfortunately, the quality of these papers does not grow with the number of papers itself.

Nature magazine recently conducted a survey among scientists asking them whether they feel that there is a reproducibility crisis in science. The majority of scientists agreed that false-positive results due to pressure to publish results frequently definitely exists. Researchers need sponsorship and sponsors need to see results to invest additional money in the research, which results in many published papers with declining credibility. Ultimately, the fight for grants and bureaucracy are often named as the main causes of the lack of the reproducibility process in labs.

The research papers that were questioned for integrity have the following common attributes:

No code or data were publicly shared for other researchers to attempt to replicate the results.
The scientists who attempted to replicate the results failed completely or partially to do it by following the provided instructions.

Even the papers published by Nobel laureates can sometimes be questioned due to an inability to reproduce the results. For example, in 2014, Science magazine retracted a paper published by Nobel Prize winner and immunologist Bruce Beutler. His paper was about the response to pathogens by virus-like organisms in the human genome. This paper was cited over 50 times before it was retracted.

When COVID-19 become a major topic of 2020, multiple papers were published on it. According to Retraction Watch, an online blog that tracks the scientific papers that have been called off, as of March 2021 more than 86 of them were retracted.

In 2019, more than 1,400 science papers were retracted by multiple publishers. This number is huge and has been steadily growing, compared to only 50 papers in the early 2000s. This raises awareness of a so-called reproducibility crisis in science. While not all papers are retracted for that reason, oftentimes it happens because of that.

Data fishing

Data fishing or data dredging is a method of achieving a statistically significant result of an experiment by running the computation multiple times before the desired result is achieved and only reporting these results and ignoring the inconvenient results. Sometimes, scientists unintentionally dredge the data to achieve the result they think is most probable and confirm their hypothesis. A more sinister plan can take place too—a scientist might be intentionally hacking the result of the experiment to achieve a predefined conclusion.

An example of such a misuse of data analysis would be if you decided to prove that there is a correlation between banana consumption and an increased level of IQ in children of age 10 and older. This is a completely made-up example, but say you wanted to establish this connection. You would need to get information about IQ level and banana consumption of a big enough sample of children – let's say 5,000.

Then, you would run tests, such as: do kids who eat bananas and exercise have a higher IQ level than the ones who only exercise? Do kids who watch TV and eat bananas have a higher level of IQ compared to the ones who do not? After conducting these tests enough times, you most likely would get some kind of correlation. However, this result would not be significant, and using the data dredging technique is considered extremely unethical by the scientific community. In data science specifically, similar problems are being seen.

Without conducting a full investigation, detecting data dredging might be difficult. Possible factors to look for include the following:

Was the research conducted by a reputable institution or group of scientists?
What does other research in similar areas suggest?
Is financial interest involved?
Is the claim sensational?

Without a proper process, data dredging and unreliable researchers will continue to be published. Recently, Nature magazine surveyed around 1,500 researchers from different areas of science and more than 50% of respondents outlined that they have tried and failed to reproduce the results of research in the past. Even more shockingly, in many cases, they failed to reproduce the results of their own experiments.

Out of all respondents, only 24% were able to successfully publish their reproduction attempts and the majority were never contacted with a request to reproduce someone else's research.

Of course, increasing the reproducibility of experiments is a costly problem and can double the time required to conduct an experiment, which many research laboratories might not be able to afford. But if it's added to the originally planned time for the research and has a proper process, it should not be as difficult or rigorous as adding it midway in the research lifecycle.

Even worse, retracting a paper after it was published can be a tedious task. Some publishers even charge researchers a significant amount of money if a paper is retracted. Such practices are truly discouraging.

All of this negatively impacts research all over the world and results in growing mistrust in science. Organizations must take steps to improve processes in their scientific departments and scientific journals must raise the bar of publishing research.

Now that we have learned about data fishing, let's review better reproducibility guidelines.

Better reproducibility in science research guidelines

The Center for Open Science (COS), a non-profit organization that focuses on supporting and promoting open-science initiatives, reproducibility, and integrity of scientific research, has published Guidelines for Transparency and Guidelines for Transparency and Openness Promotion (TOP) in Journal Policies and Practices, or the TOP Guidelines. These guidelines emphasize the importance of transparency in published research papers. Researchers can use them to justify the necessity of sharing research artifacts publicly to avoid any possible inquiries regarding the integrity of their work.

The main principles of the TOP Guidelines include the following:

Proper citation and credit to original authors: All text, code, and data artifacts that belong to other authors must be outlined in the paper and credit given as needed.
Data, methodology, and research material transparency: The authors of the paper must share the written code, methodology, and research materials in a publicly accessible location with instructions on how to access and use them.
Design and analysis transparency: The authors should be transparent about the methodology as much as possible, although this might vary by industry. At a minimum, they must disclose the standards that have been applied during the research.
Preregistrations of the research and analysis plans: Even if research does not get published, preregistration makes it more discoverable.
Reproducibility of obtained results: The authors must include sufficient details on how to reproduce the original results.

There are three levels that are applied to all these metrics:

Not implemented—information is not included in the report
Level 1—available upon request
Level 2—access before publication
Level 3—verification before publication

Level 3 is the highest level of transparency that a metric can achieve. Having this level of transparency justifies the quality of submitted research. COS applies the TOP factor to rate a journal's efforts to ensure transparency and ultimately the quality of the published research.

Apart from data and code reproducibility, often the environment and software used during the research play a big role. New technologies, such as containers and virtual and cloud environments make it easy to achieve uniformity in conducted research. Of course, if we consider biochemistry or other industries that require more precise lab conditions, achieving uniformity might be even more complex.

Now let's learn about common practices that help improve reproducibility.

Common practices to improve reproducibility

Thanks to the work of reproducibility advocates and the problem being widely discussed in scientific communities in recent years, some positive tendencies in increasing reproducibility seem to be emerging. These practices include the following:

Request a colleague to reproduce your work.
Develop extensive documentation.
Standardize research methodology.
Preregister your research before publication to avoid data cherry-picking.

There are scientific groups that make it their mission to reproduce and notify researchers about mistakes in their papers. Their typical process is to try to reproduce the result of a paper and write a letter to the researchers or lab to request a correction or retraction. Some researchers willingly collaborate and correct the mistakes in the paper, but in other cases, it is unclear and difficult. One such group has identified the following problems in the 25 papers that they analyzed:

Lack of process or point of contact regarding to whom they should address feedback on a paper. Scientific journals do not provide a clear statement on whether feedback can be addressed to the chief editor or whether there is a feedback submission form of some sort.
Scientific journal editors accept and act on submissions unwillingly. In some cases, it might take up to a year to publish a warning on a paper that has received critical feedback, even if it was provided by a reputable institution.
Some publishers expect you to pay if you want to publish a correction letter and delay retractions.
Raw data is not always available publicly. In many cases, publishers did not have a unified process around a shared location for the raw data used in the research. If you have to directly contact an author, you might not be able to get the requested information and it might significantly delay the process. Moreover, they can simply deny such a request.

The lack of a standard in submitting corrections and research paper retractions contributes to the overall reproducibility crisis and knowledge sharing. The papers that used data dredging and other techniques to manipulate the results will become a source of information for future researchers, contributing to the overall misinformation and chaos. Researchers, publishers, and editors should work together on establishing unified post-publication review guidelines that encourage other scientists to participate in testing and providing feedback.

We've learned how reproducibility affects the quality of research. Now, let's review how organizations can establish a process to ensure their data science experiments adhere to best industry practices to ensure high standards.