Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu
Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
Table of Contents (16 chapters)
1
Section 1: Introduction to Pachyderm and Reproducible Data Science
5
Section 2:Getting Started with Pachyderm
12
Section 3:Pachyderm Clients and Tools

The reproducibility crisis in science

The reproducibility crisis is a problem that has been around for more than a decade. Because data science is a close discipline to science, it is important to review the issues many scientists have outlined in the past and correlate them with similar problems the data science space is facing today.

One of the most important issues is replication—the ability to reproduce the results of a scientific experiment has been one of the founding principles of good research. In other words, if an experiment can be reproduced, it is valid, and if not, it could be a one-time occurrence that does not represent real phenomena. Unfortunately, in recent years, more and more research papers in sociology, medicine, biology, and other areas of science cannot withhold retesting against an increased number of samples, even if these papers were published in well-known and trustworthy science magazines, such as Nature. This tendency could lead to public mistrust in science and AI as part of it.

As was mentioned previously, because of the popularity and growth of the AI industry, the number of AI papers has increased multiple times. Unfortunately, the quality of these papers does not grow with the number of papers itself.

Nature magazine recently conducted a survey among scientists asking them whether they feel that there is a reproducibility crisis in science. The majority of scientists agreed that false-positive results due to pressure to publish results frequently definitely exists. Researchers need sponsorship and sponsors need to see results to invest additional money in the research, which results in many published papers with declining credibility. Ultimately, the fight for grants and bureaucracy are often named as the main causes of the lack of the reproducibility process in labs.

The research papers that were questioned for integrity have the following common attributes:

  • No code or data were publicly shared for other researchers to attempt to replicate the results.
  • The scientists who attempted to replicate the results failed completely or partially to do it by following the provided instructions.

Even the papers published by Nobel laureates can sometimes be questioned due to an inability to reproduce the results. For example, in 2014, Science magazine retracted a paper published by Nobel Prize winner and immunologist Bruce Beutler. His paper was about the response to pathogens by virus-like organisms in the human genome. This paper was cited over 50 times before it was retracted.

When COVID-19 become a major topic of 2020, multiple papers were published on it. According to Retraction Watch, an online blog that tracks the scientific papers that have been called off, as of March 2021 more than 86 of them were retracted.

In 2019, more than 1,400 science papers were retracted by multiple publishers. This number is huge and has been steadily growing, compared to only 50 papers in the early 2000s. This raises awareness of a so-called reproducibility crisis in science. While not all papers are retracted for that reason, oftentimes it happens because of that.

Data fishing

Data fishing or data dredging is a method of achieving a statistically significant result of an experiment by running the computation multiple times before the desired result is achieved and only reporting these results and ignoring the inconvenient results. Sometimes, scientists unintentionally dredge the data to achieve the result they think is most probable and confirm their hypothesis. A more sinister plan can take place too—a scientist might be intentionally hacking the result of the experiment to achieve a predefined conclusion.

An example of such a misuse of data analysis would be if you decided to prove that there is a correlation between banana consumption and an increased level of IQ in children of age 10 and older. This is a completely made-up example, but say you wanted to establish this connection. You would need to get information about IQ level and banana consumption of a big enough sample of children – let's say 5,000.

Then, you would run tests, such as: do kids who eat bananas and exercise have a higher IQ level than the ones who only exercise? Do kids who watch TV and eat bananas have a higher level of IQ compared to the ones who do not? After conducting these tests enough times, you most likely would get some kind of correlation. However, this result would not be significant, and using the data dredging technique is considered extremely unethical by the scientific community. In data science specifically, similar problems are being seen.

Without conducting a full investigation, detecting data dredging might be difficult. Possible factors to look for include the following:

  • Was the research conducted by a reputable institution or group of scientists?
  • What does other research in similar areas suggest?
  • Is financial interest involved?
  • Is the claim sensational?

Without a proper process, data dredging and unreliable researchers will continue to be published. Recently, Nature magazine surveyed around 1,500 researchers from different areas of science and more than 50% of respondents outlined that they have tried and failed to reproduce the results of research in the past. Even more shockingly, in many cases, they failed to reproduce the results of their own experiments.

Out of all respondents, only 24% were able to successfully publish their reproduction attempts and the majority were never contacted with a request to reproduce someone else's research.

Of course, increasing the reproducibility of experiments is a costly problem and can double the time required to conduct an experiment, which many research laboratories might not be able to afford. But if it's added to the originally planned time for the research and has a proper process, it should not be as difficult or rigorous as adding it midway in the research lifecycle.

Even worse, retracting a paper after it was published can be a tedious task. Some publishers even charge researchers a significant amount of money if a paper is retracted. Such practices are truly discouraging.

All of this negatively impacts research all over the world and results in growing mistrust in science. Organizations must take steps to improve processes in their scientific departments and scientific journals must raise the bar of publishing research.

Now that we have learned about data fishing, let's review better reproducibility guidelines.

Better reproducibility in science research guidelines

The Center for Open Science (COS), a non-profit organization that focuses on supporting and promoting open-science initiatives, reproducibility, and integrity of scientific research, has published Guidelines for Transparency and Guidelines for Transparency and Openness Promotion (TOP) in Journal Policies and Practices, or the TOP Guidelines. These guidelines emphasize the importance of transparency in published research papers. Researchers can use them to justify the necessity of sharing research artifacts publicly to avoid any possible inquiries regarding the integrity of their work.

The main principles of the TOP Guidelines include the following:

  • Proper citation and credit to original authors: All text, code, and data artifacts that belong to other authors must be outlined in the paper and credit given as needed.
  • Data, methodology, and research material transparency: The authors of the paper must share the written code, methodology, and research materials in a publicly accessible location with instructions on how to access and use them.
  • Design and analysis transparency: The authors should be transparent about the methodology as much as possible, although this might vary by industry. At a minimum, they must disclose the standards that have been applied during the research.
  • Preregistrations of the research and analysis plans: Even if research does not get published, preregistration makes it more discoverable.
  • Reproducibility of obtained results: The authors must include sufficient details on how to reproduce the original results.

There are three levels that are applied to all these metrics:

  • Not implemented—information is not included in the report
  • Level 1—available upon request
  • Level 2—access before publication
  • Level 3—verification before publication

Level 3 is the highest level of transparency that a metric can achieve. Having this level of transparency justifies the quality of submitted research. COS applies the TOP factor to rate a journal's efforts to ensure transparency and ultimately the quality of the published research.

Apart from data and code reproducibility, often the environment and software used during the research play a big role. New technologies, such as containers and virtual and cloud environments make it easy to achieve uniformity in conducted research. Of course, if we consider biochemistry or other industries that require more precise lab conditions, achieving uniformity might be even more complex.

Now let's learn about common practices that help improve reproducibility.

Common practices to improve reproducibility

Thanks to the work of reproducibility advocates and the problem being widely discussed in scientific communities in recent years, some positive tendencies in increasing reproducibility seem to be emerging. These practices include the following:

  • Request a colleague to reproduce your work.
  • Develop extensive documentation.
  • Standardize research methodology.
  • Preregister your research before publication to avoid data cherry-picking.

There are scientific groups that make it their mission to reproduce and notify researchers about mistakes in their papers. Their typical process is to try to reproduce the result of a paper and write a letter to the researchers or lab to request a correction or retraction. Some researchers willingly collaborate and correct the mistakes in the paper, but in other cases, it is unclear and difficult. One such group has identified the following problems in the 25 papers that they analyzed:

  • Lack of process or point of contact regarding to whom they should address feedback on a paper. Scientific journals do not provide a clear statement on whether feedback can be addressed to the chief editor or whether there is a feedback submission form of some sort.
  • Scientific journal editors accept and act on submissions unwillingly. In some cases, it might take up to a year to publish a warning on a paper that has received critical feedback, even if it was provided by a reputable institution.
  • Some publishers expect you to pay if you want to publish a correction letter and delay retractions.
  • Raw data is not always available publicly. In many cases, publishers did not have a unified process around a shared location for the raw data used in the research. If you have to directly contact an author, you might not be able to get the requested information and it might significantly delay the process. Moreover, they can simply deny such a request.

The lack of a standard in submitting corrections and research paper retractions contributes to the overall reproducibility crisis and knowledge sharing. The papers that used data dredging and other techniques to manipulate the results will become a source of information for future researchers, contributing to the overall misinformation and chaos. Researchers, publishers, and editors should work together on establishing unified post-publication review guidelines that encourage other scientists to participate in testing and providing feedback.

We've learned how reproducibility affects the quality of research. Now, let's review how organizations can establish a process to ensure their data science experiments adhere to best industry practices to ensure high standards.