Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu

Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Share Your Thoughts

Section 1: Introduction to Pachyderm and Reproducible Data Science

Section 1: Introduction to Pachyderm and Reproducible Data Science

Free Chapter

Chapter 1: The Problem of Data Reproducibility

Chapter 1: The Problem of Data Reproducibility

Why is reproducibility important?

The reproducibility crisis in science

Demystifying MLOps

Types of data science platforms

Explaining ethical AI

Further reading

Chapter 2: Pachyderm Basics

Chapter 2: Pachyderm Basics

Reviewing Pachyderm architecture

Learning about version control primitives

Discovering pipeline elements

Further reading

Chapter 3: Pachyderm Pipeline Specification

Chapter 3: Pachyderm Pipeline Specification

Pipeline specification overview

Understanding inputs

Exploring informational parameters

Exploring transformation

Optimizing your pipeline

Exploring service parameters

Exploring output parameters

Further reading

Section 2:Getting Started with Pachyderm

Section 2:Getting Started with Pachyderm

Chapter 4: Installing Pachyderm Locally

Chapter 4: Installing Pachyderm Locally

Technical requirements

Installing the required tools

Installing minikube

Installing Docker Desktop

Installing the Pachyderm command-line interface

Enabling autocompletion for Pachyderm

Preparing the Kubernetes environment

Deploying Pachyderm

Accessing the Pachyderm Console

Deleting an existing Pachyderm deployment

Further reading

Chapter 5: Installing Pachyderm on a Cloud Platform

Chapter 5: Installing Pachyderm on a Cloud Platform

Technical requirements

Installing the required tools

Deploying Pachyderm on Amazon EKS

Deploying the cluster

Deploying Pachyderm on GKE

Deploying the cluster

Deploying Pachyderm on Microsoft AKS

Deploying the cluster

Accessing the Pachyderm console

Further reading

Chapter 6: Creating Your First Pipeline

Chapter 6: Creating Your First Pipeline

Technical requirements

Pipeline overview

Creating a repository

Creating a pipeline specification

Viewing the pipeline result

Adding another pipeline step

Further reading

Chapter 7: Pachyderm Operations

Chapter 7: Pachyderm Operations

Technical requirements

Reviewing the standard Pachyderm workflow

Executing data operations

Executing pipeline operations

Running maintenance operations

Further reading

Chapter 8: Creating an End-to-End Machine Learning Workflow

Chapter 8: Creating an End-to-End Machine Learning Workflow

Technical requirements

NLP example overview

Creating repositories and pipelines

Creating an NER pipeline

Retraining an NER model

Further reading

Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

Chapter 9: Distributed Hyperparameter Tuning with Pachyderm

Technical requirements

Reviewing hyperparameter tuning techniques and strategies

Creating a hyperparameter tuning pipeline in Pachyderm

Further reading

Section 3:Pachyderm Clients and Tools

Section 3:Pachyderm Clients and Tools

Chapter 10: Pachyderm Language Clients

Chapter 10: Pachyderm Language Clients

Technical requirements

Using the Pachyderm Go client

Using the Pachyderm Python client

Further reading

Chapter 11: Using Pachyderm Notebooks

Chapter 11: Using Pachyderm Notebooks

Technical requirements

Enabling Pachyderm Notebooks in Pachyderm Hub

Running basic Pachyderm operations in Pachyderm Notebooks

Creating and running an example pipeline in Pachyderm Notebooks

Further reading

Other Books You May Enjoy

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Explaining ethical AI

This section describes aspects of ethical problems in AI and what organizations need to be aware of when they build artificial intelligence applications.

With AI and machine learning technologies becoming more widespread and accepted, it is easy to lose track of the data and the decision-making process origins. When an AI algorithm suggests which pair of shoes to buy based on your recent searches, it might not be a big deal. But suppose an AI algorithm is used to decide whether you qualify for a job, how likely you are to commit a crime, or whether you qualify for mortgage approval. In that case, it is essential to know how the algorithm was created, on which data it was trained, what was included in the dataset, and, more importantly, what was not. At a minimum, we need to question whether a proper process existed to validate the data used for producing the model. Not only is this the right thing to do, but it could also save your organization from undesirable legal consequences.

While AI applications bring certain advantages and improve the quality of our lives, they can make mistakes that sometimes can have adverse, and even devastating, effects on people's lives. These tendencies resulted in the emergence of ethical AI teams and ethical AI advocates in leading AI companies and big tech.

Ethical AI has been an increasingly discussed topic in the data science community over the last few years. According to the Artificial Intelligence Index Report 2019, ethics has been a steadily growing keyword in the total number of AI papers at leading AI conferences.

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Let's consider one of the most widely criticized AI technologies—facial recognition. A face recognition application can identify a person in an image or video. In recent years, this technology has become widespread and is now used in home security, authentication, and other areas. In 2018-2019, more than 1,000 newspaper articles worldwide mentioned facial recognition and data privacy. One such cloud-based facial recognition technology called Rekognition, developed by Amazon, has been used by police departments in a few states. The law enforcement departments used the software to search for suspects in a database, in a video surveillance analysis, including the feed from police body cameras. Independent research showed that the software was biased against people of color when out of 120 Members of Congress, it recognized 28 of them as potential criminals. All of them had darker skin tones. The tool performed especially poorly on identifying women of color.

The problem with this and other facial recognition technologies is that it was trained on a non-inclusive dataset that had photographs of mostly white men. Such outcomes are difficult to predict, but this is what ethical AI is trying to do. Implementing a surveillance system like that in public places would have negatively affected thousands of people. Advances in AI made facial recognition technology that requires little to no human involvement in subject identification. This raises the problem of total control and privacy. While a system like that could help identify criminals, possibly prevent crimes, and make our society safer, it needs to be thoroughly audited for potential errors and protected from misuse. With great power comes great responsibility.

Another interesting example is in using Natural Language Processing (NLP) applications. NLP is an ML technology that enables machines to automatically interpret and translate texts written in one human language to another. In recent years, NLP applications have seen major advances. Tools such as Google Translate solved a problem that was unsolvable even 20 years ago. NLP breaks down a sentence into chunks and tries to make connections between those chunks to provide a meaningful interpretation. NLP applications deal not only with translations but can also summarize what is written in a lengthy research paper or convert text to speech.

But these applications can make mistakes as well. One example was discovered in translations from Turkish to English. In the Turkish language, there is only the personal pronoun o, which can mean either she/her or he/his. It was discovered that Google Translate was discriminating based on gender, diminishing women's roles based on common stereotypes. For example, it would translate She is a secretary and He is a doctor, although in Turkish, both of these sentences could be written about a man or a woman.

From these examples, you can see that bias is one of the biggest problems of AI applications. A biased dataset is a dataset that does not include enough samples of a studied phenomenon to output an objective result, like in the facial-recognition example above, which did not have enough representatives of people of color to make a correct prediction.

While many companies are becoming aware of the adverse effects and risks of bias in datasets, few of them are taking steps to mitigate the possible negative consequences. According to the Artificial Intelligence Index Report 2019, only 13% of organizations that responded were working toward improving the equity and fairness of the datasets used:

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Another aspect of bias is financial inequality. It's not a secret that people from less economically advantageous backgrounds have harder times getting credit deals than those from a more fortunate background. Credit reports are known to have errors that cause higher borrowing rates.

Companies whose business is creating customer profiles, or personalization, go even further collecting intimate information about users and their behavior from public records, credit card transactions, sweepstakes, and other sources. These reports can be sold to marketers and even law enforcement organizations. Individuals are categorized according to their sex, age, marital status, wealth, medical conditions, and other factors. Sometimes these reports have outdated information about things such as criminal records. There was a case when an old lady could not get into a senior living house because of an arrest. However, though she was arrested, it was a case of domestic violence from her partner and she was never prosecuted. She was able to correct her police records, but not the report created by a profiling company. Correcting mistakes in the reports created by these companies is extremely difficult and they can affect people's lives for decades.

Sometimes, people get flagged because of a misidentified profile. Imagine that you are applying for a job and are denied because you have been prosecuted for theft or burglary in the past. This could come as a shock and might not make any sense, but there are cases like that with people who have common names. To clear a mistake like that you need the intervention of a person who wants to spend time correcting such mistakes for you. But do you meet people like that often?

With machine learning now being used in customer profiling, many data privacy advocates question the methods being used in these algorithms. Because these algorithms learn from past experiences, according to them anything you've done in the past, you are likely to repeat in the future. According to these algorithms, criminals will commit more crimes and the poor will get poorer. There is no room for mistakes in their reality. This means that people with prior convictions will likely get arrested again, which gives law enforcement a base for discrimination. The opposite is also true: those with a perfect record, from a better neighborhood, are not likely to commit a crime. This does not sound fair.

The problem with recidivism models is that most of them are proprietary black boxes. A black-box model is an end-to-end model that is created by an algorithm directly from the provided data and even a data scientist cannot explain how it makes decisions. When a machine learning algorithm evolves over time, since AI algorithms learn similarly to humans, they learn the same biases as us.

Figure 1.10 – Black-box model

Figure 1.10 – Black-box model

Let's move on to the next section!

Trustworthy AI

While a few years ago, ethical AI was something only a few groups of independent advocates and academics were working on, today more and more big tech companies have established ethical AI departments to protect the companies from reputational and legal risks.

Establishing standards for trustworthy AI models is an ambitious task and one size does not fit all. However, the following principles apply to most cases:

Create an ethical AI committee that works on discussing the AI-associated risks in alignment with the overall company strategy.
Raise awareness of the dangers of non-transparent machine learning algorithms and the potential risks they pose to society and your organization.
Create a process of identifying, communicating, and evaluating biased models, and privacy concerns. For example, in healthcare, protecting patient personal information is vitally important. Create ownership around ethical risk in the product management department.
Establish a process of notifying users about how their data will be used, explaining the risk of bias and other concepts in plain English. The earlier the user becomes aware of the implications of using your application, the less legal risk this will pose in the future.
Build a culture around praising efforts to promote ethical programs and initiatives to motivate employees to contribute to those efforts. Engage employees from different departments, including engineering, data science, product management, and others, to contribute to those efforts.

According to the Artificial Intelligence Index Report 2019, the top AI ethics challenges include fairness, interpretability and explainability, and transparency.

The following figure shows a more complete list of challenges present in the ethical AI space:

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

The following is a list of issues that non-transparent machine learning algorithms may cause:

Disproportional spread of economic and financial opportunities, including credit discrimination and unequal access to discounts and promotions based on predefined buying habits
Access to information and social circles, such as algorithms that promote news based on socio-economic groups and suggestions to join specific groups or circles
Employment discrimination, including algorithms that filter candidates based on their race, religion, or gender
Unequal use of police force and punishment, including algorithms that predict the possibility of an individual committing a crime in the future based on social status and race
Housing discrimination, including the denial of equal rental and mortgage opportunities to people of color, LGBT groups, and other minorities

AI has brought unprecedented benefits to our society similar to what the industrial revolution did. But with all these benefits, we should be aware of the societal changes that these benefits carry. If the future of driving is self-driving cars, this will mean that driving as a profession will disappear in the foreseeable future. Many other industries will be affected and will cease to exist. It does not mean that progress should not happen, but it needs to happen in a controlled way.

Software is only as perfect as its creators, and flaws in new AI-powered products are inevitable. But if these new applications are the first level in the decision-making process about human lives and destinies, there has to be a way to ensure that we minimize potential harmful consequences. Therefore, deeply understanding our models is paramount. Part of that is reproducibility, which is one of the key factors in minimizing the negative consequences of AI.