Book Image

Reproducible Data Science with Pachyderm

By : Svetlana Karslioglu
Book Image

Reproducible Data Science with Pachyderm

By: Svetlana Karslioglu

Overview of this book

Pachyderm is an open source project that enables data scientists to run reproducible data pipelines and scale them to an enterprise level. This book will teach you how to implement Pachyderm to create collaborative data science workflows and reproduce your ML experiments at scale. You’ll begin your journey by exploring the importance of data reproducibility and comparing different data science platforms. Next, you’ll explore how Pachyderm fits into the picture and its significance, followed by learning how to install Pachyderm locally on your computer or a cloud platform of your choice. You’ll then discover the architectural components and Pachyderm's main pipeline principles and concepts. The book demonstrates how to use Pachyderm components to create your first data pipeline and advances to cover common operations involving data, such as uploading data to and from Pachyderm to create more complex pipelines. Based on what you've learned, you'll develop an end-to-end ML workflow, before trying out the hyperparameter tuning technique and the different supported Pachyderm language clients. Finally, you’ll learn how to use a SaaS version of Pachyderm with Pachyderm Notebooks. By the end of this book, you will learn all aspects of running your data pipelines in Pachyderm and manage them on a day-to-day basis.
Table of Contents (16 chapters)
1
Section 1: Introduction to Pachyderm and Reproducible Data Science
5
Section 2:Getting Started with Pachyderm
12
Section 3:Pachyderm Clients and Tools

Explaining ethical AI

This section describes aspects of ethical problems in AI and what organizations need to be aware of when they build artificial intelligence applications.

With AI and machine learning technologies becoming more widespread and accepted, it is easy to lose track of the data and the decision-making process origins. When an AI algorithm suggests which pair of shoes to buy based on your recent searches, it might not be a big deal. But suppose an AI algorithm is used to decide whether you qualify for a job, how likely you are to commit a crime, or whether you qualify for mortgage approval. In that case, it is essential to know how the algorithm was created, on which data it was trained, what was included in the dataset, and, more importantly, what was not. At a minimum, we need to question whether a proper process existed to validate the data used for producing the model. Not only is this the right thing to do, but it could also save your organization from undesirable legal consequences.

While AI applications bring certain advantages and improve the quality of our lives, they can make mistakes that sometimes can have adverse, and even devastating, effects on people's lives. These tendencies resulted in the emergence of ethical AI teams and ethical AI advocates in leading AI companies and big tech.

Ethical AI has been an increasingly discussed topic in the data science community over the last few years. According to the Artificial Intelligence Index Report 2019, ethics has been a steadily growing keyword in the total number of AI papers at leading AI conferences.

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Figure 1.8 – Number of AI conference papers mentioning Ethics since 1970, from the AI Index 2019 Annual Report (p. 44)

Let's consider one of the most widely criticized AI technologies—facial recognition. A face recognition application can identify a person in an image or video. In recent years, this technology has become widespread and is now used in home security, authentication, and other areas. In 2018-2019, more than 1,000 newspaper articles worldwide mentioned facial recognition and data privacy. One such cloud-based facial recognition technology called Rekognition, developed by Amazon, has been used by police departments in a few states. The law enforcement departments used the software to search for suspects in a database, in a video surveillance analysis, including the feed from police body cameras. Independent research showed that the software was biased against people of color when out of 120 Members of Congress, it recognized 28 of them as potential criminals. All of them had darker skin tones. The tool performed especially poorly on identifying women of color.

The problem with this and other facial recognition technologies is that it was trained on a non-inclusive dataset that had photographs of mostly white men. Such outcomes are difficult to predict, but this is what ethical AI is trying to do. Implementing a surveillance system like that in public places would have negatively affected thousands of people. Advances in AI made facial recognition technology that requires little to no human involvement in subject identification. This raises the problem of total control and privacy. While a system like that could help identify criminals, possibly prevent crimes, and make our society safer, it needs to be thoroughly audited for potential errors and protected from misuse. With great power comes great responsibility.

Another interesting example is in using Natural Language Processing (NLP) applications. NLP is an ML technology that enables machines to automatically interpret and translate texts written in one human language to another. In recent years, NLP applications have seen major advances. Tools such as Google Translate solved a problem that was unsolvable even 20 years ago. NLP breaks down a sentence into chunks and tries to make connections between those chunks to provide a meaningful interpretation. NLP applications deal not only with translations but can also summarize what is written in a lengthy research paper or convert text to speech.

But these applications can make mistakes as well. One example was discovered in translations from Turkish to English. In the Turkish language, there is only the personal pronoun o, which can mean either she/her or he/his. It was discovered that Google Translate was discriminating based on gender, diminishing women's roles based on common stereotypes. For example, it would translate She is a secretary and He is a doctor, although in Turkish, both of these sentences could be written about a man or a woman.

From these examples, you can see that bias is one of the biggest problems of AI applications. A biased dataset is a dataset that does not include enough samples of a studied phenomenon to output an objective result, like in the facial-recognition example above, which did not have enough representatives of people of color to make a correct prediction.

While many companies are becoming aware of the adverse effects and risks of bias in datasets, few of them are taking steps to mitigate the possible negative consequences. According to the Artificial Intelligence Index Report 2019, only 13% of organizations that responded were working toward improving the equity and fairness of the datasets used:

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Figure 1.9 – Types of organizations taking steps to mitigate the risks of AI, from the AI Index 2019 Annual Report (p. 102)

Another aspect of bias is financial inequality. It's not a secret that people from less economically advantageous backgrounds have harder times getting credit deals than those from a more fortunate background. Credit reports are known to have errors that cause higher borrowing rates.

Companies whose business is creating customer profiles, or personalization, go even further collecting intimate information about users and their behavior from public records, credit card transactions, sweepstakes, and other sources. These reports can be sold to marketers and even law enforcement organizations. Individuals are categorized according to their sex, age, marital status, wealth, medical conditions, and other factors. Sometimes these reports have outdated information about things such as criminal records. There was a case when an old lady could not get into a senior living house because of an arrest. However, though she was arrested, it was a case of domestic violence from her partner and she was never prosecuted. She was able to correct her police records, but not the report created by a profiling company. Correcting mistakes in the reports created by these companies is extremely difficult and they can affect people's lives for decades.

Sometimes, people get flagged because of a misidentified profile. Imagine that you are applying for a job and are denied because you have been prosecuted for theft or burglary in the past. This could come as a shock and might not make any sense, but there are cases like that with people who have common names. To clear a mistake like that you need the intervention of a person who wants to spend time correcting such mistakes for you. But do you meet people like that often?

With machine learning now being used in customer profiling, many data privacy advocates question the methods being used in these algorithms. Because these algorithms learn from past experiences, according to them anything you've done in the past, you are likely to repeat in the future. According to these algorithms, criminals will commit more crimes and the poor will get poorer. There is no room for mistakes in their reality. This means that people with prior convictions will likely get arrested again, which gives law enforcement a base for discrimination. The opposite is also true: those with a perfect record, from a better neighborhood, are not likely to commit a crime. This does not sound fair.

The problem with recidivism models is that most of them are proprietary black boxes. A black-box model is an end-to-end model that is created by an algorithm directly from the provided data and even a data scientist cannot explain how it makes decisions. When a machine learning algorithm evolves over time, since AI algorithms learn similarly to humans, they learn the same biases as us.

Figure 1.10 – Black-box model

Figure 1.10 – Black-box model

Let's move on to the next section!

Trustworthy AI

While a few years ago, ethical AI was something only a few groups of independent advocates and academics were working on, today more and more big tech companies have established ethical AI departments to protect the companies from reputational and legal risks.

Establishing standards for trustworthy AI models is an ambitious task and one size does not fit all. However, the following principles apply to most cases:

  • Create an ethical AI committee that works on discussing the AI-associated risks in alignment with the overall company strategy.
  • Raise awareness of the dangers of non-transparent machine learning algorithms and the potential risks they pose to society and your organization.
  • Create a process of identifying, communicating, and evaluating biased models, and privacy concerns. For example, in healthcare, protecting patient personal information is vitally important. Create ownership around ethical risk in the product management department.
  • Establish a process of notifying users about how their data will be used, explaining the risk of bias and other concepts in plain English. The earlier the user becomes aware of the implications of using your application, the less legal risk this will pose in the future.
  • Build a culture around praising efforts to promote ethical programs and initiatives to motivate employees to contribute to those efforts. Engage employees from different departments, including engineering, data science, product management, and others, to contribute to those efforts.

According to the Artificial Intelligence Index Report 2019, the top AI ethics challenges include fairness, interpretability and explainability, and transparency.

The following figure shows a more complete list of challenges present in the ethical AI space:

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

Figure 1.11 – Ethical AI challenges, from the AI Index 2019 Annual Report (p. 149)

The following is a list of issues that non-transparent machine learning algorithms may cause:

  • Disproportional spread of economic and financial opportunities, including credit discrimination and unequal access to discounts and promotions based on predefined buying habits
  • Access to information and social circles, such as algorithms that promote news based on socio-economic groups and suggestions to join specific groups or circles
  • Employment discrimination, including algorithms that filter candidates based on their race, religion, or gender
  • Unequal use of police force and punishment, including algorithms that predict the possibility of an individual committing a crime in the future based on social status and race
  • Housing discrimination, including the denial of equal rental and mortgage opportunities to people of color, LGBT groups, and other minorities

AI has brought unprecedented benefits to our society similar to what the industrial revolution did. But with all these benefits, we should be aware of the societal changes that these benefits carry. If the future of driving is self-driving cars, this will mean that driving as a profession will disappear in the foreseeable future. Many other industries will be affected and will cease to exist. It does not mean that progress should not happen, but it needs to happen in a controlled way.

Software is only as perfect as its creators, and flaws in new AI-powered products are inevitable. But if these new applications are the first level in the decision-making process about human lives and destinies, there has to be a way to ensure that we minimize potential harmful consequences. Therefore, deeply understanding our models is paramount. Part of that is reproducibility, which is one of the key factors in minimizing the negative consequences of AI.