Book Image

Apache Spark 2.x Machine Learning Cookbook

By : Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall
Book Image

Apache Spark 2.x Machine Learning Cookbook

By: Mohammed Guller, Siamak Amirghodsi, Shuen Mei, Meenakshi Rajendran, Broderick Hall

Overview of this book

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. Learning about algorithms enables a wide range of applications, from everyday tasks such as product recommendations and spam filtering to cutting edge applications such as self-driving cars and personalized medicine. You will gain hands-on experience of applying these principles using Apache Spark, a resilient cluster computing system well suited for large-scale machine learning tasks. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.
Table of Contents (20 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Preface

 

Education is not the learning of facts,

but the training of the mind to think.

- Albert Einstein

Data is the new silicon of our age, and machine learning, coupled with biologically inspired cognitive systems, serves as the core foundation to not only enable but also accelerate the birth of the fourth industrial revolution. This book is dedicated to our parents, who through extreme hardship and sacrifice, made our education possible and taught us to always practice kindness.

The Apache Spark 2.x Machine Learning Cookbook is crafted by four friends with diverse background, who bring in a vast experience across multiple industries and academic disciplines. The team has immense experience in the subject matter at hand. The book is as much about friendship as it is about the science underpinning Spark and Machine Learning. We wanted to put our thoughts together and write a book for the community that not only combines Spark’s ML code and real-world data sets but also provides context-relevant explanation, references, and readings for a deeper understanding and promoting further research. This book is a reflection of what our team would have wished to have when we got started with Apache Spark.

My own interest in machine learning and artificial intelligence started in the mid eighties when I had the opportunity to read two significant artifacts that happened to be listed back to back in Artificial Intelligence, An International Journal, Volume 28, Number 1, February 1986. While it has been a long journey for engineers and scientists of my generation, fortunately, the advancements in resilient distributed computing, cloud computing, GPUs, cognitive computing, optimization, and advanced machine learning have made the dream of long decades come true. All these advancements have become accessible for the current generation of ML enthusiasts and data scientists alike.

We live in one of the rarest periods in history--a time when multiple technological and sociological trends have merged at the same point in time. The elasticity of cloud computing with built-in access to ML and deep learning nets will provide a whole new set of opportunities to create and capture new markets. The emergence of Apache Spark as the lingua franca or the common language of near real-time resilient distributed computing and data virtualization has provided smart companies the opportunity to employ ML techniques at a scale without a heavy investment in specialized data centers or hardware.

The Apache Spark 2.x Machine Learning Cookbook is one of the most comprehensive treatments of the Apache Spark machine learning API, with selected subcomponents of Spark to give you the foundation you need before you can master a high-end career in machine learning and Apache Spark. The book is written with the goal of providing clarity and accessibility, and it reflects our own experience (including reading the source code) and learning curve with Apache Spark, which started with Spark 1.0.

The Apache Spark 2.x Machine Learning Cookbook lives at the intersection of Apache Spark, machine learning, and Scala for developers, and data scientists through a practitioner’s lens who not only has to understand the code but also the details, theory, and inner workings of a given Spark ML algorithm or API to establish a successful career in the new economy.

The book takes the cookbook format to a whole new level by blending downloadable ready-to-run Apache Spark ML code recipes with background, actionable theory, references, research, and real-life data sets to help the reader understand the what, how and the why behind the extensive facilities offered by Spark for the machine learning library. The book starts by laying the foundations needed to succeed and then rapidly evolves to cover all the meaningful ML algorithms available in Apache Spark.

What this book covers

Chapter 1, Practical Machine Learning with Spark Using Scala, covers installing and configuring a real-life development environment with machine learning and programming with Apache Spark. Using screenshots, it walks you through downloading, installing, and configuring Apache Spark and IntelliJ IDEA along with the necessary libraries that would reflect a developer’s desktop in a real-world setting. It then proceeds to identify and list over 40 data repositories with real-world data sets that can help the reader in experimenting and advancing even further with the code recipes. In the final step, we run our first ML program on Spark and then provide directions on how to add graphics to your machine learning programs, which are used in the subsequent chapters.

Chapter 2, Just Enough Linear Algebra for Machine Learning with Spark, covers the use of linear algebra (vector and matrix), which is the foundation of some of the most monumental works in machine learning. It provides a comprehensive treatment of the DenseVector, SparseVector, and matrix facilities available in Apache Spark, with the recipes in the chapter. It provides recipes for both local and distributed matrices, including RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix to provide a detailed explanation of this topic. We included this chapter because mastery of the Spark and ML/MLlib was only possible by reading most of the source code line by line and understanding how the matrix decomposition and vector/matrix arithmetic work underneath the more course-grain algorithm in Spark. 

Chapter 3, Spark’s Three Data Musketeers for Machine Learning - Perfect Together, provides an end-to-end treatment of the three pillars of resilient distributed data manipulation and wrangling in Apache spark. The chapter comprises detailed recipes covering RDDs, DataFrame, and Dataset facilities from a practitioner’s point of view. Through an exhaustive list of 17 recipes, examples, references, and explanation, it lays out the foundation to build a successful career in machine learning sciences. The chapter provides both functional (code) as well as non-functional (SQL interface) programming approaches to solidify the knowledge base reflecting the real demands of a successful Spark ML engineer at tier 1 companies.

Chapter 4, Common Recipes for Implementing a Robust Machine Learning System, covers and factors out the tasks that are common in most machine learning systems through 16 short but to-the-point code recipes that the reader can use in their own real-world systems. It covers a gamut of techniques, ranging from normalizing data to evaluating the model output, using best practice metrics via Spark’s ML/MLlib facilities that might not be readily visible to the reader. It is a combination of recipes that we use in our day-to-day jobs in most situations but are listed separately to save on space and complexity of other recipes.

Chapter 5, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part I, is the first of two chapters exploring classification and regression in Apache Spark. This chapter starts with Generalized Linear Regression (GLM) extending it to Lasso, Ridge with different types of optimization available in Spark. The chapter then proceeds to cover Isotonic regression, Survival regression with multi-layer perceptron (neural networks) and One-vs-Rest classifier.

Chapter 6, Practical Machine Learning with Regression and Classification in Spark 2.0 - Part II, is the second of the two regression and classification chapters. This chapter covers RDD-based regression systems, ranging from Linear, Logistic, and Ridge to Lasso, using Stochastic Gradient Decent and L_BFGS optimization in Spark. The last three recipes cover Support Vector Machine (SVM) and Naïve Bayes, ending with a detailed recipe for ML pipelines that are gaining a prominent position in the Spark ML ecosystem.

Chapter 7, Recommendation Engine that Scales with Spark, covers how to explore your data set and build a movie recommendation engine using Spark’s ML library facilities. It uses a large dataset and some recipes in addition to figures and write-ups to explore the various methods of recommenders before going deep into collaborative filtering techniques in Spark.

Chapter 8, Unsupervised Clustering with Apache Spark 2.0, covers the techniques used in unsupervised learning, such as KMeans, Mixture, and Expectation (EM), Power Iteration Clustering (PIC), and Latent Dirichlet Allocation (LDA), while also covering the why and how to help the reader to understand the core concepts. Using Spark Streaming, the chapter commences with a real-time KMeans clustering recipe to classify the input stream into labeled classes via unsupervised means.

Chapter 9, Optimization - Going Down the Hill with Gradient Descent, is a unique chapter that walks you through optimization as it applies to machine learning. It starts from a closed form formula and quadratic function optimization (for example, cost function), to using Gradient Descent (GD) in order to solve a regression problem from scratch. The chapter helps to look underneath the hood by developing the reader’s skill set using Scala code while providing in-depth explanation of how to code and understand Stochastic Descent (GD) from scratch. The chapter concludes with one of Spark’s ML API to achieve the same concepts that we code from scratch.

Chapter 10, Building Machine Learning Systems with Decision Tree and Ensemble Models, covers the Tree and Ensemble models for classification and regression in depth using Spark’s machine library. We use three real-world data sets to explore the classification and regression problems using Decision Tree, Random Forest Tree, and Gradient Boosted Tree. The chapter provides an in-depth explanation of these methods in addition to plug-and-play code recipes that explore Apache Spark’s machine library step by step.

Chapter 11, The Curse of High-Dimensionality in Big Data, demystifies the art and science of dimensionality reduction and provides a complete coverage of Spark’s ML/MLlib library, which facilitates this important concept in machine learning at scale. The chapter provides sufficient and in-depth coverage of the theory (the what and why) and then proceeds to cover two fundamental techniques available (the how) in Spark for the readers to use. The chapter covers Single Value Decomposition (SVD), which relates well with the second chapter and then proceeds to examine the Principal Component Analysis (PCA) in depth with code and write ups.

Chapter 12, Implementing Text Analytics with Spark 2.0 ML Library, covers the various techniques available in Spark for implementing text analytics at scale. It provides a comprehensive treatment by starting from the basics, such as Term Frequency (TF) and similarity techniques, such as Word2Vec, and moves on to analyzing a complete dump of Wikipedia for a real-life Spark ML project. The chapter concludes with an in-depth discussion and code for implementing Latent Semantic Analysis (LSA) and Topic Modeling with Latent Dirichlet Allocation (LDA) in Spark.

Chapter 13, Spark Streaming and Machine Learning Library, starts by providing an introduction to and the future direction of Spark streaming, and then proceeds to provide recipes for both RDD-based (DStream) and structured streaming to establish a baseline. The chapter then proceeds to cover all the available ML streaming algorithms in Spark at the time of writing this book. The chapter provides code and shows how to implement streaming DataFrame and streaming data sets, and then proceeds to cover queueStream for debugging before it goes into Streaming KMeans (unsupervised learning) and streaming linear models such as Linear and Logistic regression using real-world datasets.

What you need for this book

Please use the details from the software list document.

To execute the recipes in this book, you need a system running Windows 7 and above, or Mac 10, with the following software installed:

  • Apache Spark 2.x
  • Oracle JDK SE 1.8.x
  • JetBrain IntelliJ Community Edition 2016.2.X or later version
  • Scala plug-in for IntelliJ 2016.2.x
  • Jfreechart 1.0.19
  • breeze-core 0.12
  • Cloud9 1.5.0 JAR
  • Bliki-core 3.0.19
  • hadoop-streaming 2.2.0
  • Jcommon 1.0.23
  • Lucene-analyzers-common 6.0.0
  • Lucene-core-6.0.0
  • Spark-streaming-flume-assembly 2.0.0
  • Spark-streaming-kafka-assembly 2.0.0

The hardware requirements for this software are mentioned in the software list provided with the code bundle of this book.

Who this book is for

This book is for Scala developers with a fairly good exposure to and understanding of machine learning techniques, but who lack practical implementations with Spark. A solid knowledge of machine learning algorithms is assumed, as well as some hands-on experience of implementing ML algorithms with Scala. However, you do not need to be acquainted with the Spark ML libraries and the ecosystem.

Sections

In this book, you will find several headings that appear frequently (Getting ready, How to do it…, How it works…, There's more…, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows:

Getting ready

This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.

How to do it…

This section contains the steps required to follow the recipe.

How it works…

This section usually consists of a detailed explanation of what happened in the previous section.

There's more…

This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.

See also

This section provides helpful links to other useful information for the recipe.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Mac users note that we installed Spark 2.0 in the /Users/USERNAME/spark/spark-2.0.0-bin-hadoop2.7/ directory on a Mac machine."

A block of code is set as follows:

object HelloWorld extends App { 
   println("Hello World!") 
 } 

Any command-line input or output is written as follows:

 mysql -u root -p

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Configure Global Libraries. Select Scala SDK as your global library."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.
  2. Hover the mouse pointer on the SUPPORT tab at the top.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box.
  5. Select the book for which you're looking to download the code files.
  6. Choose from the drop-down menu where you purchased this book from.
  7. Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows
  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-2x-Machine-Learning-Cookbook. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

 

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.