Book Image

Machine Learning with Spark. - Second Edition

By : Rajdeep Dua, Manpreet Singh Ghotra
Book Image

Machine Learning with Spark. - Second Edition

By: Rajdeep Dua, Manpreet Singh Ghotra

Overview of this book

This book will teach you about popular machine learning algorithms and their implementation. You will learn how various machine learning concepts are implemented in the context of Spark ML. You will start by installing Spark in a single and multinode cluster. Next you'll see how to execute Scala and Python based programs for Spark ML. Then we will take a few datasets and go deeper into clustering, classification, and regression. Toward the end, we will also cover text processing using Spark ML. Once you have learned the concepts, they can be applied to implement algorithms in either green-field implementations or to migrate existing systems to this new platform. You can migrate from Mahout or Scikit to use Spark ML. By the end of this book, you will acquire the skills to leverage Spark's features to create your own scalable machine learning applications and power a modern data-driven business.
Table of Contents (13 chapters)

What this book covers

Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local development environment for the Spark framework, as well as how to create a Spark cluster in the cloud using Amazon EC2. The Spark programming model and API will be introduced and a simple Spark application will be created using Scala, Java, and Python.

Chapter 2, Math for Machine Learning, provides a mathematical introduction to machine learning. Understanding math and many of its techniques is important to get a good hold on the inner workings of the algorithms and to get the best results.

Chapter 3, Designing a Machine Learning System, presents an example of a real-world use case for a machine learning system. We will design a high-level architecture for an intelligent system in Spark based on this illustrative use case.

Chapter 4, Obtaining, Processing, and Preparing Data with Spark, details how to go about obtaining data for use in a machine learning system, in particular from various freely and publicly available sources. We will learn how to process, clean, and transform the raw data into features that may be used in machine learning models, using available tools, libraries, and Spark's functionality.

Chapter 5, Building a Recommendation Engine with Spark, deals with creating a recommendation model based on the collaborative filtering approach. This model will be used to recommend items to a given user, as well as create lists of items that are similar to a given item. Standard metrics to evaluate the performance of a recommendation model will be covered here.

Chapter 6, Building a Classification Model with Spark, details how to create a model for binary classification, as well as how to utilize standard performance-evaluation metrics for classification tasks.

Chapter 7, Building a Regression Model with Spark, shows how to create a model for regression, extending the classification model created in Chapter 6, Building a Classification Model with Spark. Evaluation metrics for the performance of regression models will be detailed here.

Chapter 8, Building a Clustering Model with Spark, explores how to create a clustering model and how to use related evaluation methodologies. You will learn how to analyze and visualize the clusters that are generated.

Chapter 9, Dimensionality Reduction with Spark, takes us through methods to extract the underlying structure from, and reduce the dimensionality of, our data. You will learn some common dimensionality-reduction techniques and how to apply and analyze them. You will also see how to use the resulting data representation as an input to another machine learning model.

Chapter 10, Advanced Text Processing with Spark, introduces approaches to deal with large-scale text data, including techniques for feature extraction from text and dealing with the very high-dimensional features typical in text data.

Chapter 11, Real-Time Machine Learning with Spark Streaming, provides an overview of Spark Streaming and how it fits in with the online and incremental learning approaches to apply machine learning on data streams.

Chapter 12, Pipeline APIs for Spark ML, provides a uniform set of APIs that are built on top of Data Frames and help the user to create and tune machine learning pipelines.