Apache Spark 2: Data Processing and Real-Time Analytics

Apache Spark 2: Data Processing and Real-Time Analytics

By : Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Buy this Book

Apache Spark 2: Data Processing and Real-Time Analytics

By: Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Buy this Book

Overview of this book

Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: • Mastering Apache Spark 2.x by Romeo Kienzler • Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla • Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook

Title Page

About Packt

Contributors

Preface

Free Chapter

A First Taste and What's New in Apache Spark V2

Spark machine learning

Spark Streaming

Spark SQL

Spark graph processing

Extended ecosystem

What's new in Apache Spark V2?

Cluster design

Cluster management

Cloud-based deployments

Performance

Cloud

Summary

Apache Spark Streaming

Overview

Errors and recovery

Streaming sources

Summary

Structured Streaming

The concept of continuous applications

Windowing

Increased performance with good old friends

How transparent fault tolerance and exactly-once delivery guarantee is achieved

Example - connection to a MQTT message broker

Summary

Apache Spark MLlib

Architecture

Classification with Naive Bayes

Clustering with K-Means

Artificial neural networks

Summary

Apache SparkML

What does the new API look like?

The concept of pipelines

Model evaluation

CrossValidation and hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Summary

Apache SystemML

Why do we need just another library?

A cost-based optimizer for machine learning algorithms

Performance measurements

Apache SystemML in action

Summary

Apache Spark GraphX

Overview

Graph analytics/processing with GraphX

Summary

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Summary

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Summary

Practical Machine Learning with Spark Using Scala

Introduction

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Running a sample ML code from Spark

Identifying data sources for practical machine learning

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to add graphics to your Spark program

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Introduction

Creating RDDs with Spark 2.0 using internal data sources

Creating RDDs with Spark 2.0 using external data sources

Transforming RDDs with Spark 2.0 using the filter() API

Transforming RDDs with the super useful flatMap() API

Transforming RDDs with set operation APIs

RDD transformation/aggregation with groupBy() and reduceByKey()

Transforming RDDs with the zip() API

Join transformation with paired key-value RDDs

Reduce and grouping transformation with paired key-value RDDs

Creating DataFrames from Scala data structures

Operating on DataFrames programmatically without SQL

Loading DataFrames and setup from an external source

Using DataFrames with standard SQL language - SparkSQL

Working with the Dataset API using a Scala Sequence

Creating and using Datasets from RDDs and back again

Working with JSON using the Dataset API and SQL together

Functional programming with the Dataset API using domain objects

Common Recipes for Implementing a Robust Machine Learning System

Introduction

Spark's basic statistical API to help you build your own algorithms

ML pipelines for real-life machine learning applications

Normalizing data with Spark

Splitting data for training and testing

Common operations with the new Dataset API

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0

LabeledPoint data structure for Spark ML

Getting access to Spark cluster in Spark 2.0

Getting access to Spark cluster pre-Spark 2.0

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0

New model export and PMML markup in Spark 2.0

Regression model evaluation using Spark 2.0

Binary classification model evaluation using Spark 2.0

Multiclass classification model evaluation using Spark 2.0

Multilabel classification model evaluation using Spark 2.0

Using the Scala Breeze library to do graphics in Spark 2.0

Recommendation Engine that Scales with Spark

Introduction

Setting up the required data for a scalable recommendation engine in Spark 2.0

Exploring the movies data details for the recommendation system in Spark 2.0

Exploring the ratings data details for the recommendation system in Spark 2.0

Building a scalable recommendation engine using collaborative filtering in Spark 2.0

Unsupervised Clustering with Apache Spark 2.0

Introduction

Building a KMeans classifying system in Spark 2.0

Bisecting KMeans, the new kid on the block in Spark 2.0

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0

Latent Dirichlet Allocation (LDA) to classify documents and text into topics

Streaming KMeans to classify data in near real-time

Implementing Text Analytics with Spark 2.0 ML Library

Introduction

Doing term frequency with Spark - everything that counts

Downloading a complete dump of Wikipedia for a real-life Spark ML project

Using Latent Semantic Analysis for text analytics with Spark 2.0

Topic modeling with Latent Dirichlet allocation in Spark 2.0

Spark Streaming and Machine Learning Library

Introduction

Structured streaming for near real-time machine learning

Streaming DataFrames for real-time machine learning

Streaming Datasets for real-time machine learning

Streaming data and debugging with queueStream

Downloading and understanding the famous Iris data for unsupervised classification

Streaming KMeans for a real-time on-line classifier

Downloading wine quality data for streaming regression

Streaming linear regression for a real-time regression

Downloading Pima Diabetes data for supervised classification

Streaming logistic regression for an on-line classifier

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Who This Book Is For

If you are an intermediate-level Spark developer looking to master the advanced capabilities and use-cases of Apache Spark 2.x, this Learning Path is ideal for you. Big data professionals who want to learn how to integrate and use the features of Apache Spark and build a strong big data pipeline will also find this Learning Path useful. To grasp the concepts explained in this Learning Path, you must know the fundamentals of Apache Spark and Scala.

What This Book Covers

Chapter 1, A First Taste and What's New in Apache Spark V2, provides an overview of Apache Spark, the functionality that is available within its modules, and how it can be extended. It covers the tools available in the Apache Spark ecosystem outside the standard Apache Spark modules for processing and storage. It also provides tips on performance tuning.

Chapter 2, Apache Spark Streaming, talks about continuous applications using Apache Spark Streaming. You will learn how to incrementally process data and create actionable insights.

Chapter 3, Structured Streaming, talks about Structured Streaming – a new way of defining continuous applications using the DataFrame and Dataset APIs.

Chapter 4, Apache Spark MLlib, introduces you to MLlib, the de facto standard for machine learning when using Apache Spark.

Chapter 5, Apache SparkML, introduces you to the DataFrame-based machine learning library of Apache Spark: the new first-class citizen when it comes to high performance and massively parallel machine learning.

Chapter 6, Apache SystemML, introduces you to Apache SystemML, another machine learning library capable of running on top of Apache Spark and incorporating advanced features such as a cost-based optimizer, hybrid execution plans, and low-level operator re-writes.

Chapter 7, Apache Spark GraphX, talks about Graph processing with Scala using GraphX. You will learn some basic and also advanced graph algorithms and how to use GraphX to execute them.

Chapter 8, Spark Tuning, digs deeper into Apache Spark internals and says that while Spark is great in making us feel as if we are using just another Scala collection, we shouldn't forget that Spark actually runs in a distributed system. Therefore, throughout this chapter, we will cover how to monitor Spark jobs, Spark configuration, common mistakes in Spark app development, and some optimization techniques.

Chapter 9, Testing and Debugging Spark, explains how difficult it can be to test an application if it is distributed; then, we see some ways to tackle this. We will cover how to do testing in a distributed environment, and testing and debugging Spark applications.

Chapter 10, Practical Machine Learning with Spark Using Scala, covers installing and configuring a real-life development environment with machine learning and programming with Apache Spark. Using screenshots, it walks you through downloading, installing, and configuring Apache Spark and IntelliJ IDEA along with the necessary libraries that would reflect a developer’s desktop in a real-world setting. It then proceeds to identify and list over 40 data repositories with real-world datasets that can help the reader in experimenting and advancing even further with the code recipes. In the final step, we run our first ML program on Spark and then provide directions on how to add graphics to your machine learning programs, which are used in the subsequent chapters.

Chapter 11,Spark’s Three Data Musketeers for Machine Learning - Perfect Together, provides an end-to-end treatment of the three pillars of resilient distributed data manipulation and wrangling in Apache Spark. The chapter comprises detailed recipes covering RDDs, DataFrame, and Dataset facilities from a practitioner’s point of view. Through an exhaustive list of 17 recipes, examples, references, and explanation, it lays out the foundation to build a successful career in machine learning sciences. The chapter provides both functional (code) as well as non-functional (SQL interface) programming approaches to solidify the knowledge base reflecting the real demands of a successful Spark ML engineer at tier 1 companies.

Chapter 12, Common Recipes for Implementing a Robust Machine Learning System, covers and factors out the tasks that are common in most machine learning systems through 16 short but to-the-point code recipes that the reader can use in their own real-world systems. It covers a gamut of techniques, ranging from normalizing data to evaluating the model output, using best practice metrics via Spark’s ML/MLlib facilities that might not be readily visible to the reader. It is a combination of recipes that we use in our day-to-day jobs in most situations but is listed separately to save on space and complexity of other recipes.

Chapter 13,Recommendation Engine that Scales with Spark, covers how to explore your data set and build a movie recommendation engine using Spark’s ML library facilities. It uses a large dataset and some recipes in addition to figures and write-ups to explore the various methods of recommenders before going deep into collaborative filtering techniques in Spark.

Chapter 14, Unsupervised Clustering with Apache Spark 2.0, covers the techniques used in unsupervised learning, such as KMeans, Mixture, and Expectation (EM), Power Iteration Clustering (PIC), and Latent Dirichlet Allocation (LDA), while also covering the why and how to help the reader to understand the core concepts. Using Spark Streaming, the chapter commences with a real-time KMeans clustering recipe to classify the input stream into labeled classes via unsupervised means.

Chapter 15, Implementing Text Analytics with Spark 2.0 ML Library, covers the various techniques available in Spark for implementing text analytics at scale. It provides a comprehensive treatment by starting from the basics, such as Term Frequency (TF) and similarity techniques, such as Word2Vec, and moves on to analyzing a complete dump of Wikipedia for a real-life Spark ML project. The chapter concludes with an in-depth discussion and code for implementing Latent Semantic Analysis (LSA) and Topic Modeling with Latent Dirichlet Allocation (LDA) in Spark.

Chapter 16, Spark Streaming and Machine Learning Library, starts by providing an introduction to and the future direction of Spark streaming and then proceeds to provide recipes for both RDD-based (DStream) and structured streaming to establish a baseline. The chapter then proceeds to cover all the available ML streaming algorithms in Spark at the time of writing this book. The chapter provides code and shows how to implement streaming DataFrame and streaming data sets, and then proceeds to cover queueStream for debugging before it goes into Streaming KMeans (unsupervised learning) and streaming linear models such as Linear and Logistic regression using real-world datasets.

To Get the Most out of This Book

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for a cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for a standalone word missing and for an SQL warehouse).

Along with this, you would require the following:

VirtualBox 5.1.22 or above
Hortonworks HDP Sandbox V2.6 or above
Eclipse Neon or above
Eclipse Scala Plugin
Eclipse Git Plugin
Spark 2.0.0 (or higher)
Hadoop 2.7 (or higher)
Java (JDK and JRE) 1.7+/1.8+
Scala 2.11.x (or higher)
Python 2.7+/3.4+
R 3.1+ and RStudio 1.0.143 (or higher)
Maven Eclipse plugin (2.9 or higher)
Maven compiler plugin for Eclipse (2.3.2 or higher)
Maven assembly plugin for Eclipse (2.4.1 or higher)
Oracle JDK SE 1.8.x
JetBrain IntelliJ Community Edition 2016.2.X or later version
Scala plug-in for IntelliJ 2016.2.x
Jfreechart 1.0.19
breeze-core 0.12
Cloud9 1.5.0 JAR
Bliki-core 3.0.19
hadoop-streaming 2.2.0
Jcommon 1.0.23
Lucene-analyzers-common 6.0.0
Lucene-core-6.0.0
Spark-streaming-flume-assembly 2.0.0
Spark-streaming-kafka-assembly 2.0.0

Download the Example Code Files

You can download the example code files for this book from your account at www.packt.com. If you purchased this book elsewhere, you can visit www.packt.com/support and register to have the files emailed directly to you.

You can download the code files by following these steps:

Log in or register at www.packt.com.
Select the SUPPORT tab.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box and follow the onscreen instructions.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR/7-Zip for Windows
Zipeg/iZip/UnRarX for Mac
7-Zip/PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Apache-Spark-2-Data-Processing-and-Real-Time-Analytics . In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Conventions Used

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in the text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the BeautifulSoup function."

A block of code is set as follows:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

Any command-line input or output is written as follows:

$./bin/spark-submit --class com.chapter11.RandomForestDemo \
--master spark://ip-172-31-21-153.us-west-2.compute:7077 \
--executor-memory 2G \
--total-executor-cores 2 \
file:///home/KMeans-0.0.1-SNAPSHOT.jar \
file:///home/mnist.bz2

Bold: New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Configure Global Libraries. Select Scala SDK as your global library."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Get in Touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, mention the book title in the subject of your message and email us at [email protected].

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packt.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details.

Piracy: If you come across any illegal copies of our works in any form on the Internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Reviews

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!

For more information about Packt, please visit packt.com.

Apache Spark 2: Data Processing and Real-Time Analytics

By : Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Apache Spark 2: Data Processing and Real-Time Analytics

By: Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2: Data Processing and Real-Time Analytics

Preface

Who This Book Is For

What This Book Covers

To Get the Most out of This Book

Download the Example Code Files

Conventions Used

Note

Note

Get in Touch

Reviews