Apache Spark 2: Data Processing and Real-Time Analytics

Apache Spark 2: Data Processing and Real-Time Analytics

By : Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Buy this Book

Apache Spark 2: Data Processing and Real-Time Analytics

By: Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Buy this Book

Overview of this book

Apache Spark is an in-memory, cluster-based data processing system that provides a wide range of functionalities such as big data processing, analytics, machine learning, and more. With this Learning Path, you can take your knowledge of Apache Spark to the next level by learning how to expand Spark's functionality and building your own data flow and machine learning programs on this platform. You will work with the different modules in Apache Spark, such as interactive querying with Spark SQL, using DataFrames and datasets, implementing streaming analytics with Spark Streaming, and applying machine learning and deep learning techniques on Spark using MLlib and various external tools. By the end of this elaborately designed Learning Path, you will have all the knowledge you need to master Apache Spark, and build your own big data processing and analytics pipeline quickly and without any hassle. This Learning Path includes content from the following Packt products: • Mastering Apache Spark 2.x by Romeo Kienzler • Scala and Spark for Big Data Analytics by Md. Rezaul Karim, Sridhar Alla • Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen MeiCookbook

Title Page

About Packt

Contributors

Preface

Free Chapter

A First Taste and What's New in Apache Spark V2

Spark machine learning

Spark Streaming

Spark SQL

Spark graph processing

Extended ecosystem

What's new in Apache Spark V2?

Cluster design

Cluster management

Cloud-based deployments

Performance

Cloud

Summary

Apache Spark Streaming

Overview

Errors and recovery

Streaming sources

Summary

Structured Streaming

The concept of continuous applications

Windowing

Increased performance with good old friends

How transparent fault tolerance and exactly-once delivery guarantee is achieved

Example - connection to a MQTT message broker

Summary

Apache Spark MLlib

Architecture

Classification with Naive Bayes

Clustering with K-Means

Artificial neural networks

Summary

Apache SparkML

What does the new API look like?

The concept of pipelines

Model evaluation

CrossValidation and hyperparameter tuning

Winning a Kaggle competition with Apache SparkML

Summary

Apache SystemML

Why do we need just another library?

A cost-based optimizer for machine learning algorithms

Performance measurements

Apache SystemML in action

Summary

Apache Spark GraphX

Overview

Graph analytics/processing with GraphX

Summary

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Summary

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Summary

Practical Machine Learning with Spark Using Scala

Introduction

Configuring IntelliJ to work with Spark and run Spark ML sample codes

Running a sample ML code from Spark

Identifying data sources for practical machine learning

Running your first program using Apache Spark 2.0 with the IntelliJ IDE

How to add graphics to your Spark program

Spark's Three Data Musketeers for Machine Learning - Perfect Together

Introduction

Creating RDDs with Spark 2.0 using internal data sources

Creating RDDs with Spark 2.0 using external data sources

Transforming RDDs with Spark 2.0 using the filter() API

Transforming RDDs with the super useful flatMap() API

Transforming RDDs with set operation APIs

RDD transformation/aggregation with groupBy() and reduceByKey()

Transforming RDDs with the zip() API

Join transformation with paired key-value RDDs

Reduce and grouping transformation with paired key-value RDDs

Creating DataFrames from Scala data structures

Operating on DataFrames programmatically without SQL

Loading DataFrames and setup from an external source

Using DataFrames with standard SQL language - SparkSQL

Working with the Dataset API using a Scala Sequence

Creating and using Datasets from RDDs and back again

Working with JSON using the Dataset API and SQL together

Functional programming with the Dataset API using domain objects

Common Recipes for Implementing a Robust Machine Learning System

Introduction

Spark's basic statistical API to help you build your own algorithms

ML pipelines for real-life machine learning applications

Normalizing data with Spark

Splitting data for training and testing

Common operations with the new Dataset API

Creating and using RDD versus DataFrame versus Dataset from a text file in Spark 2.0

LabeledPoint data structure for Spark ML

Getting access to Spark cluster in Spark 2.0

Getting access to Spark cluster pre-Spark 2.0

Getting access to SparkContext vis-a-vis SparkSession object in Spark 2.0

New model export and PMML markup in Spark 2.0

Regression model evaluation using Spark 2.0

Binary classification model evaluation using Spark 2.0

Multiclass classification model evaluation using Spark 2.0

Multilabel classification model evaluation using Spark 2.0

Using the Scala Breeze library to do graphics in Spark 2.0

Recommendation Engine that Scales with Spark

Introduction

Setting up the required data for a scalable recommendation engine in Spark 2.0

Exploring the movies data details for the recommendation system in Spark 2.0

Exploring the ratings data details for the recommendation system in Spark 2.0

Building a scalable recommendation engine using collaborative filtering in Spark 2.0

Unsupervised Clustering with Apache Spark 2.0

Introduction

Building a KMeans classifying system in Spark 2.0

Bisecting KMeans, the new kid on the block in Spark 2.0

Using Gaussian Mixture and Expectation Maximization (EM) in Spark to classify data

Classifying the vertices of a graph using Power Iteration Clustering (PIC) in Spark 2.0

Latent Dirichlet Allocation (LDA) to classify documents and text into topics

Streaming KMeans to classify data in near real-time

Implementing Text Analytics with Spark 2.0 ML Library

Introduction

Doing term frequency with Spark - everything that counts

Downloading a complete dump of Wikipedia for a real-life Spark ML project

Using Latent Semantic Analysis for text analytics with Spark 2.0

Topic modeling with Latent Dirichlet allocation in Spark 2.0

Spark Streaming and Machine Learning Library

Introduction

Structured streaming for near real-time machine learning

Streaming DataFrames for real-time machine learning

Streaming Datasets for real-time machine learning

Streaming data and debugging with queueStream

Downloading and understanding the famous Iris data for unsupervised classification

Streaming KMeans for a real-time on-line classifier

Downloading wine quality data for streaming regression

Streaming linear regression for a real-time regression

Downloading Pima Diabetes data for supervised classification

Streaming logistic regression for an on-line classifier

Other Books You May Enjoy

Leave a review - let other readers know what you think

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Performance

Before moving on to the rest of the chapters, covering functional areas of Apache Spark and extensions, we will examine the area of performance. What issues and areas need to be considered? What might impact the Spark application performance, starting at the cluster level and finishing with actual Scala code? We don't want to just repeat, what the Spark website says, so take a look at this URL:http://spark.apache.org/docs/<version>/tuning.html.

Here, <version> relates to the version of Spark that you are using; that is, either the latest or something like 1.6.1 for a specific version. So, having looked at this page, we will briefly mention some of the topic areas. We will list some general points in this section without implying an order of importance.

The cluster structure

The size and structure of your big data cluster are going to affect performance. If you have a cloud-based cluster, your IO and latency will suffer, in comparison to an unshared hardware cluster. You will be sharing the underlying hardware, with multiple customers and the cluster hardware may be remote.There are some exceptions to this. The IBM cloud, for instance, offers dedicated bare metal high-performance cluster nodes, with an InfiniBand network connection, which can be rented on an hourly basis.

Additionally, the positioning of cluster components on servers may cause resource contention. For instance, think carefully about locating Hadoop NameNodes, Spark servers, Zookeeper, Flume, and Kafka servers in large clusters. With high workloads, you might consider segregating servers to individual systems. You might also consider using an Apache system such as Mesos thatprovides better distributions and assignment of resources to the individual processes.

Consider potential parallelism as well. The greater the number of workers in your Spark cluster for large Datasets, the greater the opportunity for parallelism. One rule of thumb is one worker per hyper-thread or virtual core respectively.

Hadoop Distributed File System

You might consider using an alternative to HDFS, depending upon your cluster requirements. For instance, IBM has the GPFS (General Purpose File System) for improved performance.

The reason why GPFS might be a better choice is that coming from the high-performance computing background, this filesystem has a full read-write capability, whereas HDFS is designed as a write once, read many filesystems. It offers an improvement in performance over HDFS because it runs at the kernel level as opposed to HDFS, which runs in a Java Virtual Machine (JVM) that in turn runs as an operating system process. It also integrates with Hadoop and the Spark cluster tools. IBM runs setups with several hundred petabytes using GPFS.

Another commercial alternative is the MapR file system that, besides performance improvements, supports mirroring, snapshots, and high availability.

Ceph is an open source alternative to a distributed, fault-tolerant, and self-healing filesystem for commodity hard drives like HDFS. It runs in the Linux kernel as well and addresses many of the performance issues that HDFS has. Other promising candidates in this space are Alluxio (formerly Tachyon), Quantcast, GlusterFS, and Lustre.

Finally, Cassandra is not a filesystem but a NoSQL key-value store and is tightly integrated with Apache Spark and is therefore traded as a valid and powerful alternative to HDFS--or even to any other distributed filesystem--especially as it supports predicate push-down using ApacheSparkSQL and the Catalyst optimizer, which we will cover in the following chapters.

Data locality

The key for good data processing performance is avoidance of network transfers. This was very true a couple of years ago, but is less relevant for tasks with high demands on CPU and low I/O, but for low demand on CPU and high I/O demand data processing algorithms, this still holds.

Note

We can conclude from this, that HDFS is one of the best ways to achieve data locality, as chunks of files are distributed on the cluster nodes, in most of the cases, using hard drives directly attached to the server systems. This means that those chunks can be processed in parallel using the CPUs on the machines where individual data chunks are located in order to avoid network transfer.

Another way to achieve data locality is using ApacheSparkSQL. Depending on the connector implementation, SparkSQL can make use of the data processing capabilities of the source engine. So, for example, when using MongoDB in conjunction with SparkSQL, parts of the SQL statement are preprocessed by MongoDB before data is sent upstream to Apache Spark.

Memory

In order to avoid OOM (Out of Memory) messages for the tasks on your Apache Spark cluster, please consider a number of questions for the tuning:

Consider the level of physical memory available on your Spark worker nodes. Can it be increased? Check on the memory consumption of operating system processes during high workloads in order to get an idea of free memory. Make sure that the workers have enough memory.
Consider data partitioning. Can you increase the number of partitions? As a rule of thumb, you should have at least as many partitions as you have available CPU cores on the cluster. Use the repartition function on the RDD API.
Can you modify the storage fraction and the memory used by the JVM for storage and caching of RDDs? Workers are competing for memory against data storage. Use the Storage page on the Apache Spark user interface to see if this fraction is set to an optimal value. Then update the following properties:
spark.memory.fraction
spark.memory.storageFraction
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size

In addition, the following two things can be done in order to improve performance:

Consider using Parquet as a storage format, which is much more storage effective than CSV or JSON
Consider using the DataFrame/Dataset API instead of the RDD API as it might resolve in more effective executions (more about this in the next three chapters)

Coding

Try to tune your code, to improve the Spark application performance. For instance, filter your application-based data early in your ETL cycle.One example is, when using raw HTML files, detag them and crop away unneeded parts at an early stage.Tune your degree of parallelism, try to find the resource-expensive parts of your code, and find alternatives.

Note

ETL is one of the first things you are doing in an analytics project. So you are grabbing data, from third-party systems, either by directly accessing relational or NoSQL databases or by reading exports in various file formats such as, CSV, TSV, JSON or even more exotic ones from local or remote filesystems or from a staging area in HDFS: after some inspections and sanity checks on the files an ETL process in Apache Spark basically reads in the files and creates RDDs or DataFrames/Datasets out of them.

They are transformed, so that they fit the downstream analytics application, running on top of Apache Spark or other applications and then stored back into filesystems as either JSON, CSV or PARQUET files, or even back to relational or NoSQL databases.

Note

Finally, I can recommend the following resource for any performance-related problems with Apache Spark: https://spark.apache.org/docs/latest/tuning.html.

Apache Spark 2: Data Processing and Real-Time Analytics

By : Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Apache Spark 2: Data Processing and Real-Time Analytics

By: Romeo Kienzler, Md. Rezaul Karim, Sridhar Alla, Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, Shuen Mei

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2: Data Processing and Real-Time Analytics

Performance

The cluster structure

Hadoop Distributed File System

Data locality

Note

Memory

Coding

Note

Note