Apache Spark 2 for Beginners

Apache Spark 2 for Beginners

By : Rajanarayanan Thottuvaikkatumana

Buy this Book

Apache Spark 2 for Beginners

By: Rajanarayanan Thottuvaikkatumana

Buy this Book

Overview of this book

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists. This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then the Spark programming model is introduced through real-world examples followed by Spark SQL programming with DataFrames. An introduction to SparkR is covered next. Later, we cover the charting and plotting features of Python in conjunction with Spark data processing. After that, we take a look at Spark's stream processing, machine learning, and graph processing libraries. The last chapter combines all the skills you learned from the preceding chapters to develop a real-world Spark application. By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.

Apache Spark 2 for Beginners

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Spark Fundamentals

An overview of Apache Hadoop

Understanding Apache Spark

Installing Spark on your machines

References

Summary

Spark Programming Model

Functional programming with Spark

Understanding Spark RDD

Data transformations and actions with RDDs

Monitoring with Spark

The basics of programming with Spark

Creating RDDs from files

Understanding the Spark library stack

Reference

Summary

Spark SQL

Understanding the structure of data

Why Spark SQL?

Anatomy of Spark SQL

DataFrame programming

Understanding Aggregations in Spark SQL

Understanding multi-datasource joining with SparkSQL

Introducing datasets

Understanding Data Catalogs

References

Summary

Spark Programming with R

The need for SparkR

Basics of the R language

DataFrames in R and Spark

Spark DataFrame programming with R

Understanding aggregations in Spark R

Understanding multi-datasource joins with SparkR

References

Summary

Spark Data Analysis with Python

Charting and plotting libraries

Setting up a dataset

Data analysis use cases

Charts and plots

References

Summary

Spark Stream Processing

Data stream processing

Micro batch data processing

A log event processor

Windowed data processing

More processing options

Kafka stream processing

Spark Streaming jobs in production

References

Summary

Spark Machine Learning

Understanding machine learning

Why Spark for machine learning?

Wine quality prediction

Summary

Spark Graph Processing

Understanding graphs and their usage

The Spark GraphX library

Tennis tournament analysis

Applying the PageRank algorithm

Connected component algorithm

Understanding GraphFrames

Understanding GraphFrames queries

References

Summary

Designing Spark Applications

Lambda Architecture

Microblogging with Lambda Architecture

Implementing Lambda Architecture

Working with Spark applications

Coding style

Setting up the source code

Understanding data ingestion

Generating purposed views and queries

Understanding custom data processes

References

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

An overview of Apache Hadoop

Apache Hadoop is an open source software framework designed from ground-up to do distributed data storage on a cluster of computers and to do distributed data processing of the data that is spread across the cluster of computers. This framework comes with a distributed filesystem for the data storage, namely, Hadoop Distributed File System (HDFS), and a data processing framework, namely, MapReduce. The creation of HDFS is inspired from the Google research paper, The Google File System and MapReduce is based on the Google research paper, MapReduce: Simplified Data Processing on Large Clusters.

Hadoop was adopted by organizations in a really big way by implementing huge Hadoop clusters for data processing. It saw tremendous growth from Hadoop MapReduce version 1 (MRv1) to Hadoop MapReduce version 2 (MRv2). From a pure data processing perspective, MRv1 consisted of HDFS and MapReduce as the core components. Many applications, generally called SQL-on-Hadoop applications, such as Hive and Pig, were stacked on top of the MapReduce framework. It is very common to see that even though these types of applications are separate Apache projects, as a suite, many such projects provide great value.

The Yet Another Resource Negotiator (YARN) project came to the fore with computing frameworks other than MapReduce type to run on the Hadoop ecosystem. With the introduction of YARN sitting on top of HDFS, and below MapReduce in a component architecture layering perspective, the users could write their own applications that can run on YARN and HDFS to make use of the distributed data storage and data processing capabilities of the Hadoop ecosystem. In other words, the newly overhauled MapReduce version 2 (MRv2) became one of the application frameworks sitting on top of HDFS and YARN.

Figure 1 gives a brief idea about these components and how they are stacked together:

Figure 1

MapReduce is a generic data processing model. The data processing goes through two steps, namely, map step and reduce step. In the first step, the input data is divided into a number of smaller parts so that each one of them can be processed independently. Once the map step is completed, its output is consolidated and the final result is generated in the reduce step. In a typical word count example, the creation of key-value pairs with each word as the key and the value 1 is the map step. The sorting of these pairs on the key, summing the values of the pairs with the same key falls into an intermediate combine step. Producing the pairs containing unique words and their occurrence count is the reduce step.

From an application programming perspective, the basic ingredients for an over-simplified MapReduce application are as follows:

Input location
Output location
Map function implemented for the data processing need from the appropriate interfaces and classes from the MapReduce library
Reduce function implemented for the data processing need from the appropriate interfaces and classes from the MapReduce library

The MapReduce job is submitted for running in Hadoop and once the job is completed, the output can be taken from the output location specified.

This two-step process of dividing a MapReduce data processing job to map and reduce tasks was highly effective and turned out to be a perfect fit for many batch data processing use cases. There is a lot of Input/Output (I/O) operations with the disk happening under the hood during the whole process. Even in the intermediate steps of the MapReduce job, if the internal data structures are filled with data or when the tasks are completed beyond a certain percentage, writing to the disk happens. Because of this, the subsequent steps in the MapReduce jobs have to read from the disk.

Then the other biggest challenge comes when there are multiple MapReduce jobs to be completed in a chained fashion. In other words, if a big data processing work is to be accomplished by two MapReduce jobs in such a way that the output of the first MapReduce job is the input of the second MapReduce job. In this situation, whatever may be the size of the output of the first MapReduce job, it has to be written to the disk before the second MapReduce could use it as its input. So in this simple case, there is a definite and unnecessary write operation.

In many of the batch data processing use cases, these I/O operations are not a big issue. If the results are highly reliable, for many batch data processing use cases, latency is tolerated. But the biggest challenge comes when doing real-time data processing. The huge amount of I/O operations involved in MapReduce jobs makes it unsuitable for real-time data processing with the lowest possible latency.

Apache Spark 2 for Beginners

By : Rajanarayanan Thottuvaikkatumana

Apache Spark 2 for Beginners

By: Rajanarayanan Thottuvaikkatumana

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2 for Beginners

An overview of Apache Hadoop