Apache Spark 2 for Beginners

Apache Spark 2 for Beginners

By : Rajanarayanan Thottuvaikkatumana

Buy this Book

Apache Spark 2 for Beginners

By: Rajanarayanan Thottuvaikkatumana

Buy this Book

Overview of this book

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists. This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then the Spark programming model is introduced through real-world examples followed by Spark SQL programming with DataFrames. An introduction to SparkR is covered next. Later, we cover the charting and plotting features of Python in conjunction with Spark data processing. After that, we take a look at Spark's stream processing, machine learning, and graph processing libraries. The last chapter combines all the skills you learned from the preceding chapters to develop a real-world Spark application. By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.

Apache Spark 2 for Beginners

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Spark Fundamentals

An overview of Apache Hadoop

Understanding Apache Spark

Installing Spark on your machines

References

Summary

Spark Programming Model

Functional programming with Spark

Understanding Spark RDD

Data transformations and actions with RDDs

Monitoring with Spark

The basics of programming with Spark

Creating RDDs from files

Understanding the Spark library stack

Reference

Summary

Spark SQL

Understanding the structure of data

Why Spark SQL?

Anatomy of Spark SQL

DataFrame programming

Understanding Aggregations in Spark SQL

Understanding multi-datasource joining with SparkSQL

Introducing datasets

Understanding Data Catalogs

References

Summary

Spark Programming with R

The need for SparkR

Basics of the R language

DataFrames in R and Spark

Spark DataFrame programming with R

Understanding aggregations in Spark R

Understanding multi-datasource joins with SparkR

References

Summary

Spark Data Analysis with Python

Charting and plotting libraries

Setting up a dataset

Data analysis use cases

Charts and plots

References

Summary

Spark Stream Processing

Data stream processing

Micro batch data processing

A log event processor

Windowed data processing

More processing options

Kafka stream processing

Spark Streaming jobs in production

References

Summary

Spark Machine Learning

Understanding machine learning

Why Spark for machine learning?

Wine quality prediction

Summary

Spark Graph Processing

Understanding graphs and their usage

The Spark GraphX library

Tennis tournament analysis

Applying the PageRank algorithm

Connected component algorithm

Understanding GraphFrames

Understanding GraphFrames queries

References

Summary

Designing Spark Applications

Lambda Architecture

Microblogging with Lambda Architecture

Implementing Lambda Architecture

Working with Spark applications

Coding style

Setting up the source code

Understanding data ingestion

Generating purposed views and queries

Understanding custom data processes

References

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data transformations and actions with RDDs

Spark does the data processing using the RDDs. From the relevant data source such as text files and NoSQL data stores, data is read to form the RDDs. On such an RDD, various data transformations are performed and finally, the result is collected. To be precise, Spark comes with Spark transformations and Spark actions that act upon RDDs. Let us take the following RDD capturing a list of retail banking transactions, which is of the type RDD[(string, string, double)]:

AccountNo	TranNo	TranAmount
SB001	TR001	250.00
SB002	TR004	450.00
SB003	TR010	120.00
SB001	TR012	-120.00
SB001	TR015	-10.00
SB003	TR020	100.00

To calculate the account level summary of the transactions from the RDD of the form (AccountNo,TranNo,TranAmount):

First it has to be transformed to the form of key-value pairs (AccountNo,TranAmount), where AccountNo is the key but there will be multiple elements with the same key.
On this key, do a summation operation on...

Apache Spark 2 for Beginners

By : Rajanarayanan Thottuvaikkatumana

Apache Spark 2 for Beginners

By: Rajanarayanan Thottuvaikkatumana

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2 for Beginners

Data transformations and actions with RDDs