Apache Spark 2.x for Java Developers

Book Image

Apache Spark 2.x for Java Developers

By : Sourav Gulati, Sumit Kumar

Book Image

Apache Spark 2.x for Java Developers

By: Sourav Gulati, Sumit Kumar

Overview of this book

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone. The book starts with an introduction to the Apache Spark 2.x ecosystem, followed by explaining how to install and configure Spark, and refreshes the Java concepts that will be useful to you when consuming Apache Spark's APIs. You will explore RDD and its associated common Action and Transformation Java APIs, set up a production-like clustered environment, and work with Spark SQL. Moving on, you will perform near-real-time processing with Spark streaming, Machine Learning analytics with Spark MLlib, and graph processing with GraphX, all using various Java packages. By the end of the book, you will have a solid foundation in implementing components in the Spark framework in Java to build fast, real-time applications.

Title Page

Credits

Foreword

About the Authors

About the Authors

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Customer Feedback

Customer Feedback

Preface

Free Chapter

Introduction to Spark

Introduction to Spark

Dimensions of big data

What makes Hadoop so revolutionary?

Why Apache Spark?

RDD - the first citizen of Spark

Exploring the Spark ecosystem

What's new in Spark 2.X?

Revisiting Java

Revisiting Java

Why use Java for Spark?

Lambda expressions

Lexical scoping

Intermediate operations

Terminal operations

Let Us Spark

Getting started with Spark

Spark REPL also known as CLI

Some basic exercises using Spark shell

Spark components

Spark Driver Web UI

Spark job configuration and submission

Spark REST APIs

Understanding the Spark Programming Model

Understanding the Spark Programming Model

Common RDD transformations

Common RDD actions

RDD persistence and cache

Working with Data and Storage

Working with Data and Storage

Interaction with external storage systems

Working with different data formats

Spark on Cluster

Spark on Cluster

Spark application in distributed-mode

Cluster managers

Yet Another Resource Negotiator (YARN)

Spark Programming Model - Advanced

Spark Programming Model - Advanced

RDD partitioning

Advanced transformations

Advanced actions

Shared variable

Broadcast variable

Working with Spark SQL

Working with Spark SQL

SQLContext and HiveContext

Dataframe and dataset

Spark SQL operations

Hive integration

Near Real-Time Processing with Spark Streaming

Near Real-Time Processing with Spark Streaming

Introducing Spark Streaming

Understanding micro batching

Streaming sources

Streaming transformations

Fault tolerance and reliability

Structured Streaming

Machine Learning Analytics with Spark MLlib

Machine Learning Analytics with Spark MLlib

Introduction to machine learning

Concepts of machine learning

Machine learning work flow

Operations on feature vectors

Learning Spark GraphX

Learning Spark GraphX

Introduction to GraphX

Introduction to Property Graph

Getting started with the GraphX API

Graph operations

Graph algorithms

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Spark components

Before moving any further let's first understand the common terminologies associated with Spark:

Driver: This is the main program that oversees the end-to-end execution of a Spark job or program. It negotiates the resources with the resource manager of the cluster for delegate and orchestrate the program into smallest possible data local parallel programming unit.
Executors: In any Spark job, there can be one or more executors, that is, processes that execute smaller tasks delegated by the driver. The executors process the data, preferably local to the node and store the result in memory, disk, or both.
Master: Apache Spark has been implemented in master-slave architecture and hence master refers to the cluster node executing the driver program.
Slave: In a distributed cluster mode, slave refers to the nodes on which executors are being run and hence there can be (and mostly is) more than one slave in the cluster.
Job: This is a collection of operations performed on any set of...