Apache Spark 2.x for Java Developers

Apache Spark 2.x for Java Developers

By : Sourav Gulati, Sumit Kumar

Buy this Book

Apache Spark 2.x for Java Developers

By: Sourav Gulati, Sumit Kumar

Buy this Book

Overview of this book

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone. The book starts with an introduction to the Apache Spark 2.x ecosystem, followed by explaining how to install and configure Spark, and refreshes the Java concepts that will be useful to you when consuming Apache Spark's APIs. You will explore RDD and its associated common Action and Transformation Java APIs, set up a production-like clustered environment, and work with Spark SQL. Moving on, you will perform near-real-time processing with Spark streaming, Machine Learning analytics with Spark MLlib, and graph processing with GraphX, all using various Java packages. By the end of the book, you will have a solid foundation in implementing components in the Spark framework in Java to build fast, real-time applications.

Title Page

Credits

Foreword

About the Authors

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Introduction to Spark

Dimensions of big data

What makes Hadoop so revolutionary?

Why Apache Spark?

RDD - the first citizen of Spark

Exploring the Spark ecosystem

What's new in Spark 2.X?

References

Summary

Revisiting Java

Why use Java for Spark?

Streams

Intermediate operations

Terminal operations

Summary

Let Us Spark

Getting started with Spark

Spark REPL also known as CLI

Some basic exercises using Spark shell

Spark components

Spark Driver Web UI

Spark job configuration and submission

Spark REST APIs

Summary

Understanding the Spark Programming Model

Hello Spark

Common RDD transformations

Common RDD actions

RDD persistence and cache

Summary

Working with Data and Storage

Interaction with external storage systems

Working with different data formats

References

Summary

Spark on Cluster

Spark application in distributed-mode

Cluster managers

Yet Another Resource Negotiator (YARN)

Summary

Spark Programming Model - Advanced

RDD partitioning

Advanced transformations

Advanced actions

Shared variable

Broadcast variable

Summary

Working with Spark SQL

SQLContext and HiveContext

Dataframe and dataset

Spark SQL operations

Hive integration

Summary

Near Real-Time Processing with Spark Streaming

Introducing Spark Streaming

Understanding micro batching

Streaming sources

Kafka

Streaming transformations

Fault tolerance and reliability

Structured Streaming

Summary

Machine Learning Analytics with Spark MLlib

Introduction to machine learning

Concepts of machine learning

Machine learning work flow

Operations on feature vectors

Summary

Learning Spark GraphX

Introduction to GraphX

Introduction to Property Graph

Getting started with the GraphX API

Graph operations

Graph algorithms

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Why Apache Spark?

MapReduce is on its way to being a legacy. We've got Spark, says the man behind Apache Hadoop, Doug Cutting. MapReduce is an amazingly popular distributed framework but it comes with its fair amount of criticism as well. Since its inception, MapReduce has been built to run jobs in batch mode. Although it supports streaming, it's not very well suited for ad-hoc queries, machine learning, and so on. Apache Spark is a distributed in-memory computing framework, and somehow tries to address some of the major concern that surrounds MapReduce:

Performance: A major bottleneck in MapReduce jobs are disk I/Os, and it is considerably visible during the shuffle and sort phase of MR, as data is written to disk. The guiding principal that Spark follows is simple: share the memory across the cluster and keep everything in memory as long as possible. This greatly enhances the performance of Spark jobs to the tune of 100X when compared to MR (as claimed by their developers).
Fault tolerance: Both MR and Spark have different approaches in handling fault tolerance. The AM keeps a track of mappers and reducers while executing MR jobs. As and when these containers stop responding or fail upfront, the AM after requesting the RM, launches a separate JVM to run such tasks. While this approach achieves fault tolerance, it is both time and resource consuming. Apache Spark's approach of handling fault tolerance is different; it uses Resilient Distributed Datasets (RDD), a read only fault-tolerant parallel collection. RDD maintains the lineage graph so that whenever its partition gets lost it recovers the lost data by re-computing from the previous stage and thus making it more resilient.
DAG: Chaining of MapReduce jobs is a difficult task and along with that it also has to deal with the burden of writing intermediate results on HDFS before the next job starts its execution. Spark is actually a DirectedAcyclicGraph (DAG) engine, so chaining any number of job can easily be achieved. All the intermediate results are shared across memory, avoiding multiple disk I/O. Also these jobs are lazily evaluated and hence only those paths are processed which are explicitly called for computation. In Spark such triggers are called actions.
Data processing: Spark (aka Spark Core) is not an isolated distributed compute framework. A whole lot of Spark modules have been built around Spark Core to make it more general purpose. RDD forms the main abstract in all these modules. With recent development dataframe and dataset have also been developed, which enriches RDD by providing it a schema and type safety. Nevertheless, the universality of RDD is ubiquitous across all the modules of Spark making it simple to use and it can easily be cross-referenced in different modules. Whether it is streaming, querying capability, machine learning, or graph processing, the same data can be referenced by RDD and can be interchangeably used. This is a unique appeal of RDD, which is lacking in MR, as different concept was required to handle machine learning jobs in ApacheMahout than was required in ApacheGiraph.

Compatibility: Spark has not been developed to keep only the YARN cluster in mind. It has amazing compatibility to run on Hadoop, Mesos, and even a standalone cluster mode. Similarly, Spark has not been built around HDFS and has a wide variety of acceptability as far as different filesystems are concerned. All Apache Spark does is provide compute capabilities while leaving the choice of choosing the cluster and filesystem to the use case being worked upon.
Spark APIs: Spark APIs have wide coverage as far as functionality and programming languages are concerned. Spark's APIs have huge similarities with the Scala collection, the language in which Apache Spark has been majorly implemented. It is this richness of functional programming that makes Apache spark avoid much of the boilerplate code that eclipsed MR jobs. Unlike MR, which dealt with low level programming, Spark exposes APIs at a higher abstraction level with a scope of overriding any bare metal code if ever required.

Apache Spark 2.x for Java Developers

By : Sourav Gulati, Sumit Kumar

Apache Spark 2.x for Java Developers

By: Sourav Gulati, Sumit Kumar

Overview of this book

Related Content you might be interested in

Current Title:

Apache Spark 2.x for Java Developers

Apache Spark Quick Start Guide

Learning Apache Spark 2

Scala and Spark for Big Data Analytics

Why Apache Spark?