Book Overview & Buying
Table Of Contents

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

2.8 (12)

Buy this Book

Scala and Spark for Big Data Analytics

2.8 (12)

By: Md. Rezaul Karim, Sridhar Alla

Buy this Book

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Scala: the scalable language

Scala for Java programmers

Scala for the beginners

Summary

Object-Oriented Scala

Variables in Scala

Methods, classes, and objects in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

SBT and other build systems

Summary

Functional Programming Concepts

Introduction to functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Pure functions and higher-order functions

Using higher-order functions

Error handling in functional Scala

Functional programming and data mutability

Summary

Collection APIs

Scala collection APIs

Types and hierarchies

Performance characteristics

Java interoperability

Using Scala implicits

Summary

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Introduction to big data

Distributed computing using Apache Hadoop

Here comes Apache Spark

Summary

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Introduction to RDDs

Using the Spark shell

Actions and Transformations

Caching

Loading and saving data

Summary

Special RDD Operations

Types of RDDs

Aggregations

Partitioning and shuffling

Broadcast variables

Accumulators

Summary

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Aggregations

Joins

Summary

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

Spark Streaming

Discretized streams

Stateful/stateless transformations

Checkpointing

Interoperability with streaming platforms (Apache Kafka)

Structured streaming

Summary

Everything is Connected - GraphX

A brief introduction to graph theory

GraphX

VertexRDD and EdgeRDD

Graph operators

Pregel API

PageRank

Summary

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Spark machine learning APIs

Feature extraction and transformation

Creating a simple pipeline

Unsupervised machine learning

Binary and multiclass classification

Summary

My Name is Bayes, Naive Bayes

Multinomial classification

Bayesian inference

Naive Bayes

The decision trees

Summary

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Clustering techniques

Centroid-based clustering (CC)

Hierarchical clustering (HC)

Distribution-based clustering (DC)

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Summary

Text Analytics Using Spark ML

Understanding text analytics

Transformers and Estimators

Tokenization

StopWordsRemover

NGrams

TF-IDF

Word2Vec

CountVectorizer

Topic modeling using LDA

Implementing text classification

Summary

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Summary

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Deploying the Spark application on a cluster

Summary

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Summary

PySpark and SparkR

Introduction to PySpark

Installation and configuration

Introduction to SparkR

Summary

Special RDD Operations

"It's supposed to be automatic, but actually you have to push this button."

- John Brunner

In this chapter, you learn how RDDs can be tailored to different needs, and how these RDDs provide new functionalities (and dangers!) Moreover, we investigate other useful objects that Spark provides, such as broadcast variables and accumulators.
In a nutshell, the following topics will be covered throughout this chapter: