Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Preface

What this book covers

What you need for this book

Free Chapter

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Scala: the scalable language

Scala for Java programmers

Scala for the beginners

Summary

Object-Oriented Scala

Variables in Scala

Methods, classes, and objects in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

SBT and other build systems

Summary

Functional Programming Concepts

Introduction to functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Pure functions and higher-order functions

Using higher-order functions

Error handling in functional Scala

Functional programming and data mutability

Summary

Collection APIs

Scala collection APIs

Types and hierarchies

Performance characteristics

Java interoperability

Using Scala implicits

Summary

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Introduction to big data

Distributed computing using Apache Hadoop

Here comes Apache Spark

Summary

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Introduction to RDDs

Using the Spark shell

Actions and Transformations

Caching

Loading and saving data

Summary

Special RDD Operations

Types of RDDs

Aggregations

Partitioning and shuffling

Broadcast variables

Accumulators

Summary

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Aggregations

Joins

Summary

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

Spark Streaming

Discretized streams

Stateful/stateless transformations

Checkpointing

Interoperability with streaming platforms (Apache Kafka)

Structured streaming

Summary

Everything is Connected - GraphX

A brief introduction to graph theory

GraphX

VertexRDD and EdgeRDD

Graph operators

Pregel API

PageRank

Summary

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Spark machine learning APIs

Feature extraction and transformation

Creating a simple pipeline

Unsupervised machine learning

Binary and multiclass classification

Summary

My Name is Bayes, Naive Bayes

Multinomial classification

Bayesian inference

Naive Bayes

The decision trees

Summary

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Clustering techniques

Centroid-based clustering (CC)

Hierarchical clustering (HC)

Distribution-based clustering (DC)

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Summary

Text Analytics Using Spark ML

Understanding text analytics

Transformers and Estimators

Tokenization

StopWordsRemover

NGrams

TF-IDF

Word2Vec

CountVectorizer

Topic modeling using LDA

Implementing text classification

Summary

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Summary

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Deploying the Spark application on a cluster

Summary

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Summary

PySpark and SparkR

Introduction to PySpark

Installation and configuration

Introduction to SparkR

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

The continued growth in data coupled with the need to make increasingly complex decisions against that data is creating massive hurdles that prevent organizations from deriving insights in a timely manner using traditional analytical approaches. The field of big data has become so related to these frameworks that its scope is defined by what these frameworks can handle. Whether you're scrutinizing the clickstream from millions of visitors to optimize online ad placements, or sifting through billions of transactions to identify signs of fraud, the need for advanced analytics, such as machine learning and graph processing, to automatically glean insights from enormous volumes of data is more evident than ever.

Apache Spark, the de facto standard for big data processing, analytics, and data sciences across all academia and industries, provides both machine learning and graph processing libraries, allowing companies to tackle complex problems easily with the power of highly scalable and clustered computers. Spark's promise is to take this a little further to make writing distributed programs using Scala feel like writing regular programs for Spark. Spark will be great in giving ETL pipelines huge boosts in performance and easing some of the pain that feeds the MapReduce programmer's daily chant of despair to the Hadoop gods.

In this book, we used Spark and Scala for the endeavor to bring state-of-the-art advanced data analytics with machine learning, graph processing, streaming, and SQL to Spark, with their contributions to MLlib, ML, SQL, GraphX, and other libraries.

We started with Scala and then moved to the Spark part, and finally, covered some advanced topics for big data analytics with Spark and Scala. In the appendix, we will see how to extend your Scala knowledge for SparkR, PySpark, Apache Zeppelin, and in-memory Alluxio. This book isn't meant to be read from cover to cover. Skip to a chapter that looks like something you're trying to accomplish or that simply ignites your interest.

Happy reading!

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Related Content you might be interested in

Current Title:

Scala and Spark for Big Data Analytics

Big Data Analytics with Hadoop 3

Apache Spark Quick Start Guide

Apache Spark 2.x for Java Developers