Book Image

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

Book Image

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Preface

What this book covers

What you need for this book

Who this book is for

Reader feedback

Customer support

Free Chapter

Introduction to Scala

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Scala: the scalable language

Scala for Java programmers

Scala for the beginners

Object-Oriented Scala

Object-Oriented Scala

Variables in Scala

Methods, classes, and objects in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

SBT and other build systems

Functional Programming Concepts

Functional Programming Concepts

Introduction to functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Pure functions and higher-order functions

Using higher-order functions

Error handling in functional Scala

Functional programming and data mutability

Collection APIs

Collection APIs

Scala collection APIs

Types and hierarchies

Performance characteristics

Java interoperability

Using Scala implicits

Tackle Big Data – Spark Comes to the Party

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Introduction to big data

Distributed computing using Apache Hadoop

Here comes Apache Spark

Start Working with Spark – REPL and RDDs

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Introduction to RDDs

Using the Spark shell

Actions and Transformations

Loading and saving data

Special RDD Operations

Special RDD Operations

Partitioning and shuffling

Broadcast variables

Introduce a Little Structure - Spark SQL

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Stream Me Up, Scotty - Spark Streaming

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

Spark Streaming

Discretized streams

Stateful/stateless transformations

Interoperability with streaming platforms (Apache Kafka)

Structured streaming

Everything is Connected - GraphX

Everything is Connected - GraphX

A brief introduction to graph theory

VertexRDD and EdgeRDD

Graph operators

Learning Machine Learning - Spark MLlib and Spark ML

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Spark machine learning APIs

Feature extraction and transformation

Creating a simple pipeline

Unsupervised machine learning

Binary and multiclass classification

My Name is Bayes, Naive Bayes

My Name is Bayes, Naive Bayes

Multinomial classification

Bayesian inference

The decision trees

Time to Put Some Order - Cluster Your Data with Spark MLlib

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Clustering techniques

Centroid-based clustering (CC)

Hierarchical clustering (HC)

Distribution-based clustering (DC)

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Text Analytics Using Spark ML

Text Analytics Using Spark ML

Understanding text analytics

Transformers and Estimators

StopWordsRemover

CountVectorizer

Topic modeling using LDA

Implementing text classification

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Time to Go to ClusterLand - Deploying Spark on a Cluster

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Deploying the Spark application on a cluster

Testing and Debugging Spark

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

PySpark and SparkR

PySpark and SparkR

Introduction to PySpark

Installation and configuration

Introduction to SparkR

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Determining number of clusters

The beauty of clustering algorithms like K-means algorithm is that it does the clustering on the data with an unlimited number of features. It is a great tool to use when you have a raw data and would like to know the patterns in that data. However, deciding the number of clusters prior to doing the experiment might not be successful but may sometimes lead to an overfitting or underfitting problem. On the other hand, one common thing to all three algorithms (that is, K-means, bisecting K-means, and Gaussian mixture) is that the number of clusters must be determined in advance and supplied to the algorithm as a parameter. Hence, informally, determining the number of clusters is a separate optimization problem to be solved.

In this section, we will use a heuristic approach based on the Elbow method. We start from K = 2 clusters, and then we ran the...