Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.

Preface

What this book covers

What you need for this book

Free Chapter

Introduction to Scala

History and purposes of Scala

Platforms and editors

Installing and setting up Scala

Scala: the scalable language

Scala for Java programmers

Scala for the beginners

Summary

Object-Oriented Scala

Variables in Scala

Methods, classes, and objects in Scala

Packages and package objects

Java interoperability

Pattern matching

Implicit in Scala

Generic in Scala

SBT and other build systems

Summary

Functional Programming Concepts

Introduction to functional programming

Functional Scala for the data scientists

Why FP and Scala for learning Spark?

Pure functions and higher-order functions

Using higher-order functions

Error handling in functional Scala

Functional programming and data mutability

Summary

Collection APIs

Scala collection APIs

Types and hierarchies

Performance characteristics

Java interoperability

Using Scala implicits

Summary

Tackle Big Data – Spark Comes to the Party

Introduction to data analytics

Introduction to big data

Distributed computing using Apache Hadoop

Here comes Apache Spark

Summary

Start Working with Spark – REPL and RDDs

Dig deeper into Apache Spark

Apache Spark installation

Introduction to RDDs

Using the Spark shell

Actions and Transformations

Caching

Loading and saving data

Summary

Special RDD Operations

Types of RDDs

Aggregations

Partitioning and shuffling

Broadcast variables

Accumulators

Summary

Introduce a Little Structure - Spark SQL

Spark SQL and DataFrames

DataFrame API and SQL API

Aggregations

Joins

Summary

Stream Me Up, Scotty - Spark Streaming

A Brief introduction to streaming

Spark Streaming

Discretized streams

Stateful/stateless transformations

Checkpointing

Interoperability with streaming platforms (Apache Kafka)

Structured streaming

Summary

Everything is Connected - GraphX

A brief introduction to graph theory

GraphX

VertexRDD and EdgeRDD

Graph operators

Pregel API

PageRank

Summary

Learning Machine Learning - Spark MLlib and Spark ML

Introduction to machine learning

Spark machine learning APIs

Feature extraction and transformation

Creating a simple pipeline

Unsupervised machine learning

Binary and multiclass classification

Summary

My Name is Bayes, Naive Bayes

Multinomial classification

Bayesian inference

Naive Bayes

The decision trees

Summary

Time to Put Some Order - Cluster Your Data with Spark MLlib

Unsupervised learning

Clustering techniques

Centroid-based clustering (CC)

Hierarchical clustering (HC)

Distribution-based clustering (DC)

Determining number of clusters

A comparative analysis between clustering algorithms

Submitting Spark job for cluster analysis

Summary

Text Analytics Using Spark ML

Understanding text analytics

Transformers and Estimators

Tokenization

StopWordsRemover

NGrams

TF-IDF

Word2Vec

CountVectorizer

Topic modeling using LDA

Implementing text classification

Summary

Spark Tuning

Monitoring Spark jobs

Spark configuration

Common mistakes in Spark app development

Optimization techniques

Summary

Time to Go to ClusterLand - Deploying Spark on a Cluster

Spark architecture in a cluster

Deploying the Spark application on a cluster

Summary

Testing and Debugging Spark

Testing in a distributed environment

Testing Spark applications

Debugging Spark applications

Summary

PySpark and SparkR

Introduction to PySpark

Installation and configuration

Introduction to SparkR

Summary

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What you need for this book

All the examples have been implemented using Python version 2.7 and 3.5 on an Ubuntu Linux 64 bit, including the TensorFlow library version 1.0.1. However, in the book, we showed the source code with only Python 2.7 compatible. Source codes that are Python 3.5+ compatible can be downloaded from the Packt repository. You will also need the following Python modules (preferably the latest versions):

Spark 2.0.0 (or higher)
Hadoop 2.7 (or higher)
Java (JDK and JRE) 1.7+/1.8+
Scala 2.11.x (or higher)
Python 2.7+/3.4+
R 3.1+ and RStudio 1.0.143 (or higher)
Eclipse Mars, Oxygen, or Luna (latest)
Maven Eclipse plugin (2.9 or higher)
Maven compiler plugin for Eclipse (2.3.2 or higher)
Maven assembly plugin for Eclipse (2.4.1 or higher)

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for standalone word missing and for an SQL warehouse).

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Related Content you might be interested in

Current Title:

Scala and Spark for Big Data Analytics

Big Data Analytics with Hadoop 3

Apache Spark Quick Start Guide

Apache Spark 2.x for Java Developers