Book Image

Spark for Data Science

By : Srinivas Duvvuri, Bikramaditya Singhal
Book Image

Spark for Data Science

By: Srinivas Duvvuri, Bikramaditya Singhal

Overview of this book

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.
Table of Contents (18 chapters)
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface

The Spark stack


Spark is a general-purpose cluster computing system that empowers other higher-level components to leverage its core engine. It is interoperable with Apache Hadoop, in the sense that it can read and write data from/to HDFS and can also integrate with other storage systems that are supported by the Hadoop API.

While it allows building other higher-level applications on top of it, it already has a few components built on top that are tightly integrated with its core engine to take advantage of the future enhancements at the core. These applications come bundled with Spark to cover the broader sets of requirements in the industry. Most of the real-world applications need to be integrated across projects to solve specific business problems that usually have a set of requirements. This is eased out with Apache Spark as it allows its higher level components to be seamlessly integrated, such as libraries in a development project.

Also, with Spark's built-in support for Scala, Java, R and Python, a broader range of developers and data engineers are able to leverage the entire Spark stack:

Spark core

The Spark core, in a way, is similar to the kernel of an operating system. It is the general execution engine, which is fast as well as fault tolerant. The entire Spark ecosystem is built on top of this core engine. It is mainly designed to do job scheduling, task distribution, and monitoring of jobs across worker nodes. It is also responsible for memory management, interacting with various heterogeneous storage systems, and various other operations.

The primary building block of Spark core is the Resilient Distributed Dataset (RDD), which is an immutable, fault-tolerant collection of elements. Spark can create RDDs from a variety of data sources such as HDFS, local filesystems, Amazon S3, other RDDs, NoSQL data stores such as Cassandra, and so on. They are resilient in the sense that they automatically rebuild on failure. RDDs are built through lazy parallel transformations. They may be cached and partitioned, and may or may not be materialized.

The entire Spark core engine may be viewed as a set of simple operations on distributed datasets. All the scheduling and execution of jobs in Spark is done based on the methods associated with each RDD. Also, the methods associated with each RDD define their own ways of distributed in-memory computation.

Spark SQL

This module of Spark is designed to query, analyze, and perform operations on structured data. This is a very important component in the entire Spark stack because of the fact that most of the organizational data is structured, though unstructured data is growing rapidly. Acting as a distributed query engine, it enables Hadoop Hive queries to run up to 100 times faster on it without any modification. Apart from Hive, it also supports Apache Parquet, an efficient columnar storage, JSON, and other structured data formats. Spark SQL enables running SQL queries along with complex programs written in Python, Scala, and Java.

Spark SQL provides a distributed programming abstraction called DataFrames, referred to as SchemaRDD before, which had fewer functions associated with it. DataFrames are distributed collections of named columns, analogous to SQL tables or Python's Pandas DataFrames. They can be constructed with a variety of data sources that have schemas with them such as Hive, Parquet, JSON, other RDBMS sources, and also from Spark RDDs.

Spark SQL can be used for ETL processing across different formats and then running ad hoc analysis. Spark SQL comes with an optimizer framework called Catalyst that can transform SQL queries for better efficiency.

Spark streaming

The processing window for the enterprise data is becoming shorter than ever. To address the real-time processing requirement of the industry, this component of Spark was designed, which is fault tolerant as well as scalable. Spark enables real-time data analytics on live streams of data by supporting data analysis, machine learning, and graph processing on them.

It provides an API called Discretised Stream (DStream) to manipulate the live streams of data. The live streams of data are sliced up into small batches of, say, x seconds. Spark treats each batch as an RDD and processes them as basic RDD operations. DStreams can be created out of live streams of data from HDFS, Kafka, Flume, or any other source which can stream data on the TCP socket. By applying some higher-level operations on DStreams, other DStreams can be produced.

The final result of Spark streaming can either be written back to the various data stores supported by Spark or can be pushed to any dashboard for visualization.

MLlib

MLlib is the built-in machine learning library in the Spark stack. This was introduced in Spark 0.8. Its goal is to make machine learning scalable and easy. Developers can seamlessly use Spark SQL, Spark Streaming, and GraphX in their programming language of choice, be it Java, Python, or Scala. MLlib provides the necessary functions to perform various statistical analyses such as correlations, sampling, hypothesis testing, and so on. This component also has a broad coverage of applications and algorithms in classification, regression, collaborative filtering, clustering, and decomposition.

The machine learning workflow involves collecting and preprocessing data, building and deploying the model, evaluating the results, and refining the model. In the real world, the preprocessing steps take up significant effort. These are typically multi-stage workflows involving expensive intermediate read/write operations. Often, these processing steps may be performed multiple times over a period of time. A new concept called ML Pipelines was introduced to streamline these preprocessing steps. A Pipeline is a sequence of transformations where the output of one stage is the input of another, forming a chain. The ML Pipeline leverages Spark and MLlib and enables developers to define reusable sequences of transformations.

GraphX

GraphX is a thin-layered unified graph analytics framework on Spark. It was designed to be a general-purpose distributed dataflow framework in place of specialized graph processing frameworks. It is fault tolerant and also exploits in-memory computation.

GraphX is an embedded graph processing API for manipulating graphs (for example, social networks) and to do graph parallel computation (for example, Google's Pregel). It combines the advantages of both graph-parallel and data-parallel systems on the Spark stack to unify exploratory data analysis, iterative graph computation, and ETL processing. It extends the RDD abstraction to introduce the Resilient Distributed Graph (RDG), which is a directed graph with properties associated to each of its vertices and edges.

GraphX includes a decently large collection of graph algorithms, such as PageRank, K-Core, Triangle Count, LDA, and so on.

SparkR

The SparkR project was started to integrate the statistical analysis and machine learning capability of R with the scalability of Spark. It addressed the limitation of R, which was its ability to process as much data as fitted in the memory of a single machine. R programs can now scale in a distributed setting through SparkR.

SparkR is actually an R Package that provides an R shell to leverage Spark's distributed computing engine. With R's rich set of built-in packages for data analytics, data scientists can analyze large datasets interactively at scale.