Book Image

Big Data Analytics

By : Venkat Ankam
Book Image

Big Data Analytics

By: Venkat Ankam

Overview of this book

Big Data Analytics book aims at providing the fundamentals of Apache Spark and Hadoop. All Spark components – Spark Core, Spark SQL, DataFrames, Data sets, Conventional Streaming, Structured Streaming, MLlib, Graphx and Hadoop core components – HDFS, MapReduce and Yarn are explored in greater depth with implementation examples on Spark + Hadoop clusters. It is moving away from MapReduce to Spark. So, advantages of Spark over MapReduce are explained at great depth to reap benefits of in-memory speeds. DataFrames API, Data Sources API and new Data set API are explained for building Big Data analytical applications. Real-time data analytics using Spark Streaming with Apache Kafka and HBase is covered to help building streaming applications. New Structured streaming concept is explained with an IOT (Internet of Things) use case. Machine learning techniques are covered using MLLib, ML Pipelines and SparkR and Graph Analytics are covered with GraphX and GraphFrames components of Spark. Readers will also get an opportunity to get started with web based notebooks such as Jupyter, Apache Zeppelin and data flow tool Apache NiFi to analyze and visualize data.
Table of Contents (18 chapters)
Big Data Analytics
About the Author
About the Reviewers

Why Hadoop plus Spark?

Apache Spark shines better when it is combined with Hadoop. To understand this, let's take a look at Hadoop and Spark features.

Hadoop features



Unlimited scalability

Stores unlimited data by scaling out HDFS

Effectively manages cluster resources with YARN

Runs multiple applications along with Spark

Thousands of simultaneous users

Enterprise grade

Provides security with Kerberos authentication and ACLs authorization

Data encryption

High reliability and integrity


Wide range of applications

Files: Structured, semi-structured, and unstructured

Streaming sources: Flume and Kafka

Databases: Any RDBMS and NoSQL database

Spark features



Easy development

No boilerplate coding

Multiple native APIs such as Java, Scala, Python, and R

REPL for Scala, Python, and R

Optimized performance


Optimized shuffle

Catalyst Optimizer


Batch, SQL, machine learning, streaming, and graph processing