Book Image

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla
Book Image

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.
Table of Contents (19 chapters)

What you need for this book

All the examples have been implemented using Python version 2.7 and 3.5 on an Ubuntu Linux 64 bit, including the TensorFlow library version 1.0.1. However, in the book, we showed the source code with only Python 2.7 compatible. Source codes that are Python 3.5+ compatible can be downloaded from the Packt repository. You will also need the following Python modules (preferably the latest versions):

  • Spark 2.0.0 (or higher)
  • Hadoop 2.7 (or higher)
  • Java (JDK and JRE) 1.7+/1.8+
  • Scala 2.11.x (or higher)
  • Python 2.7+/3.4+
  • R 3.1+ and RStudio 1.0.143 (or higher)
  • Eclipse Mars, Oxygen, or Luna (latest)
  • Maven Eclipse plugin (2.9 or higher)
  • Maven compiler plugin for Eclipse (2.3.2 or higher)
  • Maven assembly plugin for Eclipse (2.4.1 or higher)

Operating system: Linux distributions are preferable (including Debian, Ubuntu, Fedora, RHEL, and CentOS) and to be more specific, for Ubuntu it is recommended to have a complete 14.04 (LTS) 64-bit (or later) installation, VMWare player 12, or Virtual box. You can run Spark jobs on Windows (XP/7/8/10) or Mac OS X (10.4.7+).

Hardware configuration: Processor Core i3, Core i5 (recommended), or Core i7 (to get the best results). However, multicore processing will provide faster data processing and scalability. You will need least 8-16 GB RAM (recommended) for a standalone mode and at least 32 GB RAM for a single VM--and higher for cluster. You will also need enough storage for running heavy jobs (depending on the dataset size you will be handling), and preferably at least 50 GB of free disk storage (for standalone word missing and for an SQL warehouse).