Book Image

Scala and Spark for Big Data Analytics

By : Md. Rezaul Karim, Sridhar Alla
Book Image

Scala and Spark for Big Data Analytics

By: Md. Rezaul Karim, Sridhar Alla

Overview of this book

Scala has been observing wide adoption over the past few years, especially in the field of data science and analytics. Spark, built on Scala, has gained a lot of recognition and is being used widely in productions. Thus, if you want to leverage the power of Scala and Spark to make sense of big data, this book is for you. The first part introduces you to Scala, helping you understand the object-oriented and functional programming concepts needed for Spark application development. It then moves on to Spark to cover the basic abstractions using RDD and DataFrame. This will help you develop scalable and fault-tolerant streaming applications by analyzing structured and unstructured data using SparkSQL, GraphX, and Spark structured streaming. Finally, the book moves on to some advanced topics, such as monitoring, configuration, debugging, testing, and deployment. You will also learn how to develop Spark applications using SparkR and PySpark APIs, interactive data analytics using Zeppelin, and in-memory data processing with Alluxio. By the end of this book, you will have a thorough understanding of Spark, and you will be able to perform full-stack data analytics with a feel that no amount of data is too big.
Table of Contents (19 chapters)

Scala for the beginners

In this part, you will find that we assume that you have a basic understanding of any previous programming language. If Scala is your first entry into the coding world, then you will find a large set of materials and even courses online that explain Scala for beginners. As mentioned, there are lots of tutorials, videos, and courses out there.

There is a whole Specialization, which contains this course, on Coursera: https://www.coursera.org/specializations/scala. Taught by the creator of Scala, Martin Odersky, this online class takes a somewhat academic approach to teaching the fundamentals of functional programming. You will learn a lot about Scala by solving the programming assignments. Moreover, this specialization includes a course on Apache Spark. Furthermore, Kojo (http://www.kogics.net/sf:kojo) is an interactive learning environment that uses Scala programming to explore and play with math, art, music, animations, and games.

Your first line of code

As a first example, we will use the pretty common Hello, world! program in order to show you how to use Scala and its tools without knowing much about it. Let's open your favorite editor (this example runs on Windows 7, but can be run similarly on Ubuntu or macOS), say Notepad++, and type the following lines of code:

object HelloWorld {
def main(args: Array[String]){
println("Hello, world!")
}
}

Now, save the code with a name, say HelloWorld.scala, as shown in the following figure:

Figure 11: Saving your first Scala source code using Notepad++

Let's compile the source file as follows:

C:\>scalac HelloWorld.scala
C:\>scala HelloWorld
Hello, world!
C:\>

I'm the hello world program, explain me well!

The program should be familiar to anyone who has some programming of experience. It has a main method which prints the string Hello, world! to your console. Next, to see how we defined the main function, we used the def main() strange syntax to define it. def is a Scala keyword to declare/define a method, and we will be covering more about methods and different ways of writing them in the next chapter. So, we have an Array[String] as an argument for this method, which is an array of strings that can be used for initial configurations of your program, and omit is valid. Then, we use the common println() method, which takes a string (or formatted one) and prints it to the console. A simple hello world has opened up many topics to learn; three in particular:

● Methods (covered in a later chapter)
● Objects and classes (covered in a later chapter)
● Type inference - the reason why Scala is a statically typed language - explained earlier

Run Scala interactively!

The scala command starts the interactive shell for you, where you can interpret Scala expressions interactively:

> scala
Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121).
Type in expressions for evaluation. Or try :help.
scala>
scala> object HelloWorld {
| def main(args: Array[String]){
| println("Hello, world!")
| }
| }
defined object HelloWorld
scala> HelloWorld.main(Array())
Hello, world!
scala>
The shortcut :q stands for the internal shell command :quit, used to exit the interpreter.

Compile it!

The scalac command, which is similar to javac command, compiles one or more Scala source files and generates a bytecode as output, which then can be executed on any Java Virtual Machine. To compile your hello world object, use the following:

> scalac HelloWorld.scala

By default, scalac generates the class files into the current working directory. You may specify a different output directory using the -d option:

> scalac -d classes HelloWorld.scala

However, note that the directory called classes must be created before executing this command.

Execute it with Scala command

The scala command executes the bytecode that is generated by the interpreter:

$ scala HelloWorld

Scala allows us to specify command options, such as the -classpath (alias -cp) option:

$ scala -cp classes HelloWorld

Before using the scala command to execute your source file(s), you should have a main method that acts as an entry point for your application. Otherwise, you should have an Object that extends Trait Scala.App, then all the code inside this object will be executed by the command. The following is the same Hello, world! example, but using the App trait:

#!/usr/bin/env Scala 
object HelloWorld extends App {
println("Hello, world!")
}
HelloWorld.main(args)

The preceding script can be run directly from the command shell:

./script.sh

Note: we assume here that the file script.sh has the execute permission:

$ sudo chmod +x script.sh

Then, the search path for the scala command is specified in the $PATH environment variable.