Book Image

Apache Spark 2.x Cookbook

By : Rishi Yadav
Book Image

Apache Spark 2.x Cookbook

By: Rishi Yadav

Overview of this book

While Apache Spark 1.x gained a lot of traction and adoption in the early years, Spark 2.x delivers notable improvements in the areas of API, schema awareness, Performance, Structured Streaming, and simplifying building blocks to build better, faster, smarter, and more accessible big data applications. This book uncovers all these features in the form of structured recipes to analyze and mature large and complex sets of data. Starting with installing and configuring Apache Spark with various cluster managers, you will learn to set up development environments. Further on, you will be introduced to working with RDDs, DataFrames and Datasets to operate on schema aware data, and real-time streaming with various sources such as Twitter Stream and Apache Kafka. You will also work through recipes on machine learning, including supervised learning, unsupervised learning & recommendation engines in Spark. Last but not least, the final few chapters delve deeper into the concepts of graph processing using GraphX, securing your implementations, cluster optimization, and troubleshooting.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Leveraging Databricks Cloud


Databricks is the company behind Spark. It has a cloud platform that takes out all of the complexity of deploying Spark and provides you with a ready-to-go environment with notebooks for various languages. Databricks Cloud also has a community edition that provides one node instance with 6 GB of RAM for free. It is a great starting place for developers. The Spark cluster that is created also terminates after 2 hours of sitting idle. 

Note

All the recipes in this book can be run on either the InfoObjects Sandbox or Databricks Cloud community edition. The entire data for the recipes in this book has also been ported to a public bucket called sparkcookbook on S3. Just put these recipes on the Databricks Cloud community edition, and they will work seamlessly. 

 

How to do it...

  1. Click on Sign Up :
  1. Choose COMMUNITY EDITION (or full platform): 
  1.  Fill in the details and you'll be presented with a landing page, as follows:
  1. Click on Clusters, then Create Cluster (showing community edition below it):
  1. Enter the cluster name, for example, myfirstcluster, and choose Availability Zone (more about AZs in the next recipe). Then click on Create Cluster:
  1. Once the cluster is created, the blinking green signal will become solid green, as follows:
  1. Now go to Home and click on Notebook. Choose an appropriate notebook name, for example, config, and choose Scala as the language:
  1. Then set the AWS access parameters. There are two access parameters:
    • ACCESS_KEY: This is referred to as fs.s3n.awsAccessKeyId in SparkContext's Hadoop configuration.
    • SECRET_KEY: This is referred to as  fs.s3n.awsSecretAccessKey in SparkContext's Hadoop configuration.
  2. Set ACCESS_KEY in the config notebook:
        sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<replace  
          with your key>")
  1. Set SECRET_KEY in the config notebook:
        sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","
          <replace with your secret key>")
  1. Load a folder from the sparkcookbook bucket (all of the data for the recipes in this book are available in this bucket:
        val yelpdata = 
          spark.read.textFile("s3a://sparkcookbook/yelpdata")
  1. The problem with the previous approach was that if you were to publish your notebook, your keys would be visible. To avoid the use of this approach, use Databricks File System (DBFS).

Note

DBFS is Databricks Cloud's internal file system. It is a layer above S3, as you can guess. It mounts S3 buckets in a user's workspace as well as caches frequently accessed data on worker nodes. 

  1. Set the access key in the Scala notebook:
        val accessKey = "<your access key>"
  1. Set the secret key in the Scala notebook:
        val secretKey = "<your secret key>".replace("/","%2F")
  1. Set the bucket name in the Scala notebook:
        val bucket = "sparkcookbook"
  1. Set the mount name in the Scala notebook:
        val mount = "cookbook"
  1. Mount the bucket:
        dbutils.fs.mount(s"s3a://$accessKey:$secretKey@$bucket",
          s"/mnt/$mount")
  1. Display the contents of the bucket:
        display(dbutils.fs.ls(s"/mnt/$mount"))

Note

The rest of the recipes will assume that you would have set up AWS credentials.

How it works...

Let's look at the key concepts in Databricks Cloud.

Cluster

The concept of clusters is self-evident. A cluster contains a master node and one or more slave nodes. These nodes are EC2 nodes, which we are going to learn more about in the next recipe. 

Notebook

Notebook is the most powerful feature of Databricks Cloud. You can write your code in Scala/Python/R or a simple SQL notebook. These notebooks cover the whole 9 yards. You can use notebooks to write code like a programmer, use SQL like an analyst, or do visualization like a Business Intelligence (BI) expert. 

Table

Tables enable Spark to run SQL queries.

Library

Library is the section where you upload the libraries you would like to attach to your notebooks. The beauty is that you do not have to upload libraries manually; you can simply provide the Maven parameters and it would find the library for you and attach it.