Book Image

Apache Spark for Data Science Cookbook

By : Padma Priya Chitturi
Book Image

Apache Spark for Data Science Cookbook

By: Padma Priya Chitturi

Overview of this book

Spark has emerged as the most promising big data analytics engine for data science professionals. The true power and value of Apache Spark lies in its ability to execute data science tasks with speed and accuracy. Spark’s selling point is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. It lets you tackle the complexities that come with raw unstructured data sets with ease. This guide will get you comfortable and confident performing data science tasks with Spark. You will learn about implementations including distributed deep learning, numerical computing, and scalable machine learning. You will be shown effective solutions to problematic concepts in data science using Spark’s data science libraries such as MLLib, Pandas, NumPy, SciPy, and more. These simple and efficient recipes will show you how to implement algorithms and optimize your work.
Table of Contents (17 chapters)
Apache Spark for Data Science Cookbook
About the Author
About the Reviewer
Customer Feedback

Working with the Spark programming model

This recipe explains the fundamentals of the Spark programming model. It covers the RDD basics that is, Spark provides a Resilient Distributed Dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated in parallel. It also covers how to create and perform transformations and actions on RDDs.

How to do it…

  1. Let's create RDDs and apply a few transformations such as map and filter, and a few actions such as count, take, top, and so on, in Spark-shell:

            scala> val data = Array(1, 2, 3, 4, 5) 
            scala> val rddData = sc.parallelize(data) 
            scala> val mydata = data.filter(ele => ele%2==0) 
            mydata: org.apache.spark.rdd.RDD[String] = 
            MapPartitionsRDD[3]   at
            filter at <console>:29 
            scala> val mydata = => ele+2) 
            mydata: org.apache.spark.rdd.RDD[String] = 
            MapPartitionsRDD[3]  at 
            filter at <console>:30 
            scala> mydata.count() 
            res1: Long = 5 
            scala> mydata.take(2) 
            res2:Array[Int] = Array(1,2) 
            res2:Array[Int] = Array(5)
  2. Now let's work with the transformations and actions in a Spark standalone application:

              object SparkTransformations { 
              def main(args:Array[String]){ 
              val conf = new SparkConf 
              val sc = new SparkContext(conf) 
              val baseRdd1 =     
              val baseRdd2 =      
              val baseRdd3 =  sc.parallelize(Array(1,2,3,4),2) 
              val sampledRdd = baseRdd1.sample(false,0.5) 
              val unionRdd = baseRdd1.union(baseRdd2).repartition(1) 
              val intersectionRdd = baseRdd1.intersection(baseRdd2) 
              val distinctRdd = baseRdd1.distinct.repartition(1) 
              val subtractRdd = baseRdd1.subtract(baseRdd2) 
              val cartesianRdd = sampledRdd.cartesian(baseRdd2) 
              val reducedValue = baseRdd3.reduce((a,b) => a+b) 
              val collectedRdd = distinctRdd.collect 
              val count = distinctRdd.count 
              val first = distinctRdd.first 
              println("Count is..."+count); println("First Element     
              val takeValues = distinctRdd.take(3) 
              val takeSample = distinctRdd.takeSample(false, 2) 
              val takeOrdered = distinctRdd.takeOrdered(2) 
              println("Take Sample Values..") 
              val foldResult = distinctRdd.fold("<>")((a,b) => a+b) 
              println(foldResult) }} 

How it works…

Spark offers an abstraction called an RDD as part of its programming model. The preceding code snippets show RDD creation, transformations, and actions. Transformations such as union, subtract, intersection, sample, cartesian, map, filter, and flatMap when applied on a RDD result in a new RDD, whereas actions such as count, first, take(3), takeSample(false, 2) and takeOrdered(2) compute the result on the RDD and return it to the driver program or save it to external storage. Although we can define RDDs at any point, Spark computes them in lazy fashion, that is, the first time it is used in any action.

There's more…

There are a few transformations, such as reduceByKey, groupByKey, repartition, distinct, intersection, subtract, and so on, which result in shuffle operation. This shuffle is very expensive as it involves disk I/O, data serialization, and network I/O. Using certain configuration parameters, shuffle can be optimized.

See also

The Apache Spark documentation offers a detailed explanation about the Spark programming model. Please refer to this documentation page: