Book Image

Learning Spark SQL

By : Aurobindo Sarkar
Book Image

Learning Spark SQL

By: Aurobindo Sarkar

Overview of this book

In the past year, Apache Spark has been increasingly adopted for the development of distributed applications. Spark SQL APIs provide an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Hence, understanding the design and implementation best practices before you start your project will help you avoid these problems. This book gives an insight into the engineering practices used to design and build real-world, Spark-based applications. The book's hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL. It starts by familiarizing you with data exploration and data munging tasks using Spark SQL and Scala. Extensive code examples will help you understand the methods used to implement typical use-cases for various types of applications. You will get a walkthrough of the key concepts and terms that are common to streaming, machine learning, and graph applications. You will also learn key performance-tuning details including Cost Based Optimization (Spark 2.2) in Spark SQL applications. Finally, you will move on to learning how such systems are architected and deployed for a successful delivery of your project.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Using Spark SQL for creating pivot tables


Pivot tables alternate views of your data and are used during data exploration. In the following example, we demonstrate pivoting using Spark DataFrames:

The following example pivots on housing loan taken and computes the numbers by marital status:

In the next example, we create a DataFrame with appropriate column names for the total and average number of calls:

In the following example, we a DataFrame with appropriate names for the total and average duration of calls for each job category:

In the following example, we pivoting to compute average call for each job category, while also specifying a subset of marital status:

The following is the same as the preceding one, except that we the average call duration values by the housing loan field as well in this case:

Next, we how you can create a DataFrame of pivot table of deposits subscribed by month, save it to disk, and read it back into a RDD:

Further, we use the RDD in the preceding step to...