Book Image

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal
Book Image

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.
Table of Contents (12 chapters)

Reading and writing data from and to JSON, including nested JSON 

Spark SQL automatically detects the JSON dataset schema from the files and loads it as a DataFrame. It also provides an option to query JSON data for reading and writing data. Nested JSON can also be parsed, and fields can be directly accessed without any explicit transformations.

Getting ready

You can follow the steps by running the steps in the 2_8.Reading and Writing data from and to Json including nested json.iynpb notebook in your local cloned repository in the Chapter02 folder.

Upload the folder JsonData from Chapter02/sensordata folder to ADLS Gen-2 account having sensordata as file system . We are mounting ADLS Gen-2 Storage Account with sensordata file system to /mnt/SensorData.

The JsonData has two folders, SimpleJsonData which has files simple JSON structure and JsonData folder which has files with nested JSON structure.

Note

The code was tested on Databricks Runtime Version 7.3 LTS having Spark 3.0.1.

In the upcoming section we will learn how to process simple and complex JSON datafile. We will use sensordata files with simple and nested schema.

How to do it…

In this section, you will see how you can read and write the simple JSON data files:

  1. You can read JSON datafiles using below code snippet. You need to specify multiline option as true when you are reading JSON file having multiple lines else if its single line JSON datafile this can be skipped.
    df_json = spark.read.option("multiline","true").json("/mnt/SensorData/JsonData/SimpleJsonData/")
    display(df_json)
  2. After executing the preceding code, you can see the schema of the json data.
    Figure 2.21 – Simple Json Data

    Figure 2.21 – Simple Json Data

  3. Writing json file is identical to a CSV file. You can use following code snippet to write json file using Azure Databricks. You can specify different mode options while writing JSON data like append, overwrite, ignore and error or errorifexists. error mode is the default one and this mode throws exceptions if data already exists.
    multilinejsondf.write.format("json").mode("overwrite).save("/mnt/SensorData/JsonData/SimpleJsonData/")

Very often, you will come across scenarios where you need to process complex datatypes such as arrays, maps, and structs while processing data.

To encode a struct type as a string and to read the struct type as a complex type, Spark provides functions such as to_json() and from_json(). If you are receiving data from the streaming source in nested JSON format, then you can use the from_json() function to process the input data before sending it downstream.

Similarly you can use to_json() method to encode or convert columns in DataFrame to JSON string and send the dataset to various destination like EventHub, Data Lake storage, Cosmos database, RDBMS systems like SQL server, Oracle etc. You can follow along the steps required to process simple and nested Json in the following steps.

  1. Execute the following code to display the dataset from the mount location of storage account.
    df_json = spark.read.option("multiline","true").json("dbfs:/mnt/SensorData/JsonData/")
  2. You can see that the vehicle sensor data has Owner's attribute in the multiline json format.
    Figure 2.22 – Complex Json Data

    Figure 2.22 – Complex Json Data

  3. To convert Owners attribute into row format for joining or transforming data you can use explode() function. Execute the following command for performing this operation. Here you can see using explode() function the elements of Array has been converted into two columns named name and phone.
    Figure 2.23 – Explode function

    Figure 2.23 – Explode function

  4. To encode or convert the columns in the DataFrame to JSON string, to_json() method can be used. In this example we will create new DataFrame with json column from the existing DataFrame data_df in preceding step. Execute the following code to create the new DataFrame with json column.
    jsonDF = data_df.withColumn("jsonCol", to_json(struct([data_df[x] for x in data_df.columns]))) .select("jsonCol")
    display(jsonDF)
  5. After executing the preceding command, you will get new DataFrame with column named jsonCol.
    Figure 2.24 – Encode To Json

    Figure 2.24 – Encode To Json

  6. Using to_json() method you have converted _id, name and phone column 
to a new json column.

How it works…

Spark provides you with options to process data from multiple sources and in multiple formats. You can process the data and enrich it before sending it to downstream applications. Data can be sent to a downstream application with low latency on streaming data or high throughput on historical data.