Book Image

Azure Databricks Cookbook

By : Phani Raj, Vinod Jaiswal
Book Image

Azure Databricks Cookbook

By: Phani Raj, Vinod Jaiswal

Overview of this book

Azure Databricks is a unified collaborative platform for performing scalable analytics in an interactive environment. The Azure Databricks Cookbook provides recipes to get hands-on with the analytics process, including ingesting data from various batch and streaming sources and building a modern data warehouse. The book starts by teaching you how to create an Azure Databricks instance within the Azure portal, Azure CLI, and ARM templates. You’ll work through clusters in Databricks and explore recipes for ingesting data from sources, including files, databases, and streaming sources such as Apache Kafka and EventHub. The book will help you explore all the features supported by Azure Databricks for building powerful end-to-end data pipelines. You'll also find out how to build a modern data warehouse by using Delta tables and Azure Synapse Analytics. Later, you’ll learn how to write ad hoc queries and extract meaningful insights from the data lake by creating visualizations and dashboards with Databricks SQL. Finally, you'll deploy and productionize a data pipeline as well as deploy notebooks and Azure Databricks service using continuous integration and continuous delivery (CI/CD). By the end of this Azure book, you'll be able to use Azure Databricks to streamline different processes involved in building data-driven apps.
Table of Contents (12 chapters)

Reading and writing data from and to CSV and Parquet 

Azure Databricks supports multiple file formats, including sequence files, Record Columnar files, and Optimized Row Columnar files. It also provides native support for CSV, JSON, and Parquet file formats.

Parquet is the most widely used file format in the Databricks Cloud for the following reasons:

  1. Columnar storage format—Stores data column-wise, unlike row-based format files such as Avro and CSV.
  2. Open source—Parquet is open source and free to use.
  3. Aggressive compression—Parquet supports compression, which is not available in most file formats. Because of its compression technique, it requires slow storage compared to other file formats. It uses different encoding methods for compressions.
  4. Performance—The Parquet file format is designed for optimized performance. You can get the relevant data quickly as it saves both data and metadata. The amount of data scanned is comparatively smaller, resulting in less input/output (I/O) usage.
  5. Schema evaluation—It supports changes in the column schema as required. Multiple Parquet files with compatible schemas can be merged.
  6. Self-describing—Each Parquet file contains metadata and data, which makes it self-describing.

Parquet files also support predicate push-down, column filtering, static, and dynamic partition pruning.

In this recipe, you will learn how to read from and write to CSV and Parquet files using Azure Databricks.

Getting ready

You can follow the steps by running the steps in the 2_7.Reading and Writing data from and to CSV, Parquet.ipynb notebook in your local cloned repository in the Chapter02 folder.

Upload the csvFiles folder in the Chapter02/Customer folder to the ADLS Gen2 storage account in the rawdata file system and in Customer/csvFiles folder.

How to do it…

Here are the steps and code samples for reading from and writing to CSV and Parquet files using Azure Databricks. You will find a separate section for processing CSV and Parquet file formats.

Working with the CSV file format

Go through the following steps for reading CSV files and saving data in CSV format.

  1. Ensure that you have mounted the ADLS Gen2 Storage location. If not, you can refer to the Mounting ADLS Gen2 and Azure Blob storage to Azure DBFS recipe in this chapter to follow the steps for mounting a storage account.
  2. Run the following code to list the CSV data files from the mounted ADLS Gen2 storage account:
    #Listing CSV Files
    dbutils.fs.ls("/mnt/Gen2Source/Customer/csvFiles")
  3. Read the customer data stored in csv files in the ADLS Gen2 storage account by running the following code:
    customerDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/csvFiles")
  4. You can display the result of a Dataframe by running the following code:
    customerDF.show()
  5. By running the following code, we are writing customerDF DataFrame data to the location /mnt/Gen2Source/Customer/WriteCsvFiles in CSV format.
    customerDF.write.mode("overwrite").option("header", "true").csv("/mnt/Gen2Source/Customer/WriteCsvFiles")
  6. To confirm that the data is written to the target folder in csv format, let's read the csv files from target folder by running the following code.
    targetDF = spark.read.format("csv").option("header",True).option("inferSchema", True).load("/mnt/Gen2Source/Customer/WriteCsvFiles")
    targetDF.show()

In the following section we will learn how to read data from and write data to parquet files.

Working with the Parquet file format

Let's get started.

  1. You can use the same customer dataset for reading from the CSV files and writing into the Parquet file format.
  2. We will use the targetDF DataFrame used in Step 6 and save it as parquet format by running the following code. We are using save mode as overwrite in the following code. Using overwrite save option, existing data is overwritten in the target or destination folder mentioned.
    #Writing the targetDF data which has the CSV data read as parquet File using append mode
    targetDF.write.mode("overwrite").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
  3. In the following code, we are reading data from csvasParquetFiles folder to confirm the data in parquet format:
    df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
    display(df_parquetfiles.limit(5))
  4. Let's change the save mode from overwrite to append by running the following code. Using save mode as append, new data will be inserted, and existing data is preserved in the target or destination folder:
    #Using overwrite as option for save mode
    targetDF.write.mode("append").option("header", "true").parquet("/mnt/Gen2Source/Customer/csvasParquetFiles/") 
  5. Run the following code to check the count of records in the parquet folder and number should increase as we have appended the data to the same folder.
    df_parquetfiles=spark.read.format("parquet").option("header",True).load("/mnt/Gen2Source/Customer/csvasParquetFiles/")
    df_parquetfiles.count()

By the end of this recipe, you have learnt how to read from and write to CSV and Parquet files.

How it works…

The CSV file format is a widely used format by many tools, and it's also a default format for processing data. There are many disadvantages when you compare it in terms of cost, query processing time, and size of the data files. The CSV format is not that effective compared with what you will find in the Parquet file format. Also, it doesn't support partition pruning, which directly impacts the cost of storing and processing data in CSV format.

Conversely, Parquet is a columnar format that supports compression and partition pruning. It is widely used for processing data in big data projects for both reading and writing data. A Parquet file stores data and metadata, which makes it self-describing.

Parquet also supports schema evolution, which means you can change the schema of the data as required. This helps in developing systems that can accommodate changes in the schema as it matures. In such cases, you may end up with multiple Parquet files that have different schemas but are compatible.