Book Image

Distributed Data Systems with Azure Databricks

By : Alan Bernardo Palacio
Book Image

Distributed Data Systems with Azure Databricks

By: Alan Bernardo Palacio

Overview of this book

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying machine learning and deep learning models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed Data Systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines. The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you’ll begin with a quick introduction to Databricks core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you’ll explore MLflow Model Serving on Azure Databricks and implement distributed training pipelines using HorovodRunner in Databricks. Finally, you’ll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines. By the end of this MS Azure book, you’ll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.
Table of Contents (17 chapters)
1
Section 1: Introducing Databricks
4
Section 2: Data Pipelines with Databricks
9
Section 3: Machine and Deep Learning with Databricks

What this book covers

Chapter 1, Introduction to Azure Databricks, takes you through the core functionalities of Databricks, including how we interact with the workspace environment, a quick look into the main applications, and how we will be using the tool for Python users. This covers topics such as workspace, interface, computation management, and Databricks notebooks.

Chapter 2, Creating an Azure Databricks Workspace, teaches you how to apply all the previous concepts using the different tools that Azure has in order to interact with the workspace. This includes using PowerShell and the Azure CLI to manage all Databricks' resources.

Chapter 3, Creating ETL Operations with Azure Databricks, shows you how to manage different data sources, transform them, and create an entire event-driven ETL.

Chapter 4, Delta Lake with Azure Databricks, explores Delta Lake and how to implement it for various operations.

Chapter 5, Introducing Delta Engine, explores Delta Engine and also shows you how to use it along with Delta Lake and create efficient ETLs in Databricks.

Chapter 6, Introducing Structured Streaming, provides explanations on notebooks, details on how to use specific types of streaming sources and sinks, how to put streaming into production, and notebooks demonstrating example use cases.

Chapter 7, Using Python Libraries in Azure Databricks, explores all the nuances regarding working with Python, as well as introducing core concepts regarding models and data that will be studied in more detail later on.

Chapter 8, Databricks Runtime for Machine Learning, acts as a deep dive for us in the development of classic ML algorithms to train and deploy models based on tabular data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for ML.

Chapter 9, Databricks Runtime for Deep Learning, acts as a deep dive for us in the development of classic DL algorithms to train and deploy models based on unstructured data, all while exploring libraries and algorithms as well. The examples will be focused on the particularities and advantages of using Databricks for DL.

Chapter 10, Model Tracking and Tuning in Azure Databricks, focuses on model tuning, deployment, and control using Databricks' functionalities, such as AutoML and Delta Lake, while using it in conjunction with popular libraries such as TensorFlow.

Chapter 11, Managing and Serving Models with MLflow and MLeap, explores in more detail the MLflow library, an open source platform for managing the end-to-end ML life cycle. This library allows the user to track experiments, record and compare parameters, centralize model storage, and more. You will learn how to use it in combination with what was learned in the previous chapters.

Chapter 12, Distributed Deep Learning in Azure Databricks, demonstrates how to use Horovod to make distributed DL faster by taking single-GPU training scripts and scaling them to train across many GPUs in parallel.