Book Image

Distributed Data Systems with Azure Databricks

By : Alan Bernardo Palacio
Book Image

Distributed Data Systems with Azure Databricks

By: Alan Bernardo Palacio

Overview of this book

Microsoft Azure Databricks helps you to harness the power of distributed computing and apply it to create robust data pipelines, along with training and deploying machine learning and deep learning models. Databricks' advanced features enable developers to process, transform, and explore data. Distributed Data Systems with Azure Databricks will help you to put your knowledge of Databricks to work to create big data pipelines. The book provides a hands-on approach to implementing Azure Databricks and its associated methodologies that will make you productive in no time. Complete with detailed explanations of essential concepts, practical examples, and self-assessment questions, you’ll begin with a quick introduction to Databricks core functionalities, before performing distributed model training and inference using TensorFlow and Spark MLlib. As you advance, you’ll explore MLflow Model Serving on Azure Databricks and implement distributed training pipelines using HorovodRunner in Databricks. Finally, you’ll discover how to transform, use, and obtain insights from massive amounts of data to train predictive models and create entire fully working data pipelines. By the end of this MS Azure book, you’ll have gained a solid understanding of how to work with Databricks to create and manage an entire big data pipeline.
Table of Contents (17 chapters)
1
Section 1: Introducing Databricks
4
Section 2: Data Pipelines with Databricks
9
Section 3: Machine and Deep Learning with Databricks

Ingesting data using Delta Lake

Data can be ingested into Delta Lake in several ways. Azure Databricks offers several integrations with Partners, which provide data sources that are loaded as Delta tables. We can copy a file directly into a table, use AutoLoader, or create a new streaming table. Let's take a deeper look at this.

Partner integrations

Azure Databricks allows us to connect to different partners that provide data sources. These are easy to implement and provide scalable ingestion.

We can view the options that we have for ingesting data from Partner Integrations when creating a new table in the UI, as shown in the following screenshot:

Figure 4.1 – Ingesting data from Partner Integrations

Some of these integrations, such as Qlink, allow you to get data from multiple data sources such as Oracle, Microsoft SQL Server, and SAP and load them into Delta Lake. Of course, these integrations require you to have a subscription to...