-
Book Overview & Buying
-
Table Of Contents
Data Engineering with Azure Databricks
By :
In this chapter, we explored essential data ingestion strategies for Azure Databricks, focusing on batch ingestion patterns that underpin enterprise data platforms.
We began by understanding the differences between batch and streaming ingestion. Batch ingestion processes data in scheduled chunks and is ideal for historical analysis, regulatory reporting, and cost-sensitive workloads. Streaming ingestion (covered in detail in Chapter 5) handles real-time data processing with Event Hubs and Auto Loader.
We examined ingesting data from Azure Storage (ADLS Gen2 and Blob Storage), and learned about authentication methods, including Managed Identities (the recommended approach, as set up in Chapter 2), service principals with OAuth 2.0, and access keys. We explored reading key file formats—CSV for data exchange, JSON for semi-structured data, and Parquet for optimal analytical performance. We implemented a simple watermark-based incremental loading pattern for processing...