Book Image

Hands-On Data Warehousing with Azure Data Factory

By : Christian Cote, Michelle Gutzait, Giuseppe Ciaburro
Book Image

Hands-On Data Warehousing with Azure Data Factory

By: Christian Cote, Michelle Gutzait, Giuseppe Ciaburro

Overview of this book

ETL is one of the essential techniques in data processing. Given data is everywhere, ETL will always be the vital process to handle data from different sources. Hands-On Data Warehousing with Azure Data Factory starts with the basic concepts of data warehousing and ETL process. You will learn how Azure Data Factory and SSIS can be used to understand the key components of an ETL solution. You will go through different services offered by Azure that can be used by ADF and SSIS, such as Azure Data Lake Analytics, Machine Learning and Databrick’s Spark with the help of practical examples. You will explore how to design and implement ETL hybrid solutions using different integration services with a step-by-step approach. Once you get to grips with all this, you will use Power BI to interact with data coming from different sources in order to reveal valuable insights. By the end of this book, you will not only learn how to build your own ETL solutions but also address the key challenges that are faced while building them.
Table of Contents (12 chapters)

What's new in V2.0?


With V2, ADF has now been overhauled. This section will describe the main novelties of ADF V2.

Integration runtime

This is one of the main features of version 2.0. It represents the compute infrastructure and performs data integration across networks. Here are some enhancements it can provide:

  • Data movements between public and private networks either on-premises or using a virtual private network (VPN). They were known as data management gateways in V1 and Power BI.
    • Public: They are used by Azure and other cloud connections. There's a default integration runtime that comes with ADF.
    • Private: They are used to connect private computer resources such as SQL Server on-premises to ADF. We need to install a service on one Windows machine in the private network. That machine can connect to the enterprise resources and send the data to ADF via the service installed on it.
  • SSIS package execution—managing SSIS packages in Azure. This is one of the main topics of this book. Chapter 3, SSIS Lift and Shift, is completely dedicated to this feature.

Linked services

Linked services now have a connectVia property to be able to use the Integration Runtimes that we mentioned in this chapter before. They can now connect to a lot more of data stores than it was possible before.

Datasets

Datasets are the same as they were in V1, but we don't need to define any availability schedules in them now. This means that they have more flexibility in their usage. In conjunction with Linked Services, the datasets have now access to a whole lot of new data stores: sources and destinations.

Pipelines

Pipelines have been modified quite a lot in V2. They don't have any windows of execution, with start times and end times. Pipelines can now be executed using the following technique:

  • On demand via .NET, PowerShell, REST API, or Python
  • Trigger:
    • Schedule trigger: This trigger uses a wall clock kind of schedule, for example, a pipeline can be executed on a weekly basis every Tuesday and Thursday at 10:00 AM
    • Tumbling window trigger: This works on a periodic interval, for example, every 15 minutes between two specific dates

Activities

Pipelines now have the following control activities:

  • Execute pipeline: Calls another pipeline in the same factory.
  • For each activity: Executes activities in a loop similar to any for each loop in structured programming languages.
  • Web activity: Used to call custom REST endpoints.
  • Lookup activity: Gets a record from any external data. The output can later be used by subsequent activities.
  • Get metadata activity: Gets the metadata of activities in ADF.
  • Until activity: Loops the execution of activity sets until the condition is evaluated to true.
  • If condition activity: This is like any if statement in standard programming languages.
  • Wait activity: Pauses the pipeline for a time before resuming other activities.

Parameters

Parameters can be used in pipelines. They are read-only values that are passed when the pipeline is executed manually or when they are scheduled to be executed.

Expressions

In V1, functions could be used to filter out dataset queries. In V2, expressions can be used anywhere in JSON-defined factory objects.

Controlling the flow of activities

Calling activities is more flexible in V2 than in the previous one (V1). As stated in the Pipeline section, there are many new activities, such as for each, if, until, lookup, and so on.

SSIS package deployment in Azure

There is now a new SSI runtime that completely manages clusters of Azure VMs dedicated to running SSIS in the cloud. Packages are deployed in the same manner that they are deployed on-premises when using the Azure SSIS integration runtime. SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS) can be used to deploy SSIS packages.

Spark cluster data store

There are many more data stores available now.

Spark clusters are now available in V2. Since Spark is very performant and now integrates more functionalities, it has become an almost essential player in the big data world. In the previous version of ADF, Spark clusters were available via MapReduce custom activities. In this version, Spark is now a first-class citizen, so there will be no more headaches when it comes to integrating it in our data flow.