What's new in V2.0?
With V2, ADF has now been overhauled. This section will describe the main novelties of ADF V2.
Integration runtime
This is one of the main features of version 2.0. It represents the compute infrastructure and performs data integration across networks. Here are some enhancements it can provide:
- Data movements between public and private networks either on-premises or using a virtual private network (VPN). They were known as data management gateways in V1 and Power BI.
- Public: They are used by Azure and other cloud connections. There's a default integration runtime that comes with ADF.
- Private: They are used to connect private computer resources such as SQL Server on-premises to ADF. We need to install a service on one Windows machine in the private network. That machine can connect to the enterprise resources and send the data to ADF via the service installed on it.
- SSIS package execution—managing SSIS packages in Azure. This is one of the main topics of this book. Chapter 3, SSIS Lift and Shift, is completely dedicated to this feature.
Linked services
Linked services now have a connectVia property to be able to use the Integration Runtimes that we mentioned in this chapter before. They can now connect to a lot more of data stores than it was possible before.
Datasets
Datasets are the same as they were in V1, but we don't need to define any availability schedules in them now. This means that they have more flexibility in their usage. In conjunction with Linked Services, the datasets have now access to a whole lot of new data stores: sources and destinations.
Pipelines
Pipelines have been modified quite a lot in V2. They don't have any windows of execution, with start times and end times. Pipelines can now be executed using the following technique:
- On demand via .NET, PowerShell, REST API, or Python
- Trigger:
- Schedule trigger: This trigger uses a wall clock kind of schedule, for example, a pipeline can be executed on a weekly basis every Tuesday and Thursday at 10:00 AM
- Tumbling window trigger: This works on a periodic interval, for example, every 15 minutes between two specific dates
Activities
Pipelines now have the following control activities:
- Execute pipeline: Calls another pipeline in the same factory.
- For each activity: Executes activities in a loop similar to any
for each
loop in structured programming languages. - Web activity: Used to call custom REST endpoints.
- Lookup activity: Gets a record from any external data. The output can later be used by subsequent activities.
- Get metadata activity: Gets the metadata of activities in ADF.
- Until activity: Loops the execution of activity sets until the condition is evaluated to true.
- If condition activity: This is like any
if
statement in standard programming languages. - Wait activity: Pauses the pipeline for a time before resuming other activities.
Parameters
Parameters can be used in pipelines. They are read-only values that are passed when the pipeline is executed manually or when they are scheduled to be executed.
Expressions
In V1, functions could be used to filter out dataset queries. In V2, expressions can be used anywhere in JSON-defined factory objects.
Controlling the flow of activities
Calling activities is more flexible in V2 than in the previous one (V1). As stated in the Pipeline section, there are many new activities, such as for each
, if
, until
, lookup
, and so on.
SSIS package deployment in Azure
There is now a new SSI runtime that completely manages clusters of Azure VMs dedicated to running SSIS in the cloud. Packages are deployed in the same manner that they are deployed on-premises when using the Azure SSIS integration runtime. SQL Server Data Tools (SSDT) or SQL Server Management Studio (SSMS) can be used to deploy SSIS packages.
Spark cluster data store
There are many more data stores available now.
Spark clusters are now available in V2. Since Spark is very performant and now integrates more functionalities, it has become an almost essential player in the big data world. In the previous version of ADF, Spark clusters were available via MapReduce custom activities. In this version, Spark is now a first-class citizen, so there will be no more headaches when it comes to integrating it in our data flow.