Book Image

Hands-On Data Warehousing with Azure Data Factory

By : Christian Cote, Michelle Gutzait, Giuseppe Ciaburro
Book Image

Hands-On Data Warehousing with Azure Data Factory

By: Christian Cote, Michelle Gutzait, Giuseppe Ciaburro

Overview of this book

ETL is one of the essential techniques in data processing. Given data is everywhere, ETL will always be the vital process to handle data from different sources. Hands-On Data Warehousing with Azure Data Factory starts with the basic concepts of data warehousing and ETL process. You will learn how Azure Data Factory and SSIS can be used to understand the key components of an ETL solution. You will go through different services offered by Azure that can be used by ADF and SSIS, such as Azure Data Lake Analytics, Machine Learning and Databrick’s Spark with the help of practical examples. You will explore how to design and implement ETL hybrid solutions using different integration services with a step-by-step approach. Once you get to grips with all this, you will use Power BI to interact with data coming from different sources in order to reveal valuable insights. By the end of this book, you will not only learn how to build your own ETL solutions but also address the key challenges that are faced while building them.
Table of Contents (12 chapters)

The modern data warehouse


Microsoft, as well as many other service providers, have listed the concepts of the modern data warehouse as follows:

Here are some of the many features a modern data warehouse should have:

  • Integration of relational as well as non-relational sources: The data warehouse should be able to ingest data that is not easily integrable in the traditional data warehouse, such as big data, non-relational crunched data, and so on.
  • Hybrid deployment: The data warehouse should be able to extend the data warehouse from on-premises storage to the cloud.
  • Advanced analytics: The data warehouse should be able to analyze the data from all kinds of datasets using different modern machine learning tools.
  • In-database analytics: The data warehouse should be able to use Microsoft software that is integrated with some very powerful analytics open tools, such as R and Python, in its database. Also, with PolyBase integration, the data warehouse can integrate more data sources when it's based on SQL Server.

Main components of a data warehouse

This section will discuss the various parts of a data warehouse.

Staging area

In a classic data warehouse, this zone is usually a database and/or a schema in it that used to hold a copy of the data from the source systems. The staging area is necessary because most of the time, data sources are not stored on the same server as the data warehouse. Even if they are on the same server, we prefer a copy of them for the following reasons:

  • Preserve data integrity. All data is copied over from a specific point in time. This ensures that we have consistency between tables.
  • We might need specific indexes that we could not create in the source system. When we query the data, we're not necessarily making the same links (joins) in the source system. Therefore, we might have to create indexes to increase query performance.
  • Querying the source might have an impact on the performance of the source application. Usually, the staging area is used to bring just the changes from the source systems. This prevents processing too much data from the data source.

Not to mention that the data source might be files: CSV, XML, and so on. It’s much easier to bring their content in relational tables. From a modern data warehouse perspective, this means storing the files in HDFS and separating them using dates.

In a modern data warehouse, if we’re in the cloud only, relational data can still be stored in databases. The only difference might be in the location of the databases. In Azure, we can use Azure SQL tables or Azure data warehouse.

Data warehouse

This is where the data is copied over from the staging area. There are several schools of thought that define the data warehouse:

  • Kimball group data warehouse bus: Ralph Kimball was a pioneer in data warehousing. He and his colleagues wrote many books and articles on their method. It consists of conformed dimensions that can be used by many business processes. For example, if we have a dimension named DimCustomer, we should link it to all fact tables that store customers. We should not create another dimension that redefines our customers. The following link gives more information on the Kimball group method: https://www.kimballgroup.com.
  • Inmon CIF: Bill Inmon and his colleagues defined the corporate information factory at the end of 1990s. This consisted of modeling the source systems commonly using the third normal form. All the data in the table was dated, which means that any changes in the data sources were inserted in the data warehouse tables. The following link gives more information on CIF: http://www.inmoncif.com.
  • Data Vault: Created by Dan Linsted in the 21st century, this is the latest and more efficient modeling method in data warehousing. It consists of breaking down the source data into many different entities. This gives a lot of flexibility when the data is consumed. We have to reconstruct the data and use the necessary pieces for our analysis. Here is a link that gives more information on Data Vault: http://learndatavault.com.
Cubes

In addition to the relational data warehouse, we might have a cube such as SQL Server Analysis Services. Cubes don't replace the relational data warehouses, they extend it. They can also connect to the other part of the warehouse that is not necessarily stored in a relational database. By doing this, they become a semantic layer that can be used by the consumption layer described next.

Consumption layer – BI and analytics

This area is where the data is consumed from the data warehouse and/or the data lake. This book has a chapter dedicated to data lake. In short, the data lake is composed of several areas (data ponds) that classify the data inside of it. The data warehouse is a part of the data lake; it contains the certified data. The data outside the data warehouse in the data lake is most of the time noncertified. It is used to do ad hoc analysis or data discovery.

The BI part can be stored in relational databases, analytic cubes, or models. It can also consist of views on top of the data warehouse when the data is suitable for it.

What is Azure Data Factory

Azure data factories are composed of the following components:

  • Linked services: Connectors to the various storage and compute services. For example, we can have a pipeline that will use the following artifacts:
    • HDInsight cluster on demand: Access to the HDInsight compute service to run a Hive script that uses HDFS external storage
    • Azure Blob storage/SQL Azure: As the Hive job runs, this will retrieve the data from Azure and copy it to an SQL Azure database
  • Datasets: There are layers for the data used in pipelines. A dataset uses a linked service.
  • Pipeline: The pipeline is the link between all datasets. It contains activities that initiate data movements and transformations. It is the engine of the factory; without pipelines, nothing will move in the factory.

Limitations of ADF V1.0

As good as ADF was, and although a lot of features have been added to it since its GA in 2015, there were a few limitations. At first, we relied on JSON quite a lot to define various ADF abstracts. The number of data stores and compute capabilities were quite limited.

The development experience is very different compared to V2.0. As shown in the following screenshot, we could use the Author and Deploy capability, but it only gave us JSON templates.

As we will see later in this book, the new V2.0 factory has a much better development experience.

When it came to source control, we had to rely on Visual Studio integration. From Visual Studio, we could create or import an existing factory and therefore, use the source control of our choice to version it.