Book Image

Scalable Data Analytics with Azure Data Explorer

By : Jason Myerscough
Book Image

Scalable Data Analytics with Azure Data Explorer

By: Jason Myerscough

Overview of this book

Azure Data Explorer (ADX) enables developers and data scientists to make data-driven business decisions. This book will help you rapidly explore and query your data at scale and secure your ADX clusters. The book begins by introducing you to ADX, its architecture, core features, and benefits. You'll learn how to securely deploy ADX instances and navigate through the ADX Web UI, cover data ingestion, and discover how to query and visualize your data using the powerful Kusto Query Language (KQL). Next, you'll get to grips with KQL operators and functions to efficiently query and explore your data, as well as perform time series analysis and search for anomalies and trends in your data. As you progress through the chapters, you'll explore advanced ADX topics, including deploying your ADX instances using Infrastructure as Code (IaC). The book also shows you how to manage your cluster performance and monthly ADX costs by handling cluster scaling and data retention periods. Finally, you'll understand how to secure your ADX environment by restricting access with best practices for improving your KQL query performance. By the end of this Azure book, you'll be able to securely deploy your own ADX instance, ingest data from multiple sources, rapidly query your data, and produce reports with KQL and Power BI.
Table of Contents (18 chapters)
1
Section 1: Introduction to Azure Data Explorer
5
Section 2: Querying and Visualizing Your Data
11
Section 3: Advanced Azure Data Explorer Topics

Introducing the data analytics pipeline

Before diving into ADX, it is worth spending some time to understand the data analytics pipeline. Whenever I am learning something new that is large and complex in scope, such as data analytics, I break the topic down into smaller chunks to help with learning and measuring my progress. Therefore, an understanding of the various stages of the data analytics pipeline will help you understand how ADX takes raw data and generates reports and visuals as a result of our analytical tasks, such as time series analysis.

Figure 1.1 illustrates the stages of the data analytics pipeline required to take data from a data source, perform some analysis, and produce the result of the analysis in the form of a visual, such as tables, reports, and graphs:

Figure 1.1 – Data analytics pipeline

Figure 1.1 – Data analytics pipeline

In the spirit of breaking a complex subject into smaller chunks, let's look at each stage in detail:

  1. Data: The first step in the pipeline is the data sources. In Chapter 4, Ingesting Data in Azure Data Explorer, we will discuss the different types of data. For now, suffice it to say there are three different categories of data: structured, semi-structured, and unstructured. Data can range from structured, such as tables, to unstructured, such as free-form text.
  2. Ingestion: Once the data sources have been identified, the data needs to be ingested by the pipeline. The primary purpose of the ingestion stage is to take the raw data, perform some Extract-Transform-Load (ETL) operations to format the data in a way that helps with your analysis, and send the data to the storage stage. The data can be ingested using tools and services such as Apache Kafka, Azure Event Hubs, and IoT Hub. Chapter 4, Ingesting Data in Azure Data Explorer, discusses the different ingestion methods, such as streaming versus batch, and demonstrates how to ingest data using multiple services, such as Azure Event Hubs and Azure Blob storage.
  3. Store: Once ingested, ADX natively compresses and stores the data in a proprietary format. The data is then cached locally on the cluster based on the hot cache settings. The data is phased out of the cluster based on the retention settings. We will discuss these terms a little later in the chapter.
  4. Analyze: At this stage, we can start to query, apply machine learning to detect anomalies, and predict trends. We will see examples of anomaly detection and trend prediction in Chapter 7, Identifying Patterns, Anomalies, and Trends in Your Data. In this book, we will perform most of our analysis in the ADX Web UI using Kusto Query Language (KQL).
  5. Visualize: The final stage of the pipeline is visualize. Once you have ingested your data and performed your analysis, chances are you will want to share and present your findings. We will present our findings using the ADX Web UI's dashboards and Power BI.

In the next section, we will look at some of the services Azure provides for the different stages of the analytics pipeline.

Overview of Azure data analytics services

You may have noticed that I referenced a few of Azure's data services previously, and you may be wondering what they are used for. Although this book is about Azure Data Explorer, it is worth understanding what some of the common data services are, since some of the services, such as Event Hubs and Blob storage, will be discussed and used in later chapters.

To help map the different data services to the analytics pipeline, Figure 1.2 illustrates an updated pipeline, with the Azure data services mapped to the respective pipeline stages:

Figure 1.2 – Azure data services

Figure 1.2 – Azure data services

Important Note

The list of services depicted in Figure 1.2 is by no means an exhaustive list of Azure data analytics services. For a complete and accurate list, please see https://azure.microsoft.com/en-us/services/#analytics.

The following list of services is a short description of the services shown in Figure 1.2:

  • Event Hubs: This is an event and streaming Platform as a Service (PaaS). Event Hubs allows us to stream data, which we will demonstrate and use in Chapter 4, Ingesting Data in Azure Data Explorer.
  • Data Factory: This is a PaaS service that allows us to transform data from one format to another. These transformations are commonly referred to as Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT).
  • HDInsight: This is a PaaS service that appears twice in Figure 1.2 and could technically appear in other stages. HDInsight is quite possibly one of the most misunderstood analytical services, with regard to what it does. HDInsight is a PaaS version of the Hortonworks Hadoop framework, which includes a wide range of ingestion, analytics, and storage services, such as Apache Kafka, Hive, HBase, Spark, and the Hadoop Distributed File System (HDFS).
  • Azure Data Lake Gen2: This is a storage solution based on Azure Blob storage that implements HDFS.
  • Blob Storage: This is Azure's object storage service that all other storage services are based on.
  • Azure Databricks: This is Azure's PaaS implementation of Apache Spark.
  • Power BI: Technically not an Azure service, Power BI is a rich reporting product that is commonly integrated with Azure.

You may be wondering where ADX would fit in Figure 1.2. The answer is ingestion, store, analyze, and visualize. In the next section, you will learn how this is possible by understanding what Azure Data Explorer is.