Book Image

Learn Azure Synapse Data Explorer

By : Pericles (Peri) Rocha
Book Image

Learn Azure Synapse Data Explorer

By: Pericles (Peri) Rocha

Overview of this book

Large volumes of data are generated daily from applications, websites, IoT devices, and other free-text, semi-structured data sources. Azure Synapse Data Explorer helps you collect, store, and analyze such data, and work with other analytical engines, such as Apache Spark, to develop advanced data science projects and maximize the value you extract from data. This book offers a comprehensive view of Azure Synapse Data Explorer, exploring not only the core scenarios of Data Explorer but also how it integrates within Azure Synapse. From data ingestion to data visualization and advanced analytics, you’ll learn to take an end-to-end approach to maximize the value of unstructured data and drive powerful insights using data science capabilities. With real-world usage scenarios, you’ll discover how to identify key projects where Azure Synapse Data Explorer can help you achieve your business goals. Throughout the chapters, you'll also find out how to manage big data as part of a software as a service (SaaS) platform, as well as tune, secure, and serve data to end users. By the end of this book, you’ll have mastered the big data life cycle and you'll be able to implement advanced analytical scenarios from raw telemetry and log data.
Table of Contents (19 chapters)
1
Part 1 Introduction to Azure Synapse Data Explorer
6
Part 2 Working with Data
12
Part 3 Managing Azure Synapse Data Explorer

Exploring the Data Explorer pool infrastructure and scalability

Let us look at how Data Explorer pools work behind the curtains.

Any typical deployment of Data Explorer, regardless of being the standalone service or Data Explorer pools in Azure Synapse, will almost always consist of two major services working together, as follows:

  • The Engine service: Serves user queries, processes data ingestion, and accepts control commands that create or change databases, tables, or other metadata objects (a.k.a. data definition language (DDL) for seasoned SQL users).
  • The Data Management service: Connects the Engine service with data pipelines, orchestrates and maintains data ingestion processes, and manages data purging tasks (a.k.a. data grooming) that run on the Engine nodes of the cluster.

These services are deployed through virtual machines (VMs) in Microsoft Azure, building a cluster of Data Explorer compute nodes. These nodes perform different tasks in the architecture of the Data Explorer pool, which we will discuss next.

Data Explorer pool architecture

The Engine service is the most important component in the architecture of Data Explorer pools. There are four types of cluster nodes defined by their respective roles supporting the Engine service, as follows:

  • Admin node: This node maintains and performs all metadata transactions across the cluster.
  • Query Head node: When users submit queries, the Query Head node builds a distributed query plan and orchestrates query execution across the Data nodes in the cluster. It holds a read-only copy of the cluster metadata to make decisions for optimal query performance.
  • Data node: As the worker bee in the cluster, it receives part of the distributed query from the Query Head node and executes that portion of the query to retrieve the data that it holds. Data shards are cached in the Data nodes. These nodes also create new data shards when new data is ingested into the database.
  • Gateway node: Acts as a broker for the Data Explorer REST API. It receives control commands and dispatches them to the Admin node, and sends any user queries it receives to a Query Head node. It is also responsible for authenticating clients that connect to the service via external API calls.

You do not need to worry about how many nodes of which types your cluster contains. The actual implementation of the cluster is transparent to the end user, and you don’t have control over the individual nodes.

Scalability of compute resources

Data Explorer was designed to scale vertically and horizontally to achieve companies’ requirements, and to accommodate periodical changes in demand. By scaling vertically, you are adding or removing CPU, cache, or RAM size for each node in the cluster. By scaling horizontally, you are adding more instances of the specified node size to the cluster. For example, you can configure your Data Explorer pool to start with two instances with eight cores each, and then scale your environment horizontally or vertically as needed.

Note

You cannot control the specific number of CPUs, amount of RAM, or cache size for the VMs used in your clusters. Azure Synapse Data Explorer has a pre-defined set of VM sizes from Extra Small (two cores) to Large (16 cores) to choose from. These VM sizes have a balanced amount of each compute resource.

Sometimes, it is hard to anticipate how much of a compute resource you will need for a given task throughout the day. Furthermore, if you have high usage of your analytics environment at one point in time during the day but less usage at separate times, you would want to adjust the service to scale automatically as users demand more and fewer resources.

Data Explorer allows you to do just that through Optimized autoscale: just set the minimum number of instances you want to have running at any given time of the day, and a maximum number of instances the service can provision in case there’s more user demand than the currently allocated resources can support, and Data Explorer pools will scale in and out automatically. So, if your cluster is underutilized, Data Explorer will scale in to lower your cost (while scaling out if the cluster is overutilized). This can be configured in the Azure portal or in Azure Synapse Studio, as seen in Figure 1.16.

Figure 1.16 – Specifying a minimum and maximum number of instances on Autoscale

Figure 1.16 – Specifying a minimum and maximum number of instances on Autoscale

The maximum number of instances seen on the slider in Figure 1.16 scales with the workload size that you selected. With a large compute cluster, you can scale to up to 1,000 instances.

Managing data on distributed clusters

Scaling in and out is great, but you must be thinking: how about my data? The architecture of Data Explorer decouples the storage layer from the compute layer, meaning these layers can scale independently. If more storage is needed, then more resources are allocated to the storage layer. If more compute is needed, your compute VMs will increase in size, or you may have more instances of them.

The Data Explorer service implements database sharding to distribute data across its storage. The Engine service has awareness of each data shard and distributes queries across them. For almost all cases, you don’t need to know details about the physical data shards themselves, as Data Explorer exposes data simply through logical tables.

Data is physically persisted in storage, but to deliver a fast query experience, Data Explorer pools cache data in solid-state drives (SSDs). We will look at how you can define caching policies to balance costs and performance in Chapter 10, System Monitoring and Diagnostics.

Data shards are distributed across Data nodes using a hash function, which makes the process deterministic—using this hash function, the cluster can determine at any time which Data node is the preferred one for a certain shard. When you scale a Data Explorer pool in or out, the cluster then redistributes the data shards equally across the Data nodes available.

Mission-critical infrastructure

For enterprises, it is not enough to be able to store large amounts of data and retrieve them quickly. Data is a critical asset for companies, and the infrastructure that holds their data needs to be bulletproof to protect it from security and availability challenges and needs to offer developer productivity and sophisticated tooling for monitoring.

Data Explorer pools inherit several of the mission-critical features in Azure Synapse Analytics (and some of these were described in the What is Azure Synapse? section of this chapter). Let us look at other features that it offers that are relevant to building mission-critical environments, as follows:

  • AAD integration: AAD is Microsoft’s cloud-based identity and access management (IAM) service for the enterprise. It helps users sign in to a corporate network and access resources in thousands of SaaS applications such as Microsoft Office 365, Azure services, and third-party applications built with support for AAD.
  • Azure Policy support: This allows companies to enforce standards and evaluate compliance with services provisioned by users. For Data Explorer, you can use policies such as forcing Data Explorer encryption at rest using a customer-managed key, or force-enabling double encryption, among other policies.
  • Purging of personal data: Companies have a responsibility to protect customer data, and the ability to delete personal data from the service is a strong asset to help them satisfy the General Data Protection Regulation’s (GDPR’s) obligation. Data Explorer supports purging individual records, the purging of an entire table, or the purging of records in materialized views. This operation permanently deletes data and is irreversible.
  • Azure Availability Zones: Built for business continuity and disaster recovery (BCDR), Azure Availability Zones replicate your data and services to at least three different data centers in an Azure region. Your data residency is still respected, but in the case of a local failure on a region’s data center, your application will fail over to one of the copies in a different data center, but on the same Azure region.
  • Integrated with Azure Monitor: Collect and analyze telemetry data from your Data Explorer pools to understand cluster metrics and track query, data ingestion, and data export operations performance.
  • Globally available: At the time of this writing, Azure was available in more than 60 regions worldwide, and the list of regions continues to grow every year. This allows organizations to deploy applications closer to their users to reduce latency and offer more resiliency and recovery options, but also respect data residency rules. For an updated list of Azure regions, visit https://azure.microsoft.com/explore/global-infrastructure/.

Note

Not every Azure service is available in every Azure region. For a detailed view of Azure service availability per Azure region, use the Products available by region tool at https://azure.microsoft.com/en-us/global-infrastructure/services/.

How much scale can Data Explorer handle?

As of July 2022, Microsoft claimed the following usage statistics for Azure Data Explorer globally:

  • 115 PB of data ingested daily
  • 2.5 billion queries daily
  • 8.1 exabytes (EB) in total data size
  • 2.4 million VM cores running at any given time
  • More than 350,000 KQL developers

These are important numbers for a managed service. What is even more impressive is that Microsoft claims those numbers are growing close to 100% year over year.

All the details mentioned here about the service architecture and scalability are characteristics of the standalone Azure Data Explorer service too. There are a few special things about Data Explorer in Azure Synapse, so let’s explore that next (no pun intended).