Learn Azure Synapse Data Explorer

By : Pericles (Peri) Rocha

Learn Azure Synapse Data Explorer

By: Pericles (Peri) Rocha

Overview of this book

Large volumes of data are generated daily from applications, websites, IoT devices, and other free-text, semi-structured data sources. Azure Synapse Data Explorer helps you collect, store, and analyze such data, and work with other analytical engines, such as Apache Spark, to develop advanced data science projects and maximize the value you extract from data. This book offers a comprehensive view of Azure Synapse Data Explorer, exploring not only the core scenarios of Data Explorer but also how it integrates within Azure Synapse. From data ingestion to data visualization and advanced analytics, you’ll learn to take an end-to-end approach to maximize the value of unstructured data and drive powerful insights using data science capabilities. With real-world usage scenarios, you’ll discover how to identify key projects where Azure Synapse Data Explorer can help you achieve your business goals. Throughout the chapters, you'll also find out how to manage big data as part of a software as a service (SaaS) platform, as well as tune, secure, and serve data to end users. By the end of this book, you’ll have mastered the big data life cycle and you'll be able to implement advanced analytical scenarios from raw telemetry and log data.

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Share Your Thoughts

Download a free PDF copy of this book

Part 1 Introduction to Azure Synapse Data Explorer

Free Chapter

Chapter 1: Introducing Azure Synapse Data Explorer

Technical requirements

Understanding the lifecycle of data

The need for a fast and highly scalable data exploration service

What is Azure Synapse?

What is Azure Synapse Data Explorer?

Integrating Data Explorer pools with other Azure Synapse services

Exploring the Data Explorer pool infrastructure and scalability

What makes Azure Synapse Data Explorer unique?

When to use Azure Synapse Data Explorer

Summary

Chapter 2: Creating Your First Data Explorer Pool

Technical requirements

Creating a free Azure account

Creating an Azure Synapse workspace

Creating a Data Explorer pool using Azure Synapse Studio

Creating a Data Explorer pool using the Azure portal

Creating a Data Explorer pool using the Azure CLI

Summary

Chapter 3: Exploring Azure Synapse Studio

Technical requirements

Exploring the user interface of Azure Synapse Studio

Running your first query

Managing and monitoring Data Explorer pools

Monitoring Data Explorer pools

Summary

Chapter 4: Real-World Usage Scenarios

Technical requirements

Building a multi-purpose end-to-end analytics environment

Managing IoT data

Processing and analyzing geospatial data

Enabling real-time analytics with big data

Performing time series analytics

Summary

Part 2 Working with Data

Chapter 5: Ingesting Data into Data Explorer Pools

Technical requirements

Understanding the data loading process

Defining a retention policy

Choosing a data load strategy

Performing data ingestion

Summary

Chapter 6: Data Analysis and Exploration with KQL and Python

Technical requirements

Analyzing data with KQL

Exploring Data Explorer pool data with Python

Summary

Chapter 7: Data Visualization with Power BI

Technical requirements

Introduction to the Power BI integration

Creating a Power BI report

Adding data sources to your Power BI report

Connecting Power BI with your Azure Synapse workspace

Authoring Power BI reports from Azure Synapse Studio

Summary

Chapter 8: Building Machine Learning Experiments

Technical requirements

Understanding the application of ML

Introducing ML into your projects with AutoML

Exploring additional ML capabilities in Azure Synapse

Summary

Chapter 9: Exporting Data from Data Explorer Pools

Technical requirements

Understanding data export scenarios

Exporting data with client tools

Using server-side export to pull data

Performing robust exports with server-side data push

Configuring continuous data export

Summary

Part 3 Managing Azure Synapse Data Explorer

Chapter 10: System Monitoring and Diagnostics

Technical requirements

Monitoring your environment

Setting up alerts

Summary

Chapter 11: Tuning and Resource Management

Technical requirements

Implementing resource governance with workload groups

Speeding up queries using cache policies

Summary

Chapter 12: Securing Your Environment

Technical requirements

Security overview

Managing data encryption

Authenticating users

Configuring access to resources

Implementing network security

Protecting against external threats

Summary

Chapter 13: Advanced Data Management

Technical requirements

Managing extents

Purging personal data

Summary

Index

Why subscribe?

Other Books You May Enjoy

Packt is searching for authors like you

Share Your Thoughts

Download a free PDF copy of this book

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding the lifecycle of data

The typical data lifecycle in the world of analytics begins with data generation and ends with data analysis, or visualization through reports or dashboards. In between these steps, data gets ingested into an analytical store. Data may or may not be transformed in this process, depending on how the data will be used. In some cases, data can be updated after it has been loaded into an analytical store, even though this is not optimal. Appending new data is quite common.

Big data is normally defined as very large datasets (volume) that can be structured, semi-structured, or unstructured, without necessarily having a pre-defined format (variety), and data that changes or is produced fast (velocity). Volume, variety, and velocity are known as the three Vs of big data.

Note

While most literature defines the three Vs of big data as volume, variety, and velocity, you may also see literature that defines them as five Vs: the previously mentioned volume, variety, velocity, but also veracity (consistency, or lack of) and value (how useful the data is). It is important to understand that a big data solution needs to accommodate loading large volumes of data at low latency, regardless of the structure of the data.

For data warehousing and analytics scenarios in general, you will typically go through the following workflow:

Figure 1.1 – A typical workflow in analytics

Let us break down the steps in this process, as follows:

Data sources: This is where data originates from. Some examples of data sources may include a sales application that stores transactions on a database (in which case, the database in question would be the source), telemetry data from internet of things (IoT) devices, application log data, and much more.
Create database objects: The first step is to create the database itself, and any objects you will need to start loading data. Creating tables at this stage is common, but not required—in many cases, you will create destination tables as part of the data ingestion phase.
Ingest and transform data: The second step is to bring data to your analytical store. This step involves acquiring data, copying it to your destination storage location, transforming data as needed, and loading it to a final table (not necessarily in this order—sometimes, you will load data and transform it in the destination location) that will be retrieved by user queries and dashboards. This can be a complex process that may involve moving data from a source location to a data lake (a data repository where data is stored and analyzed in its raw form), creating intermediary tables to transform data (sort, enrich, clean data), creating indexes and views, and other steps.
User queries, data visualization, and dashboards: In this step, data is ready to be served to end users. But this does not mean you are done—you need to make sure queries are executed at the expected performance level, and dashboards can refresh data without user interaction while reducing overall system overhead (we do not want a dashboard refreshing several times per day if that’s not needed).
Manage and optimize tables, views, and indexes: Once the system is in production and serving end users, you will start to find system bottlenecks and opportunities to optimize your analytical environment. This will involve creating new indexes (and maintaining the ones you have created before!), views, and materialized views, and tuning your servers.

The lifecycle of big data can be similar to that of a normal data warehouse (a robust database system used for reporting and analytics), but it can also be very specific. For the purpose of this book, we’ll look at big data from the eyes of a data scientist, or someone who will deliver advanced analytics scenarios from big data. Building a pipeline and the processes to ensure data travels quickly from when it is produced to unlock insights without compromising quality or productivity is a challenge for companies of all sizes.

The lifecycle of data described here is widely implemented and well proven as a pattern. With the growth of the data science profession, we have observed a proliferation of new tools and requirements for projects that went well beyond this pattern. With that, came the need for a methodology that helps govern ML projects from gathering requirements up to model deployment, and everything in between, allowing data scientists to focus on the outcomes of their projects as opposed to building a new approach for every new project. Let’s look at how the TDSP helps achieve that.

Introducing the Team Data Science Process

In 2016, Microsoft introduced the Team Data Science Process (TDSP) as an agile, iterative methodology to build data science solutions at scale efficiently. It includes best practices, role definitions, guidelines for collaborative development, and project planning to help data scientists and analysts build E2E data science projects without having to worry about building their own operational model.

Figure 1.2 illustrates the stages in this process:

Figure 1.2 – The TDSP lifecycle

At a high level, the TDSP lifecycle outlines the following stages of data science projects:

Business Understanding: This stage involves working with project stakeholders to assess and identify the business problems that are being addressed by the project, as well as to define the project objectives. It also involves identifying the source data that will be used to answer the business problems that were identified.
Data Acquisition & Understanding: At this stage, the actual ingestion of data begins, ensuring a clean, high-quality dataset that has a clear relationship with the business problems identified in the Business Understanding stage. After having performed initial data ingestion, in this stage, we explore the data to determine whether thedata quality is, in fact, adequate.
Modeling: After ensuring we have the right data that help address the business problems, we now perform feature engineering (FE) and model training. By creating the right features from your source data and finding the model that best answers the problem specified in the Business Understanding stage, in this stage we determine the model that is best suited for production use.
Deployment: This is where we operationalize the model that was identified in the Modeling stage. We build a data pipeline, deploy the model to production, and prepare the interfaces that allow model consumption from external applications.
Customer Acceptance: By now, we have a data pipeline in place and a model that helps address the business challenges identified at the beginning of our project. At the Customer Acceptance stage, we get agreement from the customer that this project in fact helps address our challenges and identify an entity to whom we hand off the project for ongoing management and operations.

For more details about the TDSP, refer to https://docs.microsoft.com/en-us/azure/architecture/data-science-process/overview.

Tooling and infrastructure

Big data projects will require specialized tools and infrastructure to process data at scale and with low latency. The TDSP provides recommendations for infrastructure and tooling requirements for data science projects. These recommendations will include the underlying storage systems used to store data, the analytical engines (such as SQL and Apache Spark), cloud services to host ML models, and more.

Azure Synapse offers the infrastructure and development tools needed in big data projects from data ingestion through data storage, with the option of analytical engines for data exploration and to serve data to users at scale, as well as modeling, and data visualization. In the next sections, we will explore the full data lifecycle and how Azure Synapse helps individuals deliver E2E advanced analytics and data science projects.

Learn Azure Synapse Data Explorer

By : Pericles (Peri) Rocha

Learn Azure Synapse Data Explorer

By: Pericles (Peri) Rocha

Overview of this book

Related Content you might be interested in

Current Title:

Learn Azure Synapse Data Explorer

Limitless Analytics with Azure Synapse

Scalable Data Analytics with Azure Data Explorer

Cloud Analytics with Microsoft Azure.

Understanding the lifecycle of data

Introducing the Team Data Science Process

Tooling and infrastructure