Book Image

Data Engineering with Alteryx

By : Paul Houghton
Book Image

Data Engineering with Alteryx

By: Paul Houghton

Overview of this book

Alteryx is a GUI-based development platform for data analytic applications. Data Engineering with Alteryx will help you leverage Alteryx’s code-free aspects which increase development speed while still enabling you to make the most of the code-based skills you have. This book will teach you the principles of DataOps and how they can be used with the Alteryx software stack. You’ll build data pipelines with Alteryx Designer and incorporate the error handling and data validation needed for reliable datasets. Next, you’ll take the data pipeline from raw data, transform it into a robust dataset, and publish it to Alteryx Server following a continuous integration process. By the end of this Alteryx book, you’ll be able to build systems for validating datasets, monitoring workflow performance, managing access, and promoting the use of your data sources.
Table of Contents (18 chapters)
1
Part 1: Introduction
5
Part 2: Functional Steps in DataOps
11
Part 3: Governance of DataOps

Profiling data with summary and statistical aggregations

Once you have your dataset, getting an idea of what values are in it allows you to understand what the data looks like. Knowing what the data is like provides a reference when you are ready to compare across runs.

In Alteryx, there are a number of ways in which to investigate the range of values that appear in the dataset. In this section, we are going to look at the following three areas:

  • What is the variation in the dataset and the size of the range?
  • How is the dataset distributed?
  • What proportion of your records is missing values?

In each of these areas, Alteryx provides tools for answering the questions quickly and also has methods for those answers to be persisted in your logging systems.

Investigating the variation and size range of your dataset

The first area to investigate is the spread of the data. Understanding the aggregated spread of the records in each field will give you an understanding...