Book Image

Data Engineering with Alteryx

By : Paul Houghton
Book Image

Data Engineering with Alteryx

By: Paul Houghton

Overview of this book

Alteryx is a GUI-based development platform for data analytic applications. Data Engineering with Alteryx will help you leverage Alteryx’s code-free aspects which increase development speed while still enabling you to make the most of the code-based skills you have. This book will teach you the principles of DataOps and how they can be used with the Alteryx software stack. You’ll build data pipelines with Alteryx Designer and incorporate the error handling and data validation needed for reliable datasets. Next, you’ll take the data pipeline from raw data, transform it into a robust dataset, and publish it to Alteryx Server following a continuous integration process. By the end of this Alteryx book, you’ll be able to build systems for validating datasets, monitoring workflow performance, managing access, and promoting the use of your data sources.
Table of Contents (18 chapters)
Part 1: Introduction
Part 2: Functional Steps in DataOps
Part 3: Governance of DataOps

The data cleansing process

The data process is built around identifying the records that are useful for the intended purpose and enriching the dataset with any fields that might be valuable. We can achieve this in two ways:

  • By modifying the existing dataset
  • Or by adding additional data to the dataset

Of those two options, adding additional data is effectively just an extension of modifying the dataset by combining multiple data pipelines into a single, cohesive pipeline.

When modifying the existing dataset, four primary processes provide an umbrella for the transformations:

  • Selecting the columns of interest
  • Filtering the relevant rows
  • Creating and modifying columns with formulas
  • Summarizing the dataset to a more relevant level of granularity

Each of these steps focuses on transforming the dataset according to your use case and solving your data question.

Selecting columns

Selecting the relevant columns in a dataset is achieved...