-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Data Engineering with Alteryx
By :
The Alteryx platform is the Alteryx software suite that combines processing, managing datasets, and analysis. While a lot of focus in the Alteryx community tends to be on the business user analyst, a data engineer's benefits are extensive. Alteryx as a whole allows for both code-free and code-friendly workflow development, giving it the flexibility to quickly transform a dataset while having the depth to make complex transformations using whatever tool or process makes the most sense.
In this section, we will learn about the following:
The Alteryx platform is a collection of four software products:
Important Note
Alteryx software products have Alteryx as part of the name. Generally, the name Alteryx is dropped from the name in discussions and that will often happen throughout this book.
Because the data science deployment falls into Machine Learning Operations (MLOps), it isn't a core component of the Data Operations (DataOps) process. Thus, while you might have some interactions with the model deployment as a data engineer, we will be focusing on extracting and processing the raw datasets rather than the model management and implementation that Promote supports. As such, the Promote software will be beyond the scope of this book.
Now that we know what the Alteryx platform is and what software is available, we can look at how Alteryx will fit into a business case.
The Alteryx platform is all about creating a process where iteration is easy. All too often, when integrating a new data source, you won't always know the answer to the following questions until late in the process:
Trying to develop a workflow to answer these questions with a pipeline focused on writing code, common areas of frustration appear when trying to iterate through ideas and tests. These frustrations include the following:
The visual nature of Alteryx lets you quickly think through the pipeline, and see what transformation is happening where. When errors appear in the process, the tool will highlight the error in context.
It is also easy to trace specific records back through the process visually. This tracing renders straightforward the process of identifying when a transformation takes place that results in a logical error.
The Alteryx platform's key benefits to a data engineer arise in three major cases:
These benefits fall under an overarching theme of making it easier to get new datasets to the end user. For example, suppose the development time, debugging, and documentation can all be made simpler. In that case, responding to requests from analysts and data scientists becomes something to take pride in rather than dreading.
The Alteryx platform supports the speed of development with two fundamental features:
The visual development process helps a data engineer by allowing them to lay out the pipeline onto the Alteryx canvas. Of course, you can create the pipeline from scratch, which is often the case if little information about the end destination is available. Still, you can build the pipeline from a data flow chart with the principal steps preplanned.
This translation process uses the transformation tools that provide the building blocks for a workflow. By aligning those tools with a logical grid across (or down) the Designer canvas, you can see each step in the pipeline. Such an arrangement allows you to focus on each step to identify when the data might diverge for a particular process and add any intermediate checks.
The other benefit is speed – the fact that the Alteryx engine performs the operations quickly. One of the reasons for this performance is that transformations take place in memory and with the minimum memory footprint required for any particular change.
For example, when a column with millions of records has a formula applied, only the cells (the row and column combination) that are processed are needed in memory. The result is that the transformations that Alteryx does are fast.
The location of the dataset is often the only limit to Alteryx's in-memory performance. For example, opening a large Snowflake or Microsoft SQL Server table in Alteryx can become bottlenecked by network transfers. In these cases, the InDB tools can perform calculations on the remote database to minimize the problem and reduce the volume of data transferred locally.
The next significant benefit is the inherent iterative workflow that Alteryx development uses. When building a data pipeline, the sequencing of the transformations is vital to the dataset result.
This iterative process allows you to do the following:
The iterative process allows the data engineer to test changes quickly without worrying about how long it will take to compile or if you haven't noticed a typo in the SQL script.
Each tool in Alteryx will automatically document itself with annotations. For example, a formula tool will list the calculations taking place.
This self-documenting provides a good starting point for the documentation of the overall workflow. You can supplement these annotations by adding additional context. The further context can be renaming specific tools to reflect what they are doing (which also appears in the workflow logs). Add comment sections to the canvas or grouping processes with tool containers.
We now understand why the Alteryx platform is a powerful tool for data engineering and some of its key benefits. Next, we need to gain a deeper insight into the benefits that using Alteryx Designer can bring to your data engineering development.
Change the font size
Change margin width
Change background colour