Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Designing a simple PDI job (Simple)


This recipe guides you through creating a simple PDI job using the graphical development environment Spoon. In a PDI process, jobs orchestrate other jobs and transformations in a coordinated way to realize our business process. This simple job uses the transformation created in the previous recipe, and we will use it as a simple example in this book's recipes, wherever necessary, to play with PDI command-line tools.

Getting ready

To get ready for this recipe, you need to check that the JAVA_HOME environment variable is set properly and then start the Spoon script. For more information on this, check what we detailed in the first recipe, Designing a sample PDI transformation (Simple).

How to do it...

  1. Create a new empty job. To do this, you can perform either of the following steps:

    • Click on the New button from the toolbar menu and select the Job item entry

    • Select the Job item entry by navigating to File | New or by pressing Ctrl + ALT + N.

  2. Go to the Job settings dialog and define a new parameter called p_country. Do this as follows:

    1. Open the Job settings dialog by either pressing Ctrl + J or right-clicking on any place of the right working area and selecting Job settings from the newly displayed contextual menu.

  3. Once the dialog opens, select the Parameters tab and add a new parameter called p_country.

  4. Select the Design tab from the left-hand side view to display the list of tasks from their corresponding category folders. Look for the following tasks and drag-and-drop them into the working area on the right:

    1. From the General category folder, get a Start task and a Transformation task and drag-and-drop them into the working area.

    2. From the File management category folder, get a Delete File task and drag-and-drop it into the working area.

    3. From the Utility category folder, get an Abort task and drag-and-drop it into the working area on the right.

  5. Connect the tasks together in the following specified order:

    1. Connect the Start task to the Delete File task.

    2. Connect the Delete File task to the Transformation task and then connect the Delete File task to the Dummy task as well.

    3. Connect the Transformation task to the Success task and then connect the Transformation task to the Dummy task as well.

    4. Connect the Dummy task to the Abort task.

  6. Configure the Delete File task as follows:

    1. Open the Delete File task properties dialog by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit job entry.

    2. Configure the name of the task.

    3. Configure the name of the file to be deleted by typing the complete name of the file in the Filename input field. The file to be deleted is located in the same samples directory where the transformation and the job reside. In any case, a good approach is to type the filename in a way that is location independent, using a system variable to parameterize the directory name where the file is located. In our case, the complete filename is ${Internal.Transformation.Filename.Directory}/selected_country_customers.xls.

    4. As you can see in the Properties dialog, you can tick a checkbox so that the task will fail in case the file to be deleted does not exist.

    5. Click on OK and close the Delete File task properties dialog.

  7. Configure the Transformation task as follows:

    1. Open the Transformation task properties dialog by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit job entry.

    2. Configure the name of the task.

  8. Under the Transformation specification tab, configure the name of the file for the transformation we're going to call. The transformation's file is located in the same samples directory where the job resides. In any case, a good approach is to type the filename in a way that is location independent, using a system variable to parameterize the directory name where the file is located. In our case, the complete filename is ${Internal.Job.Filename.Directory}/read-customers.ktr.

  9. Save the job with the suggested name, export-job.kjb, in the same directory where you previously saved the transformation.

  10. Try to execute the processes from the Spoon GUI. This is normally done by the developer to test or debug his/her jobs and transformations. To do this, click on the Run button, which can be identified by the green triangle icon located in the toolbar.

  11. After the execution, if anything gets terminated successfully, the task icons will be decorated by a green mark that indicates that the specific tasks were executed successfully. Moreover, down in the working area, the Job Entry log details view indicates the results for every single task called by the job.

How it works...

A Pentaho ETL process is created generally by a set of jobs and transformations.

Transformations are workflows whose role is to perform actions on a flow of data by typically applying a set of basic action steps to the data. A transformation can be made by:

  • A set of input steps

  • A set of transformation steps

  • A set of output steps

Input steps take data from external sources and bring them into the transformation. Examples of input steps are as follows:

  • File input steps (text, Excel, properties, other)

  • Table input steps

  • OLAP source input steps or other similar steps

Transformation steps apply elementary business rules to the flow of data; the composition of this set of elementary transformation steps into an organized flow of operations represents a process. Examples of transformation steps are those that perform the following actions:

  • Make operations on strings

  • Make calculations

  • Join different flow paths

  • Apply scripts to the data with the goal of getting the results into other fields

Output steps send the data from the flow to external targets, such as databases, files, web services, or others. Therefore, we can say that transformations act as a sort of unit of work in the context of an entire ETL process. The more a transformation is atomic and concrete in its work, the more we can reuse it throughout other ETL processes.

Jobs are workflows whose role is to orchestrate the execution of a set of tasks: they generally synchronize and prioritize the execution of tasks and give an order of execution based on the success or failure of the execution of the current task. These tasks are basic tasks that either prepare the execution environment for other tasks that are next in the execution workflow or that manage the artifacts produced by tasks that are preceding them in the execution workflow. For example, we have tasks that let us manipulate the files and directories in the local filesystem, tasks that move files between remote servers through FTP or SSH, and tasks that check the availability of a table or the content of a table. Any job can call other jobs or transformations to design more complex processes. Therefore, generally speaking, jobs orchestrate the execution of jobs and transformations into large ETL processes.

In our case, we have a very simple example with a job and a transformation to support our recipes' experiments. The transformation gets data from a text file that contains a set of customers by country. After the data from the text file is loaded, it filters the dataflow by country and prints the result on an Excel file. The filter is made using a parameter that you set at the time you start the job and the filter step. The job checks if the previous file exists, and if so, deletes it and then calls the transformation for a new extraction. The job also has some failure paths to manage any sort of error condition that could occur during the processing of the tasks. The failure paths terminate with a step that aborts the job, marking it as failed.

There's more...

Every time we design a job or a transformation, there are some basic rules to follow to help you make things more easily portable between different systems, and, eventually, self-describing. The use of internal variables and a proper naming system for your job tasks and transformation steps are good rules of thumb. Then at the end, a brief recap of the various color and icon indicators that are implicitly present in the design of your process is also a good exercise. They help you to understand how the flow of information moves (inside your transformations) or how the set of operations execute in your flow (inside a job) quickly.

Why a proper naming for tasks and steps is so important

Each task in a job and each step in a transformation has a set of properties to let the user configure the expected behavior; one of these properties is used to give it a name. Giving tasks and steps an appropriate name is a very important thing because it helps us to make our transformations or jobs more readable. This suggestion becomes more valuable as the process becomes bigger. Documentation is always considered an unpleasant thing to do, but documenting processes is the only way to remember what we made, and why, in the long term. In this case, the correct naming of our components is something that, if done well, can be considered as a documentation in itself, at least for the insiders. And that is good enough!

Using internal variables to write location-independent processes

When writing our sample jobs and transformations, we used internal variables to set the path of the files we are reading or writing and set the path of the transformation file we have linked. This is very important to make our transformation and job location unaware so that we can easily move them here and there in our servers without any pain.

Kettle has two internal variables for this that you can access whenever required. By pressing Ctrl + Space directly from inside the field, you can activate the variables inspector to help you with finding out the right variable name without having to struggle with it. Going back to our problem of building a location-independent path, Kettle has two important system variables for this:

  • Internal.Job.Filename.Directory – This is the directory name where the running job resides

  • Internal.Transformation.Filename.Directory – This is the directory name where the running transformation resides

The important thing about these two variables is that PDI resolves them dynamically at runtime. So whenever you refer to a file, if you properly refer the path of the referred file to one of these two variables (depending on the case), you will be able to build location-unaware processes that will give you the ability to move them around without any pain.