Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Designing a simple PDI transformation (Simple)


This recipe guides you through creating a simple PDI transformation using the graphical development environment Spoon. Using this simple example, we will see how to play with PDI command-line tools. The goal of this recipe is to extract a list of customers located in a selected country. The country to be exported is identified through an input parameter and the export is made to an Excel file located in the same directory where the transformation is run.

Getting ready

To get ready for this recipe, you first need to check that your Java environment is configured properly; to do this, check that the JAVA_HOME environment variable is set. Even if PDI while starting up tries to guess the value of the JAVA_HOME environment variable from the system, it is always good practice to set the JAVA_HOME environment variable. As soon as this is done, you need to start Spoon, the PDI's graphical development environment. You can start Spoon from the command line using the appropriate script located in the PDI's home directory. As soon as you get into the PDI home directory, you can run the proper script depending on the specific operating system environment.

Have a look at the following options:

  • If you are on Windows, use the script spoon.bat to start the application

  • If you are on Linux or Mac, use the script spoon.sh to start the application

How to do it...

  1. Create a new empty transformation. To do this, you can perform either of the following steps:

    1. Click on the New button from the toolbar menu and select the Transformation item entry.

    2. Select the Transformation item entry by navigating to File | New or by pressing Ctrl + N.

  2. Go to the Transformation properties dialog and define a new transformation parameter called p_country. To do this, perform the following steps:

    1. Open the Transformation settings dialog by either pressing Ctrl + T or by right-clicking anywhere on the working area to the right and selecting Transformation settings from the newly displayed contextual menu.

    2. Once the dialog opens, select the Parameters tab and add a new parameter called p_country.

  3. Select the Design tab from the left view. The list of category folders will appear in the Steps panel. Perform the following steps on them:

    1. From the Input category folder, get a Text File Input step and drag-and-drop it into the working area on the right.

    2. From the Job category folder, get a Get Variables step and drag-and-drop it into the working area on the right.

    3. From the Flow category folder, get a Filter and a Dummy step and drag-and-drop them into the working area on the right.

    4. Rename the Dummy step Discarded customers.

    5. From the Output category folder, get a Microsoft Excel Output step and drag-and-drop it into the working area on the right.

  4. Connect the steps together in the following specified order:

    1. Connect the Text File Input step to the Get Variable step.

    2. Connect the Get Variable step to the Filter step.

    3. Connect the Filter step to the Microsoft Excel Output step and then connect the Filter step to the Dummy step as well.

  5. Configure the Text File Input step as follows:

    1. Open the Text File Input step properties dialog by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit step.

    2. Configure the name of the step.

    3. Under the File tab, configure the file to be read by typing the complete name of the file in the File or Directory input field. Because the sample file is located in the same sample directory where the transformation resides, a good approach to naming the file in a way that is location independent is to use a system variable to parameterize the directory name where the file is located. In our case, the complete filename is ${Internal.Transformation.Filename.Directory}/customers.txt.

  6. After the name of the file has been typed in, click on the Add button; the file will be added to the selected files located beneath as follows:

    1. Select the Fields tab. You must fill in the table describing the fields format.

    2. Click on the Get Fields button. The Text File Input step automatically analyzes a first set of 100 rows in the input file and tries to guess the field names to be imported. By the end of this inspection, all the fields will automatically get defined. Remember to verify the guessed data types so that only the CUSTOMER_KEY field has an Integer data type while the remaining fields have a String data type.

    3. Click on OK and close the Text File Input step properties dialog.

  7. Configure the Get Variable step as follows:

    1. Open the Get Variable step properties by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit step.

    2. Configure the name of the step.

    3. Click on the first row under the Field column, add a new column; named filter_country and press the Tab key.

    4. The cursor goes to the next column. Add the name of the parameter whose value is used to populate the new field: ${p_country}. Press the Tab key.

    5. In the next column, where the cursor goes, select String as the data type of the new field.

    6. Click on OK and close the Get Variable step properties dialog.

  8. Configure the Filter step as follows:

    1. Open the Filter step properties dialog by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit step.

    2. From the Send "true" data to step combobox, select the Write selected country customers item entry to set the path that is to be navigated to any time the result of the condition is true.

    3. From the Send "false" data to step combobox, select the Discarded customers item entry to set the path that is to be navigated to anytime the result of the condition is false.

    4. Add a new condition called COUNTRY = filter_country.

    5. To also manage the cases where the p_country parameter has not been set by the user, when starting the procedure, add another condition called filter_country IS NULL. Set this new condition as an alternate to the previous condition by using the OR logical operator.

    6. Click on OK and close the Filter step properties dialog.

  9. Configure the Microsoft Excel Output step as follows:

    1. Open the Microsoft Excel Output step properties dialog by either double-clicking on the step icon in the working area or right-clicking on the step icon and selecting Edit step.

    2. Configure the name of the step.

    3. Under the File tab, configure the name of the export file by typing the complete name of the file in the Filename input field. Remember to write the name of the file without any extension because the file extension is located in the Extension input field. The exported file will be put in the same samples directory where the transformation resides. In any case, a good approach is to type the filename in a way that is location independent, using a system variable to parameterize the directory name where the file is located. In our case, the complete filename is ${Internal.Transformation.Filename.Directory}/selected_country_customers.

    4. Select the Fields tab. You must fill in the table describing the field's format.

    5. Click on the Get Fields button. The Text File Input step automatically analyzes a first set of 100 rows in the input file and tries to guess the field names to be imported. At the end of this inspection, all the fields will automatically get defined.

    6. Click on OK and close the Microsoft Excel Output step properties dialog.

  10. Save the transformation with the suggested name read-customers.ktr.

  11. The transformation design is now complete. You can now go to the Design a simple job (Simple) recipe and create the job that will use this transformation.

There's more...

Now that we have designed a sample transformation, let's analyze a quick way to easily find and get directly to the needed steps.

How to quickly find the steps to use

A set of category folders organizes transformations to facilitate the user's search process. If you are unsure about the right location of the step you are looking for, you can easily find it using the search functionality. To do this, go to the Search input text field located in the upper-left corner of the design view and write the name of the step you are looking for. While you are typing, you will see the items in the Steps panel whose name starts with the letters you are typing appear as shown in the following screenshot: