Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Configuring command-line tools to run properly (Simple)


This recipe guides you through configuring the script for command-line tools so that you can properly manage your execution performance in case of increased memory requirements. Many steps work in memory, so the more memory we reserve to PDI, coherently with the available memory and the overall system requirements, the better it is. A wrong memory configuration leads you to bad performance and/or unexpected OutOfMemory exception errors.

You will learn how to modify the script files Kitchen or Pan to set new memory requirements. This recipe will work the same for both Kitchen and Pan; the only difference to consider is in the names of the script files.

Getting ready

Remember that in PDI, we have two different sets of scripts to start PDI processes from the command line:

  • The Kitchen scripts for starting PDI jobs

  • The Pan scripts for starting PDI transformations

As soon as you get into the PDI home directory, you can edit them depending on the specific operating system environment.

So, let's move on and go to the PDI home directory and start working on this recipe.

How to do it...

To change the memory settings by modifying the script in Windows, perform the following steps:

  1. From the PDI home directory, open the kitchen.bat (or pan.bat) script.

  2. Scan through the script's code until you find the following lines of code:

    if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS=-Xmx512m
  3. Change the value of the variable PENTAHO_DI_JAVA_OPTIONS to the required memory value for example, to 1024:

    set PENTAHO_DI_JAVA_OPTIONS=-Xmx1024m
  4. Save the file and exit.

To change the memory settings by modifying the script in Linux or Mac, perform the following steps:

  1. From the PDI home directory, open the kitchen.sh script (or pan.sh).

  2. Scan through the script's code until you find the following lines of code:

    if [ -z "$JAVAMAXMEM" ]; then
      JAVAMAXMEM="512"
    fi
    
    if [ -z "$PENTAHO_DI_JAVA_OPTIONS" ]; then
      PENTAHO_DI_JAVA_OPTIONS="-Xmx${JAVAMAXMEM}m"
    fi
  3. Change the value of the variable JAVAXMEM to the required memory value for example, to 1024:

      JAVAMAXMEM="1024"
  4. Save the file and exit.

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Setting the environment variables in Windows is another, cleaner way to change the memory settings. To do this, you need to execute the steps summarized as follows:

  1. Go to Control Panel and open the Environment variables dialog window.

  2. Create a new system variable by clicking on the New button in the System Environment Variable section of the Environment variables dialog window.

  3. Add a new variable called PENTAHO_DI_JAVA_OPTIONS.

  4. Set the value of the variable. If we want to assign 1024 MB to PDI, for example, we set the variable's value to –Xmx1024m.

  5. Click on OK to confirm and close the New System Variable dialog window.

  6. Click on OK to close the System Environment Variables dialog window. This change will affect both scripts for jobs and transformations without any additional requirement.

To change the memory settings by setting the environment variables in Linux/Mac, perform the following steps:

  1. Go to the User home directory and open the .bash_profile script file. If it does not exist, create a new one.

  2. Add the following line to the script file:

    export PENTAHO_DI_JAVA_OPTIONS="-Xmx1024m"
  3. Save and close the file. Remember that this new environment variable will either be visible in the user context starting from the next login or after closing and re-opening your terminal window.

There's more...

Setting the environment variable is a good way to configure our scripts seamlessly without modifying anything in the standard script. However, we can simplify our life by writing scripts that encapsulate all the internals related to the preparation of the script's execution environment. This lets us run our process without any hassle.

Making things easier by writing custom scripts

Kettle and Pan are two scripts that start our PDI processes from the command line. This means that they are full of switches that let us configure our PDI job to run properly. However, sometimes starting a job or a transformation is also a matter of preparing an execution environment that could require a bit of effort in terms of technical knowledge as well as a considerable amount of time. We do not usually want our user to be in such a situation. Therefore, to work around this, encapsulate the call to either the Kitchen or the Pan script, and the rest of the things will be taken care of by the custom script that does all of this without any pain.

Let's say we have a PDI job to start in Linux/Mac. We can write a bash script called startMyJob.sh that starts our job easily by configuring all the settings required to perform the job execution properly as shown in the following code:

#!/bin/bash

export PENTAHO_DI_JAVA_OPTIONS="-Xmx3072m -Djava.io.tmpdir=/mnt/tmp"
export KETTLE_HOME=/home/ubuntu/pdi_settings

/home/ubuntu/pentaho/data-integration/kitchen.sh -file=/home/ubuntu/pentaho/etl/run_load.kjb -level=Basic -param:SKIP_FTP=false -param:SKIP_DIMENSIONS=true "run_model"

As you can see, the code prepares the execution environment by setting the following:

  • The memory options

  • The location of a directory to store temporary files

  • The KETTLE_HOME variable

Finally, it starts the PDI job. You can see how simple it is to start our job using this script instead of spending a lot of time on manual settings every time!