Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Executing PDI jobs from the repository (Simple)


This recipe guides you through starting PDI jobs using Kitchen by assuming that PDI jobs are stored in the PDI repository. We will learn how to start simple jobs both with and without a set of input parameters previously defined in the job.

There are different versions of repositories that you can use depending on your needs: filesystem, database, and Pentaho repository.

  • The filesystem repository is a very easy form of repository; it lets you store your files at a specified location in the filesystem.

  • The database repository is another simple form repository that uses database tables as a way to have a unique central objects store. It is usually an authenticated repository and does not have any kind of versioning or locking mechanism on the object, so it is not recommended for team development.

  • The last type of repository, the Pentaho repository, is a sophisticated repository environment for safely storing our BI object (and so our ETL processes) available only for the Enterprise Edition (EE) of PDI.

Getting ready

To play with this recipe, I prepared a filesystem repository in the <book_samples>/sample3 directory with all the related transformations and the job in it. The repository is located in the samples/sample3 directory. To have our example working fully, we need to add the filesystem repository (see the How to define a filesystem repository section under the There's more… section of this recipe for details about it) to PDI by specifying the complete path to the <book_samples>/sample3 directory in the base directory field.

Check if the JAVA_HOME environment variable is set properly and configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the Executing PDI jobs from a filesystem (Simple) recipe.

How to do it...

For starting a PDI job from a filesystem repository in Linux or Mac, perform the following steps:

  1. Let's start our sample job from the repository; to do this, we need to use two new arguments:

    • The –rep argument identifies the repository we are going to connect to; the syntax is as follows:

      –rep: <id_of_the_repository>
      
    • The –job argument specifies the name of the job that needs to be started; the syntax is as follows:

      –job: <name_of_the_job>
      
  2. So, because the ID of our repository is sample3, we'll start the sample job with the following syntax:

    $ kitchen.sh –rep:sample3 –job: export-job
    
  3. Suppose that the job is not located in the root directory of our repository, but is stored in the pdi_book directory; to start it, we must specify the correct path to the job in the repository with the following syntax using the –dir argument:

    $ kitchen.sh –rep:sample3 –dir:/pdi_book  -job:export-job
    
  4. Parameter usage from within the command-line tool is exactly the same as we saw in the previous recipes. So, if we want to extract all the customers from the country U.S.A, a sample of the command to use is as follows:

    kitchen.sh -param:p_country=USA ––rep:sample3 –job: export-job
    

For starting a PDI job from a filesystem repository on Windows, perform the following steps:

  1. Starting a PDI job from within a filesystem repository on Windows requires for the same rules we saw previously to be followed with the exception that, as usual, we need to specify the arguments using the / character instead of the character that we used for Linux or Mac. So, the –rep argument will change to /rep:

    /rep: < id_of_the_repository >
    

    And the –job argument will change to /job:

    /job: < name_of_the_job >
    
  2. Let's suppose that the job is located in the root directory of our repository; we can call the Kitchen script using the following syntax:

    C:\temp\samples>Kitchen.bat /rep:sample3 /job:export-job
    
  3. About the use of PDI parameters through the command-line arguments: let's suppose we run our job by extracting all the customers for the country U.S.A; we can call our job using the following syntax:

    C:\temp\samples>Kitchen.bat /rep:sample3 /job:export-job /param:p_country:USA
    

For starting a PDI job from a database repository, perform the following steps:

  1. The process of starting a PDI job in a database repository is the same as in a filesystem repository. The only difference in this case is that a database repository is an authenticated source, so it requires a username and password.

  2. The –user parameter lets you specify a valid username to connect to the repository. If you are on Linux/Mac, the syntax will be as follows:

    –user: <repository_username>
    
  3. For Windows, you need to change the character to the / character; the syntax is as follows:

    /user: <repository_username>
    
  4. The –pass argument lets you specify a valid password for the user you are using to connect to the database repository; the syntax in this case for Linux/Mac is as follows:

    –pass: <repository_user_password>
    

    And for Windows, the syntax is as follows:

    /pass: <repository_username>
    
  5. Suppose that our job, called export-job.kjb, is located in a database repository instead of in a filesystem repository, and that we are connecting to the user pdiuser with the password password. The example command to start the job in Linux/Mac is as follows:

    $ kitchen.sh –user:pdiuser –pass:password –rep:sample3 –dir:/pdi_book  -job:export-job
    

    Whereas in Windows, the syntax is as follows:

    C:\temp\samples>Kitchen.bat /user:pdiuser /pass:password /rep:sample3 /job:export-job
    

There's more...

Up until now, we worked with jobs and transformations saved as XML files in the plain filesystem. Let's discuss some important things in the following sections that will help them work properly if we were to store them in a repository. Then, we'll talk about how to define a filesystem and a database repository.

Changes in job and transformation design

When using jobs and transformations from within repositories (either database or files), things change slightly. In this case, the ETL process files relate to the root of the repository, and the internal variable Internal.Job.Filename.Directory does not make any sense. Because of this, we need to change the way our example job links jobs or transformations or any kind of file. The major changes are as follows:

  • The job links the transformation using the system variable Internal.Job. Repository.Directory to dynamically get the path to the job file in the repository. This system variable gives you the path to the job file being executed in the repository at runtime.

  • To specify the input and output path for all the files read or written from or to the outside world, we added two new parameters, p_input_directory and p_target_directory, to specify the absolute or relative path (it depends on your needs) to the input and output directories. In our examples, these parameters default to the local directory where the job or transformation starts.

How to define a filesystem repository

Creating a filesystem repository is a fairly easy task. Let's follow the ensuing summarized steps and try to define a new one:

  1. Start Spoon and navigate to Tools | Repository | Connect. The repository connection dialog opens.

  2. Click on the new repository button (the green circular button with the + icon in it). The Select repository type dialog box opens.

  3. Spoon lets you decide between the two repository types. Choose the filesystem repository type and click on OK. The File repository settings dialog box opens in front of you.

  4. Select the base directory that will contain the repository objects (in our case, it is the <book_samples>/sample3 directory).

  5. Fill in the ID field with the ID of the repository; in our case, the ID of the repository is sample3.

  6. Set a name for the repository. You can even use spaces in your repository name because it is just a sort of label.

  7. Click on OK and you are done. The newly created repository will appear in the list of available repositories.

Defining a database repository

The process to define a database repository is almost identical to what we saw in the previous paragraph regarding the filesystem repository. The difference here is that PDI requires a connection definition where it can create the related repository's database tables. Let's follow the ensuing summarized steps to create a new one:

  1. Start Spoon and navigate to Tools | Repository | Connect. The repository connection dialog opens.

  2. Click on the new repository button (the green circular button with the + icon in it). The Select repository type dialog box opens.

  3. Spoon lets you decide between the two repository types. Choose the database repository type and click on OK. The Repository information dialog box opens in front of you.

  4. Choose the connection name from a list of connections. In case there is no valid connection to use, you can define it by clicking on the New button (we assume you already know how to define a connection and that you already have a connection available). Fill in the repository ID and repository name.

  5. If you don't have an existing repository or if you are accessing an old repository version, you can respectively create the necessary tables or upgrade the existing ones by clicking on the Create or Upgrade button.

  6. Click on OK and you are done. The newly created repository will appear in the list of available repositories.