Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Discovering your PDI repository from the command line (Simple)


This recipe guides you through discovering the structure and content of your PDI repository using the PDI command-line tools. We can know anything about the repository from the command line: we can view the list of available repositories, or view the list of directories in the repository, view the list of jobs or transformations in the specified directory. This recipe will work the same for both Kitchen and Pan, with the exception of the listing of jobs and transformations in a repository's directory; the first works with Kitchen and the second with Pan.

Getting ready

To get ready for this recipe, you need to check that the JAVA_HOME environment variable is set properly and then configure your environment variables so that the Kitchen script can start from anywhere without specifying the complete path to your PDI home directory. For details about these checks, refer to the recipe Executing PDI jobs from a filesystem (Simple).

How to do it...

To get the list of the available repositories, perform the following steps:

  1. Sometimes we need to start a job or a transformation but we do not have the details of the repository we are going to interact with. The first thing we need to know is the name of the repository we are going to connect to to start our process. To get the name of the available repositories, we can use the listrep command-line argument.

  2. The usage is very simple because it does not need any value, just the name of the argument specified in the command line.

  3. Imagine that we need to find the list of the available repositories on Linux/Mac; the command to give is as follows:

    $ kitchen.sh –listrep
    
  4. To do the same thing on Windows, the command is written as follows:

    C:\temp\samples>Kitchen.bat /listrep
    
  5. The result we get is the repositories listed in a clear and concise form with the repository ID in the first column and the repository name in the second column:

    INFO  19-03 23:18:51,675 - Kitchen - Start of run.
    INFO  19-03 23:18:51,695 - RepositoriesMeta - Reading repositories XML file: /home/sramazzina/.kettle/repositories.xml
    List of repositories:
    #1 : sample3 [PDI Book Samples]  id=KettleFileRepository
    

To get the list of directories in a selected repository, perform the following steps:

  1. The next step after we have found the repository we were looking for could be to look for a job or transformation located somewhere in the repository.

  2. To do this, we need to get the list of available directories in the repository using the listdir argument used together with the following arguments:

    • The rep argument, to specify the name of the repository where we want to display the internal directory structure.

    • The dir argument, to give the directory name's starting point. The command will show you the directories contained in a specific directory. If this argument has not been specified, PDI assumes that you want to show all the directories contained in the root directory. Navigating through the structure of a complex repository is quite a tedious and iterative process, but something is better than nothing!

    • The user and –pass arguments, in case your repository is an authenticated repository, to specify the username and password that needs to be connected to.

  3. To find the list of the available directories in the root of the repository rep3, the command to fire on Linux/Mac is as follows:

    $ kitchen.sh –rep:sample3 –listdir
    
  4. To find the list of the available directories in the root of the repository rep3, the command to fire on Windows is as follows:

    C:\temp\samples>Kitchen.bat /rep:sample3 /listdir
    
  5. The command returns the list of available directories in the following form:

    INFO  20-03 07:07:17,236 - Kitchen - Start of run.
    INFO  20-03 07:07:17,252 - RepositoriesMeta - Reading repositories XML file: /home/sramazzina/.kettle/repositories.xml
    dir2
    dir1
    
  6. The directory dir1 has a subdirectory, subdir11; to show this directory, we need to specify another command that for Linux/Mac is as follows:

    $ kitchen.sh –rep:sample3 –dir:dir1 –listdir
    

    And for Windows, the command is as follows:

    C:\temp\samples>Kitchen.bat /rep:sample3 /dir:dir1 /listdir
    
  7. PDI will show us the children of the directory dir1 as follows:

    INFO  20-03 07:07:34,324 - Kitchen - Start of run.
    INFO  20-03 07:07:34,339 - RepositoriesMeta - Reading repositories XML file: /home/sramazzina/.kettle/repositories.xml
    subdir11
    
  8. If you're checking the directories of an authenticated repository, the command will change as follows for Linux/Mac:

    $ kitchen.sh –user:pdiuser –pass:password –rep:sample3 –listdir
    

    And the command will change as follows for Windows:

    C:\temp\samples>Kitchen.bat /user:pdiuser /pass:password /rep:sample3 /listdir
    
  9. The output of the command will remain the same.

To get the list of jobs in a specified directory, perform the following steps:

  1. Now that we know about the internals of our repository, we're ready to look for our jobs.

  2. The argument used to show the list of the available jobs in a specified directory is listjob. This argument must be used together with the following:

    • The rep argument, to specify the name of the repository where we want to display the internal directory structure.

    • The dir argument, to give the name of the directory. The command will show you the jobs contained in a specific directory. If this argument is not specified, PDI assumes that you want to show all the jobs contained in the root directory.

    • The user and pass arguments, in case your repository is an authenticated repository, to specify the username and password that needs to be connected to.

  3. To find the list of the available jobs in the root directory of the repository rep3, the command to fire on Linux/Mac is as follows:

    $ kitchen.sh –rep:sample3 –listjobs
    
  4. To find the list of the available jobs in the root directory of the repository rep3, the command to fire on Windows is as follows:

    C:\temp\samples>Kitchen.bat /rep:sample3 /listjobs
    
  5. The command returns the list of available jobs in the following form:

    INFO  20-03 07:30:46,642 - Kitchen - Start of run.
    INFO  20-03 07:30:46,657 - RepositoriesMeta - Reading repositories XML file: /home/sramazzina/.kettle/repositories.xml
    export-job
    
  6. If you're checking the jobs in an authenticated repository, the command on Linux/Mac will change in the following way:

    $ kitchen.sh –user:pdiuser –pass:password –rep:sample3 –listjobs
    
  7. If you're checking the jobs in an authenticated repository, the command on a Windows platform will change in the following way:

    C:\temp\samples>Kitchen.bat /user:pdiuser /pass:password /rep:sample3 /listjobs
    

To get the list of transformations in a specified directory, perform the following steps:

  1. The Pan script lets us see the list of transformations contained in a directory of our repository.

  2. To do this, we need to specify the –listtrans argument together with the same arguments specified for the –listjobs argument; for details about this, please refer to the previous paragraph to get a detailed explanation of the meaning and syntax of those arguments.

  3. To find the list of the available transformations in the root directory of the repository rep3, the command to fire on Linux/Mac is as follows:

    $ pan.sh –rep:sample3 –listtrans
    
  4. To find the list of the available transformations in the root directory of the repository rep3, the command to fire on Windows is as follows:

    C:\temp\samples>Pan.bat /rep:sample3 /listtrans
    
  5. The command returns the list of available transformations in the following form:

    INFO  20-03 07:35:10,073 - Pan - Start of run.
    INFO  20-03 07:35:10,103 - RepositoriesMeta - Reading repositories XML file: /home/sramazzina/.kettle/repositories.xml
    read-customers
    
  6. Anything applied to the display of jobs contained in a specific directory and the ability to apply the same command to an authenticated repository applies here to transformations as well; the only recommendation is to remember to use the Pan script instead of the Kitchen script.