Book Image

Instant Pentaho Data Integration Kitchen

By : Sergio Ramazzina
Book Image

Instant Pentaho Data Integration Kitchen

By: Sergio Ramazzina

Overview of this book

Pentaho PDI is a modern, powerful, and easy-to-use ETL system that lets you develop ETL processes with simplicity. Explore and gain the experience and skills that you need to run processes from the command line or schedule them by using an extensive description and a good set of samples. Instant Pentaho Data Integration Kitchen How-to will help you to understand the correct way to deal with PDI command line tools. We start with a recipe about how to configure your memory requirements to run your processes effectively and then move forward with a set of recipes that show you the different ways to start PDI processes. We start with a recap about how transformations and jobs are designed using spoon and then move forward to configure memory requirements to properly run your processes from the command line. We dive into the various flags that control the logging system by specifying the logging output and the log verbosity. We focus and deliver all the knowledge you require to run the ETL processes using command line tools with ease and in a proficient manner.
Table of Contents (7 chapters)

Preface

Pentaho Data Integration (PDI) is an ETL tool that was born 10 years ago. Its creator, Matt Caster, celebrated the 10th anniversary of this product, originally named Kettle (you can read the celebratory post on Matt's blog at: http://www.ibridge.be/?p=211), this year on March 8th 2013. The term K. E. T. T. L. E. is an acronym that stands for Kettle Extraction Transformation Transport Load Environment. When Pentaho acquired Kettle, its name was changed to Pentaho Data Integration, but actually, many developers continue to call it by the old name: Kettle.

How the story began…

The history of Kettle began in 2001 when Matt Caster, Pentaho Data Integration's chief architect and creator of Kettle, was working as a BI consultant. He had the idea of writing his own ETL tool to have a better and cheaper way to transfer data from one place to another. He was looking for a different solution, something that was better than inventing ugly data warehouse solutions written in PL/SQL, VB, or Shell scripts. He spent two years doing a thorough analysis of the problem. Because he was busy all the time with his work as a consultant, he worked on this project either during the weekends or at night. After this phase, he came out with a set of analyses documents and a couple of test programs written in C. He was not fully satisfied with what he got, so by early 2003, he started looking towards Java and continued his work on the product on this platform that, in those years, was gaining more traction in the market. So by the mid of 2003, the first version of the ETL design tool named Stir (which is now called Spoon) came to life.

It is interesting to see a screenshot of how things were then:

Stir featured a big X on the graphical view, and the log view was not working and neither were most step dialogs; but, it is useful for you to understand what the starting point of this adventure was. A certain number of other releases came out, each with a different set of new features or bugs fixed.

In 2004, work was reasonably stable and he was able to deploy Kettle for the first time to a customer. Because of the "real-world" situation, a lot of things needed to be fixed and new features needed to be implemented. That was why, in those days, things were advancing a lot faster than they were in the first three years. It seemed that the code base grew so fast that several refactorings and code cleanings were needed. Version 2.0 was one of the last "unstructured" versions. But it was thanks to the Java expertise from companies such as ixor (Wim De Clerq especially) that Kettle survived and changed radically. They helped Matt a lot with refactoring and code reorganizations to give the application a better structure and to simplify the code. At that time, Kettle had a fairly complete first release with support for slow-changing dimensions, junk dimensions, 28 steps, and 13 database connectors.

The application that was initially closed source was open sourced in late 2005. The first version under this new licensing mode was published in December 2005, and the response from the community was massive.

Kettle components

As of today, PDI is one of the best ETL open source solutions; it is made up of the following components:

  • Spoon: This is a desktop application that uses a graphical interface and editor for transformations and jobs. It provides a way for you to create complex ETL jobs without having to read or write code. Any time you author, edit, run, or debug a transformation or job, you will use Spoon.

  • Pan: This is a standalone command-line process that can be used to execute transformations and jobs created in Spoon.

  • Kitchen: This is a standalone command-line process that can be used to execute jobs.

  • Carte: Carte is a lightweight web container that allows you to set up a dedicated, remote ETL server.

What this book covers

Designing a simple PDI transformation (Simple) shows you how to design the simple transformation used as an example throughout all the recipes in this book. It also summarizes how we can develop a simple transformation using the design tool Spoon and some advises to follow in the development of transformations.

Designing a simple PDI job (Simple) shows you how to design a simple job that uses the transformation developed in the previous recipe. This job will be used as an example throughout all the book recipes. Like the previous recipe, it helps summarize how we can develop a simple job using the design tool Spoon and some advises to follow in the development of jobs and transformations.

Configuring command-line tools to run properly (Simple) represents the main starting point for everything. You will find what the main things are that you need to do to configure your PDI ETL system properly so that anything is able to work without any inconvenience.

Executing PDI jobs from a filesystem (Simple) is the first of a set of three recipes about how to start an ETL job from the command line. This is about how to start your PDI process when it is saved to the regular filesystem.

Executing PDI jobs packaged in archive files (Intermediate) explains the same topic as the previous recipe, but considers the process files to be packaged as an archive file. This is useful any time you use an ETL procedure on multiple systems (I mean for examples where you want to do some maintenance procedure) and you want to quickly move and run it without pain.

Executing PDI jobs from the repository (Simple) is the last in the series about how to start a job or transformation from the command line. This recipe is all about starting a job or transformation were the ETL files are stored in the repository.

Dealing with the execution log (Simple) explains how to efficiently use the various types of arguments available to manage the logfile and how to set the appropriate severity depending on the situation.

Discovering your PDI repository from the command line (Simple) is useful any time you decide to explore your PDI repository from the command line. It could so happen that you may forget what you have in your repository and where you have placed it. If that is the case, this is the recipe for you.

Exporting jobs and transformations to .zip files (Simple) shows you how to use a very simple and useful export mechanism. It could be useful to create a backup of your process files or to export them and easily move them to other systems.

Managing return code of PDI processes (Simple) is really the recipe for you if you need to get the procedure's return code to manage the conditional execution of other external processes.

Scheduling PDI jobs and transformations (Intermediate) tries to clear any doubts you have about scheduling your ETL processes.

What you need for this book

To run the samples in this book, you need a version of Java installed (JDK 1.6 or higher is fine) and you need the latest version of Pentaho Data Integration. If you don't have PDI installed, you can freely download it from this link: http://kettle.pentaho.com. For those of you who prefer to compile the version directly from the sources (I prefer to do this for my personal installation), you can get the latest sources from the following repository link: svn://source.pentaho.org/svnkettleroot.

Who this book is for

This book is for ETL developers with any amount of knowledge of PDI, from basic to advanced, and who already have knowledge about developing ETL processes using PDI. It is a book for anyone who wants to get a better idea about how to get their ETL processes running anywhere, manually or by scheduling, using command-line tools. You will have all the knowledge needed to do your work easily and without pain.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "We can include other contexts through the use of the include directive".

A block of code is set as follows:

if "%PENTAHO_DI_JAVA_OPTIONS%"=="" set PENTAHO_DI_JAVA_OPTIONS=-Xmx512m

Any command-line input or output is written as follows:

$ kitchen.sh –file:/home/sramazzina/tmp/samples/export-job.kjb

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "Click on the New button from the toolbar menu".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at if you are having a problem with any aspect of the book, and we will do our best to address it.