Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Time for action – starting and customizing Spoon


In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features.

  1. Start Spoon.

    • If your system is Windows, run Spoon.bat

      Tip

      You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal.

    • In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh

    • If you didn’t make spoon.sh executable, you may type sh spoon.sh

    • Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon

  2. As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button.

    Note

    Repositories are explained in Appendix A, Working with Repositories. If you want to know what a repository connection is about, you will find the information in that appendix.

  3. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed.

  4. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu.

  5. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot:

  6. Select the tab window Look & Feel.

  7. Change the Grid size and Preferred Language settings as shown in the following screenshot:

  8. Click on the OK button.

  9. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead:

What just happened?

You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration.

In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before.

The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen.

You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon!

Spoon

Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. The other PDI components which you will learn in the following chapters, are executed from terminal windows.

Setting preferences in the Options window

In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings.

Note

Remember to restart Spoon in order to see the changes applied.

In particular, please take note of the following suggestion about the configuration of the preferred language.

Tip

If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language.

One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them.

Tip

You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen.

Storing transformations and jobs in a repository

The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it.

As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods:

  • Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose.

  • Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively.

It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool.

Note

By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method.

Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons:

  • Working with files is more natural and practical for most users.

  • Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them.

There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this book we will use the files method. For details of working with repositories, please refer to Appendix A, Working with Repositories.

Creating your first transformation

Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.