Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Time for action – logging into a database repository


To log on into an existent database repository, follow these instructions:

  1. Launch Spoon.

  2. If the repository dialog window doesn't show up, select Tools | Repository | Connect... from the main menu. The repository dialog window appears.

  3. In the list, select the repository you want to log in.

  4. Type your username and password. If you didn't create a user, use the default admin/admin, and click on OK.

  5. You are logged into the repository. You will see the name of the repository in the upper-left corner of Spoon:

What just happened?

In the preceding section, you opened Spoon and logged into a database repository.

Logging into a database repository using credentials

If you want to work with the repository storage system, you have to log in to the repository before starting to work. In order to do that, you have to choose the repository and provide a repository username and password.

The repository dialog that allows you to log into repository can be opened from the main Spoon menu. If you intend to log into the repository often, you'd better select Tools | Options... and check the general option Show repository dialog at startup?, which will cause the repository dialog to always show up when you launch Spoon.

It is possible to log into the repository automatically. Suppose you have a repository named MY_REPO, and you use the default user. By adding the following lines to the kettle.properties file, the next time you launch Spoon you will log into the repository automatically:

  KETTLE_REPOSITORY=MY_REPO
  KETTLE_USER=admin
  KETTLE_PASSWORD=admin

Tip

For details on the kettle.properties file, refer to the Kettle variables section in Chapter 3, Manipulating Real-world Data.

As a final note, take into account that the log information is exposed and as such, autologin is not recommended.

Creating transformations and jobs in repository folders

In a repository, the jobs and transformations are organized in folders. A folder in a repository fulfills the same purpose as a folder in your drive—it allows you to keep your work organized. Once you create a folder, you can save both transformations and jobs in it.

While connected to a repository, you can design, preview, and run jobs and transformations just as you do with files. However, there are some differences when it comes to opening, creating, or saving your work. So let's summarize how you do those tasks when logged into a repository:

Task

Procedure

Open a transformation/job

Select File | Open. The repository explorer shows up. Navigate the repository until you find the transformation or job you want to open, then double-click on it.

Create a folder

Select Tools | Repository | Explore ..., expand the transformation or job tree, locate the parent folder, right-click on it and create the folder.

Create a transformation

Select File | New | Transformation or press Ctrl + N.

Create a job

Select File | New | Job or press Ctrl + Alt + N.

Save a transformation

Press Ctrl + T. Give the transformation a name. In the Directory textbox, select the folder where the transformation is going to be saved. Press Ctrl + S. The transformation will be saved in the selected directory under the given name.

Save a job

Press Ctrl + J. Give the job a name. In the Directory textbox, select the folder where the job is going to be saved. Press Ctrl + S. The job will be saved in the selected directory under the given name.

Creating database connections, users, servers, partitions, and clusters

Besides jobs and transformations, there are some additional PDI elements that you can define:

Element

Description

Connections

This element defines connections to relational databases. These are covered in Chapter 8, Working with Databases.

Security

This element provides security to the users. They are needed to log into the repository. There are two predefined users: admin and guest.

You can also define some elements not covered in this book, but worth mentioning:

Element

Description

Slaves

Slave servers are the servers installed in remote machines to execute jobs and transformations remotely.

Partitions

Partitioning is a mechanism by which you can send individual rows to different copies of the same step, for example, based on a field value.

Clusters

Clusters are a group of slave servers which collectively execute a job or a transformation.

All the elements can also be created, modified, and deleted from the repository explorer. Once you create any of these elements, they are automatically shared by all repository users.

Designing jobs and transformations

You shouldn't have any difficulty in designing jobs and transformations under a database repository-based method. The way you work is exactly the same as the way you learned through all the chapters while working under the file-based method. The main change, besides the way you open and save your work, is the way you refer to other jobs and transformations. Here you have a list of the main situations:

Task

File-based method

Database repository-based method

Providing the name of the job to run in a Job-job entry

You provide the full path of the .kjb file or a relative path, eventually using the ${Internal.Job.Filename.Directory} variable.

You should fill the Repository: specify by name option. You may eventually use a relative location using the ${Internal.Job.Repository.Directory} variable.

Providing the name of the transformation to run in a Transformation-job entry

You provide the full path of the .ktr file or a relative path, eventually using the ${Internal.Job.Filename.Directory} variable.

You should fill the Specify by name and directory option. You may eventually use a relative location using the ${Internal.Job.Repository.Directory} variable.

Providing the name of the transformation to run as a subtransformation using a Mapping step

You provide the full path of the .ktr file or a relative path, eventually using the ${Internal.Transformation.Filename.Directory} variable.

You should fill the Use a mapping transformation from the repository option. You may eventually use a relative location using the ${Internal.Transformation.Repository.Directory} variable.

Providing the name of a file whose location is relative to the location of a job or a transformation

You provide the relative path using ${Internal.Job.Filename.Directory} in a job or ${Internal.Transformation.Filename.Directory} in a transformation.

In a database repository-based system, you can create a folder tree as you do in the filesystem. However, you are not able to specify relative paths.

Backing up and restoring a repository

You may regularly back up your database repository in the same way as you would do with any database. You do it by using the utilities provided by the RDBMS, for example, mysqldump in MySQL. However, PDI offers you a method for creating a backup in an XML file.

  • You create a backup from the Tools | Repository | Export Repository... option. You will be asked for the name and location of the XML file that will contain the backup data. To backup a single folder, open the Repository explorer and right-click on the name of the folder.

  • You restore a backup from the Tools | Repository | Import Repository... option. You will be asked for the name and location of the XML file that contains the backup.

Both in the export and the import operations, Kettle offers you the possibility to apply some rules to make sure that the transformations and jobs adhere to the given standards.

As an example, suppose that you want to make sure that there are no disabled hops in your jobs. One of the rules that Kettle offers does this verification—the name of the rule is 'Job has no disabled hops'. The list of available rules is quite limited but still useful for meeting some basic requirements.