Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Time for action – creating a PDI repository


To create a repository, follow these steps:

  1. Open the MySQL command-line client.

  2. In the command window, type the following command:

    CREATE DATABASE PDI_REPO;
  3. Open Spoon.

  4. Unless a repository dialog appears, open the repository dialog from the Tools | Repository | Connect... menu.

  5. Click on the plus icon to create a new repository. A window with two options appears: Select the Kettle database repository option, as shown in the following screenshot:

  6. The Repository information dialog shows up. Click on New to create a new database connection.

  7. The database connection window appears. Define a connection to the database you have just created and give the connection the name PDI_REPO_CONN.

    Tip

    In order to create the database connection, refer to the Time for Action – creating a connection to the Steel Wheels database recipe in Chapter 8, Working with Databases.

  8. Test the connection to see that it is properly configured.

  9. Click on OK to close the database connection window. The Select Database Connection box will show the created connection.

  10. Give the repository an ID and a Name, for example, kettle_repo and My First Repo.

  11. Click on Create or Upgrade.

  12. PDI will ask you if you are sure you want to create the repository on the specified database connection. Answer Yes (if you are sure of the settings you entered of course).

  13. A dialog appears asking if you want to do a dry run to evaluate the generated SQL before execution. Answer No, unless you want to preview the SQL that will create the repository.

  14. A progress window appears showing you the progress while the repository is being created.

  15. Finally, you see a window with the message Kettle created the repository on the specified connection. Close the dialog window.

  16. Click on OK to close the Repository information window. You will be back in the repository dialog, this time with a new repository available in the repository list.

  17. If you want to start working with the created repository, refer to the Working with the repository storage system section. If not, click on Cancel. This will close the window.

What just happened?

In MySQL, you created a new database named PDI_REPO. Then, you used that database to create a PDI database repository.

Creating a database repository to store your transformations and jobs

A Kettle database repository is a database that provides a storage system for your transformations and jobs. The repository is the alternative to the *.ktr and *.kjb file-based system.

In order to create a new database repository, a database must have been created previously. In that section, the repository was created in a MySQL RDBMS. However, you can create your repositories in any JDBC compliant RDBMS.

Note

The PDI repository database should be used exclusively for its purpose!

Note that if the repository has already been created from another machine or by another user, which means another profile in the operating system, you don't have to create the repository again. In that case, just define the connection to the repository but don't create it again. In other words, follow all the instructions but don't click on the Create or Upgrade button.

Once you have created a repository; its name, description, and connection information are stored in a file named repositories.xml, located in the PDI home directory. The repository database is populated with a bunch of tables with the familiar names of transformation, job, steps, and steps_type.

Note that you may have more than one repository—different repositories for different projects, different repositories for different versions of a project, a repository just for testing new PDI features, and another for serious development, and so on. Therefore, it is important that you give the repositories meaningful names and descriptions so you don't get confused if you have more than one.