Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.
Table of Contents (27 chapters)
Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Index

Time for action – creating a PDI repository


To create a repository, follow these steps:

  1. Open MySQL Command Line Client.

  2. In the command window, type the following:

    CREATE DATABASE PDI_REPO;
  3. Open Spoon.

  4. If the repository dialog appears, skip to step 6.

  5. Open the repository dialog from the Repository | Connect to repository menu.

  6. Click on New to create a new repository. The repository information dialog shows up. Click on New to create a new database connection.

  7. The database connection window appears. Define a connection to the database you have just created and give a name to the connection— PDI_REPO_CONN in this case.

    Tip

    If you want to refer to the steps on creating the database connection, check out Time for action – creating a connection to the Steel Wheels database section in Chapter 8.

  8. Test the connection to see that it is properly configured.

  9. Click OK to close the database connection window. The Select database connection box will show the created connection.

  10. Give the name MY_REPO to the repository. As description, type My first repository.

  11. Click on Create or Upgrade.

  12. PDI will ask you if you are sure you want to create the repository on the specified database connection. Answer Yes if you are sure of the settings you entered.

  13. A dialog appears asking if you want to do a dry run to evaluate the generated SQL before execution.

  14. Answer No unless you want to preview the SQL that will create the reposprogress window appears showing you the progress while the repository is being created.

  15. Finally, you see a window with the message Kettle created the repository on the specified connection. Close the dialog window.

  16. Click on OK to close the repository information window. You will be back in the repository dialog, this time with a new repository available in the repository drop-down list.

  17. If you want to start working with the created repository, please refer to the Working with the repository storage system section. If not, click on No Repository. This will close the window.

What just happened?

In MySQL you created a new database named PDI_REPO. Then you used that database to create a PDI repository.

Creating repositories to store your transformationand jobs

A Kettle repository is a database that provides you with a storage system for your transformations and jobs. The repository is the alternative to the *.ktr and *.kjb file-based system.

In order to create a new repository, a database must have been created previously. In the tutorial, the repository was created in a MySQL RDBMS. However, you can create your repositories in any relational database.

Note

The PDI repository database should be used exclusively for its purpose!

Note that if the repository has already been created from another machine or by another user, that is, another profile in the operating system, you don't have to create the repository again. In that case, just define the connection to the repository but don't create it again. In other words, follow all the instructions but don't click the Create or Upgrade button.

Once you have created a repository, its name, description, and connection information are stored in a file named repositories.xml, which is located in the PDI home directory. The repository database is populated with a bunch of tables with familiar names such as transformation, job, steps, and steps_type.

Note that you may have more than one repository—different repositories for different projects, different repositories for different versions of a project, a repository just for testing new PDI features, and another for serious development, and so on. Therefore, it is important that you give the repositories meaningful names and descriptions so that you don't get confused if you have more than one.