Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.
Table of Contents (27 chapters)
Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Index

Appendix A. Working with Repositories

Spoon allows you to store your transformations and jobs under two different configurations—file based and repository based. In contrast to the file-based configuration that keeps the transformations and jobs in XML format such as *.ktr and *.kjb files in the local file system, the repository-based configuration keeps the same information in tables in a relational database.

While working with the file-based system is simple and practical, the repository-based system can be convenient in some situations. The following is a list of some of the distinctive repository features:

  • Repositories implement security. In order to work with a repository, you need credentials. You can create users and profiles with different permissions on the repository; however, keep in mind that the kind of permissions you may apply is limited.

  • Repositories are prepared for basic team development. The elements you create (transformations, jobs, database connections, and so on) are shared by all repository users as soon as you create them.

  • If you want to use PDI as the input source in dashboards made with the CDF (refer to Chapter 13 for details), the only way you have is by working with repositories.

  • PDI 4, in its Enterprise version, will include a lot of new repository features such as version control.

Before you decide on working with a repository, you have to be aware of the file-based system benefits that you may lose out on. Here are some examples:

  • When working with the repository-based system, you need access to the repository database. If, for some reason, you cannot access the database (due to a network problem or any other issue), you will not be able to work. You don't have this restriction when working with files—you need only the software and the .ktr/.kjb files.

  • When working with repositories, it is difficult to keep track of the changes. On the other hand, when you work with the file system, it's easier to know which jobs or transformations are modified. If you use Subversion, you even have version control.

  • Suppose you want to search and replace some text in all jobs and transformations. If you are working with repositories, you would have to do it for each table in the repository database. When working with the file-based system, this task is quite simple—you could create an Eclipse project, load the root directory of your jobs and transformations, and do the task by using the Eclipse utilities.

This appendix explains how to create a repository and how to work with it. You can give repositories a try and decide for yourself which method, repository-based or file-based, suits you best.