Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Appendix F. Best Practices

This appendix gives you some advice to take into account in your daily work with PDI. If you intend to work seriously with PDI, knowing how to accomplish different tasks is just not enough.

Here you have some guidelines that will help you to go in the right direction:

  • Outline your ideas on paper before creating a transformation or a job. Don't drop steps randomly on the canvas trying to get things to work, otherwise you will end up with a transformation or a job that is difficult to understand and might not be of any use.

  • Document your work. Write at least a simple description in the transformation and job setting windows. Replace the default names of the steps and the job entries with meaningful ones. Use notes to clarify the purpose of the transformations and the jobs. Color-code your notes for a better effect; for example, use a color for notes explaining the purpose of a transformation, and a different color or font for technical notes. By doing this, your work will be well documented.

  • Make your jobs and transformations clear to understand. Arrange the elements in the canvas so that it does not look like a puzzle to solve. Memorize the shortcuts for arrangement and alignment and use them regularly. You will find a full list in Appendix D, Spoon Shortcuts.

  • Organize the PDI elements in folders. Don't save all of the transformations and jobs in the same folder. Organize them according to the purpose they have.

  • Make your work flexible and reusable. Make use of arguments, variables, and named parameters. If you identify tasks that are going to be used in several situations, create subtransformations.

  • Make your work portable (ready for deployment). Do whatever you can so that even if you move your work to another machine or another folder, or the path to source or destination files change, or the connection properties to the databases change, everything keeps working without or with minimal changes. In order to do that, don't use fixed names but variables. If you know the values for the variables beforehand, define the variables in the kettle.properties file. For the name of the transformations and jobs use the relative paths (use the ${Internal.Job.Filename.Directory}, and ${Internal.Transformation.Filename.Directory} variables).

  • Avoid overloading your transformations. A transformation should do a precise task. If it doesn't, think of splitting it into two or more, or create subtransformations. Doing so, your transformation will be clearer and in the case of subtransformations, also reusable.

  • Handle errors. Try to figure out the kind of errors that may occur and trap them by validating, handing errors, and acting accordingly—fixing data, taking alternative paths, sending friendly messages to the log files, and so on.

  • Do everything you can to optimize the PDI performance. You can find a full checklist at http://wiki.pentaho.com/display/COM/PDI+Performance+tuning+check-list.

    For tracking the performance of individual steps in a transformation, you can look up the details at http://wiki.pentaho.com/display/EAI/Step+performance+monitoring.

  • Keep a track of jobs and transformations history. You can use a versioning system, such as Subversion or Git. In doing so, you can recover older versions of your jobs and transformations or examine the history of how they changed. For more on Subversion, visit the site http://subversion.tigris.org/. For more on Git visit the official site http://git-scm.com/. Also, consider upgrading to EE, where versioning is a repository feature.

  • Bookmark the the forum page and visit it frequently. The PDI forum is available at http://forums.pentaho.org/forumdisplay.php?f=135. If you are stuck with something, search for a solution in the forum. If you don't find what you're looking for, create a new thread, expose your doubts or scenario clearly and you'll get a prompt answer as the Pentaho community and particularly the PDI one is quite active. Alternatively you can meet Pentaho people on IRC server www.freenode.net, channel #pentaho. On the channel, people discuss all kinds of issues related to all the Pentaho tools, and not just Kettle.