Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Exploring the Pentaho Demo


Despite being out of the scope of this book, it’s worth to briefly introduce the Pentaho Demo. The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo:

The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. By the time you read the book, Pentaho 5.0 may already have arrived. At the time of writing this book, the latest stable version is 4.8.0, so the file you have to download is biserver-ce-4.8.0-stable.zip for Windows and biserver-ce-4.8.0-stable.tar.gz for other systems.

Note

You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org.

Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the book.

In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

  • June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities.

  • February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle.

  • May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling.

  • November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely.

  • October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well.

  • April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering.

  • June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon.

  • November 2010: PDI 4.1 is released with many bug fixes.

  • August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories (see Appendix A, Working with Repositories for details).

  • April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features.

  • November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets.

  • 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability.

Using PDI in real-world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples.

Loading data warehouses or datamarts

The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules.

But in every case, no exception, the process involves the following steps:

  • Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules.

  • Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing.

  • Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks.

Data cleansing

It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others.

Exporting data

Data may need to be exported for numerous reasons:

  • To create detailed business reports

  • To allow communication between different departments within the same company

  • To deliver data from your legacy systems to obey government regulations, and so on

Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports.

Integrating PDI along with other Pentaho tools

The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.

Note

The use of PDI integrated with other tools is beyond the scope of this book. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book.

Pop quiz – PDI data sources

Q1. Which of the following are not valid sources in Kettle?

  1. Spreadsheets.

  2. Free database engines.

  3. Commercial database engines.

  4. Flat files.

  5. None.