Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Pentaho Data Integration

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer: Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Time for action – running and previewing the hello_world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single Text file input step

Time for action – reading all your files at a time using a single Text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – updating a file with news about examinations

Time for action – running the examination transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Basic Data Manipulation

Basic calculations

Time for action – reviewing examinations by using the Calculator step

Time for action – reviewing examinations by using the Formula step

Calculations on groups of rows

Time for action – calculating World Cup statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Looking up data

Time for action – finding out which language people speak

Summary

Controlling the Flow of Data

Splitting streams

Time for action – browsing new PDI features by copyinga dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/ Case step

Merging streams

Time for action – gathering progress and merging all together

Time for action – giving priority to Bouchard by using Append Stream

Summary

Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action – calculating scores with JavaScript

Time for action – testing the calculation of averages

Enriching the code

Time for action – calculating flexible scores by using variables

Reading and parsing unstructured files

Time for action – changing a list of house descriptions with JavaScript

Avoiding coding by using purpose-built steps

Summary

Transforming the Row Set

Converting rows to columns

Time for action – enhancing a films file by converting rows to columns

Time for action – calculating total scores by performances by country

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – getting variables for setting the default starting date

Summary

Validating Data and Handling Errors

Capturing errors

Time for action – capturing errors while calculating the ageof a film

Time for action – aborting when there are too many errors

Time for action – treating errors that may appear

Avoiding unexpected errors by validating data

Time for action – validating genres with a Regex Evaluation step

Time for action – checking films file with the Data Validator

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection with the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates by using parameters

Time for action – getting orders in a range of dates by using variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existent ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of product changes with the Dimension lookup/update step

Time for action – testing the transformation that keeps a historyof product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a simple hello world job

Receiving arguments and parameters in a job

Time for action – customizing the hello world file with arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Running job entries under conditions

Time for action – sending a sales report and warning the administrator if something is wrong

Summary

Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action – updating a file with news about examinations by setting a variable with the name of the file

Enhancing the design of your processes

Time for action – generating files with top scores

Time for action – calculating the top scores with a subtransformation

Time for action – splitting the generation of top scores by copying and getting rows

Time for action – generating the files with top scores by nesting jobs

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the sales star

Getting rid of administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Taking it Further

PDI best practices

Getting the most out of PDI

Integrating PDI and the Pentaho BI suite

PDI Enterprise Edition and Kettle Developer Support

Summary

Working with Repositories

Creating a repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Checking the exit code

Providing options when running Pan and Kitchen

Quick Reference: Steps and Job Entries

Transformation steps

Job entries

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Time for action – creating a hop with the mouse-over assistance

Enterprise features

Summary

Pop Quiz Answers

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is no exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn't come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now, but from KDE Extraction, Transportation, Transformation and Loading Environment, as the tool was planned to be written on top of KDE, as mentioned in the introduction of the book.

In April 2006 the Kettle project was acquired by the Pentaho Corporation and Matt Casters, Kettle's founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, the Chief Technology Officer, said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment the tool has grown constantly. Every few months a new release is available, bringing to the users, improvements in performance and existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. Among other changes, the version included enhancements for large scale environments and multilingual capabilities.
February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support (more on this in Chapter 13), enhanced database support, and a single designer for the two main elements you design in Kettle—jobs and transformations.
May 2007: PDI 2.5 is released including many new features, the main feature being the advanced error handling.
November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel also changed completely.
October 2008: PDI 3.1 comes with an easier-to-use tool, along with a lot of new functionalities as well.
April 2009: PDI 3.2 is released with a really large number of changes for a minor version—new functionality, visualization improvements, performance improvements, and a huge pile of bug fixes. The main change in this version was the incorporation of dynamic clustering (see Chapter 13 for details).
In 2010 PDI 4.0 will be released, delivering mostly improvements with regard to enterprise features such as version control.

Note

Most users still refer to PDI as Kettle, its further name. Therefore, the names PDI, Pentaho Data Integration, and Kettle will be used interchangeably throughout the book.

Using PDI in real world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In you look at its original name, K.E.T.T.L.E., then you must conclude that it is a tool used for ETL processes which, as you may know, are most frequently seen in data warehouse environments.

In fact, PDI not only serves as a data integrator or an ETL tool, but is such a powerful tool that it is common to see it used for those and for many other purposes. Here you have some examples.

Loading datawarehouses or datamarts

The loading of a datawarehouse or a datamart involves many steps, and there are many variants depending on business area or business rules. However, in every case, the process involves the following steps:

Extracting information from one or different databases, text files, and other sources. The extraction process may include the task of validating and discarding data that doesn't match expected patterns or rules.
Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks such as converting data types, doing some calculations, filtering irrelevant data, and summarizing.
Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following sample screenshot shows a simple ETL designed with Kettle:

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP application and a CRM application, though they're not connected. These are just two of hundreds of examples where data integration is needed. Integrating data is not just a matter of gathering and mixing data; some conversions, validation, and transport of data has to be done. Kettle is meant to do all those tasks.

Data cleansing

Why do we need that data be correct and accurate? There are many reasons—for the efficiency of business, to generate trusted conclusions in data mining or statistical studies, to succeed when integrating data, and so on. Data cleansing is about ensuring that the data is correct and precise. This can be ensured by verifying if the data meets certain rules, discarding or correcting those that don't follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on—tasks that Kettle makes possible, thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company of any size that uses a commercial ERP application. One day the owners realize that the licences are consuming an important share of its budget and so they decide to migrate to an open source ERP. The company will no longer have to pay licences, but if they want to do the change, they will have to migrate the information. Obviously it is not an option to start from scratch, or type the information by hand. Kettle makes the migration possible, thanks to its ability to interact with most kinds of sources and destinations such as plain files, and commercial and free databases and spreadsheets.

Exporting data

Sometimes you are forced by government regulations to export certain data to be processed by legacy systems. You can't just print and deliver some reports containing the required data. The data has to have a rigid format, with columns that have to obey some rules (size, format, content), different records for heading and tail, just to name some common demands. Kettle has the power to take crude data from the source and generate these kinds of ad hoc reports.

Integrating PDI using Pentaho BI

The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used as part of a process inside the Pentaho BI Platform. There are many things embedded in the Pentaho application that Kettle can do—preprocessing data for an on-line report, sending mails in a schedule fashion, or generating spreadsheet reports.

Note

You'll find more on this in Chapter 13. However, the use of PDI integrated with the BI Suite is beyond the scope of this book.

Pop quiz – PDI data sources

Which of the following aren't valid sources in Kettle:

Spreadsheets
Free database engines
Commercial database engines
Flat files
None of the above

Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho Data Integration

Note

Note

Using PDI in real world scenarios

Loading datawarehouses or datamarts

Integrating data

Data cleansing

Migrating information

Exporting data

Integrating PDI using Pentaho BI

Note

Pop quiz – PDI data sources