Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Pentaho Data Integration

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer: Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Time for action – running and previewing the hello_world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single Text file input step

Time for action – reading all your files at a time using a single Text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – updating a file with news about examinations

Time for action – running the examination transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Basic Data Manipulation

Basic calculations

Time for action – reviewing examinations by using the Calculator step

Time for action – reviewing examinations by using the Formula step

Calculations on groups of rows

Time for action – calculating World Cup statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Looking up data

Time for action – finding out which language people speak

Summary

Controlling the Flow of Data

Splitting streams

Time for action – browsing new PDI features by copyinga dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/ Case step

Merging streams

Time for action – gathering progress and merging all together

Time for action – giving priority to Bouchard by using Append Stream

Summary

Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action – calculating scores with JavaScript

Time for action – testing the calculation of averages

Enriching the code

Time for action – calculating flexible scores by using variables

Reading and parsing unstructured files

Time for action – changing a list of house descriptions with JavaScript

Avoiding coding by using purpose-built steps

Summary

Transforming the Row Set

Converting rows to columns

Time for action – enhancing a films file by converting rows to columns

Time for action – calculating total scores by performances by country

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – getting variables for setting the default starting date

Summary

Validating Data and Handling Errors

Capturing errors

Time for action – capturing errors while calculating the ageof a film

Time for action – aborting when there are too many errors

Time for action – treating errors that may appear

Avoiding unexpected errors by validating data

Time for action – validating genres with a Regex Evaluation step

Time for action – checking films file with the Data Validator

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection with the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates by using parameters

Time for action – getting orders in a range of dates by using variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existent ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of product changes with the Dimension lookup/update step

Time for action – testing the transformation that keeps a historyof product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a simple hello world job

Receiving arguments and parameters in a job

Time for action – customizing the hello world file with arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Running job entries under conditions

Time for action – sending a sales report and warning the administrator if something is wrong

Summary

Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action – updating a file with news about examinations by setting a variable with the name of the file

Enhancing the design of your processes

Time for action – generating files with top scores

Time for action – calculating the top scores with a subtransformation

Time for action – splitting the generation of top scores by copying and getting rows

Time for action – generating the files with top scores by nesting jobs

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the sales star

Getting rid of administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Taking it Further

PDI best practices

Getting the most out of PDI

Integrating PDI and the Pentaho BI suite

PDI Enterprise Edition and Kettle Developer Support

Summary

Working with Repositories

Creating a repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Checking the exit code

Providing options when running Pan and Kitchen

Quick Reference: Steps and Job Entries

Transformation steps

Job entries

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Time for action – creating a hop with the mouse-over assistance

Enterprise features

Summary

Pop Quiz Answers

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

The Kettle Project

Whether there is a migration to do, an ETL process to run, or a need for massively loading data into a database, you have several software tools, ranging from expensive and sophisticated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly different. By 2000, Matt Casters, a Belgian business intelligent consultant, had been working for a while as a datawarehouse architect and administrator. As such, he was one of quite a number of people who, no matter if the company they worked for was big or small, had to deal with the difficulties that involve bridging the gap between information technology and business needs. What made it even worse at that time was that ETL tools were prohibitively expensive and everything had to be crafted done. The last employer he worked for, didn't think that writing a new ETL tool would be a good idea. This was one of the motivations for Matt to become an independent contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of software for himself to do ETL tasks. It was going to take up some time left and right in the evenings and weekends. Surprised, she asked how long it would take you to get it done. He replied that it would probably take five years and that he perhaps would have something working in three.

Working on that started in early 2003. Matt's main goals for writing the software included learning about databases, ETL processes, and data warehousing. This would in turn improve his chances on a job market that was pretty volatile. Ultimately, it would allow him to work full time on the software.

Another important goal was to understand what the tool had to do. Matt wanted a scalable and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The first idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the people behind Qt, the core UI library of KDE, had released database plans to create drivers for popular databases. However, the lack of decent drivers for those databases drove Matt to change plans and use Java. He picked Java because he had some prior experience as he had written a Japanese Chess (Shogi) database program when Java 1.0 was released. To Sun's credit, this software still runs and is available at http://ibridge.be/shogi/.

After a year of development, the tool was capable of reading text files, reading from databases, writing to databases and it was very flexible. The experience with Java was not 100% positive though. The code had grown unstructured, crashes occurred all too often, and it was hard to get something going with the Java graphic library used at that moment, the Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Matt decided to start using the newly released Standard Widget Toolkit (SWT), which helped solve part of the problem. As for the rest, Kettle was a complete mess. It was time to ask for help. The help came in hands of Wim De Clercq, a senior enterprise Java architect, co-owner of Ixor (www.ixor.be) and also friend of Matt. At various intervals over the next few years, Wim involved himself in the project, giving advices to Matt about good practices in Java programming. Listening to that advice meant performing massive amounts of code changes. As a consequence, it was not unusual to spend weekends doing nothing but refactoring code and fixing thousands of errors because of that. But, bit by bit, things kept going in the right direction.

At that same time, Matt also showed the results to his peers, colleagues, and other senior BI consultants to hear what they thought of Kettle. That was how he got in touch with the Flemish Traffic Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions of rows of data had to be integrated from thousands of data sources all over Belgium. All of a sudden, he was being paid to deploy and improve Kettle to handle that job. The diversity of test cases at the traffic center helped to improve Kettle dramatically. That was somewhere in 2004 and Kettle was by its version 1.2.

While working at Flemish, Matt also posted messages on Javaforge (www.javaforge.com) to let people know they could download a free copy of Kettle for their own use. He got a few reactions. Despite some of them being remarkably negative, most were positive. The most interesting response came from a nice guy called Jens Bleuel in Germany who asked if it was possible to integrate third-party software into Kettle. In his specific case, he needed a connector to link Kettle with the German SAP software (www.sap.com). Kettle didn't have a plugin architecture, so Jens' question made Matt think about a plugin system, and that was the main motivation for developing version 2.0.

For various reasons including the birth of Matt's son Sam and a lot of consultancy work, it took around a year to release Kettle version 2.0. It was a fairly complete release with advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains those concepts), ability to connect to thirteen different databases, and the most important fact being support for plugins. Matt contacted Jens to let him know the news and Jens was really interested. It was a very memorable moment for Matt and Jens as it took them only a few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot of excitement, and they agreed to start promoting the sales of Kettle from the Kettle.be website and from Proratio (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it became too much to handle. Doing development and sales all by themselves was no fun after a while. As such, Matt thought about open sourcing Kettle early in 2005 and by late summer he made his decision. Jens and Proratio didn't mind and the decision was final.

When they finally open sourced Kettle on December 2005, the response was massive. The downloadable package put up on Javaforge got downloaded around 35000 times during first week only. The news got spread all over the world pretty quickly.

What followed was a flood of messages, both private and on the forum. At its peak in March 2006, Matt got over 300 messages a day concerning Kettle.

In no time, he was answering questions like crazy, allowing people to join the development team and working as a consultant at the same time. Added to this, the birth of his daughter Hannelore in February 2006 was too much to deal with.

Fortunately, good times came. While Matt was trying to handle all that, a discussion was taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL software, but they didn't have a strong reliance on a specific tool.

While Jens was evaluating all sorts of open source BI packages, he came across that thread. Matt replied immediately persuading people at Pentaho to consider including Kettle. And he must be convincing because the answer came quickly and was positive. James Dixon, Pentaho founder and CTO, opened Kettle the possibility to be the premier and only ETL tool supported by Pentaho. Later on, Matt came in touch with one of the other Pentaho founders, Richard Daley, who offered him a job. That allowed Matt to focus full-time on Kettle. Four years later, he's still happily working for Pentaho as chief architect for data integration, doing the best effort to deliver Kettle 4.0. Jens Bleuel, who collaborated with Matt since the early versions, is now also part of the Pentaho team.

Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho 3.2 Data Integration: Beginner's Guide

The Kettle Project