Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.
Table of Contents (27 chapters)
Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Index

The Kettle Project

Whether there is a migration to do, an ETL process to run, or a need for massively loading data into a database, you have several software tools, ranging from expensive and sophisticated to free open source and friendly ones, which help you accomplish the task.

Ten years ago, the scenario was clearly different. By 2000, Matt Casters, a Belgian business intelligent consultant, had been working for a while as a datawarehouse architect and administrator. As such, he was one of quite a number of people who, no matter if the company they worked for was big or small, had to deal with the difficulties that involve bridging the gap between information technology and business needs. What made it even worse at that time was that ETL tools were prohibitively expensive and everything had to be crafted done. The last employer he worked for, didn't think that writing a new ETL tool would be a good idea. This was one of the motivations for Matt to become an independent contractor and to start his own company. That was in June 2001.

At the end of that year, he told his wife that he was going to write a new piece of software for himself to do ETL tasks. It was going to take up some time left and right in the evenings and weekends. Surprised, she asked how long it would take you to get it done. He replied that it would probably take five years and that he perhaps would have something working in three.

Working on that started in early 2003. Matt's main goals for writing the software included learning about databases, ETL processes, and data warehousing. This would in turn improve his chances on a job market that was pretty volatile. Ultimately, it would allow him to work full time on the software.

Another important goal was to understand what the tool had to do. Matt wanted a scalable and parallel tool, and wanted to isolate rows of data as much as possible.

The last but not least goal was to pick the right technology that would support the tool. The first idea was to build it on top of KDE, the popular Unix desktop environment. Trolltech, the people behind Qt, the core UI library of KDE, had released database plans to create drivers for popular databases. However, the lack of decent drivers for those databases drove Matt to change plans and use Java. He picked Java because he had some prior experience as he had written a Japanese Chess (Shogi) database program when Java 1.0 was released. To Sun's credit, this software still runs and is available at http://ibridge.be/shogi/.

After a year of development, the tool was capable of reading text files, reading from databases, writing to databases and it was very flexible. The experience with Java was not 100% positive though. The code had grown unstructured, crashes occurred all too often, and it was hard to get something going with the Java graphic library used at that moment, the Abstract Window Toolkit (AWT); it looked bad and it was slow.

As for the library, Matt decided to start using the newly released Standard Widget Toolkit (SWT), which helped solve part of the problem. As for the rest, Kettle was a complete mess. It was time to ask for help. The help came in hands of Wim De Clercq, a senior enterprise Java architect, co-owner of Ixor (www.ixor.be) and also friend of Matt. At various intervals over the next few years, Wim involved himself in the project, giving advices to Matt about good practices in Java programming. Listening to that advice meant performing massive amounts of code changes. As a consequence, it was not unusual to spend weekends doing nothing but refactoring code and fixing thousands of errors because of that. But, bit by bit, things kept going in the right direction.

At that same time, Matt also showed the results to his peers, colleagues, and other senior BI consultants to hear what they thought of Kettle. That was how he got in touch with the Flemish Traffic Centre (www.verkeerscentrum.be/verkeersinfo/kaart) where billions of rows of data had to be integrated from thousands of data sources all over Belgium. All of a sudden, he was being paid to deploy and improve Kettle to handle that job. The diversity of test cases at the traffic center helped to improve Kettle dramatically. That was somewhere in 2004 and Kettle was by its version 1.2.

While working at Flemish, Matt also posted messages on Javaforge (www.javaforge.com) to let people know they could download a free copy of Kettle for their own use. He got a few reactions. Despite some of them being remarkably negative, most were positive. The most interesting response came from a nice guy called Jens Bleuel in Germany who asked if it was possible to integrate third-party software into Kettle. In his specific case, he needed a connector to link Kettle with the German SAP software (www.sap.com). Kettle didn't have a plugin architecture, so Jens' question made Matt think about a plugin system, and that was the main motivation for developing version 2.0.

For various reasons including the birth of Matt's son Sam and a lot of consultancy work, it took around a year to release Kettle version 2.0. It was a fairly complete release with advanced support for slowly changing dimensions and junk dimensions (Chapter 9 explains those concepts), ability to connect to thirteen different databases, and the most important fact being support for plugins. Matt contacted Jens to let him know the news and Jens was really interested. It was a very memorable moment for Matt and Jens as it took them only a few hours to get a new plugin going that read data from an SAP/R3 server. There was a lot of excitement, and they agreed to start promoting the sales of Kettle from the Kettle.be website and from Proratio (www.proratio.de), the company Jens worked for.

Those were days of improvements, requests, people interested in the project. However, it became too much to handle. Doing development and sales all by themselves was no fun after a while. As such, Matt thought about open sourcing Kettle early in 2005 and by late summer he made his decision. Jens and Proratio didn't mind and the decision was final.

When they finally open sourced Kettle on December 2005, the response was massive. The downloadable package put up on Javaforge got downloaded around 35000 times during first week only. The news got spread all over the world pretty quickly.

What followed was a flood of messages, both private and on the forum. At its peak in March 2006, Matt got over 300 messages a day concerning Kettle.

In no time, he was answering questions like crazy, allowing people to join the development team and working as a consultant at the same time. Added to this, the birth of his daughter Hannelore in February 2006 was too much to deal with.

Fortunately, good times came. While Matt was trying to handle all that, a discussion was taking place at the Pentaho forum (http://forums.pentaho.org/) concerning the ETL tool that Pentaho should support. They had selected Enhydra Octopus, a Java-based ETL software, but they didn't have a strong reliance on a specific tool.

While Jens was evaluating all sorts of open source BI packages, he came across that thread. Matt replied immediately persuading people at Pentaho to consider including Kettle. And he must be convincing because the answer came quickly and was positive. James Dixon, Pentaho founder and CTO, opened Kettle the possibility to be the premier and only ETL tool supported by Pentaho. Later on, Matt came in touch with one of the other Pentaho founders, Richard Daley, who offered him a job. That allowed Matt to focus full-time on Kettle. Four years later, he's still happily working for Pentaho as chief architect for data integration, doing the best effort to deliver Kettle 4.0. Jens Bleuel, who collaborated with Matt since the early versions, is now also part of the Pentaho team.