Pentaho Data Integration Beginner's Guide - Second Edition

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Buy this Book

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Buy this Book

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

Time for action – configuring the error handling to see the description of the errors

Summary

Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single text file input step

Time for action – reading all your files at a time using a single text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – reading and writing matches files with flexibility

Time for action – running the matches transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Filtering, Searching, and Performing Other Useful Operations with Data

Sorting data

Time for action – sorting information about matches with the Sort rows step

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Time for action – refining the counting task by filtering even more

Looking up data

Time for action – finding out which language people speak

Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

Time for action – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

Summary

Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

Avoiding coding by using purpose built steps

Summary

Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – parameterizing the start and end date of the time dimension dataset

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection to the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates using parameters

Time for action – getting orders in a range of dates by using Kettle variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existing ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

Time for action – testing the transformation that keeps history of product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Summary

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

Time for action – generating top average scores by copying and getting rows

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the SALES star

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a database repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Summary

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Summary

Quick Reference – Steps and Job Entries

Transformation steps

Job entries

Summary

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Database wizards

Summary

Introducing PDI 5 Features

Welcome page

Usability

Solutions to commonly occurring situations

Backend

Summary

Best Practices

Summary

Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Chapter 2, Getting Started with Transformations

Chapter 3, Manipulating Real-world Data

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Chapter 5, Controlling the Flow of Data

Chapter 6, Transforming Your Data by Coding

Chapter 8, Working with Databases

Chapter 9, Performing Advanced Operations with Databases

Chapter 10, Creating Basic Task Flows

Chapter 11, Creating Advanced Transformations and Jobs

Chapter 12, Developing and Implementing a Simple Datamart

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Time for action – creating a hello world transformation

How about starting by saying hello to the world? It's not really new, but good enough for our first practical example; here are the steps to follow:

Create a folder named pdi_labs under a folder of your choice.
Open Spoon.
From the main menu, navigate to File | New | Transformation.
On the left of the screen, under the Design tab, you’ll see a tree of Steps. Expand the Input branch by double-clicking on it.
Note
Note that if you work in Mac OS a single click is enough.
Then, left-click on the Generate Rows icon and without releasing the button, drag-and-drop the selected icon to the main canvas. The screen will look like the following screenshot:
Note
Note that we changed the preferred language back to English.
Double-click on the Generate Rows step you just put in the canvas, and fill the textboxes, including Step name and Limit and grid as follows:
From the Steps tree, double-click on the Flow branch.
Click on the Dummy (do nothing) icon and drag-and-drop it to the main canvas.
Put the mouse cursor over the Generate Rows step and wait until a tiny toolbar shows up below the entry icon, as shown in the following screenshot:
Click on the output connector (the last icon in the toolbar), and drag towards the Dummy (do nothing) step. A grayed hop is displayed.
When the mouse cursor is over the Dummy (do nothing) step, release the button. A link—a hop from now on—is created from the Generate Rows step to the Dummy (do nothing) step. The screen should look like the following screenshot:
Right-click anywhere on the canvas to bring a contextual menu.
In the menu, select the New note option. A note editor appears.
Type some description such as Hello, World! Select the Font style tab and choose some nice font and colors for your note, and then click on OK.
From the main menu, navigate to Edit | Settings.... A window appears to specify transformation properties. Fill the Transformation name textbox with a simple name, such as hello world. Fill the Description textbox with a short description such as My first transformation. Finally, provide a more clear explanation in the Extended description textbox, and then click on OK.
From the main menu, navigate to File | Save.
Save the transformation in the folder pdi_labs with the name hello_world.
Select the Dummy (do nothing) step by left-clicking on it.
Click on the Preview icon in the bar menu above the main canvas. The screen should look like the following screenshot:
The Transformation debug dialog window appears. Click on the Quick Launch button.
A window appears to preview the data generated by the transformation as shown in the following screenshot:
Close the preview window and click on the Run icon. The screen should look like the following screenshot:
A window named Execute a transformation appears. Click on Launch.
The execution results are shown at the bottom of the screen. The Logging tab should look as follows:

What just happened?

You have just created your first transformation.

First, you created a new transformation, dragged-and-dropped into the work area two steps: Generate Rows and Dummy (do nothing), and connected them.

With the Generate Rows step you created 10 rows of data with the message Hello World! The Dummy (do nothing) step simply served as a destination of those rows.

After creating the transformation, you did a preview. The preview allowed you to see the content of the created data, this is, the 10 rows with the message Hello World!

Finally, you run the transformation. Then you could see at the bottom of the screen the Execution Results window, where a Logging tab shows the complete detail of what happened. There are other tabs in this window which you will learn later in the book.

Directing Kettle engine with transformations

A transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows—the data enters or is created in a step, the step applies some kind of transformation to it, and finally the data leaves that step. Therefore, it’s said that a transformation is data flow oriented.

A transformation itself is neither a program nor an executable file. It is just plain XML. The transformation contains metadata which tells the Kettle engine what to do.

A step is the minimal unit inside a transformation. A big set of steps is available. These steps are grouped in categories such as the Input and Flow categories that you saw in the example.

Each step is conceived to accomplish a specific function, going from reading a parameter to normalizing a dataset.

Each step has a configuration window. These windows vary according to the functionality of the steps and the category to which they belong. What all steps have in common are the name and description:

Step property	Description
Name	A representative name inside the transformation.
Description	A brief explanation that allows you to clarify the purpose of the step. It’s not mandatory but it is useful.

A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitute the output data of the origin step and the input data of the destination step.

Exploring the Spoon interface

As you just saw, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main menu, Design view, Transformation toolbar, and Canvas (work area):

Note

The words canvas and work area will be used interchangeably throughout the book.

There is also an area named View that shows the structure of the transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Designing a transformation

In the earlier section, you designed a very simple transformation, with just two steps and one explanatory note. You learned to link steps by using the mouseover assistance toolbar. There are alternative ways to do the same thing. You can use the one that you feel more comfortable with. Appendix D, Spoon Shortcuts explains all of the different options to you. It also explains a lot of shortcuts to zoom in and out, align the steps, among others. These shortcuts are very useful as your transformations become more complex.

Note

Appendix F, Best Practices, explains the benefit of using shortcuts as well as other best practices that are invaluable when you work with Spoon, especially when you have to design and develop big ETL projects.

Running and previewing the transformation

The Preview functionality allows you to see a sample of the data produced for selected steps. In the previous example, you previewed the output of the Dummy (do nothing) step.

The Run icon effectively runs the whole transformation.

Whether you preview or run a transformation, you’ll get an Execution Results window showing what happened. You will learn more about this in the next chapter.

Pop quiz – PDI basics

Q1. There are several graphical tools in PDI, but Spoon is the most used.

True.
False.

Q2. You can choose to save transformations either in files or in a database.

True.
False.

Q3. To run a transformation, an executable file has to be generated from Spoon.

True.
False.

Q4. The grid size option in the Look & Feel window allows you to resize the work area.

True.
False.

Q5. To create a transformation you have to provide external data (that is, text file, spreadsheet, database, and so on).

True.
False.

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Time for action – creating a hello world transformation

Note

Note

What just happened?

Directing Kettle engine with transformations

Exploring the Spoon interface

Note

Tip

Designing a transformation

Note

Running and previewing the transformation

Pop quiz – PDI basics