Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Pentaho Data Integration

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer: Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Time for action – running and previewing the hello_world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single Text file input step

Time for action – reading all your files at a time using a single Text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – updating a file with news about examinations

Time for action – running the examination transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Basic Data Manipulation

Basic calculations

Time for action – reviewing examinations by using the Calculator step

Time for action – reviewing examinations by using the Formula step

Calculations on groups of rows

Time for action – calculating World Cup statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Looking up data

Time for action – finding out which language people speak

Summary

Controlling the Flow of Data

Splitting streams

Time for action – browsing new PDI features by copyinga dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/ Case step

Merging streams

Time for action – gathering progress and merging all together

Time for action – giving priority to Bouchard by using Append Stream

Summary

Transforming Your Data with JavaScript Code and the JavaScript Step

Doing simple tasks with the JavaScript step

Time for action – calculating scores with JavaScript

Time for action – testing the calculation of averages

Enriching the code

Time for action – calculating flexible scores by using variables

Reading and parsing unstructured files

Time for action – changing a list of house descriptions with JavaScript

Avoiding coding by using purpose-built steps

Summary

Transforming the Row Set

Converting rows to columns

Time for action – enhancing a films file by converting rows to columns

Time for action – calculating total scores by performances by country

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – getting variables for setting the default starting date

Summary

Validating Data and Handling Errors

Capturing errors

Time for action – capturing errors while calculating the ageof a film

Time for action – aborting when there are too many errors

Time for action – treating errors that may appear

Avoiding unexpected errors by validating data

Time for action – validating genres with a Regex Evaluation step

Time for action – checking films file with the Data Validator

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection with the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates by using parameters

Time for action – getting orders in a range of dates by using variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existent ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of product changes with the Dimension lookup/update step

Time for action – testing the transformation that keeps a historyof product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a simple hello world job

Receiving arguments and parameters in a job

Time for action – customizing the hello world file with arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Running job entries under conditions

Time for action – sending a sales report and warning the administrator if something is wrong

Summary

Creating Advanced Transformations and Jobs

Enhancing your processes with the use of variables

Time for action – updating a file with news about examinations by setting a variable with the name of the file

Enhancing the design of your processes

Time for action – generating files with top scores

Time for action – calculating the top scores with a subtransformation

Time for action – splitting the generation of top scores by copying and getting rows

Time for action – generating the files with top scores by nesting jobs

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the sales star

Getting rid of administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Taking it Further

PDI best practices

Getting the most out of PDI

Integrating PDI and the Pentaho BI suite

PDI Enterprise Edition and Kettle Developer Support

Summary

Working with Repositories

Creating a repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a repository

Examining and modifying the contents of a repository with the Repository explorer

Migrating from a file-based system to a repository-based system and vice-versa

Summary

Pan and Kitchen: Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Checking the exit code

Providing options when running Pan and Kitchen

Quick Reference: Steps and Job Entries

Transformation steps

Job entries

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Introducing PDI 4 Features

Agile BI

Visual improvements for designing transformations and jobs

Time for action – creating a hop with the mouse-over assistance

Enterprise features

Summary

Pop Quiz Answers

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Time for action – logging into a repository

To log into an existent repository, follow these instructions:

Launch Spoon.
If the repository dialog window doesn't show up, select Repository | Connect to repository from the main menu. The repository dialog window appears.
In the drop-down list, select the repository you want to log into.
Type your username and password. If you have never created any users, use the default username and password—admin and admin. Click on OK.
You will now be logged into the repository. You will see the name of the repository in the upper-left corner of Spoon:

What just happened?

You opened Spoon and logged into a repository. In order to do that, you provided the name of the repository and proper credentials. Once you did it, you were ready to start working with the repository.

Logging into a repository by using credentials

If you want to work with the repository storage system, you have to log into the repository before you begin your work. In order to do that, you have to choose the repository and provide a repository username and password.

The repository dialog that allows you to log into the repository can be opened from the main Spoon menu. If you intend to log into the repository often, you'd better select Edit | Options... and check the general option Show repository dialog at startup?. This will cause the repository dialog to always show up when you launch Spoon.

It is possible to log into the repository automatically. Let's assume you have a repository named MY_REPO and you use the default user. Add the following lines to the kettle.properties file:

KETTLE_REPOSITORY=MY_REPO
KETTLE_USER=admin
KETTLE_PASSWORD=admin

The next time you launch Spoon, you will be logged into the repository automatically.

Tip

For details about the kettle.properties file, refer to the section on Kettle variables in Chapter 2.

Note

Because the log information is exposed, auto login is not recommended.

Defining repository user accounts

To log into a repository, you need a user account. Every repository user has a profile that dictates the permissions that the user has on the repository. There are three predefined profiles:

Profile	Permissions
Read-only	Cannot create nor modify any element in the repository
User	Can create, modify, and delete any object in the repository excepting users and profiles
Administrator	Has full permissions, including creating new users and profiles

There are also two predefined users:

admin: A user with Administrator profile. This is the user you used to log into the repository for the first time. It has full permissions on the repository.
guest: A user with Read-only profile.

If you have Administrator profile, you can create, modify, rename, or delete users and profiles from the Repository explorer. For details, please refer to the section Examining and modifying the contents of a repository with the Repository explorer, later in this chapter. Any user may change his/her own user information both from the Repository explorer and from the Repository | Edit current user menu optio.

Creating transformations and jobs in repository folders

In a repository, the jobs and transformations are organized in folders. A folder in a repository fulfills the same purpose as a folder in your drive—it allows you to keep your work organized. Once you create a folder, you can save both transformations and jobs in it.

While connected to a repository you design, preview, and run jobs and transformations just as you do with files. However, there are some differences when it comes to opening, creating, or saving your work. So, let's summarize how you do those tasks when logged into a repository:

Task	Procedure
Open a transformation / job	Select File \| Open. The Repository explorer shows up. Navigate the repository until you find the transformation or job you want to open. Double-click it.
Create a folder	Select Repository \| Explore repository, expand the transformation or job tree, locate the parent folder, right-click and create the folder. Alternatively, double-click the parent folder.
Create a transformation	Select File \| New \| Transformation or press Ctrl+N.
Create a Job	Select File \| New \| Job or press Ctrl+Alt+N.
Save a transformation	Press Ctrl+T. Give a name to the transformation. In the Directory textbox, select the folder where the transformation is going to be saved. Press Ctrl+S. The transformation will now be saved in the selected directory under the given name.
Save a job	Press Ctrl+J. Give a name to the job. In the Directory textbox, select the folder where the job is going to be saved. Press Ctrl+S. The job will be saved in the selected directory under the given name.

Creating database connections, partitions, servers, and clusters

Besides users, profiles, jobs, and transformations, there are some additional PDI elements that you can define:

Element	Description
Database connections	Connection definitions to relational databases. These are covered in Chapter 8.
Partition schemas	Partitioning is a mechanism by which you send individual rows to different copies of the same step—for example, based on a field value. This is an advanced topic not covered in this book.
Slave servers	Slave servers are installed in remote machines to execute jobs and transformations remotely. They are introduced in Chapter 13.
Clusters	Clusters are groups of slave servers that collectively execute a job or a transformation. They are also introduced in Chapter 13.

All these elements can also be created, modified, and deleted from the Repository explorer.

Once you create any of these elements, it is automatically shared by all repository users.

Backing up and restoring a repository

A PDI repository is a database. As such, you may regularly backup it with the utilities provided by the RDBMS. However, PDI offers you a method for creating a backup in an XML file.

You create a backup from the Repository explorer. Right-click the name of the repository and select Export all objects to an XML file. You will be asked for the name and location of the XML file that will contain the backup data. In order to back up a single folder, instead of right-clicking the repository name, right-click the name of the folder.

You can restore a backup made in an XML file also from the Repository explorer. Right-click the name of the repository and select Import all objects from an XML file. You will be asked for the name and location of the XML file that contains the backup.

Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho 3.2 Data Integration: Beginner's Guide

Time for action – logging into a repository

What just happened?

Logging into a repository by using credentials

Tip

Note

Defining repository user accounts

Creating transformations and jobs in repository folders

Creating database connections, partitions, servers, and clusters

Backing up and restoring a repository