Pentaho Data Integration Beginner's Guide - Second Edition

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Buy this Book

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Buy this Book

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

Time for action – configuring the error handling to see the description of the errors

Summary

Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single text file input step

Time for action – reading all your files at a time using a single text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – reading and writing matches files with flexibility

Time for action – running the matches transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Filtering, Searching, and Performing Other Useful Operations with Data

Sorting data

Time for action – sorting information about matches with the Sort rows step

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Time for action – refining the counting task by filtering even more

Looking up data

Time for action – finding out which language people speak

Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

Time for action – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

Summary

Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

Avoiding coding by using purpose built steps

Summary

Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – parameterizing the start and end date of the time dimension dataset

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection to the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates using parameters

Time for action – getting orders in a range of dates by using Kettle variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existing ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

Time for action – testing the transformation that keeps history of product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Summary

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

Time for action – generating top average scores by copying and getting rows

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the SALES star

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a database repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Summary

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Summary

Quick Reference – Steps and Job Entries

Transformation steps

Job entries

Summary

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Database wizards

Summary

Introducing PDI 5 Features

Welcome page

Usability

Solutions to commonly occurring situations

Backend

Summary

Best Practices

Summary

Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Chapter 2, Getting Started with Transformations

Chapter 3, Manipulating Real-world Data

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Chapter 5, Controlling the Flow of Data

Chapter 6, Transforming Your Data by Coding

Chapter 8, Working with Databases

Chapter 9, Performing Advanced Operations with Databases

Chapter 10, Creating Basic Task Flows

Chapter 11, Creating Advanced Transformations and Jobs

Chapter 12, Developing and Implementing a Simple Datamart

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Time for action – logging into a database repository

To log on into an existent database repository, follow these instructions:

Launch Spoon.
If the repository dialog window doesn't show up, select Tools | Repository | Connect... from the main menu. The repository dialog window appears.
In the list, select the repository you want to log in.
Type your username and password. If you didn't create a user, use the default admin/admin, and click on OK.
You are logged into the repository. You will see the name of the repository in the upper-left corner of Spoon:

What just happened?

In the preceding section, you opened Spoon and logged into a database repository.

Logging into a database repository using credentials

If you want to work with the repository storage system, you have to log in to the repository before starting to work. In order to do that, you have to choose the repository and provide a repository username and password.

The repository dialog that allows you to log into repository can be opened from the main Spoon menu. If you intend to log into the repository often, you'd better select Tools | Options... and check the general option Show repository dialog at startup?, which will cause the repository dialog to always show up when you launch Spoon.

It is possible to log into the repository automatically. Suppose you have a repository named MY_REPO, and you use the default user. By adding the following lines to the kettle.properties file, the next time you launch Spoon you will log into the repository automatically:

  KETTLE_REPOSITORY=MY_REPO
  KETTLE_USER=admin
  KETTLE_PASSWORD=admin

Tip

For details on the kettle.properties file, refer to the Kettle variables section in Chapter 3, Manipulating Real-world Data.

As a final note, take into account that the log information is exposed and as such, autologin is not recommended.

Creating transformations and jobs in repository folders

In a repository, the jobs and transformations are organized in folders. A folder in a repository fulfills the same purpose as a folder in your drive—it allows you to keep your work organized. Once you create a folder, you can save both transformations and jobs in it.

While connected to a repository, you can design, preview, and run jobs and transformations just as you do with files. However, there are some differences when it comes to opening, creating, or saving your work. So let's summarize how you do those tasks when logged into a repository:

Task	Procedure
Open a transformation/job	Select File \| Open. The repository explorer shows up. Navigate the repository until you find the transformation or job you want to open, then double-click on it.
Create a folder	Select Tools \| Repository \| Explore ..., expand the transformation or job tree, locate the parent folder, right-click on it and create the folder.
Create a transformation	Select File \| New \| Transformation or press Ctrl + N.
Create a job	Select File \| New \| Job or press Ctrl + Alt + N.
Save a transformation	Press Ctrl + T. Give the transformation a name. In the Directory textbox, select the folder where the transformation is going to be saved. Press Ctrl + S. The transformation will be saved in the selected directory under the given name.
Save a job	Press Ctrl + J. Give the job a name. In the Directory textbox, select the folder where the job is going to be saved. Press Ctrl + S. The job will be saved in the selected directory under the given name.

Creating database connections, users, servers, partitions, and clusters

Besides jobs and transformations, there are some additional PDI elements that you can define:

Element	Description
Connections	This element defines connections to relational databases. These are covered in Chapter 8, Working with Databases.
Security	This element provides security to the users. They are needed to log into the repository. There are two predefined users: `admin` and `guest`.

You can also define some elements not covered in this book, but worth mentioning:

Element	Description
Slaves	Slave servers are the servers installed in remote machines to execute jobs and transformations remotely.
Partitions	Partitioning is a mechanism by which you can send individual rows to different copies of the same step, for example, based on a field value.
Clusters	Clusters are a group of slave servers which collectively execute a job or a transformation.

All the elements can also be created, modified, and deleted from the repository explorer. Once you create any of these elements, they are automatically shared by all repository users.

Designing jobs and transformations

You shouldn't have any difficulty in designing jobs and transformations under a database repository-based method. The way you work is exactly the same as the way you learned through all the chapters while working under the file-based method. The main change, besides the way you open and save your work, is the way you refer to other jobs and transformations. Here you have a list of the main situations:

Task	File-based method	Database repository-based method
Providing the name of the job to run in a Job-job entry	You provide the full path of the `.kjb` file or a relative path, eventually using the `${Internal.Job.Filename.Directory}` variable.	You should fill the Repository: specify by name option. You may eventually use a relative location using the `${Internal.Job.Repository.Directory}` variable.
Providing the name of the transformation to run in a Transformation-job entry	You provide the full path of the `.ktr` file or a relative path, eventually using the `${Internal.Job.Filename.Directory}` variable.	You should fill the Specify by name and directory option. You may eventually use a relative location using the `${Internal.Job.Repository.Directory}` variable.
Providing the name of the transformation to run as a subtransformation using a Mapping step	You provide the full path of the `.ktr` file or a relative path, eventually using the `${Internal.Transformation.Filename.Directory}` variable.	You should fill the Use a mapping transformation from the repository option. You may eventually use a relative location using the `${Internal.Transformation.Repository.Directory}` variable.
Providing the name of a file whose location is relative to the location of a job or a transformation	You provide the relative path using `${Internal.Job.Filename.Directory}` in a job or `${Internal.Transformation.Filename.Directory}` in a transformation.	In a database repository-based system, you can create a folder tree as you do in the filesystem. However, you are not able to specify relative paths.

Backing up and restoring a repository

You may regularly back up your database repository in the same way as you would do with any database. You do it by using the utilities provided by the RDBMS, for example, mysqldump in MySQL. However, PDI offers you a method for creating a backup in an XML file.

You create a backup from the Tools | Repository | Export Repository... option. You will be asked for the name and location of the XML file that will contain the backup data. To backup a single folder, open the Repository explorer and right-click on the name of the folder.
You restore a backup from the Tools | Repository | Import Repository... option. You will be asked for the name and location of the XML file that contains the backup.

Both in the export and the import operations, Kettle offers you the possibility to apply some rules to make sure that the transformations and jobs adhere to the given standards.

As an example, suppose that you want to make sure that there are no disabled hops in your jobs. One of the rules that Kettle offers does this verification—the name of the rule is 'Job has no disabled hops'. The list of available rules is quite limited but still useful for meeting some basic requirements.

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Time for action – logging into a database repository

What just happened?

Logging into a database repository using credentials

Tip

Creating transformations and jobs in repository folders

Creating database connections, users, servers, partitions, and clusters

Designing jobs and transformations

Backing up and restoring a repository