Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Getting Started with Transformations

Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

Time for action – configuring the error handling to see the description of the errors

Manipulating Real-world Data

Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single text file input step

Time for action – reading all your files at a time using a single text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – reading and writing matches files with flexibility

Time for action – running the matches transformation from a terminal window

Time for action – getting data from an XML file with information about countries

Filtering, Searching, and Performing Other Useful Operations with Data

Filtering, Searching, and Performing Other Useful Operations with Data

Time for action – sorting information about matches with the Sort rows step

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

Time for action – counting frequent words by filtering

Time for action – refining the counting task by filtering even more

Looking up data

Time for action – finding out which language people speak

Controlling the Flow of Data

Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

Time for action – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

Transforming Your Data by Coding

Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

Avoiding coding by using purpose built steps

Transforming the Rowset

Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – parameterizing the start and end date of the time dimension dataset

Working with Databases

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection to the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates using parameters

Time for action – getting orders in a range of dates by using Kettle variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existing ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Performing Advanced Operations with Databases

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

Time for action – testing the transformation that keeps history of product changes

Creating Basic Task Flows

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Creating Advanced Transformations and Jobs

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

Time for action – generating top average scores by copying and getting rows

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

Developing and Implementing a Simple Datamart

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the SALES star

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

Working with Repositories

Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a database repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Quick Reference – Steps and Job Entries

Quick Reference – Steps and Job Entries

Transformation steps

Spoon Shortcuts

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Database wizards

Introducing PDI 5 Features

Introducing PDI 5 Features

Solutions to commonly occurring situations

Best Practices

Pop Quiz Answers

Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Chapter 2, Getting Started with Transformations

Chapter 3, Manipulating Real-world Data

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Chapter 5, Controlling the Flow of Data

Chapter 6, Transforming Your Data by Coding

Chapter 8, Working with Databases

Chapter 9, Performing Advanced Operations with Databases

Chapter 10, Creating Basic Task Flows

Chapter 11, Creating Advanced Transformations and Jobs

Chapter 12, Developing and Implementing a Simple Datamart

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Solutions to commonly occurring situations

Among several common use-cases in PDI, there are three that have been specially addressed in PDI 5: restartability, database transactions, and looping.

Restartability has to do with the ability of restarting a job after an interruption. To keep track of the status of an execution, you define checkpoints. Checkpoints ensure that the status variable, arguments, files, and rows from the result set are serialized. This way, the execution of a failed job can be easily resumed in a safe state after the following job entry of the last successful checkpoint.

The second common requirement that was introduced is the database transaction across transformations and jobs. This differs from previous versions of PDI where the scope of a transaction was a single transformation or job.

Finally, a very interesting feature included in PDI 5 is the possibility of including subjobs in transformations, through the use of executors. For a long time PDI developers used to ask, "Can I run a job inside a transformation?". The answer was definitely a no. In order to solve the requirement, the solution was to create jobs and transformations nested in complex ways. Now you can avoid all that unnecessary work by looping-over data or files in an easier way. There is a job executor step that can easily be configured to loop-over the rows in a dataset. Not only is the loop easier to implement, but also there is a bonus: the step returns the execution results (number of rows read, number of errors, and so on), the result rows, and the result files. Analogous to this, there is also a transformation executor.