Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Transformation steps


The following table includes all of the transformation steps used in the book. For a full list of the steps and their descriptions, select Help | Step information... in Spoon’s main menu.

Also visit the website http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+Steps.

There you will find a full step reference and some examples as well.

Icon

Name

Purpose

Time for action

Abort

Aborts a transformation.

Generating custom messages by setting a variable with the name of the examination file (Chapter 11, Creating Advanced Transformations and Jobs). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Add constants

Adds one or more constants to the input rows.

Gathering progress and merging all together (Chapter 5, Controlling the Flow of Data). Also in Chapter 8, Working with Databases, Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Add sequence

Gets the next value from a sequence.

Assigning tasks by distributing (Chapter 5, Controlling the Flow of Data). Also in Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.

Append streams

Appends two streams in an ordered way.

Giving priority to Bouchard by using Append Stream (Chapter 5, Controlling the Flow of Data).

Calculator

Creates new fields by performing simple calculations.

Avoiding errors while converting the estimated time (Chapter 2, Getting started with Transformations). Also in Chapter 3, Manipulating Real-world Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.

Clone row

Clones a row as many times as needed.

Generating a range of dates and inspecting (Chapter 2, Getting started with Transformations). Also in Chapter 7, Transforming the Rowset, and Chapter 10, Creating Basic Task Flows.

Combination lookup/update

Updates a junk dimension in a data-warehouse. Alternatively, looks up information in this dimension.

Loading a region dimension with a Combination lookup-update step (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Copy rows to result

Writes rows to the executing job. The information will then be passed to the next entry in the job.

Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).

Data Grid

Allows you to create rows of static data in a grid, usually for testing, reference, or demo purposes.

Avoiding errors while converting the estimated time (Chapter 2, Getting started with Transformations). Also in Chapter 5, Controlling the Flow of Data.

Database join

Executes a database query using stream values as parameters.

Using a Database join step to create a list of suggested products to buy (Chapter 9, Performing Advanced Operations with Databases).

Database lookup

Looks up values in a database using field values.

Using a Database lookup step to create a list of products to buy (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Delay row

Outputs each input row after a delay.

Generating a range of dates and inspecting (Chapter 2, Getting started with Transformations) Also in Chapter 10, Creating Basic Task Flows.

Delete

Deletes data in a database table based upon keys.

Deleting data about discontinued items (Chapter 8, Working with Databases).

Dimension lookup/update

Updates a slowly changing dimension in a data-warehouse. Alternatively, look up information in this dimension.

Keeping a history of changes in product by using the Dimension lookup-update step (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Dummy (do nothing)

This step type doesn’t do anything. It’s useful however when testing, or in certain situations where you want to split streams.

Creating a hello world transformation (Chapter 1, Getting started with Pentaho Data Integration). Also in Chapter 3, Manipulating Real-world Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 6, Transforming your Data by Coding, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.

Filter rows

Filters rows using simple equations.

Counting frequent words by filtering (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.

Formula

Calculates a formula using Pentaho’s libformula.

Generating custom files by executing a transformation for every input row (Chapter 11, Creating Advanced Transformations and Jobs).

Generate random value

Generates random values.

Have a go Hero – measuring the performance of input steps (Chapter 3, Manipulating Real-world Data).

Generate Rows

Generates a number of empty or equal rows.

Creating a hello world transformation (Chapter 1, Getting started with Pentaho Data Integration). Also in Chapter 2, Getting started with Transformations, Chapter 3, Manipulating Real-world Data, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.

Get data from XML

Gets data from an XML file by using XPath. This step also allows you to parse XML defined in a previous field.

Getting data from an XML file with information about countries (Chapter 3, Manipulating Real-world Data). Also in Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, and Chapter 9, Performing Advanced Operations with Databases.

Get rows from result

Reads rows from a previous entry in a job.

Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).

Get System Info

Gets information from the system such as system date, arguments, and so on.

Reading and writing matching files with flexibility (Chapter 3, Manipulating Real-world Data). Also in Chapter 8, Working with Databases, Chapter 10, Creating Basic Task Flows, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.

Get Variables

Determines the values of certain (environment or Kettle) variables and puts them in field values.

Creating a time dimension dataset (Chapter 7, Transforming the Rowset). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Group by

Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.

Calculating football matches statistics by grouping data (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 5, Controlling the Flow of Data, Chapter 9, Performing Advanced Operations with Databases, and Chapter 11, Creating Advanced Transformations and Jobs.

If field value is null

Sets a field value to a constant if it is null.

Enhancing a films file by converting rows to columns (Chapter 7, Transforming the Rowset).

Insert/Update

Updates or insert rows in a database based upon keys.

Inserting new products or updating existent ones (Chapter 8, Working with Databases).

Java Filter

Filters rows using Java code.

Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).

Mapping (subtransformation)

Run a subtransformation, use MappingInput and MappingOutput to specify the fields interface.

Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).

Mapping input specification

Specifies the input interface of a mapping.

Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).

Mapping output specification

Specifies the output interface of a mapping.

Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).

Microsoft Excel Input

Reads data from Excel and OpenOffice Workbooks (.xls, .xlsx, and .ods).

Browsing new features of PDI by copying a dataset (Chapter 5, Controlling the Flow of Data). Also in Chapter 8, Working with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Microsoft Excel Output

Stores records into an Excel (.xl s) document.

Getting data from an XML file with information about countries (Chapter 3, Manipulating Real-World Data). Also in Chapter 5, Controlling the Flow of Data.

Modified JavaScript Value

Allows you to code JavaScript to modify or create new fields.

Counting frequent words by coding in JavaScript (Chapter 6, Transforming your Data by Coding). Also in Chapter 7, Transforming the Rowset.

Number range

Creates ranges based on a numeric field.

Creating a simple transformation (Chapter 2, Getting Started with Transformations). Also in Chapter 8, Working with Databases.

Regex Evaluation

Evaluates a field using a regular expression. It can also extract new fields out of an existing field with capturing groups.

Loading the fact table using a range of dates (Chapter 12, Developing and Implementing a Simple Datamart).

Replace in string

Replaces all occurrences of a word in a string with another word.

Fixing words before counting them (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data).

Row denormaliser

Denormalizes rows by looking up key-value pairs and assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.

Enhancing a films file by converting rows to columns (Chapter 7, Transforming the Rowset).

Row Normaliser

Denormalized information can be normalized using this step-type.

Enhancing the matches files by normalizing the dataset (Chapter 7, Transforming the Rowset). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Select values

Selects, reorders, or removes fields. Optionally, allows you to change the metadata: type, length, and precision.

Avoiding errors while converting the estimated time (Chapter 2, Getting Started with Transformations). Also in Chapter 3, Manipulating Real-World Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, Chapter 9, Performing Advanced Operations with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.

Set Variables

Sets Kettle variables based on a single-input row.

Generating custom messages by setting a variable with the name of the examination file (Chapter 11, Creating Advanced Transformations and Jobs). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Sort rows

Sorts rows based upon field values (ascending or descending).

Sorting information about matches (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, Chapter 9, Performing Advanced Operations with Databases, and Chapter 11, Creating Advanced Transformations and Jobs.

Split field to rows

Splits a single string field by delimiter and creates a new row for each split term.

Counting frequent words by filtering (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 6, Transforming your Data by Coding.

Split Fields

Splits a single string field into two or more.

Aggregating football matches data with the Row denormaliser step (Chapter 7, Transforming the Rowset). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Stream lookup

Looks up values coming from another stream in the transformation.

Finding out which language people speak (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 7, Transforming the Rowset, and Chapter 12, Developing and Implementing a Simple Datamart.

Switch/Case

Switches a row to a certain target step based on the case value in a field.

Assigning tasks by filtering priorities with the Switch Case step (Chapter 5, Controlling the Flow of Data).

Table input

Reads data from a database table.

Getting data about shipped orders (Chapter 8, Working with Databases). Also in Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Table output

Writes data onto a database table.

Loading a table with a list of manufacturers (Chapter 8, Working with Databases). Also in Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Text file input

Reads data from a text file.

Reading results of football matches from files (Chapter 3, Manipulating Real-world Data). Also in Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 6, Transforming your Data by Coding, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, and Chapter 11, Creating Advanced Transformations and Jobs.

Text file output

Writes data to a text file.

Sending the results of matches to a plain file (Chapter 3, Manipulating Real-world Data). Also in Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 9, Performing Advanced Operations with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 11, Creating Advanced Transformations and Jobs.

Unique rows

Removes double rows leaving unique occurrences.

Have a go hero – listing the last match played by each team (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data).

Update

Updates data in a database table based upon keys.

Loading a region dimension with a Combination lookup-update step (Chapter 9, Performing Advanced Operations with Databases) Also in Chapter 12, Developing and Implementing a Simple Datamart.

User Defined Java Class

Allows you to program a step using Java code.

Counting frequent words by coding in Java (Chapter 6, Transforming your Data by Coding).

User Defined Java Expression

Calculates the result of a Java Expression using Janino.

Creating a simple transformation (Chapter 2, Getting Started with Transformations). Also in Chapter 10, Creating Basic Task Flows, and Chapter 11, Creating Advanced Transformations and Jobs.

Value Mapper

Maps values of a certain field from one value to another.

Browsing new features of PDI by copying a dataset (Chapter 5, Controlling the Flow of Data).

Write to log

Writes data to the log.

Avoiding errors while converting the estimated time (Chapter 2, Getting started with Transformations). Also in Chapter 5, Controlling the Flow of Data, and Chapter 8, Working with Databases.