Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Book Image

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.
Table of Contents (27 chapters)
Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About the Reviewers
Preface
Index

Transformation steps


The following table includes all the transformation steps used in the book. For a full list of steps and their descriptions, select Help | Show step plug-in information in Spoon's main menu.

You can also visit http://wiki.pentaho.com/display/EAI/Pentaho+Data+Integration+v3.2.+Steps for a full step reference along with some examples.

Icon

Name

Purpose

Time for action

Abort

Aborts a transformation

Aborting when there are too many errors (Chapter 7); also in Chapters 11 and 12

Add constants

Adds one or more constant fields to the stream

Gathering progress and merging all together (Chapter 4); also in Chapters 7, 8, and 9

Add sequence

Gets the next value from a sequence

Assigning tasks by Distributing (Chapter 4); also in Chapters 6 and 11

Append streams

Appends two streams in an ordered way

Giving priority to Bouchard by using Append Stream (Chapter 4)

Calculator

Creates new fields by performing simple calculations

Reviewing examination by using the Calculator step (Chapter 3); also in Chapters 6 and 8

Combination lookup/update

Updates a junk dimension. Alternatively, it can be used to update Type I SCD.

Loading a region dimension with a Combination lookup/update step (Chapter 9); also in Chapter 12

Copy rows to result

Write rows to the executing job. The information will then be passed to the next entry in the job.

Splitting the generation of top scores by copying and getting rows (Chapter 11)

Data Validator

Validates fields based on a set of rules

Checking films file with the Data Validator (Chapter 7)

Database join

Executes a database query using stream values as parameters

Using a Database join step to create a list of suggested products to buy (Chapter 9)

Database lookup

Looks up values in a database table

Using a Database lookup step to create a list of products to buy (Chapter 9), also in Chapter 12

Delay row

For each incoming row, waits a given time before giving the row to the next step

Generating custom files by executing a transformation for every input row (Chapter 11)

Delete

Delete data in a database table

Deleting data about discontinued items (Chapter 8)

Dimension lookup/update

Updates or looks up a Type II SCD. Alternatively, it can be used to update Type I SCD or hybrid dimensions.

Keeping a history of product changes with the Dimension lookup/update step (Chapter 9), also in Chapter 12

Dummy (do nothing)

This step type doesn't do anything! However it is used often.

Creating a hello world transformation (Chapter 1), also in Chapters 2, 3, 7, and 9

Excel Input

Reads data from a Microsoft Excel (.xls) file

Browsing PDI new features by copying a dataset (Chapter 4); also in Chapter 8

Excel Output

Writes data to a Microsoft Excel (.xls) file

Getting data from an XML file with information about countries (Chapter 2); also in Chapters 4 and 10

Filter rows

Splits the stream in two upon a given condition. Alternatively, it is used to let pass just the rows that meet the condition.

Counting frequent words by filtering (Chapter 3); also in Chapters 4, 6, 7, 9, 11, and 12

Fixed file input

Reads data from a fixed width file

Calculating Scores with JavaScript (Chapter 5)

Formula

Creates new fields by using formulas. It uses Pentaho's libformula.

Reviewing examination by using the Formula step (Chapter 3); also in Chapters 10 and 11

Generate Rows

Generates a number of equal rows

Creating a hello world transformation (Chapter 1); also in Chapters 6, 9, and 10

Get data from XML

Gets data from XML files

Getting data from an XML file with information about countries(Chapter 2); also in chapters 3 and 9

Get rows from result

Reads rows from a previous entry in a job

Splitting the generation of top scores by copying and getting rows (Chapter 11)

Get System Info

Gets information from the system like system date, arguments, etc.

Updating a file with news about examination (Chapter 2) also in Chapters 7, 8, 10, 11, and 12

Get Variables

Takes the values of environment or Kettle variables and adds them as fields in the stream

Creating the time dimension dataset(Chapter 6)

Group by

Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly

Calculating World Cup statistics by grouping data (Chapter 3); also in Chapters 4, 7, and 9

If field value is null

If a field is null, it changes its value to a constant. It can be applied to all fields of a same data type, or to particular fields

Enhancing a films file by converting rows to columns (Chapter 6)

Insert / Update

Updates or inserts rows in a database table

Inserting new products or updating existent ones (Chapter 8)

Mapping (sub-transformation)

Runs a subtransformation

Calculating the top scores with a subtransformation (Chapter 11)

Mapping input specification

Specifies the input interface of a sub-transformation

Calculating the top scores with a subtransformation (Chapter 11)

Mapping output specification

Specifies the output interface of a sub-transformation

Calculating the top scores with a subtransformation (Chapter 11)

Modified Java Script Value

Allows you to code Javascript to modify or create new fields. It's also possible to code Java

Calculating Scores with JavaScript(Chapter 5); also in Chapters 6, 7, and 11

Number range

Creates ranges based on a numeric field

Capturing errors while calculating the age of a film (Chapter 7); also in Chapter 8

Regex Evaluation

Evaluates a field with a regular expression

Validating Genres with a Regex Evaluation step (Chapter 7); also in Chapter 12

Row denormaliser

Denormalises rows by looking up key-value pairs

Enhancing a films file by converting rows to columns (Chapter 6)

Row Normaliser

Normalises data de-normalised

Enhancing the matches file by normalizing the dataset (Chapter 6)

Select values

Selects, reorders, or removes fields. Also allows you to change the metadata of fields

Reading all your files at a time using a single Text file input step (Chapter 2); also in Chapters 3, 4, 6, 7, 8, 9, 11, and 12

Set Variables

Sets Kettle variables based on a single input row

Updating a file with news about examinations by setting a variable with the name of the file (Chapter 11); also in Chapter 12

Sort rows

Sorts rows based upon field values, ascending or descending

Reviewing examinations by using the Calculator step (Chapter 3); also in Chapters 4, 6, 7, 8, 9, and 11

Split field to rows

Splits a single string field and creates a new row for each split term

Counting frequent words by filtering (Chapter 3)

Split Fields

Splits a single field into more than one

Calculating World Cup statistics by grouping data (Chapter 3); also in Chapters 6 and 11

Stream lookup

Looks up values coming from another stream in the transformation

Finding out which language people speak (Chapter 3); also in Chapter 6

Switch / Case

Switches a row to a certain target step based on the value of a field

Assigning tasks by filtering priorities with the Switch/ Case step (Chapter 4)

Table input

Reads data from a database table

Getting data about shipped orders (Chapter 8); also in Chapters 9, 10, and 12

Table output

Writes data to a database table

Loading a table with a list of manufacturers (Chapter 8), also in Chapters 9 and 12

Text file input

Reads data from a text file

Reading all your files at a time using a single Text file input step (Chapter 2); also in Chapters 3, 5, 6, 7, 8, and 11

Text file output

Writes data to a text file

Sending the results of matches to a plain file (Chapter 2); also in Chapters 3, 7, 9, 10, and 11

Update

Updates data in a database table

Loading a region dimension with a Combination lookup/update step (Chapter 9)

Value Mapper

Maps values of a certain field from one value to another

Browsing PDI new features by copying a dataset (Chapter 4)