Pentaho Data Integration Beginner's Guide - Second Edition

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Buy this Book

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Buy this Book

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Name	Purpose	Time for action
Abort	Aborts a transformation.	Generating custom messages by setting a variable with the name of the examination file (Chapter 11, Creating Advanced Transformations and Jobs). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Add constants	Adds one or more constants to the input rows.	Gathering progress and merging all together (Chapter 5, Controlling the Flow of Data). Also in Chapter 8, Working with Databases, Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.
Add sequence	Gets the next value from a sequence.	Assigning tasks by distributing (Chapter 5, Controlling the Flow of Data). Also in Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.
Append streams	Appends two streams in an ordered way.	Giving priority to Bouchard by using Append Stream (Chapter 5, Controlling the Flow of Data).
Calculator	Creates new fields by performing simple calculations.	Avoiding errors while converting the estimated time (Chapter 2, Getting started with Transformations). Also in Chapter 3, Manipulating Real-world Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.
Clone row	Clones a row as many times as needed.	Generating a range of dates and inspecting (Chapter 2, Getting started with Transformations). Also in Chapter 7, Transforming the Rowset, and Chapter 10, Creating Basic Task Flows.
Combination lookup/update	Updates a junk dimension in a data-warehouse. Alternatively, looks up information in this dimension.	Loading a region dimension with a Combination lookup-update step (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Copy rows to result	Writes rows to the executing job. The information will then be passed to the next entry in the job.	Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).
Data Grid	Allows you to create rows of static data in a grid, usually for testing, reference, or demo purposes.	Avoiding errors while converting the estimated time (Chapter 2, Getting started with Transformations). Also in Chapter 5, Controlling the Flow of Data.
Database join	Executes a database query using stream values as parameters.	Using a Database join step to create a list of suggested products to buy (Chapter 9, Performing Advanced Operations with Databases).
Database lookup	Looks up values in a database using field values.	Using a Database lookup step to create a list of products to buy (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Delay row	Outputs each input row after a delay.	Generating a range of dates and inspecting (Chapter 2, Getting started with Transformations) Also in Chapter 10, Creating Basic Task Flows.
Delete	Deletes data in a database table based upon keys.	Deleting data about discontinued items (Chapter 8, Working with Databases).
Dimension lookup/update	Updates a slowly changing dimension in a data-warehouse. Alternatively, look up information in this dimension.	Keeping a history of changes in product by using the Dimension lookup-update step (Chapter 9, Performing Advanced Operations with Databases). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Dummy (do nothing)	This step type doesn’t do anything. It’s useful however when testing, or in certain situations where you want to split streams.	Creating a hello world transformation (Chapter 1, Getting started with Pentaho Data Integration). Also in Chapter 3, Manipulating Real-world Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 6, Transforming your Data by Coding, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.
Filter rows	Filters rows using simple equations.	Counting frequent words by filtering (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.
Formula	Calculates a formula using Pentaho’s `libformula`.	Generating custom files by executing a transformation for every input row (Chapter 11, Creating Advanced Transformations and Jobs).
Generate random value	Generates random values.	Have a go Hero – measuring the performance of input steps (Chapter 3, Manipulating Real-world Data).
Generate Rows	Generates a number of empty or equal rows.	Creating a hello world transformation (Chapter 1, Getting started with Pentaho Data Integration). Also in Chapter 2, Getting started with Transformations, Chapter 3, Manipulating Real-world Data, Chapter 7, Transforming the Rowset, Chapter 9, Performing Advanced Operations with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.
Get data from XML	Gets data from an XML file by using `XPath`. This step also allows you to parse XML defined in a previous field.	Getting data from an XML file with information about countries (Chapter 3, Manipulating Real-world Data). Also in Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, and Chapter 9, Performing Advanced Operations with Databases.
Get rows from result	Reads rows from a previous entry in a job.	Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).
Get System Info	Gets information from the system such as system date, arguments, and so on.	Reading and writing matching files with flexibility (Chapter 3, Manipulating Real-world Data). Also in Chapter 8, Working with Databases, Chapter 10, Creating Basic Task Flows, Chapter 11, Creating Advanced Transformations and Jobs, and Chapter 12, Developing and Implementing a Simple Datamart.
Get Variables	Determines the values of certain (environment or Kettle) variables and puts them in field values.	Creating a time dimension dataset (Chapter 7, Transforming the Rowset). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Group by	Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly.	Calculating football matches statistics by grouping data (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data). Also in Chapter 5, Controlling the Flow of Data, Chapter 9, Performing Advanced Operations with Databases, and Chapter 11, Creating Advanced Transformations and Jobs.
If field value is null	Sets a field value to a constant if it is null.	Enhancing a films file by converting rows to columns (Chapter 7, Transforming the Rowset).
Insert/Update	Updates or insert rows in a database based upon keys.	Inserting new products or updating existent ones (Chapter 8, Working with Databases).
Java Filter	Filters rows using Java code.	Generating top average scores by copying and getting rows (Chapter 11, Creating Advanced Transformations and Jobs).
Mapping (subtransformation)	Run a subtransformation, use `MappingInput` and `MappingOutput` to specify the fields interface.	Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).
Mapping input specification	Specifies the input interface of a mapping.	Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).
Mapping output specification	Specifies the output interface of a mapping.	Calculating statistics with the use of a subtransformation (Chapter 11, Creating Advanced Transformations and Jobs).
Microsoft Excel Input	Reads data from Excel and OpenOffice Workbooks (`.xls`, `.xlsx`, and `.ods`).	Browsing new features of PDI by copying a dataset (Chapter 5, Controlling the Flow of Data). Also in Chapter 8, Working with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.
Microsoft Excel Output	Stores records into an Excel (`.xl` `s`) document.	Getting data from an XML file with information about countries (Chapter 3, Manipulating Real-World Data). Also in Chapter 5, Controlling the Flow of Data.
Modified JavaScript Value	Allows you to code JavaScript to modify or create new fields.	Counting frequent words by coding in JavaScript (Chapter 6, Transforming your Data by Coding). Also in Chapter 7, Transforming the Rowset.
Number range	Creates ranges based on a numeric field.	Creating a simple transformation (Chapter 2, Getting Started with Transformations). Also in Chapter 8, Working with Databases.
Regex Evaluation	Evaluates a field using a regular expression. It can also extract new fields out of an existing field with capturing groups.	Loading the fact table using a range of dates (Chapter 12, Developing and Implementing a Simple Datamart).
Replace in string	Replaces all occurrences of a word in a string with another word.	Fixing words before counting them (Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data).
Row denormaliser	Denormalizes rows by looking up key-value pairs and assigning them to new fields in the output rows. This method aggregates and needs the input rows to be sorted on the grouping fields.	Enhancing a films file by converting rows to columns (Chapter 7, Transforming the Rowset).
Row Normaliser	Denormalized information can be normalized using this step-type.	Enhancing the matches files by normalizing the dataset (Chapter 7, Transforming the Rowset). Also in Chapter 12, Developing and Implementing a Simple Datamart.
Select values	Selects, reorders, or removes fields. Optionally, allows you to change the metadata: type, length, and precision.	Avoiding errors while converting the estimated time (Chapter 2, Getting Started with Transformations). Also in Chapter 3, Manipulating Real-World Data, Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, Chapter 5, Controlling the Flow of Data, Chapter 7, Transforming the Rowset, Chapter 8, Working with Databases, Chapter 9, Performing Advanced Operations with Databases, Chapter 10, Creating Basic Task Flows, and Chapter 12, Developing and Implementing a Simple Datamart.
Set Variables	Sets Kettle variables based on a single-input row.	Generating custom messages by setting a variable with the name of the examination file (Chapter 11, Creating Advanced Transformations and Jobs). Also in Chapter 12, Developing and Implementing a Simple Datamart.

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Transformation steps