Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book.

Pentaho 3.2 Data Integration Beginner's Guide

Credits

Foreword

The Kettle Project

About the Author

About the Reviewers

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Pentaho Data Integration

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer: Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Time for action – running and previewing the hello_world transformation

Name	Purpose	Time for action
Abort	Aborts a transformation	Aborting when there are too many errors (Chapter 7); also in Chapters 11 and 12
Add constants	Adds one or more constant fields to the stream	Gathering progress and merging all together (Chapter 4); also in Chapters 7, 8, and 9
Add sequence	Gets the next value from a sequence	Assigning tasks by Distributing (Chapter 4); also in Chapters 6 and 11
Append streams	Appends two streams in an ordered way	Giving priority to Bouchard by using Append Stream (Chapter 4)
Calculator	Creates new fields by performing simple calculations	Reviewing examination by using the Calculator step (Chapter 3); also in Chapters 6 and 8
Combination lookup/update	Updates a junk dimension. Alternatively, it can be used to update Type I SCD.	Loading a region dimension with a Combination lookup/update step (Chapter 9); also in Chapter 12
Copy rows to result	Write rows to the executing job. The information will then be passed to the next entry in the job.	Splitting the generation of top scores by copying and getting rows (Chapter 11)
Data Validator	Validates fields based on a set of rules	Checking films file with the Data Validator (Chapter 7)
Database join	Executes a database query using stream values as parameters	Using a Database join step to create a list of suggested products to buy (Chapter 9)
Database lookup	Looks up values in a database table	Using a Database lookup step to create a list of products to buy (Chapter 9), also in Chapter 12
Delay row	For each incoming row, waits a given time before giving the row to the next step	Generating custom files by executing a transformation for every input row (Chapter 11)
Delete	Delete data in a database table	Deleting data about discontinued items (Chapter 8)
Dimension lookup/update	Updates or looks up a Type II SCD. Alternatively, it can be used to update Type I SCD or hybrid dimensions.	Keeping a history of product changes with the Dimension lookup/update step (Chapter 9), also in Chapter 12
Dummy (do nothing)	This step type doesn't do anything! However it is used often.	Creating a hello world transformation (Chapter 1), also in Chapters 2, 3, 7, and 9
Excel Input	Reads data from a Microsoft Excel (`.xls`) file	Browsing PDI new features by copying a dataset (Chapter 4); also in Chapter 8
Excel Output	Writes data to a Microsoft Excel (`.xls`) file	Getting data from an XML file with information about countries (Chapter 2); also in Chapters 4 and 10
Filter rows	Splits the stream in two upon a given condition. Alternatively, it is used to let pass just the rows that meet the condition.	Counting frequent words by filtering (Chapter 3); also in Chapters 4, 6, 7, 9, 11, and 12
Fixed file input	Reads data from a fixed width file	Calculating Scores with JavaScript (Chapter 5)
Formula	Creates new fields by using formulas. It uses Pentaho's libformula.	Reviewing examination by using the Formula step (Chapter 3); also in Chapters 10 and 11
Generate Rows	Generates a number of equal rows	Creating a hello world transformation (Chapter 1); also in Chapters 6, 9, and 10
Get data from XML	Gets data from XML files	Getting data from an XML file with information about countries(Chapter 2); also in chapters 3 and 9
Get rows from result	Reads rows from a previous entry in a job	Splitting the generation of top scores by copying and getting rows (Chapter 11)
Get System Info	Gets information from the system like system date, arguments, etc.	Updating a file with news about examination (Chapter 2) also in Chapters 7, 8, 10, 11, and 12
Get Variables	Takes the values of environment or Kettle variables and adds them as fields in the stream	Creating the time dimension dataset(Chapter 6)
Group by	Builds aggregates in a group by fashion. This works only on a sorted input. If the input is not sorted, only double consecutive rows are handled correctly	Calculating World Cup statistics by grouping data (Chapter 3); also in Chapters 4, 7, and 9
If field value is null	If a field is null, it changes its value to a constant. It can be applied to all fields of a same data type, or to particular fields	Enhancing a films file by converting rows to columns (Chapter 6)
Insert / Update	Updates or inserts rows in a database table	Inserting new products or updating existent ones (Chapter 8)
Mapping (sub-transformation)	Runs a subtransformation	Calculating the top scores with a subtransformation (Chapter 11)
Mapping input specification	Specifies the input interface of a sub-transformation	Calculating the top scores with a subtransformation (Chapter 11)
Mapping output specification	Specifies the output interface of a sub-transformation	Calculating the top scores with a subtransformation (Chapter 11)
Modified Java Script Value	Allows you to code Javascript to modify or create new fields. It's also possible to code Java	Calculating Scores with JavaScript(Chapter 5); also in Chapters 6, 7, and 11
Number range	Creates ranges based on a numeric field	Capturing errors while calculating the age of a film (Chapter 7); also in Chapter 8
Regex Evaluation	Evaluates a field with a regular expression	Validating Genres with a Regex Evaluation step (Chapter 7); also in Chapter 12
Row denormaliser	Denormalises rows by looking up key-value pairs	Enhancing a films file by converting rows to columns (Chapter 6)
Row Normaliser	Normalises data de-normalised	Enhancing the matches file by normalizing the dataset (Chapter 6)
Select values	Selects, reorders, or removes fields. Also allows you to change the metadata of fields	Reading all your files at a time using a single Text file input step (Chapter 2); also in Chapters 3, 4, 6, 7, 8, 9, 11, and 12
Set Variables	Sets Kettle variables based on a single input row	Updating a file with news about examinations by setting a variable with the name of the file (Chapter 11); also in Chapter 12
Sort rows	Sorts rows based upon field values, ascending or descending	Reviewing examinations by using the Calculator step (Chapter 3); also in Chapters 4, 6, 7, 8, 9, and 11
Split field to rows	Splits a single string field and creates a new row for each split term	Counting frequent words by filtering (Chapter 3)
Split Fields	Splits a single field into more than one	Calculating World Cup statistics by grouping data (Chapter 3); also in Chapters 6 and 11

Pentaho 3.2 Data Integration: Beginner's Guide

Pentaho 3.2 Data Integration: Beginner's Guide

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho 3.2 Data Integration: Beginner's Guide

Transformation steps