Learning Pentaho Data Integration 8 CE - Third Edition

Book Image

Learning Pentaho Data Integration 8 CE - Third Edition

Book Image

Learning Pentaho Data Integration 8 CE - Third Edition

Overview of this book

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability. We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment. By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects.

Title Page

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Customer Feedback

Customer Feedback

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Launching the PDI Graphical Designer - Spoon

Introducing transformations

Installing useful related software

Getting Started with Transformations

Getting Started with Transformations

Designing and previewing transformations

Understanding PDI data and metadata

Handling errors

Creating Basic Task Flows

Creating Basic Task Flows

Introducing jobs

Designing and running jobs

Running transformations from a Job

Understanding and changing the flow of execution

Knowing the basics about Kettle variables

Reading and Writing Files

Reading and Writing Files

Reading data from files

Outputting data to files

Working with Big Data and cloud sources

Manipulating PDI Data and Metadata

Manipulating PDI Data and Metadata

Manipulating simple fields

Working with complex structures

Controlling the Flow of Data

Controlling the Flow of Data

Splitting streams unconditionally

Splitting the stream based on conditions

Merging streams in several ways

Looking up data

Cleansing, Validating, and Fixing Data

Cleansing, Validating, and Fixing Data

Validating data

Treating invalid data by splitting and merging streams

Manipulating Data by Coding

Manipulating Data by Coding

Doing simple tasks with the JavaScript step

Parsing unstructured files with JavaScript

Doing simple tasks with the Java Class step

Getting the most out of the Java Class step

Avoiding coding using purpose-built steps

Transforming the Dataset

Transforming the Dataset

Working on groups of rows

Converting rows to columns

Normalizing data

Going forward and backward across rows

Performing Basic Operations with Databases

Performing Basic Operations with Databases

Connecting to a database and exploring its content

Previewing and getting data from a database

Inserting, updating, and deleting data

Verifying a connection, running DDL scripts, and doing other useful tasks

Looking up data in different ways

Loading Data Marts with PDI

Loading Data Marts with PDI

Preparing the environment

Introducing dimensional modeling

Loading dimensions with data

Loading fact tables

Creating Portable and Reusable Transformations

Creating Portable and Reusable Transformations

Defining and using Kettle variables

Creating reusable Transformations

Making the data flow between transformations

Executing transformations in an iterative way

Implementing Metadata Injection

Implementing Metadata Injection

Introducing metadata injection

Discovering metadata and injecting it

Identifying use cases to implement metadata injection

Creating Advanced Jobs

Creating Advanced Jobs

Enhancing your processes with the use of variables

Accessing copied rows for different purposes

Working with filelists

Executing jobs in an iterative way

Launching Transformations and Jobs from the Command Line

Launching Transformations and Jobs from the Command Line

Using the Pan and Kitchen utilities

Supplying named parameters and variables

Using command-line arguments

Sending the output of executions to log files

Automating the execution

Best Practices for Designing and Deploying a PDI Project

Best Practices for Designing and Deploying a PDI Project

Setting up a new project

Best practices to design jobs and transformations

Maximizing the performance

Deploying the project in different environments

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Executing jobs in an iterative way

For a long time, PDI developers used to ask, can I run a Job inside a transformation? The answer was definitely a no. In order to solve the requirement, the solution was to create jobs and transformations nested in complex ways. Now you can avoid all that unnecessary work by looping jobs in an easier way. There is a Job Executor step—analogous to the Transformation Executor that you know—that can easily be configured to loop over the rows in a dataset.

Using Job executors

The Job Executor is a PDI step that allows you to execute a Job several times simulating a loop. The executor receives a dataset, and then executes the Job once for each row or a set of rows of the incoming dataset.

To understand how this works, we will build a very simple example. The Job that we will execute will have two parameters: a folder and a file. It will create the folder, and then it will create an empty file inside the new folder. Both the name of the folder and the name of the...