Book Image

Pentaho Data Integration Quick Start Guide

By : María Carina Roldán
Book Image

Pentaho Data Integration Quick Start Guide

By: María Carina Roldán

Overview of this book

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag and drop design and powerful Extract-Transform-Load (ETL) capabilities. Given its power and flexibility, initial attempts to use the Pentaho Data Integration tool can be difficult or confusing. This book is the ideal solution. This book reduces your learning curve with PDI. It provides the guidance needed to make you productive, covering the main features of Pentaho Data Integration. It demonstrates the interactive features of the graphical designer, and takes you through the main ETL capabilities that the tool offers. By the end of the book, you will be able to use PDI for extracting, transforming, and loading the types of data you encounter on a daily basis.
Table of Contents (15 chapters)

Getting data from other sources

So far, we have been getting data from plain files and databases. These are two of the most common data sources, but there are many more kinds of sources available in PDI, mainly grouped in, but not limited to, the Input folder. The following subsections will present some of the sources that we didn't cover in the previous sections, which are also of use.


With PDI, you can read XML files or parse fields whose contents are in an XML structure. In both cases, you parse the XML with the Get data from XML input step. For specifying the fields to read, you use XPath notation. When the XML is very big or complex, there is an alternative step:XML Input Stream (StAX).

Similarly, you can parse JSON structures with the JSON Input step. For specifying the fields in this case, you use JSONPath notation.

Also, you can parse both XML and JSON structures with JavaScript or Java code, by using the Modified Java Script Value step or the User Defined Java Class step...