Book Image

Pentaho Data Integration Quick Start Guide

By : María Carina Roldán
Book Image

Pentaho Data Integration Quick Start Guide

By: María Carina Roldán

Overview of this book

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag and drop design and powerful Extract-Transform-Load (ETL) capabilities. Given its power and flexibility, initial attempts to use the Pentaho Data Integration tool can be difficult or confusing. This book is the ideal solution. This book reduces your learning curve with PDI. It provides the guidance needed to make you productive, covering the main features of Pentaho Data Integration. It demonstrates the interactive features of the graphical designer, and takes you through the main ETL capabilities that the tool offers. By the end of the book, you will be able to use PDI for extracting, transforming, and loading the types of data you encounter on a daily basis.
Table of Contents (15 chapters)

Getting data from plain files


In this section, you will learn how to get data from plain files (for example, .txt and CSV files). We will start by explaining how to read and configure such files, and then we will explain how PDI allows you to read multiple files at once, compressed files, and files stored in remote locations.

Reading plain files

In the previous chapter, we experimented with reading a simple file, but this time we will go into detail on getting and properly configuring a simple file's metadata.

Note

For this and some of the future exercises in this book, we will use .csv files with surveys of the Airbnb website. The sample data can be downloaded from http://tomslee.net/airbnb-data-collection-get-the-data.

For this exercise, we will read and configure a file with data about a survey carried out in Amsterdam. The file looks as follows:

Sample file

This time, we will use a Text file input step, which is much more flexible than the CSV file input that you are familiar with:

  1. Create a...