Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Getting Started with Transformations

Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

Time for action – configuring the error handling to see the description of the errors

Manipulating Real-world Data

Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single text file input step

Time for action – reading all your files at a time using a single text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – reading and writing matches files with flexibility

Time for action – running the matches transformation from a terminal window

Time for action – getting data from an XML file with information about countries

Filtering, Searching, and Performing Other Useful Operations with Data

Filtering, Searching, and Performing Other Useful Operations with Data

Time for action – sorting information about matches with the Sort rows step

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

Time for action – counting frequent words by filtering

Time for action – refining the counting task by filtering even more

Looking up data

Time for action – finding out which language people speak

Controlling the Flow of Data

Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

Time for action – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

Transforming Your Data by Coding

Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

Avoiding coding by using purpose built steps

Transforming the Rowset

Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – parameterizing the start and end date of the time dimension dataset

Working with Databases

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection to the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates using parameters

Time for action – getting orders in a range of dates by using Kettle variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existing ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Performing Advanced Operations with Databases

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

Time for action – testing the transformation that keeps history of product changes

Creating Basic Task Flows

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Creating Advanced Transformations and Jobs

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

Time for action – generating top average scores by copying and getting rows

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

Developing and Implementing a Simple Datamart

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the SALES star

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

Working with Repositories

Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a database repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Quick Reference – Steps and Job Entries

Quick Reference – Steps and Job Entries

Transformation steps

Spoon Shortcuts

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Database wizards

Introducing PDI 5 Features

Introducing PDI 5 Features

Solutions to commonly occurring situations

Best Practices

Pop Quiz Answers

Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Chapter 2, Getting Started with Transformations

Chapter 3, Manipulating Real-world Data

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Chapter 5, Controlling the Flow of Data

Chapter 6, Transforming Your Data by Coding

Chapter 8, Working with Databases

Chapter 9, Performing Advanced Operations with Databases

Chapter 10, Creating Basic Task Flows

Chapter 11, Creating Advanced Transformations and Jobs

Chapter 12, Developing and Implementing a Simple Datamart

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Appendix A. Working with Repositories

Spoon allows you to store your transformations and jobs under two different configurations: file-based and database repository. In contrast to the file-based configuration that keeps the transformations and jobs in XML format as *.ktr and *.kjb files in the local filesystem, the database repository configuration keeps the same information in tables in a relational database.

Although working with the file-based system is simple and practical, the database repository method can be convenient in some situations.

The following is a list of some of the distinctive repository features:

Repositories implement security. In order to work with a repository, you need credentials.
Repositories are, by their nature, prepared for basic team development. The elements you create (transformations, jobs, database connections, and so on) are shared by all the repository users as soon as you create them.
The Enterprise Repository is a Java content repository capable of more robust and scalable collaborative functions such as version control, locking, and more.

Before you decide on working with a repository, you have to be aware of the file-based system benefits you lose. Here are some examples:

When working with the database repository-based system, you need access to the repository database. If for some reason you cannot access it (for example, network problems), you will not be able to work. You don't have this restriction when working with files, where you only need the software and the transformation and job files, that is, the .ktr and .kjb files.
When working with the database repositories, it is difficult to keep track of the changes. Working with the filesystem, it is easier to know which jobs or transformations were modified. If you use Subversion or Git, you even have a control version that allows you to examine the history of changes and to recover older versions of your work if necessary.
Suppose that you want to search and replace some text in all the jobs and transformations. If you are working with repositories, you would have to do it for each table in the repository database. Whereas working with the file-based system, this task is quite simple. For example, you could create a Sublime project - available for downloading at www.sublimetext.com - open the root directory of your jobs and transformations, and do the task by using the Sublime utilities.

As explained in Chapter 1, Getting Started with Pentaho Data Integration, there is a third method, File repository, that is a mix of the two mentioned earlier. It's a repository of jobs and transformations stored in the filesystem.

Note

The use of the File repository is similar to the database repository. Therefore, we will not explain it in this appendix. You should not have any difficulty in trying it once you understand how to work with the database repository.

This appendix shows you how to create a database repository and how to work with it. You can try repositories and decide for yourself which method, database repository-based or file-based, suits you best.