Pentaho Data Integration Beginner's Guide - Second Edition

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Buy this Book

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Buy this Book

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Pentaho Data Integration Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Getting Started with Pentaho Data Integration

Pentaho Data Integration and Pentaho BI Suite

Exploring the Pentaho Demo

Installing PDI

Time for action – installing PDI

Launching the PDI graphical designer – Spoon

Time for action – starting and customizing Spoon

Time for action – creating a hello world transformation

Installing MySQL

Time for action – installing MySQL on Windows

Time for action – installing MySQL on Ubuntu

Summary

Getting Started with Transformations

Designing and previewing transformations

Time for action – creating a simple transformation and getting familiar with the design process

Running transformations in an interactive fashion

Time for action – generating a range of dates and inspecting the data as it is being created

Handling errors

Time for action – avoiding errors while converting the estimated time from string to integer

Time for action – configuring the error handling to see the description of the errors

Summary

Manipulating Real-world Data

Reading data from files

Time for action – reading results of football matches from files

Time for action – reading all your files at a time using a single text file input step

Time for action – reading all your files at a time using a single text file input step and regular expressions

Sending data to files

Time for action – sending the results of matches to a plain file

Getting system information

Time for action – reading and writing matches files with flexibility

Time for action – running the matches transformation from a terminal window

XML files

Time for action – getting data from an XML file with information about countries

Summary

Filtering, Searching, and Performing Other Useful Operations with Data

Sorting data

Time for action – sorting information about matches with the Sort rows step

Calculations on groups of rows

Time for action – calculating football match statistics by grouping data

Filtering

Time for action – counting frequent words by filtering

Time for action – refining the counting task by filtering even more

Looking up data

Time for action – finding out which language people speak

Controlling the Flow of Data

Splitting streams

Time for action – browsing new features of PDI by copying a dataset

Time for action – assigning tasks by distributing

Splitting the stream based on conditions

Time for action – assigning tasks by filtering priorities with the Filter rows step

Time for action – assigning tasks by filtering priorities with the Switch/Case step

Merging streams

Time for action – gathering progress and merging it all together

Time for action – giving priority to Bouchard by using the Append Stream

Treating invalid data by splitting and merging streams

Time for action – treating errors in the estimated time to avoid discarding rows

Summary

Transforming Your Data by Coding

Doing simple tasks with the JavaScript step

Time for action – counting frequent words by coding in JavaScript

Reading and parsing unstructured files with JavaScript

Time for action – changing a list of house descriptions with JavaScript

Doing simple tasks with the Java Class step

Time for action – counting frequent words by coding in Java

Transforming the dataset with Java

Time for action – splitting the field to rows using Java

Avoiding coding by using purpose built steps

Summary

Transforming the Rowset

Converting rows to columns

Time for action – enhancing the films file by converting rows to columns

Aggregating data with a Row Denormaliser step

Time for action – aggregating football matches data with the Row Denormaliser step

Normalizing data

Time for action – enhancing the matches file by normalizing the dataset

Generating a custom time dimension dataset by using Kettle variables

Time for action – creating the time dimension dataset

Time for action – parameterizing the start and end date of the time dimension dataset

Summary

Working with Databases

Introducing the Steel Wheels sample database

Time for action – creating a connection to the Steel Wheels database

Time for action – exploring the sample database

Querying a database

Time for action – getting data about shipped orders

Time for action – getting orders in a range of dates using parameters

Time for action – getting orders in a range of dates by using Kettle variables

Sending data to a database

Time for action – loading a table with a list of manufacturers

Time for action – inserting new products or updating existing ones

Time for action – testing the update of existing products

Eliminating data from a database

Time for action – deleting data about discontinued items

Summary

Performing Advanced Operations with Databases

Preparing the environment

Time for action – populating the Jigsaw database

Looking up data in a database

Time for action – using a Database lookup step to create a list of products to buy

Time for action – using a Database join step to create a list of suggested products to buy

Introducing dimensional modeling

Loading dimensions with data

Time for action – loading a region dimension with a Combination lookup/update step

Time for action – testing the transformation that loads the region dimension

Time for action – keeping a history of changes in products by using the Dimension lookup/update step

Time for action – testing the transformation that keeps history of product changes

Summary

Creating Basic Task Flows

Introducing PDI jobs

Time for action – creating a folder with a Kettle job

Designing and running jobs

Time for action – creating a simple job and getting familiar with the design process

Running transformations from jobs

Time for action – generating a range of dates and inspecting how things are running

Receiving arguments and parameters in a job

Time for action – generating a hello world file by using arguments and parameters

Running jobs from a terminal window

Time for action – executing the hello world job from a terminal window

Using named parameters and command-line arguments in transformations

Time for action – calling the hello world transformation with fixed arguments and parameters

Deciding between the use of a command-line argument and a named parameter

Summary

Creating Advanced Transformations and Jobs

Re-using part of your transformations

Time for action – calculating statistics with the use of a subtransformations

Time for action – generating top average scores by copying and getting rows

Iterating jobs and transformations

Time for action – generating custom files by executing a transformation for every input row

Enhancing your processes with the use of variables

Time for action – generating custom messages by setting a variable with the name of the examination file

Summary

Developing and Implementing a Simple Datamart

Exploring the sales datamart

Loading the dimensions

Time for action – loading the dimensions for the sales datamart

Extending the sales datamart model

Loading a fact table with aggregated data

Time for action – loading the sales fact table by looking up dimensions

Getting facts and dimensions together

Time for action – loading the fact table using a range of dates obtained from the command line

Time for action – loading the SALES star

Automating the administrative tasks

Time for action – automating the loading of the sales datamart

Summary

Working with Repositories

Creating a database repository

Time for action – creating a PDI repository

Working with the repository storage system

Time for action – logging into a database repository

Examining and modifying the contents of a repository with the Repository Explorer

Migrating from file-based system to repository-based system and vice versa

Summary

Pan and Kitchen – Launching Transformations and Jobs from the Command Line

Running transformations and jobs stored in files

Running transformations and jobs from a repository

Kettle variables and the Kettle home directory

Checking the exit code

Providing options when running Pan and Kitchen

Summary

Quick Reference – Steps and Job Entries

Transformation steps

Job entries

Summary

Spoon Shortcuts

General shortcuts

Designing transformations and jobs

Grids

Repositories

Database wizards

Summary

Introducing PDI 5 Features

Welcome page

Usability

Solutions to commonly occurring situations

Backend

Summary

Best Practices

Summary

Pop Quiz Answers

Chapter 1, Getting Started with Pentaho Data Integration

Chapter 2, Getting Started with Transformations

Chapter 3, Manipulating Real-world Data

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data

Chapter 5, Controlling the Flow of Data

Chapter 6, Transforming Your Data by Coding

Chapter 8, Working with Databases

Chapter 9, Performing Advanced Operations with Databases

Chapter 10, Creating Basic Task Flows

Chapter 11, Creating Advanced Transformations and Jobs

Chapter 12, Developing and Implementing a Simple Datamart

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

Pentaho Data Integration (also known as Kettle) is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading—better known as the ETL processes. PDI not only serves as an ETL tool, but is also used for other purposes such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag-and-drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult or confusing. This book provides the guidance needed to overcome that difficulty, covering the key features of PDI. Each chapter introduces new features, allowing you to gradually get involved with the tool.

By the end of the book, you will have not only experimented with all kinds of examples, but will have also built a basic but complete datamart with the help of PDI.

How to read this book

Although it is recommended that you read all the chapters, you don't have to. The book allows you to tailor the PDI learning process according to your particular needs.

The first five chapters along with Chapter 10, Creating Basic Task Flows, cover the core concepts. If you don't know PDI and want to learn just the basics, reading those chapters will suffice. If you need to work with databases, you could include Chapter 8, Working with Databases, in the roadmap.

If you already know the basics, you can improve your PDI knowledge by reading Chapter 6, Transforming Your Data by Coding, Chapter 7, Transforming the Rowset, and Chapter 11, Creating Advanced Transformations and Jobs.

If you already know PDI and want to learn how to use it to load or maintain a data warehouse or datamart, you will find all that you need in Chapter 9, Performing Advanced Operations with Databases, and Chapter 12, Developing and Implementing a Simple Datamart.

Finally, all the appendices are valuable resources for anyone reading this book.

What this book covers

Chapter 1, Getting Started with Pentaho Data Integration, serves as the most basic introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon). The chapter also includes instructions for installing a MySQL server.

Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a transformation.

Chapter 3, Manipulating Real-world Data, explains how to apply the concepts learned in the previous chapter to real-world data that comes from different sources. It also explains how to save the results to different destinations: plain files, Excel files, and more. As real data is very prone to errors, this chapter also explains the basics of handling errors and validating data.

Chapter 4, Filtering, Searching, and Performing Other Useful Operations with Data, expands the set of operations learned in previous chapters by teaching the reader a great variety of essential features such as filtering, sorting, or looking for data.

Chapter 5, Controlling the Flow of Data, explains different options that PDI offers to combine or split flows of data.

Chapter 6, Transforming Your Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.

Chapter 7, Transforming the Rowset, explains the ability of PDI to deal with some sophisticated problems—for example, normalizing data from pivoted tables—in a simple fashion.

Chapter 8, Working with Databases, explains how to use PDI to work with databases. The list of topics covered includes connecting to a database, previewing and getting data, and inserting, updating, and deleting data. As database knowledge is not presumed, the chapter also covers fundamental concepts of databases and the SQL language.

Chapter 9, Performing Advanced Operations with Databases, explains how to perform advanced operations with databases, including those especially designed to load data warehouses. A primer on data warehouse concepts is also given in case you are not familiar with the subject.

Chapter 10, Creating Basic Task Flows, serves as an introduction to processes in PDI. Through the creation of simple jobs, you will learn what jobs are and what they are used for.

Chapter 11, Creating Advanced Transformations and Jobs, deals with advanced concepts that will allow you to build complex PDI projects. The list of covered topics includes nesting jobs, iterating on jobs and transformations, and creating subtransformations.

Chapter 12, Developing and Implementing a Simple Datamart, presents a simple datamart project, and guides you to build the datamart by using all the concepts learned throughout the book.

Appendix A, Working with Repositories, is a step-by-step guide to the creation of a PDI database repository and then gives instructions on to work with it.

Appendix B, Pan and Kitchen – Launching Transformations and Jobs from the Command Line, is a quick reference for running transformations and jobs from the command line.

Appendix C, Quick Reference – Steps and Job Entries, serves as a quick reference to steps and job entries used throughout the book.

Appendix D, Spoon Shortcuts, is an extensive list of Spoon shortcuts useful for saving time when designing and running PDI jobs and transformations.

Appendix E, Introducing PDI 5 Features, quickly introduces you to the architectural and functional features included in Kettle 5—the version that was under development when this book was written.

Appendix F, Best Practices, gives a list of best PDI practices and recommendations.

Appendix G , Pop Quiz Answers, contains answers to pop quiz questions.

What you need for this book

PDI is a multiplatform tool. This means that no matter what your operating system is, you will be able to work with the tool. The only prerequisite is to have JVM 1.6 installed. It is also useful to have Excel or Calculator, along with a nice text editor.

Having an Internet connection while reading is extremely useful as well. Several links are provided throughout the book that complement what is explained. Additionally, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have, will also find it useful.

This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

You don't need to have any prior data warehouse or database experience to read this book. Fundamental database and data warehouse technical terms and concepts are explained in easy-to-understand language.

Conventions

In this book, you will find several headings that appear frequently.

To give clear instructions on how to complete a procedure or task, we use:

Time for action – heading

Action 1
Action 2
Action 3

Instructions often need some extra explanation so that they make sense, so they are followed with:

What just happened?

This heading explains the working of tasks or instructions that you have just completed.

You will also find some other learning aids in the book, including:

Pop quiz – heading

These are short multiple-choice questions intended to help you test your own understanding.

Have a go hero – heading

These practical challenges and give you ideas for experimenting with what you have learned.

You will also find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text are shown as follows: "You may notice that we used the Unix command rm to remove the Drush directory rather than the DOS del command."

A block of code is set as follows:

# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

# * Fine Tuning
#
key_buffer = 16M
key_buffer_size = 32M
max_allowed_packet = 16M
thread_stack = 512K
thread_cache_size = 8
max_connections = 300

Any command-line input or output is written as follows:

cd /ProgramData/Propeople
rm -r Drush
git clone --branch master http://git.drupal.org/project/drush.git

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "On the Select Destination Location screen, click on Next to accept the default destination."

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title through the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website, or added to any list of existing errata, under the Errata section of that title.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

Preface

How to read this book

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Note

Tip

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions