Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By : María Carina Roldán
Book Image

Pentaho Data Integration Beginner's Guide - Second Edition - Second Edition

By: María Carina Roldán

Overview of this book

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.
Table of Contents (26 chapters)
Pentaho Data Integration Beginner's Guide
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Best Practices
Index

Time for action – creating a hello world transformation


How about starting by saying hello to the world? It's not really new, but good enough for our first practical example; here are the steps to follow:

  1. Create a folder named pdi_labs under a folder of your choice.

  2. Open Spoon.

  3. From the main menu, navigate to File | New | Transformation.

  4. On the left of the screen, under the Design tab, you’ll see a tree of Steps. Expand the Input branch by double-clicking on it.

    Note

    Note that if you work in Mac OS a single click is enough.

  5. Then, left-click on the Generate Rows icon and without releasing the button, drag-and-drop the selected icon to the main canvas. The screen will look like the following screenshot:

    Note

    Note that we changed the preferred language back to English.

  6. Double-click on the Generate Rows step you just put in the canvas, and fill the textboxes, including Step name and Limit and grid as follows:

  7. From the Steps tree, double-click on the Flow branch.

  8. Click on the Dummy (do nothing) icon and drag-and-drop it to the main canvas.

  9. Put the mouse cursor over the Generate Rows step and wait until a tiny toolbar shows up below the entry icon, as shown in the following screenshot:

  10. Click on the output connector (the last icon in the toolbar), and drag towards the Dummy (do nothing) step. A grayed hop is displayed.

  11. When the mouse cursor is over the Dummy (do nothing) step, release the button. A link—a hop from now on—is created from the Generate Rows step to the Dummy (do nothing) step. The screen should look like the following screenshot:

  12. Right-click anywhere on the canvas to bring a contextual menu.

  13. In the menu, select the New note option. A note editor appears.

  14. Type some description such as Hello, World! Select the Font style tab and choose some nice font and colors for your note, and then click on OK.

  15. From the main menu, navigate to Edit | Settings.... A window appears to specify transformation properties. Fill the Transformation name textbox with a simple name, such as hello world. Fill the Description textbox with a short description such as My first transformation. Finally, provide a more clear explanation in the Extended description textbox, and then click on OK.

  16. From the main menu, navigate to File | Save.

  17. Save the transformation in the folder pdi_labs with the name hello_world.

  18. Select the Dummy (do nothing) step by left-clicking on it.

  19. Click on the Preview icon in the bar menu above the main canvas. The screen should look like the following screenshot:

  20. The Transformation debug dialog window appears. Click on the Quick Launch button.

  21. A window appears to preview the data generated by the transformation as shown in the following screenshot:

  22. Close the preview window and click on the Run icon. The screen should look like the following screenshot:

  23. A window named Execute a transformation appears. Click on Launch.

  24. The execution results are shown at the bottom of the screen. The Logging tab should look as follows:

What just happened?

You have just created your first transformation.

First, you created a new transformation, dragged-and-dropped into the work area two steps: Generate Rows and Dummy (do nothing), and connected them.

With the Generate Rows step you created 10 rows of data with the message Hello World! The Dummy (do nothing) step simply served as a destination of those rows.

After creating the transformation, you did a preview. The preview allowed you to see the content of the created data, this is, the 10 rows with the message Hello World!

Finally, you run the transformation. Then you could see at the bottom of the screen the Execution Results window, where a Logging tab shows the complete detail of what happened. There are other tabs in this window which you will learn later in the book.

Directing Kettle engine with transformations

A transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows—the data enters or is created in a step, the step applies some kind of transformation to it, and finally the data leaves that step. Therefore, it’s said that a transformation is data flow oriented.

A transformation itself is neither a program nor an executable file. It is just plain XML. The transformation contains metadata which tells the Kettle engine what to do.

A step is the minimal unit inside a transformation. A big set of steps is available. These steps are grouped in categories such as the Input and Flow categories that you saw in the example.

Each step is conceived to accomplish a specific function, going from reading a parameter to normalizing a dataset.

Each step has a configuration window. These windows vary according to the functionality of the steps and the category to which they belong. What all steps have in common are the name and description:

Step property

Description

Name

A representative name inside the transformation.

Description

A brief explanation that allows you to clarify the purpose of the step. It’s not mandatory but it is useful.

A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitute the output data of the origin step and the input data of the destination step.

Exploring the Spoon interface

As you just saw, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main menu, Design view, Transformation toolbar, and Canvas (work area):

Note

The words canvas and work area will be used interchangeably throughout the book.

There is also an area named View that shows the structure of the transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Designing a transformation

In the earlier section, you designed a very simple transformation, with just two steps and one explanatory note. You learned to link steps by using the mouseover assistance toolbar. There are alternative ways to do the same thing. You can use the one that you feel more comfortable with. Appendix D, Spoon Shortcuts explains all of the different options to you. It also explains a lot of shortcuts to zoom in and out, align the steps, among others. These shortcuts are very useful as your transformations become more complex.

Note

Appendix F, Best Practices, explains the benefit of using shortcuts as well as other best practices that are invaluable when you work with Spoon, especially when you have to design and develop big ETL projects.

Running and previewing the transformation

The Preview functionality allows you to see a sample of the data produced for selected steps. In the previous example, you previewed the output of the Dummy (do nothing) step.

The Run icon effectively runs the whole transformation.

Whether you preview or run a transformation, you’ll get an Execution Results window showing what happened. You will learn more about this in the next chapter.

Pop quiz – PDI basics

Q1. There are several graphical tools in PDI, but Spoon is the most used.

  1. True.

  2. False.

Q2. You can choose to save transformations either in files or in a database.

  1. True.

  2. False.

Q3. To run a transformation, an executable file has to be generated from Spoon.

  1. True.

  2. False.

Q4. The grid size option in the Look & Feel window allows you to resize the work area.

  1. True.

  2. False.

Q5. To create a transformation you have to provide external data (that is, text file, spreadsheet, database, and so on).

  1. True.

  2. False.