Book Image

Learning Pentaho Data Integration 8 CE - Third Edition

Book Image

Learning Pentaho Data Integration 8 CE - Third Edition

Overview of this book

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag-and-drop design and powerful Extract-Tranform-Load (ETL) capabilities. This book shows and explains the new interactive features of Spoon, the revamped look and feel, and the newest features of the tool including transformations and jobs Executors and the invaluable Metadata Injection capability. We begin with the installation of PDI software and then move on to cover all the key PDI concepts. Each of the chapter introduces new features, enabling you to gradually get practicing with the tool. First, you will learn to do all kind of data manipulation and work with simple plain files. Then, the book teaches you how you can work with relational databases inside PDI. Moreover, you will be given a primer on data warehouse concepts and you will learn how to load data in a data warehouse. During the course of this book, you will be familiarized with its intuitive, graphical and drag-and-drop design environment. By the end of this book, you will learn everything you need to know in order to meet your data manipulation requirements. Besides, your will be given best practices and advises for designing and deploying your projects.
Table of Contents (23 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Preface

Pentaho Data Integration (also known as Kettle) is an engine, along with a suite of tools, responsible for the processes of Extracting, Transforming, and Loading, better known as the ETL processes. Pentaho Data Integration (PDI) not only serves as an ETL tool, but it's also used for other purposes, such as migrating data between applications or databases, exporting data from databases to flat files, data cleansing, and much more. PDI has an intuitive, graphical, drag and drop design environment, and its ETL capabilities are powerful. However, getting started with PDI can be difficult and confusing. This book provides the guidance needed to overcome that difficulty, by covering the key features of PDI. Learning Pentaho Data Integration 8 CE explains the new interactive features of the graphical designer—Spoon, and its revamped look and feel. It also covers the newest features of the tool including Transformations and Jobs executors and the invaluable metadata injection capability.

The content of the book is based on PDI 8 Community Edition (CE). However, it can be used with the Enterprise Edition (EE) as well. Besides, if you are currently working with an earlier version of the tool, you should know that most of the content is also valid for PDI 6 and PDI 7.

By the end of the book, not only will you have experimented with all kinds of examples, but will also have gained the knowledge about developing useful, portable, reusable, and well-designed processes. 

What this book covers

Chapter 1, Getting Started with Pentaho Data Integration, serves as an introduction to PDI, presenting the tool. This chapter includes instructions for installing PDI and gives you the opportunity to play with the graphical designer (Spoon).

Chapter 2, Getting Started with Transformations, explains the fundamentals of working with transformations, including learning the simplest ways of transforming data and getting familiar with the process of designing, debugging, and testing a Transformation. This chapter also explains the basics of handling errors.

Chapter 3, Creating Basic Task Flows, serves as an introduction to the processes in PDI. Through the creation of simple Jobs, you will learn what Jobs are and what they are used for.

 

Chapter 4, Reading and Writing Files, explains how to get data from several files formats as spreadsheets, CSV files, and more. It also explains how to save data in the same kind of formats.

Chapter 5, Manipulating PDI Data and Metadata, expands the set of operations learned in the previous chapters. Besides exploring new PDI steps for data manipulation, this chapter introduces the Select Value step for manipulating metadata. It also explains how to get system information and predefined variables for being used as part of the data flow. The chapter also explains how to read and write XML and JSON structures.

Chapter 6, Controlling the Flow of Data, explains different options that PDI offers to deal with more than one stream of data: It explains how to combine and split flows of data, filter data and more.

Chapter 7, Cleansing, Validating, and Fixing Data, offers different ways for cleansing data, and also for dealing with invalid data, either by discarding it or by fixing it.

Chapter 8, Manipulating Data by Coding, explains how JavaScript and Java coding can help in the treatment of data. It shows why you may need to code inside PDI, and explains in detail how to do it.

Chapter 9, Transforming the Dataset, explains techniques for transforming the dataset as a whole; for example, aggregating data or normalizing pivoted tables.

Chapter 10, Performing Basic Operations with Databases, explains how to use PDI to work with databases. The list of topics in this chapter includes connecting to a database, previewing and getting data. It also covers other basic operations as inserting, looking up for data, and more.

Chapter 11, Loading Data Marts with PDI, explains the details about loading simple data marts. It shows how to load common types of dimensions (SCD, Junk, Time, and so on) and also different types of fact tables.

Chapter 12, Creating Portable and Reusable Transformations, explains several techniques for creating versatile transformations that can be used and reused in different scenarios or with different sets of data.

Chapter 13, Implementing Metadata Injection, explains a powerful feature of PDI, which is basically about injecting metadata into a template Transformation at runtime. Pentaho team has put in huge effort to highly support this feature in the latest PDI versions, so it's worth to explain in detail how this feature works.

Chapter 14, Creating Advanced Jobs, explains techniques for creating complex processes; for example, iterating over Jobs or manipulating lists of files for different purposes.

Chapter 15, Launching Transformations and Jobs from the Command Line, is a reference not only for running transformations from a Terminal, but also for dealing with the output of the executions.

Chapter 16, Best Practices for Designing and Deploying a PDI Project, covers the setup of a new project and also the best practices that make it easier to develop, maintain, and deploy a project in different environments.

What you need for this book

PDI is a multiplatform tool. This means that no matter which operating system you have, you will be able to work with the tool. The only prerequisite is to have JVM 1.8 installed. You will also need an Office suite, for example, Open Office or Libre Office, and a good text editor, for example, Sublime III or Notepad ++. Access to a relational database is recommended. Suggested engines are MySQL and PostgreSQL, but could be others of your choice as well.

Having an internet connection while reading is extremely useful too. Several links are provided throughout the book that complements what is explained. Besides, there is the PDI forum where you may search or post doubts if you are stuck with something.

Who this book is for

This book is a must-have for software developers, business intelligence analysts, IT students, and everyone involved or interested in developing ETL solutions, or more generally, doing any kind of data manipulation. Those who have never used PDI will benefit the most from the book, but those who have will also find it useful. This book is also a good starting point for data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meanings.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: Unzip the downloaded file in a folder of your choice, as, for example, c:/util/kettle or /home/pdi_user/kettle.

A block of code is set as follows:

project_name,start_date,end_date
Project A,2016-01-10,2016-01-25
Project B,2016-04-03,2016-07-21
Project C,2017-01-15,???
Project D,2015-09-03,2015-12-20
Project E,2016-05-11,2016-05-31
Project F,2011-12-01,2013-11-30

Any command-line input or output is written as follows:

 kitchen /file:c:/pdi_labs/hello_world.kjb

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: Open Spoon from the main menu and navigate to File | New | Transformation.

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected] and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

  1. Log in or register to our website using your email address and password.
  2. Hover the mouse pointer on the SUPPORT tab at the top.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box.
  5. Select the book for which you're looking to download the code files.
  6. Choose from the drop-down menu where you purchased this book from.
  7. Click on Code Download.

Once the file is downloaded, make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows
  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Pentaho-Data-Integration-8-CE. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/LearningPentahoDataIntegration8CE_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, do provide us with the location address or the website name immediately, so that we can pursue a remedy. Contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.