Book Image

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Book Image

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Overview of this book

Pentaho Data Integration is the premier open source ETL tool, providing easy, fast, and effective ways to move and transform data. While PDI is relatively easy to pick up, it can take time to learn the best practices so you can design your transformations to process data faster and more efficiently. If you are looking for clear and practical recipes that will advance your skills in Kettle, then this is the book for you. Pentaho Data Integration Cookbook Second Edition guides you through the features of explains the Kettle features in detail and provides easy to follow recipes on file management and databases that can throw a curve ball to even the most experienced developers. Pentaho Data Integration Cookbook Second Edition provides updates to the material covered in the first edition as well as new recipes that show you how to use some of the key features of PDI that have been released since the publication of the first edition. You will learn how to work with various data sources – from relational and NoSQL databases, flat files, XML files, and more. The book will also cover best practices that you can take advantage of immediately within your own solutions, like building reusable code, data quality, and plugins that can add even more functionality. Pentaho Data Integration Cookbook Second Edition will provide you with the recipes that cover the common pitfalls that even seasoned developers can find themselves facing. You will also learn how to use various data sources in Kettle as well as advanced features.
Table of Contents (21 chapters)
Pentaho Data Integration Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
References
Index

Getting data from a database


If you're used to working with databases, one of your main objectives while working with PDI must be getting data from your databases for transforming, loading in other databases, generating reports, and so on. Whatever operation you intend to achieve, the first thing you have to do after connecting to the database is to get that data and create a PDI dataset. In this recipe, you will learn the simplest way to do that.

Getting ready

To follow these instructions, you need to have access to any DBMS. Many of the recipes in this chapter will be connecting to a MySQL instance. It is recommended that to fully take advantage of the book's code, (which can be found on the book's website) you have access to a MySQL instance.

How to do it...

  1. Create a transformation and drop a Table Input step into the canvas. You will find it in the Input category of steps.

  2. From the Connection drop-down list, select the connection to the database where your data resides, or create it if it doesn't exist.

  3. In the SQL textarea, type the SQL statement that returns the data you need. So far, you should have something like the following:

  4. Click on Preview. This will bring a sample list of rows so you can confirm that the data is as expected.

  5. Click on OK to close the Table Input configuration window, and you'll be ready to use the data for further manipulation.

How it works...

The Table Input step you used in the recipe is the main Kettle step to get data from a database. When you run or preview the transformation, Kettle executes the SQL and pushes the rows of data coming from the database into the output stream of the step. Each column of the SQL statement leads to a PDI field and each row generated by the execution of the statement becomes a row in the PDI dataset.

Once you get the data from the database, it will be available for any kind of manipulation inside the transformation.

There's more...

In order to save time, or in case you are not sure of the name of the tables or columns in the database, instead of typing the SQL statement, click on the Get SQL select statement... button. This will bring the Database Explorer window. This window allows you to explore the selected database. By expanding the database tree and selecting the table that interests you, you will be able to explore that table through the different options available under the Actions menu.

Double-clicking on the name of the table will generate a SELECT statement to query that table. You will have the chance to include all the field names in the statement, or simply generate a SELECT * statement. After bringing the SQL to the Table Input configuration window, you will be able to modify it according to your needs.

Note

By generating this statement, you will lose any statement already in the SQL textarea.

See also

  • Connecting to a database

  • Getting data from a database by providing parameters

  • Getting data from a database by running a query built at runtime