Pentaho Data Integration Cookbook - Second Edition

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Overview of this book

Pentaho Data Integration is the premier open source ETL tool, providing easy, fast, and effective ways to move and transform data. While PDI is relatively easy to pick up, it can take time to learn the best practices so you can design your transformations to process data faster and more efficiently. If you are looking for clear and practical recipes that will advance your skills in Kettle, then this is the book for you. Pentaho Data Integration Cookbook Second Edition guides you through the features of explains the Kettle features in detail and provides easy to follow recipes on file management and databases that can throw a curve ball to even the most experienced developers. Pentaho Data Integration Cookbook Second Edition provides updates to the material covered in the first edition as well as new recipes that show you how to use some of the key features of PDI that have been released since the publication of the first edition. You will learn how to work with various data sources ‚Äì from relational and NoSQL databases, flat files, XML files, and more. The book will also cover best practices that you can take advantage of immediately within your own solutions, like building reusable code, data quality, and plugins that can add even more functionality. Pentaho Data Integration Cookbook Second Edition will provide you with the recipes that cover the common pitfalls that even seasoned developers can find themselves facing. You will also learn how to use various data sources in Kettle as well as advanced features.

Pentaho Data Integration Cookbook Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Working with Databases

Introduction

Connecting to a database

Getting data from a database

Getting data from a database by providing parameters

Getting data from a database by running a query built at runtime

Inserting or updating rows in a table

Inserting new rows where a simple primary key has to be generated

Inserting new rows where the primary key has to be generated based on stored values

Deleting data from a table

Creating or altering a database table from PDI (design time)

Creating or altering a database table from PDI (runtime)

Inserting, deleting, or updating a table depending on a field

Changing the database connection at runtime

Loading a parent-child table

Building SQL queries via database metadata

Performing repetitive database design tasks from PDI

Reading and Writing Files

Introduction

Reading a simple file

Reading several files at the same time

Reading semi-structured files

Reading files having one field per row

Reading files with some fields occupying two or more rows

Writing a simple file

Writing a semi-structured file

Providing the name of a file (for reading or writing) dynamically

Using the name of a file (or part of it) as a field

Reading an Excel file

Getting the value of specific cells in an Excel file

Writing an Excel file with several sheets

Writing an Excel file with a dynamic number of sheets

Reading data from an AWS S3 Instance

Working with Big Data and Cloud Sources

Introduction

Loading data into Salesforce.com

Getting data from Salesforce.com

Loading data into Hadoop

Getting data from Hadoop

Loading data into HBase

Getting data from HBase

Loading data into MongoDB

Getting data from MongoDB

Manipulating XML Structures

Introduction

Reading simple XML files

Specifying fields by using the Path notation

Validating well-formed XML files

Validating an XML file against DTD definitions

Validating an XML file against an XSD schema

Generating a simple XML document

Generating complex XML structures

Generating an HTML page using XML and XSL transformations

Reading an RSS Feed

Generating an RSS Feed

File Management

Introduction

Copying or moving one or more files

Deleting one or more files

Getting files from a remote server

Putting files on a remote server

Copying or moving a custom list of files

Deleting a custom list of files

Comparing files and folders

Working with ZIP files

Encrypting and decrypting files

Looking for Data

Introduction

Looking for values in a database table

Looking for values in a database with complex conditions

Looking for values in a database with dynamic queries

Looking for values in a variety of sources

Looking for values by proximity

Looking for values by using a web service

Looking for values over intranet or the Internet

Validating data at runtime

Understanding and Optimizing Data Flows

Introduction

Splitting a stream into two or more streams based on a condition

Merging rows of two streams with the same or different structures

Adding checksums to verify datasets

Comparing two streams and generating differences

Generating all possible pairs formed from two datasets

Joining two or more streams based on given conditions

Interspersing new rows between existent rows

Executing steps even when your stream is empty

Processing rows differently based on the row number

Processing data into shared transformations via filter criteria and subtransformations

Altering a data stream with Select values

Processing multiple jobs or transformations in parallel

Executing and Re-using Jobs and Transformations

Introduction

Launching jobs and transformations

Executing a job or a transformation by setting static arguments and parameters

Executing a job or a transformation from a job by setting arguments and parameters dynamically

Executing a job or a transformation whose name is determined at runtime

Executing part of a job once for every row in a dataset

Executing part of a job several times until a condition is true

Moving part of a transformation to a subtransformation

Using Metadata Injection to re-use transformations

Integrating Kettle and the Pentaho Suite

Introduction

Creating a Pentaho report with data coming from PDI

Creating a Pentaho report directly from PDI

Configuring the Pentaho BI Server for running PDI jobs and transformations

Executing a PDI transformation as part of a Pentaho process

Executing a PDI job from the Pentaho User Console

Generating files from the PUC with PDI and the CDA plugin

Populating a CDF dashboard with data coming from a PDI transformation

Getting the Most Out of Kettle

Introduction

Sending e-mails with attached files

Generating a custom logfile

Running commands on another server

Programming custom functionality

Generating sample data for testing purposes

Working with JSON files

Getting information about transformations and jobs (file-based)

Getting information about transformations and jobs (repository-based)

Using Spoon's built-in optimization tools

Utilizing Visualization Tools in Kettle

Introduction

Managing plugins with the Marketplace

Data profiling with DataCleaner

Visualizing data with AgileBI

Using Instaview to analyze and visualize data

Data Analytics

Introduction

Reading data from a SAS datafile

Studying data via stream statistics

Building a random data sample for Weka

Data Structures

Books data structure

museums data structure

outdoor data structure

Steel Wheels data structure

Lahman Baseball Database

References

Books

Online

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Inserting or updating rows in a table

Two of the most common operations on databases, besides retrieving data, are inserting and updating rows in a table.

PDI has several steps that allow you to perform these operations. In this recipe you will learn to use the Insert/Update step. Before inserting or updating rows in a table by using this step, it is critical that you know which field or fields in the table uniquely identify a row in the table.

Note

If you don't have a way to uniquely identify the records, you should consider other steps, as explained in the There's more... section.

Assume this situation: you have a file with new employees of Steel Wheels. You have to insert those employees in the database. The file also contains old employees that have changed either the office where they work, the extension number, or other basic information. You will take the opportunity to update that information as well.

Getting ready

Download the material for the recipe from the book's site. Take a look at the file you will use:

EMPLOYEE_NUMBER, LASTNAME, FIRSTNAME, EXTENSION, OFFICE, REPORTS, TITLE
1188, Firrelli, Julianne,x2174,2,1143, Sales Manager
1619, King, Tom,x103,6,1088,Sales Rep
1810, Lundberg, Anna,x910,2,1143,Sales Rep
1811, Schulz, Chris,x951,2,1143,Sales Rep

Explore the Steel Wheels database, in particular the employees table, so you know what you have before running the transformation. Execute the following MySQL statement:

SELECT
      EMPLOYEENUMBER ENUM
    , CONCAT(FIRSTNAME,' ',LASTNAME) NAME
    , EXTENSION EXT
    , OFFICECODE OFF
    , REPORTSTO REPTO
    , JOBTITLE
    FROM EMPLOYEES
    WHERE EMPLOYEENUMBER IN (1188, 1619, 1810, 1811);
+------+----------------+-------+-----+-------+-----------+
| ENUM | NAME           | EXT   | OFF | REPTO | JOBTITLE  |
+------+----------------+-------+-----+-------+-----------+
| 1188 | Julie Firrelli | x2173 | 2   |  1143 | Sales Rep |
| 1619 | Tom King       | x103  | 6   |  1088 | Sales Rep |
+------+----------------+-------+-----+-------+-----------+
2 rows in set (0.00 sec)

How to do it...

Perform the following steps to insert or update rows in a table:

Create a transformation and use a Text File input step to read the file employees.txt. Provide the name and location of the file, specify comma as the separator, and fill in the Fields grid.
Tip
Remember that you can quickly fill the grid by clicking on the Get Fields button.
Now, you will do the inserts and updates with an Insert/Update step. So, expand the Output category of steps, look for the Insert/Update step, drag it to the canvas, and create a hop from the Text File input step toward this one.
Double-click on the Insert/Update step and select the connection to the Steel Wheels database, or create it if it doesn't exist. As target table, type EMPLOYEES.
Fill the grids as shown in the following screenshot:
Save and run the transformation.

Explore the employees table by running the query executed earlier. You will see that one employee was updated, two were inserted, and one remained untouched because the file had the same data as the database for that employee:

+------+---------------+-------+-----+-------+--------------+
| ENUM | NAME          | EXT   | OFF | REPTO | JOBTITLE     |
+------+---------------+-------+-----+-------+--------------+
| 1188 | Julie Firrelli| x2174 | 2   |  1143 |Sales Manager |
| 1619 | Tom King      | x103  | 6   |  1088 |Sales Rep     |
| 1810 | Anna Lundberg | x910  | 2   |  1143 |Sales Rep     |
| 1811 | Chris Schulz  | x951  | 2   |  1143 |Sales Rep     |
+------+---------------+-------+-----+-------+--------------+
4 rows in set (0.00 sec)

How it works...

The Insert/Update step, as its name implies, serves for both inserting or updating rows. For each row in your stream, Kettle looks for a row in the table that matches the condition you put in the upper grid—the grid labeled The key(s) to look up the value(s):. Take for example the last row in your input file:

1811, Schulz, Chris,x951,2,1143,Sales Rep

When this row comes to the Insert/Update step, Kettle looks for a row where EMPLOYEENUMBER equals 1811. When it doesn't find one, it inserts a row following the directions you put in the lower grid. For this sample row, the equivalent INSERT statement would be as follows:

INSERT INTO EMPLOYEES (EMPLOYEENUMBER, LASTNAME, FIRSTNAME,
            EXTENSION, OFFICECODE, REPORTSTO, JOBTITLE)
       VALUES (1811, 'Schulz', 'Chris',
              'x951', 2, 1143, 'Sales Rep')

Now look at the first row:

1188, Firrelli, Julianne,x2174,2,1143, Sales Manager

When Kettle looks for a row with EMPLOYEENUMBER equal to 1188, it finds it. Then, it updates that row according to what you put in the lower grid. It only updates the columns where you put Y under the Update column. For this sample row, the equivalent UPDATE statement would be as follows:

UPDATE EMPLOYEES SET EXTENSION = 'x2174'
                   , OFFICECODE = 2
                   , REPORTSTO = 1143
                   , JOBTITLE = 'Sales Manager'
WHERE EMPLOYEENUMBER = 1188

Note that the name of this employee in the file (Julianne) is different from the name in the table (Julie), but, as you put N under the column Update for the field FIRSTNAME, this column was not updated.

Note

If you run the transformation with the log level Detailed, in the log you will be able to see the real prepared statements that Kettle performs when inserting or updating rows in a table.

There's more...

Here there are two alternative solutions to this use case.

Alternative solution if you just want to insert records

If you just want to insert records, you shouldn't use the Insert/Update step but the Table Output step. This would be faster because you would be avoiding unnecessary lookup operations; however, the Table Output step does not check for duplicated records. The Table Output step is really simple to configure; just select the database connection and the table where you want to insert the records. If the names of the fields coming to the Table Output step have the same name as the columns in the table, you are done. If not, you should check the Specify database fields option, and fill the Database fields tab exactly as you filled the lower grid in the Insert/Update step, except that here there is no Update column.

Alternative solution if you just want to update rows

If you just want to update rows, instead of using the Insert/Update step, you should use the Update step. You configure the Update step just as you configure the Insert/Update step, except that here there is no Update column.

Alternative way for inserting and updating

The following is an alternative way for inserting and updating rows in a table.

Note

This alternative only works if the columns in the Key field's grid of the Insert/Update step are a unique key in the database.

You may replace the Insert/Update step by a Table Output step and, as the error handling stream coming out of the Table Output step, put an Update step.

Tip

In order to handle the error when creating the hop from the Table Output step towards the Update step, select the Error handling of step option.

Alternatively, right-click on the Table Output step, select Define error handling..., and configure the Step error handling settings window that shows up. Your transformation would look like the following:

In the Table Output step, select the table EMPLOYEES, check the Specify database fields option, and fill the Database fields tab just as you filled the lower grid in the Insert/Update step, except that here there is no Update column.

In the Update step, select the same table and fill the upper grid—let's call it the Key fields grid—just as you filled the Key fields grid in the Insert/Update step. Finally, fill the lower grid with those fields that you want to update, that is, those rows that had Y under the Update column.

In this case, Kettle tries to insert all records coming to the Table Output step. The rows for which the insert fails go to the Update step, and get updated.

If the columns in the Key fields grid of the Insert/Update step are not a unique key in the database, this alternative approach doesn't work. The Table Output would insert all the rows. Those that already existed would be duplicated instead of getting updated.

This strategy for performing inserts and updates has been proven to be much faster than the use of the Insert/Update step whenever the ratio of updates to inserts is low. In general, for best practice reasons, this is not an advisable solution.

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Overview of this book

Related Content you might be interested in

Current Title:

Pentaho Data Integration Cookbook - Second Edition - Second Edition

Inserting or updating rows in a table

Note

Getting ready

How to do it...

Tip

How it works...

Note

There's more...

Alternative solution if you just want to insert records

Alternative solution if you just want to update rows

Alternative way for inserting and updating

Note

Tip

See also