Book Image

Pentaho Analytics for MongoDB Cookbook

By : Joel André Latino, Harris Ward
Book Image

Pentaho Analytics for MongoDB Cookbook

By: Joel André Latino, Harris Ward

Overview of this book

MongoDB is an open source, schemaless NoSQL database system. Pentaho as a famous open source Analysis tool provides high performance, high availability, and easy scalability for large sets of data. The variant features in Pentaho for MongoDB are designed to empower organizations to be more agile and scalable and also enables applications to have better flexibility, faster performance, and lower costs. Whether you are brand new to online learning or a seasoned expert, this book will provide you with the skills you need to create turnkey analytic solutions that deliver insight and drive value for your organization. The book will begin by taking you through Pentaho Data Integration and how it works with MongoDB. You will then be taken through the Kettle Thin JDBC Driver for enabling a Java application to interact with a database. This will be followed by exploration of a MongoDB collection using Pentaho Instant view and creating reports with MongoDB as a datasource using Pentaho Report Designer. The book will then teach you how to explore and visualize your data in Pentaho BI Server using Pentaho Analyzer. You will then learn how to create advanced dashboards with your data. The book concludes by highlighting contributions of the Pentaho Community.
Table of Contents (15 chapters)
Pentaho Analytics for MongoDB Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Migrating data from the RDBMS to MongoDB


In this recipe, you will transfer data from a sample RDBMS to a MongoDB database. The sample data is called SteelWheels and is available in the Pentaho BA server, running on the Hypersonic Database Server.

Getting ready

Start the Pentaho BA Server by executing the appropriate scripts located in the BA Server's home folder. It is start-pentaho.sh for Unix/Linux operating systems, and for the Windows operating system, it is start-pentaho.bat. Also in Windows, you can go to the Start menu and choose Pentaho Enterprise Edition, then Server Management, and finally Start BA Server.

Start Pentaho Data Integration by executing the right scripts in the PDI home folder. It is spoon.sh for Unix/Linux operating systems and spoon.bat for the Windows operating system. Besides this, in Windows, you can go to the Start menu and choose Pentaho Enterprise Edition, then Design Tools, and finally Data Integration.

Start MongoDB. If you don't have the server running as a service, you need execute the mongod –dbpath=<data folder> command in the bin folder of MongoDB.

To make sure you have the Pentaho BA Server started, you can access the default URL, which is http://localhost:8080/pentaho/. When you launch Spoon, you should see a welcome screen like the one pictured here:

How to do it…

After you have made that sure you are ready to start the recipe, perform the following steps:

  1. Create a new empty transformation.

    1. As was explained in the first recipe of this chapter, set the name of this transformation to Migrate data from RDBMS to MongoDB.

    2. Save the transformation with the name chapter1-rdbms-to-mongodb.

  2. Select a customer's data from the SteelWheels database using Table Input step.

    1. Select the Design tab in the left-hand-side view.

    2. From the Input category folder, find the Table Input step and drag and drop it into the working area in the right-hand-side view.

    3. Double-click on the Table Input step to open the configuration dialog.

    4. Set the Step Name property to Select Customers.

    5. Before we can get any data from the SteelWheels Hypersonic database, we will have to create a JDBC connection to it.

      To do this, click on the New button next to the Database Connection pulldown. This will open the Database Connection dialog.

      Set Connection Name to SteelWheels. Next, select the Connection Type as Hypersonic. Set Host Name to localhost, Database Name to SampleData, Port to 9001, Username to pentaho_user, and finally Password to password. Your setup should look similar to the following screenshot:

    6. You can test the connection by clicking on the Test button at the bottom of the dialog. You should get a message similar to Connection Successful. If not, then you must double-check your connection details.

    7. Click on OK to return to the Table Input step.

    8. Now that we have a valid connection set, we are able to get a list of customers from the SteelWheels database. Copy and paste the following SQL into the query text area:

      SELECT * FROM CUSTOMERs
    9. Click on the Preview button and you will see a table of customer details.

    10. Your Table Input step configuration should look similar to what is shown in the following screenshot:

    11. Click on OK to exit the Table Input configuration dialog.

  3. Now, let's configure the output of the customer's data in the MongoDB database.

    1. Under the Design tab, from the Big Data category folder, find the MongoDB Output step and drag and drop it into the working area in the right-hand-side view.

    2. As we want data to flow from the Table Input step to the MongoDB Output step, we are going to create a Hop between the steps. To do this, simply hover over the Table Input step and a popup will appear, with some options below the step. Click on Right Arrow and then on the MongoDB Output step. This will create a Hop between the two steps.

    3. It's time to configure the MongoDB Output step. Double-click on it.

    4. Set Step Name to Customers Output.

    5. As we're running a default MongoDB instance, we only have to set some simple properties in this step. Set Hostname to localhost and Port to 27017.

    6. Select the Output options tab. In this tab, we can define how the data will be inserted into MongoDB.

    7. Set the Database property to SteelWheels. Don't worry if this database doesn't exist in MongoDB, as it will be created automatically.

    8. Set the Collection property to Customers. Again, don't worry if this collection doesn't exist in MongoDB, as it will be created automatically.

    9. Leave the Batch insert size property at 100. For performance and/or production purposes, you can increase it if necessary. If you don't provide any value to this field, the default value will be 100.

    10. We are going to truncate the collection each time before we load data. In this way, if we rerun the transformation many times, we won't get duplicate records. Your Output options page should look like what is shown in this screenshot:

    11. Now, let's define the MongoDB documents structure. Select the Mongo document fields tab.

    12. Click on the Get fields button, and the fields list will be populated with the SteelWheels database fields in the ETL stream.

    13. By default, the column names in the SteelWheels database are in uppercase. In MongoDB, these field names should be in camel case. You can manually edit the names of the MongoDB document paths in this section also. Make sure that the Use Field Name option is set to No for each field, like this:

    14. By clicking on Preview document structure, you will see an example of what the document will look like when it is inserted into the MongoDB Customers collection.

    15. Click on the OK button to finish the MongoDB Output configuration.

  4. The transformation design is complete. You can run it for testing purposes using the Run button, as illustrated here:

How it works…

As you can see, this is a basic transformation that loads data from the RDBMS database and inserts it into a MongoDB collection. This is a very simple example of loading data from one point to another. Not all transformations are like this. That is why PDI comes with various steps that allow you to manipulate data along the way.

In this case, we truncate the collection each time the transformation is run. However, it is also possible to use other combinations, such as Insert&Update or just Insert or Update individually.

There's more…

Now that we have designed a transformation, let's look at a simple way of reusing the MongoDB connection for future transformations.

How to reuse the properties of a MongoDB connection

If you have to create MongoDB connections manually for each transformation, you are likely to make mistakes and typos. A good way to avoid this is to store the MongoDB connection details in a separate .properties file on your filesystem. There is a file called kettle.properties that is located in a hidden directory called .kettle in your home directory. For example, in Linux, the location will be /home/latino/.kettle. In Windows, it will be C:\Users\latino\.kettle. Navigate to and open this .properties file in your favorite text editor. Then, copy and paste the following lines:

MONGODB_STEELWHEELS_HOSTNAME=localhost
MONGODB_STEELWHEELS_PORT=27017
MONGODB_STEELWHEELS_USERNAME=
MONGODB_STEELWHEELS_PASSWORD=

Save the .properties file and restart Spoon.

Now, where can we use these properties?

You will notice that when you are setting properties in certain PDI steps, you can see the following icon:

This icon denotes that we can use a variable or parameter in place of a static value. Variables are defined using the following structure: ${MY_VARIABLE}. You will notice that the variables are encapsulated in ${}. If you are not sure what the name of your variable is, you can also press Ctrl and the Spacebar; this will open a drop-down list of the available variables. You will see the MongoDB variables that you defined in the .properties file earlier in this list. With this in mind, we can now replace the connection details in our steps with variables as shown in this screenshot:

You can find out more about the MongoDB Output step on this documentation website: http://wiki.pentaho.com/display/EAI/MongoDB+Output