Book Image

Pentaho Analytics for MongoDB Cookbook

By : Joel André Latino, Harris Ward
Book Image

Pentaho Analytics for MongoDB Cookbook

By: Joel André Latino, Harris Ward

Overview of this book

MongoDB is an open source, schemaless NoSQL database system. Pentaho as a famous open source Analysis tool provides high performance, high availability, and easy scalability for large sets of data. The variant features in Pentaho for MongoDB are designed to empower organizations to be more agile and scalable and also enables applications to have better flexibility, faster performance, and lower costs. Whether you are brand new to online learning or a seasoned expert, this book will provide you with the skills you need to create turnkey analytic solutions that deliver insight and drive value for your organization. The book will begin by taking you through Pentaho Data Integration and how it works with MongoDB. You will then be taken through the Kettle Thin JDBC Driver for enabling a Java application to interact with a database. This will be followed by exploration of a MongoDB collection using Pentaho Instant view and creating reports with MongoDB as a datasource using Pentaho Report Designer. The book will then teach you how to explore and visualize your data in Pentaho BI Server using Pentaho Analyzer. You will then learn how to create advanced dashboards with your data. The book concludes by highlighting contributions of the Pentaho Community.
Table of Contents (15 chapters)
Pentaho Analytics for MongoDB Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Migrating data from files to MongoDB


In this recipe, we will guide you through creating a transformation that loads data from different files in your filesystem, and then load them into a MongoDB Collection. We are going to load data from files called orders.csv, customers.xls, and products.xml. Each of these files contains a key that we can use to join data in PDI before we send it to the MongoDB Output step.

Getting ready

Start Spoon and take a look at the content of the orders.csv, customers.xls, and products.xml files. This will help you understand what the data looks like before you start loading it into MongoDB.

How to do it…

You will need the orders.csv, customers.xls, and products.xml files. These files will be available at the Packt Publishing website, just in case you don't have them. Make sure that MongoDB is up and running, and then you will be able to perform to the following steps:

  1. Create a new empty transformation.

    1. Set the transformation name to Migrate data from files to MongoDB.

    2. Save the transformation with the name chapter1-files-to-mongodb.

  2. Select data from the orders.csv file using the CSV file input step.

    1. Select the Design tab in the left-hand-side view.

    2. From the Input category folder, find the CSV file input step and drag and drop it into the working area in the right-hand-side view.

    3. Double-click on the step to open the CSV Input configuration dialog.

    4. Set Step Name to Select Orders.

    5. In the Filename field, click on the Browse button, navigate to the location of the .csv file, and select the order.csv file.

    6. Set the Delimiter field to a semicolon (;).

    7. Now, let's define our output fields by clicking on the Get Fields button. A Sample size dialog will appear; it is used to analyze the format data in the CSV file. Click on OK. Then, click on Close in Scan results.

    8. Click on OK to finish the configuration of the CSV file input.

  3. Select data from the customers.xls file using the Microsoft Excel Input step.

    1. Select the Design tab in the left-hand-side view.

    2. From the Input category folder, find the Microsoft Excel Input step and drag and drop it into the working area in the right-hand-side view.

    3. Double-click on the step to open the Microsoft Excel Input dialog.

    4. Set Step Name to Select Customers.

    5. On the Files tab, in the File or directory field, click on the Browse button and choose the location of the customers.xls file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.

    6. Select the Sheets tab. Then, click on the Get sheetname(s)... button. You'll be shown an Enter list dialog. Select Sheet1 and click on the > button to add a sheet to the Your selection list. Finally, click on OK.

    7. Select the Fields tab. Then, click on the Get field from header row... button. This will generate a list of existing fields in the spreadsheet. You will have to make a small change; change the Type field for Customer Number from Number to Integer. You can preview the file data by clicking on the Preview rows button.

    8. Click on OK to finish the configuration of the Select Customers step.

  4. Select data from the products.xml file using the Get data from XML step.

    1. Select the Design tab in the left-hand-side view.

    2. From the Input category folder, find the Get data from XML step and drag and drop it into the working area in the right-hand-side view.

    3. Double-click on the step to open the Get data from XML dialog.

    4. Set Step Name to Select Products.

    5. On the File tab, in the File or directory field, click on the Browse button and choose the location of the products.xml file in your filesystem. After that, click on the Add button to add the file to the list of files to be processed.

    6. Select the Content tab. Click on the Get XPath nodes button and select the /products/product option from the list of the Available Paths dialog.

    7. Next, select the Fields tab. Click on the Get fields button and you will get a list of available fields in the XML file. Change the types of the last three fields (stockquantity, buyprice, and MSRP) from Number to Integer. Set the Trim Type to Both for all fields.

  5. Now, let's join the data from the three different files.

    1. Select the Design tab in the left-hand-side view.

    2. From the Lookup category folder, find the Stream lookup step. Drag and drop it onto the working area in the right-hand-side view. Double-click on Stream lookup and change the Step name field to Lookup Customers.

    3. We are going to need two lookup steps for this transformation. Drag and drop another Stream Lookup step onto the design view, and set Step Name to Lookup Products.

    4. Create a hop between the Select Orders step and the Lookup Customers step.

    5. Then, create a hop from the Select Customers step to the Lookup Customers step.

    6. Next, create a hop from the Lookup Customers step to the Lookup Products step.

    7. Finally, create a hop from Select Products to the Lookup Products step.

  6. Let's configure the Lookup Customers step. Double-click on the Lookup Customers step and set the Lookup step field to the Select Customers option.

    1. In the Keys section, set the Field and Lookup Field options to Customer Number.

    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Customer Number from the field from the list.

    3. Click on OK to finish.

  7. Let's configure the Lookup Products step. The process is similar to that of the Lookup Customers step but with different values. Double-click on the Lookup Products step and set the Lookup step field to the Select Products option.

    1. In the Keys section, set Field to Product Code and the LookupField option to Code.

    2. Click on the Get lookup fields button. This will populate the step with all the available fields from the lookup source. Remove Code from the field in the list.

    3. Click on OK to finish.

  8. Now that we have the data joined correctly, we can write the data stream to a MongoDB collection.

    1. On the Design tab, from the Big Data category folder, find the MongoDB Output step and drag and drop it into the working area in the right-hand-side view.

    2. Create a hop between the Lookup Products step and the MongoDB Output step.

    3. Double-click on the MongoDB Output step and change the Step name field to Orders Output.

    4. Select the Output options tab. Click on the Get DBs buttons and select the SteelWheels option for the Database field. Set the Collection field to Orders. Check the Truncate collection option.

    5. Select the Mongo document fields tab. Click on the Get fields button and you will get a list of fields from the previous step.

    6. Configure the Mongo document output as seen in the following screenshot:

    7. Click on OK.

  9. You can run the transformation and check out MongoDB for the new data. Your transformation should look like the one in this screenshot:

How it works…

In this transformation, we initially get data from the Orders CSV. This first step populates the primary data stream in PDI. Our other XLS and XML steps also collect data. We then connect these two streams of data to the first stream using the Lookup steps and the correct keys. When we finally have all of the data in the single stream, we can load it into the MongoDB collection.

You can learn more about the Stream lookup step online at:

http://wiki.pentaho.com/display/EAI/Stream+Lookup