Book Image

Pentaho Analytics for MongoDB Cookbook

By : Joel Latino, Harris Ward, Joel André Latino, Harris Ward
Book Image

Pentaho Analytics for MongoDB Cookbook

By: Joel Latino, Harris Ward, Joel André Latino, Harris Ward

Overview of this book

MongoDB is an open source, schemaless NoSQL database system. Pentaho as a famous open source Analysis tool provides high performance, high availability, and easy scalability for large sets of data. The variant features in Pentaho for MongoDB are designed to empower organizations to be more agile and scalable and also enables applications to have better flexibility, faster performance, and lower costs. Whether you are brand new to online learning or a seasoned expert, this book will provide you with the skills you need to create turnkey analytic solutions that deliver insight and drive value for your organization. The book will begin by taking you through Pentaho Data Integration and how it works with MongoDB. You will then be taken through the Kettle Thin JDBC Driver for enabling a Java application to interact with a database. This will be followed by exploration of a MongoDB collection using Pentaho Instant view and creating reports with MongoDB as a datasource using Pentaho Report Designer. The book will then teach you how to explore and visualize your data in Pentaho BI Server using Pentaho Analyzer. You will then learn how to create advanced dashboards with your data. The book concludes by highlighting contributions of the Pentaho Community.
Table of Contents (15 chapters)
Pentaho Analytics for MongoDB Cookbook
About the Authors
About the Reviewers

Exporting MongoDB data using the aggregation framework

In this recipe, we will explore the use of the MongoDB aggregation framework in the MongoDB Input Step. We will create a simple example to get data from a collection and show you how you can take advantage of the MongoDB aggregation framework to prepare data for the PDI stream.

Getting ready

To get ready for this recipe, you will need to start your ETL development environment Spoon, and make sure that you have the MongoDB server running with the data from the previous recipe.

How to do it…

The following steps introduce the use of the MongoDB aggregation framework:

  1. Create a new empty transformation.

    1. Set the transformation to PDI using MongoDB Aggregation Framework.

    2. Set the name for this transformation to chapter1-using-mongodb-aggregation-framework.

  2. Select data from the Orders collection using the MongoDB Input step.

    1. Select the Design tab in the left-hand-side view.

    2. From the Big Data category folder, find the MongoDB Input step and drag and drop it into the working area in the right-hand-side view.

    3. Double-click on the step to open the MongoDB Input dialog.

    4. Set the step name to Select 'Baane Mini Imports' Orders.

    5. Select the Input options tab. Click on the Get DBs button and select the SteelWheels option for the Database field. Next, click on Get collections and select the Orders option for the Collection field.

    6. Select the Query tab and then check the Query is aggregation pipeline option. In the text area, write the following aggregation query:

       { $match: {"" : "Baane Mini Imports"} },
       { $group: {"_id" : {"orderNumber": "$orderNumber", 
       "orderDate" : "$orderDate"}, "totalSpend": { $sum: 
       "$totalPrice"} } } 
    7. Uncheck the Output single JSON field option.

    8. Select the Fields tab. Click on the Get Fields button and you will get a list of fields returned by the query. You can preview your data by clicking on the Preview button.

    9. Click on the OK button to finish the configuration of this step.

  3. We want to add a Dummy step to the stream. This step does nothing, but it will allow us to select a step to preview our data. Add the Dummy step from the Flow category to the workspace and name it OUTPUT.

  4. Create a hop between the Select 'Baane Mini Imports' Orders step and the OUTPUT step.

  5. Select the OUTPUT dummy step and preview the data.

How it works…

The MongoDB aggregation framework allows you to define a sequence of operations or stages that is executed in pipeline much like the Unix command-line pipeline. You can manipulate your collection data using operations such as filtering, grouping, and sorting before the data even enters the PDI stream.

In this case, we are using the MongoDB Input step to execute an aggregation framework query. Technically, this does the same as db.collection.aggregate(). The query that we execute is broken down into two parts. For the first part, we filter the data based on a customer name. In this case, it is Baane Mini Imports. For the second part, we group the data by order number and order date and sum the total price.

See also

In the next recipe, we will talk about other ways in which you can aggregate data using MongoDB Map/Reduce.