Book Image

Pentaho Analytics for MongoDB

By : Bo Borland
Book Image

Pentaho Analytics for MongoDB

By: Bo Borland

Overview of this book

<p>Pentaho Analytics for MongoDB will teach you MongoDB and Pentaho integration points and developer skills needed to create turnkey analytic solutions that deliver insight and drive value for your organization.<br /><br />Starting with how to install, configure, and develop content in both Pentaho and MongoDB, this book will give you the complete range of skills needed to gain insight into MongoDB data using Pentaho Business Analytics.&nbsp; You will learn about MongoDB data models and query techniques, which are covered in combination with the provided sample MongoDB database. You then advance to data integration, analysis, and reporting using Pentaho.<br /><br />You will learn how to use Pentaho Data Integration to blend and enrich data from additional sources. From this blended data, you will develop professional-looking reports and analysis views that are visual and interactive. Lastly, we will cover the Pentaho web portal and web interfaces for deploying analytics out to a broader set of consumer users.</p>
Table of Contents (15 chapters)
Pentaho Analytics for MongoDB
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

MongoDB technology overview


Modern businesses capture huge volumes and varieties of data using several different data storage methods. There is no one-size-fits-all data storage method, because each technology has evolved to tackle the data challenges or opportunities of that specific time in history. We continue to see new and innovative data storage solutions as data volume, variety, and velocity grows, and as people figure out new ways to use data. The following is a small sampling of the variety of data sources you might encounter in a single organization:

  • Simple tabular data files: CSV, text files, and MS Excel

  • Commercial relational databases: Oracle, SQL Server, and DB2

  • Open source relational databases: MySQL and PostGreSQL

  • Modern, web-oriented data sources: XML, JSON, web services, and APIs

  • Hadoop distributions: Apache, Hortonworks, Cloudera, MapR, and Intel

  • Analytical databases: Vertica, Greenlplum, and InfoBright

  • Machine generated data sources: Application logs, web server logs, configuration files, sensor data, message queues, and filesystem audit logs

  • NoSQL databases: MongoDB, Redis, Cassandra, HBase, and CouchDB

Organizations invest heavily in these storage technologies and the skills needed to capture, store, and process data. MongoDB has emerged as a leader in the NoSQL category of databases. Because you are reading this book, you are probably well aware of the differences between MongoDB and relational databases; however, it is important to review and remind ourselves where MongoDB came from and why it is popular alternative to relational databases.

MongoDB is a document-oriented database designed to conquer some of the modern data storage challenges that developers and IT departments experience when using traditional relational databases. These modern data storage challenges started with the rise of the internet and high traffic websites such as Amazon and Google. These companies' websites attracted millions of users and subsequently massive volumes of website log, clickstream, and event data. Traditional relational database methods for handling growing data mostly involved scaling up (that is, vertical scaling) by adding more CPU and RAM to a single, often proprietary database server. This method of scaling was expensive and had limits on how far you could scale a single server. As a result, Google and Amazon decided to solve these data challenges by developing their own distributed data stores that could easily scale out (that is, horizontal scaling) across hundreds or thousands of commodity servers, as shown in the following figure. Horizontal scaling made it easier to scale dynamically by adding more machines to the cluster without any downtime or limits to compute capacity.

These early pioneers of distributed databases inspired a NoSQL data storage movement that included MongoDB. The term NoSQL is a popular way to describe MongoDB, because MongoDB does not use SQL, but NoSQL is not just about the query language. It has more to do with the way data is stored than just the query language. For many, the name NoSQL is inadequate, because it simply describes a query language and not the true essence of MongoDB, which is a horizontally scalable, distributed document database. Surprisingly, the name NoSQL originated simply as a way to describe these emerging distributed data stores using a short and unique Twitter hashtag, #nosql, for the purpose of advertising a meet up on the topic. As the story continued, the hashtag name stuck and is widely in use today!

Why would a database solution not leverage SQL, the most popular database query language used by developers and organizations all over the world? One key reason is that the SQL query language is not designed to efficiently query the nested constructs involved in hierarchical JSON documents, which form the foundation of the MongoDB document data model. JSON documents are language independent text files that represent data and are built on two primary data structures: nested collections of name/value pairs and ordered lists such as arrays. The following example shows the JSON representation of a dataset that describes the movie Forrest Gump. The JSON representation contains a parent object for movie information, a nested object representing the production company, and an array of cast member objects, shown as follows:

{
    "movie": "Forrest Gump",
    "rating": "PG-13",
    "duration_min": 142,
    "production_company": {
       "name": "Parmount Pictures",
       "streetAddress": "5555 Melrose Ave",
       "city": "Los Angeles",
       "state": "CA",
       "postalCode": 90038
    },
    "cast": [
        {
            "character": "Forrest Gump",
            "person": "Tom Hanks"
        },
        {
            "character": "Jenny Curran",
            "person": "Robin Wright"
        }
    ]
}

Each object is a comma-separated collection of key-value pairs enclosed in curly braces. MongoDB has its own query language built from the ground as a powerful way to retrieve, process, and update JSON documents. Document-oriented data storage is an alternative to SQL-based relational databases, and it offers some unique advantages that we will discuss in more detail in the next chapter. You can also learn more about JSON at json.org or Wikipedia.org/wiki/JSON.

MongoDB's use of JSON pairs each key with a complex data structure known as a document, and these documents can contain many different key-value pairs, key-array pairs, or even nested documents. MongoDB document-oriented data models enable the following benefits over relational databases:

  • It provides speedy and easy horizontal scaling by auto-sharding and grouping related data together in document collections instead of separate database tables that require joins to pull the data back together. The multitable joins of relational database management systems (RDBMS) reduce performance and make horizontal scaling more difficult.

  • It provides faster and easier application development by providing a data model that maps to native programming language objects. JSON's universal data structures make this possible and are supported by virtually any modern programming language. This makes MongoDB popular among developers because it permits a one-to-one mapping between object-oriented software objects and database entities. It also makes data interchange between software and databases easier.

  • It provides a dynamic schema that makes it easier than enforced RDBMS schemas to manage and evolve your data model. MongoDB allows for insertion of data without a predefined database schema. This gives software developers more flexibility to define and manipulate the database schema instead of relying on a separate database administrator to maintain schema changes.