Book Image

Pentaho for Big Data Analytics

By : Manoj R Patil, Feris Thia
Book Image

Pentaho for Big Data Analytics

By: Manoj R Patil, Feris Thia

Overview of this book

<p>Pentaho accelerates the realization of value from big data with the most complete solution for big data analytics and data integration. The real power of big data analytics is the abstraction between data and analytics. Data can be distributed across the cluster in various formats, and the analytics platform should have the capability to talk to different heterogeneous data stores and fetch the filtered data to enrich its value.<br /><br />Pentaho Big Data Analytics is a practical, hands-on guide that provides you with clear, step-by-step exercises for using Pentaho to take advantage of big data systems, where data beats algorithm, and gives you a good grounding in using Pentaho Business Analytics’ capabilities.<br /><br />This book looks at the key ingredients of the Pentaho Business Analytics platform. We will see how to prepare the Pentaho BI environment, and get to grips with the big data ecosystem through. The book provides a clear guide to the essential tools of Pentaho Business Analytics, providing familiarity with both the various design tools for setting up reports, and the visualization tools necessary for complete data analysis.</p>
Table of Contents (14 chapters)
Pentaho for Big Data Analytics
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Pentaho BI Suite – components


Pentaho is a trailblazer when it comes to business intelligence and analysis, offering a full suite of capabilities for the ETL (Extract, Transform, and Load) processes, data discovery, predictive analysis, and powerful visualization. It has the flexibility of deploying on premise, in cloud, or can be embedded in custom applications.

Pentaho is a provider of a Big Data analytics solution that spans data integration, interactive data visualization, and predictive analytics. As depicted in the following diagram, this platform contains multiple components, which are divided into three layers: data, server, and presentation:

Let us take a detailed look at each of the components in the previous diagram.

Data

This is one of the biggest advantages of Pentaho; that it integrates with multiple data sources seamlessly. In fact, Pentaho Data Integration 4.4 Community Edition (referred as CE hereafter) supports 44 open source and proprietary databases, flat files, spreadsheets, and more out of box third-party software. Pentaho introduced Adaptive Big Data Layer as part of the Pentaho Data Integration engine to support the evolution of the Big Data stores. This layer accelerates access and integration to the latest version and capabilities of the Big Data stores. It natively supports third-party Hadoop distributions from MapR, Cloudera, Hortonworks, as well as popular NoSQL databases such as Cassandra and MongoDB. These new Pentaho Big Data initiatives bring greater adaptability, abstraction from change, and increased competitive advantage to companies facing the never-ceasing evolution of the Big Data ecosystem. Pentaho also supports analytic databases such as Greenplum and Vertica.

Server applications

The Pentaho Administration Console (PAC) server in CE or Pentaho Enterprise Console (PEC) server in EE (Enterprise Edition) is a web interface used to create, view, schedule, and apply permissions to reports and dashboards. It also provides an easy way to manage security, scheduling, and configuration for the Business Application Server and Data Integration Server along with repository management. The server applications are as follows:

  • Business Analytics (BA) Server: This is a Java-based BI platform with a report management system and lightweight process-flow engine. This platform also provides an HTML5-based web interface for creating, scheduling, and sharing various artifacts of BI such as interactive reporting, data analysis, and a custom dashboard. In CE, we have a parallel application called Business Intelligence (BI) Server.

  • Data Integration (DI) Server: This is a commercially available enterprise class server for the ETL processes and Data Integration. It helps to execute ETL and Data Integration jobs smoothly. It also provides scheduling to automate jobs and supports content management with the help of revision history and security integration.

Thin Client Tools

The Thin Client Tools all run inside Pentaho User Console (PUC) in a web browser (such as Internet Explorer, Chrome, or Firefox). Let's have a look at each of the tools:

  • Pentaho Interactive Reporting: This is a "What You See is What You Get" (WYSIWYG) type of design interface used to build simple and ad hoc reports on the fly without having to rely on IT support. Any business user can design reports using the drag-and-drop feature by connecting to the desired data source and then do rich formatting or use the existing templates.

  • Pentaho Analyzer: This provides an advanced web-based, multiple browser- supported OLAP viewer with support for drag-and-drop. It is an intuitive analytical visualization application with the capability to filter and drill down further into business information data, which is stored in its own Pentaho Analysis (Mondrian) data source. You can also perform other activities such as sorting, creating derived measures, and chart visualization.

  • Pentaho Dashboard Designer (EE): This is a commercial plugin that allows users to create dashboards with great usability. Dashboards can contain a centralized view of key performance indicators (KPI) and other business data movement, dynamic filter controls with customizable layout and themes.

Design tools

Let's take a quick look at each of these tools:

  • Schema Workbench: This is a Graphical User Interface (GUI) for designing Rolap cubes for Pentaho Analysis (Mondrian). It also provides the capability of data exploration and analysis for end BI users without having to understand the MultiDimensional eXpressions (MDX) language.

  • Aggregation Designer: This is based on Pentaho Analysis (Mondrian) schema files in XML and the database with the underlying tables described by the schema XML to generate pre-calculated, pre-aggregated answers, which improve the performance of analysis work and MDX queries executed against Mondrian to a great extent.

  • Metadata Editor: This is a tool used to create logical business models and acts as an abstraction layer from the underlying physical data layer. The resulting metadata mappings are used by Pentaho's Interactive Reporting (the community-based Saiku Reporting), to create reports within the BA Server without any other external desktop application.

  • Report Designer: This is a banded report designing tool with a rich GUI, which can also contain sub-reports, charts, and graphs. It can query and use data from a range of data sources from text files to RDBMS to Big Data, which addresses the requirements of financial, operational, and production reporting. Even standalone reports can be executed from the user console or used within a dashboard. Pentaho Report Designer consists of a reporting engine at its core, which accepts a .ppt template to process reports. This file is in a ZIP format with XML resources to define the report design.

  • Data Integration: This is also known as "Kettle", and consists of a core integration (ETL) engine and GUI application that allows the user to design Data Integration jobs and transformations. It also supports distributed deployment on the cluster or cloud environment as well as on single node computers. It has an adaptive Big Data layer, which supports different Big Data stores by insulating Hadoop, so that you only need to focus on analysis without bothering much about modification of the Big Data stores.

  • Design Studio: This is an Eclipse-based application and plugin, facilitating to create business process flow with a special XML script to define action sequences called xactions and other forms of automation in the platform. Action sequences define a lightweight, result-oriented business flow within the Pentaho BA Server.