Book Image

Getting Started with Greenplum for Big Data Analytics

By : Sunila Gollapudi
Book Image

Getting Started with Greenplum for Big Data Analytics

By: Sunila Gollapudi

Overview of this book

Organizations are leveraging the use of data and analytics to gain a competitive advantage over their opposition. Therefore, organizations are quickly becoming more and more data driven. With the advent of Big Data, existing Data Warehousing and Business Intelligence solutions are becoming obsolete, and a requisite for new agile platforms consisting of all the aspects of Big Data has become inevitable. From loading/integrating data to presenting analytical visualizations and reports, the new Big Data platforms like Greenplum do it all. It is now the mindset of the user that requires a tuning to put the solutions to work. "Getting Started with Greenplum for Big Data Analytics" is a practical, hands-on guide to learning and implementing Big Data Analytics using the Greenplum Integrated Analytics Platform. From processing structured and unstructured data to presenting the results/insights to key business stakeholders, this book explains it all. "Getting Started with Greenplum for Big Data Analytics" discusses the key characteristics of Big Data and its impact on current Data Warehousing platforms. It will take you through the standard Data Science project lifecycle and will lay down the key requirements for an integrated analytics platform. It then explores the various software and appliance components of Greenplum and discusses the relevance of each component at every level in the Data Science lifecycle. You will also learn Big Data architectural patterns and recap some key advanced analytics techniques in detail. The book will also take a look at programming with R and integration with Greenplum for implementing analytics. Additionally, you will explore MADlib and advanced SQL techniques in Greenplum for analytics. This book also elaborates on the physical architecture aspects of Greenplum with guidance on handling high-availability, back-up, and recovery.
Table of Contents (13 chapters)
Getting Started with Greenplum for Big Data Analytics
About the Author
About the Reviewers

Enterprise data

Before we take a deep dive into Big Data and analytics, let us understand the important characteristics of enterprise data as a prerequisite.

Enterprise data signifies data in a perspective that is holistic to an enterprise. We are talking about data that is centralized/integrated/federated, using diverse storage strategy, from diverse sources (that are internal and/or external to the enterprise), condensed and cleansed for quality, secure, and definitely scalable.

In short, enterprise data is the data that is seamlessly shared or available for exploration where relevant information is used appropriately to gain competitive advantage for an enterprise.

Data formats and access patterns are diverse which additionally drives some of the need for various platforms. Any new strategic enterprise application development should not assume the persistence requirements to be relational. For example, data that is transactional in nature could be stored in a relational store and twitter feed could be stored in NoSQL structure.

This would mean bringing in complexity that introduces learning new interfaces but a benefit worth the performance gain.

It requires that an enterprise has the important data engineering aspects in place to handle enterprise data effectively. The following list covers a few critical data engineering aspects:

  • Data architecture and design

  • Database administration

  • Data governance (that includes data life cycle management, compliance, and security)


Enterprise data can be classified into the following categories:

  • Transactional data: It is the data generated to handle day-to-day affairs within an enterprise and reveals a snapshot of ongoing business processing. It is used to control and run fundamental business tasks. This category of data usually refers to a subset of data that is more recent and relevant. This data requires a strong backup strategy and data loss is likely to entail significant monetary impact and legal issues. Transactional data is owned by Enterprise Transactional systems that are the actual source for the data as well. This data is characterized by dynamicity. For example, order entry, new account creation, payments, and so on.

  • Master and Reference data: Though we see Master data and Reference data categorized under the same bucket, they are different in their own sense. Reference data is all about the data that is usually outside the enterprise and is Standards compliant and usually static in nature. On the other hand, Master data is similar in definition with the only difference that it originates from within the enterprise. Both Master and Reference data are referenced by Transactional data and key to the operation of business. This data is often non-transactional/static in nature and can be stored centralized or duplicated. For example:

    Reference data: country codes, PIN, branch codes, and so on

    Master data: accounts, portfolio managers, departments, and so on

  • Analytical data: Business data is analyzed and insights derived are presented for decision making; data classified under this category usually is not owned by the analyzing application. Transaction data from various transaction processing systems is fed for analysis. This data is sliced and diced at various levels to help problem solving, planning, and decision-support as it gives multi-dimensional views of various business activities. It is usually larger in volume and historic in nature when compared to transactional data.

In addition to the preceding categories, there are a few other important data classifications. These classifications define the character data:

  • Configuration data: This classification refers to the data that describes data or defines the way data needs to be used. There can be many categories of configuration data. For example, an application has many clients, and each client needs to refer to a unique set of messaging configurations (let's say a unique queue name) or information regarding how to format a report that needs to be displayed for a particular user, and so on. This classification is also referred to as metadata.

  • Historic data: It refers to any data that is historic in nature. Typically gives reference to facts at a given point in time. This data requires a robust archival strategy as it is expected to be voluminous. At the same time, it would not undergo any changes and is usually used as a reference for comparison. Corrections/changes to historic data can happen only in the case of errors. Examples can be, security price at a point in time, say January 20, 1996, financial transactions of an account in the first quarter of the year, and so on.

  • Transitional data: This is one of the most important data classifications that refer to data that is intermediary and temporary in nature. This data is usually generated to improve the data processing speed and could be kept in memory that is evicted post its use. This data might not be available for direct consumption. Example for this data classification can be an intermediary computation data that is stored and is to be used in a bigger scheme of data processing, like market value for each security to compute, and rate of return on the overall value invested.


In this section, we will understand the characteristic features of enterprise data. Each of the listed characteristics describes a unique facet/behavior that would be elaborated in the implementation perspective later in the Data Science life cycle section in this chapter. Following are a few important characteristics of enterprise data:

  • Included: Enterprise data is integrated and usually, but not mandatorily, centralized to all applications within an enterprise. Data from various sources and varied formats is either aggregated or federated for this purpose. (Aggregation refers to physically combining data sets into a single structure and location while federation is all about getting a centralized way to access a variety of data sources to get the required data without physically combining/merging the data.)

  • Standards compliance: Data is represented/presented to the application in context in a format that is either a standard to an enterprise/across enterprises.

  • Secure: Data is securely accessible through authorization.

  • Scalable: In a context where data is integrated from various sources, the need to support larger volumes becomes critical, and thus the scalability, both in terms of storage and processing.

  • Condensed/Cleansed/Consistent: Enterprise data can possibly be condensed and cleansed to ensure data quality against a given set of data standards for an enterprise.

  • Varied sources and formats: Data is mostly combined from varied sources and can continue to be stored in varied formats for optimized usage.

  • Available: Enterprise data is always consistent with minimal data disparity and available to all applications using it.