Pig Design Patterns

Book Image

Pig Design Patterns

By : Pradeep Pasupuleti

Book Image

Pig Design Patterns

By: Pradeep Pasupuleti

Overview of this book

Pig Design Patterns

Pig Design Patterns

Credits

Foreword

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Setting the Context for Design Patterns in Pig

Setting the Context for Design Patterns in Pig

Understanding design patterns

The scope of design patterns in Pig

Hadoop demystified – a quick reckoner

Pig – a quick intro

Understanding Pig through the code

Data Ingest and Egress Patterns

Data Ingest and Egress Patterns

The context of data ingest and egress

Types of data in the enterprise

Ingest and egress patterns for multistructured data

The ingress and egress patterns for the NoSQL data

The ingress and egress patterns for structured data

The ingress and egress patterns for semi-structured data

JSON ingress and egress patterns

Data Profiling Patterns

Data Profiling Patterns

Data profiling for Big Data

Rationale for using Pig in data profiling

The data type inference pattern

The basic statistical profiling pattern

The pattern-matching pattern

The string profiling pattern

The unstructured text profiling pattern

Data Validation and Cleansing Patterns

Data Validation and Cleansing Patterns

Data validation and cleansing for Big Data

Choosing Pig for validation and cleansing

The constraint validation and cleansing design pattern

The regex validation and cleansing design pattern

The corrupt data validation and cleansing design pattern

The unstructured text data validation and cleansing design pattern

Data Transformation Patterns

Data Transformation Patterns

Data transformation processes

The structured-to-hierarchical transformation pattern

The data normalization pattern

The data integration pattern

The aggregation pattern

The data generalization pattern

Understanding Data Reduction Patterns

Understanding Data Reduction Patterns

Data reduction – a quick introduction

Data reduction considerations for Big Data

Dimensionality reduction – the Principal Component Analysis design pattern

Numerosity reduction – the histogram design pattern

Numerosity reduction – sampling design pattern

Numerosity reduction – clustering design pattern

Advanced Patterns and Future Work

Advanced Patterns and Future Work

The clustering pattern

The topic discovery pattern

The natural language processing pattern

The classification pattern

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Chapter 6. Understanding Data Reduction Patterns

In the previous chapter, we learned about the various Big Data transformation techniques that dealt with transforming the structure of the data to a hierarchical representation. This was done in order to take advantage of Hadoop's capability to process semistructured data. We have seen the importance of performing normalization on the data before performing analysis on it. We then discussed using joins to denormalize the data. CUBE and ROLLUP perform multiple aggregations on the data; these aggregations provide a snapshot of the data. In the data generalization section, we discussed various generalization techniques for numerical and categorical data.

In this chapter, we will discuss design patterns that perform dimensionality reduction using the principal component analysis technique, and numerosity reduction using clustering, sampling, and histogram techniques.