Pig Design Patterns

Book Image

Pig Design Patterns

By : Pradeep Pasupuleti

Book Image

Pig Design Patterns

By: Pradeep Pasupuleti

Overview of this book

Pig Design Patterns

Pig Design Patterns

Credits

Foreword

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Setting the Context for Design Patterns in Pig

Setting the Context for Design Patterns in Pig

Understanding design patterns

The scope of design patterns in Pig

Hadoop demystified – a quick reckoner

Pig – a quick intro

Understanding Pig through the code

Data Ingest and Egress Patterns

Data Ingest and Egress Patterns

The context of data ingest and egress

Types of data in the enterprise

Ingest and egress patterns for multistructured data

The ingress and egress patterns for the NoSQL data

The ingress and egress patterns for structured data

The ingress and egress patterns for semi-structured data

JSON ingress and egress patterns

Data Profiling Patterns

Data Profiling Patterns

Data profiling for Big Data

Rationale for using Pig in data profiling

The data type inference pattern

The basic statistical profiling pattern

The pattern-matching pattern

The string profiling pattern

The unstructured text profiling pattern

Data Validation and Cleansing Patterns

Data Validation and Cleansing Patterns

Data validation and cleansing for Big Data

Choosing Pig for validation and cleansing

The constraint validation and cleansing design pattern

The regex validation and cleansing design pattern

The corrupt data validation and cleansing design pattern

The unstructured text data validation and cleansing design pattern

Data Transformation Patterns

Data Transformation Patterns

Data transformation processes

The structured-to-hierarchical transformation pattern

The data normalization pattern

The data integration pattern

The aggregation pattern

The data generalization pattern

Understanding Data Reduction Patterns

Understanding Data Reduction Patterns

Data reduction – a quick introduction

Data reduction considerations for Big Data

Dimensionality reduction – the Principal Component Analysis design pattern

Numerosity reduction – the histogram design pattern

Numerosity reduction – sampling design pattern

Numerosity reduction – clustering design pattern

Advanced Patterns and Future Work

Advanced Patterns and Future Work

The clustering pattern

The topic discovery pattern

The natural language processing pattern

The classification pattern

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Data profiling for Big Data

Bad data lurks in all of the data that is ingested by Hadoop, but its impact magnifies with the phenomenal volume and variety that constitutes Big Data. Working with missing records, malformed values, and wrong file formats amplifies the amount of wasted time. What drives us to frustration is seeing the amount of data that we can't use even though we have it, data that we have at hand and then lost, and data that was not the same as it was yesterday. In a Big Data analytics project, it is common to be handed an extremely huge dataset without a lot of information as to where it came from, how it was collected, what the fields mean, and so on. In many cases, the data has gone through many hands and multiple transformations since it was gathered, and nobody really knows what it all means anymore.

Data profiling is a measure of how good the data is and the fitness to process it in the subsequent steps. It simply indicates what is wrong with the data. Data profiling...