-
Book Overview & Buying
-
Table Of Contents
Pig Design Patterns
By :
This section describes the string profiling design pattern in which we use Pig scripts on textual data to know important statistics.
A majority of Big Data implementations deal with text data embedded in columns. To gain insight from these columns, they have to be integrated with other enterprise-structured data. This design pattern elaborates a few of the ways that help understand the quality of textual data.
The quality of textual data can be ascertained by applying basic statistical techniques on the values of the attributes. Finding the string length is the most important dimension in selecting the appropriate data types and sizes for the target system. You can use the maximum and minimum string lengths to determine, at a glance, if the data ingested into Hadoop meets a given constraint. While dealing with data sizes in the petabyte range, limiting the character count to be just large enough optimizes storage and computation by cutting down...
Change the font size
Change margin width
Change background colour