Index
A
- Abstract Syntax Tree (AST) / The EXPLAIN operator
- aggregation
- about / Data transformation processes
- aggregation pattern
- about / The aggregation pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- ANother Tool for Language Recognition (ANTLR) / The EXPLAIN operator
- Apache log ingestion pattern
- about / The Apache log ingestion pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- CommonLogLoader class / Code for the CommonLogLoader class
- CombinedLogLoader class / Code for the CombinedLogLoader class
- results / Results
- Apache projects, Hadoop
- Avro / Hadoop under the covers
- Chukwa / Hadoop under the covers
- SQOOP / Hadoop under the covers
- HBase / Hadoop under the covers
- Hive / Hadoop under the covers
- Mahout / Hadoop under the covers
- Pig / Hadoop under the covers
- ZooKeeper / Hadoop under the covers
- Application Programming Interface (APIs)
- about / The advent of Hadoop
- atom
- about / Complex types
- audio / Types of data in the enterprise
- Avro
- about / Hadoop under the covers
B
- bag / Complex types
- basic statistical profiling pattern
- background / Background
- motivation / Motivation
- profiling requisites / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- Pig script / Pig script
- getProfile macro / Macro
- results / Results
- Big Data
- data profiling / Data profiling for Big Data
- data validation / Data validation and cleansing for Big Data
- data cleansing / Data validation and cleansing for Big Data
- data reduction considerations / Data reduction considerations for Big Data
- Big Data analytics project
- about / Data profiling for Big Data
- quality measures / Data profiling for Big Data
- metadata / Data profiling for Big Data
- data quality, measuring / Data profiling for Big Data
- Big Data profiling
- about / Data profiling for Big Data
- structured Big Data, profiling / Data profiling for Big Data
- unstructured Big Data, profiling / Data profiling for Big Data
- dimensions / Big Data profiling dimensions
- sampling considerations / Sampling considerations for profiling Big Data
- sampling techniques / Sampling considerations for profiling Big Data
- binning / Motivation
C
- cartesian join / Motivation
- Chukwa
- about / Hadoop under the covers
- classification
- about / Motivation
- training / Motivation
- testing / Motivation
- production / Motivation
- performing / Motivation
- classification pattern
- about / The classification pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- cleansing steps, for unstructured data / Motivation
- clustering / Motivation
- about / Background, Background
- text clustering / Background
- clustering design pattern
- about / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- clustering pattern
- about / The clustering pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- code snippets, image egress
- about / The image egress
- Pig script / Pig script
- SequenceToImageStorage / Sequence to an image UDF
- SequenceFile.Reader class / Sequence to an image UDF
- results / Results
- code snippets, image ingress
- about / The image ingress
- Pig script / Pig script
- ImagetoSequenceFileUDF / Image to a sequence UDF snippet
- code snippets, JSON ingress and egress patterns
- ingress code / The ingress code
- code for simple JSON / The code for simple JSON
- code for nested JSON / The code for nested JSON
- egress code / The egress code
- results / Results
- code snippets, XML ingest and egress patterns
- XML raw ingestion / The XML raw ingestion code
- XML Binary Ingestion Code / The XML binary ingestion code
- XML egress code / The XML egress code
- Pig Script / Pig script
- XML Storage / The XML storage
- coherence
- about / Big Data profiling dimensions
- measuring / Big Data profiling dimensions
- referential integrity / Big Data profiling dimensions
- vakue integrity / Big Data profiling dimensions
- Collapsed Variational Bayes (CVB) / Pattern implementation
- combiner / The MapReduce internals
- completeness
- about / Big Data profiling dimensions
- determining / Big Data profiling dimensions
- attribute completeness / Big Data profiling dimensions
- tuple completeness / Big Data profiling dimensions
- value completeness / Big Data profiling dimensions
- complex data types, Pig
- about / Complex types
- atom / Complex types
- tuple / Complex types
- bag / Complex types
- map / Complex types
- Compression
- constraint validation and cleansing design pattern
- about / The constraint validation and cleansing design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- pattern implementation / Pattern implementation
- mandatory constraints / Pattern implementation
- range constraints / Pattern implementation
- unique constraints / Pattern implementation
- code snippets / Code snippets
- results / Results
- correctness
- about / Big Data profiling dimensions
- determining / Big Data profiling dimensions
- corrupt data validation and cleansing design pattern
- about / The corrupt data validation and cleansing design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- pattern implementation / Pattern implementation
- code snippets / Code snippets
- results / Results
- custom log ingestion pattern
- about / The Custom log ingestion pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- Cygwin / Firing up Pig
D
- data-driven patterns
- emergence / Emergence of data-driven patterns
- data cleansing, for Big Data
- data cleansing, Pig used
- advantages / Choosing Pig for validation and cleansing
- data corruption, sources
- sensor data / Background
- structured data / Background
- data egress
- data generalization pattern
- about / The data generalization pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- data ingest
- data integration
- about / Data transformation processes
- data integration pattern
- about / The data integration pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- data model, Pig
- primitive data types / Primitive types
- complex data types / Complex types
- data normalization pattern
- about / The data normalization pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- data profiling
- implementing in Hadoop, Pig used / Rationale for using Pig in data profiling
- data profiling, Big Data
- about / Data profiling for Big Data
- data reduction
- data reduction considerations, Big Data / Data reduction considerations for Big Data
- data reduction techniques
- dimensionality reduction / Data reduction – a quick introduction
- numerosity reduction / Data reduction – a quick introduction
- compression / Data reduction – a quick introduction
- diagrammatic representation / Data reduction – a quick introduction
- data transformation
- about / Data transformation processes
- normalization / Data transformation processes
- aggregation / Data transformation processes
- generalization / Data transformation processes
- data integration / Data transformation processes
- data type inference pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- Pig script / Pig script
- Java UDF code snippet / Java UDF
- results / Results
- data validation, Big Data / Data validation and cleansing for Big Data
- DEFINE operator / Operators used in code
- DESCRIBE operator / Operators used in code
- design patterns
- about / Understanding design patterns
- scope, in Pig / The scope of design patterns in Pig
- dimensionality reduction
- Principal Component Analysis design pattern / Dimensionality reduction – the Principal Component Analysis design pattern
- SVD / Motivation
- Dimensionality reduction
- dimensions, Big Data profiling
- completeness / Big Data profiling dimensions
- correctness / Big Data profiling dimensions
- coherence / Big Data profiling dimensions
- distance-based partitioning algorithms / Motivation
- DISTINCT operator / Operators used in code
- DUMP operator / Operators used in code
E
- edit distance / Motivation
- enterprise-centric view, data
- diagrammatic representation / Types of data in the enterprise
- enterprise context
- about / The enterprise context
- equal-frequency grouping technique / Motivation
- equal-width grouping technique / Motivation
- EXPLAIN operator
- about / The EXPLAIN operator
F
- File System nodes / Working of HDFS
- FILTER operator / Operators used in code
- FLATTEN operator / Operators used in code
- FOREACH operator / Operators used in code
- full outer join / Motivation
G
- generalization
- about / Data transformation processes
- Google File System (GFS)
- about / The advent of Hadoop
- GROUP operator / Operators used in code
H
- Hadoop
- enterprise context / The enterprise context
- challenges, of distributed systems / Common challenges of distributed systems
- features / The advent of Hadoop
- integral parts / Hadoop under the covers
- Apache projects / Hadoop under the covers
- data profiling, implementing using Pig / Rationale for using Pig in data profiling
- HBase
- about / Hadoop under the covers
- HBase ingress and egress pattern
- about / The HBase ingress and egress pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- ingress implementation / The ingress implementation
- egress implementation / The egress implementation
- code snippets / Code snippets
- ingress code / The ingress code
- egress code / The egress code
- results / Results
- HDFS
- about / Understanding the Hadoop Distributed File System
- design goals / HDFS design goals
- working / Working of HDFS
- NameNode / Working of HDFS
- DataNodes / Working of HDFS
- Hierarchical Agglomerative Clustering (HAC) / Motivation
- high-volume data
- legacy data / Types of data in the enterprise
- transactional (OLTP) data / Types of data in the enterprise
- unstructured data / Types of data in the enterprise
- video / Types of data in the enterprise
- audio / Types of data in the enterprise
- images / Types of data in the enterprise
- numerical/patterns/graphs / Types of data in the enterprise
- social media data / Types of data in the enterprise
- histogram design pattern
- about / Numerosity reduction – the histogram design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- Hive
- about / Hadoop under the covers
- Hive ingress and egress pattern
- about / The Hive ingress and egress patterns
- background / Background
- motivation / Motivation
- use cases / Use cases
- ingress implementation / The ingress implementation
- egress implementation / The egress implementation
- code snippets / Code snippets
- ingress code / The ingress Code
- data, importing using RCFile / Importing data using RCFile
- HiveColumnarLoader, using / Importing data using RCFile
- data, importing using HCatalog / Importing data using HCatalog
- egress code / The egress code
- results / Results
I
- ILLUSTRATE operator / Operators used in code
- image egress implementation
- performing / The image egress implementation
- image ingress and egress pattern
- about / The image ingress and egress pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- image ingress implementation
- performing / The image Ingress Implementation
- images / Types of data in the enterprise
- inner join / Motivation
- integral parts, Hadoop
- about / Hadoop under the covers
- Hadoop Common / Hadoop under the covers
- HDFS / Hadoop under the covers
- Hadoop MapReduce / Hadoop under the covers
- interquartile range (IQR) / Motivation
J
- Jaccard similarity / Motivation
- Java 1.6 / Firing up Pig
- JobTracker / Understanding how MapReduce works
- JOIN operator / Operators used in code
- JSON ingress and egress patterns
- about / JSON ingress and egress patterns
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- ingress implementation / The ingress implementation
- egress implementation / The egress implementation
- code snippets / Code snippets
K
- K-means clustering algorithm / Motivation
- K-medoid clustering algorithm / Motivation
L
- Latent Dirichlet Allocation (LDA)
- about / The topic discovery pattern
- left outer join / Motivation
- legacy data / Types of data in the enterprise
- Levenshtein distance / Motivation
- LIMIT operator / Operators used in code
- LOAD operator / Operators used in code
- local mode, Pig
- about / Firing up Pig
- logical optimization, Pig processing / The EXPLAIN operator
- log ingestion pattern
- considerations / Considerations for log ingestion
M
- mainframe ingestion pattern
- about / The mainframe ingestion pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- map / Complex types
- Map function / The MapReduce internals
- MAPREDUCE / Pig's extensibility
- MapReduce
- using / Understanding the Hadoop Distributed File System
- about / Understanding MapReduce
- working / Understanding how MapReduce works
- JobTracker / Understanding how MapReduce works
- TaskTrackers / Understanding how MapReduce works
- internals / The MapReduce internals
- components / The MapReduce internals
- MapReduce components
- combiner / The MapReduce internals
- partitioner / The MapReduce internals
- output / The MapReduce internals
- job configuration / The MapReduce internals
- job input / The MapReduce internals
- MapReduce job
- Map function / The MapReduce internals
- Reduce function / The MapReduce internals
- MapReduce mode, Pig
- about / Firing up Pig
- MapReduce plan, Pig processing / The EXPLAIN operator
- master node / Working of HDFS
- maxDiff grouping technique / Motivation
- MongoDB ingress and egress pattern
- about / MongoDB ingress and egress patterns
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- ingress implementation / The ingress implementation
- egress implementation / The egress implementation
- code snippets / Code snippets
- ingress code / The ingress code
- egress code / The egress code
- results / Results
- multistructured data
- ingest pattern / Ingest and egress patterns for multistructured data
- egress pattern / Ingest and egress patterns for multistructured data
- Apache Log formats / Ingest and egress patterns for multistructured data
- custom log format / Ingest and egress patterns for multistructured data
- image format / Ingest and egress patterns for multistructured data
N
- natural language processing pattern
- about / The natural language processing pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- NLP pipeline
- end of sentence detection / Motivation
- tokenization / Motivation
- parts-of-speech tagging / Motivation
- chunking / Motivation
- extraction / Motivation
- non-numeric data
- normalizing / Motivation
- NonProbabilistic sampling
- about / Motivation
- nonprobabilistic sampling methods
- about / Motivation
- normalization
- about / Data transformation processes
- NoSQL data
- ingress pattern / The ingress and egress patterns for the NoSQL data
- egress pattern / The ingress and egress patterns for the NoSQL data
- Numerical/patterns/graphs / Types of data in the enterprise
- numeric data
- normalizing / Motivation
- numerosity reduction
- histogram design pattern / Numerosity reduction – the histogram design pattern
- sampling design pattern / Numerosity reduction – sampling design pattern
- clustering design pattern / Numerosity reduction – clustering design pattern
- Numerosity reduction
O
- operators, Pig code
- DEFINE / Operators used in code
- LOAD / Operators used in code
- STORE / Operators used in code
- DUMP / Operators used in code
- UNION / Operators used in code
- SAMPLE / Operators used in code
- GROUP / Operators used in code
- FOREACH / Operators used in code
- DISTINCT / Operators used in code
- JOIN / Operators used in code
- DESCRIBE / Operators used in code
- FILTER / Operators used in code
- ILLUSTRATE / Operators used in code
- ORDERBY / Operators used in code
- PARALLEL / Operators used in code
- LIMIT / Operators used in code
- FLATTEN / Operators used in code
- ORDERBY operator / Operators used in code
P
- PARALLEL operator / Operators used in code
- partitioner / The MapReduce internals
- pattern-matching pattern
- about / The pattern-matching pattern
- background / Background, Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- Pig script / Pig script
- getPatterns macro, implementing / Macro
- results / Results
- Perl / Firing up Pig
- physical plan, Pig processing / The EXPLAIN operator
- Pig
- design patterns / The scope of design patterns in Pig
- about / Hadoop under the covers, Pig – a quick intro
- compiler / Pig – a quick intro
- Latin script / Pig – a quick intro
- standard data-processing operators / Pig – a quick intro
- functions, in Big Data processing flow / Understanding the relevance of Pig in the enterprise
- working / Working of Pig – an overview
- firing up / Firing up Pig
- versus Hadoop, compatibility / Firing up Pig
- prerequisites / Firing up Pig
- installing / Firing up Pig
- installation, verifying / Firing up Pig
- local mode / Firing up Pig
- MapReduce mode / Firing up Pig
- use case / The use case
- code listing / Code listing
- dataset / The dataset
- extensibility / Pig's extensibility
- operators / Operators used in code
- EXPLAIN operator / The EXPLAIN operator
- data model / Understanding Pig's data model
- schemas, handling / The relevance of schemas
- samplingsupport / Sampling support in Pig
- Datafu library / Sampling support in Pig
- used, for implementing data profiling in Hadoop / Rationale for using Pig in data profiling
- used, for data cleansing / Choosing Pig for validation and cleansing
- Pig core
- about / Firing up Pig
- Pig extensibility features
- REGISTER / Pig's extensibility
- MAPREDUCE / Pig's extensibility
- STREAM / Pig's extensibility
- Pig Latin
- about / Pig – a quick intro
- features / Understanding the rationale of Pig
- ILLUSTRATE function / Understanding the rationale of Pig
- Pig processing
- query parser / The EXPLAIN operator
- logical plan / The EXPLAIN operator
- logical optimization / The EXPLAIN operator
- physical plan / The EXPLAIN operator
- MapReduce plan / The EXPLAIN operator
- primitive data types, Pig
- about / Primitive types
- Int / Primitive types
- Float / Primitive types
- Long / Primitive types
- Double / Primitive types
- Chararray / Primitive types
- Principal Component Analysis design pattern
- about / Dimensionality reduction – the Principal Component Analysis design pattern
- background / Background
- motivation / Motivation
- eigenvalues / Motivation
- eigenvectors / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- limitations / Limitations of PCA implementation
- code snippets / Code snippets
- results / Results
- probabilistic sampling methods
- about / Motivation
- simple random sampling / Motivation
- stratified sampling / Motivation
- NonProbabilistic sampling / Motivation
Q
- query parser, Pig processing / The EXPLAIN operator
R
- reduce-side join / Motivation
- Reduce function / The MapReduce internals
- regex validation and cleansing design pattern
- about / The regex validation and cleansing design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- pattern implementation / Pattern implementation
- code snippets / Code snippets
- results / Results
- REGISTER / Pig's extensibility
- regression / Motivation
- replicated join / Motivation
- ReservoirSampling / Sampling support in Pig
- right outer join / Motivation
S
- SampleByKey / Sampling support in Pig
- SAMPLE operator / Operators used in code, Sampling support in Pig
- sampling
- probabilistic sampling methods / Background
- nonprobabilistic methods / Motivation
- sampling design pattern
- about / Numerosity reduction – sampling design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- sampling suppport / Sampling support in Pig
- sampling techniques, Datafu library / Sampling support in Pig
- ReservoirSampling / Sampling support in Pig
- SampleByKey / Sampling support in Pig
- WeightedSample / Sampling support in Pig
- semi-structured data
- ingress and egress patterns / The ingress and egress patterns for semi-structured data
- mainframe ingestion pattern / The mainframe ingestion pattern
- XML ingest and egress patterns / XML ingest and egress patterns
- simple random sampling
- about / Motivation
- simple random sampling technique / Sampling support in Pig
- slave nodes / Working of HDFS
- social media data / Types of data in the enterprise
- solution-driven patterns
- emergence / The emergence of solution-driven patterns
- SQOOP
- about / Hadoop under the covers
- standard data-processing operators, Pig
- JOIN / Pig – a quick intro
- FILTER / Pig – a quick intro
- GROUP BY / Pig – a quick intro
- ORDER BY / Pig – a quick intro
- UNION / Pig – a quick intro
- Stochastic SVD (SSVD)
- about / Motivation
- implementing / Limitations of PCA implementation
- Storage Area Network (SAN)
- about / The advent of Hadoop
- STORE operator / Operators used in code
- Stratified Random Sampling technique / Sampling support in Pig
- stratified sampling
- about / Motivation
- STREAM / Pig's extensibility
- string profiling pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- Pig script / Pig script
- getStringProfile macro, implementing / Macro
- results / Results
- structured-to-hierarchical transformation pattern
- about / The structured-to-hierarchical transformation pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- structured data
- ingress and egress patterns / The ingress and egress patterns for structured data
- SVD
- about / Motivation
T
- TaskTrackers / Understanding how MapReduce works
- text clustering techniques
- Hierarchical Agglomerative Clustering (HAC) / Motivation
- distance-based partitioning algorithms / Motivation
- K-medoid clustering algorithm / Motivation
- K-means clustering algorithm / Motivation
- time-quality trade-off / Data validation and cleansing for Big Data
- topic discovery pattern
- about / The topic discovery pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- results / Results
- additional information / Additional information
- transactional (OLTP) data / Types of data in the enterprise
- tuples / Pig – a quick intro, Understanding the rationale of Pig, Complex types
U
- UNION operator / Operators used in code
- unstructured data / Types of data in the enterprise
- unstructured text profiling pattern
- background / Background
- motivation / Motivation
- text pre-processing / Motivation
- use cases / Use cases
- implementing / Pattern implementation
- code snippets / Code snippets
- Pig script / Pig script
- Java UDF code snippet / Java UDF for stemming
- Java UDF code snippet, for computing TF-IDF / Java UDF for generating TF-IDF
- results / Results
- unstructured text validation and cleansing pattern
- about / The unstructured text data validation and cleansing design pattern
- background / Background
- motivation / Motivation
- use cases / Use cases
- pattern implementation / Pattern implementation
- code snippets / Code snippets
- results / Results
- User Defined Functions (UDFs) / Pig – a quick intro
V
- V-Optimal grouping technique / Motivation
- video / Types of data in the enterprise
W
- WeightedSample / Sampling support in Pig
X
- XML ingest and egress patterns
- about / XML ingest and egress patterns
- background / Background
- motivation / Motivation
- motivation, for ingesting raw XML / Motivation for ingesting raw XML
- motivation, for ingesting binary XML / Motivation for ingesting binary XML
- motivation, for egression of XML / Motivation for egression of XML
- use cases / Use cases
- implementing / Pattern implementation
- XML raw ingestion implementation / The implementation of the XML raw ingestion
- XML binary ingestion implementation / The implementation of the XML binary ingestion
- code snippets / Code snippets
- results / Results
Z
- ZooKeeper
- about / Hadoop under the covers