Index
A
- aggregator
- about / Understanding operations, Aggregator
- calling sequence / Aggregator calling sequence
- built-in aggregators / Built-in aggregators
- aggregator, custom operations
- writing / Writing an aggregator
- Analytics Inside website
- URL / Installing and using
- Apache Spark
- about / Beyond MapReduce
- Apache Tez
- about / Beyond MapReduce
- Application Master / YARN – MapReduce version 2
- AssertionLevel
- about / AssertionLevel
- .STRICT / AssertionLevel
- .VALID / AssertionLevel
- .NONE / AssertionLevel
- assertions
- about / Understanding operations, Assertions
- ValueAssertion calling sequence / ValueAssertion calling sequence
- GroupAssertion calling sequence / GroupAssertion calling sequence
- AssertionLevel / AssertionLevel
- using / Using assertions
- built-in assertions / Built-in assertions
- BaseOperation methods, implementing / A note about implementing BaseOperation methods
B
- balancer
- boundary nodes / Hadoop architecture
- branch
- buffer, custom operations
- writing / Writing a buffer
- buffers
- about / Understanding operations, Buffers
- calling sequence / Buffer calling sequence
- built-in buffers / Built-in buffers
- built-in aggregators
- about / Built-in aggregators
- First / Built-in aggregators
- Last / Built-in aggregators
- Min / Built-in aggregators
- Max / Built-in aggregators
- Average / Built-in aggregators
- Sum / Built-in aggregators
- Count / Built-in aggregators
- built-in assertions
- AssertEquals / Built-in assertions
- AssertEqualsAll / Built-in assertions
- AssertExpression / Built-in assertions
- AssertGroupBase / Built-in assertions
- AssertGroupSizeEquals / Built-in assertions
- AssertGroupSizeLessThan / Built-in assertions
- AssertGroupSizeMoreThan / Built-in assertions
- AssertMatches / Built-in assertions
- AssertMatchesAll / Built-in assertions
- AssertNotEquals / Built-in assertions
- AssertNotNull / Built-in assertions
- AssertNull / Built-in assertions
- AssertSizeEquals / Built-in assertions
- AssertSizeLessThan / Built-in assertions
- AssertSizeMoreThan / Built-in assertions
- built-in filters
- ExpressionFilter / Built-in filters
- RegexFilter / Built-in filters
- Sample / Built-in filters
- ScriptFilter / Built-in filters
- Debug / Built-in filters
- FilterNotNull / Built-in filters
- FilterNull / Built-in filters
- And / Built-in filters
- Or / Built-in filters
- Not / Built-in filters
- Xor / Built-in filters
- examples / Built-in filters
- built-in functions
- Identity / Built-in functions
- DateFormatter / Built-in functions
- DateParser / Built-in functions
- RegexParser / Built-in functions
- RegexReplace / Built-in functions
- RegexSplitGenerator / Built-in functions
- RegexSplitter / Built-in functions
- ExpressionFunction / Built-in functions
- ScriptFunction / Built-in functions
- FieldJoiner / Built-in functions
- FieldFormatter / Built-in functions
- SetValue / Built-in functions
- UnGroup / Built-in functions
- Insert / Built-in functions
- XPathParser / Built-in functions
- TagSoupParser / Built-in functions
- business intelligence
C
- cascade
- about / Data flows as processes, Cascades
- example code / Cascades
- refining / Refining and adjusting
- adjusting / Refining and adjusting
- cascade events / Flow and cascade events
- cascades
- using / Using cascades
- building, complex workflow used / Building a complex workflow using cascades
- flow, skipping / Skipping a flow in a cascade
- Cascading
- data flow, controlling / Understanding how Cascading controls data flow
- common errors / Common errors
- using, as insulation from Big Data migrations / Using Cascading as insulation from big data migrations and upgrades
- using, as insulation from Big Data upgrades / Using Cascading as insulation from big data migrations and upgrades
- optimizing / Optimizing Cascading
- cascading
- Cascading application
- implementing / Putting it all together
- Cascading application, debugging
- about / Debugging a Cascading application
- development environment, setting up / Getting your environment ready for debugging
- Cascading local mode debugging, using / Using Cascading local mode debugging
- Eclipse, setting up / Setting up Eclipse
- remote debugging / Remote debugging
- assertions, using / Using assertions
- Debug() filter / The Debug() filter
- exceptions, managing with traps / Managing exceptions with traps
- checkpoints / Checkpoints
- bad data, managing / Managing bad data
- Flow sequencing, viewing with DOT files / Viewing flow sequencing using DOT files
- cascading development methodology
- about / Defining the project – the Cascading development methodology
- project, roles / Project roles and responsibilities
- project, responsibilities / Project roles and responsibilities
- data analysis, conducting / Conducting data analysis
- functional decomposition, performing / Performing functional decomposition
- process, designing / Designing the process and components
- components, designing / Designing the process and components
- operations, creating / Creating and integrating the operations
- operations, integrating / Creating and integrating the operations
- subassemblies, creating / Creating and using subassemblies
- subassemblies, using / Creating and using subassemblies
- Cascading framework
- about / The Cascading framework
- execution graph / The execution graph and flow planner
- flow planner / The execution graph and flow planner
- MapReduce jobs, producing / How Cascading produces MapReduce jobs
- Cascading processing
- data streams, defining for / Understanding how Cascading represents records
- data streams, structuring for / Understanding how Cascading represents records
- Cascading serializers
- about / Cascading serializers
- Cascading tools
- about / Using other Cascading tools
- Lingual / Lingual
- Pattern / Pattern
- Driven / Driven
- Fluid / Fluid
- Load / Load
- Multitool / Multitool
- language support / Support for other languages
- Hortonworks / Hortonworks
- cascading_ext project
- URL / Optimizing Cascading
- checkpoints
- classes, of operations
- filters / Understanding operations
- functions / Understanding operations
- aggregator / Understanding operations
- buffers / Understanding operations
- assertion / Understanding operations
- classes, simple MapReduce job
- MRJob / A simple MapReduce job
- MRMapper / A simple MapReduce job
- MRReducer / A simple MapReduce job
- MRPartitioner / A simple MapReduce job
- MRComparator / A simple MapReduce job
- cluster test / Performing a cluster test
- coercion
- about / Data typing and coercion
- CoGroup pipe
- about / CoGroup
- combiner
- about / Hadoop architecture
- common Cascading themes
- about / Understanding common Cascading themes
- data flows, as processes / Data flows as processes
- comparators
- about / Using tuples and defining fields
- contents
- about / Contents
- contexts
- about / Contexts
- counters
- about / Counters, Instrumentation and counters
- used, for controlling flows / Using counters to control flow
- custom assertion, custom operations
- writing / Writing a custom assertion
- custom operations
- writing / Writing custom operations
- filter, writing / Writing a filter
- function, writing / Writing a function
- aggregator, writing / Writing an aggregator
- custom assertion, writing / Writing a custom assertion
- buffer, writing / Writing a buffer
- use cases, identifying / Identifying common use cases for custom operations
- custom Taps
- about / Custom taps
D
- data definition language (DDL)
- about / Hadoop architecture
- Data Mining Group website
- URL / Pattern
- DataNode
- about / DataNodes
- data streams
- defining, for Cascading processing / Understanding how Cascading represents records
- structuring, for Cascading processing / Understanding how Cascading represents records
- data typing
- about / Data typing and coercion
- Debug() filter
- about / The Debug() filter
- declarator
- distributed cache
- about / Distributed cache
- domain-specific language (DSL)
- about / The Cascading framework
- DOT
- DOT file
- Driven
E
- Each operations
- about / Each operations
- filters / Filters
- functions / Function
- Each pipe
- about / Each
- EasyMock
- Eclipse
- URL / Setting up Eclipse
- end-to-end testing / Testing strategies
- environment variables, Hadoop
- JAVA_HOME / Reviewing Hadoop
- HADOOP_HOME / Reviewing Hadoop
- HADOOP_CONF_DIR / Reviewing Hadoop
- Every operations
- about / Every operations
- aggregator / Aggregator
- Every pipe
- about / Every
- external components
- integrating / Integrating external components
- external JAR files, using / Using external JAR files
F
- FAT JAR
- about / Distributed cache
- feeding sources / Skipping a flow in a cascade
- field algebra
- fields
- field sets
- about / Using a Fields object, named field groups, and selectors
- Fields.ALL / Using a Fields object, named field groups, and selectors
- Fields.UNKNOWN / Using a Fields object, named field groups, and selectors
- Fields.ARGS / Using a Fields object, named field groups, and selectors
- Fields.RESULTS / Using a Fields object, named field groups, and selectors
- Fields.GROUP / Using a Fields object, named field groups, and selectors
- Fields object
- filter
- about / Understanding operations, Filters
- calling sequence / Filter calling sequence
- built-in filters / Built-in filters
- filter, custom operations
- writing / Writing a filter
- flow
- about / Data flows as processes, Flow
- FlowConnector
- about / FlowConnector
- flow events / Flow and cascade events
- FlowProcess object
- about / FlowProcess
- flows
- controlling, dynamically / Dynamically controlling flows
- controlling, counters used / Using counters to control flow
- existing MapReduce jobs used / Using existing MapReduce jobs
- fluent programming techniques used / Using fluent programming techniques, The FlowDef fluent interface
- fluent interface
- about / The Cascading framework
- Fluid
- full load test / Performing a full load test
- function, custom operations
- writing / Writing a function
- functions
- about / Understanding operations, Function
- calling sequence / Function calling sequence
- built-in functions / Built-in functions
- functor / Built-in subassemblies
G
- Graphviz
- GroupAssertion calling sequence / GroupAssertion calling sequence
- GroupBy pipe
- about / GroupBy and sorting
- groups
- about / Data flows as processes
H
- Hadoop
- reviewing / Reviewing Hadoop
- primary files / Reviewing Hadoop
- environment variables / Reviewing Hadoop
- tasks / Hadoop architecture
- optimizing / Optimizing Hadoop
- Hadoop architecture
- about / Hadoop architecture
- Hadoop jobs
- about / Hadoop jobs
- Hadoop modes
- about / Local and Hadoop modes
- HashJoin pipe / HashJoin
- HDFS
- about / Reviewing Hadoop, HDFS – the Hadoop Distributed File System
- NameNode / The NameNode
- secondary NameNode / The secondary NameNode
- DataNode / DataNodes
- head nodes / Hadoop architecture
- High Availability (HA)
- about / The secondary NameNode
- Hortonworks
- about / Hortonworks
- references / Hortonworks
- Hortonworks Data Platform (HDP)
- about / Hortonworks
I
- instrumentation
- about / Instrumentation and counters
- integrated development environment (IDE) / Using Cascading local mode debugging
- integration test / Performing an integration test
- integration testing
- about / Testing strategies, Integration testing
- methodologies / Integration testing
- intermediate file management
- about / Intermediate file management
J
- Janino compiler
- about / Built-in filters
- Java Archive (JAR) files / Using external JAR files
- Java open source mock frameworks
- about / Java open source mock frameworks
- Java Virtual Machine (JVM) / Hadoop architecture
- jMock
- JMockit
- Job Queue / Hadoop architecture
- JobTracker / Hadoop architecture
- about / Hadoop architecture, The JobTracker
- join pipes
- JUnit
- about / Unit testing and JUnit
K
- key field / Hadoop architecture
L
- least recently used (LRU) / Built-in subassemblies
- Lingual
- about / Lingual
- Load
- load testing
- Load tool
- about / Load and performance testing
- local mode
- about / Local and Hadoop modes
M
- machine learning
- about / Pattern
- Map process
- about / Hadoop architecture
- MapReduce
- MapReduce execution framework
- about / MapReduce execution framework
- JobTracker / The JobTracker
- TaskTracker / The TaskTracker
- MapReduce jobs
- MapReduce version 1
- to to MapReduce version 2 moving to, YARN used / Using Cascading as insulation from big data migrations and upgrades
- Merge pipe
- about / The Merge pipe
- method cascading / Using fluent programming techniques
- Mockachino
- mocking
- about / Mocking
- Mockito
- monad / Using fluent programming techniques
- Multitool
N
- named field groups
- NameNode
- about / The NameNode
- natural language processing (NLP) / Understanding the project domain – text analytics and natural language processing (NLP), Next steps
- negative tests
- about / Load and performance testing
- nodes, Hadoop
- head nodes / Hadoop architecture
- slave nodes / Hadoop architecture
- boundary nodes / Hadoop architecture
O
- online resources
- searching / Finding online resources
- OpenNLP
- OperationCall<Context> object
- about / OperationCall<Context>
- Operation class
- about / The Operation class and interface hierarchy
- interface hierarchy / The Operation class and interface hierarchy
- contexts / Contexts
- FlowProcess object / FlowProcess
- OperationCall<Context> object / OperationCall<Context>
- operations
- about / Understanding operations, Operations and fields
- using / Understanding operations
- basic operation lifecycle / The basic operation lifecycle
- processing sequence / An operation processing sequence and its methods
- methods / An operation processing sequence and its methods
- operations, types
- about / Operation types
- Each operations / Each operations
- Every operations / Every operations
- buffers / Buffers
- assertions / Assertions
- Optiq
- about / Lingual
- orchestration layer
- about / The Cascading framework
P
- partitioner
- about / Hadoop architecture
- Pattern
- performance
- optimizing / Optimizing performance
- performance testing
- about / Load and performance testing
- pipe
- about / Data flows as processes
- pipe assembly
- about / Data flows as processes
- pipe operations
- about / Pipe operations
- filter / Pipe operations
- function / Pipe operations
- aggregator / Pipe operations
- count / Pipe operations
- buffer / Pipe operations
- assertion / Pipe operations
- pipes
- using / Using pipes
- creating / Creating and chaining
- Each pipe / Each
- splitting / Splitting
- GroupBy pipe / GroupBy and sorting
- sorting / GroupBy and sorting
- Every pipe / Every
- merging / Merging and joining
- joining / Merging and joining
- Merge pipe / The Merge pipe
- join pipes / The join pipes – CoGroup and HashJoin
- CoGroup / CoGroup
- HashJoin / HashJoin
- default output selectors / Default output selectors
- pipe subassemblies / Creating and using subassemblies
- plumbing/liquid processing model / Data flows as processes
- PowerMock
- Predictive Modelling Markup Language (PMML)
- about / Pattern
- primary files, Hadoop
- core-site.xml / Reviewing Hadoop
- hdfs-site.xml / Reviewing Hadoop
- mapred-site.xml / Reviewing Hadoop
- project
- scope / Project scope – understanding requirements
- domain / Understanding the project domain – text analytics and natural language processing (NLP)
- named entity extraction, creating / Conducting a simple named entity extraction
R
- rack / Hadoop architecture
- Read phase
- about / Hadoop architecture
- Reduce process
- about / Hadoop architecture
- relational database management system (RDBMS)
- about / Hadoop architecture
- Resilient Distributed Datasets / Using Cascading as insulation from big data migrations and upgrades
- resilient distributed datasets (RDD)
- about / Beyond MapReduce
- Resource Manager / YARN – MapReduce version 2
S
- schema-on-read
- about / Hadoop architecture
- scheme
- about / Defining schemes
- defining / Defining schemes
- in detail / Schemes in detail
- secondary NameNode
- about / The secondary NameNode
- selector
- selectors
- Shuffle and Sort process
- about / Hadoop architecture
- simple MapReduce job
- about / A simple MapReduce job
- sink data
- about / Schemes in detail
- SinkMode, options
- KEEP / Using taps
- REPLACE / Using taps
- UPDATE / Using taps
- slave nodes / Hadoop architecture
- Sort and Group step
- about / Hadoop architecture
- source data
- about / Schemes in detail
- speculative execution / Hadoop architecture
- spill parameters / HashJoin
- subassemblies
- creating / Creating and using subassemblies
- using / Creating and using subassemblies
- built-in subassemblies / Built-in subassemblies
- new custom subassembly, creating / Creating a new custom subassembly
- custom subassemblies, using / Using custom subassemblies
- subject matter expert (SME) / Project roles and responsibilities
T
- taps
- about / Data flows as processes
- using / Using taps
- Tap types
- Hfs / Using taps
- Lfs / Using taps
- Dfs / Using taps
- FileTap / Using taps
- tasks, Hadoop
- parallel tasks / Hadoop architecture
- mapper tasks / Hadoop architecture
- reducer tasks / Hadoop architecture
- TaskTracker
- about / The TaskTracker
- Task Tracker / Hadoop architecture
- test plan, designing
- unit test / Performing a unit test
- integration test / Performing an integration test
- cluster test / Performing a cluster test
- full load test / Performing a full load test
- test strategies
- about / Testing strategies
- integration testing / Testing strategies, Integration testing
- load testing / Testing strategies, Load and performance testing
- unit testing / Unit testing and JUnit
- mocking / Mocking
- text analytics / Understanding the project domain – text analytics and natural language processing (NLP)
- Tez / Using Cascading as insulation from big data migrations and upgrades
- trap
- about / Managing exceptions with traps
- tuple
- TupleEntry class
- about / TupleEntry
- tuple stream
U
- Unitils
- unit test / Performing a unit test
- unit testing
- about / Unit testing and JUnit
- creating / Unit testing and JUnit
V
- ValueAssertion calling sequence / ValueAssertion calling sequence
- value field / Hadoop architecture
W
- workflow
- building / Building the workflow
- flows, building / Building flows
- context, managing / Managing the context
- cascade, building / Building the cascade
- test plan, designing / Designing the test plan
- software, packaging / Software packaging and delivery to the cluster
- delivery, to cluster / Software packaging and delivery to the cluster
- Write phase
- about / Hadoop architecture
Y
- YARN
- about / YARN – MapReduce version 2
Z
- ZIP file
- code / Contents
- installing / Installing and using
- using / Installing and using