Book Image

Hadoop Real-World Solutions Cookbook

By : Jonathan R. Owens, Jon Lentz, Brian Femiano

Book Image

Hadoop Real-World Solutions Cookbook

By: Jonathan R. Owens, Jon Lentz, Brian Femiano

Overview of this book

<p>Helping developers become more comfortable and proficient with solving problems in the Hadoop space. People will become more familiar with a wide variety of Hadoop related tools and best practices for implementation.</p> <p>Hadoop Real-World Solutions Cookbook will teach readers how to build solutions using tools such as Apache Hive, Pig, MapReduce, Mahout, Giraph, HDFS, Accumulo, Redis, and Ganglia.</p> <p>Hadoop Real-World Solutions Cookbook provides in depth explanations and code examples. Each chapter contains a set of recipes that pose, then solve, technical challenges, and can be completed in any order. A recipe breaks a single problem down into discrete steps that are easy to follow. The book covers (un)loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine learning approaches with Mahout, debugging and troubleshooting MapReduce, and columnar storage and retrieval of structured data using Apache Accumulo.<br /><br />Hadoop Real-World Solutions Cookbook will give readers the examples they need to apply Hadoop technology to their own problems.</p>

Hadoop Real-World Solutions Cookbook

Hadoop Real-World Solutions Cookbook

Credits

About the Authors

About the Authors

About the Reviewers

About the Reviewers

www.packtpub.com

www.packtpub.com

Preface

Free Chapter

Hadoop Distributed File System – Importing and Exporting Data

Hadoop Distributed File System – Importing and Exporting Data

Importing and exporting data into HDFS using Hadoop shell commands

Moving data efficiently between clusters using Distributed Copy

Importing data from MySQL into HDFS using Sqoop

Exporting data from HDFS into MySQL using Sqoop

Configuring Sqoop for Microsoft SQL Server

Exporting data from HDFS into MongoDB

Importing data from MongoDB into HDFS

Exporting data from HDFS into MongoDB using Pig

Using HDFS in a Greenplum external table

Using Flume to load data into HDFS

HDFS

Reading and writing data to HDFS

Compressing data using LZO

Reading and writing data to SequenceFiles

Using Apache Avro to serialize data

Using Apache Thrift to serialize data

Using Protocol Buffers to serialize data

Setting the replication factor for HDFS

Setting the block size for HDFS

Extracting and Transforming Data

Extracting and Transforming Data

Transforming Apache logs into TSV format using MapReduce

Using Apache Pig to filter bot traffic from web server logs

Using Apache Pig to sort web server log data by timestamp

Using Apache Pig to sessionize web server log data

Using Python to extend Apache Pig functionality

Using MapReduce and secondary sort to calculate page views

Using Hive and Python to clean and transform geographical event data

Using Python and Hadoop Streaming to perform a time series analytic

Using MultipleOutputs in MapReduce to name output files

Creating custom Hadoop Writable and InputFormat to read geographical event data

Performing Common Tasks Using Hive, Pig, and MapReduce

Performing Common Tasks Using Hive, Pig, and MapReduce

Using Hive to map an external table over weblog data in HDFS

Using Hive to dynamically create tables from the results of a weblog query

Using the Hive string UDFs to concatenate fields in weblog data

Using Hive to intersect weblog IPs and determine the country

Generating -grams over news archives using MapReduce

Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives

Using Pig to load a table and perform a SELECT operation with GROUP BY

Advanced Joins

Joining data in the Mapper using MapReduce

Joining data using Apache Pig replicated join

Joining sorted data using Apache Pig merge join

Joining skewed data using Apache Pig skewed join

Using a map-side join in Apache Hive to analyze geographical events

Using optimized full outer joins in Apache Hive to analyze geographical events

Joining data using an external key-value store (Redis)

Big Data Analysis

Big Data Analysis

Counting distinct IPs in weblog data using MapReduce and Combiners

Using Hive date UDFs to transform and sort event dates from geographic event data

Using Hive to build a per-month report of fatalities over geographic event data

Implementing a custom UDF in Hive to help validate source reliability over geographic event data

Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python

Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig

Trim Outliers from the Audioscrobbler dataset using Pig and datafu

Advanced Big Data Analysis

Advanced Big Data Analysis

PageRank with Apache Giraph

Single-source shortest-path with Apache Giraph

Using Apache Giraph to perform a distributed breadth-first search

Collaborative filtering with Apache Mahout

Clustering with Apache Mahout

Sentiment classification with Apache Mahout

Debugging

Using Counters in a MapReduce job to track bad records

Developing and testing MapReduce jobs with MRUnit

Developing and testing MapReduce jobs running in local mode

Enabling MapReduce jobs to skip bad records

Using Counters in a streaming job

Updating task status messages to display debugging information

Using illustrate to debug Pig jobs

System Administration

System Administration

Starting Hadoop in pseudo-distributed mode

Starting Hadoop in distributed mode

Adding new nodes to an existing cluster

Safely decommissioning nodes

Recovering from a NameNode failure

Monitoring cluster health using Ganglia

Tuning MapReduce job parameters

Persistence Using Apache Accumulo

Persistence Using Apache Accumulo

Designing a row key to store geographic events in Accumulo

Using MapReduce to bulk import geographic event data into Accumulo

Setting a custom field constraint forinputting geographic event data in Accumulo

Limiting query results using the regex filtering iterator

Counting fatalities for different versions of the same key using SumCombiner

Enforcing cell-level security on scans using Accumulo

Aggregating sources in Accumulo using MapReduce

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

--as-avrodatafile argument / There's more...
--as-sequencefile argument / There's more...
-a arguments / How it works...
Accumulator interface / How it works...
Accumulo
- row key designing, to store geographic events / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
- geographic event data bulk importing, MapReduce used / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- custom field constraint setting, to input geographic event data / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
- SumCombiner, using / Counting fatalities for different versions of the same key using SumCombiner, How to do it..., How it works...
- used for enforcing cell-level security, on scans / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
- sources aggregating, MapReduce used / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
AccumuloFileOutputFormat
- versus AccumuloOutputFormat / AccumuloOutputFormat versus AccumuloFileOutputFormat
AccumuloInputFormat class / How it works...
AccumuloOutputFormat
- versus AccumuloFileOutputFormat / AccumuloOutputFormat versus AccumuloFileOutputFormat
AccumuloTableAssistant.java / AccumuloTableAssistant.java
ACLED
- about / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
ACLEDIngestReducer.java class / How to do it...
ACLEDSourceReducer static inner class / How to do it...
addCacheArchive() static method / There's more...
Apache Avro
- using, to serialize data / Using Apache Avro to serialize data, How to do it..., How it works...
Apache Giraph
- about / Introduction
- PageRank with / PageRank with Apache Giraph, How to do it..., How it works...
- -v option / How it works...
- -e option / How it works...
- -s option / How it works...
- -w option / How it works...
- community / Keep up with the Apache Giraph community
- single-source shortest-path / Single-source shortest-path with Apache Giraph, How to do it...
- using, to perform distributed breadth-first search / Using Apache Giraph to perform a distributed breadth-first search, How to do it..., How it works..., There's more...
- scalability tuning / Apache Giraph jobs often require scalability tuning
Apache Hive
- map-side join, using / Using a map-side join in Apache Hive to analyze geographical events, How to do it..., How it works...
- optimized full outer joins, using / Using optimized full outer joins in Apache Hive to analyze geographical events, How to do it..., How it works...
Apache logs
- transforming into TSV format, MapReduce used / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
Apache Mahout
- about / Introduction
- collaborative filtering with / Collaborative filtering with Apache Mahout, How to do it..., How it works...
- clustering with / Clustering with Apache Mahout, How to do it..., How it works...
- sentiment classification / Sentiment classification with Apache Mahout, How to do it..., How it works..., There's more...
Apache Mahout 0.6
- URL / Getting ready
Apache Pig
- about / Using Apache Pig to filter bot traffic from web server logs, There's more...
- using to sort web server log data, by timestamp / Using Apache Pig to sort web server log data by timestamp, See also
- used for sorting web server log data, by timestamp / Using Apache Pig to sort web server log data by timestamp, See also
- used, for sorting data / How to do it...
- using, to view sessionize web server log data / Using Apache Pig to sessionize web server log data, How to do it..., See also
- functionality extending, Python used / Using Python to extend Apache Pig functionality, How it works...
- SELECT operation, performing with GROUP BY operation / Using Pig to load a table and perform a SELECT operation with GROUP BY, How it works...
- using, to load table / Using Pig to load a table and perform a SELECT operation with GROUP BY, How it works...
- replicated join, used for joining data / Joining data using Apache Pig replicated join, How it works..., There's more...
- merge join, used for joining data / Joining sorted data using Apache Pig merge join, How it works...
- skewed join, used for joining skewed data / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
- record with malformed IP address, example / How to do it...
Apache Pig 0.10
- URL, for installing / Getting ready
Apache Thrift
- about / Using Apache Thrift to serialize data
- using, to serialize data / Getting ready, How to do it..., How it works...
apache_clf.txt dataset / Getting ready
associative / How it works...
Audioscrobbler dataset
- about / Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
- URL, for downloading / Getting ready, Getting ready
- outliers trimming, Datafu used / Trim Outliers from the Audioscrobbler dataset using Pig and datafu, How to do it...
AvroWrapper class / How it works...
AvroWriter job / There's more...
AvroWriter MapReduce job / How it works...

B

@BeforeClass annotation / How to do it...
bad records
- skipping, by enabling MapReduce jobs / Enabling MapReduce jobs to skip bad records, There's more...
Before annotation / How to do it...
block compression option / There's more...
block replication
- about / Introduction
block size
- setting, for HDFS / Setting the block size for HDFS, How it works...
Boolean expressions / Supporting more complex Boolean expressions
Bot traffic
- about / Using Apache Pig to filter bot traffic from web server logs
- filtering, Apache Pig UDF used / How to do it...
BreadtFirstTextOutputFormat class / How to do it...
BSON object / How to do it...
Builder pattern / How it works..., AccumuloTableAssistant.java
Bulk Synchronous Parallel (BSP) / Getting ready
- URL / Read and understand the Google Pregel paper
bzip2 / Compressing data using LZO

C

--clear-staging-table argument / How it works...
--clustering parameter / How it works...
--clusters parameter / How it works...
--connect statement / How it works...
-conf argument/flag / How it works...
CachedConfiguration.getInstance() static method / How it works...
cCoalesce() method / The coalesce() method can take variable length arguments.
CDH3 / Getting ready
- URL / Getting ready, Getting ready
cell-level security
- enforcing on scans, Accumulo used / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
cleanAndValidatePoint() private method / How to do it...
cluster
- new nodes, adding / Adding new nodes to an existing cluster, How it works...
- monitoring, Ganglia used / Monitoring cluster health using Ganglia, Getting ready, How it works...
CLUSTER BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
clusters
- data moving between, distributed copy used / Moving data efficiently between clusters using Distributed Copy, There's more..., How to do it...
coalesce() method / How it works...
ColumnUpdate object / How it works...
ColumnVisibility / ColumnVisibility is part of the key
Combiners
- used, for counting distinct IPs in weblog data / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
common join
- versus map-side join / Common join versus map-side join
- STREAMTABLE hint / STREAMTABLE hint
commutative / How it works...
CompositeKey class / How it works...
CompositeKeyParitioner class / How it works...
compute() function / How it works..., How it works..., Apache Giraph jobs often require scalability tuning
concat_ws() function / The UDF concat_ws() function will not automatically cast parameters to String, The concat_ws() function supports variable length parameter arguments
Connector instance / How to do it...
Constraint class / How it works...
constraint class
- building / Bundled Constraint classes
- installing, on TabletServer / Installing a constraint on each TabletServer
context class / How it works...
copyFromLocal command / How it works...
copyToLocal command / How it works..., There's more...
copyToLocal Hadoop shell command / How it works...
Cosine similarity
- about / Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
- calculating, Pig used / How to do it..., How it works...
counters
- about / Using Counters in a MapReduce job to track bad records
- using in MapReduce job, to track bad records / Using Counters in a MapReduce job to track bad records, How to do it..., How it works...
- using, in streaming job / Using Counters in a streaming job
create() method / How it works...
CREATE command / How it works...
createRecordReader() method / How it works...
current_diff / How it works...
current_loc / How it works...
Cyclic Redundancy Check (CRC) / How it works...

D

-D argument/flag / How it works...
data
- moving between clusters, distributed copy used / Moving data efficiently between clusters using Distributed Copy, There's more...
- importing from MySQL into HDFS, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- exporting from MySQL into HDFS, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- compressing, LZO used / Compressing data using LZO, How to do it...
- sorting, Apache Pig used / How to do it...
- joining in mapper, MapReduce used / Joining data in the Mapper using MapReduce, How to do it..., How it works..., There's more...
- joining, Apache Pig replicated join used / Joining data using Apache Pig replicated join, How it works..., There's more...
dataflow programs
- example data generating, URL for / See also
Datafu
- about / Trim Outliers from the Audioscrobbler dataset using Pig and datafu
- Audioscrobbler dataset, outliers trimming from / Trim Outliers from the Audioscrobbler dataset using Pig and datafu
data locality / Introduction
Datanode / Introduction
data serialization / Using Apache Avro to serialize data
- Apache Avro used / Using Apache Avro to serialize data
- Apache Thrift used / Using Apache Thrift to serialize data
- Protocol Buffers used / Using Protocol Buffers to serialize data
datediff() argument / How it works...
date format strings / Date format strings follow Java SimpleDateFormat guidelines
debugging information
- displaying, task status messages updated / Updating task status messages to display debugging information, How it works...
dfs.block.size property / How it works...
dfs.replication property / How it works..., How it works...
distcp command / There's more...
DistinctCounterJob / How it works...
distinct IPs
- counting in weblog data, MapReduce used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- counting in weblog data, Combiners used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- counting, MapReduce used / How to do it...
DISTRIBUTE BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
distributed breadth-first search
- performing, Apache Giraph used / Using Apache Giraph to perform a distributed breadth-first search, How to do it..., How it works..., There's more...
DistributedCache class / There's more...
distributed cache mechanism / How it works...
distributed copy
- used, for moving data between clusters / Moving data efficiently between clusters using Distributed Copy, There's more..., How to do it...
DistributedLzoIndexer / How it works..., There's more...
DROP temporary tables / DROP temporary tables
dump command / How it works...

E

-e option / How to do it..., How it works...
Eexport HIVE_AUX_JARS_PATH / Export HIVE_AUX_JARS_PATH in your environment
end_date / How it works...
end_time_obj / How it works...
EvalFunc abstract class / How it works...
EvalFunc class / How it works...
evaluate() method / How it works...
event dates
- transforming, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- sorting, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
event_date field / How it works...
exec(Tuple t) method / How it works...
external table
- mapping over weblog data in HDFS, Hive used / Using Hive to map an external table over weblog data in HDFS, How it works...
- dropping / Dropping an external table does not delete the data stored in the table

F

-file location_regains_by_time.py \ argument / How it works...
-file location_regains_mapper.py \ argument / How it works...
-fs argument/flag / How it works...
field constraint
- setting, to input geographic event data in Accumulo / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
fields
- concatenating in weblog data, Hive string UDFs used / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...
FileInputFormat class / How to do it..., How it works...
FileSystem.get() method / How it works...
FileSystem API / There's more...
FileSystem class / See also, How it works...
FileSystem object / How it works...
FilterFunc abstract class / How to do it...
Flume
- using, to load data into HDFS / Using Flume to load data into HDFS, How it works...
flushDB() method / How it works...
from_unixtime() / How it works...
fs.checkpoint.dir property / Getting ready
fs.default.name parameter / How it works...
fs.default.name property / How to do it..., How it works...
fully-distributed mode
- about / Starting Hadoop in pseudo-distributed mode
- Hadoop, starting in / Starting Hadoop in distributed mode, How to do it..., How it works..., There's more...

G

Ganglia
- used, for monitoring cluster / Monitoring cluster health using Ganglia, Getting ready, How it works...
Ganglia meta daemon (gmetad) / Getting ready
Ganglia monitoring daemon (gmond) / Getting ready
geographical event data
- cleaning, Hive used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- transforming, Hive used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- cleaning, Python used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- transforming, Python used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- reading, by creating custom Hadoop Writable / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
- reading, by creating custom InputFormat / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
geographic event data
- events transforming, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- events sorting, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- per-month report of fatalities building over, Hive used / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
- bulk importing into Accumulo, MapReduce used / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- inputting in Accumulo, by setting custom field constraint / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
geographic events
- storing in Accumulo, by designing row key / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
get command / There's more...
getCurrentVertex() method / How to do it...
getmerge command / There's more...
getRecordReader() method / How it works...
getReverseTime() function / How it works...
getRowID() / How to do it...
getZOrderedCurve() method / How to do it..., How it works...
Git Client
- URL / Getting ready, Getting ready
GitHub
- for Windows, URL / Getting ready, Getting ready
- for Mac, URL / Getting ready, Getting ready
Google BigTable design
- URL / Introduction
Google BigTable design approach
- URL / Introduction
Google Pregel paper
- about / Read and understand the Google Pregel paper
Greenplum external table
- HDFS, using / Using HDFS in a Greenplum external table, How it works..., There's more...
GroupComparator class / How it works...
GzipCodec / Reading and writing data to SequenceFiles

H

$HADOOP_BIN / Importing and exporting data into HDFS using Hadoop shell commands
Hadoop
- about / Introduction, Introduction, Developing and testing MapReduce jobs with MRUnit
- URL / Getting ready
- streaming job, executing / How to do it...
- starting, in pseudo-distributed mode / Starting Hadoop in pseudo-distributed mode, How to do it..., How it works..., There's more...
- starting, in fully-distributed mode / Starting Hadoop in distributed mode, How to do it..., How it works..., There's more...
- new nodes, adding to existing cluster / Getting ready, There's more...
- rebalancing / There's more...
- cluster monitoring, Ganglia used / Monitoring cluster health using Ganglia, Getting ready, How it works...
hadoop-streaming.jar file / How to do it...
Hadoop Distributed Copy (distcp) tool / Moving data efficiently between clusters using Distributed Copy
hadoop fs -COMMAND / Importing and exporting data into HDFS using Hadoop shell commands
Hadoop FS shell / Getting ready
hadoop mradmin -refreshNodes command / How it works...
Hadoop shell commands
- used, for importing data / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- used, for exporting data / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
hadoop shell script / Importing and exporting data into HDFS using Hadoop shell commands
Hadoop streaming
- using, to perform time series analytic / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...
- using, with language / Using Hadoop Streaming with any language that can read from stdin and write to stdout
Hadoop Writable
- creating, to read geographical event data / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
HashSet instance / How it works...
HDFS
- about / Introduction, Introduction
- data importing, Hadoop shell commands used / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- data exporting, Hadoop shell commands used / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- data importing from MySQL, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- data exporting from MySQL, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- data exporting, into MongoDB / Exporting data from HDFS into MongoDB, How to do it..., How it works...
- data, importing from MongoDB / Importing data from MongoDB into HDFS, How to do it...
- data exporting into MongoDB, Pig used / Exporting data from HDFS into MongoDB using Pig, How to do it..., How it works...
- using, in Greenplum external table / Using HDFS in a Greenplum external table, How it works..., There's more...
- data loading, Flume used / Using Flume to load data into HDFS, How it works...
- services / Introduction
- data, reading to / Reading and writing data to HDFS, How to do it..., How it works...
- data, writing to / Reading and writing data to HDFS, How to do it..., How it works...
- replication factor, setting / Setting the replication factor for HDFS, How it works...
- block size, setting / Setting the block size for HDFS, How it works...
- external table over weblog data, mapping / Using Hive to map an external table over weblog data in HDFS, How it works...
- external table, mapping / How to do it...
HDFS, services
- Namenode / Introduction
- Secondary Namenode / Introduction
- Datanode / Introduction
hdfs-site.xml file / Getting ready, How it works...
HdfsReader class / There's more...
HdfsWriter class / How it works..., There's more...
Hive
- used, for transforming geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used, for cleaning geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used for mapping external table over weblog, in HDFS / Using Hive to map an external table over weblog data in HDFS, How it works...
- using, to create tables from weblog query results / Using Hive to dynamically create tables from the results of a weblog query, How to do it..., There's more...
- using to intersect weblog IPs and determine country / Using Hive to intersect weblog IPs and determine the country, How to do it...
- multitable join support / Hive supports multitable joins
- ON operator / The ON operator for inner joins does not support inequality conditions
- using to build per-month report of fatalities, over geographic event data / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
- custom UDF, implementing / Implementing a custom UDF in Hive to help validate source reliability over geographic event data , Getting ready, How to do it..., How it works...
- existing UDFs, checking out / Check out the existing UDFs
- used, for marking non-violence longest period / Getting ready, How to do it..., How it works...
Hive date UDFs
- using to transform event dates, from geographic event data / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- using to sort event dates, from geographic event data / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
Hive query language
- about / Marking the longest period of non-violence using Hive MAP/REDUCE operators and Python
Hive string UDFs
- using, to concatenate firlds in weblog data / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...

I

--input arguments / How it works...
--input parameter / How it works...
-ignorecrc argument / How it works...
IdentityMapper / How to do it..., Developing and testing MapReduce jobs with MRUnit
IdentityMapperTest class / Getting ready, How to do it...
IllegalArgumentException exception / How it works...
illustrate
- using, to debug Apache Pig / Using illustrate to debug Pig jobs
InputFormat
- creating, to read geographical event data / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
input splits / Compressing data using LZO
InputStream object / How it works...
invalidZOrder() unit test method / How to do it...
INVALID_IP_ADDRESS counter / How it works...
io.compression.codecs property / How it works...
ip field / How to do it...
isSplitable() method / How it works...
IsUseragentBot class / How to do it..., How it works..., There's more...
IsUseragentBot class / How it works...

J

-jobconf mapred.reduce.tasks=1 argument / How it works...
-jobconf num.key.fields.for.partition=1 \ argument / How it works...
-jobconf stream.num.map.output.key.fields=2 \ argument / How it works...
-jt argument/flag / How it works...
Java Virtual Machine (JVM) / How it works...
JAVA_HOME environment property / How to do it...
JobConf.setMaxMapAttempts() method / How it works...
JobConf.setMaxReduceAttempts() method / How it works...
JobConf documentation
- URL / There's more...
Job Tracker UI
- about / Updating task status messages to display debugging information, How it works...
JOIN statement / How it works...
JOIN table / The ON operator for inner joins does not support inequality conditions

K

k-means / Clustering with Apache Mahout
key-value store
- used, for joining data / Joining data using an external key-value store (Redis)
key.toString() method / How it works...
keys
- Lexicographic sorting / Lexicographic sorting of keys

L

LineReader / How it works...
LineRecordReader class / How it works...
loadRedis() method / How to do it..., How it works...
LocalJobRunner class / How it works...
local mode
- MapReduce running jobs, developing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- MapReduce running jobs, testing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
LOCATION keyword / LOCATION must point to a directory, not a file
location_regains_mapper.py file / How it works...
LZO
- used, for data compressing / Compressing data using LZO, How to do it...
- codec implementation, downloading / Getting ready
- setting up, steps for / How to do it...
- working / How it works..., There's more...
- io.compression.codecs property / How it works...
- DistributedLzoIndexer / There's more...
LzoIndexer / There's more...
LzoTextInputFormat / How it works...

M

--maxIter parameter / How it works...
-mapper location_regains_mapper.py \ argument / How it works...
-m argument / How it works...
-md arguments / How it works...
-ml arguments / How it works...
main() method / How to do it..., How to do it...
map() function / How to do it..., How to do it..., How to do it..., How it works...
map() method / How it works...
map-side join
- about / Joining data in the Mapper using MapReduce
- using, in Apache Hive / Using a map-side join in Apache Hive to analyze geographical events, How to do it..., How it works...
- auto-converting to / Auto-convert to map-side join whenever possible
- behavior / Map-join behavior
- versus common join / Common join versus map-side join
MapDriver class / How it works...
Map input records counter / Using Counters in a MapReduce job to track bad records
maponly jobs / How it works...
Mapper class / How it works...
- col_pos / How it works...
- pattern / How it works...
- outKey / How it works...
- outValue / How it works...
mapred-site.xml configuration file / Getting ready, There's more...
mapred.cache.files property / How it works...
mapred.child.java.opts property / How to do it...
mapred.compress.map.output property / How to do it...
mapred.job.reuse.jvm.num.tasks property / How to do it...
mapred.job.tracker property / How to do it..., How it works...
mapred.map.child.java.opts property / How to do it...
mapred.map.output.compression.codec property / How to do it...
mapred.map.tasks.speculative.execution property / How to do it...
mapred.output.compression.codec property / How to do it...
mapred.output.compression.type property / How to do it...
mapred.output.compress property / How to do it...
mapred.reduce.child.java.opts property / How to do it...
mapred.reduce.tasks.speculative.execution property / How to do it...
mapred.reduce.tasks property / How to do it...
mapred.skip.attempts.to.start.skipping property / There's more...
mapred.skip.map.auto.incr.proc.count property / There's more...
mapred.skip.map.max.skip.records property / There's more...
mapred.skip.out.dir property / There's more...
mapred.skip.reduce.auto.incr.proc.count property / There's more...
mapred.tasktracker.reduce.tasks.maximum property / There's more...
mapred.textoutputformat.separator property / There's more...
MapReduce
- about / How it works...
- used, for transforming Apache logs into TSV format / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
- using, to calculate page views / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- calculating, secondary sort used / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- output files naming, MultipleOutputs, using / Using MultipleOutputs in MapReduce to name output files, How to do it..., How it works...
- distributed cache, using to find lines with matching keywords over newa archives / Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives, How it works..., Distributed cache does not work in local jobrunner mode
- used, for joining data in mapper / Joining data in the Mapper using MapReduce, How to do it..., How it works..., There's more...
- used, for counting distinct IPs in weblog data / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- used, for counting distinct IPs / How to do it...
- used for bulk importing geographic event data, into Accumulo / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- used for aggregating sources in Accumulo / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
MapReduce job
- counters, using to track bad records / Using Counters in a MapReduce job to track bad records, How to do it..., How it works...
- parameters, tuning / Tuning MapReduce job parameters, How to do it...
MapReduce job, properties
- mapred.skip.attempts.to.start.skipping property / There's more...
- mapred.skip.map.auto.incr.proc.count property / There's more...
- mapred.skip.reduce.auto.incr.proc.count property / There's more...
- mapred.skip.out.dir property / There's more...
- mapred.skip.map.max.skip.records property / There's more...
MapReduce jobs
- -file parameter, using to pass required files / Using the –file parameter to pass additional required files for MapReduce jobs
- about / Developing and testing MapReduce jobs with MRUnit
- developing, with MRUnit / Developing and testing MapReduce jobs with MRUnit
- MRUnit, URL for downloading / Getting ready
- testing, with MRUnit / How to do it..., There's more...
- enabling, to skip bad records / Enabling MapReduce jobs to skip bad records, There's more...
MapReduce running jobs
- in local mode, developing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- in local mode, testing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
MapReduce used
- used, for generating n-grams over news archives / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
mapred_excludes file / How it works...
masters configuration file / There's more...
Maven 2.2
- URL / Getting ready
merge join, Apache Pig
- used, for joining sorted data / Joining sorted data using Apache Pig merge join, How it works...
Microsoft SQL Server
- Scoop, configuring for / Configuring Sqoop for Microsoft SQL Server, How to do it...
min() operator / The Combiner does not always have to be the same class as your Reducer
Mockito
- URL / See also
MongoDB
- data, exporting from HDFS / Exporting data from HDFS into MongoDB, How to do it..., How it works...
- data, importing into HDFS / Importing data from MongoDB into HDFS, How to do it...
Mongo Hadoop Adaptor / Getting ready
- URL / Getting ready, Getting ready
Mongo Java Driver
- URL / Getting ready, Getting ready, Getting ready
MRUnit
- about / Developing and testing MapReduce jobs with MRUnit
- URL, for downloading / Getting ready
- mapper, testing / How to do it..., There's more...
MultipleOutputs
- used, for naming output files in MapReduce / Using MultipleOutputs in MapReduce to name output files, How to do it..., How it works...
MySQL
- data importing into HDFS, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- data exporting into HDFS, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
mysql.user table / How it works...
MySQL JDBC driver JAR file / Getting ready

N

--namedVector arguments / How it works...
--numClusters parameter / How it works...
-ng arguments / How it works...
n-grams
- generating, over news archives, MapReduce used / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
NameNode failure
- recovering from / Recovering from a NameNode failure, How to do it..., There's more...
news archives
- n-grams generating over, MapReduce used / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
NGramMapper class / How it works...
Nigera_ACLED_cleaned.tsv dataset / Getting ready, Getting ready
nigeria_holidays table / How it works...
nobots relationship / There's more...
nobots_weblogs relation / How it works...
nodes
- adding, to existing cluster / Adding new nodes to an existing cluster, How to do it..., There's more...
- decommissioning / Safely decommissioning nodes, How to do it...
non-violence longest period
- marking, Hive used / Getting ready, How to do it..., How it works...
NullWritable / Use NullWritable to avoid unnecessary serialization overhead
NumberFormatException exception / How it works...

O

--output arguments / How it works...
--output parameter / How it works...
--overwrite parameter / How it works...
-output /output/acled_analytic_out \ argument / How it works...
ON operator / The ON operator for inner joins does not support inequality conditions
operating modes, hadoop
- standalone mode / Starting Hadoop in pseudo-distributed mode
- pseudo-distributed mode / Starting Hadoop in pseudo-distributed mode
- fully-distributed mode / Starting Hadoop in pseudo-distributed mode
optimized full outer joins
- using, in Apache Hive / Using optimized full outer joins in Apache Hive to analyze geographical events, How to do it..., How it works...
ORDER BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
ORDER BY relational operator / There's more...
org.apache.hadoop.fs.FileSystem object / How it works...
org.apache.hadoop.fs.FsShell class / How it works...
output.write() method / How it works...
OutputStream object / How it works...

P

--password option / How it works...
PageRank
- with Apache Giraph / PageRank with Apache Giraph, How to do it..., How it works...
page views
- calculating, secondary sort used / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
per-month report of fatalities
- building over geographic event data, Hive used / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
Pig
- used for exporting data into MongoDB, from HDFS / Exporting data from HDFS into MongoDB using Pig, How to do it..., How it works...
- used, for calculating Cosine similarity / How to do it..., How it works...
play counts
- about / Trim Outliers from the Audioscrobbler dataset using Pig and datafu
prev_date / How it works...
protobufRecord object / How it works...
ProtobufWritable class / How it works...
ProtobufWritable instance / How it works...
Protocol Buffers
- using, to serialize data / Getting ready, How to do it..., How it works...
pseudo-distributed mode
- about / Starting Hadoop in pseudo-distributed mode
- Hadoop, starting in / Starting Hadoop in pseudo-distributed mode, How to do it..., How it works..., There's more...
Python
- using, to extend Apache Pig functionality / Using Python to extend Apache Pig functionality, How it works...
- used, for cleaning geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used, for transforming geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- AS keyword, used for type casing values / Type casing values using the AS keyword
Python streaming
- using, to perform time series analytic / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...

Q

QL statement / Making every column type String
Quantile UDF / Trim Outliers from the Audioscrobbler dataset using Pig and datafu, How it works...
query
- issuing, SumCombiner used / How to do it..., How it works...
query results
- limiting, regex filtering iterator used / Limiting query results using the regex filtering iterator, Getting ready, How to do it..., How it works...

R

-reducer location_regains_by_time.py \ argument / How it works...
read compression option / There's more...
record-skipping / There's more...
Record class / How it works...
Redis
- about / Joining data using an external key-value store (Redis), Getting ready
- used, for joining data in MapReduce / How to do it...
- URL / There's more...
reduce() function / How to do it...
reduce() method / How it works...
reduce-side join
- about / Joining data in the Mapper using MapReduce, Joining data using Apache Pig replicated join
Reducer class / How it works...
regex filtering iterator
- used, for limiting query results / Limiting query results using the regex filtering iterator, Getting ready, How to do it..., How it works...
removeAndSetOutput() method / How to do it...
removeAndSetPath() method / Use caution when invoking FileSystem.delete()
replicated join, Apache Pig
- used, for joining data / Joining data using Apache Pig replicated join, How it works..., There's more...
replication factor
- setting, for HDFS / Setting the replication factor for HDFS, How it works...
replication factor setting
- about / Introduction
request_date field / Using the Hive string UDFs to concatenate fields in weblog data
request_time field / Using the Hive string UDFs to concatenate fields in weblog data
Resource Description Framework (RDF) / Single-source shortest-path with Apache Giraph
rowCount variable / How it works...
row key
- designing, to store geographic events in Accumulo / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
run() method / How it works..., How it works..., How it works..., How to do it..., How to do it..., How to do it..., How to do it..., How to do it...
runTest() method / How it works...

S

$SQOOP_HOME / How it works...
--split-by argument / How it works...
--staging-table argument / How it works...
-s arguments / How it works...
-s option / How it works...
scans
- cell-level security enforcing, Accumulo used / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
Scoop
- configuring, for Microsoft SQL Server / Configuring Sqoop for Microsoft SQL Server, How to do it...
Secondary Namenode / Introduction
secondary sort
- using, to calculate page views / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
select() method / How it works...
SELECT statement / How it works...
SELECT TRANSFORM / MAP and REDUCE keywords are shorthand for SELECT TRANSFORM
seq2sparse arguments / How it works...
- --input arguments / How it works...
- --output arguments / How it works...
- --namedVector arguments / How it works...
- -ml arguments / How it works...
- -ng arguments / How it works...
- -x arguments / How it works...
- -md arguments / How it works...
- -s arguments / How it works...
- -wt arguments / How it works...
- -a arguments / How it works...
seqdirectory tool / How it works...
SequenceFileInputFormat.class / How it works...
SequenceFiles
- data, writing to / Reading and writing data to SequenceFiles, How to do it...
- data, reading to / Reading and writing data to SequenceFiles, How to do it...
- about / There's more...
- uncompressed option / There's more...
- read compression option / There's more...
- block compression option / There's more...
SequenceWriter class / How it works...
SerDe / How it works...
sessionize web server log data
- viewing, Apache Pig used / Using Apache Pig to sessionize web server log data, How to do it...
set() method / How it works...
setAttemptsToStartSkipping() method / There's more...
setJarByClass() / How it works...
setJarByClass() method / How it works...
setNumReduceTasks() method / There's more...
setSkipOutputPath() method / There's more...
setStatus() method / How to do it...
setup() method / How it works..., How it works..., How to do it...
setup() routine / How to do it...
shell commands
- URL / Importing and exporting data into HDFS using Hadoop shell commands
SimpleDateFormat pattern / Setting a custom field constraint forinputting geographic event data in Accumulo
single-source shortest-path
- with Apache Giraph / Single-source shortest-path with Apache Giraph, How to do it..., How it works...
- First superstep (S0) / First superstep (S0)
- second superstep (S1) / Second superstep (S1)
Sinks / There's more...
skewed data
- joining, Apache Pig skewed join used / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
skewed join, Apache Pig
- used, for joining skewed data / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
SkipBadRecords class / How it works..., There's more...
slaves configuration file / How to do it...
SORT BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
SortComparator class / How it works...
sorted data
- joining, Apache Pig merge join used / Joining sorted data using Apache Pig merge join, How it works...
Sources / There's more...
sources
- aggregating in Accumulo, MapReduce used / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
spiders / Using Apache Pig to filter bot traffic from web server logs
split points / Split points
splittable / Compressing data using LZO
Sqoop
- used for importing from MySQL, into HDFS / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- used for exporting from MySQL, into HDFS / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- URL / Getting ready
standalone mode
- about / Starting Hadoop in pseudo-distributed mode
startTime variable / How it works...
start_date / How it works...
start_time_obj / How it works...
stderr / How it works...
stdin / Using Counters in a streaming job
stdout / Using Counters in a streaming job
streaming job
- counters, using / Using Counters in a streaming job
- executing, streaming_counters.py program used / How to do it...
StreamingQuantile UDF / There's more...
streaming_counters job / How to do it...
string fields / How to do it...
STRING type / Making every column type String
String[] parameters / How to do it...
strip() method / How it works...
SumCombiner
- using, in Accumulo / Counting fatalities for different versions of the same key using SumCombiner, How to do it..., How it works...
- used, for issuing query / How to do it..., How it works...

T

--table argument / How it works..., How it works...
tab-separated values (TSV) / Transforming Apache logs into TSV format using MapReduce
TableFoo FULL OUTER JOIN TableBar / Map-join behavior
TableFoo LEFT OUTER JOIN TableBar / Map-join behavior
TableFoo RIGHT OUTER JOIN TABLE B / Map-join behavior
TabletServer
- constraint class, installing / Installing a constraint on each TabletServer
task status messages
- updating, to display debugging information / Updating task status messages to display debugging information, How it works...
TestCase class / How to do it...
testclassifier tool / How to do it...
testFullKey() unit test method / How to do it...
testIdentityMapper1() method / How to do it...
testIdentityMapper2() method / How to do it...
testInvalidReverseTime() unit test method / How to do it...
testValidReverseTime() unit test method / How to do it...
TextOutputFormat class / There's more...
thriftRecord object / How it works...
ThriftWrittable class / How it works...
time series analytic
- creating, Hadoop streaming used / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...
timestamp
- web server log data sorting, Apache Pig used / Using Apache Pig to sort web server log data by timestamp, See also
timestamp field / There's more...
Tool interface / How to do it...
ToolRunner class / How it works...
ToolRunner setup / Generating n-grams over news archives using MapReduce
train_formated dataset / How to do it...
TRANSFORM operator / How it works...
TSV format
- Apache logs transforming, MapReduce used / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
type casing values
- AS keyword used / Type casing values using the AS keyword

U

--update-key value / How it works...
--username option / How it works...
-usersFile flag / How it works...
uncompressed option / There's more...
unix_timestamp() / How it works...
user-defined filter function (UDF) / Using Apache Pig to filter bot traffic from web server logs
user_artist_data.txt file / How it works...

V

-v option / How it works...
validZOrder() unit test method / How to do it...
VALID_IP_ADDRESS regular expression / How it works...

W

-w option / How it works...
-wt arguments / How it works...
weblog data
- Hive string UDFs, using to concatenate fields / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...
- distinct IPs counting, MapReduce used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- distinct IPs counting, Combiners used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
weblog IPs
- intersecting, Hive used / Using Hive to intersect weblog IPs and determine the country, How to do it...
WeblogMapper class / How it works...
WeblogMapper map() method / How it works...
weblog query results
- tables creating, Hive used / Using Hive to dynamically create tables from the results of a weblog query, How to do it..., There's more...
WeblogRecord.Record object / How it works...
WeblogRecord class / How it works...
WeblogRecord object / How to do it..., How it works..., How to do it..., How it works..., How to do it...
weblog_entries.txt dataset
- URL, for downloading / Getting ready
/ Getting ready, Getting ready, How to do it...
weblog_entries dataset / Getting ready, Getting ready
weblog_entries_bad_records.txt dataset
- URL, for downloading / Getting ready
WhitespaceAnalyzer / How it works...
withInput() method / How it works...
withOutput() method / How it works...
WritableComparable class / How it works..., How it works...
WritableComparable interface / How it works...
writeVertex() method / How it works...

X

-x arguments / How it works...

Z

Z-order curve / Z-order curve