Index
A
- --as-avrodatafile argument / There's more...
- --as-sequencefile argument / There's more...
- -a arguments / How it works...
- Accumulator interface / How it works...
- Accumulo
- row key designing, to store geographic events / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
- geographic event data bulk importing, MapReduce used / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- custom field constraint setting, to input geographic event data / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
- SumCombiner, using / Counting fatalities for different versions of the same key using SumCombiner, How to do it..., How it works...
- used for enforcing cell-level security, on scans / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
- sources aggregating, MapReduce used / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
- AccumuloFileOutputFormat
- versus AccumuloOutputFormat / AccumuloOutputFormat versus AccumuloFileOutputFormat
- AccumuloInputFormat class / How it works...
- AccumuloOutputFormat
- versus AccumuloFileOutputFormat / AccumuloOutputFormat versus AccumuloFileOutputFormat
- AccumuloTableAssistant.java / AccumuloTableAssistant.java
- ACLED
- ACLEDIngestReducer.java class / How to do it...
- ACLEDSourceReducer static inner class / How to do it...
- addCacheArchive() static method / There's more...
- Apache Avro
- using, to serialize data / Using Apache Avro to serialize data, How to do it..., How it works...
- Apache Giraph
- about / Introduction
- PageRank with / PageRank with Apache Giraph, How to do it..., How it works...
- -v option / How it works...
- -e option / How it works...
- -s option / How it works...
- -w option / How it works...
- community / Keep up with the Apache Giraph community
- single-source shortest-path / Single-source shortest-path with Apache Giraph, How to do it...
- using, to perform distributed breadth-first search / Using Apache Giraph to perform a distributed breadth-first search, How to do it..., How it works..., There's more...
- scalability tuning / Apache Giraph jobs often require scalability tuning
- Apache Hive
- map-side join, using / Using a map-side join in Apache Hive to analyze geographical events, How to do it..., How it works...
- optimized full outer joins, using / Using optimized full outer joins in Apache Hive to analyze geographical events, How to do it..., How it works...
- Apache logs
- transforming into TSV format, MapReduce used / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
- Apache Mahout
- about / Introduction
- collaborative filtering with / Collaborative filtering with Apache Mahout, How to do it..., How it works...
- clustering with / Clustering with Apache Mahout, How to do it..., How it works...
- sentiment classification / Sentiment classification with Apache Mahout, How to do it..., How it works..., There's more...
- Apache Mahout 0.6
- URL / Getting ready
- Apache Pig
- about / Using Apache Pig to filter bot traffic from web server logs, There's more...
- using to sort web server log data, by timestamp / Using Apache Pig to sort web server log data by timestamp, See also
- used for sorting web server log data, by timestamp / Using Apache Pig to sort web server log data by timestamp, See also
- used, for sorting data / How to do it...
- using, to view sessionize web server log data / Using Apache Pig to sessionize web server log data, How to do it..., See also
- functionality extending, Python used / Using Python to extend Apache Pig functionality, How it works...
- SELECT operation, performing with GROUP BY operation / Using Pig to load a table and perform a SELECT operation with GROUP BY, How it works...
- using, to load table / Using Pig to load a table and perform a SELECT operation with GROUP BY, How it works...
- replicated join, used for joining data / Joining data using Apache Pig replicated join, How it works..., There's more...
- merge join, used for joining data / Joining sorted data using Apache Pig merge join, How it works...
- skewed join, used for joining skewed data / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
- record with malformed IP address, example / How to do it...
- Apache Pig 0.10
- URL, for installing / Getting ready
- Apache Thrift
- about / Using Apache Thrift to serialize data
- using, to serialize data / Getting ready, How to do it..., How it works...
- apache_clf.txt dataset / Getting ready
- associative / How it works...
- Audioscrobbler dataset
- about / Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
- URL, for downloading / Getting ready, Getting ready
- outliers trimming, Datafu used / Trim Outliers from the Audioscrobbler dataset using Pig and datafu, How to do it...
- AvroWrapper class / How it works...
- AvroWriter job / There's more...
- AvroWriter MapReduce job / How it works...
B
- @BeforeClass annotation / How to do it...
- bad records
- skipping, by enabling MapReduce jobs / Enabling MapReduce jobs to skip bad records, There's more...
- Before annotation / How to do it...
- block compression option / There's more...
- block replication
- about / Introduction
- block size
- setting, for HDFS / Setting the block size for HDFS, How it works...
- Boolean expressions / Supporting more complex Boolean expressions
- Bot traffic
- about / Using Apache Pig to filter bot traffic from web server logs
- filtering, Apache Pig UDF used / How to do it...
- BreadtFirstTextOutputFormat class / How to do it...
- BSON object / How to do it...
- Builder pattern / How it works..., AccumuloTableAssistant.java
- Bulk Synchronous Parallel (BSP) / Getting ready
- bzip2 / Compressing data using LZO
C
- --clear-staging-table argument / How it works...
- --clustering parameter / How it works...
- --clusters parameter / How it works...
- --connect statement / How it works...
- -conf argument/flag / How it works...
- CachedConfiguration.getInstance() static method / How it works...
- cCoalesce() method / The coalesce() method can take variable length arguments.
- CDH3 / Getting ready
- URL / Getting ready, Getting ready
- cell-level security
- enforcing on scans, Accumulo used / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
- cleanAndValidatePoint() private method / How to do it...
- cluster
- new nodes, adding / Adding new nodes to an existing cluster, How it works...
- monitoring, Ganglia used / Monitoring cluster health using Ganglia, Getting ready, How it works...
- CLUSTER BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
- clusters
- data moving between, distributed copy used / Moving data efficiently between clusters using Distributed Copy, There's more..., How to do it...
- coalesce() method / How it works...
- ColumnUpdate object / How it works...
- ColumnVisibility / ColumnVisibility is part of the key
- Combiners
- used, for counting distinct IPs in weblog data / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- common join
- versus map-side join / Common join versus map-side join
- STREAMTABLE hint / STREAMTABLE hint
- commutative / How it works...
- CompositeKey class / How it works...
- CompositeKeyParitioner class / How it works...
- compute() function / How it works..., How it works..., Apache Giraph jobs often require scalability tuning
- concat_ws() function / The UDF concat_ws() function will not automatically cast parameters to String, The concat_ws() function supports variable length parameter arguments
- Connector instance / How to do it...
- Constraint class / How it works...
- constraint class
- building / Bundled Constraint classes
- installing, on TabletServer / Installing a constraint on each TabletServer
- context class / How it works...
- copyFromLocal command / How it works...
- copyToLocal command / How it works..., There's more...
- copyToLocal Hadoop shell command / How it works...
- Cosine similarity
- about / Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig
- calculating, Pig used / How to do it..., How it works...
- counters
- about / Using Counters in a MapReduce job to track bad records
- using in MapReduce job, to track bad records / Using Counters in a MapReduce job to track bad records, How to do it..., How it works...
- using, in streaming job / Using Counters in a streaming job
- create() method / How it works...
- CREATE command / How it works...
- createRecordReader() method / How it works...
- current_diff / How it works...
- current_loc / How it works...
- Cyclic Redundancy Check (CRC) / How it works...
D
- -D argument/flag / How it works...
- data
- moving between clusters, distributed copy used / Moving data efficiently between clusters using Distributed Copy, There's more...
- importing from MySQL into HDFS, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- exporting from MySQL into HDFS, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- compressing, LZO used / Compressing data using LZO, How to do it...
- sorting, Apache Pig used / How to do it...
- joining in mapper, MapReduce used / Joining data in the Mapper using MapReduce, How to do it..., How it works..., There's more...
- joining, Apache Pig replicated join used / Joining data using Apache Pig replicated join, How it works..., There's more...
- dataflow programs
- example data generating, URL for / See also
- Datafu
- about / Trim Outliers from the Audioscrobbler dataset using Pig and datafu
- Audioscrobbler dataset, outliers trimming from / Trim Outliers from the Audioscrobbler dataset using Pig and datafu
- data locality / Introduction
- Datanode / Introduction
- data serialization / Using Apache Avro to serialize data
- Apache Avro used / Using Apache Avro to serialize data
- Apache Thrift used / Using Apache Thrift to serialize data
- Protocol Buffers used / Using Protocol Buffers to serialize data
- datediff() argument / How it works...
- date format strings / Date format strings follow Java SimpleDateFormat guidelines
- debugging information
- displaying, task status messages updated / Updating task status messages to display debugging information, How it works...
- dfs.block.size property / How it works...
- dfs.replication property / How it works..., How it works...
- distcp command / There's more...
- DistinctCounterJob / How it works...
- distinct IPs
- counting in weblog data, MapReduce used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- counting in weblog data, Combiners used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- counting, MapReduce used / How to do it...
- DISTRIBUTE BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
- distributed breadth-first search
- performing, Apache Giraph used / Using Apache Giraph to perform a distributed breadth-first search, How to do it..., How it works..., There's more...
- DistributedCache class / There's more...
- distributed cache mechanism / How it works...
- distributed copy
- used, for moving data between clusters / Moving data efficiently between clusters using Distributed Copy, There's more..., How to do it...
- DistributedLzoIndexer / How it works..., There's more...
- DROP temporary tables / DROP temporary tables
- dump command / How it works...
E
- -e option / How to do it..., How it works...
- Eexport HIVE_AUX_JARS_PATH / Export HIVE_AUX_JARS_PATH in your environment
- end_date / How it works...
- end_time_obj / How it works...
- EvalFunc abstract class / How it works...
- EvalFunc class / How it works...
- evaluate() method / How it works...
- event dates
- transforming, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- sorting, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- event_date field / How it works...
- exec(Tuple t) method / How it works...
- external table
- mapping over weblog data in HDFS, Hive used / Using Hive to map an external table over weblog data in HDFS, How it works...
- dropping / Dropping an external table does not delete the data stored in the table
F
- -file location_regains_by_time.py \ argument / How it works...
- -file location_regains_mapper.py \ argument / How it works...
- -fs argument/flag / How it works...
- field constraint
- setting, to input geographic event data in Accumulo / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
- fields
- concatenating in weblog data, Hive string UDFs used / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...
- FileInputFormat class / How to do it..., How it works...
- FileSystem.get() method / How it works...
- FileSystem API / There's more...
- FileSystem class / See also, How it works...
- FileSystem object / How it works...
- FilterFunc abstract class / How to do it...
- Flume
- using, to load data into HDFS / Using Flume to load data into HDFS, How it works...
- flushDB() method / How it works...
- from_unixtime() / How it works...
- fs.checkpoint.dir property / Getting ready
- fs.default.name parameter / How it works...
- fs.default.name property / How to do it..., How it works...
- fully-distributed mode
- about / Starting Hadoop in pseudo-distributed mode
- Hadoop, starting in / Starting Hadoop in distributed mode, How to do it..., How it works..., There's more...
G
- Ganglia
- used, for monitoring cluster / Monitoring cluster health using Ganglia, Getting ready, How it works...
- Ganglia meta daemon (gmetad) / Getting ready
- Ganglia monitoring daemon (gmond) / Getting ready
- geographical event data
- cleaning, Hive used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- transforming, Hive used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- cleaning, Python used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- transforming, Python used / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- reading, by creating custom Hadoop Writable / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
- reading, by creating custom InputFormat / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
- geographic event data
- events transforming, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- events sorting, Hive date UDFs used / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- per-month report of fatalities building over, Hive used / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
- bulk importing into Accumulo, MapReduce used / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- inputting in Accumulo, by setting custom field constraint / Setting a custom field constraint forinputting geographic event data in Accumulo, How to do it..., How it works...
- geographic events
- storing in Accumulo, by designing row key / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
- get command / There's more...
- getCurrentVertex() method / How to do it...
- getmerge command / There's more...
- getRecordReader() method / How it works...
- getReverseTime() function / How it works...
- getRowID() / How to do it...
- getZOrderedCurve() method / How to do it..., How it works...
- Git Client
- URL / Getting ready, Getting ready
- GitHub
- for Windows, URL / Getting ready, Getting ready
- for Mac, URL / Getting ready, Getting ready
- Google BigTable design
- URL / Introduction
- Google BigTable design approach
- URL / Introduction
- Google Pregel paper
- Greenplum external table
- HDFS, using / Using HDFS in a Greenplum external table, How it works..., There's more...
- GroupComparator class / How it works...
- GzipCodec / Reading and writing data to SequenceFiles
H
- $HADOOP_BIN / Importing and exporting data into HDFS using Hadoop shell commands
- Hadoop
- about / Introduction, Introduction, Developing and testing MapReduce jobs with MRUnit
- URL / Getting ready
- streaming job, executing / How to do it...
- starting, in pseudo-distributed mode / Starting Hadoop in pseudo-distributed mode, How to do it..., How it works..., There's more...
- starting, in fully-distributed mode / Starting Hadoop in distributed mode, How to do it..., How it works..., There's more...
- new nodes, adding to existing cluster / Getting ready, There's more...
- rebalancing / There's more...
- cluster monitoring, Ganglia used / Monitoring cluster health using Ganglia, Getting ready, How it works...
- hadoop-streaming.jar file / How to do it...
- Hadoop Distributed Copy (distcp) tool / Moving data efficiently between clusters using Distributed Copy
- hadoop fs -COMMAND / Importing and exporting data into HDFS using Hadoop shell commands
- Hadoop FS shell / Getting ready
- hadoop mradmin -refreshNodes command / How it works...
- Hadoop shell commands
- used, for importing data / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- used, for exporting data / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- hadoop shell script / Importing and exporting data into HDFS using Hadoop shell commands
- Hadoop streaming
- using, to perform time series analytic / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...
- using, with language / Using Hadoop Streaming with any language that can read from stdin and write to stdout
- Hadoop Writable
- creating, to read geographical event data / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
- HashSet instance / How it works...
- HDFS
- about / Introduction, Introduction
- data importing, Hadoop shell commands used / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- data exporting, Hadoop shell commands used / Importing and exporting data into HDFS using Hadoop shell commands, How to do it..., How it works...
- data importing from MySQL, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- data exporting from MySQL, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- data exporting, into MongoDB / Exporting data from HDFS into MongoDB, How to do it..., How it works...
- data, importing from MongoDB / Importing data from MongoDB into HDFS, How to do it...
- data exporting into MongoDB, Pig used / Exporting data from HDFS into MongoDB using Pig, How to do it..., How it works...
- using, in Greenplum external table / Using HDFS in a Greenplum external table, How it works..., There's more...
- data loading, Flume used / Using Flume to load data into HDFS, How it works...
- services / Introduction
- data, reading to / Reading and writing data to HDFS, How to do it..., How it works...
- data, writing to / Reading and writing data to HDFS, How to do it..., How it works...
- replication factor, setting / Setting the replication factor for HDFS, How it works...
- block size, setting / Setting the block size for HDFS, How it works...
- external table over weblog data, mapping / Using Hive to map an external table over weblog data in HDFS, How it works...
- external table, mapping / How to do it...
- HDFS, services
- Namenode / Introduction
- Secondary Namenode / Introduction
- Datanode / Introduction
- hdfs-site.xml file / Getting ready, How it works...
- HdfsReader class / There's more...
- HdfsWriter class / How it works..., There's more...
- Hive
- used, for transforming geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used, for cleaning geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used for mapping external table over weblog, in HDFS / Using Hive to map an external table over weblog data in HDFS, How it works...
- using, to create tables from weblog query results / Using Hive to dynamically create tables from the results of a weblog query, How to do it..., There's more...
- using to intersect weblog IPs and determine country / Using Hive to intersect weblog IPs and determine the country, How to do it...
- multitable join support / Hive supports multitable joins
- ON operator / The ON operator for inner joins does not support inequality conditions
- using to build per-month report of fatalities, over geographic event data / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
- custom UDF, implementing / Implementing a custom UDF in Hive to help validate source reliability over geographic event data , Getting ready, How to do it..., How it works...
- existing UDFs, checking out / Check out the existing UDFs
- used, for marking non-violence longest period / Getting ready, How to do it..., How it works...
- Hive date UDFs
- using to transform event dates, from geographic event data / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- using to sort event dates, from geographic event data / Using Hive date UDFs to transform and sort event dates from geographic event data, How to do it...
- Hive query language
- Hive string UDFs
- using, to concatenate firlds in weblog data / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...
I
- --input arguments / How it works...
- --input parameter / How it works...
- -ignorecrc argument / How it works...
- IdentityMapper / How to do it..., Developing and testing MapReduce jobs with MRUnit
- IdentityMapperTest class / Getting ready, How to do it...
- IllegalArgumentException exception / How it works...
- illustrate
- using, to debug Apache Pig / Using illustrate to debug Pig jobs
- InputFormat
- creating, to read geographical event data / Creating custom Hadoop Writable and InputFormat to read geographical event data, How to do it..., How it works...
- input splits / Compressing data using LZO
- InputStream object / How it works...
- invalidZOrder() unit test method / How to do it...
- INVALID_IP_ADDRESS counter / How it works...
- io.compression.codecs property / How it works...
- ip field / How to do it...
- isSplitable() method / How it works...
- IsUseragentBot class / How to do it..., How it works..., There's more...
- IsUseragentBot class / How it works...
J
- -jobconf mapred.reduce.tasks=1 argument / How it works...
- -jobconf num.key.fields.for.partition=1 \ argument / How it works...
- -jobconf stream.num.map.output.key.fields=2 \ argument / How it works...
- -jt argument/flag / How it works...
- Java Virtual Machine (JVM) / How it works...
- JAVA_HOME environment property / How to do it...
- JobConf.setMaxMapAttempts() method / How it works...
- JobConf.setMaxReduceAttempts() method / How it works...
- JobConf documentation
- URL / There's more...
- Job Tracker UI
- JOIN statement / How it works...
- JOIN table / The ON operator for inner joins does not support inequality conditions
K
- k-means / Clustering with Apache Mahout
- key-value store
- used, for joining data / Joining data using an external key-value store (Redis)
- key.toString() method / How it works...
- keys
- Lexicographic sorting / Lexicographic sorting of keys
L
- LineReader / How it works...
- LineRecordReader class / How it works...
- loadRedis() method / How to do it..., How it works...
- LocalJobRunner class / How it works...
- local mode
- MapReduce running jobs, developing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- MapReduce running jobs, testing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- LOCATION keyword / LOCATION must point to a directory, not a file
- location_regains_mapper.py file / How it works...
- LZO
- used, for data compressing / Compressing data using LZO, How to do it...
- codec implementation, downloading / Getting ready
- setting up, steps for / How to do it...
- working / How it works..., There's more...
- io.compression.codecs property / How it works...
- DistributedLzoIndexer / There's more...
- LzoIndexer / There's more...
- LzoTextInputFormat / How it works...
M
- --maxIter parameter / How it works...
- -mapper location_regains_mapper.py \ argument / How it works...
- -m argument / How it works...
- -md arguments / How it works...
- -ml arguments / How it works...
- main() method / How to do it..., How to do it...
- map() function / How to do it..., How to do it..., How to do it..., How it works...
- map() method / How it works...
- map-side join
- about / Joining data in the Mapper using MapReduce
- using, in Apache Hive / Using a map-side join in Apache Hive to analyze geographical events, How to do it..., How it works...
- auto-converting to / Auto-convert to map-side join whenever possible
- behavior / Map-join behavior
- versus common join / Common join versus map-side join
- MapDriver class / How it works...
- Map input records counter / Using Counters in a MapReduce job to track bad records
- maponly jobs / How it works...
- Mapper class / How it works...
- col_pos / How it works...
- pattern / How it works...
- outKey / How it works...
- outValue / How it works...
- mapred-site.xml configuration file / Getting ready, There's more...
- mapred.cache.files property / How it works...
- mapred.child.java.opts property / How to do it...
- mapred.compress.map.output property / How to do it...
- mapred.job.reuse.jvm.num.tasks property / How to do it...
- mapred.job.tracker property / How to do it..., How it works...
- mapred.map.child.java.opts property / How to do it...
- mapred.map.output.compression.codec property / How to do it...
- mapred.map.tasks.speculative.execution property / How to do it...
- mapred.output.compression.codec property / How to do it...
- mapred.output.compression.type property / How to do it...
- mapred.output.compress property / How to do it...
- mapred.reduce.child.java.opts property / How to do it...
- mapred.reduce.tasks.speculative.execution property / How to do it...
- mapred.reduce.tasks property / How to do it...
- mapred.skip.attempts.to.start.skipping property / There's more...
- mapred.skip.map.auto.incr.proc.count property / There's more...
- mapred.skip.map.max.skip.records property / There's more...
- mapred.skip.out.dir property / There's more...
- mapred.skip.reduce.auto.incr.proc.count property / There's more...
- mapred.tasktracker.reduce.tasks.maximum property / There's more...
- mapred.textoutputformat.separator property / There's more...
- MapReduce
- about / How it works...
- used, for transforming Apache logs into TSV format / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
- using, to calculate page views / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- calculating, secondary sort used / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- output files naming, MultipleOutputs, using / Using MultipleOutputs in MapReduce to name output files, How to do it..., How it works...
- distributed cache, using to find lines with matching keywords over newa archives / Using the distributed cache in MapReduce to find lines that contain matching keywords over news archives, How it works..., Distributed cache does not work in local jobrunner mode
- used, for joining data in mapper / Joining data in the Mapper using MapReduce, How to do it..., How it works..., There's more...
- used, for counting distinct IPs in weblog data / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- used, for counting distinct IPs / How to do it...
- used for bulk importing geographic event data, into Accumulo / Using MapReduce to bulk import geographic event data into Accumulo, How to do it..., How it works...
- used for aggregating sources in Accumulo / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
- MapReduce job
- counters, using to track bad records / Using Counters in a MapReduce job to track bad records, How to do it..., How it works...
- parameters, tuning / Tuning MapReduce job parameters, How to do it...
- MapReduce job, properties
- mapred.skip.attempts.to.start.skipping property / There's more...
- mapred.skip.map.auto.incr.proc.count property / There's more...
- mapred.skip.reduce.auto.incr.proc.count property / There's more...
- mapred.skip.out.dir property / There's more...
- mapred.skip.map.max.skip.records property / There's more...
- MapReduce jobs
- -file parameter, using to pass required files / Using the –file parameter to pass additional required files for MapReduce jobs
- about / Developing and testing MapReduce jobs with MRUnit
- developing, with MRUnit / Developing and testing MapReduce jobs with MRUnit
- MRUnit, URL for downloading / Getting ready
- testing, with MRUnit / How to do it..., There's more...
- enabling, to skip bad records / Enabling MapReduce jobs to skip bad records, There's more...
- MapReduce running jobs
- in local mode, developing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- in local mode, testing / Developing and testing MapReduce jobs running in local mode, How to do it..., How it works..., There's more...
- MapReduce used
- used, for generating n-grams over news archives / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
- mapred_excludes file / How it works...
- masters configuration file / There's more...
- Maven 2.2
- URL / Getting ready
- merge join, Apache Pig
- used, for joining sorted data / Joining sorted data using Apache Pig merge join, How it works...
- Microsoft SQL Server
- Scoop, configuring for / Configuring Sqoop for Microsoft SQL Server, How to do it...
- min() operator / The Combiner does not always have to be the same class as your Reducer
- Mockito
- URL / See also
- MongoDB
- data, exporting from HDFS / Exporting data from HDFS into MongoDB, How to do it..., How it works...
- data, importing into HDFS / Importing data from MongoDB into HDFS, How to do it...
- Mongo Hadoop Adaptor / Getting ready
- URL / Getting ready, Getting ready
- Mongo Java Driver
- URL / Getting ready, Getting ready, Getting ready
- MRUnit
- about / Developing and testing MapReduce jobs with MRUnit
- URL, for downloading / Getting ready
- mapper, testing / How to do it..., There's more...
- MultipleOutputs
- used, for naming output files in MapReduce / Using MultipleOutputs in MapReduce to name output files, How to do it..., How it works...
- MySQL
- data importing into HDFS, Sqoop used / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- data exporting into HDFS, Sqoop used / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- mysql.user table / How it works...
- MySQL JDBC driver JAR file / Getting ready
N
- --namedVector arguments / How it works...
- --numClusters parameter / How it works...
- -ng arguments / How it works...
- n-grams
- generating, over news archives, MapReduce used / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
- NameNode failure
- recovering from / Recovering from a NameNode failure, How to do it..., There's more...
- news archives
- n-grams generating over, MapReduce used / Generating n-grams over news archives using MapReduce, Getting ready, How to do it..., How it works...
- NGramMapper class / How it works...
- Nigera_ACLED_cleaned.tsv dataset / Getting ready, Getting ready
- nigeria_holidays table / How it works...
- nobots relationship / There's more...
- nobots_weblogs relation / How it works...
- nodes
- adding, to existing cluster / Adding new nodes to an existing cluster, How to do it..., There's more...
- decommissioning / Safely decommissioning nodes, How to do it...
- non-violence longest period
- marking, Hive used / Getting ready, How to do it..., How it works...
- NullWritable / Use NullWritable to avoid unnecessary serialization overhead
- NumberFormatException exception / How it works...
O
- --output arguments / How it works...
- --output parameter / How it works...
- --overwrite parameter / How it works...
- -output /output/acled_analytic_out \ argument / How it works...
- ON operator / The ON operator for inner joins does not support inequality conditions
- operating modes, hadoop
- standalone mode / Starting Hadoop in pseudo-distributed mode
- pseudo-distributed mode / Starting Hadoop in pseudo-distributed mode
- fully-distributed mode / Starting Hadoop in pseudo-distributed mode
- optimized full outer joins
- using, in Apache Hive / Using optimized full outer joins in Apache Hive to analyze geographical events, How to do it..., How it works...
- ORDER BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
- ORDER BY relational operator / There's more...
- org.apache.hadoop.fs.FileSystem object / How it works...
- org.apache.hadoop.fs.FsShell class / How it works...
- output.write() method / How it works...
- OutputStream object / How it works...
P
- --password option / How it works...
- PageRank
- with Apache Giraph / PageRank with Apache Giraph, How to do it..., How it works...
- page views
- calculating, secondary sort used / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- per-month report of fatalities
- building over geographic event data, Hive used / Using Hive to build a per-month report of fatalities over geographic event data, How it works..., Date reformatting code template
- Pig
- used for exporting data into MongoDB, from HDFS / Exporting data from HDFS into MongoDB using Pig, How to do it..., How it works...
- used, for calculating Cosine similarity / How to do it..., How it works...
- play counts
- prev_date / How it works...
- protobufRecord object / How it works...
- ProtobufWritable class / How it works...
- ProtobufWritable instance / How it works...
- Protocol Buffers
- using, to serialize data / Getting ready, How to do it..., How it works...
- pseudo-distributed mode
- about / Starting Hadoop in pseudo-distributed mode
- Hadoop, starting in / Starting Hadoop in pseudo-distributed mode, How to do it..., How it works..., There's more...
- Python
- using, to extend Apache Pig functionality / Using Python to extend Apache Pig functionality, How it works...
- used, for cleaning geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- used, for transforming geographical event data / Using Hive and Python to clean and transform geographical event data, How to do it..., How it works..., There's more...
- AS keyword, used for type casing values / Type casing values using the AS keyword
- Python streaming
- using, to perform time series analytic / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...
Q
- QL statement / Making every column type String
- Quantile UDF / Trim Outliers from the Audioscrobbler dataset using Pig and datafu, How it works...
- query
- issuing, SumCombiner used / How to do it..., How it works...
- query results
- limiting, regex filtering iterator used / Limiting query results using the regex filtering iterator, Getting ready, How to do it..., How it works...
R
- -reducer location_regains_by_time.py \ argument / How it works...
- read compression option / There's more...
- record-skipping / There's more...
- Record class / How it works...
- Redis
- about / Joining data using an external key-value store (Redis), Getting ready
- used, for joining data in MapReduce / How to do it...
- URL / There's more...
- reduce() function / How to do it...
- reduce() method / How it works...
- reduce-side join
- Reducer class / How it works...
- regex filtering iterator
- used, for limiting query results / Limiting query results using the regex filtering iterator, Getting ready, How to do it..., How it works...
- removeAndSetOutput() method / How to do it...
- removeAndSetPath() method / Use caution when invoking FileSystem.delete()
- replicated join, Apache Pig
- used, for joining data / Joining data using Apache Pig replicated join, How it works..., There's more...
- replication factor
- setting, for HDFS / Setting the replication factor for HDFS, How it works...
- replication factor setting
- about / Introduction
- request_date field / Using the Hive string UDFs to concatenate fields in weblog data
- request_time field / Using the Hive string UDFs to concatenate fields in weblog data
- Resource Description Framework (RDF) / Single-source shortest-path with Apache Giraph
- rowCount variable / How it works...
- row key
- designing, to store geographic events in Accumulo / Designing a row key to store geographic events in Accumulo, How to do it..., How it works...
- run() method / How it works..., How it works..., How it works..., How to do it..., How to do it..., How to do it..., How to do it..., How to do it...
- runTest() method / How it works...
S
- $SQOOP_HOME / How it works...
- --split-by argument / How it works...
- --staging-table argument / How it works...
- -s arguments / How it works...
- -s option / How it works...
- scans
- cell-level security enforcing, Accumulo used / Enforcing cell-level security on scans using Accumulo, How to do it..., How it works...
- Scoop
- configuring, for Microsoft SQL Server / Configuring Sqoop for Microsoft SQL Server, How to do it...
- Secondary Namenode / Introduction
- secondary sort
- using, to calculate page views / Using MapReduce and secondary sort to calculate page views, How to do it..., How it works...
- select() method / How it works...
- SELECT statement / How it works...
- SELECT TRANSFORM / MAP and REDUCE keywords are shorthand for SELECT TRANSFORM
- seq2sparse arguments / How it works...
- --input arguments / How it works...
- --output arguments / How it works...
- --namedVector arguments / How it works...
- -ml arguments / How it works...
- -ng arguments / How it works...
- -x arguments / How it works...
- -md arguments / How it works...
- -s arguments / How it works...
- -wt arguments / How it works...
- -a arguments / How it works...
- seqdirectory tool / How it works...
- SequenceFileInputFormat.class / How it works...
- SequenceFiles
- data, writing to / Reading and writing data to SequenceFiles, How to do it...
- data, reading to / Reading and writing data to SequenceFiles, How to do it...
- about / There's more...
- uncompressed option / There's more...
- read compression option / There's more...
- block compression option / There's more...
- SequenceWriter class / How it works...
- SerDe / How it works...
- sessionize web server log data
- viewing, Apache Pig used / Using Apache Pig to sessionize web server log data, How to do it...
- set() method / How it works...
- setAttemptsToStartSkipping() method / There's more...
- setJarByClass() / How it works...
- setJarByClass() method / How it works...
- setNumReduceTasks() method / There's more...
- setSkipOutputPath() method / There's more...
- setStatus() method / How to do it...
- setup() method / How it works..., How it works..., How to do it...
- setup() routine / How to do it...
- shell commands
- SimpleDateFormat pattern / Setting a custom field constraint forinputting geographic event data in Accumulo
- single-source shortest-path
- with Apache Giraph / Single-source shortest-path with Apache Giraph, How to do it..., How it works...
- First superstep (S0) / First superstep (S0)
- second superstep (S1) / Second superstep (S1)
- Sinks / There's more...
- skewed data
- joining, Apache Pig skewed join used / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
- skewed join, Apache Pig
- used, for joining skewed data / Joining skewed data using Apache Pig skewed join, How to do it..., How it works...
- SkipBadRecords class / How it works..., There's more...
- slaves configuration file / How to do it...
- SORT BY / SORT BY versus DISTRIBUTE BY versus CLUSTER BY versus ORDER BY
- SortComparator class / How it works...
- sorted data
- joining, Apache Pig merge join used / Joining sorted data using Apache Pig merge join, How it works...
- Sources / There's more...
- sources
- aggregating in Accumulo, MapReduce used / Aggregating sources in Accumulo using MapReduce, How to do it..., How it works...
- spiders / Using Apache Pig to filter bot traffic from web server logs
- split points / Split points
- splittable / Compressing data using LZO
- Sqoop
- used for importing from MySQL, into HDFS / Importing data from MySQL into HDFS using Sqoop, Getting ready, How it works..., There's more...
- used for exporting from MySQL, into HDFS / Exporting data from HDFS into MySQL using Sqoop, Getting ready, How it works...
- URL / Getting ready
- standalone mode
- startTime variable / How it works...
- start_date / How it works...
- start_time_obj / How it works...
- stderr / How it works...
- stdin / Using Counters in a streaming job
- stdout / Using Counters in a streaming job
- streaming job
- counters, using / Using Counters in a streaming job
- executing, streaming_counters.py program used / How to do it...
- StreamingQuantile UDF / There's more...
- streaming_counters job / How to do it...
- string fields / How to do it...
- STRING type / Making every column type String
- String[] parameters / How to do it...
- strip() method / How it works...
- SumCombiner
- using, in Accumulo / Counting fatalities for different versions of the same key using SumCombiner, How to do it..., How it works...
- used, for issuing query / How to do it..., How it works...
T
- --table argument / How it works..., How it works...
- tab-separated values (TSV) / Transforming Apache logs into TSV format using MapReduce
- TableFoo FULL OUTER JOIN TableBar / Map-join behavior
- TableFoo LEFT OUTER JOIN TableBar / Map-join behavior
- TableFoo RIGHT OUTER JOIN TABLE B / Map-join behavior
- TabletServer
- constraint class, installing / Installing a constraint on each TabletServer
- task status messages
- updating, to display debugging information / Updating task status messages to display debugging information, How it works...
- TestCase class / How to do it...
- testclassifier tool / How to do it...
- testFullKey() unit test method / How to do it...
- testIdentityMapper1() method / How to do it...
- testIdentityMapper2() method / How to do it...
- testInvalidReverseTime() unit test method / How to do it...
- testValidReverseTime() unit test method / How to do it...
- TextOutputFormat class / There's more...
- thriftRecord object / How it works...
- ThriftWrittable class / How it works...
- time series analytic
- creating, Hadoop streaming used / Using Python and Hadoop Streaming to perform a time series analytic, How to do it..., How it works...
- timestamp
- web server log data sorting, Apache Pig used / Using Apache Pig to sort web server log data by timestamp, See also
- timestamp field / There's more...
- Tool interface / How to do it...
- ToolRunner class / How it works...
- ToolRunner setup / Generating n-grams over news archives using MapReduce
- train_formated dataset / How to do it...
- TRANSFORM operator / How it works...
- TSV format
- Apache logs transforming, MapReduce used / Transforming Apache logs into TSV format using MapReduce, How to do it..., How it works..., There's more...
- type casing values
- AS keyword used / Type casing values using the AS keyword
U
- --update-key value / How it works...
- --username option / How it works...
- -usersFile flag / How it works...
- uncompressed option / There's more...
- unix_timestamp() / How it works...
- user-defined filter function (UDF) / Using Apache Pig to filter bot traffic from web server logs
- user_artist_data.txt file / How it works...
V
- -v option / How it works...
- validZOrder() unit test method / How to do it...
- VALID_IP_ADDRESS regular expression / How it works...
W
- -w option / How it works...
- -wt arguments / How it works...
- weblog data
- Hive string UDFs, using to concatenate fields / Using the Hive string UDFs to concatenate fields in weblog data, Getting ready, How it works...
- distinct IPs counting, MapReduce used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- distinct IPs counting, Combiners used / Counting distinct IPs in weblog data using MapReduce and Combiners, How to do it..., How it works...
- weblog IPs
- intersecting, Hive used / Using Hive to intersect weblog IPs and determine the country, How to do it...
- WeblogMapper class / How it works...
- WeblogMapper map() method / How it works...
- weblog query results
- tables creating, Hive used / Using Hive to dynamically create tables from the results of a weblog query, How to do it..., There's more...
- WeblogRecord.Record object / How it works...
- WeblogRecord class / How it works...
- WeblogRecord object / How to do it..., How it works..., How to do it..., How it works..., How to do it...
- weblog_entries.txt dataset
- URL, for downloading / Getting ready
- weblog_entries dataset / Getting ready, Getting ready
- weblog_entries_bad_records.txt dataset
- URL, for downloading / Getting ready
- WhitespaceAnalyzer / How it works...
- withInput() method / How it works...
- withOutput() method / How it works...
- WritableComparable class / How it works..., How it works...
- WritableComparable interface / How it works...
- writeVertex() method / How it works...
X
- -x arguments / How it works...
Z
- Z-order curve / Z-order curve