Index
A
- addnl
- about / How it works...
- Adwords assigner / There's more...
- Adwords balance algorithm
- used, for assigning advertisements to leywords / Assigning advertisements to keywords using the Adwords balance algorithm
- implementing / Assigning advertisements to keywords using the Adwords balance algorithm, Getting ready, How to do it...
- working / How it works...
- AdwordsBidGenerator / How it works...
- Amazon EC2 Spot Instances
- Amazon Elastic Compute Cloud (EC2) / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
- Amazon Elastic MapReduce (EMR)
- about / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
- used, for running MapReduce computations / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR), How to do it...
- Amazon EMR console
- URL / How to do it...
- Amazon sales dataset
- clustering / Clustering an Amazon sales dataset, Getting ready
- working / How it works...
- Amazon Simple Storage Service (S3) / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR)
- ant-nodeps package
- about / Building libhdfs
- ant-trax package
- about / Building libhdfs
- Apache Ant
- download link / Getting ready
- URL / Getting ready
- Apache Forrest
- URL / Building libhdfs
- Apache Gora / Configuring Apache HBase as the backend data store for Apache Nutch
- Apache HBase
- configuring, as backend data store for Apache Nutch / Configuring Apache HBase as the backend data store for Apache Nutch, How to do it, How it works...
- deploying, on Hadoop cluster / Deploying Apache HBase on a Hadoop cluster, How to do it, How it works...
- download link / How to do it
- Apache HBase Cluster
- deploying, on Amazon EC2 cloud with EMR / Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR, How to do it...
- Apache Lucene project / Indexing and searching web documents using Apache Solr
- Apache Mahout K-Means clustering algorithm
- about / How to do it...
- Apache Nutch
- about / Intra-domain web crawling using Apache Nutch
- used, for intra-domain web crawling / Intra-domain web crawling using Apache Nutch, How to do it...
- Apache HBase, configuring as backend data store / Configuring Apache HBase as the backend data store for Apache Nutch, How to do it, How it works...
- using, with Hadoop/HBase cluster for web crawling / Getting ready, How to do it, How it works
- Apache Nutch Ant build / How it works
- Apache Nutch search engine
- about / Introduction
- Apache Solr
- about / Indexing and searching web documents using Apache Solr
- used, for indexing and searching web documents / Indexing and searching web documents using Apache Solr, How to do it
- working / How it works
- Apache tomcat developer list e-mail archives
- URL / Introduction
- Apache Whirr
- about / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment
- used, for deploying Hadoop cluster on Amazon E2 cloud / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment, How to do it..., How it works...
- used, for deploying HBase cluster on Amazon E2 cloud / Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment, How to do it..., How it works...
- Apache Whirr binary distribution
- downloading / How to do it...
- automake package
- about / Building libhdfs
- AWS Access Keys / How to do it...
B
- bad records
- benchmarks
- running, for verifying Hadoop installation / Running benchmarks to verify the Hadoop installation, How it works...
- about / Running benchmarks to verify the Hadoop installation
- built-in data types
- text / There's more...
- BytesWritable / There's more...
- VIntWritable / There's more...
- VLongWritable / There's more...
- NullWritable / There's more...
- ArrayWritable / There's more...
- TwoDArrayWritable / There's more...
- MapWritable / There's more...
- SortedMapWritable / There's more...
C
- <configuration> tag
- about / How to do it...
- capacity scheduler
- classifiers
- CLI
- cluster deployments
- Hadoop configurations, tuning / Getting ready, How to do it...
- clustering
- about / Clustering the text data
- clustering algorithm
- about / Running K-means with Mahout
- collaborative filtering-based recommendations
- about / Collaborative filtering-based recommendations
- implementing / Getting ready, How to do it...
- working / How it works...
- comapreTo() method / How it works...
- combiner
- adding, to WordCount MapReduce program / Adding the combiner step to the WordCount MapReduce program, How to do it...
- about / Adding the combiner step to the WordCount MapReduce program
- activating / How it works...
- completebulkload command
- about / How it works...
- complex dataset
- parsing, with Hadoop / Parsing a complex dataset with Hadoop, How to do it..., How it works...
- computational complexity / How it works...
- conf/core-site.xml
- about / How to do it...
- configuration properties / There's more...
- conf/hdfs-site.xml
- about / How to do it...
- configuration properties / There's more...
- conf/mapred-site.xml
- about / How to do it...
- configuration properties / There's more...
- configuration files
- conf/core-site.xml / How to do it...
- conf/hdfs-site.xml / How to do it...
- conf/mapred-site.xml / How to do it...
- configuration properties, conf/core-site.xml
- fs.inmemory.size.mb / There's more...
- io.sort.factor / There's more...
- io.file.buffer.size / There's more...
- configuration properties, conf/hdfs-site.xml
- dfs.block.size / There's more...
- dfs.namenode.handler.count / There's more...
- configuration properties, conf/mapred-site.xml
- mapred.reduce.parallel.copies / There's more...
- mapred.map.child.java.opts / There's more...
- mapred.reduce.child.java.opts / There's more...
- io.sort.mb / There's more...
- content-based recommendations
- about / Content-based recommendations
- implementing / Getting ready, How to do it...
- working / How it works...
- createRecordReader() method
- about / How it works...
- custom Hadoop key type
- implementing / Implementing a custom Hadoop key type, How to do it..., How it works...
- custom Hadoop Writable data type
- custom InputFormat
- custom Partitioner
- implementing / Hadoop intermediate (map to reduce) data partitioning
- Cygwin / Getting ready
D
- data
- emitting, from mapper / Emitting data of different value types from a mapper, How to do it..., How it works...
- grouping, MapReduce used / Performing Group-By using MapReduce, How to do it..., How it works...
- data de-duplication
- Hadoop streaming, used / Data de-duplication using Hadoop Streaming, How it works...
- HBase, used / Data de-duplication using HBase
- Dataflow language / How to do it...
- data mining algorithm
- about / Installing Mahout
- DataNodes
- about / Introduction
- adding / Adding a new DataNode
- decommissioning / Decommissioning DataNodes, How to do it...
- data preprocessing
- datasets
- joining, MapReduce used / Joining two datasets using MapReduce, Getting ready, How it works...
- debug scripts
- about / Debug scripts – analyzing task failures
- writing / How to do it...
- decommissioning process
- working / Decommissioning DataNodes
- about / How it works...
- DFSIO
- used, for benchmarking / Benchmarking HDFS
- about / Benchmarking HDFS
- distributed cache / How it works...
- distributed mode, Hadoop installation
- about / Introduction
- document classification
- about / Document classification using Mahout Naive Bayes classifier
- Naive Bayes Classifier, used / Document classification using Mahout Naive Bayes classifier, How to do it..., How it works...
E
- EC2 console
- URL / How to do it...
- ElasticSearch
- about / ElasticSearch for indexing and searching
- URL / ElasticSearch for indexing and searching
- used, for indexing and searching data / How to do it, How it works
- download link / How to do it
- working / How it works
- using / How it works
- EMR
- used, for executing Pig script / Executing a Pig script using EMR, How to do it...
- used, for executing Hive script / Executing a Hive script using EMR, How to do it...
- used, for deploying Apache HBase Cluster on Amazon EC2 cloud / Deploying an Apache HBase Cluster on Amazon EC2 cloud using EMR, How to do it...
- EMR Bootstrap actions
- used, for configuring VMs for EMR jobs / Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs, How to do it..., There's more...
- configure-daemons / There's more...
- configure-hadoop / There's more...
- memory-intensive / There's more...
- run-if / There's more...
- EMR CLI
- used, for creating EMR job flow / Creating an Amazon EMR job flow using the Command Line Interface, How to do it...
- EMR job flows
- executing, Amazon EC2 Spot Instances used / Saving money by using Amazon EC2 Spot Instances to execute EMR job flows, How to do it...
- creating, CLI used / Creating an Amazon EMR job flow using the Command Line Interface, How to do it...
- exclude file / How to do it...
F
- failure percentages
- fair scheduler
- fault tolerance
- FIFO scheduler
- file replication factor
- setting / Setting the file replication factor
- FileSystem.Create() method / How it works...
- FileSystem.create(filePath) method / How it works...
- FileSystem object
- configuring / Configuring the FileSystem object
- frequency distribution
- about / Calculating frequency distributions and sorting using MapReduce
- calculating, MapReduce used / Calculating frequency distributions and sorting using MapReduce, How it works...
- Fuse-DFS project
- mounting / Mounting HDFS (Fuse-DFS), Getting ready, How to do it...
- working / How it works...
- URL / How it works...
G
- getDistance() method / How it works...
- getFileBlockLocations() function / Retrieving the list of data blocks of a file
- getGeoLocation() method / How it works...
- getInputSplit() method / How it works...
- getLength() method
- about / There's more...
- getLocalCacheFiles() method / How it works...
- getMerge command / How to do it...
- getmerge command
- about / How it works...
- getPath() method / How it works...
- getSplits() method
- about / There's more...
- getTypes() method / How to do it...
- getUri() function / Configuring the FileSystem object
- GNU Plot
- used, for plotting results / Plotting the Hadoop results using GNU Plot, How to do it..., How it works...
- URL / There's more...
- Google
- about / Introduction
- Gross National Income (GNI) / Running your first Pig command
H
- Hadoop
- about / Introduction
- setting up / Setting up Hadoop on your machine, How to do it...
- URL / How to do it...
- MapReduce program, writing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop, How to do it...
- MapReduce program, executing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop
- setting, in distributed cluster environment / Setting Hadoop in a distributed cluster environment, Getting ready, How to do it...
- used, for parsing complex dataset / Parsing a complex dataset with Hadoop, How to do it..., How it works...
- content-based recommendations / Content-based recommendations
- hierarchical clustering / Hierarchical clustering
- Amazon sales dataset clustering / Clustering an Amazon sales dataset
- collaborative filtering-based recommendations / Collaborative filtering-based recommendations
- Adwords balance algorithm / Assigning advertisements to keywords using the Adwords balance algorithm
- Hadoop's Writable-based serialization framework
- Hadoop Aggregate package / How it works...
- Hadoop cluster
- Apache HBase, deploying on / Deploying Apache HBase on a Hadoop cluster, How to do it, How it works...
- deploying on Amazon E2 cloud, Apache Whirr used / Using Apache Whirr to deploy an Apache Hadoop cluster in a cloud environment, How to do it...
- deploying on Amazon E2, Apache Whirr used / How to do it..., How it works...
- Hadoop configurations
- tuning / How to do it...
- Hadoop counters
- about / Hadoop counters for reporting custom metrics
- used, for reporting custom metrics / Hadoop counters for reporting custom metrics
- working / How it works...
- Hadoop data types
- Hadoop DistributedCache
- about / Broadcasting and distributing shared resources to tasks in a MapReduce job – Hadoop DistributedCache
- used, for retrieving Map and Reduce tasks / How to do it...
- working / How it works...
- used, for distributing archives / Distributing archives using the DistributedCache
- resources, adding from command line / Adding resources to the DistributedCache from the command line
- used, for adding resources to classpath / Adding resources to the classpath using DistributedCache
- Hadoop GenericWritable data type / How to do it...
- Hadoop InputFormat
- selecting, for input data format / Choosing a suitable Hadoop InputFormat for your input data format
- Hadoop installation
- NameNode / Introduction
- DataNodes / Introduction
- JobTracker / Introduction
- TaskTracker / Introduction
- modes / Introduction
- verifying, benchmarks used / Running benchmarks to verify the Hadoop installation, How it works...
- Hadoop intermediate data partitioning
- Hadoop Kerberos security
- about / Hadoop security – integrating with Kerberos
- pitfalls / How it works...
- Hadoop monitoring UI
- using / Using MapReduce monitoring UI, How to do it...
- working / How it works...
- Hadoop OutputFormats
- used, for formatting MapReduce computations results / Formatting the results of MapReduce computations – using Hadoop OutputFormats, How it works...
- Hadoop Partitioners
- Hadoop results
- plotting, GNU Plot used / Plotting the Hadoop results using GNU Plot, How to do it..., How it works...
- Hadoop scheduler
- hadoop script / How to do it...
- Hadoop security
- about / Hadoop security – integrating with Kerberos
- Kerberos, integrating with / Hadoop security – integrating with Kerberos, How to do it...
- Hadoop Streaming
- about / Using Hadoop with legacy applications – Hadoop Streaming, There's more...
- working / How it works...
- URL / There's more...
- Hadoop streaming
- using with Python script-based mapper, for data preprocessing / Data preprocessing (extract, clean, and format conversion) using Hadoop Streaming and Python, How it works..., There's more...
- used, for data de-duplication / Data de-duplication using Hadoop Streaming, How to do it..., How it works...
- Hadoop Tool interface
- HADOOP_LOG_DIR
- about / How it works...
- hashCode() method / How it works..., How it works...
- HashPartitioner partitions
- HBase
- about / Introduction, Installing HBase
- installing / Installing HBase, How to do it...
- downloading / How to do it...
- working / How it works...
- running, in distributed mode / There's more...
- data random access, via Java client APIs / Data random access using Java client APIs, How to do it...
- MapReduce jobs, running / Running MapReduce jobs on HBase (table input/output), How to do it...
- used, for data de-duplication / Data de-duplication using HBase
- HBase cluster
- deploying on Amazon E2 cloud, Apache Whirr used / Using Apache Whirr to deploy an Apache HBase cluster in a cloud environment, How to do it..., How it works...
- HBase data model
- about / Installing HBase
- reference link / Installing HBase
- HBase TableMapper / How it works
- HDFS
- about / Setting up HDFS, Introduction
- setting up / Setting up HDFS, How to do it...
- working / How it works...
- benchmarking / Benchmarking HDFS, How to do it...
- DataNode, adding / Adding a new DataNode, How to do it...
- rebalancing / Rebalancing HDFS
- files, merging / Merging files in HDFS
- HDFS basic command-line file operations
- executing / HDFS basic command-line file operations, How to do it...
- HDFS block size
- setting / Setting HDFS block size, How to do it...
- HDFS C API
- using / Using HDFS C API (libhdfs), How to do it...
- working / How it works...
- HDFS configuration files
- configuring / Configuring using HDFS configuration files
- hdfsConnectAsUser command / How it works...
- hdfsConnect command / How it works...
- HDFS disk usage
- HDFS filesystem
- mounting / How to do it...
- HDFS Java API
- about / Using HDFS Java API, How to do it..., How it works...
- using / Using HDFS Java API, How to do it...
- working / How it works...
- HDFS monitoring UI
- using / Using HDFS monitoring UI
- hdfsOpenFile command / How it works...
- hdfsRead command / How it works...
- HDFS replication factor
- about / Setting the file replication factor
- working / How it works...
- HDFS setup
- testing / How to do it...
- HDFS web console
- accessing / How to do it...
- hierarchical clustering
- about / Hierarchical clustering
- implementing / Hierarchical clustering, How to do it...
- working / How it works...
- higher-level programming interfaces
- about / Installing Pig
- histograms
- about / Calculating histograms using MapReduce
- calculating, MapReduce used / Calculating histograms using MapReduce, Getting ready, How to do it..., How it works...
- Hive
- about / Introduction, Installing Hive
- downloading / How to do it...
- installing / How to do it...
- working / How it works..., How it works...
- SQL-style query, running with / Running a SQL-style query with Hive, Getting ready, How to do it...
- used, for filtering and sorting / How to do it...
- join, performing with / Performing a join with Hive, How to do it..., How it works...
- Hive interactive session
- Hive script
- executing, EMR used / Executing a Hive script using EMR, How to do it...
- Human Development Report (HDR) / Running a SQL-style query with Hive
- Human Development Report (HDR) data / Running your first Pig command
I
- importtsv and bulkload
- used, for importing large text dataset to HBase / Loading large datasets to an Apache HBase data store using importtsv and bulkload tools, How to do it…, How it works...
- importtsv tool
- about / How it works...
- using / There's more...
- in-links graph
- generating, for for crawled web pages / Generating the in-links graph for crawled web pages, How to do it, How it works
- InputFormat implementations
- TextInputFormat / There's more...
- NLineInputFormat / There's more...
- SequenceFileInputFormat / There's more...
- DBInputFormat / There's more...
- InputSplit object
- about / There's more...
- intra-domain web crawling
- Apache Nutch used / Intra-domain web crawling using Apache Nutch, How to do it...
- inverted document frequencies (IDF) / Creating TF and TF-IDF vectors for the text data
- inverted index
- generating, MapReduce used / Generating an inverted index using Hadoop MapReduce, How to do it..., How it works...
J
- Java 1.6
- downloading / Getting ready
- installing / Getting ready
- Java client APIs
- used, for connecting HBase / How to do it...
- Java Cryptography Extension (JCE) Policy / How to do it...
- Java Integrated Development Environment (IDE) / Getting ready
- Java JDK 1.6 / Getting ready
- Java regular expressions
- URL / There's more...
- Java VMs
- reusing, for improving performance / Reusing Java VMs to improve the performance, How it works...
- JDK 1.5
- URL / Building libhdfs
- JobTracker
- about / Introduction
- setting up / How to do it...
- join
- performing, with Hive / Performing a join with Hive, How to do it..., How it works...
- JSON snippet / How to do it...
K
- K-means
- about / How it works..., Running K-means with Mahout
- running, with Mahout / Running K-means with Mahout, How to do it..., How it works...
- K-means results
- visualizing / Visualizing K-means results, How it works...
- Kerberos
- integrating with / Hadoop security – integrating with Kerberos
- installing / How to do it...
- principals / How to do it...
- Kerberos setup
- about / Hadoop security – integrating with Kerberos
- NameNode / Hadoop security – integrating with Kerberos
- DataNodes / Hadoop security – integrating with Kerberos
- JobTracker / Hadoop security – integrating with Kerberos
- TaskTrackers / Hadoop security – integrating with Kerberos
- KeyFieldPartitioner / KeyFieldBasedPartitioner
- KeyValueTextInputFormat
- about / How it works...
- kinit command / How it works...
L
- large text dataset
- importing to HBase, importtsv and bulkload used / Loading large datasets to an Apache HBase data store using importtsv and bulkload tools, Getting ready, How to do it…, How it works..., There's more...
- LDA
- about / Topic discovery using Latent Dirichlet Allocation (LDA)
- used, for topic discovery / Topic discovery using Latent Dirichlet Allocation (LDA), How to do it…
- libhdfs
- about / Using HDFS C API (libhdfs)
- using / Getting ready
- building / Building libhdfs
- Libtool package
- about / Building libhdfs
- local mode, Hadoop installation
- about / Introduction
- working / How it works...
- LogFileInputFormat
- about / How it works...
- LogFileRecordReader class
- about / How it works...
- LogWritable class
- about / How it works...
M
- machine learning algorithm
- about / Installing Mahout
- Mahout
- about / Introduction, Installing Mahout
- installing / How to do it...
- working / How it works...
- K-means, running with / Running K-means with Mahout, How to do it..., How it works...
- Mahout installation
- verifying / How to do it...
- Mahout K-Means algorithm / How it works...
- Mahout seqdumper command / How it works…
- Mahout split command
- about / How it works...
- map() function / How it works...
- MapFile
- about / There's more...
- mapper
- data, emitting from / Emitting data of different value types from a mapper, How to do it...
- implementing, for HTTP log processing application / Using Hadoop with legacy applications – Hadoop Streaming, How to do it...
- MapReduce
- about / Introduction
- used, for calculating simple analytics / Simple analytics using MapReduce, Getting ready, How to do it..., How it works...
- used, for grouping data / Performing Group-By using MapReduce, How to do it..., How it works...
- used, for calculating frequency distributions / Calculating frequency distributions and sorting using MapReduce, How it works...
- used, for calculating histograms / Calculating histograms using MapReduce, Getting ready, How to do it..., How it works...
- used, for calculating Scatter plots / Calculating scatter plots using MapReduce, Getting ready, How to do it..., How it works...
- used, for joining datasets / Joining two datasets using MapReduce, How to do it..., How it works...
- used, for generating inverted index / Generating an inverted index using Hadoop MapReduce, How to do it..., How it works...
- MapReduce application
- MultipleInputs feature, using / Using multiple input data types and multiple mapper implementations in a single MapReduce application
- MapReduce computations
- running, Amazon Elastic MapReduce (EMR) used / Running Hadoop MapReduce computations using Amazon Elastic MapReduce (EMR), How to do it...
- MapReduce computations results
- formatting, Hadoop OutputFormats used / Formatting the results of MapReduce computations – using Hadoop OutputFormats, How it works...
- MapReduce jobs
- dependencies, adding / Adding dependencies between MapReduce jobs, How to do it...
- running, on HBase / Running MapReduce jobs on HBase (table input/output), How to do it...
- working / How it works...
- MapReduce monitoring UI
- using / Using MapReduce monitoring UI, How to do it...
- working / How it works...
- MBOX format / Joining two datasets using MapReduce
- minSupport / How it works…
- modes, Hadoop installation
- local mode / Introduction
- Pseudo distributed mode / Introduction
- distributed modes / Introduction
- mrbench / There's more...
- multi-dimensional space / Clustering an Amazon sales dataset
- multiple disks/volumes
- MultipleInputs feature
- using, in MapReduce application / Using multiple input data types and multiple mapper implementations in a single MapReduce application
N
- 20news dataset
- downloading / How to do it…
- Naive Bayes Classifier
- about / Classification using Naive Bayes Classifier
- URL / Classification using Naive Bayes Classifier
- implementing / Classification using Naive Bayes Classifier, How to do it...
- working / How it works...
- used, for document classification / Document classification using Mahout Naive Bayes classifier, How to do it..., How it works...
- NameNode
- about / Introduction
- NASA weblog dataset
- URL / Introduction
- nextKeyValue() method
- about / How it works..., How it works...
- NLineInputFormat
- about / There's more...
- nnbench / There's more...
- non-Euclidian space / Clustering an Amazon sales dataset
O
- orthogonal axes / Clustering an Amazon sales dataset
P
- <path> parameter / How it works...
- Partitioner / How it works...
- Pattern.compile() method / How it works...
- Pig
- about / Introduction, Installing Pig
- installing / How to do it...
- downloading / How to do it...
- join and sort operations, implementing / Set operations (join, union) and sorting with Pig, How to do it..., There's more...
- Pig command
- running / Running your first Pig command, How to do it...
- working / How it works...
- Pig interactive session
- Pig script
- executing, EMR used / Executing a Pig script using EMR, How to do it...
- primitive data types
- IntWritable / There's more...
- LongWritable / There's more...
- BooleanWritable / There's more...
- FloatWritable / There's more...
- ByteWritable / There's more...
- principals
- Pseudo distributed mode, Hadoop installation
- about / Introduction
R
- random sample / Clustering an Amazon sales dataset
- readFields() method / How it works...
- read performance benchmark
- running / How to do it...
- rebalancer tool
- about / Rebalancing HDFS
- reduce() function / How it works...
- reduce() method / How to do it...
S
- S3 bucket / How to do it...
- Scatter plot
- about / Calculating scatter plots using MapReduce
- calculating, MapReduce used / Calculating scatter plots using MapReduce, Getting ready, How to do it..., How it works...
- scheduling
- seq2sparse command / How it works…
- seqdirectory command / How it works…
- SequenceFileInputFormat
- about / There's more...
- SequenceFileAsBinaryInputFormat / There's more...
- SequenceFileAsTextInputFormat / There's more...
- setrep command syntax / How it works...
- shared-user Hadoop clusters
- simple analytics
- calculating, MapReduce used / Simple analytics using MapReduce, How to do it..., How it works...
- speculative execution
- SQL-style query
- running, with Hive / Running a SQL-style query with Hive, How to do it...
- SSH server / Getting ready
T
- -threshold parameter
- about / Rebalancing HDFS
- tab-separated value (TSV)file / How to do it...
- TableMapReduceUtil class / How it works
- task failures
- TaskTracker
- about / Introduction
- TaskTrackers
- setting up / How to do it...
- TeraSort / There's more...
- term frequencies (TF) / Creating TF and TF-IDF vectors for the text data
- Term frequency-inverse document frequency (TF-IDF) model / Creating TF and TF-IDF vectors for the text data
- TestDFSIO / There's more...
- testmapredsort job / How it works...
- text data
- clustering / Clustering the text data, How to do it..., How it works...
- TextInputFormat
- about / There's more...
- TextInputFormat class / How it works...
- TF and TF-IDF vectors
- creating, for text data / Creating TF and TF-IDF vectors for the text data, Getting ready, How to do it…
- working / How it works…
- Topic discovery
- toString() method / There's more...
- TotalOrderPartitioner / There's more...
- Twahpic / Topic discovery using Latent Dirichlet Allocation (LDA)
V
- VMs
- configuring for EMR jobs, EMR Bootstrap actions used / Using EMR Bootstrap actions to configure VMs for the Amazon EMR jobs, How to do it...
W
- web crawling
- about / Intra-domain web crawling using Apache Nutch
- performing, Apache Nutch used with Hadoop/HBase cluster / Whole web crawling with Apache Nutch using a Hadoop/HBase cluster, How to do it, How it works
- web documents
- indexing and searching, Apache Solr used / Indexing and searching web documents using Apache Solr, How to do it
- WordCount MapReduce program
- writing / Writing a WordCount MapReduce sample, bundling it, and running it using standalone Hadoop, How to do it...
- working / How it works...
- combiner step, adding / Adding the combiner step to the WordCount MapReduce program, How to do it...
- running, in distributed cluster environment / Running the WordCount program in a distributed cluster environment, How to do it..., How it works...
- Writable interface / Choosing appropriate Hadoop data types
- write() method / How it works...
- write performance benchmark
- running / How to do it...
Z
- zipf / How to do it...
- zlib-devel package
- about / Building libhdfs