This recipe describes how to run a MapReduce computation in a distributed Hadoop v2 cluster.
Start the Hadoop cluster by following the Setting up HDFS recipe or the Setting up Hadoop ecosystem in a distributed cluster environment using a Hadoop distribution recipe.
Now let's run the WordCount sample in the distributed Hadoop v2 setup:
Upload the
wc-input
directory in the source repository to the HDFS filesystem. Alternatively, you can upload any other set of text documents as well.$ hdfs dfs -copyFromLocal wc-input .
Execute the WordCount example from the
HADOOP_HOME
directory:$ hadoop jar hcb-c1-samples.jar \ chapter1.WordCount \ wc-input wc-output
Run the following commands to list the output directory and then look at the results:
$hdfs dfs -ls wc-output Found 3 items -rw-r--r-- 1 joesupergroup0 2013-11-09 09:04 /data/output1/_SUCCESS drwxr-xr-x - joesupergroup0 2013-11-09 09:04 /data/output1/_logs -rw-r--r-- 1 joesupergroup1306 2013-11-09 09:04 /data/output1/part-r-00000 $ hdfs dfs -cat wc-output/part*
When we submit a job, YARN would schedule a MapReduce ApplicationMaster to coordinate and execute the computation. ApplicationMaster requests the necessary resources from the ResourceManager and executes the MapReduce computation using the containers it received from the resource request.