Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount
example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?
You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme
file in the source code zip. The build process should generate a ch01-0.0.1-job.jar
file under the <SOURCE_CODE_BASE_DIR>/ch01/target
directory.
For our WordCount
example, you can use any text file of your choice. To explain the example, we will use the sample.txt
file that is part of the source zip. Perform the following steps:
First, let's create a nice directory structure in HDFS to manage our input files with the following command:
$ hadoop fs -mkdir /input $ hadoop fs -mkdir /input/ch01
Next, upload the
sample.txt
file to HDFS at the desired location, by using the following command:$ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt
Now, verify that the file is successfully imported to HDFS by using the following command:
$ hadoop fs -ls /input/ch01
Finally, when you execute the preceding command, it should show an output similar to the following code:
Found 1 items -rw-r--r-- 1 eshadoop supergroup 2803 2015-05-10 15:18 /input/ch01/sample.txt
We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target
directory and run the following command:
$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt
Now you'll get the following output:
15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption 15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732] 15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount] 15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1 15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002 15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002 15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/ 15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002 15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false 15/05/10 15:21:54 INFO mapreduce.Job: map 0% reduce 0% 15/05/10 15:22:01 INFO mapreduce.Job: map 100% reduce 0% 15/05/10 15:22:09 INFO mapreduce.Job: map 100% reduce 100% 15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully … … … Elasticsearch Hadoop Counters Bulk Retries=0 Bulk Retries Total Time(ms)=0 Bulk Total=1 Bulk Total Time(ms)=48 Bytes Accepted=9655 Bytes Received=4000 Bytes Retried=0 Bytes Sent=9655 Documents Accepted=232 Documents Received=0 Documents Retried=0 Documents Sent=232 Network Retries=0 Network Total Time(ms)=84 Node Retries=0 Scroll Total=0 Scroll Total Time(ms)=0
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt
file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster
.