Book Image

Elasticsearch for Hadoop

By : Vishal Shukla
Book Image

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Table of Contents (15 chapters)
Elasticsearch for Hadoop
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Running the WordCount example


Now that we have got our ES-Hadoop environment tested and running, we are all set to run our first WordCount example. In the Hadoop world, WordCount has made its place to replace the HelloWorld program, hasn't it?

Getting the examples and building the job JAR file

You can download the examples in the book from https://github.com/vishalbrevitaz/eshadoop/tree/master/ch01. Once you have got the source code, you can build the JAR file for this chapter using the steps mentioned in the readme file in the source code zip. The build process should generate a ch01-0.0.1-job.jar file under the <SOURCE_CODE_BASE_DIR>/ch01/target directory.

Importing the test file to HDFS

For our WordCount example, you can use any text file of your choice. To explain the example, we will use the sample.txt file that is part of the source zip. Perform the following steps:

  1. First, let's create a nice directory structure in HDFS to manage our input files with the following command:

    $ hadoop fs -mkdir /input
    $ hadoop fs -mkdir /input/ch01
    
  2. Next, upload the sample.txt file to HDFS at the desired location, by using the following command:

    $ hadoop fs -put data/ch01/sample.txt /input/ch01/sample.txt 
    
  3. Now, verify that the file is successfully imported to HDFS by using the following command:

    $ hadoop fs -ls /input/ch01
    

    Finally, when you execute the preceding command, it should show an output similar to the following code:

    Found 1 items
    -rw-r--r--   1 eshadoop supergroup       2803 2015-05-10 15:18 /input/ch01/sample.txt 
    

Running our first job

We are ready with the job JAR file; its sample file is imported to HDFS. Point your terminal to the <SOURCE_CODE_BASE_DIR>/ch01/target directory and run the following command:

$ hadoop jar ch01-0.0.1-job.jar /input/ch01/sample.txt  

Now you'll get the following output:

15/05/10 15:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
15/05/10 15:21:34 WARN mr.EsOutputFormat: Speculative execution enabled for reducer - consider disabling it to prevent data corruption
15/05/10 15:21:34 INFO util.Version: Elasticsearch Hadoop v2.0.2 [ca81ff6732]
15/05/10 15:21:34 INFO mr.EsOutputFormat: Writing to [eshadoop/wordcount]
15/05/10 15:21:35 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
15/05/10 15:21:41 INFO input.FileInputFormat: Total input paths to process : 1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: number of splits:1
15/05/10 15:21:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1431251282365_0002
15/05/10 15:21:42 INFO impl.YarnClientImpl: Submitted application application_1431251282365_0002
15/05/10 15:21:42 INFO mapreduce.Job: The url to track the job: http://eshadoop:8088/proxy/application_1431251282365_0002/
15/05/10 15:21:42 INFO mapreduce.Job: Running job: job_1431251282365_0002
15/05/10 15:21:54 INFO mapreduce.Job: Job job_1431251282365_0002 running in uber mode : false
15/05/10 15:21:54 INFO mapreduce.Job:  map 0% reduce 0%
15/05/10 15:22:01 INFO mapreduce.Job:  map 100% reduce 0%
15/05/10 15:22:09 INFO mapreduce.Job:  map 100% reduce 100%
15/05/10 15:22:10 INFO mapreduce.Job: Job job_1431251282365_0002 completed successfully



  Elasticsearch Hadoop Counters
    Bulk Retries=0
    Bulk Retries Total Time(ms)=0
    Bulk Total=1
    Bulk Total Time(ms)=48
    Bytes Accepted=9655
    Bytes Received=4000
    Bytes Retried=0
    Bytes Sent=9655
    Documents Accepted=232
    Documents Received=0
    Documents Retried=0
    Documents Sent=232
    Network Retries=0
    Network Total Time(ms)=84
    Node Retries=0
    Scroll Total=0
    Scroll Total Time(ms)=0

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

We just executed our first Hadoop MapReduce job that uses and imports data to Elasticsearch. This MapReduce job simply outputs the count of each word in the Mapper phase, and Reducer calculates the sum of all the counts for each word. We will dig into greater details of how exactly this WordCount program is developed in the next chapter. The console output of the job displays the useful log information to indicate the progress of the job execution. It also displays the ES-Hadoop counters that provide some handy information about the amount of data and documents being sent and received, the number of retries, the time taken, and so on. If you have used the sample.txt file provided in the source zip, you will be able to see that the job found 232 unique words and all of them are pushed as the Elasticsearch document. In the next section, we will examine these documents with the Elasticsearch Head and Marvel plugin that we already installed in Elasticsearch. Note that you can also track the status of your ES-Hadoop MapReduce jobs, similar to any other Hadoop jobs, in the job tracker. In our setup, you can access the job tracker at http://localhost:8088/cluster.