Elasticsearch for Hadoop

Elasticsearch for Hadoop

By : Vishal Shukla

Buy this Book

Elasticsearch for Hadoop

By: Vishal Shukla

Buy this Book

Overview of this book

Elasticsearch for Hadoop

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Setting Up Environment

Setting up Hadoop for Elasticsearch

Setting up Elasticsearch

Running the WordCount example

Exploring data in Head and Marvel

Summary

Getting Started with ES-Hadoop

Understanding the WordCount program

Going real — network monitoring data

Writing the NetworkLogsMapper job

Getting data from Elasticsearch to HDFS

Summary

Understanding Elasticsearch

Knowing Search and Elasticsearch

Talking to Elasticsearch

Controlling the indexing process

Elastic searching

Aggregations

Summary

Visualizing Big Data Using Kibana

Setting up and getting started

Discovering data

Summary

Real-Time Analytics

Getting started with the Twitter Trend Analyser

Injecting streaming data into Storm

Analyzing trends

Classifying tweets using percolators

Summary

ES-Hadoop in Production

Elasticsearch in a distributed environment

The ES-Hadoop architecture

Configuring the environment for production

Administration of clusters

Summary

Integrating with the Hadoop Ecosystem

Pigging out Elasticsearch

SQLizing Elasticsearch with Hive

Cascading with Elasticsearch

Giving Spark to Elasticsearch

ES-Hadoop on YARN

Summary

Configurations

Basic configurations

Write and query configurations

Mapping configurations

Index configurations

Network configurations

Authentication configurations

SSL configurations

Proxy configurations

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Setting up Elasticsearch

In this section, we will download and configure the Elasticsearch server and install the Elasticsearch Head and Marvel plugins.

Downloading Elasticsearch

To download Elasticsearch, perform the following steps:

First, download Elasticsearch using the following command:

$ wget https://download.elastic.co/elasticsearch/elasticsearch/elasticsearch-1.7.1.tar.gz

Once the file is downloaded, extract it to /usr/local and rename it with a convenient name, using the following command:

$ sudo tar -xvzf elasticsearch-1.7.1.tar.gz -C /usr/local
$ sudo mv /usr/local/elasticsearch-1.7.1 /usr/local/elasticsearch

Then, set the eshadoop user as the owner of the directory as follows:
```
$ sudo chown -R eshadoop:hadoop /usr/local/elasticsearch
```

Configuring Elasticsearch

The Elasticsearch configuration file, elasticsearch.yml, can be located in the config folder under the Elasticsearch home directory. Open the elasticsearch.yml file in the editor of your choice by using the following command:

$ cd /usr/local/elasticsearch
$ vi config/elasticsearch.yml

Uncomment the line with the cluster.name key from the elasticsearch.yml file and change the cluster name, as shown in the following code:

  cluster.name:eshadoopcluster

Similarly, uncomment the line with the node.name key and change the value as follows:

node.name:"ES Hadoop Node"

Note

Elasticsearch comes with a decent default configuration to let you start the nodes with zero additional configurations. In a production environment and even in a development environment, sometimes it may be desirable to tweak some configurations.

By default, Elasticsearch assigns the node name from the randomly picked Marvel character name from a list of 3,000 names. The default cluster name assigned to the node is elasticsearch. With the default configurations of ES nodes in the same network and the same cluster name, Elasticsearch will synchronize the data between the nodes. This may be unwanted if each developer is looking for an isolated ES server setup. It's always good to specify cluster.name and node.name to avoid unwanted surprises.

You can change the defaults for configurations starting with path.*. To set up the directories that store the server data, to locate paths section, and to uncomment the highlighted paths and changes, use the following code:

 ########################### paths #############################
 # Path to directory containing configuration (this file and logging.yml):
#
path.conf: /usr/local/elasticsearch/config

# Path to directory where to store index data allocated for this node.
# 
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. 
path.data: /usr/local/elasticsearch/data


# Path to temporary files:
#
path.work: /usr/local/elasticsearch/work

# Path to log files:
#
path.logs: /usr/local/elasticsearch/logs

Note

It's important to choose the location of path.data wisely. In production, you should make sure that this path doesn't exist in the Elasticsearch installation directory in order to avoid accidently overwriting or deleting the data when upgrading Elasticsearch.

Installing Elasticsearch's Head plugin

Elasticsearch provides a plugin utility to install the Elasticsearch plugins. Execute the following command to install the Head plugin:

$ bin/plugin -install mobz/elasticsearch-head

-> Installing mobz/elasticsearch-head...
Trying https://github.com/mobz/elasticsearch-head/archive/master.zip...
Downloading ..................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE
Installed mobz/elasticsearch-head into /usr/local /elasticsearch/plugins/head
Identified as a _site plugin, moving to _site structure ...

As indicated by the console output, the plugin is successfully installed in the default plugins directory under the Elasticsearch home. You can access the head plugin at http://localhost:9200/_plugin/head/.

Installing the Marvel plugin

Now, let's install the Marvel plugin using a similar command:

$ bin/plugin -i elasticsearch/marvel/latest

-> Installing elasticsearch/marvel/latest...
Trying http://download.elasticsearch.org/elasticsearch/marvel/marvel-latest.zip...
Downloading ................................................................................................................................................................................................................................................................................................................................DONE
Installed elasticsearch/marvel/latest into /usr/local/elasticsearch/plugins/marvel

Running and testing

Finally, start Elasticsearch using the following command:

$ ./bin/elasticsearch

We will then get the following log:

[2015-05-13 21:59:37,344][INFO ][node                     ] [ES Hadoop Node] version[1.5.1], pid[3822], build[5e38401/2015-04-09T13:41:35Z]
[2015-05-13 21:59:37,346][INFO ][node                     ] [ES Hadoop Node] initializing ...
[2015-05-13 21:59:37,358][INFO ][plugins                  ] [ES Hadoop Node] loaded [marvel], sites [marvel, head]
[2015-05-13 21:59:39,956][INFO ][node                     ] [ES Hadoop Node] initialized
[2015-05-13 21:59:39,959][INFO ][node                     ] [ES Hadoop Node] starting ...
[2015-05-13 21:59:40,133][INFO ][transport                ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.0.2.15:9300]}
[2015-05-13 21:59:40,159][INFO ][discovery                ] [ES Hadoop Node] eshadoopcluster/_bzqXWbLSXKXWpafHaLyRA
[2015-05-13 21:59:43,941][INFO ][cluster.service          ] [ES Hadoop Node] new_master [ES Hadoop Node][_bzqXWbLSXKXWpafHaLyRA][eshadoop][inet[/10.0.2.15:9300]], reason: zen-disco-join (elected_as_master)
[2015-05-13 21:59:43,989][INFO ][http                     ] [ES Hadoop Node] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.0.2.15:9200]}
[2015-05-13 21:59:43,989][INFO ][node                     ] [ES Hadoop Node] started
[2015-05-13 21:59:44,026][INFO ][gateway                  ] [ES Hadoop Node] recovered [0] indices into cluster_state
[2015-05-13 22:00:00,707][INFO ][cluster.metadata         ] [ES Hadoop Node] [.marvel-2015.05.13] creating index, cause [auto(bulk api)], templates [marvel], shards [1]/[1], mappings [indices_stats, cluster_stats, node_stats, shard_event, node_event, index_event, index_stats, _default_, cluster_state, cluster_event, routing_event]
[2015-05-13 22:00:01,421][INFO ][cluster.metadata         ] [ES Hadoop Node] [.marvel-2015.05.13] update_mapping [node_stats] (dynamic)

The startup logs will give you some useful hints as to what is going on. By default, Elasticsearch uses the transport ports from 9200 to 9299 for HTTP, allocating the first port that is available for the node. In the highlighted output, you can also see that it binds to the port 9300 as well. Elasticsearch uses the port range from 9300 to 9399 for an internal node-to-node communication or when communicating using the Java client. It can use the zen multicast or the unicast ping discovery to find other nodes in the cluster with multicast as the default. We will understand more about these discovery nodes in later chapters.

Elasticsearch for Hadoop

By : Vishal Shukla

Elasticsearch for Hadoop

By: Vishal Shukla

Overview of this book

Related Content you might be interested in

Current Title:

Elasticsearch for Hadoop

Setting up Elasticsearch

Downloading Elasticsearch

Configuring Elasticsearch

Note

Note

Installing Elasticsearch's Head plugin

Installing the Marvel plugin

Running and testing