Book Image

Elasticsearch Server - Third Edition

By : Rafal Kuc
Book Image

Elasticsearch Server - Third Edition

By: Rafal Kuc

Overview of this book

ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch’s schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data. This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You’ll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You’ll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project’s specifications.
Table of Contents (18 chapters)
Elasticsearch Server Third Edition
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Installing and configuring your cluster


Installing and running Elasticsearch even in production environments is very easy nowadays, compared to how it was in the days of Elasticsearch 0.20.x. From a system that is not ready to one with Elasticsearch, there are only a few steps that one needs to go. We will explore these steps in the following section:

Installing Java

Elasticsearch is a Java application and to use it we need to make sure that the Java SE environment is installed properly. Elasticsearch requires Java Version 7 or later to run. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/index.html. You can also use OpenJDK (http://openjdk.java.net/) if you wish. You can, of course, use Java Version 7, but it is not supported by Oracle anymore, at least without commercial support. For example, you can't expect new, patched versions of Java 7 to be released. Because of this, we strongly suggest that you install Java 8, especially given that Java 9 seems to be right around the corner with the general availability planned to be released in September 2016.

Installing Elasticsearch

To install Elasticsearch you just need to go to https://www.elastic.co/downloads/elasticsearch, choose the last stable version of Elasticsearch, download it, and unpack it. That's it! The installation is complete.

Note

At the time of writing, we used a snapshot of Elasticsearch 2.2. This means that we've skipped describing some properties that were marked as deprecated and are or will be removed in the future versions of Elasticsearch.

The main interface to communicate with Elasticsearch is based on the HTTP protocol and REST. This means that you can even use a web browser for some basic queries and requests, but for anything more sophisticated you'll need to use additional software, such as the cURL command. If you use the Linux or OS X command, the cURL package should already be available. If you use Windows, you can download the package from http://curl.haxx.se/download.html.

Running Elasticsearch

Let's run our first instance that we just downloaded as the ZIP archive and unpacked. Go to the bin directory and run the following commands depending on the OS:

  • Linux or OS X: ./elasticsearch

  • Windows: elasticsearch.bat

Congratulations! Now, you have your Elasticsearch instance up-and-running. During its work, the server usually uses two port numbers: the first one for communication with the REST API using the HTTP protocol, and the second one for the transport module used for communication in a cluster and between the native Java client and the cluster. The default port used for the HTTP API is 9200, so we can check search readiness by pointing the web browser to http://127.0.0.1:9200/. The browser should show a code snippet similar to the following:

{
  "name" : "Blob",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.0",
    "build_hash" : "5b1dd1cf5a1957682d84228a569e124fedf8e325",
    "build_timestamp" : "2016-01-13T18:12:26Z",
    "build_snapshot" : true,
    "lucene_version" : "5.4.0"
  },
  "tagline" : "You Know, for Search"
}

The output is structured as a JavaScript Object Notation (JSON) object. If you are not familiar with JSON, please take a minute and read the article available at https://en.wikipedia.org/wiki/JSON.

Note

Elasticsearch is smart. If the default port is not available, the engine binds to the next free port. You can find information about this on the console during booting as follows:

[2016-01-13 20:04:49,953][INFO ][http] [Blob] publish_address {127.0.0.1:9201}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9201} 

Note the fragment with [http]. Elasticsearch uses a few ports for various tasks. The interface that we are using is handled by the HTTP module.

Now, we will use the cURL program to communicate with Elasticsearch. For example, to check the cluster health, we will use the following command:

curl -XGET http://127.0.0.1:9200/_cluster/health?pretty

The -X parameter is a definition of the HTTP request method. The default value is GET (so in this example, we can omit this parameter). For now, do not worry about the GET value; we will describe it in more detail later in this chapter.

As a standard, the API returns information in a JSON object in which new line characters are omitted. The pretty parameter added to our requests forces Elasticsearch to add a new line character to the response, making the response more user-friendly. You can try running the preceding query with and without the ?pretty parameter to see the difference.

Elasticsearch is useful in small and medium-sized applications, but it has been built with large clusters in mind. So, now we will set up our big two-node cluster. Unpack the Elasticsearch archive in a different directory and run the second instance. If we look at the log, we will see the following:

[2016-01-13 20:07:58,561][INFO ][cluster.service          ] [Big Man] detected_master {Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}, added {{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}])

This means that our second instance (named Big Man) discovered the previously running instance (named Blob). Here, Elasticsearch automatically formed a new two-node cluster. Starting from Elasticsearch 2.0, this will only work with nodes running on the same physical machine—because Elasticsearch 2.0 no longer supports multicast. To allow your cluster to form, you need to inform Elasticsearch about the nodes that should be contacted initially using the discovery.zen.ping.unicast.hosts array in elasticsearch.yml. For example, like this:

discovery.zen.ping.unicast.hosts: ["192.168.2.1", "192.168.2.2"]

Shutting down Elasticsearch

Even though we expect our cluster (or node) to run flawlessly for a lifetime, we may need to restart it or shut it down properly (for example, for maintenance). The following are the two ways in which we can shut down Elasticsearch:

  • If your node is attached to the console, just press Ctrl + C

  • The second option is to kill the server process by sending the TERM signal (see the kill command on the Linux boxes and Program Manager on Windows)

    Note

    The previous versions of Elasticsearch exposed a dedicated shutdown API but, in 2.0, this option has been removed because of security reasons.

The directory layout

Now, let's go to the newly created directory. We should see the following directory structure:

Directory

Description

Bin

The scripts needed to run Elasticsearch instances and for plugin management

Config

The directory where configuration files are located

Lib

The libraries used by Elasticsearch

Modules

The plugins bundled with Elasticsearch

After Elasticsearch starts, it will create the following directories (if they don't exist):

Directory

Description

Data

The directory used by Elasticsearch to store all the data

Logs

The files with information about events and errors

Plugins

The location to store the installed plugins

Work

The temporary files used by Elasticsearch

Configuring Elasticsearch

One of the reasons—of course, not the only one—why Elasticsearch is gaining more and more popularity is that getting started with Elasticsearch is quite easy. Because of the reasonable default values and automatic settings for simple environments, we can skip the configuration and go straight to indexing and querying (or to the next chapter of the book). We can do all this without changing a single line in our configuration files. However, in order to truly understand Elasticsearch, it is worth understanding some of the available settings.

We will now explore the default directories and the layout of the files provided with the Elasticsearch tar.gz archive. The entire configuration is located in the config directory. We can see two files here: elasticsearch.yml (or elasticsearch.json, which will be used if present) and logging.yml. The first file is responsible for setting the default configuration values for the server. This is important because some of these values can be changed at runtime and can be kept as a part of the cluster state, so the values in this file may not be accurate. The two values that we cannot change at runtime are cluster.name and node.name.

The cluster.name property is responsible for holding the name of our cluster. The cluster name separates different clusters from each other. Nodes configured with the same cluster name will try to form a cluster.

The second value is the instance (the node.name property) name. We can leave this parameter undefined. In this case, Elasticsearch automatically chooses a unique name for itself. Note that this name is chosen during each startup, so the name can be different on each restart. Defining the name can helpful when referring to concrete instances by the API or when using monitoring tools to see what is happening to a node during long periods of time and between restarts. Think about giving descriptive names to your nodes.

Other parameters are commented well in the file, so we advise you to look through it; don't worry if you do not understand the explanation. We hope that everything will become clearer after reading the next few chapters.

Note

Remember that most of the parameters that have been set in the elasticsearch.yml file can be overwritten with the use of the Elasticsearch REST API. We will talk about this API in The update settings API section of Chapter 9, Elasticsearch Cluster in Detail.

The second file (logging.yml) defines how much information is written to system logs, defines the log files, and creates new files periodically. Changes in this file are usually required only when you need to adapt to monitoring or backup solutions or during system debugging; however, if you want to have a more detailed logging, you need to adjust it accordingly.

Let's leave the configuration files for now and look at the base for all the applications—the operating system. Tuning your operating system is one of the key points to ensure that your Elasticsearch instance will work well. During indexing, especially when having many shards and replicas, Elasticsearch will create many files; so, the system cannot limit the open file descriptors to less than 32,000. For Linux servers, this can usually be changed in /etc/security/limits.conf and the current value can be displayed using the ulimit command. If you end up reaching the limit, Elasticsearch will not be able to create new files; so merging will fail, indexing may fail, and new indices will not be created.

Note

On Microsoft Windows platforms, the default limit is more than 16 million handles per process, which should be more than enough. You can read more about file handles on the Microsoft Windows platform at https://blogs.technet.microsoft.com/markrussinovich/2009/09/29/pushing-the-limits-of-windows-handles/.

The next set of settings is connected to the Java Virtual Machine (JVM) heap memory limit for a single Elasticsearch instance. For small deployments, the default memory limit (1,024 MB) will be sufficient, but for large ones it will not be enough. If you spot entries that indicate OutOfMemoryError exceptions in a log file, set the ES_HEAP_SIZE variable to a value greater than 1024. When choosing the right amount of memory size to be given to the JVM, remember that, in general, no more than 50 percent of your total system memory should be given. However, as with all the rules, there are exceptions. We will discuss this in greater detail later, but you should always monitor your JVM heap usage and adjust it when needed.

The system-specific installation and configuration

Although downloading an archive with Elasticsearch and unpacking it works and is convenient for testing, there are dedicated methods for Linux operating systems that give you several advantages when you do production deployment. In production deployments, the Elasticsearch service should be run automatically with a system boot; we should have dedicated start and stop scripts, unified paths, and so on. Elasticsearch supports installation packages for various Linux distributions that we can use. Let's see how this works.

Installing Elasticsearch on Linux

The other way to install Elasticsearch on a Linux operating system is to use packages such as RPM or DEB, depending on your Linux distribution and the supported package type. This way we can automatically adapt to system directory layout; for example, configuration and logs will go into their standard places in the /etc/ or /var/log directories. But this is not the only thing. When using packages, Elasticsearch will also install startup scripts and make our life easier. What's more, we will be able to upgrade Elasticsearch easily by running a single command from the command line. Of course, the mentioned packages can be found at the same URL address as we mentioned previously when we talked about installing Elasticsearch from zip or tar.gz packages: https://www.elastic.co/downloads/elasticsearch. Elasticsearch can also be installed from remote repositories via standard distribution tools such as apt-get or yum.

Note

Before installing Elasticsearch, make sure that you have a proper version of Java Virtual Machine installed.

Installing Elasticsearch using RPM packages

When using a Linux distribution that supports RPM packages such as Fedora Linux, (https://getfedora.org/) Elasticsearch installation is very easy. After downloading the RPM package, we just need to run the following command as root:

yum elasticsearch-2.2.0.noarch.rpm

Alternatively, you can add the remote repository and install Elasticsearch from it (this command needs to be run as root as well):

rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch

This command adds the GPG key and allows the system to verify that the fetched package really comes from Elasticsearch developers. In the second step, we need to create the repository definition in the /etc/yum.repos.d/elasticsearch.repo file. We need to add the following entries to this file:

[elasticsearch-2.2]
name=Elasticsearch repository for 2.2.x packages
baseurl=http://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1

Now it's time to install the Elasticsearch server, which is as simple as running the following command (again, don't forget to run it as root):

yum install elasticsearch

Elasticsearch will be automatically downloaded, verified, and installed.

Installing Elasticsearch using the DEB package

When using a Linux distribution that supports DEB packages (such as Debian), installing Elasticsearch is again very easy. After downloading the DEB package, all you need to do is run the following command:

sudo dpkg -i elasticsearch-2.2.0.deb

It is as simple as that. Another way, which is similar to what we did with RPM packages, is by creating a new packages source and installing Elasticsearch from the remote repository. The first step is to add the public GPG key used for package verification. We can do that using the following command:

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

The second step is by adding the DEB package location. We need to add the following line to the /etc/apt/sources.list file:

deb http://packages.elastic.co/elasticsearch/2.2/debian stable main

This defines the source for the Elasticsearch packages. The last step is updating the list of remote packages and installing Elasticsearch using the following command:

sudo apt-get update && sudo apt-get install elasticsearch
Elasticsearch configuration file localization

When using packages to install Elasticsearch, the configuration files are in slightly different directories than the default conf directory. After the installation, the configuration files should be stored in the following location:

  • /etc/sysconfig/elasticsearch or /etc/default/elasticsearch: A file with the configuration of the Elasticsearch process as a user to run as, directories for logs, data and memory settings

  • /etc/elasticsearch/: A directory for the Elasticsearch configuration files, such as the elasticsearch.yml file

Configuring Elasticsearch as a system service on Linux

If everything goes well, you can run Elasticsearch using the following command:

/bin/systemctl start elasticsearch.service

If you want Elasticsearch to start automatically every time the operating system starts, you can set up Elasticsearch as a system service by running the following command:

/bin/systemctl enable elasticsearch.service

Elasticsearch as a system service on Windows

Installing Elasticsearch as a system service on Windows is also very easy. You just need to go to your Elasticsearch installation directory, then go to the bin subdirectory, and run the following command:

service.bat install

You'll be asked for permission to do so. If you allow the script to run, Elasticsearch will be installed as a Windows service.

If you would like to see all the commands exposed by the service.bat script file, just run the following command in the same directory as earlier:

service.bat

For example, to start Elasticsearch, we will just run the following command:

service.bat start