Elasticsearch Server - Third Edition

Elasticsearch Server - Third Edition

By : Rafal Kuc

Buy this Book

Elasticsearch Server - Third Edition

By: Rafal Kuc

Buy this Book

Overview of this book

ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch’s schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data. This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You’ll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You’ll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project’s specifications.

Elasticsearch Server Third Edition

Credits

About the Authors

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Getting Started with Elasticsearch Cluster

Full text searching

The basics of Elasticsearch

Installing and configuring your cluster

Manipulating data with the REST API

Searching with the URI request query

Summary

Indexing Your Data

Elasticsearch indexing

Mappings configuration

Batch indexing to speed up your indexing process

Introduction to segment merging

Introduction to routing

Summary

Searching Your Data

Querying Elasticsearch

Understanding the querying process

Basic queries

Compound queries

Using span queries

Choosing the right query

Summary

Extending Your Querying Knowledge

Filtering your results

Highlighting

Validating your queries

Sorting data

Query rewrite

Summary

Extending Your Index Structure

Indexing tree-like structures

Indexing data that is not flat

Using nested objects

Using the parent-child relationship

Modifying your index structure with the update API

Summary

Make Your Search Better

Introduction to Apache Lucene scoring

Scripting capabilities of Elasticsearch

Searching content in different languages

Influencing scores with query boosts

When does index-time boosting make sense?

Words with the same meaning

Understanding the explain information

Summary

Aggregations for Data Analysis

Aggregations

Aggregation types

Pipeline aggregations

Summary

Beyond Full-text Searching

Percolator

Elasticsearch spatial capabilities

Using suggesters

The Scroll API

Summary

Elasticsearch Cluster in Detail

Understanding node discovery

The gateway and recovery modules

Templates and dynamic templates

Elasticsearch plugins

Elasticsearch caches

The update settings API

Summary

Administrating Your Cluster

Elasticsearch time machine

Monitoring your cluster's state and health

Controlling the shard and replica allocation

Controlling cluster rebalancing

The Cat API

Warming up

Index aliasing and using it to simplify your everyday work

Summary

Scaling by Example

Hardware

Preparing a single Elasticsearch node

Horizontal expansion

Preparing the cluster for high indexing and querying throughput

Monitoring

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Installing and configuring your cluster

Installing and running Elasticsearch even in production environments is very easy nowadays, compared to how it was in the days of Elasticsearch 0.20.x. From a system that is not ready to one with Elasticsearch, there are only a few steps that one needs to go. We will explore these steps in the following section:

Installing Java

Elasticsearch is a Java application and to use it we need to make sure that the Java SE environment is installed properly. Elasticsearch requires Java Version 7 or later to run. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/index.html. You can also use OpenJDK (http://openjdk.java.net/) if you wish. You can, of course, use Java Version 7, but it is not supported by Oracle anymore, at least without commercial support. For example, you can't expect new, patched versions of Java 7 to be released. Because of this, we strongly suggest that you install Java 8, especially given that Java 9 seems to be right around the corner with the general availability planned to be released in September 2016.

Installing Elasticsearch

To install Elasticsearch you just need to go to https://www.elastic.co/downloads/elasticsearch, choose the last stable version of Elasticsearch, download it, and unpack it. That's it! The installation is complete.

Note

At the time of writing, we used a snapshot of Elasticsearch 2.2. This means that we've skipped describing some properties that were marked as deprecated and are or will be removed in the future versions of Elasticsearch.

The main interface to communicate with Elasticsearch is based on the HTTP protocol and REST. This means that you can even use a web browser for some basic queries and requests, but for anything more sophisticated you'll need to use additional software, such as the cURL command. If you use the Linux or OS X command, the cURL package should already be available. If you use Windows, you can download the package from http://curl.haxx.se/download.html.

Running Elasticsearch

Let's run our first instance that we just downloaded as the ZIP archive and unpacked. Go to the bin directory and run the following commands depending on the OS:

Linux or OS X: ./elasticsearch
Windows: elasticsearch.bat

Congratulations! Now, you have your Elasticsearch instance up-and-running. During its work, the server usually uses two port numbers: the first one for communication with the REST API using the HTTP protocol, and the second one for the transport module used for communication in a cluster and between the native Java client and the cluster. The default port used for the HTTP API is 9200, so we can check search readiness by pointing the web browser to http://127.0.0.1:9200/. The browser should show a code snippet similar to the following:

{
  "name" : "Blob",
  "cluster_name" : "elasticsearch",
  "version" : {
    "number" : "2.2.0",
    "build_hash" : "5b1dd1cf5a1957682d84228a569e124fedf8e325",
    "build_timestamp" : "2016-01-13T18:12:26Z",
    "build_snapshot" : true,
    "lucene_version" : "5.4.0"
  },
  "tagline" : "You Know, for Search"
}

The output is structured as a JavaScript Object Notation (JSON) object. If you are not familiar with JSON, please take a minute and read the article available at https://en.wikipedia.org/wiki/JSON.

Note

Elasticsearch is smart. If the default port is not available, the engine binds to the next free port. You can find information about this on the console during booting as follows:

[2016-01-13 20:04:49,953][INFO ][http] [Blob] publish_address {127.0.0.1:9201}, bound_addresses {[fe80::1]:9200}, {[::1]:9200}, {127.0.0.1:9201}

Note the fragment with [http]. Elasticsearch uses a few ports for various tasks. The interface that we are using is handled by the HTTP module.

Now, we will use the cURL program to communicate with Elasticsearch. For example, to check the cluster health, we will use the following command:

curl -XGET http://127.0.0.1:9200/_cluster/health?pretty

The -X parameter is a definition of the HTTP request method. The default value is GET (so in this example, we can omit this parameter). For now, do not worry about the GET value; we will describe it in more detail later in this chapter.

As a standard, the API returns information in a JSON object in which new line characters are omitted. The pretty parameter added to our requests forces Elasticsearch to add a new line character to the response, making the response more user-friendly. You can try running the preceding query with and without the ?pretty parameter to see the difference.

Elasticsearch is useful in small and medium-sized applications, but it has been built with large clusters in mind. So, now we will set up our big two-node cluster. Unpack the Elasticsearch archive in a different directory and run the second instance. If we look at the log, we will see the following:

[2016-01-13 20:07:58,561][INFO ][cluster.service          ] [Big Man] detected_master {Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}, added {{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300},}, reason: zen-disco-receive(from master [{Blob}{5QPh00RUQraeLHAInbR4Jw}{127.0.0.1}{127.0.0.1:9300}])

This means that our second instance (named Big Man) discovered the previously running instance (named Blob). Here, Elasticsearch automatically formed a new two-node cluster. Starting from Elasticsearch 2.0, this will only work with nodes running on the same physical machine—because Elasticsearch 2.0 no longer supports multicast. To allow your cluster to form, you need to inform Elasticsearch about the nodes that should be contacted initially using the discovery.zen.ping.unicast.hosts array in elasticsearch.yml. For example, like this:

discovery.zen.ping.unicast.hosts: ["192.168.2.1", "192.168.2.2"]

Shutting down Elasticsearch

Even though we expect our cluster (or node) to run flawlessly for a lifetime, we may need to restart it or shut it down properly (for example, for maintenance). The following are the two ways in which we can shut down Elasticsearch:

If your node is attached to the console, just press Ctrl + C
The second option is to kill the server process by sending the TERM signal (see the kill command on the Linux boxes and Program Manager on Windows)
Note
The previous versions of Elasticsearch exposed a dedicated shutdown API but, in 2.0, this option has been removed because of security reasons.

The directory layout

Now, let's go to the newly created directory. We should see the following directory structure:

Directory	Description
`Bin`	The scripts needed to run Elasticsearch instances and for plugin management
`Config`	The directory where configuration files are located
`Lib`	The libraries used by Elasticsearch
`Modules`	The plugins bundled with Elasticsearch

After Elasticsearch starts, it will create the following directories (if they don't exist):

Directory	Description
`Data`	The directory used by Elasticsearch to store all the data
`Logs`	The files with information about events and errors
`Plugins`	The location to store the installed plugins
`Work`	The temporary files used by Elasticsearch

Configuring Elasticsearch

One of the reasons—of course, not the only one—why Elasticsearch is gaining more and more popularity is that getting started with Elasticsearch is quite easy. Because of the reasonable default values and automatic settings for simple environments, we can skip the configuration and go straight to indexing and querying (or to the next chapter of the book). We can do all this without changing a single line in our configuration files. However, in order to truly understand Elasticsearch, it is worth understanding some of the available settings.

We will now explore the default directories and the layout of the files provided with the Elasticsearch tar.gz archive. The entire configuration is located in the config directory. We can see two files here: elasticsearch.yml (or elasticsearch.json, which will be used if present) and logging.yml. The first file is responsible for setting the default configuration values for the server. This is important because some of these values can be changed at runtime and can be kept as a part of the cluster state, so the values in this file may not be accurate. The two values that we cannot change at runtime are cluster.name and node.name.

The cluster.name property is responsible for holding the name of our cluster. The cluster name separates different clusters from each other. Nodes configured with the same cluster name will try to form a cluster.

The second value is the instance (the node.name property) name. We can leave this parameter undefined. In this case, Elasticsearch automatically chooses a unique name for itself. Note that this name is chosen during each startup, so the name can be different on each restart. Defining the name can helpful when referring to concrete instances by the API or when using monitoring tools to see what is happening to a node during long periods of time and between restarts. Think about giving descriptive names to your nodes.

Other parameters are commented well in the file, so we advise you to look through it; don't worry if you do not understand the explanation. We hope that everything will become clearer after reading the next few chapters.

Note

Remember that most of the parameters that have been set in the elasticsearch.yml file can be overwritten with the use of the Elasticsearch REST API. We will talk about this API in The update settings API section of Chapter 9, Elasticsearch Cluster in Detail.

The second file (logging.yml) defines how much information is written to system logs, defines the log files, and creates new files periodically. Changes in this file are usually required only when you need to adapt to monitoring or backup solutions or during system debugging; however, if you want to have a more detailed logging, you need to adjust it accordingly.

Let's leave the configuration files for now and look at the base for all the applications—the operating system. Tuning your operating system is one of the key points to ensure that your Elasticsearch instance will work well. During indexing, especially when having many shards and replicas, Elasticsearch will create many files; so, the system cannot limit the open file descriptors to less than 32,000. For Linux servers, this can usually be changed in /etc/security/limits.conf and the current value can be displayed using the ulimit command. If you end up reaching the limit, Elasticsearch will not be able to create new files; so merging will fail, indexing may fail, and new indices will not be created.

Note

On Microsoft Windows platforms, the default limit is more than 16 million handles per process, which should be more than enough. You can read more about file handles on the Microsoft Windows platform at https://blogs.technet.microsoft.com/markrussinovich/2009/09/29/pushing-the-limits-of-windows-handles/.

The next set of settings is connected to the Java Virtual Machine (JVM) heap memory limit for a single Elasticsearch instance. For small deployments, the default memory limit (1,024 MB) will be sufficient, but for large ones it will not be enough. If you spot entries that indicate OutOfMemoryError exceptions in a log file, set the ES_HEAP_SIZE variable to a value greater than 1024. When choosing the right amount of memory size to be given to the JVM, remember that, in general, no more than 50 percent of your total system memory should be given. However, as with all the rules, there are exceptions. We will discuss this in greater detail later, but you should always monitor your JVM heap usage and adjust it when needed.

The system-specific installation and configuration

Although downloading an archive with Elasticsearch and unpacking it works and is convenient for testing, there are dedicated methods for Linux operating systems that give you several advantages when you do production deployment. In production deployments, the Elasticsearch service should be run automatically with a system boot; we should have dedicated start and stop scripts, unified paths, and so on. Elasticsearch supports installation packages for various Linux distributions that we can use. Let's see how this works.

Installing Elasticsearch on Linux

The other way to install Elasticsearch on a Linux operating system is to use packages such as RPM or DEB, depending on your Linux distribution and the supported package type. This way we can automatically adapt to system directory layout; for example, configuration and logs will go into their standard places in the /etc/ or /var/log directories. But this is not the only thing. When using packages, Elasticsearch will also install startup scripts and make our life easier. What's more, we will be able to upgrade Elasticsearch easily by running a single command from the command line. Of course, the mentioned packages can be found at the same URL address as we mentioned previously when we talked about installing Elasticsearch from zip or tar.gz packages: https://www.elastic.co/downloads/elasticsearch. Elasticsearch can also be installed from remote repositories via standard distribution tools such as apt-get or yum.

Note

Before installing Elasticsearch, make sure that you have a proper version of Java Virtual Machine installed.

Installing Elasticsearch using RPM packages

When using a Linux distribution that supports RPM packages such as Fedora Linux, (https://getfedora.org/) Elasticsearch installation is very easy. After downloading the RPM package, we just need to run the following command as root:

yum elasticsearch-2.2.0.noarch.rpm

Alternatively, you can add the remote repository and install Elasticsearch from it (this command needs to be run as root as well):

rpm --import https://packages.elastic.co/GPG-KEY-elasticsearch

This command adds the GPG key and allows the system to verify that the fetched package really comes from Elasticsearch developers. In the second step, we need to create the repository definition in the /etc/yum.repos.d/elasticsearch.repo file. We need to add the following entries to this file:

[elasticsearch-2.2]
name=Elasticsearch repository for 2.2.x packages
baseurl=http://packages.elastic.co/elasticsearch/2.x/centos
gpgcheck=1
gpgkey=http://packages.elastic.co/GPG-KEY-elasticsearch
enabled=1

Now it's time to install the Elasticsearch server, which is as simple as running the following command (again, don't forget to run it as root):

yum install elasticsearch

Elasticsearch will be automatically downloaded, verified, and installed.

Installing Elasticsearch using the DEB package

When using a Linux distribution that supports DEB packages (such as Debian), installing Elasticsearch is again very easy. After downloading the DEB package, all you need to do is run the following command:

sudo dpkg -i elasticsearch-2.2.0.deb

It is as simple as that. Another way, which is similar to what we did with RPM packages, is by creating a new packages source and installing Elasticsearch from the remote repository. The first step is to add the public GPG key used for package verification. We can do that using the following command:

wget -qO - https://packages.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

The second step is by adding the DEB package location. We need to add the following line to the /etc/apt/sources.list file:

deb http://packages.elastic.co/elasticsearch/2.2/debian stable main

This defines the source for the Elasticsearch packages. The last step is updating the list of remote packages and installing Elasticsearch using the following command:

sudo apt-get update && sudo apt-get install elasticsearch

Elasticsearch configuration file localization

When using packages to install Elasticsearch, the configuration files are in slightly different directories than the default conf directory. After the installation, the configuration files should be stored in the following location:

/etc/sysconfig/elasticsearch or /etc/default/elasticsearch: A file with the configuration of the Elasticsearch process as a user to run as, directories for logs, data and memory settings
/etc/elasticsearch/: A directory for the Elasticsearch configuration files, such as the elasticsearch.yml file

Configuring Elasticsearch as a system service on Linux

If everything goes well, you can run Elasticsearch using the following command:

/bin/systemctl start elasticsearch.service

If you want Elasticsearch to start automatically every time the operating system starts, you can set up Elasticsearch as a system service by running the following command:

/bin/systemctl enable elasticsearch.service

Elasticsearch as a system service on Windows

Installing Elasticsearch as a system service on Windows is also very easy. You just need to go to your Elasticsearch installation directory, then go to the bin subdirectory, and run the following command:

service.bat install

You'll be asked for permission to do so. If you allow the script to run, Elasticsearch will be installed as a Windows service.

If you would like to see all the commands exposed by the service.bat script file, just run the following command in the same directory as earlier:

service.bat

For example, to start Elasticsearch, we will just run the following command:

service.bat start

Elasticsearch Server - Third Edition

By : Rafal Kuc

Elasticsearch Server - Third Edition

By: Rafal Kuc

Overview of this book

Related Content you might be interested in

Current Title:

Elasticsearch Server - Third Edition

Installing and configuring your cluster

Installing Java

Installing Elasticsearch

Note

Running Elasticsearch

Note

Shutting down Elasticsearch

Note

The directory layout

Configuring Elasticsearch

Note

Note

The system-specific installation and configuration

Installing Elasticsearch on Linux

Note

Installing Elasticsearch using RPM packages

Installing Elasticsearch using the DEB package

Elasticsearch configuration file localization

Configuring Elasticsearch as a system service on Linux

Elasticsearch as a system service on Windows