Book Image

Advanced Elasticsearch 7.0

By : Wai Tak Wong
Book Image

Advanced Elasticsearch 7.0

By: Wai Tak Wong

Overview of this book

Building enterprise-grade distributed applications and executing systematic search operations call for a strong understanding of Elasticsearch and expertise in using its core APIs and latest features. This book will help you master the advanced functionalities of Elasticsearch and understand how you can develop a sophisticated, real-time search engine confidently. In addition to this, you'll also learn to run machine learning jobs in Elasticsearch to speed up routine tasks. You'll get started by learning to use Elasticsearch features on Hadoop and Spark and make search results faster, thereby improving the speed of query results and enhancing the customer experience. You'll then get up to speed with performing analytics by building a metrics pipeline, defining queries, and using Kibana for intuitive visualizations that help provide decision-makers with better insights. The book will later guide you through using Logstash with examples to collect, parse, and enrich logs before indexing them in Elasticsearch. By the end of this book, you will have comprehensive knowledge of advanced topics such as Apache Spark support, machine learning using Elasticsearch and scikit-learn, and real-time analytics, along with the expertise you need to increase business productivity, perform analytics, and get the very best out of Elasticsearch.
Table of Contents (25 chapters)
Free Chapter
1
Section 1: Fundamentals and Core APIs
8
Section 2: Data Modeling, Aggregations Framework, Pipeline, and Data Analytics
13
Section 3: Programming with the Elasticsearch Client
16
Section 4: Elastic Stack
20
Section 5: Advanced Features

Running Elasticsearch

Elasticsearch does not start automatically after installation. On Windows, to start it automatically at boot time, you can install Elasticsearch as a service. On Ubuntu, it's best to use the Debian package, which installs everything you need to configure Elasticsearch as a service. If you're interested, please refer to the official website (https://www.elastic.co/guide/en/elasticsearch/reference/master/deb.html).

Basic Elasticsearch configuration

Elasticsearch 7.0 has several configuration files located in the config directory, shown as follows. Basically, it provides good defaults, and it requires very little configuration from developers:

ls config

The output will be similar to the following:

elasticsearch.keystore  elasticsearch.yml  jvm.options  log4j2.properties  role_mapping.yml  roles.yml  users  users_roles

Let's take a quick look at the elasticsearch.yml, jvm.options, and log4j2.properties files:

  • elasticsearch.yml: The main configuration file. This configuration file contains settings for the clusters, nodes, and paths. If you specify an item, comment out the line. We'll explain the terminology in the Elasticsearch architectural overview section:
# -------------------------- Cluster ---------------------------
# Use a descriptive name for your cluster:
#cluster.name: my-application
# -------------------------- Node ------------------------------
# Use a descriptive name for the node:
#node.name: node-1
# -------------------------- Network ---------------------------
# Set the bind address to a specific IP (IPv4 or IPv6):
#network.host: 192.168.0.1
# Set a custom port for HTTP:
#http.port: 9200
# --------------------------- Paths ----------------------------
# Path to directory where to store the data (separate multiple
# locations by comma):
#path.data: /path/to/data
# Path to log files:
#path.logs: /path/to/logs
  • jvm.options: Recalling that Elasticsearch is developed in Java, this file is the preferred place to set the JVM options, as shown in the following code block:
 IMPORTANT: JVM heap size
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms1g
-Xmx1g
You rarely need to change the Java Virtual Machine (JVM) options unless the Elasticsearch server is moved to production. These settings can be used to improve performance. When configuring heap memory, please keep in mind that the Xmx setting is 32 GB at most, and no more than 50% of the available RAM.
  • log4j2.properties: Elasticsearch uses Log4j 2 for logging. The log file location is made from three given properties, ${sys:es.logs.base_path}, ${sys:es.logs.cluster_name}, and ${sys:es.logs.node_name} in the log4j2.properties file, as shown in the code block:
appender.rolling.fileName = ${sys:es.logs.base_path}${sys:file.separator}${sys:es.logs.cluster_name}.log

For example, our installed directory is ~/elasticsearch-7.0.0. Since no base path is specified, the default value of ~/elasticsearch-7.0.0/logs is used. Since no cluster name is specified, the default value of elasticsearch is used. The log file location setting appender.rolling.filename will generate a log file named ~/elasticsearch-7.0.0/logs/elasticsearch.log.

Important system configuration

Elasticsearch has two working modes, development mode and production mode. You'll work in development mode with a fresh installation. If you reconfigure a setting such as network.host, it will switch to production mode. In production mode, some settings must be taken care and you can check with the Elasticsearch Reference at https://www.elastic.co/guide/en/elasticsearch/reference/master/system-config.html. We will discuss the file descriptors and virtual memory settings as follows:

  • File descriptors: Elasticsearch uses a large number of file descriptors. Running out of file descriptors can result in data loss. Use the ulimit command to set the maximum number of open files for the current session or in a runtime script file:
ulimit -n 65536

If you want to set the value permanently, add the following line to the /etc/security/limits.conf file:

elasticsearch - nofile 65536

Ubuntu ignores the limits.conf file for processes started by init.d. You can comment out the following line to enable the ulimit feature as follow:

# Sets up user limits according to /etc/security/limits.conf
# (Replaces the use of /etc/limits in old login)
#session required pam_limits.so
  • Virtual memory: By default, Elasticsearch uses the mmapfs directory to store its indices, however, the default operating system limits setting on mmap counts is low. If the setting is below the standard, increase the limit to 262144 or higher:
sudo sysctl -w vm.max_map_count=262144
sudo sysctl -p
cat /proc/sys/vm/max_map_count
262144

By default, the Elasticsearch security features are disabled for open source downloads or basic licensing. Since Elasticsearch binds to localhost only by default, it is safe to run the installed server as a local development server. The changed setting only takes effect after the Elasticsearch server instance has been restarted. In the next section, we will discuss several ways to communicate with Elasticsearch.