Book Image

Elasticsearch 5.x Cookbook - Third Edition

By : Alberto Paro
Book Image

Elasticsearch 5.x Cookbook - Third Edition

By: Alberto Paro

Overview of this book

Elasticsearch is a Lucene-based distributed search server that allows users to index and search unstructured content with petabytes of data. This book is your one-stop guide to master the complete Elasticsearch ecosystem. We’ll guide you through comprehensive recipes on what’s new in Elasticsearch 5.x, showing you how to create complex queries and analytics, and perform index mapping, aggregation, and scripting. Further on, you will explore the modules of Cluster and Node monitoring and see ways to back up and restore a snapshot of an index. You will understand how to install Kibana to monitor a cluster and also to extend Kibana for plugins. Finally, you will also see how you can integrate your Java, Scala, Python, and Big Data applications such as Apache Spark and Pig with Elasticsearch, and add enhanced functionalities with custom plugins. By the end of this book, you will have an in-depth knowledge of the implementation of the Elasticsearch architecture and will be able to manage data efficiently and effectively with Elasticsearch.
Table of Contents (25 chapters)
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface

Understanding node and cluster


Every instance of Elasticsearch is called node. Several nodes are grouped in a cluster. This is the base of the cloud nature of Elasticsearch.

Getting ready

To better understand the following sections, knowledge of the basic concepts such as application node and cluster are required.

How it work...

One or more Elasticsearch nodes can be setup on physical or a virtual server depending on the available resources such as RAM, CPUs, and disk space.

A default node allows us to store data in it and to process requests and responses. (In Chapter 2, Downloading and Setup, we will see details on how to set up different nodes and cluster topologies).

When a node is started, several actions take place during its startup: such as:

  • Configuration is read from the environment variables and from the elasticsearch.yml configuration file

  • A node name is set by config file or chosen from a list of built-in random names

  • Internally, the Elasticsearch engine initializes all the modules and plugins that are available in the current installation

After node startup, the node searches for other cluster members and checks its index and shard status.

To join two or more nodes in a cluster, these rules must be matched:

  • The version of Elasticsearch must be the same (2.3, 5.0, and so on), otherwise the join is rejected

  • The cluster name must be the same

The network must be configured to support broadcast discovery (default) and they can communicate with each other. (Refer to How to setup networking recipe Chapter 2, Downloading and Setup).

A common approach in cluster management is to have one or more master nodes, which is the main reference for all cluster-level actions, and the other ones called secondary, that replicate the master data and actions.

To be consistent in write operations, all the update actions are first committed in the master node and then replicated in secondary ones.

In a cluster with multiple nodes, if a master node dies, a master-eligible one is elected to be the new master. This approach allows automatic failover to be setup in an Elasticsearch cluster.

There's more...

In Elasticsearch, we have four kinds of nodes:

  • Master nodes that are able to process REST (https://en.wikipedia.org/wiki/Representational_state_transfer) responses and all other operations of search. During every action execution, Elasticsearch generally executes actions using a MapReduce approach (https://en.wikipedia.org/wiki/MapReduce): the non data node is responsible for distributing the actions to the underlying shards (map) and collecting/aggregating the shard results (reduce) to send a final response. They may use a huge amount of RAM due to operations such as aggregations, collecting hits, and caching (that is, scan/scroll queries).

  • Data nodes that are able to store data in them. They contain the indices shards that store the indexed documents as Lucene indexes.

  • Ingest nodes that are able to process ingestion pipeline (new in Elasticsearch 5.x).

  • Client nodes (no master and no data) that are used to do processing in a way; if something bad happens (out of memory or bad queries), they are able to be killed/restarted without data loss or reduced cluster stability. Using the standard configuration, a node is both master, data container and ingest node.

In big cluster architectures, having some nodes as simple client nodes with a lot of RAM, with no data, reduces the resources required by data nodes and improves performance in search using the local memory cache of them.

See also

  • The Setting up a single node, Setting a multi node cluster and Setting up different node types recipes in Chapter 2, Downloading and Setup.