Book Image

Elasticsearch 5.x Cookbook - Third Edition

By : Alberto Paro
Book Image

Elasticsearch 5.x Cookbook - Third Edition

By: Alberto Paro

Overview of this book

Elasticsearch is a Lucene-based distributed search server that allows users to index and search unstructured content with petabytes of data. This book is your one-stop guide to master the complete Elasticsearch ecosystem. We’ll guide you through comprehensive recipes on what’s new in Elasticsearch 5.x, showing you how to create complex queries and analytics, and perform index mapping, aggregation, and scripting. Further on, you will explore the modules of Cluster and Node monitoring and see ways to back up and restore a snapshot of an index. You will understand how to install Kibana to monitor a cluster and also to extend Kibana for plugins. Finally, you will also see how you can integrate your Java, Scala, Python, and Big Data applications such as Apache Spark and Pig with Elasticsearch, and add enhanced functionalities with custom plugins. By the end of this book, you will have an in-depth knowledge of the implementation of the Elasticsearch architecture and will be able to manage data efficiently and effectively with Elasticsearch.
Table of Contents (25 chapters)
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface

Understanding cluster, replication, and sharding


Related to shards management, there are key concepts of replication and cluster status.

Getting ready

You need one or more nodes running to have a cluster. To test an effective cluster, you need at least two nodes (that can be on the same machine).

How it works...

An index can have one or more replicas (full copies of your data, automatically managed by Elasticsearch): the shards are called primary ones if they are part of the primary replica, and secondary ones if they are part of other replicas.

To maintain consistency in write operations, the following workflow is executed:

  • The write is first executed in the primary shard

  • If the primary write is successfully done, it is propagated simultaneously in all the secondary shards

  • If a primary shard becomes unavailable, a secondary one is elected as primary (if available) and the flow is re-executed

During search operations, if there are some replicas, a valid set of shards is chosen randomly between primary and secondary to improve performances. Elasticsearch has several allocation algorithms to better distribute shards on nodes. For reliability, replicas are allocated in a way that if a single node becomes unavailable, there is always at least one replica of each shard that is still available on the remaining nodes.

The following figure shows some example of possible shards and replica configuration:

The replica has a cost to increase the indexing time due to data node synchronization and also the time spent to propagate the message to the slaves (mainly in an asynchronous way).

Best practice

To prevent data loss and to have high availability, it's good to have at least one replica; so, your system can survive a node failure without downtime and without loss of data.

A typical approach for scaling performance in search when your customer number is to increase the replica number.

There's more...

Related to the concept of replication, there is the cluster status indicator of the health of your cluster.

It can cover three different states:

  • Green: This state depicts that everything is ok.

  • Yellow: This state depicts that some shards are missing but you can work.

  • Red: This state depicts that, "Houston we have a problem". Some primary shards are missing. The cluster will not accept writing and errors and stale actions may happen due to missing shards. If the missing shard cannot be restored, you have lost your data.

Solving the yellow status

  • Mainly yellow status is due to some shards that are not allocated.

  • If your cluster is in "recovery" status (this means that it's starting up and checking the shards before we put them online), just wait so that the shards start up process ends.

  • After having finished the recovery, if your cluster is always in yellow state, you may not have enough nodes to contain your replicas (because, for example, the number of replicas is bigger than the number of your nodes). To prevent this, you can reduce the number of your replicas or add the required number of nodes.

Note

The total number of nodes must not be lower than the maximum number of replicas.

Solving the red status

  • You have loss of data. This is when you have one or more shards missing.

  • You need to try to restore the node(s) that are missing. If your nodes restart and the system goes back to yellow or green status, you are safe. Otherwise, you have lost data and your cluster is not usable: delete the index/indices and restore them from backups or snapshots (if you have already done it) or from other sources.

To prevent data loss, I suggest having always at least two nodes and a replica set to 1.

Tip

Having one or more replicas on different nodes on different machines allows you to have a live backup of your data, always updated.

See also

  • We'll see replica and shard management in the Managing index settings recipe in Chapter 4, Basic Operations.