Understanding clusters, replication, and sharding
Related to shard management, there is the key concept of replication and cluster status.
Getting ready
You need one or more nodes running to have a cluster. To test an effective cluster, you need at least two nodes (that can be on the same machine).
How it works...
An index can have one or more replicas; the shards are called primary if they are part of the primary replica, and secondary ones if they are part of replicas.
To maintain consistency in write operations, the following workflow is executed:
The write operation is first executed in the primary shard
If the primary write is successfully done, it is propagated simultaneously in all the secondary shards
If a primary shard becomes unavailable, a secondary one is elected as primary (if available) and then the flow is re-executed
During search operations, if there are some replicas, a valid set of shards is chosen randomly between primary and secondary to improve its performance. ElasticSearch has several allocation algorithms to better distribute shards on nodes. For reliability, replicas are allocated in a way that if a single node becomes unavailable, there is always at least one replica of each shard that is still available on the remaining nodes.
The following figure shows some examples of possible shards and replica configuration:
The replica has a cost in increasing the indexing time due to data node synchronization, which is the time spent to propagate the message to the slaves (mainly in an asynchronous way).
Note
To prevent data loss and to have high availability, it's good to have a least one replica; so your system can survive a node failure without downtime and without loss of data.
There's more...
Related to the concept of replication, there is the cluster status indicator that will show you information on the health of your cluster. It can cover three different states:
Green: This shows that everything is okay
Yellow: This means that some shards are missing but you can work on your cluster
Red: This indicates a problem as some primary shards are missing
Solving the yellow status...
Mainly, yellow status is due to some shards that are not allocated.
If your cluster is in the recovery status (meaning that it's starting up and checking the shards before they are online), you need to wait until the shards' startup process ends.
After having finished the recovery, if your cluster is always in the yellow state, you may not have enough nodes to contain your replicas (for example, maybe the number of replicas is bigger than the number of your nodes). To prevent this, you can reduce the number of your replicas or add the required number of nodes. A good practice is to observe that the total number of nodes must not be lower than the maximum number of replicas present.
Solving the red status
This means you are experiencing lost data, the cause of which is that one or more shards are missing.
To fix this, you need to try to restore the node(s) that are missing. If your node restarts and the system goes back to the yellow or green status, then you are safe. Otherwise, you have obviously lost data and your cluster is not usable; the next action would be to delete the index/indices and restore them from backups or snapshots (if you have done them) or from other sources. To prevent data loss, I suggest having always a least two nodes and a replica set to 1 as good practice.
Note
Having one or more replicas on different nodes on different machines allows you to have a live backup of your data, which stays updated always.
See also
Setting up different node types in the next chapter.