Book Image

ElasticSearch Cookbook

By : Alberto Paro
Book Image

ElasticSearch Cookbook

By: Alberto Paro

Overview of this book

Table of Contents (20 chapters)
ElasticSearch Cookbook Second Edition
Credits
About the Author
Acknowledgments
About the Reviewers
www.PacktPub.com
Preface
Index

Managing your data


If you are going to use ElasticSearch as a search engine or a distributed data store, it's important to understand concepts of how ElasticSearch stores and manages your data.

Getting ready

To work with ElasticSearch data, a user must have basic concepts of data management and JSON data format, which is the lingua franca to work with ElasticSearch data and services.

How it works...

Our main data container is called index (plural indices) and it can be considered as a database in the traditional SQL world. In an index, the data is grouped into data types called mappings in ElasticSearch. A mapping describes how the records are composed (fields).

Every record that must be stored in ElasticSearch must be a JSON object.

Natively, ElasticSearch is a schema-less data store; when you enter records in it during the insert process it processes the records, splits it into fields, and updates the schema to manage the inserted data.

To manage huge volumes of records, ElasticSearch uses the common approach to split an index into multiple shards so that they can be spread on several nodes. Shard management is transparent to the users; all common record operations are managed automatically in the ElasticSearch application layer.

Every record is stored in only a shard; the sharding algorithm is based on a record ID, so many operations that require loading and changing of records/objects, can be achieved without hitting all the shards, but only the shard (and its replica) that contains your object.

The following schema compares ElasticSearch structure with SQL and MongoDB ones:

ElasticSearch

SQL

MongoDB

Index (Indices)

Database

Database

Shard

Shard

Shard

Mapping/Type

Table

Collection

Field

Field

Field

Object (JSON Object)

Record (Tuples)

Record (BSON Object)

There's more...

To ensure safe operations on index/mapping/objects, ElasticSearch internally has rigid rules about how to execute operations.

In ElasticSearch, the operations are divided into:

  • Cluster/index operations: All clusters/indices with active write are locked; first they are applied to the master node and then to the secondary one. The read operations are typically broadcasted to all the nodes.

  • Document operations: All write actions are locked only for the single hit shard. The read operations are balanced on all the shard replicas.

When a record is saved in ElasticSearch, the destination shard is chosen based on:

  • The id (unique identifier) of the record; if the id is missing, it is autogenerated by ElasticSearch

  • If routing or parent (we'll see it in the parent/child mapping) parameters are defined, the correct shard is chosen by the hash of these parameters

Splitting an index in shard allows you to store your data in different nodes, because ElasticSearch tries to balance the shard distribution on all the available nodes.

Every shard can contain up to 2^32 records (about 4.9 billion), so the real limit to a shard size is its storage size.

Shards contain your data and during search process all the shards are used to calculate and retrieve results. So ElasticSearch performance in big data scales horizontally with the number of shards.

All native records operations (such as index, search, update, and delete) are managed in shards.

Shard management is completely transparent to the user. Only an advanced user tends to change the default shard routing and management to cover their custom scenarios. A common custom scenario is the requirement to put customer data in the same shard to speed up his operations (search/index/analytics).

Best practices

It's best practice not to have a shard too big in size (over 10 GB) to avoid poor performance in indexing due to continuous merging and resizing of index segments.

It is also not good to over-allocate the number of shards to avoid poor search performance due to native distributed search (it works as map and reduce). Having a huge number of empty shards in an index will consume memory and increase the search times due to an overhead on network and results aggregation phases.