Book Image

Elasticsearch Essentials

Book Image

Elasticsearch Essentials

Overview of this book

With constantly evolving and growing datasets, organizations have the need to find actionable insights for their business. ElasticSearch, which is the world's most advanced search and analytics engine, brings the ability to make massive amounts of data usable in a matter of milliseconds. It not only gives you the power to build blazing fast search solutions over a massive amount of data, but can also serve as a NoSQL data store. This guide will take you on a tour to become a competent developer quickly with a solid knowledge level and understanding of the ElasticSearch core concepts. Starting from the beginning, this book will cover these core concepts, setting up ElasticSearch and various plugins, working with analyzers, and creating mappings. This book provides complete coverage of working with ElasticSearch using Python and performing CRUD operations and aggregation-based analytics, handling document relationships in the NoSQL world, working with geospatial data, and taking data backups. Finally, we’ll show you how to set up and scale ElasticSearch clusters in production environments as well as providing some best practices.
Table of Contents (18 chapters)
Elasticsearch Essentials
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
Index

Basic operations with Elasticsearch


We have already seen how Elasticsearch stores data and provides REST APIs to perform the operations. In next few sections, we will be performing some basic actions using the command line tool called CURL. Once you have grasped the basics, you will start programming and implementing these concepts using Python and Java in upcoming chapters.

Note

When we create an index, Elasticsearch by default creates five shards and one replica for each shard (this means five primary and five replica shards). This setting can be controlled in the elasticsearch.yml file by changing the index.number_of_shards properties and the index.number_of_replicas settings, or it can also be provided while creating the index.

Once the index is created, the number of shards can't be increased or decreased; however, you can increase or decrease the number of replicas at any time after index creation. So it is better to choose the number of required shards for an index at the time of index creation.

Creating an Index

Let's begin by creating our first index and give this index a name, which is book in this case. After executing the following command, an index with five shards and one replica will be created:

curl –XPUT 'localhost:9200/books/'

Note

Uppercase letters and blank spaces are not allowed in index names.

Indexing a document in Elasticsearch

Similar to all databases, Elasticsearch has the concept of having a unique identifier for each document that is known as _id. This identifier is created in two ways, either you can provide your own unique ID while indexing the data, or if you don't provide any id, Elasticsearch creates a default id for that document. The following are the examples:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Added with PUT request"
}'

On executing above command, Elasticsearch will give the following response:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":1,"created":true}

However, if you do not provide an id, which is 1 in our case, then you will get the following error:

No handler found for uri [/books/elasticsearch] and method [PUT] 

The reason behind the preceding error is that we are using a PUT request to create a document. However, Elasticsearch has no idea where to store this document (no existing URI for the document is available).

If you want the _id to be auto generated, you have to use a POST request. For example:

curl -XPOST 'localhost:9200/books/elasticsearch' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Anlytics","Text Search","Elasticsearch"],
"content":"Added with POST request"
}'

The response from the preceding request will be as follows:

{"_index":"books","_type":"elasticsearch","_id":"AU-ityC8xdEEi6V7cMV5","_version":1,"created":true}

If you open the localhost:9200/_plugin/head URL, you can perform all the CRUD operations using the HEAD plugin as well:

Some of the stats that you can see in the preceding image are these:

  • Cluster name: elasticsearch_cluster

  • Node name: node-1

  • Index name: books

  • No. of primary shards: 5

  • No. of docs in the index: 2

  • No. of unassigned shards (replica shards): 5

    Note

    Cluster states in Elasticsearch

    An Elasticsearch cluster can be in one of the three states: GREEN, YELLOW, or RED. If all the shards, meaning primary as well as replicas, are assigned in the cluster, it will be in the GREEN state. If any one of the replica shards is not assigned because of any problem, then the cluster will be in the YELLOW state. If any one of the primary shards is not assigned on a node, then the cluster will be in the RED state. We will see more on these states in the upcoming chapters. Elasticsearch never assigns a primary and its replica shard on the same node.

Fetching documents

We have stored documents in Elasticsearch. Now we can fetch them using their unique ids with a simple GET request.

Get a complete document

We have already indexed our document. Now, we can get the document using its document identifier by executing the following command:

curl -XGET 'localhost:9200/books/elasticsearch/1'?pretty

The output of the preceding command is as follows:

{
  "_index" : "books",
  "_type" : "elasticsearch",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source":{"name":"Elasticsearch Essentials","author":"Bharvi Dixit", "tags":["Data Anlytics","Text Search","ELasticsearch"],"content":"Added with PUT request"}
}

Note

pretty is used in the preceding request to make the response nicer and more readable.

As you can see, there is a _source field in the response. This is a special field reserved by Elasticsearch to store all the JSON data. There are options available to not store the data in this field since it comes with an extra disk space requirement. However, this also helps in many ways while returning data from ES, re-indexing data, or doing partial document updates. We will see more on this field in the next chapters.

If the document did not exist in the index, the _found field would have been marked as false.

Getting part of a document

Sometimes you need only some of the fields to be returned instead of returning the complete document. For these scenarios, you can send the names of the fields to be returned inside the _source parameter with the GET request:

curl -XGET 'localhost:9200/books/elasticsearch/1'?_source=name,author

The response of Elasticsearch will be as follows:

{
"_index":"books",
"_type":"elasticsearch",
"_id":"1",
"_version":1,
"found":true,
"_source":{"author":"Bharvi Dixit","name":"Elasticsearch Essentials"}
}

Updating documents

It is possible to update documents in Elasticsearch, which can be done either completely or partially, but updates come with some limitations and costs. In the next sections, we will see how these operations can be performed and how things work behind the scenes.

Updating a whole document

To update a whole document, you can use a similar PUT/POST request, which we had used to create a new document:

curl -XPUT 'localhost:9200/books/elasticsearch/1' -d '{
"name":"Elasticsearch Essentials",
"author":"Bharvi Dixit", 
"tags":["Data Analytics","Text Search","Elasticsearch"],
"content":"Updated document",
"publisher":"pact-pub"
}'

The response of Elasticsearch looks like this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":2,"created":false}

If you look at the response, it shows _version is 2 and created is false, meaning the document is updated.

Updating documents partially

Instead of updating the whole document, we can use the _update API to do partial updates. As shown in the following example, we will add a new field, updated_time, to the document for which a script parameter has been used. Elasticsearch uses Groovy scripting by default.

Note

Scripting is by default disabled in Elasticsearch, so to use a script you need to enable it by adding the following parameter to your elasticsearch.yml file:

script.inline: on
curl -XPOST 'localhost:9200/books/elasticsearch/1/_update' -d '{

   "script" : "ctx._source.updated_time= \"2015-09-09T00:00:00\""

}'

The response of the preceding request will be this:

{"_index":"books","_type":"elasticsearch","_id":"1","_version":3}

It shows that a new version has been created in Elasticsearch.

Elasticsearch stores data in indexes that are composed of Lucene segments. These segments are immutable in nature, meaning that, once created, they can't be changed. So, when we send an update request to Elasticsearch, it does the following things in the background:

  • Fetches the JSON data from the _source field for that document

  • Makes changes in the _source field

  • Deletes old documents

  • Creates a new document

All these data re-indexing tasks can be done by the user; however, if you are using the UPDATE method, it is done using only one request. These processes are the same when doing a whole document update as for a partial update. The benefit of a partial update is that all operations are done within a single shard, which avoids network overhead.

Deleting documents

To delete a document using its identifier, we need to use the DELETE request:

curl -XDELETE 'localhost:9200/books/elasticsearch/1'

The following is the response of Elasticsearch:

{"found":true,"_index":"books","_type":"elasticsearch","_id":"1","_version":4}

If you are from a Lucene background, then you must know how segment merging is done and how new segments are created in the background with more documents getting indexed. Whenever we delete a document from Elasticsearch, it does not get deleted from the file system right away. Rather, Elasticsearch just marks that document as deleted, and when you index more data, segment merging is done. At the same time, the documents that are marked as deleted are indeed deleted based on a merge policy. This process is also applied while the document is updated.

The space from deleted documents can also be reclaimed with the _optimize API by executing the following command:

curl –XPOST http://localhost:9200/_optimize?only_expunge_deletes=true'

Checking documents' existence

While developing applications, some scenarios require you to check whether a document exists or not in Elasticsearch. In these scenarios, rather than querying the documents with a GET request, you have the option of using another HTTP request method called HEAD:

curl -i -XHEAD 'localhost:9200/books/elasticsearch/1'

The following is the response of the preceding command:

HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 0

In the preceding command, I have used the -i parameter that is used to show the header information of an HTTP response. It has been used because the HEAD request only returns headers and not any content. If the document is found, then status code will be 200, and if not, then it will be 400.