Book Image

Elasticsearch Server - Third Edition

By : Rafal Kuc
Book Image

Elasticsearch Server - Third Edition

By: Rafal Kuc

Overview of this book

ElasticSearch is a very fast and scalable open source search engine, designed with distribution and cloud in mind, complete with all the goodies that Apache Lucene has to offer. ElasticSearch’s schema-free architecture allows developers to index and search unstructured content, making it perfectly suited for both small projects and large big data warehouses, even those with petabytes of unstructured data. This book will guide you through the world of the most commonly used ElasticSearch server functionalities. You’ll start off by getting an understanding of the basics of ElasticSearch and its data indexing functionality. Next, you will see the querying capabilities of ElasticSearch, followed by a through explanation of scoring and search relevance. After this, you will explore the aggregation and data analysis capabilities of ElasticSearch and will learn how cluster administration and scaling can be used to boost your application performance. You’ll find out how to use the friendly REST APIs and how to tune ElasticSearch to make the most of it. By the end of this book, you will have be able to create amazing search solutions as per your project’s specifications.
Table of Contents (18 chapters)
Elasticsearch Server Third Edition
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Manipulating data with the REST API


Elasticsearch exposes a very rich REST API that can be used to search through the data, index the data, and control Elasticsearch behavior. You can imagine that using the REST API allows you to get a single document, index or update a document, get the information on Elasticsearch current state, create or delete indices, or force Elasticsearch to move around shards of your indices. Of course, these are only examples that show what you can expect from the Elasticsearch REST API. For now, we will concentrate on using the create, retrieve, update, delete (CRUD) part of the Elasticsearch API (https://en.wikipedia.org/wiki/Create,_read,_update_and_delete), which allows us to use Elasticsearch in a fashion similar to how we would use any other NoSQL (https://en.wikipedia.org/wiki/NoSQL) data store.

Understanding the REST API

If you've never used an application exposing the REST API, you may be surprised how easy it is to use such applications and remember how to use them. In REST-like architectures, every request is directed to a concrete object indicated by a path in the address. For example, let's assume that our hypothetical application exposes the /books REST end-point as a reference to the list of books. In such case, a call to /books/1 could be a reference to a concrete book with the identifier 1. You can think of it as a data-oriented model of an API. Of course, we can nest the paths—for example, a path such as /books/1/chapters could return the list of chapters of our book with identifier 1 and a path such as /books/1/chapters/6 could be a reference to the sixth chapter in that particular book.

We talked about paths, but when using the HTTP protocol, (https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) we have some additional verbs (such as POST, GET, PUT, and so on.) that we can use to define system behavior in addition to paths. So if we would like to retrieve the book with identifier 1, we would use the GET request method with the /books/1 path. However, we would use the PUT request method with the same path to create a book record with the identifier or one, the POST request method to alter the record, DELETE to remove that entry, and the HEAD request method to get basic information about the data referenced by the path.

Now, let's look at example HTTP requests that are sent to real Elasticsearch REST API endpoints, so the preceding hypothetical information will be turned into something real:

GET http://localhost:9200/: This retrieves basic information about Elasticsearch, such as the version, the name of the node that the command has been sent to, the name of the cluster that node is connected to, the Apache Lucene version, and so on.

GET http://localhost:9200/_cluster/state/nodes/ This retrieves information about all the nodes in the cluster, such as their identifiers, names, transport addresses with ports, and additional node attributes for each node.

DELETE http://localhost:9200/books/book/123: This deletes a document that is indexed in the books index, with the book type and an identifier of 123.

We now know what REST means and we can start concentrating on Elasticsearch to see how we can store, retrieve, alter, and delete the data from its indices. If you would like to read more about REST, please refer to http://en.wikipedia.org/wiki/Representational_state_transfer.

Storing data in Elasticsearch

In Elasticsearch, every document is represented by three attributes—the index, the type, and the identifier. Each document must be indexed into a single index, needs to have its type correspond to the document structure, and is described by the identifier. These three attributes allows us to identify any document in Elasticsearch and needs to be provided when the document is physically written to the underlying Apache Lucene index. Having the knowledge, we are now ready to create our first Elasticsearch document.

Creating a new document

We will start learning the Elasticsearch REST API by indexing one document. Let's imagine that we are building a CMS system (http://en.wikipedia.org/wiki/Content_management_system) that will provide the functionality of a blogging platform for our internal users. We will have different types of documents in our indices, but the most important ones are the articles that will be published and are readable by users.

Because we talk to Elasticsearch using JSON notation and Elasticsearch responds to us again using JSON, our example document could look as follows:

{ 
 "id": "1", 
 "title": "New version of Elasticsearch released!", 
 "content": "Version 2.2 released today!", 
 "priority": 10, 
 "tags": ["announce", "elasticsearch", "release"] 
}

As you can see in the preceding code snippet, the JSON document is built with a set of fields, where each field can have a different format. In our example, we have a set of text fields (id, title, and content), we have a number (the priority field), and an array of text values (the tags field). We will show documents that are more complicated in the next examples.

Note

One of the changes introduced in Elasticsearch 2.0 has been that field names can't contain the dot character. Such field names were possible in older versions of Elasticsearch, but could result in serialization errors in certain cases and thus Elasticsearch creators decided to remove that possibility.

One thing to remember is that by default Elasticsearch works as a schema-less data store. This means that it can try to guess the type of the field in a document sent to Elasticsearch. It will try to use numeric types for the values that are not enclosed in quotation marks and strings for data enclosed in quotation marks. It will try to guess the date and index them in dedicated fields and so on. This is possible because the JSON format is semi-typed. Internally, when the first document with a new field is sent to Elasticsearch, it will be processed and mappings will be written (we will talk more about mappings in the Mappings configuration section of Chapter 2, Indexing Your Data).

Note

A schema-less approach and dynamic mappings can be problematic when documents come with a slightly different structure—for example, the first document would contain the value of the priority field without quotation marks (like the one shown in the discussed example), while the second document would have quotation marks for the value in the priority field. This will result in an error because Elasticsearch will try to put a text value in the numeric field and this is not possible in Lucene. Because of this, it is advisable to define your own mappings, which you will learn in the Mappings configuration section of Chapter 2, Indexing Your Data.

Let's now index our document and make it available for retrieval and searching. We will index our articles to an index called blog under a type named article. We will also give our document an identifier of 1, as this is our first document. To index our example document, we will execute the following command:

curl -XPUT 'http://localhost:9200/blog/article/1' -d '{"title": "New version of Elasticsearch released!", "content": "Version 2.2 released today!", "priority": 10, "tags": ["announce", "elasticsearch", "release"] }'

Note a new option to the curl command, the -d parameter. The value of this option is the text that will be used as a request payload—a request body. This way, we can send additional information such as the document definition. Also, note that the unique identifier is placed in the URL and not in the body. If you omit this identifier (while using the HTTP PUT request), the indexing request will return the following error:

No handler found for uri [/blog/article] and method [PUT]

If everything worked correctly, Elasticsearch will return a JSON response informing us about the status of the indexing operation. This response should be similar to the following one:

{
 "_index":"blog",
 "_type":"article",
 "_id":"1",
 "_version":1,
 "_shards":{
  "total":2,
  "successful":1,
  "failed":0},
 "created":true
}

In the preceding response, Elasticsearch included information about the status of the operation, index, type, identifier, and version. We can also see information about the shards that took part in the operation—all of them, the ones that were successful and the ones that failed.

Automatic identifier creation

In the previous example, we specified the document identifier manually when we were sending the document to Elasticsearch. However, there are use cases when we don't have an identifier for our documents—for example, when handling logs as our data. In such cases, we would like some application to create the identifier for us and Elasticsearch can be such an application. Of course, generating document identifiers doesn't make sense when your document already has them, such as data in a relational database. In such cases, you may want to update the documents; in this case, automatic identifier generation is not the best idea. However, when we are in need of such functionality, instead of using the HTTP PUT method we can use POST and omit the identifier in the REST API path. So if we would like Elasticsearch to generate the identifier in the previous example, we would send a command like this:

curl -XPOST 'http://localhost:9200/blog/article/' -d '{"title": "New version of Elasticsearch released!", "content": "Version 2.2 released today!", "priority": 10, "tags": ["announce", "elasticsearch", "release"] }'

We've used the HTTP POST method instead of PUT and we've omitted the identifier. The response produced by Elasticsearch in such a case would be as follows:

{
 "_index":"blog",
 "_type":"article",
 "_id":"AU1y-s6w2WzST_RhTvCJ",
 "_version":1,
 "_shards":{
  "total":2,
  "successful":1,
  "failed":0},
 "created":true
}

As you can see, the response returned by Elasticsearch is almost the same as in the previous example, with a minor difference—the _id field is returned. Now, instead of the 1 value, we have a value of AU1y-s6w2WzST_RhTvCJ, which is the identifier Elasticsearch generated for our document.

Retrieving documents

We now have two documents indexed into our Elasticsearch instance—one using a explicit identifier and one using a generated identifier. Let's now try to retrieve one of the documents using its unique identifier. To do this, we will need information about the index the document is indexed in, what type it has, and of course what identifier it has. For example, to get the document from the blog index with the article type and the identifier of 1, we would run the following HTTP GET request:

curl -XGET 'localhost:9200/blog/article/1?pretty'

Note

The additional URI property called pretty tells Elasticsearch to include new line characters and additional white spaces in response to make the output easier to read for users.

Elasticsearch will return a response similar to the following:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "1",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "New version of Elasticsearch released!",
    "content" : "Version 2.2 released today!",
    "priority" : 10,
    "tags" : [ "announce", "elasticsearch", "release" ]
  }
}

As you can see in the preceding response, Elasticsearch returned the _source field, which is the original document sent to Elasticsearch and a few additional fields that tell us about the document, such as the index, type, identifier, document version, and of course information as towhether the document was found or not (the found property).

If we try to retrieve a document that is not present in the index, such as the one with the 12345 identifier, we get a response like this:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "12345",
  "found" : false
}

As you can see, this time the value of the found property was set to false and there was no _source field because the document has not been retrieved.

Updating documents

Updating documents in the index is a more complicated task compared to indexing. When the document is indexed and Elasticsearch flushes the document to a disk, it creates segments—an immutable structure that is written once and read many times. This is done because the inverted index created by Apache Lucene is currently impossible to update (at least most of its parts). To update a document, Elasticsearch internally first fetches the document using the GET request, modifies its _source field, removes the old document, and indexes a new document using the updated content. The content update is done using scripts in Elasticsearch (we will talk more about scripting in Elasticsearch in the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better).

Note

Please note that the following document update examples require you to put the script.inline: on property into your elasticsearch.yml configuration file. This is needed because inline scripting is disabled in Elasticsearch for security reasons. The other way to handle updates is to store the script content in the file in the Elasticsearch configuration directory, but we will talk about that in the Scripting capabilities of Elasticsearch section in Chapter 6, Make Your Search Better.

Let's now try to update our document with identifier 1 by modifying its content field to contain the This is the updated document sentence. To do this, we need to run a POST HTTP request on the document path using the _update REST end-point. Our request to modify the document would look as follows:

curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{ 
 "script" : "ctx._source.content = new_content",
 "params" : {
  "new_content" : "This is the updated document"
 }
}'

As you can see, we've sent the request to the /blog/article/1/_update REST end-point. In the request body, we've provided two parameters—the update script in the script property and the parameters of the script. The script is very simple; it takes the _source field and modifies the content field by setting its value to the value of the new_content parameter. The params property contains all the script parameters.

For the preceding update command execution, Elasticsearch would return the following response:

{"_index":"blog","_type":"article","_id":"1","_version":2,"_shards":{"total":2,"successful":1,"failed":0}}

The thing to look at in the preceding response is the _version field. Right now, the version is 2, which means that the document has been updated (or re-indexed) once. Basically, each update makes Elasticsearch update the _version field.

We could also update the document using the doc section and providing the changed field, for example:

curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{
 "doc" : {
  "content" : "This is the updated document"
 }
}'

We now retrieve the document using the following command:

curl -XGET 'http://localhost:9200/blog/article/1?pretty'

And we get the following response from Elasticsearch:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "title" : "New version of Elasticsearch released!",
    "content" : "This is the updated document",
    "priority" : 10,
    "tags" : [ "announce", "elasticsearch", "release" ]
  }
}

As you can see, the document has been updated properly.

Note

The thing to remember when using the update API of Elasticsearch is that the _source field needs to be present because this is the field that Elasticsearch uses to retrieve the original document content from the index. By default, that field is enabled and Elasticsearch uses it to store the original document.

Dealing with non-existing documents

The nice thing when it comes to document updates, which we would like to mention as it can come in handy when using Elasticsearch Update API, is that we can define what Elasticsearch should do when the document we try to update is not present.

For example, let's try incrementing the priority field value for a non-existing document with identifier 2:

curl -XPOST 'http://localhost:9200/blog/article/2/_update' -d '{ 
 "script" : "ctx._source.priority += 1"
}'

The response returned by Elasticsearch would look more or less as follows:

{"error":{"root_cause":[{"type":"document_missing_exception","reason":"[article][2]: document missing","shard":"2","index":"blog"}],"type":"document_missing_exception","reason":"[article][2]: document missing","shard":"2","index":"blog"},"status":404}

As you can imagine, the document has not been updated because it doesn't exist. So now, let's modify our request to include the upsert section in our request body that will tell Elasticsearch what to do when the document is not present. The new command would look as follows:

curl -XPOST 'http://localhost:9200/blog/article/2/_update' -d '{ 
 "script" : "ctx._source.priority += 1",
 "upsert" : {
  "title" : "Empty document",
  "priority" : 0,
  "tags" : ["empty"]
 }
}'

With the modified request, a new document would be indexed; if we retrieve it using the GET API, it will look as follows:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "2",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "title" : "Empty document",
    "priority" : 0,
    "tags" : [ "empty" ]
  }
}

As you can see, the fields from the upsert section of our update request were taken by Elasticsearch and used as document fields.

Adding partial documents

In addition to what we already wrote about the update API, Elasticsearch is also capable of merging partial documents from the update request to already existing documents or indexing new documents using information about the request, similar to what we saw seen with the upsert section.

Let's imagine that we would like to update our initial document and add a new field called count to it (setting it to 1 initially). We would also like to index the document under the specified identifier if the document is not present. We can do this by running the following command:

curl -XPOST 'http://localhost:9200/blog/article/1/_update' -d '{ 
  "doc" : {
    "count" : 1
  },
  "doc_as_upsert" : true
}

We specified the new field in the doc section and we said that we want the doc section to be treated as the upsert section when the document is not present (with the doc_as_upsert property set to true).

If we now retrieve that document, we see the following response:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "1",
  "_version" : 3,
  "found" : true,
  "_source" : {
    "title" : "New version of Elasticsearch released!",
    "content" : "This is the updated document",
    "priority" : 10,
    "tags" : [ "announce", "elasticsearch", "release" ],
    "count" : 1
  }
}

Note

For a full reference on document updates, please refer to the official Elasticsearch documentation on the Update API, which is available at https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html.

Deleting documents

Now that we know how to index documents, update them, and retrieve them, it is time to learn about how we can delete them. Deleting a document from an Elasticsearch index is very similar to retrieving it, but with one major difference—instead of using the HTTP GET method, we have to use HTTP DELETE one.

For example, if we would like to delete the document indexed in the blog index under the article type and with an identifier of 1, we would run the following command:

curl -XDELETE 'localhost:9200/blog/article/1'

The response from Elasticsearch indicates that the document has been deleted and should look as follows:

{
 "found":true,
 "_index":"blog",
 "_type":"article",
 "_id":"1",
 "_version":4,
 "_shards":{
  "total":2,
  "successful":1,
  "failed":0
 }
}

Of course, this is not the only thing when it comes to deleting. We can also remove all the documents of a given type. For example, if we would like to delete the entire blog index, we should just omit the identifier and the type, so the command would look like this:

curl -XDELETE 'localhost:9200/blog'

The preceding command would result in the deletion of the blog index.

Versioning

Finally, there is one last thing that we would like to talk about when it comes to data manipulation in Elasticsearch —the great feature of versioning. As you may have already noticed, Elasticsearch increments the document version when it does updates to it. We can leverage this functionality and use optimistic locking (http://en.wikipedia.org/wiki/Optimistic_concurrency_control), and avoid conflicts and overwrites when multiple processes or threads access the same document concurrently. You can assume that your indexing application may want to try to update the document, while the user would like to update the document while doing some manual work. The question that arises is: Which document should be the correct one—the one updated by the indexing application, the one updated by the user, or the merged document of the changes? What if the changes are conflicting? To handle such cases, we can use versioning.

Usage example

Let's index a new document to our blog index—one with an identifier of 10, and let's index its second version soon after we do that. The commands that do this look as follows:

curl -XPUT 'localhost:9200/blog/article/10' -d '{"title":"Test document"}'
curl -XPUT 'localhost:9200/blog/article/10' -d '{"title":"Updated test document"}'

Because we've indexed the document with the same identifier, it should have a version 2 (you can check it using the GET request).

Now, let's try deleting the document we've just indexed but let's specify a version property equal to 1. By doing this, we tell Elasticsearch that we are interested in deleting the document with the provided version. Because the document is a different version now, Elasticsearch shouldn't allow indexing with version 1. Let's check if what we say is true. The command we will use to send the delete request looks as follows:

curl -XDELETE 'localhost:9200/blog/article/10?version=1'

The response generated by Elasticsearch should be similar to the following one:

{
  "error" : {
    "root_cause" : [ {
      "type" : "version_conflict_engine_exception",
      "reason" : "[article][10]: version conflict, current [2], provided [1]",
      "shard" : 1,
      "index" : "blog"
    } ],
    "type" : "version_conflict_engine_exception",
    "reason" : "[article][10]: version conflict, current [2], provided [1]",
    "shard" : 1,
    "index" : "blog"
  },
  "status" : 409
}

As you can see, the delete operation was not successful—the versions didn't match. If we set the version property to 2, the delete operation would be successful:

curl -XDELETE 'localhost:9200/blog/article/10?version=2&pretty'

The response this time will look as follows:

{
  "found" : true,
  "_index" : "blog",
  "_type" : "article",
  "_id" : "10",
  "_version" : 3,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  }
}

This time the delete operation has been successful because the provided version was proper.

Versioning from external systems

The very good thing about Elasticsearch versioning capabilities is that we can provide the version of the document that we would like Elasticsearch to use. This allows us to provide versions from external data systems that are our primary data stores. To do this, we need to provide an additional parameter during indexing—version_type=external and, of course, the version itself. For example, if we would like our document to have the 12345 version, we could send a request like this:

curl -XPUT 'localhost:9200/blog/article/20?version=12345&version_type=external' -d '{"title":"Test document"}'

The response returned by Elasticsearch is as follows:

{
  "_index" : "blog",
  "_type" : "article",
  "_id" : "20",
  "_version" : 12345,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : true
}

We just need to remember that, when using version_type=external, we need to provide the version in cases where we index the document. In cases where we would like to change the document and use optimistic locking, we need to provide a version parameter equal to, or higher than, the version present in the document.