Elasticsearch Indexing

Elasticsearch Indexing

By : Huseyin Akdogan

Buy this Book

Elasticsearch Indexing

By: Huseyin Akdogan

Buy this Book

Overview of this book

Beginning with an overview of the way ElasticSearch stores data, you’ll begin to extend your knowledge to tackle indexing and mapping, and learn how to configure ElasticSearch to meet your users’ needs. You’ll then find out how to use analysis and analyzers for greater intelligence in how you organize and pull up search results – to guarantee that every search query is met with the relevant results! You’ll explore the anatomy of an ElasticSearch cluster, and learn how to set up configurations that give you optimum availability as well as scalability. Once you’ve learned how these elements work, you’ll find real-world solutions to help you improve indexing performance, as well as tips and guidance on safety so you can back up and restore data. Once you’ve learned each component outlined throughout, you will be confident that you can help to deliver an improved search experience – exactly what modern users demand and expect.

Elasticsearch Indexing

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

Free Chapter

Introduction to Efficient Indexing

Getting started

Understanding the document storage strategy

Analysis

Summary

What is an Elasticsearch Index

Nature of the Elasticsearch index

Document

Summary

Basic Concepts of Mapping

Basic concepts and definitions

Types

The relationship between mapping and relevant search results

Understanding the schema-less

Summary

Analysis and Analyzers

Introducing analysis

Process of analysis

Built-in analyzers

What's text normalization?

ICU analysis plugin

An Analyzer Pipeline

Specifying the analyzer for a field in the mapping

Summary

Anatomy of an Elasticsearch Cluster

Basic concepts

Node

Shards

Replicas

Explaining the architecture of distribution

Correctly configuring the cluster

Choosing the right amount of shards and replicas

Summary

Improving Indexing Performance

Configuration

Optimization of mapping definition

Segments and merging policies

Store module

Bulk API

Notes

Summary

Snapshot and Restore

Snapshot repository

Snapshot

Restore

How does the snapshot process works?

Summary

Improving the User Search Experience

Correction of users' spelling mistakes

Get suggestions

Improving the relevancy of search results

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Understanding the document storage strategy

First of all, we need to depict the question: what is an Elasticsearch index?

The short answer is that an index is like a database in a relational database. Elasticsearch is a document-oriented search and analytics engine. Each record in Elasticsearch is a structured JSON document. In other words, each piece of data that is sent to Elasticsearch for indexing is a JSON document. All fields of the documents are indexed by default, and these indexed fields can be used in a single query. More information about this can be found in the next chapter.

Elasticsearch uses the Apache Lucene library for writing and reading the data from the index. In fact, Apache Lucene is at the heart of Elasticsearch.

Note

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. If you want to more information, please refer to https://lucene.apache.org/core/.

Every document sent to Elasticsearch is stored in Apache Lucene and the library stores all data in a data structure called an inverted index. An inverted index is a data structure that is mapped documents and terms. That means that an inverted index has a list of all the unique words that appear in any document. Also, it has a list of documents in which the collected unique word appears. Intended with this data structure, the performance of fast full-text searching is performed at low cost. The inverted index is a basic indexing algorithm used by search engines.

Note

The inverted index will be discussed in depth in the next chapter.

The _source field

As mentioned earlier, all fields of the documents are indexed by default in Elasticsearch, and these fields can be used in a single query. We usually send data to Elasticsearch because we want to either search or retrieve them.

The _source field is a metadata field automatically generated during indexing within Lucene that stores the actual JSON document. When executing search requests, the _source field is returned by default as shown in the following code snippet:

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":1,"created":true}

curl -XGET localhost:9200/my_index/_search?pretty
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 1,
            "_source": {
               "title": "What is an Elasticsearch Index",
               "category": "Elasticsearch",
               "content": "An index is like a...",
               "date": "2015-07-18",
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

Note

More information about the metadata fields can be found in Chapter 3, Basic Concepts of Mapping.

We sent a document to Elasticsearch that contains title, category, content, date, and tags fields for indexing. Then we ran the search command. The result of the search command is shown in the preceding snippet.

Because it is always able to return everything you send to Elasticsearch as a search result, Elasticsearch stores every document field within the _source field by default, which you send to it.

You can change this behavior if you want. This can be a preferred option because in some cases you may not need all fields to be returned in the search results. Also, it does not require a field to be stored in the _source field while it is searchable:

curl -XPUT localhost:9200/my_index/_mapping/article -d '{
  "article": {
    "_source": {
      "excludes": [
"date"
      ]
    }
  }
}'
{"acknowledged":true}

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":2,"created":false}

What did we do?

Firstly, we removed the date field from the _source field by changing the dynamic mapping. Then we sent the same document to Elasticsearch again for reindexing. In the next step, we will try to list the records that are greater than or equal to July 18, 2015 using the range query. The pretty parameter used in the following query tells Elasticsearch to return pretty-printed JSON results:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "range": {
      "date": {
        "gte": "2015-07-18"
      }
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 1,
            "_source": {
               "title": "What is an Elasticsearch Index",
               "category": "Elasticsearch",
               "content": "An index is like a...",
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

As you can see, we can search in the date field that although is not returned. This is because, as previously mentioned, all fields of the documents are indexed as default by Elasticsearch.

The difference between the storable and searchable field

Elasticsearch allows you to separately manage fields that can be searchable and/or storable. This is useful because in some cases we may want to index a field but may not want to store it or vice versa. In some cases, we might not want to do either.

On behalf of a better understanding of the subject, let's change the preceding example. Let's create the my_index again with the explicit mapping and disable the _source field:

curl -XDELETE localhost:9200/my_index
{"acknowledged": true}

curl -XPUT localhost:9200/my_index -d '{
  "mappings": {
    "article": {
      "_source": {
        "enabled": false
        },
      "properties": {
        "title": {"type": "string", "store": true},
        "category": {"type": "string"},
        "content": {"type": "string"},
        "date": {"type": "date", "index": "no"},
        "tags": {"type": "string", "index": "no", "store": true}
      }
    }
  }
}'

Firstly, we disabled the _source field for the article type. In this case, unless otherwise stated, any fields of the article type are not stored/returned. However, we would like to store some fields. In this case, we want to store only the title and tags fields using the store feature. If we enable the store option, we let Elasticsearch store the specified fields. Therefore, we explicitly specify which fields we want to store for future scenarios.

In addition, we don't want some fields to be indexed. This means that such fields will not be searchable. The date and the tags fields will not be searchable with the preceding configuration but, if requested, the tags field can be returned.

Note

Keep in mind that after disabling the _source field, you cannot make use of a number of features that come with the _source field, for example, the update API and highlighting.

Now, let's see the effect of the preceding configuration in practice:

curl -XPUT localhost:9200/my_index/article/1 -d '{
  "title": "What is an Elasticsearch Index",
  "category": "Elasticsearch",
  "content": "An index is like a...",
  "date": "2015-07-18",
  "tags": ["bigdata", "elasticsearch"]
}'
{"_index":"my_index","_type":"article","_id":"1","_version":1,"created":true}

curl -XGET localhost:9200/my_index/_search?pretty
{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "my_index",
      "_type" : "article",
      "_id" : "1",
      "_score" : 1.0
    } ]
  }
}

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "query": {
    "range": {
      "date": {
        "gte": "2015-07-18"
      }
    }
  }
}'
{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

Firstly, we sent a document containing the date field value that is 2015-07-18 for indexing, and we ran the match_all query after (The search request does not have a body) and we did not see the _source field within hits.

Then we ran a range query on the date field because we want the documents where the date is greater than and equal to July 18, 2015. Elasticsearch did not return any documents to us because the date field does not have a default configuration. In other words, the date field was not indexed, therefore not searchable, so we do not see any retrieved documents.

Now let's run another scenario with following command:

curl -XGET localhost:9200/my_index/_search?pretty -d '{
  "fields": ["title", "content", "tags"],
  "query": {
    "match": {
      "content": "like"
    }
  }
}'
{
   "took": 6,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.13424811,
      "hits": [
         {
            "_index": "my_index",
            "_type": "article",
            "_id": "1",
            "_score": 0.13424811,
            "fields": {
               "title": [
"What is an Elasticsearch Index"
               ],
               "tags": [
"bigdata",
"elasticsearch"
               ]
            }
         }
      ]
   }
}

The document is returned to us as a result of the preceding query because the content field is searchable; but the field is not returned because it was not stored in Lucene.

Understanding the difference between storable and searchable (indexed) fields is important for indexing performance and relevant search results. It offers significant advantages for high-level users.

Elasticsearch Indexing

By : Huseyin Akdogan

Elasticsearch Indexing

By: Huseyin Akdogan

Overview of this book

Related Content you might be interested in

Current Title:

Elasticsearch Indexing

Understanding the document storage strategy

Note

Note

The _source field

Note

The difference between the storable and searchable field

Note