Book Image

Elasticsearch Indexing

By : Huseyin Akdogan
Book Image

Elasticsearch Indexing

By: Huseyin Akdogan

Overview of this book

Beginning with an overview of the way ElasticSearch stores data, you’ll begin to extend your knowledge to tackle indexing and mapping, and learn how to configure ElasticSearch to meet your users’ needs. You’ll then find out how to use analysis and analyzers for greater intelligence in how you organize and pull up search results – to guarantee that every search query is met with the relevant results! You’ll explore the anatomy of an ElasticSearch cluster, and learn how to set up configurations that give you optimum availability as well as scalability. Once you’ve learned how these elements work, you’ll find real-world solutions to help you improve indexing performance, as well as tips and guidance on safety so you can back up and restore data. Once you’ve learned each component outlined throughout, you will be confident that you can help to deliver an improved search experience – exactly what modern users demand and expect.
Table of Contents (15 chapters)
Elasticsearch Indexing
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Analysis


We mentioned earlier that all of Apache Lucene's data is stored in an inverted index. This transformation is required for successful response by Elasticsearch to search requests. The process of transforming this data is called analysis.

Elasticsearch has an index analysis module. It maps to the Lucene Analyzer. In general, analyzers are composed of a single Tokenizer and zero or more TokenFilters.

Note

Analysis modules and analyzers will be discussed in depth in Chapter 4, Analysis and Analyzers.

Elasticsearch provides a lot of character filters, tokenizers, and token filters. For example, a character filter may be used to strip out HTML markup and a token filter may be used to modify tokens (for example, lowercase). You can combine them to create custom analyzers or you can use its built-in analyzer.

Good understanding of the process of analysis is very important in terms of improving the user's search experience and relevant search results because Elasticsearch (actually Lucene) will use analyzer during indexing and query time.

Tip

It is crucial to remember that all Elasticsearch queries are not being analyzed.

Now let's examine the importance of the analyzer in terms of relevant search results with a simple scenario:

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GIEQeR7spPlxvqlud","_version":1,"created":true}

We indexed an employee. His name is Joe Jeffers Hoffman, 30 years old. Let's search the employees that are named Joe in the company index now:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 68,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GIEQeR7spPlxvqlud",
            "_score": 0.19178301,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

All string type fields in the company index will be analyzed by a standard analyzer because employee types were created with dynamic mapping.

The standard analyzer is the default analyzer that Elasticsearch uses. It removes most punctuation and splits the text on word boundaries, as defined by the Unicode Consortium.

Note

If you want to have more information about the Unicode Consortium, please refer to http://www.unicode.org/reports/tr29/.

In this case, Joe Jeffers would be two tokens (Joe and Jeffers). To see how the standard analyzer works, run the following command:

curl -XGET 'localhost:9200/_analyze?analyzer=standard&pretty' -d 'Joe Jeffers'
{
  "tokens" : [ {
    "token" : "joe",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "jeffers",
    "start_offset" : 4,
    "end_offset" : 11,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

We searched the letters joe and the consequent document containing Joe Jeffers was returned to us because the standard analyzer had split the text on word boundaries and converted to lowercase. The standard analyzer is built using the Lower Case Token Filter along with other filters (the Standard Token Filter and Stop Token Filter, for example).

Now let's examine the following example:

curl -XDELETE localhost:9200/company
{"acknowledged":true}

curl -XPUT localhost:9200/company -d '{
  "mappings": {
    "employee": {
      "properties": {
        "firstname": {"type": "string", "index": "not_analyzed"}
      }
    }
  }
}'
{"acknowledged":true}

curl -XPOST localhost:9200/company/employee -d '{
  "firstname": "Joe Jeffers",
  "lastname": "Hoffman",
  "age": 30
}'
{"_index":"company","_type":"employee","_id":"AU7GOF2wR7spPlxvqmHY","_version":1,"created":true}

We deleted the company index created by dynamic mapping and recreated it with explicit mapping. This time, we used the not_analyzed value of the index option on the firstname field in the employee type. This means that the field is not analyzed at indexing time:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match": {
      "firstname": "joe"
    }
  }
}'
{
   "took": 12,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 2,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

As you can see, Elasticsearch did not return a result to us with the match query because the firstname field is configured to the not_analyzed value. Therefore, Elasticsearch did not use an analyzer during indexing; the indexed value was exactly as specified. In other words, Joe Jeffers was a single token. Unless otherwise indicated, the match query uses the default search analyzer. Therefore, if you want a document to return to us with the match query without changing the analyzer in this example, we need to specify the exact value (paying attention to uppercase/lowercase):

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match" : {
        "firstname": "Joe Jeffers"
    }
  }
}'

The preceding command will return us the document we searched for. Now let's examine the following example:

curl -XGET localhost:9200/company/_search?pretty -d '{
  "query": {
    "match_phrase_prefix": {
      "firstname": "Joe"
    }
  }
}'
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,****
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.30685282,
      "hits": [
         {
            "_index": "company",
            "_type": "employee",
            "_id": "AU7GOF2wR7spPlxvqmHY",
            "_score": 0.30685282,
            "_source": {
               "firstname": "Joe Jeffers",
               "lastname": "Hoffman",
               "age": 30
            }
         }
      ]
   }
}

As you can see, our searched document was returned to us although we did not specify the exact value (please note that we still use uppercase letters) because the match_phrase_prefix query analyzes the text and creates a phrase query out of the analyzed text. It allows for prefix matches on the last term in the text.